# Informatio Visualization: Project 1
*David G.*
*Pau A.*
*GCED, Course 2023-24*

This Project aims to apply the knowledge acquired during the first two months of the Information Visualization Course at UPC. With this objective in mind, we have applied different visualizations and data processing techniques in order to analyze the behavior of traffic accidents in NY City. Our goal has been to create a static visualization that allows us to answer the following questions:
- Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after COVID-19?
- Is there any type of vehicle more prone to participating in accidents
- At what time of the day are accidents more common?
- Are there any areas with a higher number of accidents?
- Is there a correlation between weather conditions and accidents?
  
With these questions in mind, we have applied a diverse set of preprocessing and data transformation techniques to the provided dataset in order to obtain the appropriate visualizations for each of the objectives. This notebook aims to guide the reader through the steps that have been followed and the mental process through which the visualizations have been refined and obtained.

The report has the following outline:
- [Data Preprocessing](#data-preprocessing)
- [Desig Process](#design-process)
  - [Question 1](#question-1)
  - [Question 2](#question-2)
  - [Question 3](#question-3)
  - [Question 4](#question-4)
  - [Question 5](#question-5)
- [Results](#results)



Before proceeding with the preprocessing and visualization-making processes, it is necessary to import the following libraries. In case any of them is not installed, the command to install them is also provided

**IMPORTANT** if running on Google Colab enable `colab` render

In [None]:
# alt.rendered.enable('colab')

In [None]:
# !pip install -r requirements.txt

In [None]:
import altair as alt
import pandas as pd
import geopandas as gpd
import geoplot as gplt
import geodatasets
import h3pandas
import warnings
from graphs import *

# we disable max_rows in altair
alt.data_transformers.disable_max_rows()

# we disable warnings
warnings.filterwarnings('ignore')

## Data Preprocessing
To begin with, the original dataset has been filtered to encompass the accidents encompassed within the period of *Jun-Sep 2018 and 2019*. This step was performed before downloading the dataset. Afterward, the **Open Refine** software has been used to first approach the data and do the bulk of the preprocessing.

Through Open Refine, we have been able to take a first look at our dataset and determine its columns and which issues need to be solved before proceeding. We identified three main issues:
- Many different encodings are used to refer to the same category on some variables, such as vehicle type
- Columns containg Null Values
- Columns not relevant to our analysis

After further analysis, it was found that the presence of null values was not as important as expected, as the columns that, in our opinion, were the most relevant, did not include many null values. 

On the other hand, the selection of the relevant columns was not performed on Open-Refine, and no columns were removed. This task was later 
performed using Pandas in Python.

However, the main issue with the dataset was the fact that some of the categorical columns were very inconsistent in naming, e.g., the same category was written in many different ways. This was mainly observed in the Vehicle Type column. In other columns, it was not considered relevant, and no action was taken.

On the other hand, a lot of work was directed toward obtaining a more consistent naming scheme for the Vehicle Type 1 column. This was aimed at normalizing the vehicle type codes. In order to do so, the **clustering** methods of Open Refine were used. With this, we were able to mainly fix orthographic inconsistencies. However, we were still left with many different categories or subcategories, which we believed were too granular for our analysis. Therefore, we manually integrated some of the less frequent categories into more general ones.

It is worth noting that answering some of the questions required further preprocessing, which will be explained later.


In [None]:

df = pd.read_csv("dataset_v1.csv")[["CRASH DATE","CRASH TIME","BOROUGH", "LATITUDE", "LONGITUDE","VEHICLE TYPE CODE 1"]]

# We parse the date column as a date
df['date'] = pd.to_datetime(df['CRASH DATE'], format='%Y-%m-%d')    
df['VEHICLE TYPE CODE 1'] = df['VEHICLE TYPE CODE 1'].str.capitalize()             
print(df.shape)
df.head()


## Design Process
In order to obtain a visualization that allows us to answer the proposed questions, we decided to address each question separately and later integrate them all into a single view using streamlit. This section describes the process followed to obtain the visualizations for each question. Afterward, the results sections present the final visualizations and the conclusions obtained from them.

### Question 1 
**Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after**

We decided that we wanted to obtain a visualization which showed the amount of accidents while allowing the reader to easily compare the difference between both weekday/weekend and before/after covid. 

Therefore, we needed a visualization which allowed to both present a quantity and allow for easy comparisons depending on groupings. We decided to focus on facilitating comparisons of the trend, while still showing precise information about the distribution of the accidents.

To begin with, we plotted a groupped barplot in which we plotted the number of accidents per day of the week, groupped by before or after covid. We decided that encoding seven distinct values (and therefore bars) was not necessary and could distract the reader. Furthermore, the differences between weekdays were not large. Therefore, we decided to plot only for weekdays and weekends.

In [None]:
# create coolumn weekday/weekend
df['weekday'] = df['date'].dt.dayofweek 
# We create a column that says wether before or after covid
df['covid'] = df['date'].dt.year
# give name
df['covid'] = df['covid'].replace([2018,2020],['before','after'])

alt.Chart(df[["covid","weekday","date"]]).mark_bar().transform_aggregate(
    accidents = 'count()',
    groupby=['weekday','date','covid']
).encode(
    x='weekday:N',
    y='average(accidents):Q',
    color='weekday',
    column = "covid:N"
)


In consequence, we repeated the grouped box plot calculating the mean of weekday and weekend day accidents and again grouped depending on before or after covid.

At this step, we also asked ourselves whether we should group by weekend or covid. As shown in the following graphs, we tried both.

In [None]:
df['weekday'] = df['weekday'].replace([0,1,2,3,4,5,6],['weekday','weekday','weekday','weekday','weekday','weekend','weekend'])

ch1 = alt.Chart(df[["covid","weekday","date"]]).mark_boxplot().transform_aggregate(
    accidents = 'count()',
    groupby=['weekday','date','covid']
).encode(
    x='weekday:N',
    y='average(accidents):Q',
    color='weekday',
    column = "covid:N"
).properties(width=300, title = "Average daily accidents before and after covid")

ch2 = alt.Chart(df[["covid","weekday","date"]]).mark_boxplot().transform_aggregate(
    accidents = 'count()',
    groupby=['weekday','date','covid']
).encode(
    x='covid:N',
    y='average(accidents):Q',
    color='covid',
    column = "weekday:N"
).properties(width=300, title = "Average daily accidents before and after covid")

ch1 | ch2


We believed the second option to be better, given that we can easily compare before and after covid "normalized" by either weekdays or weekends.

In [None]:
alt.Chart(df).mark_boxplot(size=30).transform_aggregate(accidents="count()", groupby=["weekday", "date", "covid"]).encode(
    x=alt.X(
        "covid:N",
        title=None,
        axis=alt.Axis(labelFontSize=16, labelAngle=0),
        sort=["before", "after"],
    ),
    y=alt.Y("average(accidents):Q", axis=alt.Axis(labelFontSize=14)),
    color=alt.Color(
        "covid:N",
        legend=None,
        scale=alt.Scale(range=[colors["col1"], colors["col2"]]),
    ),
    column=alt.Column(
        "weekday:N",
        header=alt.Header(labelFontSize=16),
        title=None,
        sort=alt.SortField(
            field="weekday:N", order="ascending"
        ),  # Order by weekday in descending order
    ),
).properties(width=300, height=400).configure_axis(titleFontSize=16)


### Question 2

**Is there any type of vehicle more prone to participate in accidents?**


In order to accurately answer this question, we would need data on the number of different types of vehicles on the road. We could then compare the number of accidents to the number of vehicles on the road. Unfortunately, we do not have this information, so we had to attempt to answer this question with the data we have. It is worth noting that we tried to obtain this information from other sources, but we were not able to find it.

As explained above, during preprocessing we clustered similar types of vehicles which had few instances of accidents into more general types. Before anything, we wanted to see the distribution of accidents by vehicle type.

In [None]:
alt.Chart(df).mark_bar().encode(
    x=alt.X("VEHICLE TYPE CODE 1", title="Vehicle Type", sort="-y"),
    y=alt.Y("count()", title="Number of Crashes"),
    tooltip=["VEHICLE TYPE CODE 1", "count()"],
).properties(title="Number of Crashes by Vehicle Type")

As you can see, most of the recorded accidents belong to sedans and SUVs, which are the most common types of vehicles in NY city. We decided to select the top 10 vehicles by number of accidents, as well as group the rest of them into the category _Others_.

We decided to make a horizontal bar chart in order to facilitate reading the labels of the columns. Importantly, we wanted the _Others_ column to be at the end, even though it was not the lowest. This is because it is a combination types of vehicles which all have lower number of accidents than the other top 9.

In [None]:
def q2_preprocessing(df):
    """
    Preprocesses the given DataFrame by performing the following steps:
    1. Counts the occurrences of each vehicle type code.
    2. Selects the top 10 most frequent vehicle type codes.
    3. Replaces all other vehicle type codes with "Others".
    4. Groups the data by vehicle type code and sums the counts.
    5. Sorts the data by count in descending order.
    6. Separates the data into two parts: one with the top 10 vehicle type codes and one with "Others".
    7. Concatenates the two parts.
    8. Calculates the percentage of each vehicle type code count.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame containing the vehicle type codes.

    Returns:
    - sorted_df (pandas.DataFrame): The preprocessed DataFrame with vehicle type codes and their counts and percentages.
    """
    count_df = df["VEHICLE TYPE CODE 1"].value_counts().reset_index()
    count_df.columns = ["VEHICLE TYPE CODE 1", "count"]
    top_10 = count_df.nlargest(9, "count")

    count_df["VEHICLE TYPE CODE 1"] = np.where(
        count_df["VEHICLE TYPE CODE 1"].isin(top_10["VEHICLE TYPE CODE 1"]),
        count_df["VEHICLE TYPE CODE 1"],
        "Others",
    )
    count_df = count_df.groupby("VEHICLE TYPE CODE 1").sum().reset_index()

    sorted_df = count_df.sort_values(by="count", ascending=False)

    df_part1 = sorted_df[sorted_df["VEHICLE TYPE CODE 1"] != "Others"]
    df_part2 = sorted_df[sorted_df["VEHICLE TYPE CODE 1"] == "Others"]

    sorted_df = pd.concat([df_part1, df_part2])

    sorted_df["percentage"] = (sorted_df["count"] / sorted_df["count"].sum()) * 100

    return sorted_df

In [None]:
sorted_df = q2_preprocessing(df)
bar_chart = alt.Chart(sorted_df).mark_bar().encode(
    y=alt.Y("VEHICLE TYPE CODE 1:N", title="Type of vehicle", sort=sorted_df['VEHICLE TYPE CODE 1'].tolist()),
    x=alt.X("count:Q", title="Number of accidents", scale=alt.Scale(domain=(0, max(sorted_df['count']) * 1.2))),
    color=alt.Color("VEHICLE TYPE CODE 1:N", 
                    scale=alt.Scale(domain=sorted_df['VEHICLE TYPE CODE 1'].tolist() + ['Others'], 
                                    range=[colors['col1']] * (len(sorted_df['VEHICLE TYPE CODE 1'])-1) + ['gray']),
                    legend=None  
                   )
)
text_labels = bar_chart.mark_text(
    fontWeight='bold',
    align='left',
    baseline='middle',
    dx=3, 
).encode(
    text='count:Q',
    color=alt.condition(
        alt.datum['VEHICLE TYPE CODE 1'] == 'Others',
        alt.value('gray'),
        alt.value('black')  
    )
)

layered_chart = alt.layer(bar_chart, text_labels).configure_axisX(grid=True)

layered_chart

We also tried out a lollipop chart with the same properties as this bar chart.

In [None]:
bar_chart = alt.Chart(sorted_df).mark_rule(size=2).encode(
    y=alt.Y("VEHICLE TYPE CODE 1:N", title="VEHICLE TYPE CODE 1", sort=sorted_df['VEHICLE TYPE CODE 1'].tolist()),
    x=alt.X("count:Q", title="Count", scale=alt.Scale(domain=(0, max(sorted_df['count']) * 1.2))),
    color=alt.Color("VEHICLE TYPE CODE 1:N", 
                    scale=alt.Scale(domain=sorted_df['VEHICLE TYPE CODE 1'].tolist() + ['Others'], 
                                    range=['lightblue'] * (len(sorted_df['VEHICLE TYPE CODE 1'])-1) + ['gray'])),
    tooltip=["VEHICLE TYPE CODE 1", "count"]
)

points_chart = alt.Chart(sorted_df).mark_circle(size=100, opacity=1).encode(
    y=alt.Y("VEHICLE TYPE CODE 1:N", title="VEHICLE TYPE CODE 1", sort=sorted_df['VEHICLE TYPE CODE 1'].tolist()),
    x=alt.X("count:Q", title="Count", scale=alt.Scale(domain=(0, max(sorted_df['count']) * 1.2))),
    color=alt.Color("VEHICLE TYPE CODE 1:N", 
                    scale=alt.Scale(domain=sorted_df['VEHICLE TYPE CODE 1'].tolist() + ['Others'], 
                                    range=['lightblue'] * (len(sorted_df['VEHICLE TYPE CODE 1'])-1) + ['gray'])),
    tooltip=["VEHICLE TYPE CODE 1", "count"]
)

text_labels = bar_chart.mark_text(
    fontWeight='bold',
    align='left',
    baseline='middle',
    dx=10, 
).encode(
    text='count:Q',
    color=alt.condition(
        alt.datum['VEHICLE TYPE CODE 1'] == 'Others',
        alt.value('gray'), 
        alt.value('black')  
    )
)

layered_chart = alt.layer(bar_chart, points_chart, text_labels).configure_axisX(grid=True)

layered_chart

Then we considered that we should use percentatges instead of absolute values, so we created a new column with the percentage of accidents per vehicle type.

In [None]:
bar_chart = alt.Chart(sorted_df).mark_bar().encode(
y=alt.Y("VEHICLE TYPE CODE 1:N", title=None, sort=sorted_df['VEHICLE TYPE CODE 1'].tolist()),
x=alt.X("percentage:Q", title="Percentage"),
color=alt.Color("VEHICLE TYPE CODE 1:N",
                scale=alt.Scale(domain=sorted_df['VEHICLE TYPE CODE 1'].tolist() + ['Others'],
                                range=[colors['col1']] * (len(sorted_df['VEHICLE TYPE CODE 1'])-1) + ['gray']),
                                legend=None),
tooltip=["VEHICLE TYPE CODE 1", "percentage"],

)

text_labels = bar_chart.mark_text(
    fontWeight='bold',
    align='left',
    baseline='middle',
    dx=3,  
).encode(
    text=alt.Text('percentage:Q', format='.1f'),  
    color=alt.condition(
        alt.datum['VEHICLE TYPE CODE 1'] == 'Others',
        alt.value('gray'), 
        alt.value('black') 
    )
)

layered_chart = alt.layer(bar_chart, text_labels).configure_axisX(grid=True)
layered_chart

At the end, we decided to use this chart.

### Question 3

**At what time of the day are accidents more common?**

To answer this question, we first wanted to take a look at the distribution of accidents throughout the day.

In [None]:
alt.Chart(df).mark_line().encode(
    alt.X("CRASH TIME", title="Crash Time"),
    alt.Y("count()", title="Number of Crashes")
).properties(title="Number of Crashes by Crash Time")

As we can see, we clearly needed to do some form of time intervals in order to achieve relevant results, as plotting each minute of the day was obviously unworkable. Thanks to this chart, we also realized a majority of accidents were registered to have occurred at full and half hours. We assumed this to be due to inaccurate registering when writing down the time of the accident.

Thus, we created intervals of 30 minutes each, and grouped accidents into said intervals.

In [None]:
def q3_preprocessing(df):
    """
    Preprocesses the given DataFrame by converting the 'CRASH TIME' column to an integer representation.

    Args:
        df (pandas.DataFrame): The DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """
    df = df[["CRASH TIME", "covid", "weekday"]]
    df["CRASH TIME INT"] = (
        pd.to_datetime(df["CRASH TIME"], format="%H:%M").dt.hour * 60
        + pd.to_datetime(df["CRASH TIME"], format="%H:%M").dt.minute
    )
    df["CRASH TIME INT"] = (df["CRASH TIME INT"] // 30) * 30
    df["CRASH TIME INT"] = df["CRASH TIME INT"].apply(
        lambda x: f"{x // 60:02d}:{x % 60:02d}"
    )
    return df

In [None]:
q3_df = q3_preprocessing(df)

In [None]:
alt.Chart(q3_df).mark_line().encode(
    alt.X("CRASH TIME INT", title="Crash Time"),
    alt.Y("count()", title="Number of Crashes")
).properties(title="Number of Crashes by Crash Time")


We wanted to highlight the patterns we observed on the line chart, specially around the so called "rush hours", periods of the day when most people commute to and from work. 
To do so, we divided the day into 2 categories: _Normal_ and _Rush Hour_. Then we duplicated the rows that fall into the gap between time periods in order to connect the different lines. We do this in a new dataframe to keep the data clean.

In [None]:
q3_df['interval'] = pd.to_datetime(q3_df['CRASH TIME INT'], format='%H:%M')


first_segment = pd.to_datetime('8:30', format='%H:%M')
second_segment = pd.to_datetime('9:30', format='%H:%M')
third_segment = pd.to_datetime('15:30', format='%H:%M')
forth_segment = pd.to_datetime('19:30', format='%H:%M')


night = (q3_df['interval'] < first_segment)
morning_rush = (q3_df['interval'] >= first_segment) & (q3_df['interval'] < second_segment)
midday = (q3_df['interval'] >= second_segment) & (q3_df['interval'] < third_segment)
afternoon_rush = (q3_df['interval'] >= third_segment) & (q3_df['interval'] < forth_segment)
evening = (q3_df['interval'] >= forth_segment)

#We need to differentiate between all 5 segments in order to avoid rush hour lines connecting to each other
q3_df['time_period'] = 'Unknown'
q3_df.loc[night, 'time_period'] = 'Night'
q3_df.loc[morning_rush, 'time_period'] = 'Rush Hour Morning'
q3_df.loc[midday, 'time_period'] = 'Midday'
q3_df.loc[afternoon_rush, 'time_period'] = 'Rush Hour Afternoon'
q3_df.loc[evening, 'time_period'] = 'Evening'


# We select the rows that fall between intervals in order to connect the lines between different segments
first_overlap =  q3_df.loc[q3_df['CRASH TIME INT'] == '08:00']
first_overlap['time_period'] = 'Rush Hour Morning'

second_overlap =  q3_df.loc[q3_df['CRASH TIME INT'] == '09:00']
second_overlap['time_period'] = 'Midday'

third_overlap =  q3_df.loc[q3_df['CRASH TIME INT'] == '15:00']
third_overlap['time_period'] = 'Rush Hour Afternoon'

fourth_overlap =  q3_df.loc[q3_df['CRASH TIME INT'] == '19:00']
fourth_overlap['time_period'] = 'Evening'


q3_df = q3_df.append(first_overlap)
q3_df = q3_df.append(second_overlap)
q3_df = q3_df.append(third_overlap)
q3_df = q3_df.append(fourth_overlap)

In order to show the rush hour periods, first we tried paiting the line of a different color during rush hours.

In [None]:
#use specific colors for each time period
color_scale = alt.Scale(
    domain=['Night', 'Rush Hour Morning', 'Midday', 'Rush Hour Afternoon', 'Evening'],
    range=['blue', 'red', 'blue', 'red', 'blue']
)

alt.Chart(q3_df).mark_line(size=3).encode( 
    x='CRASH TIME INT',
    y='count()',
    color=alt.Color('time_period', scale=color_scale),
)

 From this point onwards we decided to use an area chart, as we considered it to be easier to read. However, due to using different time series for each slice (e.g. rush hour and night) the interpolation between the two series was not ideal. 

In [None]:
color_scale = alt.Scale(domain=['Night', 'Rush Hour Morning', 'Midday', 'Rush Hour Afternoon', 'Evening'],
                        range=['#083d77', '#ff6f61', '#083d77', '#ff6f61', '#083d77'])

charts = []
for period in ['Night', 'Rush Hour Morning', 'Midday', 'Rush Hour Afternoon', 'Evening']:
    period_chart = alt.Chart(q3_df[q3_df['time_period'] == period]).mark_area(interpolate='basis',
        line={'color': 'darkblue'}
    ).encode(
        alt.X("CRASH TIME INT", title="Crash Time"),  # Use :T to indicate temporal data
        alt.Y("count()", title="Number of Crashes"),
        color=alt.Color("time_period:N", scale=color_scale, legend=None),
        tooltip=["CRASH TIME INT", "count()", "time_period"],
    ).properties(title=f"Number of Crashes - {period}").interactive()

    charts.append(period_chart)

alt.layer(*charts).resolve_scale(color='independent')

At this point we thought it would be interesting to see if the commute patterns shown in the plot would be more apparent if we separeted between before and during covid, as well as differentiating between workdays and weekends.

In [None]:
def q3_preprocessing(df):
    """
    Preprocesses the given DataFrame by converting the 'CRASH TIME' column to an integer representation.

    Args:
        df (pandas.DataFrame): The DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """
    df = df[["CRASH TIME", "covid", "weekday"]]
    df["CRASH TIME INT"] = (
        pd.to_datetime(df["CRASH TIME"], format="%H:%M").dt.hour * 60
        + pd.to_datetime(df["CRASH TIME"], format="%H:%M").dt.minute
    )
    df["CRASH TIME INT"] = (df["CRASH TIME INT"] // 30) * 30
    df["CRASH TIME INT"] = df["CRASH TIME INT"].apply(
        lambda x: f"{x // 60:02d}:{x % 60:02d}"
    )
    return df

def create_chart3(df, color_palette):

    morning_rh = {}
    morning_rh['x1'] = '08:00'
    morning_rh['x2'] = '09:00'
    morning_rh = pd.DataFrame([morning_rh])


    afternoon_rh = {}
    afternoon_rh['x1'] = '15:00'
    afternoon_rh['x2'] = '19:00'
    afternoon_rh = pd.DataFrame([afternoon_rh])

    morning_rh['x1'] = pd.to_datetime(morning_rh['x1'])
    morning_rh['x2'] = pd.to_datetime(morning_rh['x2'])

    afternoon_rh['x1'] = pd.to_datetime(afternoon_rh['x1'])
    afternoon_rh['x2'] = pd.to_datetime(afternoon_rh['x2'])


    morning_window = alt.Chart(morning_rh).mark_rect(opacity=0.1).encode(
        x='hours(x1):T',
        x2='hours(x2):T',
        color=alt.value('gray')
    )

    afternoon_window = alt.Chart(afternoon_rh).mark_rect(opacity=0.1).encode(
        x='hours(x1):T',
        x2='hours(x2):T',
        color=alt.value('gray')
    )

    before_after = alt.Chart().mark_area(size=3, opacity=0.4, interpolate='basis').encode(
       # x=alt.X('CRASH TIME INT', title=None, axis=alt.Axis(labelExpr="hours(timeFormat(datum.label, '%H:%M')) % 1 == 0 ? timeFormat(datum.label, '%H:%M') : ''")),
        x = alt.X('hours(HOUR):T'),
        y=alt.Y('count()').stack(None, title=None),
        color=alt.Color('covid:N', scale=alt.Scale(range=[colors['col1'], colors['col2']]))
    )
    df["HOUR"]=pd.to_datetime(df["CRASH TIME"])


    chart = alt.layer(before_after, morning_window, afternoon_window, data=df).properties(width=1000).facet(row=alt.Row("weekday", title=None)).configure_axis(title=None)
    return chart

In [None]:
create_chart3(q3_preprocessing(df), colors)

Finally, we obtained this chart. We can observe that most accidents happen during work hours in the weekdays. This trend can also be seen during covid, although greatly diminished. We believe during covid the general restrictions, as well as the fact that most people worked from home contributed to the decrease in overall traffic, and in consequence, in accidents.

### Question 4
**Are there any areas with a higher number of accidents?**

This question proved challenging to solve both in the visualization and technical aspects. To begin with, we made a few hand-drawn prototypes of how we wanted the visualization to look, which made us realize that, given the high density of accidents compared to the area, plotting all the accidents as dots was not a good option.

Therefore, we decided to try a choropleth but with a higher granularity than boroughs, e.g., precincts. This proved technically difficult, as we had to learn how to work with geopandas data frames in order to perform a spatial join between our dataset and a map. The result is shown in the following visualization.

At first, we encountered one issue related to which variable we encoded using color. If we encoded just the number of accidents per polygon, smaller polygons would appear with a lighter color even if the area had a higher concentration of accidents. To solve it we decided to normalize the count by the area of the polygon. We encoded this new variable using a quantitative sequential color scale.

As can be seen, the result was quite good. Nevertheless, we felt that it would be better to have all the area divisions have a similar size to facilitate comparisons and allow for a more detailed view in some areas of the city.

*(see next cells for the continuation)*

In [None]:
map_orig = gpd.read_file(gplt.datasets.get_path('nyc_parking_tickets')) 
df_coord = df.dropna(subset=['LATITUDE', 'LONGITUDE'])
gdf = gpd.GeoDataFrame(
    df_coord, geometry=gpd.points_from_xy( df_coord.LONGITUDE,df_coord.LATITUDE)
)[["geometry"]]
gdf.set_crs(epsg=4326, inplace=True)
if gdf.crs != map_orig.crs:
    gdf.to_crs(map_orig.crs, inplace=True)
# join espacially
gdf=gpd.sjoin(gdf,map_orig, how="right", op='intersects')
gdf_count = gdf.groupby(['geometry','id']).size().reset_index(name='counts')
gdf_count["counts"] = gdf_count.apply(lambda row: row["counts"]/(row["geometry"].area), axis=1)
df_geo = pd.DataFrame(gdf_count[["id","counts"]])

alt.Chart(map_orig).mark_geoshape().encode(
    color = 'counts:Q',
    tooltip=['id:N','counts:Q']
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(df_geo, 'id', ['counts'])
).project(
    type='albersUsa'
).properties(
    width=500,
    height=300
)


In order to solve the aforementioned issues, we decided to make a hexagon style map, which transform the shape of the geography into hexagons and colors them as a choropleth. To do so we first transformed our original map into hexagons, using the *h3pandas* library and the instructions while researching how to do it ([link](https://python.plainenglish.io/creating-beautiful-hexagon-maps-with-python-25c9291eeeda)). We first converted the polygons into hexagons and then repeated the process as in the previous graph.

However, we felt that transforming into hexagons made it more difficult to recognize the geographical areas of the city. Therefore we added clear borders between buroughs and labeled them.


In [None]:
path = geodatasets.get_path("nybb")
ny = gpd.read_file(path).to_crs("EPSG:4326")
resolution = 8
hex_map = ny.h3.polyfill_resample(resolution)
hex_map=hex_map.to_crs("ESRI:102003")

hex_buroughs = hex_map.copy()
hex_buroughs = hex_buroughs.dissolve(by='BoroName')

hex_buroughs.head()
ny_df=pd.DataFrame()
ny_df['x'] = hex_buroughs.centroid.x
ny_df['y'] = hex_buroughs.centroid.y
ny_df["BoroName"] = hex_buroughs.index


df_coord = df.dropna(subset=['LATITUDE', 'LONGITUDE'])
gdf = gpd.GeoDataFrame(
    df_coord, geometry=gpd.points_from_xy( df_coord.LONGITUDE,df_coord.LATITUDE)
)[["geometry"]]
gdf=gdf.set_crs(epsg=4326, inplace=True).to_crs("ESRI:102003")

gdf=gpd.sjoin(gdf,hex_map, how="right", op='intersects')
print(gdf.columns)
gdf_count = gdf.groupby(['geometry','h3_polyfill']).size().reset_index(name='counts')
gdf_count["counts"] = gdf_count.apply(lambda row: row["counts"], axis=1)
df_geo = pd.DataFrame(gdf_count[["h3_polyfill","counts"]])

hex = hex_map.merge(df_geo, left_on='h3_polyfill', right_on='h3_polyfill', how='left')
hex["counts"]=hex["counts"].apply(lambda x: 1 if x ==0 else x)



In [None]:
hexagons=alt.Chart(hex).mark_geoshape().encode(
    color = 'counts:Q',
    tooltip=['h3_polyfill:N','counts:Q']
).project(
    type='identity',
    reflectY=True
).properties(
    width=500,
    height=300
)
labels = alt.Chart(ny_df).mark_text().encode(
    longitude='x:Q',
    latitude='y:Q',
    text='BoroName:N'
)
borders = alt.Chart(hex_buroughs).mark_geoshape(
    stroke='darkgray',
    strokeWidth=1.25,
    opacity=1,
    fillOpacity=0
).project(
    type='identity',
    reflectY=True
).properties(
    width=500,
    height=300
)
hexagons + labels+borders

### Question 5
**Is there a correlation between weather conditions and accidents?**

This question proved the most difficult of them all.  To begin with, we had to find a weather dataset. This task was quickly achieved, as many weather datasets are available online. However, the first one we used did not provide enough information, as many of the columns of the dataset contained many null values. 
Therefore, we decided on using a second dataset, which contains daily weather data from New York City. The dataset contains many different columns, out of which the following ones were considered  relevant:
- `datetime`
- `precip`
- `temperature`
- `windspeed`
- `visibility`
- `conditions`
  
To begin with, we started by visualizing the distribution of the data using `pandas` histogram and describe functions. We observed that, as expected due to being summer data, the weather had mostly relatively high temperatures with relatively few precipitations. This made the visualization design process more challenging, as we had to also take it into account.

We tried two approaches; a lollipop chart that just showed the difference in the mean number of accidents depending on the weather and also a set of heatmaps that showed various weather variables colored by the amount of accidents normalized by the number of days with the weather combination.

Both approaches are displayed in the following cells. The lollipop chart was chosen and is further explained in the results section.

On the other hand, the heatmaps did not achieve good results as many of the cells had no available data or very little data. This made it difficult to understand.

In [None]:
df_weather_1 = pd.read_csv("new york city 2018-06-01 to 2018-08-31.csv")
df_weather_2 = pd.read_csv("new york city 2020-06-01 to 2020-08-31.csv")
df_weather = pd.concat([df_weather_1,df_weather_2],axis=0)
df_weather.head()


In [None]:
data = df.copy()
weather_cond = df_weather[['datetime','conditions']].copy()
weather_cond['datetime'] = pd.to_datetime(weather_cond['datetime'], format='%Y-%m-%d')

# Convert 'date' column in df to the same timezone as 'datetime' column in weather_cond
data['date'] = pd.to_datetime(pd.to_datetime(data['CRASH DATE']).dt.date)

# Merge weather conditions with accidents using pd.concat
data = data.merge(weather_cond, left_on='date', right_on='datetime', how='inner')
data.head()

In [None]:
per_day = data[['date','conditions','CRASH TIME']].groupby(['date']).count().reset_index()
mean = per_day['CRASH TIME'].mean()

per_day_cond = data[['date','conditions','CRASH TIME']].groupby(['date','conditions']).count().reset_index()
mean_cond = per_day_cond[['conditions','CRASH TIME']].groupby(['conditions']).mean().reset_index()
mean_cond.columns = ['conditions','mean_cond']

mean_cond["diff"]=mean_cond["mean_cond"].apply(lambda x: x-mean)


In [None]:
alt.Chart(mean_cond).mark_bar(height=3, orient='horizontal').encode(
    y=alt.Y('conditions:N').sort('x'),
    x='diff:Q',
    color=alt.condition(
        alt.datum.diff > 0,
        alt.value("steelblue"),  # The positive color
        alt.value("orange")  # The negative color
    )
).properties(
    width=500,
    height=300
)+alt.Chart(mean_cond).mark_point(orient='horizontal',size=100,opacity=1,fillOpacity=1).encode(
    y=alt.Y('conditions:N').sort('x'),
    x='diff:Q',
    color=alt.condition(
        alt.datum.diff > 0,
        alt.value("steelblue"),  # The positive color
        alt.value("orange")  # The negative color
    ),
    fill=alt.condition(
        alt.datum.diff > 0,
        alt.value("steelblue"),  # The positive color
        alt.value("orange")  # The negative color
    )
).properties(
    width=500,
    height=300
)

In [None]:
def create_heatmap(df,weather_df,col_1,col_2,bins_1,bins_2,labels_1,labels_2):
    # Convert 'date' columns to datetime if they are not already in datetime format
    weather_df['datetime'] = pd.to_datetime(weather_df['datetime'])
    df['CRASH DATE'] = pd.to_datetime(pd.to_datetime(df['CRASH DATE']).dt.date)

    # Binarize temperature and precip columns
    weather_df['col_1'] = pd.cut(weather_df[col_1], bins=bins_1, labels=labels_1)
    weather_df['col_2'] = pd.cut(weather_df[col_2], bins=bins_2, labels=labels_2)
    # Merge the datasets on the date/datetime column
    merged_data= pd.merge(df, weather_df, left_on='CRASH DATE', right_on='datetime', how='inner')
    weather_count = weather_df.groupby(["col_1","col_2"]).size().reset_index(name='w_counts')
    # merge weather count and merged data on weather comb
    accident_count = merged_data.groupby(["col_1","col_2"]).size().reset_index(name='a_counts')
    accident_count = accident_count[accident_count["a_counts"]>0]
    data = pd.merge(weather_count, accident_count, on=['col_1', 'col_2'], how='inner')
    return alt.Chart(data).mark_rect().transform_calculate(
        value = 'datum.a_counts/datum.w_counts'
    ).encode(
        x=alt.X('col_1:O', title=col_1, sort=labels_1),
        y=alt.Y('col_2:O', title=col_2, sort=labels_2).sort("-y"),
        color = alt.Color('value:Q', title='Ratio'),
    ).properties(
        width=400
    )
def generate_bin_labels(bins):
    bin_labels = []
    for i, (start, end) in enumerate(zip(bins[:-1], bins[1:]), start=1):
        if end == float('inf'):
            bin_labels.append(f'{i}. {start}+')
        else:
            bin_labels.append(f'{i}. {start}-{end}')
    return bin_labels
import numpy as np
def generate_uniform_bins(dataframe_column, num_bins):
    min_val = dataframe_column.min()
    max_val = dataframe_column.max()
    bins = np.linspace(min_val, max_val, num=num_bins+1).tolist()
    return bins


In [None]:
precip_bins = [0,2,5,7,10,15,100]
temp_bins = [15,20, 22.5,25,30,32,37]
temp_labels = generate_bin_labels(temp_bins)
precip_labels = generate_bin_labels(precip_bins)

map1=create_heatmap(df,df_weather,'temp','precip',temp_bins,precip_bins,temp_labels,precip_labels)

wind_bins = generate_uniform_bins(df_weather["windspeed"],4)
precip_bins = [0,2,5,7,10,15,100]
wind_labels = generate_bin_labels(wind_bins)
precip_labels = generate_bin_labels(precip_bins)

map2=create_heatmap(df,df_weather,'windspeed','precip',wind_bins,precip_bins,wind_labels,precip_labels)


In [None]:
alt.hconcat(
    map1,
    map2,
    spacing=20  # Adjust the spacing between the heatmaps
).resolve_scale(
    y='shared'
)

## Results

This section shows the final visualization which were achieved after the thinking and design process described on the previous section. On this section the final forms of all the visualizations are shown and analyzed.

In [None]:
from graphs import *

**IMPORTANT** due to altair rendering issues, the following cells might require being run twice to display the graphs

In Chart 1, a significant decrease in the average number of accidents per day is evident when comparing the periods before and after COVID. 
Notably, during the period preceding COVID, there is also a noticeable decline in accidents during weekends compared to weekdays. 

However, a distinctive shift occurs during the COVID period, where the number of accidents on weekends remains essentially the same as on weekdays. This pattern suggests a change in the  distribution of accidents, with weekdays exhibiting a different trend during the pandemic compared to the period before COVID.

The chart allows the reader to easily compare the situations while also providing insight into the distributions. The use of colors facilitates the task.

In [None]:
accident_data = get_accident_data("dataset_v1.csv", sample=False)
chart_1 = get_chart_1(accident_data)
chart_1

In Chart 2, the presented visualization is a horizontal bar plot that illustrates the percentage of accidents associated with various types of vehicles. The chart is designed to highlight the top 9 categories of vehicles, while aggregating the remaining vehicle types into a collective category labeled as "Others." By examining this graph, one can readily identify which specific types of vehicles are most frequently involved in accidents.

However, it's important to note that this visualization alone cannot conclusively answer the question of which types of vehicles are more prone to participate in accidents. The limitation stems from our lack of knowledge regarding the overall distribution of vehicle types in New York City. Without information about the prevalence of each vehicle type on the roads, it is challenging to assess whether certain types are inherently more accident-prone or if their frequency simply reflects their higher representation in the general vehicle population.

Therefore, while the chart provides insights into the relative involvement of various vehicle types in accidents, drawing definitive conclusions about their inherent propensity requires additional context about the overall distribution of vehicles in the city.

In [None]:
create_chart2(q2_preprocessing(accident_data),height=500,width=300)


In Chart 3, there are two subplots representing workdays and weekends. Each subplot features two overlapping area charts: a lighter green area representing the period before COVID, and a darker area representing the period during COVID. The Y-axis depicts the number of accidents in 30-minute intervals. Notably, the intervals of 8h-9h and 15h-19h, corresponding to the rush hour in NYC, are highlighted with a slight shadow.

Upon reviewing these charts, several observations can be made. Firstly, accidents tend to be more prevalent during typical work hours on workdays, while they are more evenly distributed and less frequent during weekends. Additionally, there is a noticeable decrease in the overall number of accidents (as observed in Question 1). Furthermore, the distinctive pattern of increased accidents during work hours, seen in the time before COVID, becomes less pronounced during the pandemic. 

In [None]:
data_3 = q3_preprocessing(accident_data)
chart_3 = create_chart3(data_3,color_palette=get_palette())
chart_3

In chart 3, the visualization allows us to see whether some areas of the city have a higher concentration of accidents than others. The use of the hexagons allows for a detailed vision of the distribution of accidents. The borough borders give, from a perception point of view, a feeling of containment within the boroughs, and together with the borough labels, they allow us to orient ourselves around the map. 
Furthermore, the use of a sequential color palette clearly shows whether some areas have a higher count of accidents.

Nevertheless, the graph also has some issues. The main one is due to the nature of the data; the color palette has to deal with a big range of amounts, as the quantity of accidents in some areas is much greater than in others. This makes differentiating at the lower and higher ends of the scale hard.
Furthermore, the borough labels are a bit hard to distinguish as they overlap the hexagons of the map. It would be better to manually situate them around the map with an arrow pointing to the borough. However, this task would be better solved by modifying the graph with a design software and not within altair.

Overall, the chart allows the reader to compare different areas of the city and notice how areas such as Manhattan or Brooklyn have a much higher amount of accidents than other less visited areas.

In [None]:
mapa = get_map()
ny_df, bur = get_buroughs(mapa)
hex_data = calculate_spatial_data(accident_data, mapa)

map_chart, bar_chart=plot_map(hex_data,mapa,ny_df,bur)
map_chart | bar_chart


In [None]:
map_chart.properties(width=600)
# Might give error
# map_chart.save('resources/map2.svg')


Finally, we focus on the last question, whether the weather has an effect on the amount of accidents. As can be seen in the following graph, we used a lollipop graph displaying for each weather condition the difference (in percentage) of the mean daily accidents in a weather condition when compared to the overall mean.

The visualization allows the reader to quickly observe that with bad weather, the mean amount of accidents is greater than with good weather. The use of colors and the ordering allows to quickly distinguish between positive and negative differences and the use of the lollipop chart allows to obtain the value with detail.

However, one criticism of this graph is that it hides further details about the distribution of the data and, more importantly, it does not allow the reader to know the amount of days that each weather condition occurred, which could skew the data. Nevertheless, as has been explained in the process section, we felt that this was a sacrifice that had to be made in order to be able to have a clearer and easier to understand visualization.

In [None]:
weather_data = get_weather_data(accident_data)
weather_chart(weather_data)