# VI: First Practical Work

**Authors:** Gerard Comas & Marc Franquesa.

In [None]:
# If using colaboratory, uncomment the following lines
# !pip3 uninstall -y altair
# !pip3 install altair==5.12.1

import pandas as pd
import numpy as np
import altair as alt
import geopandas as gpd
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

collisions = pd.read_csv("./processed-data/collisions.csv")
map_data = gpd.read_file("./processed-data/map.geojson")
weather = pd.read_csv("./processed-data/weather.csv")

## Design and implementation


In [None]:
# Helpful functions

def before_covid(df: pd.DataFrame) -> pd.DataFrame:
    return df[df["AFTER COVID"] == False]

def after_covid(df: pd.DataFrame) -> pd.DataFrame:
    return df[df["AFTER COVID"] == True]

### 1. Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after?

With an ambitious goal in mind, lets first plot the total collisions of each day of the week before COVID.

In [None]:
before_covid_day_count = before_covid(collisions).groupby(["CRASH WEEKDAY"]).size().reset_index(name="counts")

weekdayorder = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

alt.Chart(before_covid_day_count).mark_bar().encode(
    x = alt.X("CRASH WEEKDAY:O", sort=weekdayorder, axis=alt.Axis(title="Week Day")),
    y = alt.Y("counts:Q", axis=alt.Axis(title="Collisions"))
).properties(
    width=400
)

Lets now make a grouped bar chart, separating before and after covid.

In [None]:
days_df = collisions.groupby(["CRASH WEEKDAY", "AFTER COVID"]).size().reset_index(name="counts")

before, after, all_time = "Summer 2018 (Before Covid)", "Summer 2020 (After Covid)", "All"

days_df["MOMENT"] = np.where(days_df["AFTER COVID"], after, before)

weekdayorder = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

opacity = 0.5

colors = {
    before: "#fdc086", # Before COVID
    after: "#7fc97f", # After COVID
    all_time: "#beaed4"
}

days_ch = alt.Chart(days_df).mark_bar(
    opacity=opacity
).encode(
   x=alt.X("CRASH WEEKDAY:O", axis=alt.Axis(labelAngle=-30, title=None), sort=weekdayorder),
   xOffset="MOMENT:O",
   y=alt.Y("counts:Q", axis=alt.Axis(title="Collisions", grid=True)),
   color=alt.Color("MOMENT:O", scale=alt.Scale(domain=list(colors.keys()), range=list(colors.values())), legend=alt.Legend(title=None))
)

days_ch

Lets now add the average of before and after covid.

In [None]:
averages = alt.Chart(days_df).mark_rule(opacity=1).encode(
    y="mean(counts):Q",
    size=alt.value(2),
    color="MOMENT:O"
)

averages + days_ch

Lets now separate the days of the week in two categories, weekdays and weekends.

In [None]:
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
weekends = ["Saturday", "Sunday"]

weekdays_df = days_df[days_df["CRASH WEEKDAY"].isin(weekdays)]
weekends_df = days_df[days_df["CRASH WEEKDAY"].isin(weekends)]

weekdays_ch = alt.Chart(weekdays_df).mark_bar(opacity=opacity).encode(
   x=alt.X("CRASH WEEKDAY:O", axis=alt.Axis(labelAngle=0, title=None), sort=weekdayorder),
   xOffset="MOMENT:O",
   y=alt.Y("counts:Q", axis=alt.Axis(title="Collisions / Means", grid=True), scale=alt.Scale(domain=[0, 13000])),
   color=alt.Color("MOMENT:O", scale=alt.Scale(domain=list(colors.keys()), range=list(colors.values())))
).properties(title=alt.Title("Weekdays", fontSize=10, fontWeight=600))

averages_weekday = alt.Chart(weekdays_df).mark_rule(opacity=1).encode(
    y="mean(counts):Q",
    size=alt.value(2),
    color=alt.Color("MOMENT:O")
)


weekends_ch = alt.Chart(weekends_df).mark_bar(opacity=opacity).encode(
   x=alt.X("CRASH WEEKDAY:O", axis=alt.Axis(labelAngle=0, title=None), sort=weekdayorder),
   xOffset="MOMENT:O",
   y=alt.Y(
       "counts:Q",
       axis=alt.Axis(title=None, labels=False, domain=False, ticks=False, grid=True),
       scale=alt.Scale(domain=[0, 13000])
   ),
   color=alt.Color(
       "MOMENT:O",
       scale=alt.Scale(domain=list(colors.keys()), range=list(colors.values())),
       legend=alt.Legend(title=None)
   )
).properties(title=alt.Title("Weekends", fontSize=10, fontWeight=600))

averages_weekend = alt.Chart(weekends_df).mark_rule(opacity=1).encode(
    y="mean(counts):Q",
    size=alt.value(2),
    color="MOMENT:O"
)


# Played  around with the size to make it look good in the final viz
q1 = ((weekdays_ch + averages_weekday).properties(width=318, height=300) | (weekends_ch + averages_weekend.properties(width=136, height=300)))

q1

### 2. Is there any type of vehicle more prone to participate in accidents?
Obviously, with the current data we have this is impossible, as cars are the most predominant vehicle by a large margin, meaning they will have the most collisions. Lets start off viewing this data with a simle bar plot.

In [None]:
vehicles = collisions.groupby(["VEHICLE"]).size().reset_index(name="counts")

alt.Chart(vehicles).mark_bar().encode(
    y=alt.Y("counts:Q", axis=alt.Axis(title="Collisions")),
    x=alt.X("VEHICLE:O", axis=alt.Axis(title=None, labelAngle=-30))
).properties(
    width=400
)

This confirms what we hypothesized earlier.

The first idea we had was to create a parallel coordinate plane, where we would have the following plots:

- Percentage of accidents
- Percentage of circulation
- Percentage of injuries
- Percentage of deaths
- Ratio of injuries/accident
- Ratio of injuries/deaths

However, in the provided data, we do not have the percentage of circulation for each vehicle, and searching on the internet, we have not found any dataset that can provide us with this information. Now, we will look at how the following plots are distributed:

In [None]:
vehicles = collisions[["VEHICLE","NUMBER OF PERSONS INJURED", "NUMBER OF PERSONS KILLED"]]
vehicles = vehicles[vehicles["VEHICLE"] != "Unknown"]

vehicles = vehicles.groupby("VEHICLE").agg({
    "VEHICLE": "count",
    "NUMBER OF PERSONS INJURED": "sum",
    "NUMBER OF PERSONS KILLED": "sum"
}).rename(columns={"VEHICLE": "COLLISIONS"}).reset_index()

total_collisions = vehicles["COLLISIONS"].sum()

# Calcular el número de accidentes por tipo de vehículo
vehicles["% COLLISIONS"] = vehicles["COLLISIONS"] / total_collisions * 100

# Calcular el número total de personas heridas y muertas en todos los accidentes
total_injured = vehicles["NUMBER OF PERSONS INJURED"].sum()
total_killed = vehicles["NUMBER OF PERSONS KILLED"].sum()

# Calcular los porcentajes de personas heridas y muertas para cada tipo de vehículo
vehicles["% INJURED"] = vehicles["NUMBER OF PERSONS INJURED"] / total_injured * 100
vehicles["% KILLED"] = vehicles["NUMBER OF PERSONS KILLED"] / total_killed * 100

# Calcular los ratios de personas heridas y muertas por accidente para cada tipo de vehículo
vehicles["INJURED PER COLLISION"] = vehicles["NUMBER OF PERSONS INJURED"] / vehicles["COLLISIONS"]
vehicles["KILLED PER COLLISION"] = vehicles["NUMBER OF PERSONS KILLED"] / vehicles["COLLISIONS"]

vehicles.head()

In [None]:
base = alt.Chart(vehicles, width=800).transform_window(
    index="count()"
).transform_fold(
    ["% COLLISIONS", "INJURED PER COLLISION", "KILLED PER COLLISION"]
).transform_joinaggregate(
    min="min(value)",
    max="max(value)",
    groupby=["key"]
).transform_calculate(
    norm_val="(datum.value - datum.min) / (datum.max - datum.min)",
    mid="(datum.min + datum.max) / 2"
)

lines = base.mark_line(opacity=0.3).encode(
    x="key:N",
    y= alt.Y("norm_val:Q", axis=None),
    color="VEHICLE:N",
    detail="index:N",
    opacity=alt.value(0.5)
)

rules = base.mark_rule(
    color="#ccc", tooltip=None
).encode(
    x="key:N",
    detail="count():Q",
) 

def ytick(yvalue, field):
    scale = base.encode(x="key:N", y=alt.value(yvalue), text=f"min({field}):Q")
    return alt.layer(
        scale.mark_text(baseline="middle", align="right", dx=-5, tooltip=None),
        scale.mark_tick(size=8, color="#ccc", orient="horizontal", tooltip=None)
    )

alt.layer(
    lines, rules ,ytick(0, "max"), ytick(150, "mid"), ytick(300, "min")
).configure_axisX(
    domain=False, labelAngle=0, tickColor="#ccc", title=None
).configure_view(
    stroke=None
)

Lets now try a scatter plot.

In [None]:
def parse(i):
    if i < 1000:
        return f"{i}"
    return f"{int(i//1000)},{int(i%1000)}"

maximum = max(vehicles["COLLISIONS"])
minimum = min(vehicles["COLLISIONS"])
mean = vehicles["COLLISIONS"].mean()

legend_labels = (
    f"datum.label == '{parse(maximum)}' ? '{parse(maximum)}   (max)' : datum.label == '{parse(minimum)}' ? '{parse(minimum)}        (min)' : '{parse(mean)}   (mean)'"
)


# Using purple color as it represents the entire collision count
scatter = alt.Chart(vehicles).mark_circle(color=colors[all_time]).encode(
    x=alt.X("INJURED PER COLLISION:Q", axis=alt.Axis(title="Injuries per collision", tickCount=10)),
    y=alt.Y("KILLED PER COLLISION:Q", axis=alt.Axis(title="Deaths per collision")),
    size=alt.Size("COLLISIONS:Q", scale=alt.Scale(range=[10, 700]), legend=alt.Legend(title="Total collisions", values=[minimum, mean, maximum], labelExpr=legend_labels)),
).properties(
    width=500,
    height=300
)

# Lets add labels for each vehicle
labels = scatter.mark_text(
    align="right",
    dx=-15,
    dy=0
).encode(
    text="VEHICLE:N",
    size=alt.value(10)
)

q2 = (scatter + labels).properties(
    title="Vehicle Danger",
    width=590,
    height=300
)

q2

This one seems to be easier to understand and also looks nicer, we have decided to keep this one.

In [None]:
(q1 & q2).configure_legend(symbolOpacity=1).resolve_scale(size="independent")

### 3. At what time of the day are accidents more common?
Lets make a simpler historgram with the overall average as well as a little mark indicating the max hour.

In [None]:
time_df = collisions
time_df["HOUR"] = pd.to_datetime(time_df["CRASH DATETIME"]).dt.hour
time_df = time_df.groupby(["HOUR", "AFTER COVID"]).size().reset_index(name="counts")

time_df["MOMENT"] = np.where(time_df["AFTER COVID"], after, before)

time_ch = alt.Chart(time_df).mark_bar(opacity=opacity).encode(
    x=alt.X("HOUR:O", axis=alt.Axis(labelAngle=0, tickOffset=-10), title="Hour"),
    y=alt.Y("counts:Q", title="Collisions / Mean"),
    color=alt.Color(
        "MOMENT:O",
        scale=alt.Scale(domain=list(colors.keys()), range=list(colors.values())),
        legend=alt.Legend(title=None)
    ),
    order=alt.Order("MOMENT:O", sort="ascending")
)

time_all_df = time_df.groupby(["HOUR"]).sum().reset_index()

averages_weekend = alt.Chart(time_all_df).mark_rule(opacity=1, color=colors[all_time]).encode(
    y="mean(counts):Q",
    size=alt.value(2),
)

max_hour = alt.Chart().mark_text(text=str(sum(time_df.loc[time_df["HOUR"] == 16, "counts"])), angle=0).encode(
    x=alt.value(330),
    y=alt.value(20),
)

q3 = (time_ch + averages_weekend).properties(title="Collisions by Hour")
q3

In [None]:
((q1 | q3) & q2).configure_legend(symbolOpacity=1).resolve_scale(size="independent")

### 4. Are there any areas with a larger number of accidents?
Lets make a choropleth map. First, lets just a couple collisions in NYC. We are using a district map.

In [None]:
base = alt.Chart(map_data).mark_geoshape(fill="lightgray", stroke="black").project(type="albersUsa").properties(
    width=700,
    height=700
)

pts = alt.Chart(collisions[collisions["LOCATION"].notna()].head(5000)).mark_circle().encode(
    latitude="LATITUDE",
    longitude="LONGITUDE",
    color='BOROUGH',
    tooltip=['LATITUDE', "LONGITUDE"]
)

(base + pts)

Now making the Choropleth Map! We will be using the purple scale as we will be using the entire dataset, not just before/after covid. Keep in mind that we will only be looking at area, there are other factors too, like total km of streets. However, we have decided to go with this path as any other variable would be tricky to use.

In [None]:
base = alt.Chart(map_data).mark_geoshape().project(type="albersUsa").encode(
    color=alt.Color("COLLISIONS / KM2:Q", scale=alt.Scale(scheme='purples'), legend=alt.Legend(title="Collisions per km2")),
).properties(
    width=600,
    height=600,
    title="NYC Community Districts"
)

base

Lets add labels to the area with most collisions per km2 and area in position 4 as 2 and 3 will be next to #1. Only 2 as getting too many more would overcrowd the map. Getting the labels from [here](https://furmancenter.org/files/sotc/SOC2007_IndexofCommunityDistricts_000.pdf). Using the centroids of the areas to get where to place the labels. Lets see how that looks. 

In [None]:
top = map_data.sort_values(by="COLLISIONS / KM2", ascending=False).head(4)
top[["LATITUDE", "LONGITUDE"]] = top["geometry"].centroid.apply(lambda x: pd.Series([x.y, x.x]))

# 
labels = {
    "boro_cd": ["105", "205"],
    "LABELS": ["Midtown", "Fordham"]
}

top = top.merge(pd.DataFrame(labels), left_on="boro_cd", right_on="boro_cd")

text_labels = alt.Chart(top).mark_text(angle=0, dx=0, dy=0, fill="white", size=9).encode(
    longitude='LONGITUDE:Q',
    latitude='LATITUDE:Q',
    text='LABELS:N',
)

base + text_labels

Midtown label is good but Fordham not too much, which is barely visible. Lets place it where it can be read correctly. And lets add a couple icons for "interesting vehicles"!. These icons will be wherever they collided!

In [None]:
top.loc[top["LABELS"] == "Fordham", ["LATITUDE", "LONGITUDE"]] = [40.849746, -73.89958]


text_labels = alt.Chart(top).mark_text(angle=0, dx=0, dy=0, fill="white", size=9).encode(
    longitude="LONGITUDE:Q",
    latitude="LATITUDE:Q",
    text="LABELS:N",
)


horse = alt.Chart(collisions[collisions["ORIGINAL VEHICLE"] == "Horse"]).mark_text(text="🐎", size=18).encode(
    longitude="LONGITUDE:Q",
    latitude="LATITUDE:Q",
)

gokart = alt.Chart(collisions[collisions["ORIGINAL VEHICLE"] == "Go kart"]).mark_text(text="🏎️", size=18).encode(
    longitude="LONGITUDE:Q",
    latitude="LATITUDE:Q",
)


q4 = (base + horse + gokart + text_labels).properties(width=600, height=600)

q4

Great! Lets now put it all together.

In [None]:
((q4 | (q1 & q3)) & q2).configure_legend(symbolOpacity=1).resolve_scale(size="independent", color="shared")

### 5.  Is there a correlation between weather conditions and accidents?

In [None]:
# Read weather data
weather = pd.read_csv("./processed-data/weather.csv")

weather_corr = weather.drop(columns=["valid"]).corr()

In [None]:
# reshape the data into a long format
corr_long = weather_corr.stack().reset_index()
corr_long.columns = ['x', 'y', 'value']

# create the heatmap
heatmap = alt.Chart(corr_long).mark_rect().encode(
    x='x:O',
    y='y:O',
    color='value:Q'
).properties(
    width=300,
    height=300
)

# add text to the heatmap
text = heatmap.mark_text(baseline='middle').encode(
    text=alt.Text('value:Q', format='.2f'),
    color=alt.condition(
        alt.datum.value > 0.5,
        alt.value('white'),
        alt.value('black')
    )
)

heatmap + text

From this heatmap, we can see that there is a significant relationship between the columns `vsby` and `relh`; low visibility values are associated with high relative humidity values. There is also a strong correlation between the columns `relh` and `tmpf`. All of this makes a lot of sense when we consider the thermodynamics relation between climatic variables.

In [None]:
# select the columns we want to keep
collisions_weather_selected  = collisions[['CRASH DATETIME', 'NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED', 'VEHICLE',  'tmpf', 'relh', 'sknt', 'p01i', 'vsby']]

alt.data_transformers.disable_max_rows()

In [None]:
def violinPlot(dataset, column, rang):
    color = '#7fc97fbb' if dataset.equals(weather) else '#beaed4'
    title = 'Normal' if dataset.equals(weather) else 'Collisions'
    orient = 'right' if dataset.equals(weather) else 'left'
    chart = alt.Chart(dataset , width=100).transform_density(
        column,
        as_=[column, 'density'],
        extent= rang
    ).mark_area(orient='horizontal', color = color).encode(
        alt.X('density:Q')
            .stack('center')
            .impute(None)
            .title(None)
            .axis(labels=False, values=[0], grid=False, ticks=True),
        alt.Y(column + ':Q').title(title).axis(titleColor=color, orient=orient)
    )

    # Calculate quartiles
    q1 = dataset[column].quantile(0.25)
    q2 = dataset[column].quantile(0.5)
    q3 = dataset[column].quantile(0.75)

    # Add quartiles as horizontal lines
    q1_r = alt.Chart(pd.DataFrame({'y': [q1]})).mark_rule(color='#fee0d2', strokeWidth=2).encode(y='y')
    q2_r = alt.Chart(pd.DataFrame({'y': [q2]})).mark_rule(color='#fc9272', strokeWidth=2).encode(y='y')
    q3_r = alt.Chart(pd.DataFrame({'y': [q3]})).mark_rule(color='#de2d26', strokeWidth=2).encode(y='y')

    return chart + q1_r + q2_r + q3_r

(violinPlot(collisions_weather_selected, 'tmpf', [5, 45]) | 
 violinPlot(weather, 'tmpf', [5, 45])
).properties(
    title = "Temperature"
) | (violinPlot(collisions_weather_selected, 'relh', [0, 100]) | 
 violinPlot(weather, 'relh', [0, 100])
).properties(
    title = "Humidity"
) | (violinPlot(collisions_weather_selected, 'sknt', [0, 25]) | 
 violinPlot(weather, 'sknt', [0, 25])
).properties(
    title = "Speed of wind"
) | (violinPlot(collisions_weather_selected, 'p01i', [0, 0.5]) | 
 violinPlot(weather, 'p01i', [0, 0.5])
).properties(
    title = "Rainfall level"
) | (violinPlot(collisions_weather_selected, 'vsby', [0, 20]) | 
 violinPlot(weather, 'vsby', [0, 20])
).properties(
    title = "Visibility"
)


With this plot, we can compare the distribution of climatic variables when accidents occur versus their distribution at all times. The intention behind creating this graph was to help us understand if these distributions vary when accidents happen, that is, whether certain meteorological variables affect the number of accidents.

In some cases, we observe slightly different distributions, but it's not easy to compare them directly in this form. We need to find another way to address the question. The idea will be to look for the most extreme cases. We'll start with visibility, where we clearly see that the first quartile is at a lower level when accidents occur.

#### Visibility

In [None]:
print(f"Visibility in accidents: {collisions_weather_selected['vsby'].describe()}")
print(f"Visibility  in general: {weather['vsby'].describe()}")


Based on this data, it can be concluded that when visibility is lower, there are more accidents, as indicated by the lower mean and the lower first quartile. The other quartiles are at a value of 16.093440 kilometers, equivalent to 10 miles, which is considered complete visibility according to the data source.

However, it would be essential to study the probability of an accident with low visibility compared to the probability of an accident with high visibility.

A clearer way to represent this would be the following:

Histogram where the X-axis represents visibility, and the Y-axis represents the number of accidents/occurrences in weather conditions.
This way, we can determine the collision ratio, providing more valuable information about the likelihood of accidents concerning different visibility conditions.

In [None]:
# create 17 bins for the vsby column
bins = pd.cut(collisions_weather_selected.dropna(subset=["vsby"])["vsby"], bins=17, labels=list(range(17)))

# group by the bins
grouped = collisions_weather_selected.groupby(bins)

# get the count of collisions in each bin
counts = grouped.size()


# create 17 bins for the vsby column
bins_weather = pd.cut(weather.dropna(subset=["vsby"])['vsby'], bins=17, labels=range(17))

# group by the bins
grouped_weather = weather.groupby(bins_weather)

# get the count of collisions in each bin
counts_weather = grouped_weather.size()

In [None]:
# create a dataframe with counts and counts_weather
df = pd.DataFrame({'counts': counts, 'counts_weather': counts_weather})

# create a new column with the ratio of counts to counts_weather
df['ratio'] = df['counts'] / df['counts_weather']

df['visibility'] = df.index 

# create the bar chart
chart = alt.Chart(df).mark_bar().encode(
    x=alt.X('visibility:O', axis=alt.Axis(title='Visibility')),
    y=alt.Y('ratio:Q', axis=alt.Axis(title='Ratio of Collisions')),
).properties(
    width=400,
    height=300
)

# add the mean to the plot
mean_line = alt.Chart(df).mark_rule(color='red', strokeDash=[5,5]).encode(
    y='mean(ratio):Q'
)

chart + mean_line

It could be beneficial for the user to understand how far each data point is from the average value. To achieve this, we will calculate the z-score for each data point, which measures the number of standard deviations a particular data point is from the mean. By plotting the data using a divergent color scheme based on the z-scores, we can emphasize the more extreme values in the dataset. This visual representation will highlight instances where the data significantly deviates from the mean, providing a clearer insight into the distribution and identifying any outliers.

In [None]:
# calculate the mean and standard deviation of the ratio
mean_ratio = df['ratio'].mean()
std_ratio = df['ratio'].std()

# calculate the Z-Score for each data point
df['z_score'] = (df['ratio'] - mean_ratio) / std_ratio

# create the scatter plot
scatter = alt.Chart(df).mark_circle().encode(
    x=alt.X('visibility:O', axis=alt.Axis(title='Visibility')),
    y=alt.Y('z_score:Q', axis=alt.Axis(title='Z-Score of Ratio')),
    color=alt.Color('z_score:Q', scale=alt.Scale(scheme='purplegreen')),
    tooltip=['visibility', 'z_score']
).properties(
    width=400,
    height=300
)

# add the mean line to the plot
mean_line = alt.Chart(df).mark_rule(color='red', strokeDash=[5,5]).encode(
    y='mean(z_score):Q'
)

scatter + mean_line

What we can do is combine the two graphs, and we have the following solutions:

In [None]:
# create a dataframe with counts and counts_weather
df = pd.DataFrame({'counts': counts, 'counts_weather': counts_weather})

# create a new column with the ratio of counts to counts_weather
df['ratio'] = df['counts'] / df['counts_weather']

df['visibility'] = df.index 

mean_ratio = df['ratio'].mean()
std_ratio = df['ratio'].std()

df['pstd'] = mean_ratio + 2*std_ratio
df['nstd'] = mean_ratio - 2*std_ratio

# calculate the Z-Score for each data point
df['z_score'] = (df['ratio'] - mean_ratio) / std_ratio

# create the bar chart
bar = alt.Chart(df).mark_bar().encode(
    x=alt.X('visibility:O', axis=alt.Axis(title='Visibility')),
    y=alt.Y('ratio:Q', axis=alt.Axis(title='Ratio of Collisions')),
    color=alt.Color('z_score:Q', scale=alt.Scale(scheme='purplegreen'), legend = alt.Legend(title='Z-Score of Ratio'))
).properties(
    width=400,
    height=300
)

# create the bar chart
rule = alt.Chart(df).mark_rule().encode(
    x=alt.X('visibility:O', axis=alt.Axis(title='Visibility')),
    y=alt.Y('ratio:Q', axis=alt.Axis(title='Ratio of Collisions')),
    color=alt.Color('z_score:Q', scale=alt.Scale(scheme='purplegreen'), legend = None)
).properties(
    width=400,
    height=300
)

# create the bar chart
point = alt.Chart(df).mark_circle().encode(
    x=alt.X('visibility:O', axis=alt.Axis(title='Visibility')),
    y=alt.Y('ratio:Q', axis=alt.Axis(title='Ratio of Collisions')),
    color=alt.Color('z_score:O', scale=alt.Scale(scheme='purplegreen'), legend= None)
).properties(
    width=400,
    height=300
)

# add the mean to the plot
mean_line = alt.Chart(df).mark_rule(color='gray', strokeDash=[5,5]).encode(
    y='mean(ratio):Q'
)

pstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
    y='pstd:Q'
)

nstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
    y='nstd:Q'
)

(bar + mean_line + pstd_line + nstd_line) | (rule + point + mean_line + pstd_line + nstd_line)

We have two options, a barplot and a lollipop chart. Both represent the same information, but it seems that the barplot is easier to read. Therefore, we will stick with this graph.

The plot indicates a noticeable trend: as visibility decreases, the likelihood of an accident appears to increase. This observation is supported by the analysis of Z-Scores, which suggests that instances of extremely low visibility, represented by points around 1.5 standard deviations below the mean, are associated with a higher probability of accidents. In simpler terms, when visibility is severely reduced, the data suggests a greater chance of accidents occurring. This aligns with the common understanding that adverse weather conditions leading to poor visibility can contribute to an elevated risk of accidents.


Still, a doubt arises: why are the results less conclusive than expected? This is because visibility is closely related to humidity, and humidity, in turn, is related to temperature, which is highly influenced by the time of day. Therefore, we can imagine that the hours with lower visibility coincide with the hours when there are fewer accidents. This is something we will verify shortly.

In [None]:
# extract the hour from the DATE column
collisions_weather_selected['HOUR'] = pd.to_datetime(collisions_weather_selected['CRASH DATETIME']).dt.hour

# group by hour and calculate the mean of the visibility column
mean_visibility = collisions_weather_selected.groupby('HOUR')['vsby'].mean()


# create a chart with the mean visibility by hour
visby_hour = alt.Chart(mean_visibility.reset_index()).mark_bar().encode(
    x=alt.X('HOUR:O', axis=alt.Axis(title='Hour')),
    y=alt.Y('vsby:Q', axis=alt.Axis(title='Mean Visibility')),
).properties(
    width=400,
    height=300
)

# add the mean line
mean_line = alt.Chart(mean_visibility.reset_index()).mark_rule(color='red', strokeDash=[5,5]).encode(
    y='mean(vsby):Q'
)

visby_hour + mean_line

We can see that between 6 and 7 a.m. are the hours with the least visibility, coinciding with hours with fewer collisions. This is not a causal relationship, meaning that lower visibility doesn't directly cause fewer accidents; that wouldn't make sense. Instead, the hours with lower visibility align with times when fewer people are driving, and therefore, there are fewer accidents. 

In [None]:
# create a new column with the visibility category
collisions_weather_selected['VISIBILITY CATEGORY'] = np.where(collisions_weather_selected['vsby'] > 16, 'High Visibility', 'Low Visibility')

# group by hour and visibility category and calculate the count of collisions
hourly_visibility = collisions_weather_selected.groupby(['HOUR', 'VISIBILITY CATEGORY']).size().reset_index(name='counts')


# calculate the total number of collisions per hour
hourly_total = hourly_visibility.groupby('HOUR')['counts'].sum().reset_index(name='total')

# merge the hourly_visibility and hourly_total dataframes
hourly_visibility = pd.merge(hourly_visibility, hourly_total, on='HOUR')

# calculate the percentage of low and high visibility collisions
hourly_visibility['percentage'] = hourly_visibility['counts'] / hourly_visibility['total'] * 100

# create the stacked bar chart
stacked_bar = alt.Chart(hourly_visibility).mark_bar().encode(
    x=alt.X('HOUR:O', axis=alt.Axis(title='Hour')),
    y=alt.Y('percentage:Q', axis=alt.Axis(title='Percentage of Collisions')),
    color=alt.Color('VISIBILITY CATEGORY:N', scale=alt.Scale(domain=['Low Visibility', 'High Visibility'], range=['#1f77b4', '#ff7f0e']), legend=alt.Legend(title='Visibility Category'))
).properties(
    width=400,
    height=300
)

stacked_bar

We will proceed with a similar approach, but without categorizing data into bins. Instead, we'll thoroughly analyze all the ratios, exploring the dataset to identify any discernible trends.

In [None]:
def visualize_weather(collisions_weather_selected, weather, column_name, grafic):
    """
    Generate a scatter plot with a regression line to visualize relationships between weather data and collision statistics.

    Parameters:
    - collisions_weather_selected (pd.DataFrame): DataFrame containing collision data with weather information.
    - weather (pd.DataFrame): DataFrame containing weather information.
    - column_name (str): The column in the dataframes to analyze and compare.
    - grafic (str): The type of relationship to visualize ("COLLISIONS PER HOUR", "INJURED PER ACCIDENT", or "KILLED PER ACCIDENT").

    Returns:
    alt.Chart: An Altair chart displaying the scatter plot and a regression line.
    """

    # Data Preparation
    df = collisions_weather_selected.groupby(column_name).agg({
        column_name: "count",
        "NUMBER OF PERSONS INJURED": "sum",
        "NUMBER OF PERSONS KILLED": "sum"
    }).rename(columns={column_name: "count"}).reset_index()

    ocurrance = weather.groupby(column_name).size().reset_index(name="ocurrance")

    df = df.merge(ocurrance, on=column_name)

    # Calculate additional metrics based on user choice
    if grafic == "COLLISIONS PER HOUR":
        df["COLLISIONS PER HOUR"] = df["count"] / df["ocurrance"]
    elif grafic == "INJURED PER ACCIDENT":
        df["INJURED PER ACCIDENT"] = df["NUMBER OF PERSONS INJURED"] / df["count"]
    elif grafic == "KILLED PER ACCIDENT":
        df["KILLED PER ACCIDENT"] = df["NUMBER OF PERSONS KILLED"] / df["count"]

    # Calculate statistics for size legend
    minimum = min(df["ocurrance"])
    maximum = max(df["ocurrance"])
    mean = df["ocurrance"].mean()

    # Create scatter plot with optional regression line
    scatterplot = alt.Chart(df).mark_circle().encode(
        x=column_name + ":Q",
        y=grafic + ":Q",
        size=alt.Size("ocurrance:Q", scale=alt.Scale(range=[10, 700]), legend=alt.Legend(title="Occurrences", values=[minimum, mean, maximum])),
    ).properties(
        width=400,
        height=300
    )

    regression = scatterplot.transform_regression(
        column_name, grafic, method="linear"
    ).mark_line(color="red")

    # Combine and return the chart
    return (scatterplot + regression).resolve_scale(y='shared')

In [None]:
charts = ["COLLISIONS PER HOUR", "INJURED PER ACCIDENT", "KILLED PER ACCIDENT"]
variables = ["vsby", "tmpf", "sknt", "p01i"]

show_charts = []
for grafic in charts:
    for variable in variables:
        show_charts.append(visualize_weather(collisions_weather_selected, weather, variable, grafic))


(((show_charts[0] | show_charts[1] | show_charts[2] | show_charts[3]) &
(show_charts[4] | show_charts[5] | show_charts[6] | show_charts[7])) &
(show_charts[8] | show_charts[9] | show_charts[10] | show_charts[11]))



It doesn't seem like we can reach a conclusive answer, as the distributions don't provide sufficient information. Other modifications that occurred to us to enhance this graph were:

- Adding the Pearson and Spearman correlation coefficients.

- Introducing color-coding based on the year.

For now, we won't pursue this path further and will continue exploring with the barplot we had. Other potential avenues could have included:

- Conducting hypothesis tests to compare the means of meteorological variables on days with accidents and without accidents.

- Employing clustering algorithms to group days with similar meteorological conditions and analyzing the accident frequency in each cluster.

- Segmenting the data by hours of the day and examining if there are significant differences in meteorological conditions between hours with more accidents and hours with fewer accidents.

#### Rainfall

We will now follow the same process for rainfall:

In [None]:

# select the rows where p01i is 0
zero_p01i = collisions_weather_selected.loc[collisions_weather_selected['p01i'] == 0]

# get the number of rows with p01i = 0
num_zero_p01i = len(zero_p01i)

# select the rows where p01i is not 0
nonzero_p01i = collisions_weather_selected.loc[collisions_weather_selected['p01i'] != 0]

# create 10 bins for the p01i column
bins = pd.cut(nonzero_p01i.dropna(subset=["p01i"])['p01i'], bins=10)

# get the midpoint of each interval
midpoints = bins.apply(lambda x: x.mid.round(2))

# group by the midpoints
grouped = nonzero_p01i.groupby(midpoints)

# convert the result of the groupby to a dataframe
grouped_df = grouped.size().reset_index(name='counts')

# create a new dataframe with the count of rows with p01i = 0
zero_row = pd.DataFrame({'p01i': [0], 'counts': [num_zero_p01i]})

counts = pd.concat([zero_row , grouped_df])

# select the rows where p01i is 0
zero_p01i_weather = weather.loc[weather['p01i'] == 0]

# get the number of rows with p01i = 0
num_zero_p01i_weather = len(zero_p01i_weather)

# select the rows where p01i is not 0
nonzero_p01i_weather = weather.loc[weather['p01i'] != 0]

# create 10 bins for the p01i column
bins_weather = pd.cut(nonzero_p01i_weather.dropna(subset=["p01i"])['p01i'], bins=10)

# get the midpoint of each interval
midpoints_weather = bins_weather.apply(lambda x: x.mid.round(2))

# group by the midpoints
grouped_weather = nonzero_p01i_weather.groupby(midpoints_weather)

# convert the result of the groupby to a dataframe
grouped_df_weather = grouped_weather.size().reset_index(name='counts')

# create a new dataframe with the count of rows with p01i = 0
zero_row_weather = pd.DataFrame({'p01i': [0], 'counts': [num_zero_p01i_weather]})

counts_weather= pd.concat([zero_row_weather, grouped_df_weather])

In [None]:
# create a dataframe with counts and counts_weather
df = pd.DataFrame({'p01i': counts['p01i'] ,'counts': counts['counts'], 'counts_weather': counts_weather['counts']})

# create a new column with the ratio of counts to counts_weather
df['ratio'] = df['counts'] / df['counts_weather']

mean_ratio = df['ratio'].mean()
std_ratio = df['ratio'].std()

df['mean'] = mean_ratio
df['pstd'] = mean_ratio + 2*std_ratio
df['nstd'] = mean_ratio - 2*std_ratio

# calculate the Z-Score for each data point
df['z_score'] = (df['ratio'] - mean_ratio) / std_ratio

df.fillna(0, inplace=True)

# create the bar chart
bar = alt.Chart(df).mark_bar().encode(
    x=alt.X('p01i:O', axis=alt.Axis(title='Rain level')),
    y=alt.Y('ratio:Q', axis=alt.Axis(title='Ratio of Collisions')),
    color=alt.Color('z_score:Q', scale=alt.Scale(scheme='purplegreen'), legend = alt.Legend(title='Z-Score of Ratio'))
).properties(
    width=400,
    height=300
)

# add the mean to the plot
mean_line = alt.Chart(df).mark_rule(color='gray', strokeDash=[5,5]).encode(
    y='mean:Q'
)

pstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
    y='pstd:Q'
)

nstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
    y='nstd:Q'
)

(bar + mean_line + pstd_line + nstd_line) 

The result we might expect is that the collision ratio increases as the rainfall level rises, and this is true up to values of 2.69 cm of rainfall. Except for the bar at 2.28, as it seems to be an outlier, indicating only one occurrence of rain in that interval. It is also surprising that the ratio decreases for 3.93, but again, this is an outlier that has only occurred once. As a solution, we could consider creating the following plot: raining vs. not raining.

So, the graph we want to display will be the combination of two plots: the previous barplot and now a binary barplot indicating collisions per hour during rainfall and non-rainfall conditions.

In [None]:
def weather_chart(collisions_weather_selected, weather, column_name, value, nbins):
    """
    Generate and visualize a bar chart comparing the collisions per hour in different weather conditions.

    Parameters:
    - collisions_weather_selected (pd.DataFrame): DataFrame containing collision data with weather information.
    - weather (pd.DataFrame): DataFrame containing weather information.
    - column_name (str): The column in the dataframes to analyze and compare.
    - value (float): The specific value within the column to focus on for comparison. This value will be the most common value for each column. For example, for rainfall, it 
      will be 0, but for visibility, it will be 16.093440.
    - nbins (int): Number of bins to use for grouping non-zero values. This will be used to group the data into bins of equal width. But there will be an exception; a 
      barplot will be created for the value of 'value,' so there will actually be nbins + 1 bins.

    Returns:
    alt.Chart: An Altair chart displaying the collisions per hour in different weather conditions, along with statistical indicators.
    """

    # Data Processing (for collisions)
    zero = collisions_weather_selected.loc[collisions_weather_selected[column_name] == value]
    num_zero = len(zero)

    nonzero = collisions_weather_selected.loc[collisions_weather_selected[column_name] != value]
    num_nonzero = len(nonzero)

    bins = pd.cut(nonzero.dropna(subset=[column_name])[column_name], bins=nbins)
    midpoints = bins.apply(lambda x: x.mid.round(2))

    grouped = nonzero.groupby(midpoints)
    grouped_df = grouped.size().reset_index(name='counts')

    zero_row = pd.DataFrame({column_name: [value], 'counts': [num_zero]})
    non_zero_row = pd.DataFrame({column_name: [1], 'counts': [num_nonzero]})

    counts = pd.concat([zero_row , grouped_df])

    # Data Processing (for weather)
    zero_weather = weather.loc[weather[column_name] == value]
    num_zero_weather = len(zero_weather)

    nonzero_weather = weather.loc[weather[column_name] != value]
    num_nonzero_weather = len(nonzero_weather)

    bins_weather = pd.cut(nonzero_weather.dropna(subset=[column_name])[column_name], bins=nbins)
    midpoints_weather = bins_weather.apply(lambda x: x.mid.round(2))

    grouped_weather = nonzero_weather.groupby(midpoints_weather)
    grouped_df_weather = grouped_weather.size().reset_index(name='counts')

    zero_row_weather = pd.DataFrame({column_name: [value], 'counts': [num_zero_weather]})
    non_zero_row_weather = pd.DataFrame({column_name: [1], 'counts': [num_nonzero_weather]})

    counts_weather = pd.concat([zero_row_weather, grouped_df_weather])

    # Combine Collision and Weather Data
    df = pd.DataFrame({column_name: counts[column_name] ,'counts': counts['counts'], 'counts_weather': counts_weather['counts']})

    # Data Processing (for statistical analysis)
    counts_bin = pd.concat([zero_row , non_zero_row])
    counts_weather_bin = pd.concat([zero_row_weather , non_zero_row_weather])
    df_bin = pd.DataFrame({column_name: counts_bin[column_name] ,'counts': counts_bin['counts'], 'counts_weather': counts_weather_bin['counts']})

    df['ratio'] = df['counts'] / df['counts_weather']

    mean_ratio = df['ratio'].mean()
    std_ratio = df['ratio'].std()

    df['mean'] = mean_ratio
    df['pstd'] = mean_ratio + 2 * std_ratio
    df['nstd'] = mean_ratio - 2 * std_ratio

    # Calculate the Z-Score for each data point
    df['z_score'] = (df['ratio'] - mean_ratio) / std_ratio

    df.fillna(0, inplace=True)

    # Visualization (for Ratio of Collisions)
    bar = alt.Chart(df).mark_bar().encode(
        x=alt.X(column_name + ':O'),
        y=alt.Y('ratio:Q', axis=alt.Axis(title='Collisions per Hour')),
        color=alt.Color('z_score:Q', scale=alt.Scale(scheme='brownbluegreen'), legend = alt.Legend(title='Z-Score of Ratio'))
    ).properties(
        width=400,
        height=300
    )

    # Add statistical indicators to the plot
    mean_line = alt.Chart(df).mark_rule(color='gray', strokeDash=[5,5]).encode(
        y='mean:Q'
    )

    pstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
        y='pstd:Q'
    )

    nstd_line = alt.Chart(df).mark_rule(color='black', strokeDash=[5,5]).encode(
        y='nstd:Q'
    )

    # Additional Visualization (Binary Bar Chart)
    df_bin['ratio'] = df_bin['counts'] / df_bin['counts_weather']

    binary_bar = alt.Chart(df_bin).mark_bar(size=90).encode(
        x=alt.X(column_name + ":O"),
        y=alt.Y('ratio:Q', axis=alt.Axis(title=None, labels=False, domain=False, ticks=False, grid=True)),
    ).properties(
        width=200,
        height=300
    )

    # Combine and Return
    return ((bar + mean_line + pstd_line + nstd_line) | binary_bar).resolve_scale(y='shared')


In [None]:
weather_chart(collisions_weather_selected, weather, 'p01i', 0, 3)

We have now adjusted the color scale to avoid using colors previously employed in other graphs. Additionally, we have added the binary histogram. Furthermore, we have reduced the number of bins since we observed that instances of extreme meteorological conditions are infrequent, resulting in a limited number of records. By reducing the number of bins, we aim to strengthen the conclusions drawn from the data. Additionally, we have modified the name of the Y-axis to 'Collisions per Hour' for greater clarity and self-explanation.

As we have created a function to generate this type of graph, we can now produce this plot for any meteorological variable of interest.

In [None]:
weather_chart(collisions_weather_selected, weather, 'sknt', 0, 3)

In [None]:
weather_chart(collisions_weather_selected, weather, 'vsby', 16.093440, 3)

With these graphs, we realize that perhaps we no longer need the binary plot. Furthermore, we have reimagined the entire approach. We now have barplots to address questions 1 and 3. Moreover, the color in these graphs simply shows the distance from the mean value. It may not be providing as much information relative to the space they occupy. That's why, in the end, we have opted to represent it in a heatmap.

In [None]:
def create_column_count_df(collisions_weather_selected, weather, column_name, value, nbins):
    """
    Create a DataFrame with counts and ratios based on specified conditions in the input dataframes.

    Parameters:
    - collisions_weather_selected (pd.DataFrame): DataFrame containing collision data with weather information.
    - weather (pd.DataFrame): DataFrame containing weather information.
    - column_name (str): The column in the dataframes to analyze and compare.
    - value (float): The specific value within the column to focus on for comparison.
    - nbins (int): Number of bins to use for grouping non-zero values.

    Returns:
    pd.DataFrame: A DataFrame containing the specified column, counts, and the ratio of counts in different conditions.
    """

    # Data Preparation (for collisions)
    zero = collisions_weather_selected.loc[collisions_weather_selected[column_name] == value]
    num_zero = len(zero)

    nonzero = collisions_weather_selected.loc[collisions_weather_selected[column_name] != value]
    num_nonzero = len(nonzero)

    bins = pd.cut(nonzero.dropna(subset=[column_name])[column_name], bins=nbins)
    midpoints = bins.apply(lambda x: x.mid.round(2))

    grouped = nonzero.groupby(midpoints)
    grouped_df = grouped.size().reset_index(name='counts')

    zero_row = pd.DataFrame({column_name: [value], 'counts': [num_zero]})
    counts = pd.concat([zero_row , grouped_df])
    # Data Preparation (for weather)
    zero_weather = weather.loc[weather[column_name] == value]
    num_zero_weather = len(zero_weather)

    nonzero_weather = weather.loc[weather[column_name] != value]
    num_nonzero_weather = len(nonzero_weather)

    bins_weather = pd.cut(nonzero_weather.dropna(subset=[column_name])[column_name], bins=nbins)
    midpoints_weather = bins_weather.apply(lambda x: x.mid.round(2))

    grouped_weather = nonzero_weather.groupby(midpoints_weather)
    grouped_df_weather = grouped_weather.size().reset_index(name='counts')

    zero_row_weather = pd.DataFrame({column_name: [value], 'counts': [num_zero_weather]})
    counts_weather = pd.concat([zero_row_weather, grouped_df_weather])

    # Combine Collision and Weather Data
    df = pd.DataFrame({column_name: counts[column_name],
                       'counts_' + column_name: counts['counts'],
                       'counts_weather_' + column_name: counts_weather['counts']})

    # Calculate and add ratio to the DataFrame
    df['ratio_' + column_name] = df['counts_' + column_name] / df['counts_weather_' + column_name]

    return df[[column_name, 'ratio_' + column_name]]

In [None]:
bins = 3
df1 = create_column_count_df(collisions_weather_selected, weather, 'p01i', 0, bins).reset_index(drop=True)
df2 = create_column_count_df(collisions_weather_selected, weather, 'sknt', 0, bins).reset_index(drop=True)
df3 = create_column_count_df(collisions_weather_selected, weather, 'vsby', 16.093440, bins)

df3_0 = df3[df3['vsby'] == 16.09344]
df3_1 = df3[df3['vsby'] != 16.09344]
df3_2 = pd.concat([df3_1, df3_0], axis=0).reset_index(drop=True).sort_index(ascending=False).reset_index(drop=True)

heatmap_df = pd.concat([df1, df2, df3_2], axis=1)

heatmap_df1 = heatmap_df.melt(id_vars=['ratio_p01i'], value_vars=['p01i'], var_name='Column', value_name='Value')
heatmap_df1.rename(columns={'ratio_p01i': 'Ratio'}, inplace=True)
heatmap_df2 = heatmap_df.melt(id_vars=['ratio_sknt'], value_vars=['sknt'], var_name='Column', value_name='Value')
heatmap_df2.rename(columns={'ratio_sknt': 'Ratio'}, inplace=True)
heatmap_df3 = heatmap_df.melt(id_vars=['ratio_vsby'], value_vars=['vsby'], var_name='Column', value_name='Value')
heatmap_df3.rename(columns={'ratio_vsby': 'Ratio'}, inplace=True)

heatmap_df = pd.concat([heatmap_df1, heatmap_df2, heatmap_df3], axis=0)

heatmap_df.reset_index(inplace=True)

conditionorder = ["Perfect", "Moderate", "Bad", "Terrible"]
heatmap_df["CONDITION"] = conditionorder*3

In [None]:
axis_y_labels = (
    "datum.label == 'p01i' ? 'Rain' : datum.label == 'sknt' ? 'Wind' : 'Visbility'"
)

q5 = alt.Chart(heatmap_df).mark_rect().encode(
    x=alt.X('CONDITION:O', axis = alt.Axis(title="Condition", labels=True, labelAngle=0, domain=True, ticks=True, grid=False), sort=conditionorder),
    y=alt.Y('Column:O', axis=alt.Axis(title="Weather", labelExpr=axis_y_labels)),
    color=alt.Color('Ratio:Q', scale=alt.Scale(scheme='purples'), legend=alt.Legend(title="Collisions per Hour")),
).properties(
    title="Different Weather Conditions",
    width=481,
    height=300
).resolve_legend(color="independent")

q5

We have mapped the best conditions possible to Perfect. And the remaining 3 bins have been labelled for better to worse. Now, it is easy to see that as weather conditions worsen, the number of collisions per hour also increases. Therefore, we can address question 5.

In [None]:
((q4 | (q1 & q3)) & (q2 | q5).resolve_scale(color="independent").resolve_legend(size="independent")).configure_legend(symbolOpacity=1)

This will be our final visualization! Note that we mainly used purple as this color represents the entire data we have, while yellow represents before Covid and Green after Covid (get it? VIRUS!). We have also carfeully sized all plots so that they are perfectly aligned.


### 7. What is the main cause of accidents?

Studying the different columns we have and the questions we have already answered, it seemed necessary to determine the main cause of accidents. Our initial idea is to create a heatmap where the Y-axis represents vehicle types, the X-axis represents the cause of accidents, and the color indicates the percentage of accidents for each type of vehicle. Using absolute numbers would be challenging for comparison since we know that there are many more car accidents.

In [None]:
factor_df = collisions[["VEHICLE", "FACTOR", "ORIGINAL FACTOR"]]

# count the number of accidents for each vehicle
factor_df_grouped_vehicle = factor_df.groupby(["VEHICLE"]).size().reset_index(name="counts_vehicle")

# count the number of accidents for each factor and for each vehicle
factor_df_grouped = factor_df.groupby(["VEHICLE", "FACTOR"]).size().reset_index(name="counts")

# Drop rows with "Unknown" vehicle and "Unspecified" factor
factor_df_grouped = factor_df_grouped[(factor_df_grouped["VEHICLE"] != "Unknown") & (factor_df_grouped["FACTOR"] != "Unspecified")]

# merge the two dataframes
factor_contribution_df = factor_df_grouped.merge(factor_df_grouped_vehicle, on="VEHICLE")

# calculate the percentage of accidents for each factor and for each vehicle
factor_contribution_df["PERCENTAGE"] = factor_contribution_df["counts"] / factor_contribution_df["counts_vehicle"] * 100

In [None]:
factor_heatmap = alt.Chart(factor_contribution_df).mark_rect().encode(
    x=alt.X("FACTOR:O", axis = alt.Axis(title="Factor", labelAngle=30)),
    y=alt.Y("VEHICLE:O", axis=alt.Axis(title="Vehicle")),
    color=alt.Color("PERCENTAGE:Q", scale=alt.Scale(scheme="tealblues"), legend=alt.Legend(title="Percentage of Collisions")),
).properties(
    title="Factors Contributing to Accidents",
    width=522,
    height=300
).resolve_legend(color="independent")

factor_heatmap

We can see that there are two factors contributing more than others: 'Driver Inattention' and 'Driving Infraction.' It's challenging to discern much from the other factors. Therefore, we have decided to explore what happens within the original factors classified as 'Driver Inattention' and 'Driving Infraction.

In [None]:
# Filter the dataframe where FACTOR is "Driving Infraction"
driving_infraction_df = factor_df[factor_df["FACTOR"] == "Driving Infraction"]

# count the number of accidents for each vehicle
factor_df_grouped_vehicle = driving_infraction_df.groupby(["VEHICLE"]).size().reset_index(name="counts_vehicle")

# count the number of accidents for each factor and for each vehicle
factor_df_grouped = driving_infraction_df.groupby(["VEHICLE", "ORIGINAL FACTOR"]).size().reset_index(name="counts")

# Drop rows with "Unknown" vehicle and "Unspecified" factor
factor_df_grouped = factor_df_grouped[(factor_df_grouped["VEHICLE"] != "Unknown")]

# merge the two dataframes
factor_contribution_df = factor_df_grouped.merge(factor_df_grouped_vehicle, on="VEHICLE")

# calculate the percentage of accidents for each factor and for each vehicle
factor_contribution_df["PERCENTAGE"] = factor_contribution_df["counts"] / factor_contribution_df["counts_vehicle"] * 100

In [None]:
factor_heatmap2 = alt.Chart(factor_contribution_df).mark_rect().encode(
    x=alt.X("ORIGINAL FACTOR:O",  axis = alt.Axis(labelAngle=30, title="Factor")),
    y=alt.Y("VEHICLE:O", axis=alt.Axis(title=None)),
    color=alt.Color("PERCENTAGE:Q", scale=alt.Scale(scheme="tealblues"), legend=alt.Legend(title=["Percentage of Collisions due", "to Driving Infractions"])),
).properties(
    title="Driving Infractions contributing to Accidents",
    width=522,
    height=300
).resolve_legend(color="independent")

factors = (factor_heatmap | factor_heatmap2).resolve_legend(color="independent")
factors

In [None]:
(
    (
        (
            (q4 & q2)
            .resolve_scale(color="independent")
            .resolve_legend(size="independent")
            | (q5 & (q1 & q3))
        )
        & factors
    )
    .resolve_legend(size="independent")
    .resolve_scale(color="independent")
    .configure_legend(symbolOpacity=1)
)


## Questions: 

**1. Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after?**

Looking at the 2nd chart on the right (paired bar chart), we can use the mean lines to compare overall if accidents are more frequent in weekends or in weekdays. We can expect there to be the same amount of amount of days for each days of the week (give or take one), which is why we can use the total sum of collisions. Indeed, we can use the means of weekdays (before *yellow/orange* and after *green* Covid) and the means of weekends and it is clearly higher on weekdays. We see that after Covid the total collisions are more equally distributed among the days of the week. The weekdays average is still higher than on weekends but not as pronounced.

Using this chart we can also see that Fridays are the days with the most collisions (likely everyone is rushing for the weekend or simply tired from the long week) on the other hand, Sundays are the days with the least collisions.

**2. Is there any type of vehicle more prone to participate in accidents?**

This question is very tricky. We would need to know (per vehicle) the amount of kilometers driven for the dates we have used, to get a good representation of this question. This is because cars are for sure the vehicle with the most collisions, simply because they are driven the most km. We tried our best to find data, spent several hours in fact, but it was either incomplete or simply wrong.

The 2nd chart on the left (Vehicle Danger) has Deaths per collision in the y-axis and Injuries per collision in the x-axis and total collisions for size. We can see that cars have a total of 95263 collisions (they are the largest circle), representing around 90% of the collisions. This however does not imply that cars are are 90% more prone!

**3. At what time of the day are accidents more common?**

Using the third chart on the right (stacked bar chart) we can see that we have the most accidents from 16 to 17. Again, since hours are equally distributed throughout the day, we can use the total amount to indicate when accidents are more common.

Interesting to see the second highest maximum from 9 to 10. This latter peak corresponds with the beggining of the day and the former (16-17) with the end (when people go back home). Most collisions are concentrated from 8 to 20 (these are all over the mean).

Interestingly, from 0(12)-1 we have quite a large peak that does not seem to follow the trend. This is likely due to data missing the time variable and being set at 00:00. This is not an oversight by us as we checked, and the Time variable in the original dataset has 0 nulls.

**4. Are there any areas with a larger number of accidents?**

Using the map on the top left (opacity represents collisions per $km^2$) we can see that the Midtown Manhattan area as well as Fordham (Bronx) are the places with most collisions per $km^2$. Labels are placed in the district with most collisions per $km^2$ and the fourth most (because of space).

Using the map we can also see that a Horse crashed in Manhattan as well as a Go Kart in the Bronx! They are the sole cases of their respective vehicle type.

**5. Is there a correlation between weather conditions and accidents?**

Using the purple heatmap on the right, we can see that as Rain, Wind and Visibility conditons worsen, we find more collisions per hour. We see a trend with darker colors when conditions are not perfect. For Rain and Visibility, we see that for terrible conditions we have a lower collision rate, we can explain this by thinking about how a human works. With heavy heavy rain we will realise easier that we must drive safer, with a light rain we may not even care. Same goes for Visibility. Wind on the other hand is different, as once we get into the car, noticing it is harder so we don't change how we drive as much.

In summary, there is a correlation, we find more accidents when conditions are not perfect.

## Extra

**6. What vehicles are the most dangerous?**

Using the 2nd chart on the left scatter plot, we can see that two-wheelers injure more people per collision than four-wheelers! These latter vehicles have a chasis that protect the people inside in a collision, however, two-wheelers do not. If we look a bit closer, we see that the faster a two-wheeler is, the higher death rate! With motorcycle being the vehicle with most deaths per collision.

**7. What is the main cause of accidents?**

Using the left heatmap in the blue-teal pair at the bottom, it becomes evident that the primary causes of accidents are driving infractions and driver inattention. However, delving deeper into the chart reveals additional valuable insights. Notably, in the case of bicycles and e-bikes, accidents caused by pedestrian errors are more likely compared to other vehicle types. This indicates a heightened risk of accidents involving pedestrians in interactions with bicycles and e-bikes. Additionally, an interesting observation arises concerning bicycles, e-bikes, and e-scooters. In these vehicle categories, there is a higher percentage of accidents attributed to driver inattention. Curiously, these do not require a license to drive (except some e-scooters)!

Using the right heatmap in the blue-teal pair at the bottom, we can break down the driving infractions that lead to accidents. Here are some notable examples: Firstly, ambulances stand out with a higher percentage of collisions attributed to "passing too close." Similarly, in the case of bicycles, there is a higher percentage of collisions associated with "Failure to Yield Right of Way". These are just some highlighted examples, and a more in-depth analysis would reveal additional insights into specific driving infractions contributing to accidents.