# VI: Second Practical Work

**Authors:** Gerard Comas & Marc Franquesa.

**Altair:** version 5.

We have found that our visualisation works best in Jupyter, if possible, run this notebook in Jupyter (you will find a `requirements.txt` file). Streamlit/Colab will be slower due to the amount of layers some of these charts have.


**COLAB INSTRUCTIONS:**
1. Upload notebooks
2. Create `original-data` and `processed-data` folders
3. Upload datasets found in local `original-data` to colab `original-data`
4. Execute `pre-processing.ipynb` notebook
5. Execute `design.ipynb` notebook

The visualisation was heavily thought up before any actual coding, which is why there won't be much playing with charts and it will be pretty straight to the point. In addition, some plots are very complex, one of our final chart requiring up to 6 layers. Due to the interaction, you cannot plot a chart by itself if it depends on another chart, which is why you will see no "build-up" going layer by layer, the final notebook would be too bulky in our opinion. However, we have done our best to explain what each layer does, as well as including our reasoning.

Additionaly, as seen in the pre-processing notebook, we have the option to fill our dataset will empty values in order to be able to make sensible plots. This in turn made us have over 180k rows which is why you will see a groupby before every plot, removing unnecessary rows and columns. However, we have also made it compatible in case no filling is done, each plot requiring a different workaround.

Finally, before starting, the visualisation has been designed to be able to answer _any_ variant of the questions required, as it has flexibility in choosing any and all features. For example:
* Which area presented the majority of ambulance accidents during clear days in August on Mondays at 8PM?
* Which day had more accidents during rainy days in June in the Bronx?
Can easily be answered with our design as easily as the original questions.

Here are the imports, initial reading, and useful global variables for colors schemas.

In [None]:
# If using colaboratory, uncomment the following lines
# !pip3 uninstall -y altair
# !pip3 install altair==5.*
# !pip3 uninstall -y pyarrow
# !pip3 install pyarrow

In [None]:
import pandas as pd
import altair as alt
import geopandas as gpd
import warnings

alt.data_transformers.disable_max_rows()
warnings.simplefilter(action="ignore", category=FutureWarning)

collisions = pd.read_csv("./processed-data/collisions_weather.csv")
map_data = gpd.read_file("./processed-data/map.geojson")


primary = "purple"
boroughs_colors = "boroughs"
schema = "schema"
vehicles_colors = "vehicles"

colors = {
    "green": "#ccebc5",
    "purple": "#bebada",
    "boroughs": {
        "Staten Island": "#8dd3c7",
        "Queens": "#fdb462",
        "Brooklyn": "#b3de69",
        "Manhattan": "#fb8072",
        "Bronx": "#80b1d3",
    },
    "vehicles": {
        "Ambulance": "#fccde5",
        "Fire truck": "#ffed6f",
        "Taxi": "#ccebc5",
    },
    "schema": "purples",
}

collisions["VALID"] = collisions["VALID"].astype("int64")

collisions.head()

We started by visualising three barplots, one would show collisions by month, the other by vehicle and the final one by weather conditions. This easily answers question 1, also provding flexibility for combinations further along. All three barplots filter one another. Multiple selection is also possible. We have also decided to include emojis for the vehicle and weather barplots, just to make them more visual.

**Note:** you will see that the order for each of these plots has been set. Initially we wanted to order by count, this however could be confusing as depending on the selection of the other two barplots, a selection in one of the bar plots could change position.

In [None]:
month_order = ["June", "July", "August", "September"]
month_selection = alt.selection_point(fields=["MONTH"], empty=True)

vehicle_order = ["Taxi", "Ambulance", "Fire truck"]
vehicle_selection = alt.selection_point(fields=["VEHICLE"], empty=True)

weather_order = ["Rainy", "Clear", "Partly cloudy", "Cloudy"]
weather_selection = alt.selection_point(fields=["WEATHER"], empty=True)

bars_df = collisions.groupby(["MONTH", "VEHICLE", "VEHICLE EMOJI", "WEATHER", "WEATHER EMOJI"]).agg({"VALID": "sum"}).reset_index()
months = (
    alt.Chart(bars_df)
    .mark_bar(color=colors[primary])
    .encode(
        x=alt.X("MONTH:O", sort=month_order, axis=alt.Axis(title="Month", labelAngle=0), scale=alt.Scale(domain=month_order)),
        y=alt.Y("sum(VALID):Q", axis=alt.Axis(title="Collisions")),
        opacity=alt.condition(month_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("MONTH:O", title="Month"), alt.Tooltip("sum(VALID):Q", title="Collisions")],
    ).add_params(
        month_selection
    ).transform_filter(
        vehicle_selection & weather_selection
    )
    .properties(
        title=alt.Title(["Collisions per Month", "(filtered by vehicle and weather)"], dy=-0),
        width=250,
        height=175
    )
)

vehicles = (
    alt.Chart(bars_df)
    .mark_bar(color=colors[primary])
    .encode(
        x=alt.X("VEHICLE:N", sort=vehicle_order, axis=alt.Axis(title="Vehicle", labels=False, domain=False, ticks=False, grid=False), scale=alt.Scale(domain=vehicle_order)),
        y=alt.Y("sum(VALID):Q", axis=alt.Axis(title="Collisions")),
        opacity=alt.condition(vehicle_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("VEHICLE:N", title="Vehicle"), alt.Tooltip("sum(VALID):Q", title="Collisions")],
        # color=alt.Color(
        #     "VEHICLE:N",
        #     scale=alt.Scale(
        #             range=list(colors[vehicles_colors].values()),
        #             domain=list(colors[vehicles_colors].keys())
        #     ),
        #     legend=None
        # ),
    )
    .add_params(
        vehicle_selection
    ).transform_filter(
        month_selection & weather_selection
    )
    .properties(
        title=alt.Title(["Collisions per Vehicle", "(filtered by month and weather)"], dy=-0),
        width=200,
        height=175
    )
)

vehicles += (
    alt.Chart(bars_df)
    .mark_text(size=18, align="center", dy=-8)
    .encode(
        x=alt.X("VEHICLE:N", sort=vehicle_order, scale=alt.Scale(domain=vehicle_order)),
        y=alt.Y("sum(VALID):Q"),
        text=alt.Text("VEHICLE EMOJI:N"),
        opacity=alt.condition(vehicle_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("VEHICLE:N", title="Vehicle"), alt.Tooltip("sum(VALID):Q", title="Collisions")],
        color=alt.Color(legend=None),
    ).add_params(
        vehicle_selection
    ).transform_filter(
        month_selection & weather_selection
    )
)

weather = (
    alt.Chart(bars_df)
    .mark_bar(color=colors[primary])
    .encode(
        x=alt.X("WEATHER:N", sort=weather_order, axis=alt.Axis(title="Weather", labels=False, domain=False, ticks=False, grid=False), scale=alt.Scale(domain=weather_order)),
        y=alt.Y("sum(VALID):Q", axis=alt.Axis(title="Collisions")),
        opacity=alt.condition(weather_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("WEATHER:N", title="Weather"), alt.Tooltip("sum(VALID):Q", title="Collisions")],
    ).add_params(
        weather_selection
    ).transform_filter(
        month_selection & vehicle_selection
    ).properties(
        title=alt.Title(["Collisions per Weather", "(filtered by month and vehicle)"], dy=-0),
        width=200,
        height=175
    )
)

weather += (
    alt.Chart(bars_df)
    .mark_text(size=18, align="center", dy=-8)
    .encode(
        x=alt.X("WEATHER:N", sort=weather_order, scale=alt.Scale(domain=weather_order)),
        y=alt.Y("sum(VALID):Q"),
        text=alt.Text("WEATHER EMOJI:N"),
        opacity=alt.condition(weather_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("WEATHER:N", title="Weather"), alt.Tooltip("sum(VALID):Q", title="Collisions")],
    ).add_params(
        weather_selection
    ).transform_filter(
        month_selection & vehicle_selection
    )
)

In [None]:
(months & (vehicles | weather))

We will now add an interactive map of NYC. We have decided to use Borough granularity as we do not have many collisions, making a smaller granularity appear quite empty. We have decided to plot $\frac{collisions}{km^2}$ as we thought it would make the most sense. This chart will not have a direct purpose to answer the asked questions, however, it will be used to select area later on. We have decided to color locations not selected gray, lowering the opacity would make it quite confusing as some areas with high $\frac{collisions}{km^2}$ will be much darker than others. Color scale has been done in a log scale as we thought it would show the most information (without it only Manhattan was dark)

This chart will be filtered by the features elected in the barplots. You will see a version for the streamlit app as our initial (and lighter on altair) version did not work.

In [None]:
map_data.head()

In [None]:
ny_map_selection = alt.selection_point(fields=["BOROUGH"], empty=True)

collisions_borough=collisions.groupby(["MONTH","VEHICLE","WEATHER","BOROUGH"]).agg({"VALID": "sum"}).reset_index()
map_data = map_data[["BOROUGH", "AREA_KM2", "geometry"]]

ny_map = (
    alt.Chart(collisions_borough)
    .mark_geoshape(stroke="gray")
    .project(type="albersUsa")
    .transform_lookup(
        lookup = "BOROUGH",                 
        from_ = alt.LookupData(data=map_data, key="BOROUGH", fields=["geometry", "type", "AREA_KM2"]), 
    )
    .transform_filter(month_selection & weather_selection & vehicle_selection)
    .transform_aggregate(sumCollisions = "sum(VALID)", groupby = ["BOROUGH", "AREA_KM2", "geometry", "type"])
    .transform_calculate(COLLISIONS_KM2 = "datum.sumCollisions / datum.AREA_KM2")
    .encode(
        color=alt.condition(ny_map_selection, alt.Color("COLLISIONS_KM2:Q", scale=alt.Scale(scheme=colors[schema], type="log"), legend=alt.Legend(title=["Collisions per km2", "(log scale)"])), alt.value("lightgray")),
        tooltip=[alt.Tooltip("BOROUGH:N", title="Borough"), alt.Tooltip("COLLISIONS_KM2:Q", title="Collisions per km2"), alt.Tooltip("sumCollisions:Q", title="Collisions")],
    ).properties(
        width=300,
        height=300,
        title=["NYC Boroughs", "(filtered by barplots)"]
    ).add_params(
        ny_map_selection
    )
)

collisions_borough_st = collisions_borough.merge(map_data, on="BOROUGH", how="left")
collisions_borough_st = collisions_borough_st[["MONTH", "VEHICLE", "WEATHER", "BOROUGH", "AREA_KM2", "VALID"]]

map_data_st = alt.Data(
    url="https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson",
    format=alt.DataFormat(property="features"),
)

ny_map_st = (
    alt.Chart(collisions_borough_st)
    .mark_geoshape(stroke="gray")
    .project(type="albersUsa")
    .transform_lookup(
        lookup = "BOROUGH",                 
        from_ = alt.LookupData(data=map_data_st, key="properties.boro_name", fields=["geometry", "type"]), 
    )
    .transform_filter(month_selection & weather_selection & vehicle_selection)
    .transform_aggregate(sumCollisions = "sum(VALID)", groupby = ["BOROUGH", "AREA_KM2", "geometry", "type"])
    .transform_calculate(COLLISIONS_KM2 = "datum.sumCollisions / datum.AREA_KM2")
    .encode(
        color=alt.condition(ny_map_selection, alt.Color("COLLISIONS_KM2:Q", scale=alt.Scale(scheme=colors[schema], type="log"), legend=alt.Legend(title=["Collisions per km2", "(log scale)"])), alt.value("lightgray")),
        tooltip=[alt.Tooltip("BOROUGH:N", title="Borough"), alt.Tooltip("COLLISIONS_KM2:Q", title="Collisions per km2"), alt.Tooltip("sumCollisions:Q", title="Collisions")],
    ).properties(
        width=300,
        height=300,
        title=["NYC Boroughs", "(filtered by barplots)"]
    ).add_params(
        ny_map_selection
    )
)


# Fixes boroughs not appearing when df not full
base_map = (
    alt.Chart(map_data_st)
    .mark_geoshape(stroke="gray")
    .transform_calculate(
        collisions="0",
    )
    .encode(
        color=alt.condition(ny_map_selection, alt.value("white"), alt.value("lightgray")),
        tooltip=[alt.Tooltip("properties.boro_name:N", title="Borough"), alt.Tooltip("collisions:Q", title="Collisions per km2"), alt.Tooltip("collisions:Q", title="Collisions")]
    )
    .add_params(
        ny_map_selection
    )
)

ny_map = base_map + ny_map
ny_map_st = base_map + ny_map_st


In [None]:
((months | weather | vehicles) & (ny_map_st)).resolve_legend(color="independent")

We now built a calendar like heatmap, grouping data by day of week and week of year. This heatmap additionally serves the function of selecting a day of week (click on any of the columns to observe the effect). By default it has been set to Mondays to facilitate answering question 3, which can be tedious with our style of visualisation. The most interesting feature this chart has, is the `*` that appears on days that have maximum collisions (1 or more in case of a tie) for the features selected. This was a pretty tricky thing to do, however, we are quite happy with the result.

The `*` detail has been put in place to instantly answer question 4 (once the correct features have been selected). We thought that otherwise, with close values, distinction would be hard.

We have also included grid lines when selecting a day of week, in casse features with almost no values are selected, you are still able to see what day you have chosen.

This chart will be filtered by the features selected in the barplots and the locations selected in the map.

In [None]:
weekdayorder = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

# Default Mon to make it "quicker" to answer Q3
day_selection = alt.selection_point(fields=["CRASH WEEKDAY"], value="Mon")

weekdays_df = collisions.groupby(["CRASH DAY", "CRASH WEEKDAY", "CRASH WEEK NUMBER", "MONTH", "VEHICLE", "WEATHER", "DAY", "BOROUGH"]).agg({"VALID": "sum"}).reset_index()

# Base chart
weekdays = (
    alt.Chart(weekdays_df)
    .mark_rect()
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & ny_map_selection
    )
    .encode(
        x=alt.X("CRASH WEEKDAY:O", title="Day of week", sort=weekdayorder, axis=alt.Axis(labelAngle=0)),
        y=alt.Y("CRASH WEEK NUMBER:O", title="Week of year"),
        color=alt.Color("sumValid:Q", scale=alt.Scale(scheme=colors[schema]), title="Collisions"),
        opacity=alt.condition(day_selection, alt.value(1), alt.value(0.2)),
        tooltip=[alt.Tooltip("CRASH DAY:O", title="Crash day"), alt.Tooltip("sumValid:Q", title="Collisions")],
    )
    .transform_aggregate(
        groupby=["CRASH DAY", "CRASH WEEKDAY", "CRASH WEEK NUMBER"],
        sumValid="sum(VALID):Q",
    )
    .add_params(
        day_selection
    ).properties(
        title=["Collisions per Week and Weekday", "(filtered by barplots and map)"],
        width=300,
        height=300,
    )
)

# Adding the asterisk to max value
weekdays += (
    alt.Chart(weekdays_df)
    .mark_text(align="center", text="*", color="white", dy=3, size=15)
    .encode(
        x=alt.X("CRASH WEEKDAY:O", sort=weekdayorder),
        y=alt.Y("CRASH WEEK NUMBER:O"),
        tooltip=[alt.Tooltip("LABEL:N", title=" ")],
    )
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & ny_map_selection
    )
    .transform_aggregate(
        groupby=["CRASH DAY", "CRASH WEEKDAY", "CRASH WEEK NUMBER"],
        sumValid="sum(VALID):Q",
    )
    # ignore if no collisions, prevents labelling every
    # data point when filtering
    .transform_filter(
        alt.datum.sumValid != 0
    )
    .transform_window(
        sort=[alt.SortField(field="sumValid", order="descending")],
        rank="rank()",
    )
    # only label the top ranked data point per year
    .transform_filter(alt.datum.rank == 1)
    .transform_calculate(
        LABEL="'Max value'"
    )
)

# Day of month, not convincing
# weekdays += (
#     alt.Chart(collisions)
#     .mark_text(align="left", size=10)
#     .transform_filter(month_selection)
#     .transform_aggregate(groupby = ["DAY", "MONTH", "CRASH WEEKDAY", "CRASH WEEK NUMBER"])
#     .encode(
#         x=alt.X("CRASH WEEKDAY:O", title=None, sort=weekdayorder),
#         y=alt.Y("CRASH WEEK NUMBER:O", title=None),
#         text=alt.Text("DAY:N") 
#     )
# )

weekdays_empty = (
    alt.Chart(weekdays_df)
    .transform_filter(
        month_selection
    )
    .mark_rect(color="white", stroke="grey", strokeWidth=0.5)
    .transform_calculate(
        collisions="0",
    )
    .encode(
        x=alt.X("CRASH WEEKDAY:O", title="Day of week", sort=weekdayorder, axis=alt.Axis(labelAngle=0)),
        y=alt.Y("CRASH WEEK NUMBER:O", title="Week of year"),
        tooltip=[alt.Tooltip("CRASH DAY:O", title="Day"), alt.Tooltip("collisions:Q", title="Collisions")],
        # Grid opacity
        opacity=alt.condition(day_selection, alt.value(0.05), alt.value(0)),
    )
)

weekdays = weekdays_empty + weekdays

In [None]:
((months | weather | vehicles) & (ny_map | weekdays).resolve_legend(color="independent").resolve_scale(color="independent"))

We will now make the final chart for the questions required for the delivery, this has been the most complex chart of all, as we decided to include a lot of features to make it quicker and easier to use. It essentially plots the collisions per borough and per hour, with the data filtered by all the plots made above. Here is a quick list of additional features:

* <u>Imputing missing values</u>: workaround for filling data, using `transform_impute`.

* <u>Mark for max value</u>: the max value (borough with highest number of collisions in all hours) will be marked, this has been done to very easily answer question 2, without requiring interacting or though with this line chart.

* <u>Hour rule</u>: rule to highlight a given hour. This has been put in place to facilitate question 3 (which is why it is set at noon by default).

* <u>Max value in selected hour</u>: circle on max hour, again, has been put to speed up answering question 3.

* <u>Tooltip including all Boroughs that have values in a given number</u>: this is our favourite, as well as the hardest to pull off. We noticed that for question 3, two areas could potentially be tied up for maximum collisions. Despite having the circle on the max hour, this was not enough as it would only be one color. This is why we came up with this solution, upong hovering over this highlighted point (or any in fact) a list of Boroughs have that value at that time appears. Looking at the different colors that went through the point was hard! At least for us!

We believe we have striked a balance between redundancy/cluttering and speed/ease of use. We have also made the chart bigger to compensate for all the information at once. Also note, that we initally wanted to have the interaction with this chart (hour rule) be a hover effect, however, this was too slow for the amount of data the line chart uses, as well as made it harder to choose certain points.

In [None]:
hours_df = collisions.groupby(["CRASH DAY", "CRASH WEEKDAY", "MONTH", "VEHICLE", "WEATHER", "BOROUGH", "DAY", "HOUR", "CRASH HOUR", "LOCATION AT HOUR"]).agg({"VALID": "sum"}).reset_index()

hour_selection = alt.selection_point(encodings=["x"], nearest=True, value=12, empty=True)

# Base chart
hours = (
    alt.Chart(hours_df)
    .mark_line()
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & day_selection
    )
    .transform_aggregate(
        groupby=["HOUR", "CRASH HOUR", "LOCATION AT HOUR", "BOROUGH"],
        sumValid="sum(VALID):Q",
    )
    # Fill in missing values
    .transform_impute(
        impute="sumValid",
        key="HOUR",
        keyvals=list(range(24)),
        value=0,
        frame=[-1, 1],
        groupby=["BOROUGH"]
    )
    .encode(
        x=alt.X("HOUR:Q", axis=alt.Axis(title="Hour", labelAngle=0), scale=alt.Scale(domain=[0, 23])),
        y=alt.Y(
            "sumValid:Q",
            axis=alt.Axis(title="Collisions"),
        ),
        opacity=alt.condition(ny_map_selection, alt.value(1), alt.value(0.2)),
        color=alt.Color(
            "BOROUGH:N",
            legend=None, # alt.Legend(title="Borough"),
            scale=alt.Scale(
                range=list(colors[boroughs_colors].values()),
                domain=list(colors[boroughs_colors].keys())
            )
        ),
        # Fixes weird bug in streamlit
        tooltip=alt.value(None),
    ).properties(
        title=["Collisions per Hour and Location (filtered by barplots and heatmap)"],
        width=700,
        height=300
    )
    # Too laggy
    # .interactive()
)

max_values = (
    alt.Chart(hours_df)
    .mark_circle(opacity=0, size=50)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection &
        day_selection
    )
    .encode(
        x=alt.X("HOUR:Q", axis=alt.Axis(labelAngle=0), scale=alt.Scale(domain=[0, 23])),
        y=alt.Y("sumValid:Q"),
        color=alt.Color("BOROUGH:N", scale=alt.Scale(range=list(colors[boroughs_colors].values()), domain=list(colors[boroughs_colors].keys()))),
        opacity=alt.condition(ny_map_selection, alt.value(1), alt.value(0.2)),
        # Fixes weird bug in streamlit
        tooltip=alt.value(None),
    )
    .transform_aggregate(
        groupby=["BOROUGH", "HOUR", "LOCATION AT HOUR"],
        sumValid="sum(VALID):Q",
    )
    # ignore if no collisions, prevents labelling every
    # data point when filtering
    .transform_filter(
        alt.datum.sumValid != 0
    )
    .transform_window(
        sort=[alt.SortField(field="sumValid", order="descending")],
        rank="rank()",
    )
    # only label the top ranked data point per year
    .transform_filter(alt.datum.rank == 1)
)

# Label for max value
max_values += (
    alt.Chart(hours_df)
    .mark_text(fontSize=20, clip=False, angle=(180-45), text="→", dy=5, dx=-15)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection &
        day_selection
    )
    .encode(
        x=alt.X("HOUR:Q", sort=weekdayorder, scale=alt.Scale(domain=[0, 23])),
        y=alt.Y("sumValid:Q"),
        color=alt.Color("BOROUGH:N", legend=None, scale=alt.Scale(range=list(colors[boroughs_colors].values()), domain=list(colors[boroughs_colors].keys()))),
        opacity=alt.condition(ny_map_selection, alt.value(1), alt.value(0.2)),
        # Fixes weird bug in streamlit
        tooltip=[alt.Tooltip("LABEL:N", title=" ")],
    )
    .transform_aggregate(
        groupby=["BOROUGH", "HOUR", "LOCATION AT HOUR"],
        sumValid="sum(VALID):Q",
    )
    # ignore if no collisions, prevents labelling every
    # data point when filtering
    .transform_filter(
        alt.datum.sumValid != 0
    )
    .transform_window(
        sort=[alt.SortField(field="sumValid", order="descending")],
        rank="rank()",
    )
    .transform_filter(alt.datum.rank == 1)
    .transform_calculate(
        LABEL="'Max value'"
    )
)

# Rule to easily mark all values in the same hour
hour_rule = (
    alt.Chart(hours_df)
    .mark_rule(color="gray", strokeDash=[10, 10])
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & day_selection & hour_selection
    )
    .encode(
        x=alt.X("HOUR:Q", scale=alt.Scale(domain=[0, 23])),
    )
)

# Void chart, without it, the interaction doesn't work well
hour_rule += (
    alt.Chart(hours_df)
    .mark_text(opacity=0)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & day_selection
    )
    .encode(
        x=alt.X("HOUR:Q", axis=alt.Axis(labelAngle=0), scale=alt.Scale(domain=[0, 23])),
        y=alt.Y("sum(VALID):Q"),
        color=alt.Color("BOROUGH:N", scale=alt.Scale(range=list(colors[boroughs_colors].values()), domain=list(colors[boroughs_colors].keys()))),
        # Fixes weird bug in streamlit
        tooltip=alt.value(None),
    )
    .add_params(
        hour_selection
    )
)

# Circle mark on max value in selected hour
hour_rule += (
    alt.Chart(hours_df)
    .mark_circle(size=50)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & day_selection & hour_selection
    )
    .encode(
        x=alt.X("HOUR:Q", scale=alt.Scale(domain=[0, 23])),
        y=alt.Y("sumValid:Q"),
        color=alt.Color("BOROUGH:N", scale=alt.Scale(range=list(colors[boroughs_colors].values()), domain=list(colors[boroughs_colors].keys()))),
        opacity=alt.condition(ny_map_selection, alt.value(1), alt.value(0.2)),
        # Fixes weird bug in streamlit
        tooltip=alt.value(None),
    )
    .transform_aggregate(
        groupby=["BOROUGH", "HOUR"],
        sumValid="sum(VALID):Q",
    )
    .transform_filter(
        alt.datum.sumValid != 0
    )
    .transform_window(
        sort=[alt.SortField(field="sumValid", order="descending")],
        rank="rank()",
    )
    # only label the top ranked data point
    .transform_filter(alt.datum.rank == 1)
)

# Adds tooltip to each point in data, note that if more than
# one borough has the same value, it will show them all
tooltip = (
    alt.Chart(hours_df)
    .mark_circle(opacity=0, size=50)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection & day_selection
    )
    .encode(
        x=alt.X("HOUR:Q", axis=alt.Axis(labelAngle=0), scale=alt.Scale(domain=[0, 23])),
        y=alt.Y("sumValid:Q"),
        opacity=alt.value(0),
        tooltip=[
            alt.Tooltip("first_bo:N", title="Boroughs"),
            alt.Tooltip("sec_bo:N", title=" "),
            alt.Tooltip("thi_bo:N", title=" "),
            alt.Tooltip("fo_bo:N", title=" "),
            alt.Tooltip("fi_bo:N", title=" "),
            alt.Tooltip("CRASH HOUR:N", title="Hour"),
            alt.Tooltip("COL:Q", title="Collisions"),
        ]
    )
    .transform_aggregate(
        groupby=["HOUR", "CRASH HOUR", "BOROUGH"],
        sumValid="sum(VALID):Q",
    )
    .transform_aggregate(
        groupby=["HOUR", "CRASH HOUR", "sumValid"],
        VALUES="values(BOROUGH):N",
        COUNT="count():Q",
        COL="max(sumValid)"
    )
    .transform_calculate(
        first_bo="datum.VALUES[0].BOROUGH",
        sec_bo= "datum.COUNT > 1 ? datum.VALUES[1].BOROUGH : ''",
        thi_bo= "datum.COUNT > 2 ? datum.VALUES[2].BOROUGH : ''",
        fo_bo= "datum.COUNT > 3 ? datum.VALUES[3].BOROUGH : ''",
        fi_bo= "datum.COUNT > 4 ? datum.VALUES[4].BOROUGH : ''",
    )
)

hours = hours + max_values + hour_rule + tooltip

In [None]:
(
    (months | weather | vehicles) &
    (ny_map | weekdays).resolve_legend(color="independent").resolve_scale(color="independent") &
    (hours)
).resolve_legend(color="independent")

### Extra

In [None]:
factor_df = collisions.groupby(["CRASH DAY", "CRASH WEEKDAY", "MONTH", "VEHICLE", "WEATHER", "BOROUGH", "ORIGINAL FACTOR", "FACTOR" ]).agg({"VALID": "sum", "NUMBER OF PERSONS INJURED": "sum",
    "NUMBER OF PERSONS KILLED": "sum"}).reset_index()

factor_df["NUMBER OF PERSONS KILLED"].sum()

In [None]:
factor_df = factor_df[factor_df["FACTOR"] == "Driving Infraction"]

factor_df["NUMBER OF PERSONS KILLED"].sum()

In [None]:
factor_df = collisions.groupby(["CRASH DAY", "CRASH WEEKDAY", "MONTH", "VEHICLE", "WEATHER", "BOROUGH", "ORIGINAL FACTOR", "FACTOR" ]).agg({"VALID": "sum", "NUMBER OF PERSONS INJURED": "sum",
    "NUMBER OF PERSONS KILLED": "sum"}).reset_index()

factor_selection = alt.selection_point(fields=["ORIGINAL FACTOR"], empty=True)

factors = (
    alt.Chart(factor_df)
    .mark_circle(color=colors[primary], size=125, opacity=1)
    .transform_filter(
        month_selection & weather_selection & vehicle_selection 
    )
    .transform_aggregate(
        sumValid="sum(VALID):Q",
        sumInjured="sum(NUMBER OF PERSONS INJURED):Q",
        groupby=["ORIGINAL FACTOR", "BOROUGH"],
    )
    .transform_calculate(
        INJURED_PER_COLLISION = "datum['sumInjured'] / datum['sumValid']"
    )
    .encode(
        x=alt.X("INJURED_PER_COLLISION:Q", axis=alt.Axis(title="Average injuries per collision", tickCount=10)),
        y=alt.Y("sumValid:Q", axis=alt.Axis(title="Collisions")),
        color=alt.condition(
            ny_map_selection & factor_selection,
            alt.Color(
                "BOROUGH:N",
                legend=alt.Legend(title="Borough"),
                scale=alt.Scale(
                    range=list(colors[boroughs_colors].values()),
                    domain=list(colors[boroughs_colors].keys())
                )
            ),
            alt.value("lightgray")
        ),
        tooltip=[
            alt.Tooltip("ORIGINAL FACTOR:N", title="Factor"),
            alt.Tooltip("sumValid:Q", title="Collisions"),
            alt.Tooltip("INJURED_PER_COLLISION:Q", title="Average injuries per collision")
        ],
    )
    .properties(
        title=["Driving infractions and their danger", "(filtered by barplots)"],
        width=700,
        height=300
    )
    .add_params(
        factor_selection
    )
    # Too laggy
    # .interactive()
)

In [None]:
(
    (months.properties(width=355) | weather.properties(width=315) | vehicles.properties(width=315)) &
    ((ny_map.properties(width=400, height=350) | factors.properties(width=550, height=300)) &
    (weekdays | hours.properties(width=700)))
)

## User Manual for the New York Collisions Visualization Dashboard

Welcome to the New York Collisions Visualization Dashboard! This dashboard provides valuable information about traffic accidents that occurred in New York City during the summer of 2018. Below is a step-by-step guide to help you make the most of the available features and graphs. 

### 1.Select Month, Vehicle Type, and Weather Condition
On the first line of the dashboard, use the three bar charts to customize your preferences:

- **Month**: Select the month of interest (June, July, August, September).
- **Vehicle Type**: Filter by vehicle type (Taxi, Tmbulance, Fire truck).
- **Weather Condition**: Choose among weather conditions  (Rainy, Clear, Cloudy, Partly cloudy).

Each barplot dynamically filters the others. When you select an option in a barplot, the other options reduce their opacity to highlight the selected one.

### 2. Collision Map by Borough and Infractions Scatterplot
On the second row of the dashboard, two charts are presented:

- **Collision Map by Borough**: Observe collision density per square kilometer in different boroughs. 
    - This chart is filtered by the previous three barplots, and you can select a borough to filter the following charts. While selecting a borough, the others are displayed in light gray.
    - Note: The color legend is in logarithmic scale for better differentiation. Hover over a borough to see a tooltip with collisions per kilometer squared.

- **Infractions Scatterplot**: Analyze the relationship between the type of infraction and its risk, categorized by borough.
    - This chart is filtered by the previous three barplots. If a borough is selected, the other points will be in light gray for easy comparison of the same infraction across different boroughts.
    - This chart doesn't modify the other ones, but clicking on a point highlights others with the same infraction.
    - Note: Due to some driving infractions without injuries there may be challenges in visibility and potential overlap.

### 3. Collision Calendar and Hourly Line Chart
On the third and last row of the dashboard, two additional charts provide further insights:

- **Collision Calendar (Heatmap)**: Customize your view by selecting weekdays and visualizing collision counts using a heatmap.
    - This chart is filtered with the three barplots and the map. Choose a specific weekday to filter the subsequent chart, while the other weekdays are displayed with reduced opacity.
    - Colors indicate the number of collisions each day, and a star highlights the maximum value within the current selections.

- **Hourly Line Chart**: Explore the collision trend per hour of the day in each borough.
    - This chart is filtered by the three barplots and the heatmap. If a borought is selected, the other lines will be displayed in light gray. 
    - This chart doesn't impact the other ones, but you can interact by clicking on an hour to reveal a marker indicating the peak value of collisions during that hour.
    - An arrow and point highlight the maximum collision value within your selections.

Feel free to explore these charts to enhance your understanding of collision patterns. However, the primary purpose of this dashboard is to answer the following questions:

**1. Which weather condition and type of vehicle were present in the majority of accidents each month? And in the combination of all the months?**

Upon reviewing the dashboard, discernible patterns emerge in the prevalent characteristics of accidents each month. Notably, taxi involvement and rainy weather conditions consistently dominate the majority of accidents across individual months. This observed trend persists when aggregating data for the entire summer period.

**2. In which area and at what hour did the majority of accidents each month happen? And in the combination of all the months?**

The geographical and temporal analysis of monthly accidents reveals a consistent pattern throughout the summer of 2018, with Manhattan emerging as the predominant borough for collisions. Specifically, the maximum number of accidents occurred at 00:00 (this can be seen with the arrow in the line chart). This prompts consideration that the reported time might be an arbitrary placeholder, as accidents occurring at an unknown time could have been inputted as 00:00. Alternatively, it raises the possibility of a predominant taxi presence in Manhattan at midnight, although this seems less likely based on available information.

However, a closer examination of individual months (by selecting in the months barplot) provides additional insights. In June, peak collisions were observed at 19:00, while in August and September, the highest frequency was recorded at 14:00. 

**3. Which area presented the majority of taxi accidents during rainy days in June on Mondays at noon, 12am?**

Upon an analysis of the line chart and appropriate filtering (using the barplots and the calendar to filter), it is evident that Manhattan and Brooklyn experienced the majority of accidents, with a total of 3 collisions during these specific conditions.

**4. Which day had more accidents during clear days in July in Manhattan**

By looking at the heatmap after selecting July in the barplot and Manhattan in the map, we can easily observe a star showing the maximum, this star indicates that the peak occurred on July 19, 2018, with a total of 24 collisions.

We also created the scatterplot to answer extra questions like:

**5. Which infractions appear most frequently across the dataset, and do they correspond to higher or lower levels of risk?**

In every borough, except for Staten Island where it is an exception with only 9 collisions, the most prevalent infraction is "Following too closely." The rate of injuries per accident varies, ranging from 0.202 in Manhattan to 0.414 in Brooklyn. It is noteworthy that while "Unsafe Speed" is logically the most dangerous infraction, the actual number of collisions attributed to it is relatively low.



The dashboard was intentionally designed to be highly exploratory, inviting users to inquire about a multitude of aspects. So other questions can be answered. Particularly, as mentioned in at the start, any variant of the original questions. Using the scatterplot, one can also highlight all infractions of the same type, and seeing how that particular infraction acts.