In [1]:
import pandas as pd
import altair as alt
from altair.expr import datum
alt.data_transformers.disable_max_rows()

garbage = pd.read_csv("../Datasets/garbage.csv") # Change path!

selection = alt.selection_multi(fields=['Borough']) # A different kind of selection!

borough_color = alt.condition(selection,
                      alt.Color('Borough:N', legend=None),
                      alt.value('lightgray'))
borough_opacity = alt.condition(selection, alt.value(1.0), alt.value(0))


# Create scatterplot of bill length vs bill depth
scatterplot = alt.Chart(garbage).mark_line().encode(
    x=alt.X("Month:T", scale=alt.Scale(zero=False)),
    y=alt.Y("Garbage Collected (tn):Q", scale=alt.Scale(zero=False)),
    color=borough_color,
    opacity=borough_opacity,

).properties(title="Garbage Collected (in tons) Across NYC from 1990-2022", width=1500, height=300)


# Create corresponding legend for species
legend = alt.Chart(garbage).mark_rect().encode(
    y=alt.Y("Borough:N", axis=alt.Axis(orient="right")),
    color=borough_color
).add_selection(selection) # We now add it to the legend instead, since that is what the viewer interacts with

scatterplot | legend

My dataset describes NYC garbage collection across the 5 boroughs from 1990 to 2022. This dataset does not have many variables to inspect, so I selected time as my x-axis and garbage collected (in US tonnes) as my y-axis (and looked at all boroughs as my hue). For all of the boroughs across all years, this image shows the summary statistics:

| Garbage Collected (tn) | Paper Collected (tn) |
|-----------------------:|---------------------:|
|                  count |         21992.000000 |
|                   mean |          3762.064610 |
|                    std |          1474.827596 |
|                    min |             8.400000 |
|                    25% |          2630.025000 |
|                    50% |          3587.000000 |
|                    75% |          4779.725000 |
|                    max |          9757.000000 |

Interestingly, the graph shows similar trends in pickup for each borough over the 30 year period, with different ranges for the data each year, but the trend is largely consistent. Most boroughs saw some increase around 2003-2004 and a definite increase as the nation entered the pandemic is 2020 (with more trash pickup as people stayed in their homes and didn't go out to eat). It's important to note that this takes into account all districts in a particular borough, and that is why the marks are so long (they are really a lot of data points at the same point in time). This combination helps us look at the borough’s RANGE of trash pickup at a certain point and from this, we can notice how diverse Manhattan is in terms of district trash pickup levels. Hopefully, the viewer notices this and sees some of the cool patterns that emerge. For example, look at Staten Island’s graph: Towards the end of the calendar year, less trash is picked up. This seems noteworthy, as we might expect the opposite as the holiday season kicks in. There are many possible explanations or factors that could cause this, but the data visualization is key in detecting these patterns. 

For my explanatory visualization, I wanted to include a graph that compared a district in Manhattan with low poverty rates to a district with higher poverty rates to see if there was a relationship between that metric and trash pickup. Ultimately, this proved to be too difficult given the time constraints, as I would have to research each one of the 59 districts to check their poverty rates and add them to the dataset in a new column. To make this even tricker, NY state groups the districts differently than that of the NYC Sanitation department, causing further headaches. This experience demonstrates the inherent challenges of combining datasets together and how difficult it can be to add in missing data. If I were to have more time or a better dataset, I would expect to see that the higher poverty rate areas (unfortunately) receive less trash pickup services which can lead to a lot of terrible problems for the residents, relating to a crucial problem about the intersection of race, socio-economic status, and living conditions. 

Regardless of the challenges with my explanatory dataset, I did enjoy making my exploratory dataset, especially the interactive component. For that graph, I really wanted to make it high-quality, and so I spent a while focusing on changing the opacity for the graphs when they were not being selected (as opposed to the distracting gray in full opacity). For the long term project, I want to focus more time on finding a dataset that has more columns or aspects to look at. This would prevent me from having to combine other datasets together in order to analyze the components that I wanted to. The coding itself was very rewarding and I am confident that I have the skills and experience necessary to create professional, informative, and aesthetically-pleasing graphs for the next project. The ability to insert the graphs (interactively) into the blog would be cool, but seeing as everyone struggled with that aspect, perhaps Feingold or Mr. Lee could assist us on that component. 

Until next time loyal readers, 

-Jack
