## Problem statement 

Describe your four questions. Articulate your questions using absolutely no jargon. 

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did your analysis require any other kind of data preparation before it was ready to use?

After inspecting the "Geographic Cluster Name" column in the MCMF dataset, I found that there are 121,359 missing values for in-person programs. Since both analysis 1 and 2 relies heavily on information about the neighborhood programs belong to, I decided to use the geographic information in Community Boundaries dataset and latitude longtitude information in the MCMF dataset to map programs into respective neighborhoods. 

I first compared the neighborhood names in the MCMF dataset with neighborhood names in the Community Boundaries dataset to see if there are any difference. I found that aside from neighborhood names, some programs in the MCMF dataset used unstandardized names such as "Far South Equity Zone" and "Back of the Yards", which also need to be mapped. After extracting programs that has both longitude and latittude information and don't have a geographic cluster name or its geographic cluster name is unstandardized, I turned longitude lattitude information into shapely library point format. I also turned the multipolygon in Community Boundaries dataset into shapely format. Next, for each longitude-latitude pair, I checked if it is in any of the multipolygon that represents a neighborhood. 

After mapping, I reviewed the neighborhoods assigned to programs with unstandardized names. This step was necessary because some programs with unstandardized names lack latitude-longitude data, and I wanted to map them to the same neighborhoods as others with the same unstandardized name. However, upon review, I found that many unstandardized names, such as equity zones, were mapped to different neighborhood names. To avoid inconsistencies—where some equity zones are converted into neighborhood names while others remain unchanged—I decided to create a new column, "Neighborhood," dedicated exclusively to neighborhood names. Programs in equity zones that could not be mapped to a specific neighborhood will be marked as "NA" in this column.

## Exploratory Data Analysis

For each analysis:

What did you do exactly? How did you solve the problem? Why did you think it would be successful? 

What problems did you anticipate? What problems did you encounter? Did the very first thing you tried work? 

Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.

Note that you can write code to publish the results of the code, but hide the code using the yaml setting `#|echo: false`. For example, the code below makes a plot, but the code itself is not published with Quarto in the report.

### Analysis 2: How has the availability of equity-focused features among programs changed over time based on neighborhood Socioeconomic status?
*By Luna Xu*

#### Program Distributions

To explore this question, I began by reviewing all columns in the "My Chi My Future" (MCMF) dataset to identify those relevant to equity-focused measures. I identified "Scholarship Available," "Participants Paid," "Transport Provided," and "Has Free Food" as key equity-related features. Since Analysis 4 will specifically focus on transportation, I narrowed the scope of equity-focused features to: "Scholarship Available," "Participants Paid," and "Has Free Food."

To understand neighborhood socioeconomic status(SES), I examined the Census dataset. Since the hardship index incorporates six selected socioeconomic indicators, I decided to base my analysis on it, as it provides the most holisitic view. The hardship index functions like a ranking, with each neighborhood having a unique score, where 99 represents the highest level of hardship and 1 the lowest. Therefore, I decided to bin neighborhood into three equal buckets: low-SES, mid-SES, and high-SES. I also dropped the row "chicago" which is a total measure.

After merging the Census dataset with the SES bins and the MCMF dataset on the neighborhood variable, I explored the overall program distributions by neighborhood SES. Specifically, I counted the number of distinct programs over the years in each neighborhood and visualized them on the Chicago map using GeoPandas and Matplotlib's pyplot. Noticing several extreme values (e.g., one neighborhood had only 6 programs), I applied the LogNorm function from Matplotlib's colors module for better visualization, using the minimum and maximum program counts across all neighborhoods.

![Figure 1](attachment:image.png)

As seen from above graphs, we can see that while there's less obvious contrasts between the three SES bins, low-SES buckets generally have more neighborhoods with less programs, including neighborhood like Bunrside that only has 6 programs over 5 years period. 

To further explore the program distributions among the three SES buckets, I plotted a line graph to show the total number of distinct programs in the three SES bins in each years. Since not all programs in 2025 are inputted into the dataset, I excluded the 2025 data. 

![Figure 2](attachment:image.png)

Now, we can clearly see that, in general, the number of distinct programs increases across all three SES buckets over the years, with high-SES neighborhoods showing the highest rate of increase. From 2022 to 2024, the total number of programs in low-SES neighborhoods is gradually approaching that of mid-SES neighborhoods. This suggests that, while a gap still exists in the number of programs offered across SES buckets, efforts are being made to bridge the equity gap in program availability.

I plotted a boxplot to further examine the distribution of programs among Neighborhood within each SES buckets. As shown below, high-SES bucket generally have more scattered distribution especially in year 2023 and 2024. Similarly, mid-SES and high-SES made up of most outliers, indicating that certain mid- or high-SES neighborhood far more programs than most other neighborhood. After sorting the values of program count, I found that Irving Park, Near West Side, Morgan Park, Loop, and Lincoln Square are the five neighborhood with top number of program count, with Irving Park neighborhood consistantly having most amount of programs each year than all other neighborhood. 

I looked into other factors to deduce the reasons why these five neighborhood have most amount of programs. I found that 4/5 of the neighborhoods locate in or near chicago downtown area, so it could be that since people are more likely to come out to downtown, neighborhood in downtown areas assume that they are serving both people living in the area and those who might work/come visit the downtown area.

![Figure 3](attachment:image.png)

Then, I looked into the distribution of programs with equity features (Scholarship Available, has free food, participants paid) among three SES buckets over the years. As shown below, programs that offer scholarship are more distributed in high-SES neighborhood and programs that has free food are more distributed in low-SES neighborhood especially in 2023 and 2024. There's not a clear pattern of how programs that pays participants are distributed, but high- and mid-SES neighborhoods have relatively more paid programs. Free food programs also increase among all three neighborhood SES buckets over time. To further explore factors that contribute to this general distributions, I decided to look into each equity features.

![Figure 4](attachment:image.png)

#### Equity Feature: Scholarship Available

Since academic programs are the one typically offers scholarship, my assumption for the trend that high-SES neighborhood has more scholarship programs is that there are more academic programs offered in high-SES neighborhood. To validate this assumption, I further seperate programs by their categories. However, the original dataset has too many categories and multiple categories belong to academic programs. Therefore, I grouped categories together into four buckets: Career & Life Skills, STEM & Writing, Arts & Humanity, Sports & Wellbeing. Among them, STEM & Writing and Arts & Humanity are academic programs. The reason why I group STEM and Writing together is because they are both considered critical in influencing one's academic performance, especially for higher education. After visualizing in a clustered bar graph, I found that High-SES neighborhood, indeed, have more academic programs, both STEM & Writing and Arts & Humanity. Additionally, high-SES neighborhoods have more scholarship programs in every categories than mid-SES neighborhoods and low-SES neighborhoods, indicating a severe financial inequity among programs. 

![Figure 5](attachment:image.png)

#### Equity Feature: Has Free Food

While the general trend shows that low-SES neighborhood has more free food programs, I want to look more closer into each neighborhood, not just on the broad SES level. For example, for neighborhoods within low-SES buckets, is there a equal distribution of free food programs? On the neighhborhood level, can we still observe a positive correlation between neighborhood hardship index and number of free food programs offered. Therefore, I plotted a scatterplot with trendline with each neighborhood as a datapoint.

![Figure 6](attachment:image.png)

The result showcases a mild positive correlation between neighborhood hardship index and number of free food programs, meaning that neighborhood that has mroe hardship indeed has more free food programs. However, we can observe several extreme high values, indicating that these several neighborhood has far more free food programs than most others. So I sorted the dataset by number of free food programs offered and found that Austin, Brighton Park, Gage Park, South Lawndale, Near West Side are the five neighborhoods with most free food programs. I found that Austin has a large population than most neighborhood which could account for its relatively large amount of free food programs. Additionally, among these top five neighborhood, Near West Side is a high-SES neighborhood. Since it is near downtown, it makes sense to have more free food programs as downtown generally have more active population. However, many low-SES neighborhood are located at the south side of Chicago, making it logitically hard to get to downtown area like Near West Side than other neighborhoods near downtown (which typically are mid- and high-SES neighborhoods). Furthermore, considering some neighborhood with very high hardship level has very few free food programs (for example, Riverdale, a neighborhood that has the second highest hardship level only have 1 free food programs over the years), we can conclude that there is still shows a sizable food inequity despite the general trend of low-SES neighborhood having more free food programs. 

#### Equity Feature: Participants Paid

For participants paid variable, since there is not a detectable differences among the three neighborhood SES buckets and I noticed that many programs' participants paid variable has NaN values. Therefore, I want to try to improve the data quality. Specifically, I wonder if some programs, in fact, pay their participants, but show up as unpaid or NaN. To do so, I first investigate the word frequency in the descriptions of programs that pay participants. This is to identify key words in these paid programs that relate to financial support. Using re, nltk and counter libraries, I was able to remove stopwords and unreable parts,  parse descriptions into words, and count the number of times each words appear. As shown in the graph below, among the top 20 most common words, "paid" and "stipend" are the two that are related to financial support. 

![Figure 7](attachment:image.png)

Next, I decided to identify program descriptions that has either of these two key words "paid" and "stipend" but labelled as "unpaid" or NaN in "participants paid" column and labelled as false in "scholarship available" column. In this way, I hope to find programs that pay participants but did not show up as paid program or scholarship program. 

To ensure the keywords accurately reflect that the program compensates its participants, I used re library to parse the descriptions into sentences and identify those containing the words "paid" or "stipend". Upon reviewing the matching sentences, I found that all instances of "stipend" reliably indicated programs that pay their participants. However, the keyword "paid" introduced noise, such as mentions of "paid parking." To reduce false positives, I chose to use "stipend" as the sole indicator of a paid program. I then added a new column, "Stipend", and marked programs offering stipends as True.

Finally, I created a updated heatmap counting programs that either has "Paid, Type Unknown" value in "Participants Paid" column or True in "Stipend" column. Through a side by side comparsion of the previous heatmap and this updated version, we can observe that the number of programs that pay participants significantly increase in 2024 for low-SES neighborhoods. By comparing the two graph, we can also see that there are many programs that pay participants through stipend but is not labeled in the "pariticipants paid" column for low-SES neighborhoods but not that much for mid- and high-SES neighborhoods. This suggests that while efforts to provide financial supports for low-SES neighborhood have improved, these opportunities are not being adequately advertised. 

![Figure 8](attachment:image.png)

## Conclusions

Do the individual analysis connect with each other to answer a bigger question? If yes, explain.

In conclusion, analysis 2 reveals ongoing inequities in program availability and support across SES buckets, despite some positive trends. High-SES neighborhoods consistently have greater access to academic programs and scholarships. While low-SES neighborhoods generally have more free food programs, the uneven distribution within these areas—where neighborhoods with high hardship levels like Riverdale remain underserved—reveals persistent logistical and structural challenges. Similarly, the prevalent mislabeled or under-advertised stipend opportunities in low-SES neighborhoods highlights the importance of improving program transparency and marketing. By addressing these gaps, stakeholders can better align resources with community needs, ensuring that equity-focused efforts reach the populations that need them most.

## Recommendations to stakeholder(s)
What are the action items for the stakeholder(s) based on your analysis? Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

Do the stakeholder(s) need to be aware about some limitations of your analysis? Can your analysis be directly used by the stakeholder(s) to obtain the expected benefit / make decisions, or do they need to do some further analysis based on their own, or do they need to repeat your analysis on a more recent data for the results to be applicable? 

My Chi My Future initiative leadership should consider direct organizers in Irving Park, Near West Side, Morgan Park, and Lincoln Square to hold more programs in low-SES neighborhoods that are far away from downtown area, as these four neighborhood has the most amount of programs over the years. MCMF leadership should consider encourage more academic programs (both STEM & Writing and Arts & Humanity) in low-SES neighborhoods, and Program organizers should consider provide more scholarship options for low-SES neighborhoods or provide scholarship based on needs. MCMF leadership should also encourage more program organizers to provide free food options for low-SES neighborhoods. Finally, MCMF leadership should feature or promote programs that are hold in low-SES neighborhood and pay participants. Program organizers should put more effort in marketing if their programs pay participants by accurately inputing information in the "Participants Paid" column, not just in description. On the other hand, when collecting information, MCMF leadership should consider puting equity-focused features (scholarship available, has free food, participants paid) in front so that program organizers do not forget to fill them out. If possible, MCMF leadership should consider building a data screening tool that scans program descriptions and extract information related to equity-focus features (such as the amount of stipend provided). 

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.