## Length of the report {-}

The main report must be **no more than 15 pages** when printed as a PDF, **excluding** the cover page, AI disclosure, references, and appendix. There is **no minimum** page requirement. Reports that exceed 15 pages in the main body may be penalized for lack of conciseness.


You may include additional materials in an **Appendix** (e.g., extra figures, tables, diagnostic plots, or detailed outputs). You are welcome to refer to the appendix in the main report to support your arguments. However, the appendix is **not guaranteed** to be read during grading and will only be consulted if the grader deems it necessary.

**Delete this instructional section from your final report when using this template.**

## Problem statement 

Describe your four questions. Articulate your questions using absolutely no jargon. 

### Q1: Access and Equity
How fairly are Divvy stations spread across Chicago? I look at where stations are located, whether some neighborhoods have many more stations than others, and whether areas with more stations tend to be richer or have more white residents than areas with few or no stations. In other words, I am asking whether access to Divvy is evenly shared across the city, or whether it is concentrated in higher-income, mostly white neighborhoods.

### Q2: Equity Over Time

### Q3: Membership & Frequency
The question asks how people who buy a yearly Divvy membership use the bikes differently from people who pay per ride or by day. It looks at how often they ride, how long their rides last, what time of day they ride, and which parts of Chicago they ride in. It also asks whether these differences line up with who lives in each area, such as income levels and racial or ethnic makeup based on census data.

### Q4: Safety and Mobility
How do traffic crashes, adult physical inactivity, and Divvy station accessibility relate across Chicago’s community areas, and do neighborhoods with higher physical inactivity experience different crash rates or different access to bike-share infrastructure?

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

### Q1: Access and Equity
I used the Divvy Bicycle Stations dataset from the Chicago Data Portal, which lists each station’s ID, name, status, and latitude/longitude and allowed me to map active stations across the city. Second, I pulled American Community Survey (ACS) 2023 5-year estimates from the U.S. Census using the Python census package. I used total population, white (non-Hispanic) population, and median household income at the census block-group level so I could calculate percent people of color and compare income. Finally, I used the 2023 TIGER/Line block-group shapefiles from the Census Bureau, filtered to Cook County, to get the geographic boundaries needed for spatial joins between Divvy stations and demographic data.

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
Three public datasets were used:
- [Divvy trip data](​https://divvybikes.com/system-data), which has one row per ride with start and end time, start and end station, latitude/longitude, and rider type (member vs casual). 
- [Chicago community area boundaries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-Map/cauq-8yn6), which define 77 official community areas and include their names and polygons for mapping trips to neighborhoods.​ 
- [ACS 5-Year Data by Community Area (most recent year, 2023)](https://data.cityofchicago.org/Community-Economic-Development/ACS-5-Year-Data-by-Community-Area/t68z-cikk/about_data), which provides population, income brackets, and race/ethnicity counts for each community area; this was downloaded from the Chicago Data Portal.​ 


### Q4: Safety and Mobility
The first one I used is the [Chicago Health Atlas](https://chicagohealthatlas.org/indicators/HCSPAP?topic=adult-physical-inactivity-rate) to find the traffic crashes, adult physical inactivity counts, and adult physical inactivity rates by neighborhood. The next one I used is the [Divvy Bikes](https://divvybikes.com/system-data) dataset to get the location of Divvy stops using the latitude and longtitude. Finally, I used the [City of Chicago](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-Map/cauq-8yn6) dataset to help map latitude-longitude coordinates to a neighborhood and find the overall density of Divvy stations in each neighborhood. 

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did your analysis require any other kind of data preparation before it was ready to use?

### Q1: Access and Equity
(ethan)

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
(eduardo)

### Q4: Safety and Mobility
In the Divvy trip data, rides where the end time occurred before the start time were removed, since these represent incorrect or incomplete Divvy trips. Rows with missing latitude or longitude were dropped because they prevent spatial mapping when we overlay this data onto the community boundary areas. Columns ‘started_at’ and ‘ended_at’ columns were converted into DateTime objects, and additional variables of ‘ride_minutes’, ‘date’, ‘hour’, and ‘day_of_week’ were added for potential temporal analysis. Finally, columns ‘start_point’ and ‘end_point’ were derived from the latitude and longitude. 

In the health data, the original CSV contained four metadata rows at the top, which were removed to ensure that data could be read cleanly. Columns were also renamed for better interpretation. The column ‘community_area_name’ was standardized, and the column ‘community_area_number,’ which served as a GEOID, was converted to an integer data type for consistency. Non-analytical columns like the ‘layer’ column were also removed.

In the community boundary data, columns were renamed for better interpretation and clarity. The column ‘community_area_name’ was standardized to match the health dataset. Geometry values, which originally were WKT strings, were converted into shapely geometry objects. The boundaries were then reprojected into a projected coordinate system (EPSG:3435), which would be used for distance and area calculations. Finally, centroid latitude and longitude values were computed for later mapping and visualization.   

All spatial data, such as Divvy-ride coordinates and community boundary polygons, were transformed into the same CRS (EPSG:3435) to standardize them for spatial operations. Next, a spatial join was used to assign each Divvy ride to the community area where the ride started. 
After spatially aligning the rides to each community area, features were aggregated at the community-area level. This included computing Divvy station counts per community area and calculating the number of rides that started in each community area. 
The three datasets were then merged into one unified dataframe (community_stats) containing all variables needed for mapping and modeling.
Finally, variables such as Divvy station counts and total ride counts were heavily right-skewed. We applied log transformations to make visualizations and relationships easier to interpret. These transformations helped stabilize variance and improve the interpretability of scatterplots and comparisons across community areas.


## Exploratory Data Analysis

For each analysis:

* What did you do exactly? How did you solve the problem? Why did you think it would be successful? 

* Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.



### Analysis 1: Access and Equity
*By Ethan Bledsoe*

(reminder for ethan to add census code citation)

To understand how Divvy access relates to neighborhood characteristics, I started by comparing block groups with and without stations. I used boxplots to compare median income and percent people of color for areas that have at least one station versus those that have none, and I ran t-tests to see whether the differences in means were statistically meaningful. I then refined this by looking at station density, grouping block groups into four bins (0, 1, 2–3, and 4+ stations) and making boxplots of income and percent people of color across these bins. A simple summary table showed that income generally increases, and the share of people of color decreases, as the number of stations rises.

Next, I treated the number of stations as a numeric variable and used scatterplots with regression lines to examine how station counts relate to median income and percent people of color. I calculated Pearson correlation coefficients to quantify these relationships. Finally, I created choropleth maps of percent people of color and median income across block groups and overlaid Divvy station locations. These maps showed visually that stations are concentrated in and around the Loop and the North Side, which tend to be higher-income and less heavily composed of people of color than many South and West Side areas.

### Analysis 2: Equity Over Time
*By Junho Hong*

### Analysis 3: Membership & Frequency
*By Eduardo Sourd*

### Analysis 4: Safety and Mobility
*By Isabella Yan (Chisa)*
Divvy station accessibility relate across Chicago’s community areas, and whether neighborhoods with higher physical inactivity experience different crash rates or different levels of access to bike-share infrastructure. To analyze across datasets, we needed to transform the Divvy ride coordinates and the Chicago community boundaries into a consistent coordinate reference system, which was EPSG:3435. Upon transforming these, our key metrics, which were Divvy station counts in each community area and the ride volume, could be calculated at the community level to align with existing inactivity rates and traffic crash metrics. 

After spatially joining the data into one frame, we aggregated Divvy ride activity by community area. The number of rides beginning in each neighborhood and the number of Divvy stations located there were chosen as the ride count reflects actual usage of the Divvy bike stations, and the station count reflects the infrastructure access. These variables help explain Divvy station access, and by aggregating the activity by community area, we can compare community areas based on physical inactivity, traffic crashes, and Divvy station infrastructure. 

Below are the log-transformed plots. Because Divvy station counts and the ride totals were both heavily skewed right, it was necessary to log transform them for proper analysis.

![Log transformed plots](./images/chisa/logtransformed.png)

Next, we visualized each of the variables’ pairwise relationships. The scatterplots below contain the relationships between physical inactivity and traffic crashes, physical inactivity and Divvy station counts, and Divvy station counts and traffic crashes. These scatterplots let us see whether communities with higher inactivity tended to have fewer Divvy stations, which may imply limited access may be related to inactivity, or different crash patterns, which may imply that mobility environments differ between community areas. Having these scatterplots highlight the direction, strength, and form of relationships, including nonlinear trends or the influence of outliers.

![Inactivity vs. Crashes](./images/chisa/inactivityvscrashes.png)

![Inactivity vs. Divvy count](./images/chisa/inactivityvscount.png)

![Divvy count vs. crashes](./images/chisa/countvscrashes.png)

Now, let’s look at the correlation table. This helps supplement the scatterplots by quantifying the relationship between the pairwise variables. Physical Inactivity vs. Traffic has a correlation of -0.177, which indicates a very weak relationship between the variables. Physical Inactivity vs. Divvy Station Count has a correlation of -0.46, which indicates a moderate negative relationship between the variables. Finally, Divvy Station Count vs. Traffic Crashes has a correlation of 0.644, which indicates a moderate to strong positive relationship. 

|                               | physical_inactivity_rate | traffic_crashes | divvy_station_count | total_rides_starting_here |
| ----------------------------- | ------------------------ | --------------- | ------------------- | ------------------------- |
| **physical_inactivity_rate**  | 1.000000                 | -0.177478       | -0.468223           | -0.490147                 |
| **traffic_crashes**           | -0.177478                | 1.000000        | 0.643516            | 0.670118                  |
| **divvy_station_count**       | -0.468223                | 0.643516        | 1.000000            | 0.974841                  |
| **total_rides_starting_here** | -0.490147                | 0.670118        | 0.974841            | 1.000000                  |

Neighborhoods with higher Divvy station volume and ridership, are clustered mainly in the areas with higher traffic crash counts, which confirms the earlier finding that community areas with high Divvy station volume also tend to be high-activity, high-traffic zones. Areas with high physical inactivity and low Divvy access are consistent with the moderate negative correlation between inactivity and station count. These visualizations support that there may be spatial divides in Chicago, where central and transit-rich neighborhoods have higher physical activity and Divvy station access. 

## Discussion / Limitations (Optional)

You are welcome to introduce additional sections or subsections, if required, to address your questions in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Challenges and reflection

Use this section to reflect honestly on your project process. This part is mainly for you and the instructors, not for the stakeholder.

You might address:

- **Anticipated vs. actual challenges.**  
  What problems did you expect at the beginning? What problems did you actually encounter (e.g., data quality, missing values, messy categories, time constraints, coordination within the team)?

- **First attempts and revisions.**  
  Did the very first approach you tried work? If not, what went wrong? How did you debug, revise your questions, or improve your methods?

- **Decisions and trade-offs.**  
  Did you simplify any questions, drop some analyses, or change your focus along the way? Why?

- **What you learned.**  
  What are 2–3 key things you learned from doing this project (about data science, the Divvy data, or working in a team)? If you started over today, what would you do differently?

Keep this section to about **1–2 paragraphs** or a short set of bullet points.

### Q1: Access and Equity
The main challenges in this project were geospatial and cleaning-related. At first, my spatial joins failed because the Divvy stations and block-group polygons were stored in different coordinate systems, so I had to learn how to reproject everything consistently and verify that the joins were working. I also had to deal with odd ACS income values and a very skewed income distribution, which led me to explore and apply the IQR rule for outlier removal. Working through these issues taught me how sensitive geospatial analysis is to careful preprocessing, and how small choices in cleaning and merging can change the story the data tells. If I repeated the project, I would bring in ridership data sooner so I could study both where stations are and how they are actually being used.

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
(eduardo)

### Q4: Safety and Mobility
- Messy data for spatial join was hard to clean.
- Spatial join didn’t work at first, everything had to be reprojected onto EPSG:3435 and names had to be standardized.
- Initial values too skewed right and had to be log transformed.
- Learned about geospatial preprocessing and how data cleaning determines analysis quality.


## Conclusions

Do the individual analysis connect with each other to answer a bigger question? If yes, explain.

### Q1: Access and Equity
Overall, my analysis suggests that Divvy stations are not evenly distributed across Chicago. Block groups with more stations tend to have higher median household income and a lower share of people of color than block groups with few or no stations. This pattern appears consistently in summaries by station presence, in station-density bins, in correlations with station counts, and on city maps. Together, these results indicate that Divvy infrastructure is more built out in central, higher-income, and whiter neighborhoods, while many outer or less advantaged areas have lower access to the system.

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
(eduardo)

### Q4: Safety and Mobility
Overall, the analysis indicates that Chicago community areas with higher physical inactivity tend to have fewer Divvy stations and lower Divvy usage, which suggests limited access to active transportation options. Furthermore, community areas with higher Divvy access experience higher counts of traffic crashes, which reflects the Divvy stations’ location of being in denser or more active parts of the city. The patterns observed were consistent across scatter plots, correlations, and choropleth maps. This highlights a spatial divide, where more central parts of the city have stronger Divvy infrastructure and higher mobility, while areas outside of these tend to face higher inactivity levels and less access to Divvy infrastructure. 


## Recommendations to stakeholder(s)

In this section:

- Clearly state **which stakeholder** you are writing for (e.g., *Divvy operations manager*, *City of Chicago planner*, *marketing team*).
- Provide **2–4 specific, actionable recommendations** based on your analysis. Be as concrete and practical as possible so stakeholders can use your suggestions directly.
- Briefly discuss **limitations** of your analysis. Indicate whether stakeholders can act on your results as-is, or whether they should:
  - collect more or updated data,
  - perform additional analysis, or
  - repeat your analysis on more recent data.




### Q1: Access and Equity
My recommendations are aimed at a City of Chicago transportation planner or Divvy operations planner deciding where to expand the system. The results suggest that the city should treat block groups with low station counts, lower incomes, and higher shares of people of color as priority areas for new stations if the goal is to improve equity in access to bike-share. Because this analysis relies on a single snapshot of ACS and station data, planners should repeat a similar analysis with updated data and combine it with ridership information and community input before making major investment decisions.

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
(eduardo)

### Q4: Safety and Mobility
This is recommended for a City of Chicago transportation planner, who may be focusing on where to add new Divvy stations and how to reduce physical inactivity rates. The city should expand Divvy station coverage in community areas with high physical inactivity and limited Divvy access. Neighborhoods with low station density, low ridership, and higher inactivity rates should be focused on to improve their Divvy station placement across the city. Furthermore, places with higher rates of traffic crashes should have stricter traffic regulations implemented, or infrastructure that supports the higher volume of Divvy access, such as dedicated bike lanes. However, the data captured only covers the year of 2024, and may not reflect current Divvy trends. Traffic crashes are also aggregated across all severities as well, giving minor and more severe traffic crashes equal weight. An analysis with updated data and more detailed data about traffic crashes and Divvy station access should be performed before taking any major action. 

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## AI Tools and Assistance (Disclosure — for course purposes) {-}

In this section, briefly describe **if and how** you used any AI tools while working on this project. This includes, for example, tools such as *ChatGPT*, *GitHub Copilot*, *Grammarly*, or other code-writing, code-completion, or writing-assistance tools.

Please address:

- **Which tools (if any)** you used.
- **What you used them for** (e.g., brainstorming ideas, clarifying concepts, debugging code, editing writing, generating plots, etc.).
- **What parts of the work are fully your own**, and how you checked or modified any AI-generated suggestions.

If you did **not** use any AI tools, simply write:

> We did not use any AI tools for this project.


### Q1: Access and Equity
(ethan)

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
(eduardo)

### Q4: Safety and Mobility
I used ChatGPT to help with standardizing the community boundary data and the Divvy station data into EPSG:3435 and performing spatial join

## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.