## Problem statement 

### Q1: Access and Equity
How fairly are Divvy stations spread across Chicago? I look at where stations are located, whether some neighborhoods have many more stations than others, and whether areas with more stations tend to be richer or have more white residents than areas with few or no stations. In other words, I am asking whether access to Divvy is evenly shared across the city, or whether it is concentrated in higher-income, mostly white neighborhoods.

### Q2: Equity Over Time
How has Divvy ridership equity evolved across different expansion phases in Chicago, and which census tracts experienced the most significant changes in usage rates based on their racial and socioeconomic characteristics?

### Q3: Membership & Frequency
The question asks how people who buy a yearly Divvy membership use the bikes differently from people who pay per ride or by day. It looks at how often they ride, how long their rides last, what time of day they ride, and which parts of Chicago they ride in. It also asks whether these differences line up with who lives in each area, such as income levels and racial or ethnic makeup based on census data.

### Q4: Safety and Mobility
How do traffic crashes, adult physical inactivity, and Divvy station accessibility relate across Chicago’s community areas, and do neighborhoods with higher physical inactivity experience different crash rates or different access to bike-share infrastructure?

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

### Q1: Access and Equity
I used the Divvy Bicycle Stations dataset from the Chicago Data Portal, which lists each station’s ID, name, status, and latitude/longitude and allowed me to map active stations across the city [1]. Second, I pulled American Community Survey (ACS) 2023 5-year estimates from the U.S. Census using the Python census package. I used total population, white (non-Hispanic) population, and median household income at the census block-group level so I could calculate percent people of color and compare income [5]. Finally, I used the 2023 TIGER/Line block-group shapefiles from the Census Bureau, filtered to Cook County, to get the geographic boundaries needed for spatial joins between Divvy stations and demographic data [6].

### Q2: Equity Over Time
Five public datasets were used:
- [Divvy Trip Data](https://divvybikes.com/system-data): Ride-level data for Q2(April-June) of each year from 2014-2025
- [Divvy Station Data](https://data.cityofchicago.org/Transportation/Divvy-Bicycle-Stations/bbyy-e7gq): Station locations with coordinates as of Dec 7, 2024
- [Census Tract Boundaries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik): 2010 Census tract boundaries for Cook County
- [ACS 5-year Demographic Data (DP05)](https://data.census.gov/): Race and ethnicity data by census tract. Filtered by census tract and year
- [ACS 5-year Economic Data (DP03)](https://data.census.gov/): Economic characteristics by census tract

### Q3: Membership & Frequency
Three public datasets were used:
- [Divvy trip data](​https://divvybikes.com/system-data), which has one row per ride with start and end time, start and end station, latitude/longitude, and rider type (member vs casual). 
- [Chicago community area boundaries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-Map/cauq-8yn6), which define 77 official community areas and include their names and polygons for mapping trips to neighborhoods.​ 
- [ACS 5-Year Data by Community Area (most recent year, 2023)](https://data.cityofchicago.org/Community-Economic-Development/ACS-5-Year-Data-by-Community-Area/t68z-cikk/about_data), which provides population, income brackets, and race/ethnicity counts for each community area; this was downloaded from the Chicago Data Portal.​ 


### Q4: Safety and Mobility
The first one I used is the [Chicago Health Atlas](https://chicagohealthatlas.org/indicators/HCSPAP?topic=adult-physical-inactivity-rate) to find the traffic crashes, adult physical inactivity counts, and adult physical inactivity rates by neighborhood. The next one I used is the [Divvy Bikes](https://divvybikes.com/system-data) dataset to get the location of Divvy stops using the latitude and longtitude. Finally, I used the [City of Chicago](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-Map/cauq-8yn6) dataset to help map latitude-longitude coordinates to a neighborhood and find the overall density of Divvy stations in each neighborhood. 

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did your analysis require any other kind of data preparation before it was ready to use?

### Q1: Access and Equity
For the Divvy–census analysis, I summarized the distribution of all variables used at the block-group level. For continuous variables such as median household income, percent white non-Hispanic, percent people of color, total population, and the number of Divvy stations, I reported standard summary statistics (count, mean, standard deviation, minimum, quartiles, and maximum). These are shown in Table 1.

**Table 1. Summary of continuous variables (block-group level)**

| Variable      | Count |   Mean   |   Std   |  Min  |  25%  |  50%  |  75%  |  Max   |
|--------------|:-----:|:--------:|:-------:|:-----:|:-----:|:-----:|:-----:|:------:|
| med_income   | 3596  | 93123.0  | 48132.6 | 2499  | 59859 | 84076 | 116398| 250001 |
| pct_white_nh | 3990  | 0.39     | 0.31    | 0.00  | 0.07  | 0.39  | 0.68  | 1.00   |
| pct_poc      | 3990  | 0.61     | 0.31    | 0.00  | 0.32  | 0.61  | 0.93  | 1.00   |
| total_pop    | 4001  | 1294.9   | 615.1   | 0     | 860   | 1196  | 1611  | 8470   |
| n_stations   | 4001  | 0.28     | 0.68    | 0     | 0     | 0     | 0     | 12     |

For categorical variables, I examined the number of missing values, the number of unique levels, and the frequency of each level. In this analysis, the main categorical variables were whether a block group had at least one Divvy station (`has_divvy`) and the station-density bin (`station_bin`, which groups block groups into 0, 1, 2–3, or 4+ stations). Their distributions are summarized in Table 2. Here, `has_divvy = 0` indicates no stations in the block group, and `has_divvy = 1` indicates at least one station.

**Table 2. Distribution of categorical variables**

| Variable     | n_missing | n_unique | Level | Count |
|-------------|:---------:|:--------:|:-----:|:-----:|
| has_divvy   |     0     |    2     |   0   | 3175  |
| has_divvy   |     0     |    2     |   1   |  826  |
| station_bin |     0     |    4     |   0   | 3175  |
| station_bin |     0     |    4     |   1   |  649  |
| station_bin |     0     |    4     |  2–3  |  151  |
| station_bin |     0     |    4     |  4+   |   26  |

There were a few potentially incorrect or problematic values that required cleaning. In the ACS data, median income sometimes appeared as large negative numbers, which the Census uses as flags for missing or unreliable estimates, so I converted `med_income` to numeric and set any negative values to missing. I also found that the income distribution was highly skewed, so I applied the standard IQR rule to drop extreme outliers before computing group means and medians. On the Divvy side, I restricted the station dataset to rows whose status was “In Service” and converted latitude and longitude into geometries so that only active stations with valid locations were used.

The analysis also required several additional preparation steps before it was ready to use. I constructed a census `GEOID` for each block group and merged the ACS attributes with the 2023 TIGER/Line block-group shapefiles, filtered to Cook County. I reprojected both the block-group polygons and the Divvy station points into the same projected coordinate system to allow a correct spatial join. Using that join, I assigned each Divvy station to the block group it falls inside and then aggregated to obtain the number of stations per block group and a binary indicator of whether a block group has at least one station. Finally, I created the station-density categories (0, 1, 2–3, 4+ stations) and computed percent people of color and percent white non-Hispanic, which together formed the cleaned dataset used for all plots and statistical comparisons.



### Q2: Equity Over Time

#### Variable Distributions

To verify data quality after cleaning, I examined the distribution of all variables used in the final analysis. This helped confirm that demographic variables showed reasonable ranges, ride counts followed expected patterns (right-skewed with high concentration in downtown areas), and missing values were minimal after spatial joins and demographic matching.

**Table 1. Summary of continuous variables (census tract level)**

| Variable       | Count | Mean    | Std     | Min     | 25%     | 50%     | 75%     | Max     |
|---------------|:-----:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
| total_pop     | 3,983 | 3,929.1 | 1,805.9 | 0.0     | 2,535.0 | 3,800.0 | 5,126.5 | 19,015.0|
| pct_white_nh  | 3,971 | 38.6    | 30.1    | 0.0     | 6.6     | 38.1    | 65.9    | 97.2    |
| pct_black_nh  | 3,971 | 27.5    | 35.2    | 0.0     | 1.8     | 6.2     | 55.4    | 100.0   |
| pct_hispanic  | 3,971 | 24.6    | 26.0    | 0.0     | 5.7     | 13.7    | 34.6    | 99.6    |
| pct_asian     | 3,971 | 6.8     | 9.7     | 0.0     | 0.3     | 3.1     | 9.0     | 89.8    |
| median_income | 3,925 | 75,952  | 39,186  | 10,471  | 47,659  | 68,880  | 96,152  | 244,286 |
| pct_poverty   | 3,971 | 15.9    | 12.4    | 0.0     | 6.5     | 12.3    | 22.5    | 75.6    |
| ride_count    | 4,168 | 2,879.2 | 6,818.7 | 1       | 94      | 669     | 2,448   | 68,947  |

*Note: Demographic variables from ACS 5-year estimates (2017, 2022, 2023); ride_count represents total rides per census tract per year (2014-2025).*

**Table 2. Distribution of categorical variables**

| Variable         | n_missing | n_unique | Level                          | Count |
|-----------------|:---------:|:--------:|:------------------------------:|:-----:|
| acs_year        |     0     |    3     | 2017                           | 1,319 |
| acs_year        |     0     |    3     | 2022                           | 1,332 |
| acs_year        |     0     |    3     | 2023                           | 1,332 |
| race_category   |    12     |    4     | Predominantly White            | 1,942 |
| race_category   |    12     |    4     | Predominantly Black            | 1,140 |
| race_category   |    12     |    4     | Predominantly Hispanic         |   829 |
| race_category   |    12     |    4     | Predominantly Asian            |    60 |
| income_category |    58     |    4     | Medium Income ($50-75k)        | 1,129 |
| income_category |    58     |    4     | Low Income (<$50k)             | 1,095 |
| income_category |    58     |    4     | Very High Income (>$100k)      |   893 |
| income_category |    58     |    4     | High Income ($75-100k)         |   808 |
| is_diverse      |     0     |    2     | False (has majority >50%)      | 3,403 |
| is_diverse      |     0     |    2     | True (no majority)             |   580 |

*Note: Race category based on plurality (highest percentage) of racial/ethnic groups; is_diverse indicates census tracts where no single group exceeds 50%.*

#### Data Cleaning and Preparation

**Divvy Trip Data Standardization:**  
The Divvy trip datasets spanning 2013-2025 contained significant format variations across years. Column names changed multiple times: early years (2013-2016) used `starttime`, `from_station_*`, and `usertype`, while later years (2020+) switched to `started_at`, `start_station_*`, and `member_casual`. The 2019 dataset used entirely different naming conventions with prefixes like `'01 - Rental Details Rental ID'`. A standardization function was created to extract only essential columns (year, ride_id, station IDs/names, user_type) and unify them across all years. Q2 data (April-June) was filtered from full-year or Q1Q2 files where necessary. The 2013 dataset was excluded due to insufficient Q2 rides (only 4,005 rides).

**Station Coordinate Mapping:**  
The Divvy station file (`Divvy_Bicycle_Stations_20251207.csv`) contained 1,149 current stations but did not cover all historical stations from trip data. Station name matching was performed to add coordinates, achieving 75-80% ride coverage across years. Rides without matching stations were removed to ensure all records had valid geographic coordinates, resulting in 12,071,787 rides (80% of original 15,078,402 rides).

**Spatial Mapping to Census Tracts:**  
Ride coordinates were converted to Point geometries and spatially joined with 2010 Census tract boundaries using GeoPandas `sjoin` with `predicate='within'`. Of 12,071,787 rides, 12,000,613 (99.4%) successfully mapped to census tracts; 71,174 rides fell outside tract boundaries and were excluded.

**ACS Data Processing:**  
American Community Survey 5-year estimates were used to match demographic data to ride years: 2017 ACS (2013-2017 estimates) for 2014-2017 rides, 2022 ACS (2018-2022) for 2018-2022 rides, and 2023 ACS (2019-2023) for 2023-2025 rides. Census tract GEOIDs in ACS files had a `1400000US` prefix that was removed for matching. Column names varied between ACS years, with 2017/2022 using `Percent Estimate!!` prefix while 2023 used `Percent!!`. A prefix detection function handled these variations automatically. Median household income appeared in different inflation-adjusted dollar columns across years (e.g., "2017 INFLATION-ADJUSTED DOLLARS" vs "2023 INFLATION-ADJUSTED DOLLARS"), requiring dynamic column detection. All percentage and income values were converted to numeric, with errors coerced to NaN.

**Demographic Categorization:**  
Census tracts were categorized by race based on plurality (highest percentage among White NH, Black NH, Hispanic, Asian), labeled as "Predominantly [Race]". A diversity flag (`is_diverse`) was added for tracts where no group exceeded 50%. Income was binned into four categories: Low (<$50k), Medium ($50-75k), High ($75-100k), and Very High (>$100k) based on inflation-adjusted median household income.

**Phase Assignment:**  
Years were grouped into three expansion phases: Phase 1 (2014-2017) representing the initial system, Phase 2 (2018-2022) showing equity-focused expansion, and Phase 3 (2023-2025) reflecting recent growth. Station first-appearance years were calculated to identify which phase each station was added.

**Potentially Incorrect Values:**
- Negative median income values in ACS data were converted to NaN
- Rides with missing station names or coordinates were excluded
- Census tracts with missing demographic data (4 tracts for race, 76 for income) were dropped from relevant analyses
- Station expansion analysis removed 6 records lacking demographic matches

**Additional Preparation:**
- Created cumulative station counts by phase for distribution analysis
- Calculated rides per capita by merging ride counts with census tract population
- Generated percentage change metrics comparing Phase 1 (2014-2017) to Phase 3 (2023-2025)
- Capped percentage change values at -100% to +300% for visualization clarity



### Q3: Membership & Frequency
The following variables used at the trip level were main variables of interest and their distributions are summarized below. The key continuous variables for the analysis are ride_length in minutes and hour_of_day. The key categorical variables for the analysis are member_casual, day_of_week, and start_community. Table 1 shows summary statistics on the two continuous variables. Table 2 shows the distributions of the key categorical variables.
**Table 1. Summary of continuous variables (trip level)**
| Variable      | Count |   Mean   |   Std   |  Min  |  25%  |  50%  |  75%  |  Max   |
|--------------|:-----:|:--------:|:-------:|:-----:|:-----:|:-----:|:-----:|:------:|
| ride_length_min  | 172,846 | 11.42  | 24.71 | 2  | 5 | 8 | 13| 1,418 |
| hour_of_day| 172,846  | 13.43     | 4.89    | 0  | 10  | 14  | 17  | 23   |

**Table 2. Distribution of categorical variables**

| Variable     | n_missing | n_unique | Level | Count |
|-------------|:---------:|:--------:|:-----:|:-----:|
| member_casual  |     0     |    2     |   member   | 135,776 |
| member_casual  |     0     |    2     |   casual   |  37,070  |
| day_of_week|     0     |    7     |   Monday (top level)   | 33,854  |
| day_of_week|     0     |    7     |   (other 6 days)   | 138,992  |
| start_community|     1,632     |    77     | NEAR NORTH SIDE (top) |  36,868  |
| start_community|     1,632     |    77     |  (other 76 areas)   |   134,346  |	
Before the data was usable, there were some questionable values that indicated cleaning. The start and end times of a trip were in string format instead of datetime objects for easier processing so these were cast. Rows which indicated times that could not be read were dropped to prevent invalid start times and durations. Duration was cast in minutes and any trip with a duration of one minute or less (likely a rider mistake or a log-in error) as well as trips over 24 hours (trying to manually extend a trip) were dropped, too. Start and end locations were in latitude/longitude so these were cast into geometries as start and end points. Some missing values for trips without lat/long info or that were just outside community polygons meant that ~1,600 trips did not receive a start_community; these rows were kept as we explicitly note their missing community status in the summaries. For the ACS community dataset, income and race variables were provided as comma-separated strings so commas were removed and the values cast as numeric types instead. The names of the communities from ACS and spatial boundaries were cast to uppercase and spaces trimmed for accurate matching to the names produced from the spatial join from trips to community areas.
Finally, there were some additional means of preparing the data for analysis to connect trips with neighborhoods and demographics. For example, from the start time of each trip I engineered day_of_week, hour_of_day, month, year, and whether it was a weekend so usage patterns could be compared from day-of-week/hour-to-hour. Trips start and end points were cast as GeoDataFrames which could be spatially joined to the Chicago community area polygons so each ride was assigned a start_community/end_community. Community-level census information from the ACS table (number of residents total, income brackets, race/ethnic breakdowns) was merged onto communities with the name of the community-area after naming conventions were standardized. Finally, a community-level summary table was created that contains total trips, member trips, casual trips, mean and median ride length, trips/1,000 residents, income breakdown (median income, low-income percentage), racial/ethnic breakdown (majority race/ethnicity, minority percentages), and diversity index; this is a ready dataset for further neighborhood/demographic analyses.


### Q4: Safety and Mobility
In the Divvy trip data, rides where the end time occurred before the start time were removed, since these represent incorrect or incomplete Divvy trips. Rows with missing latitude or longitude were dropped because they prevent spatial mapping when we overlay this data onto the community boundary areas. Columns ‘started_at’ and ‘ended_at’ columns were converted into DateTime objects, and additional variables of ‘ride_minutes’, ‘date’, ‘hour’, and ‘day_of_week’ were added for potential temporal analysis. Finally, columns ‘start_point’ and ‘end_point’ were derived from the latitude and longitude. 

In the health data, the original CSV contained four metadata rows at the top, which were removed to ensure that data could be read cleanly. Columns were also renamed for better interpretation. The column ‘community_area_name’ was standardized, and the column ‘community_area_number,’ which served as a GEOID, was converted to an integer data type for consistency. Non-analytical columns like the ‘layer’ column were also removed.

In the community boundary data, columns were renamed for better interpretation and clarity. The column ‘community_area_name’ was standardized to match the health dataset. Geometry values, which originally were WKT strings, were converted into shapely geometry objects. The boundaries were then reprojected into a projected coordinate system (EPSG:3435), which would be used for distance and area calculations. Finally, centroid latitude and longitude values were computed for later mapping and visualization.   

All spatial data, such as Divvy-ride coordinates and community boundary polygons, were transformed into the same CRS (EPSG:3435) to standardize them for spatial operations. Next, a spatial join was used to assign each Divvy ride to the community area where the ride started. 
After spatially aligning the rides to each community area, features were aggregated at the community-area level. This included computing Divvy station counts per community area and calculating the number of rides that started in each community area. 
The three datasets were then merged into one unified dataframe (community_stats) containing all variables needed for mapping and modeling.
Finally, variables such as Divvy station counts and total ride counts were heavily right-skewed. We applied log transformations to make visualizations and relationships easier to interpret. These transformations helped stabilize variance and improve the interpretability of scatterplots and comparisons across community areas.


## Exploratory Data Analysis

For each analysis:

* What did you do exactly? How did you solve the problem? Why did you think it would be successful? 

* Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.



### Analysis 1: Access and Equity
*By Ethan Bledsoe*

To understand how Divvy access relates to neighborhood characteristics, I started by comparing block groups with and without stations. I used boxplots to compare median income and percent people of color for areas that have at least one station versus those that have none, and I ran t-tests to see whether the differences in means were statistically meaningful. 
![Boxplot for block groups with or without stations](images/ethan/blockgroupboxplot.png)

I then refined this by looking at station density, grouping block groups into four bins (0, 1, 2–3, and 4+ stations) and making boxplots of income and percent people of color across these bins. A simple summary table showed that income generally increases, and the share of people of color decreases, as the number of stations rises.
![Station density boxplot](images/ethan/divvystationdensity.png)

Next, I treated the number of stations as a numeric variable and used scatterplots with regression lines to examine how station counts relate to median income and percent people of color. I calculated Pearson correlation coefficients to quantify these relationships. 

![Regression line for station counts vs. income and percent POC](images/ethan/regressionlines.png)

Finally, I created choropleth maps of percent people of color and median income across block groups and overlaid Divvy station locations. These maps showed visually that stations are concentrated in and around the Loop and the North Side, which tend to be higher-income and less heavily composed of people of color than many South and West Side areas.
![Percent POC Choropleth](images/ethan/pocchoropleth.png)
![Median Income Choropleth](images/ethan/incomechoropleth.png)

### Analysis 2: Equity Over Time
*By Junho Hong*

To understand how Divvy's expansion affected equity across Chicago neighborhoods, I analyzed station distribution and ridership patterns across three distinct phases: Phase 1 (2014-2017) representing the initial system, Phase 2 (2018-2022) capturing equity-focused expansion, and Phase 3 (2023-2025) reflecting recent growth.

**Station Distribution Analysis:**  
I first examined how station placement evolved across demographic groups. Using stacked bar charts, I visualized the percentage distribution of stations by race category at the end of each phase (cumulative) and compared this with the distribution of newly added stations in Phases 2 and 3 (incremental). Census tracts were categorized by plurality race (Predominantly White, Black, Hispanic, or Asian) based on 2017, 2022, and 2023 ACS data matched to their respective time periods.

![Station Distribution Charts](images/junho/station_distribution.png)

The analysis revealed that Phase 1 was heavily concentrated in Predominantly White areas (57.9% of stations), but expansion became progressively more equitable. By Phase 3, new stations were distributed nearly equally: 33.2% in Hispanic areas, 32.8% in White areas, and 31.9% in Black areas.

**Spatial Station Expansion:**  
To visualize geographic patterns, I created three choropleth maps showing census tracts colored by race category with station locations overlaid as points. Each map represents cumulative stations at the end of each phase, allowing visual tracking of how the system expanded from the downtown core and North Side into previously underserved South and West Side neighborhoods.

![Expansion Maps](images/junho/expansion_maps.png)

**Ridership Temporal Trends:**  
Next, I analyzed ridership patterns over time using dual line charts. The left chart shows total rides by race category from 2014-2025, revealing that Predominantly White areas consistently accounted for 85-92% of all rides despite representing only ~47% of stations by Phase 3. The right chart displays per capita ridership (rides per person per year), which provides a normalized view accounting for population differences.

![Ridership Trends](images/junho/ridership_trends.png)

The per capita analysis revealed persistent and severe disparities: by Phase 3, Predominantly White areas averaged 1.19 rides per person annually, while Black areas averaged only 0.14 (88% lower) and Hispanic areas 0.08 (93% lower).

**Ridership Volume with Demographic Context:**  
To visualize the relationship between ridership patterns and neighborhood demographics, I created maps showing census tracts colored by race category with ridership volume overlaid as red circles scaled by ride counts. These maps demonstrate the stark spatial inequality: large circles (high ridership) cluster almost exclusively in Predominantly White (blue) North Side areas, while Predominantly Black (orange) and Hispanic (light blue) South and West Side neighborhoods show minimal ridership despite receiving stations in later phases.

![Ridership Volume by Demographics](images/junho/ridership_demographics_overlay.png)

**Per Capita Comparison Across Phases:**  
I used grouped bar charts to compare per capita ridership across race categories for all three phases. This visualization made the equity gap explicit: while Predominantly White areas showed slight growth from Phase 1 to Phase 3 (1.29 → 1.41 rides per capita), Black areas declined from 0.20 to 0.11, and Hispanic areas dropped from 0.17 to 0.07.

![Per Capita Comparison](images/junho/per_capita_comparison.png)

**Census Tract Growth Analysis:**  
Finally, I examined which specific census tracts experienced the most significant ridership changes. A growth map colored tracts on a red-yellow-green scale representing percentage change from Phase 1 to Phase 3. Green areas (growth) clustered primarily in already well-served North Side neighborhoods, while many South and West Side tracts showed stagnation or decline despite receiving new stations.

![Growth Map](images/junho/growth_map.png)

The top 10 tracts with the largest absolute ridership gains were all Predominantly White, Very High Income areas, averaging +3,865 rides per tract compared to +628 in Black areas and +845 in Hispanic areas. This analysis demonstrated that while percentage growth rates were similar across demographic groups, absolute ridership gains widened the equity gap.

### Analysis 3: Membership & Frequency
*By Eduardo Sourd*
The first exploratory analysis tested whether ride lengths are different between an annual member versus a casual rider. The cleaned trip data was explored first by determining length in minutes per trip. Then, a kernel density plot was overlaid with a curve for members and casuals to see if there's any difference between the two. This is because this will show an entirety of the population and not just the averages, shapes for general distributions were similar. They are both right skewed, and where members are slightly centroids a bit more around shorter rides, casuals take a bit more with a heavier tail on the longer ride side. This is how density plots work as a common means of assessing difference in groups for continuous variables and it's easy to justify how often shorter trips/longer trips occur based on what is observed.
![Ride length density plot](images/eduardo/distribution-of-ride-lengths-by-rider-type.png)

The second exploratory analysis tested when Divvy bikes tend to be used by riders throughout the day and week. The start_date variable was used to create hour_of_day and day_of_week. Grouping trips by hour and rider type showed a count to create a line plot of trips per hour for members and casuals. This showed an increase of member rides at 8am and 5pm with a peak in casuals at 12pm and then a downward trend leaning towards peak time. This makes sense with casuals being more focused on recreation as that's when recreational options are open whereas members seem to have more of a commuting strategy related to their work lives. However, to assess hours and weekdays patterns together, trips were aggregated by hour_of_day and day_of_week per rider type, pivoted into matrices to see two heatmaps next to each other. This determined that member ridership is confined to weekdays commuting hours while casual activity is more afternoon and weekend-based—this is a stronger indicator of recreation as if they were consistently using it for transportation it would be on the weekends as well. Line plots and heatmaps are compact representations of such uses that show clear patterns over time. In addition, a lot of the Divvy and bike-share analyses utilize similar visualizations.
![Trips by hour line plot](images/eduardo/trips-by-hour-of-day-and-rider-type.png)
![Trips by hour and day heatmaps](images/eduardo/heatmap-trips-by-hour-and-day.png)

The third analysis compared where people use community bikes from their destination with where community demographic patterns suggest, testing whether use is higher in certain neighborhoods. Starting polygons were converted to spatial points and joined to Chicago community area polygons so that each trip was assigned a starting community; trip data were aggregated to the community level to determine total number of trips, member share, number of trips per 1,000 residents, income bracket counts, race counts, and derived diversity metrics (ACS 5‑year data). These community‑level attributes were applied to generate scatter plots where the y-axis represents member share and the x-axis, which includes the percentage of residents in the highest income bracket, percentage Asian, trips per 1,000 residents, and an income diversity index (derived from ACS Data), and the bubble size corresponding to total trips. Simple linear trend lines and trendline equation statistics were applied to summarize an r value of strength and direction of these associations. This method directly answers whether higher income, more Asian, or more frequent communities have a higher share of the member riders since it's clear that's the case in some areas but not others. Additionally, scatter plots with fitted lines represent a common assumption when trying to assess associations.
![Member share scatter plots](images/eduardo/correlation-with-high-member-share.png)

### Analysis 4: Safety and Mobility
*By Isabella Yan (Chisa)*
Divvy station accessibility relate across Chicago’s community areas, and whether neighborhoods with higher physical inactivity experience different crash rates or different levels of access to bike-share infrastructure. To analyze across datasets, we needed to transform the Divvy ride coordinates and the Chicago community boundaries into a consistent coordinate reference system, which was EPSG:3435. Upon transforming these, our key metrics, which were Divvy station counts in each community area and the ride volume, could be calculated at the community level to align with existing inactivity rates and traffic crash metrics. 

After spatially joining the data into one frame, we aggregated Divvy ride activity by community area. The number of rides beginning in each neighborhood and the number of Divvy stations located there were chosen as the ride count reflects actual usage of the Divvy bike stations, and the station count reflects the infrastructure access. These variables help explain Divvy station access, and by aggregating the activity by community area, we can compare community areas based on physical inactivity, traffic crashes, and Divvy station infrastructure. 

Below are the log-transformed plots. Because Divvy station counts and the ride totals were both heavily skewed right, it was necessary to log transform them for proper analysis.

![Log transformed plots](./images/chisa/logtransformed.png)

Next, we visualized each of the variables’ pairwise relationships. The scatterplots below contain the relationships between physical inactivity and traffic crashes, physical inactivity and Divvy station counts, and Divvy station counts and traffic crashes. These scatterplots let us see whether communities with higher inactivity tended to have fewer Divvy stations, which may imply limited access may be related to inactivity, or different crash patterns, which may imply that mobility environments differ between community areas. Having these scatterplots highlight the direction, strength, and form of relationships, including nonlinear trends or the influence of outliers.

![Inactivity vs. Crashes](./images/chisa/inactivityvscrashes.png)

![Inactivity vs. Divvy count](./images/chisa/inactivityvscount.png)

![Divvy count vs. crashes](./images/chisa/countvscrashes.png)

Now, let’s look at the correlation table. This helps supplement the scatterplots by quantifying the relationship between the pairwise variables. Physical Inactivity vs. Traffic has a correlation of -0.177, which indicates a very weak relationship between the variables. Physical Inactivity vs. Divvy Station Count has a correlation of -0.46, which indicates a moderate negative relationship between the variables. Finally, Divvy Station Count vs. Traffic Crashes has a correlation of 0.644, which indicates a moderate to strong positive relationship. 

|                               | physical_inactivity_rate | traffic_crashes | divvy_station_count | total_rides_starting_here |
| ----------------------------- | ------------------------ | --------------- | ------------------- | ------------------------- |
| **physical_inactivity_rate**  | 1.000000                 | -0.177478       | -0.468223           | -0.490147                 |
| **traffic_crashes**           | -0.177478                | 1.000000        | 0.643516            | 0.670118                  |
| **divvy_station_count**       | -0.468223                | 0.643516        | 1.000000            | 0.974841                  |
| **total_rides_starting_here** | -0.490147                | 0.670118        | 0.974841            | 1.000000                  |

Neighborhoods with higher Divvy station volume and ridership, are clustered mainly in the areas with higher traffic crash counts, which confirms the earlier finding that community areas with high Divvy station volume also tend to be high-activity, high-traffic zones. Areas with high physical inactivity and low Divvy access are consistent with the moderate negative correlation between inactivity and station count. These visualizations support that there may be spatial divides in Chicago, where central and transit-rich neighborhoods have higher physical activity and Divvy station access. 

## Discussion / Limitations (Optional)

You are welcome to introduce additional sections or subsections, if required, to address your questions in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Challenges and reflection

Use this section to reflect honestly on your project process. This part is mainly for you and the instructors, not for the stakeholder.

You might address:

- **Anticipated vs. actual challenges.**  
  What problems did you expect at the beginning? What problems did you actually encounter (e.g., data quality, missing values, messy categories, time constraints, coordination within the team)?

- **First attempts and revisions.**  
  Did the very first approach you tried work? If not, what went wrong? How did you debug, revise your questions, or improve your methods?

- **Decisions and trade-offs.**  
  Did you simplify any questions, drop some analyses, or change your focus along the way? Why?

- **What you learned.**  
  What are 2–3 key things you learned from doing this project (about data science, the Divvy data, or working in a team)? If you started over today, what would you do differently?

Keep this section to about **1–2 paragraphs** or a short set of bullet points.

### Q1: Access and Equity
The main challenges in this project were geospatial and cleaning-related. At first, my spatial joins failed because the Divvy stations and block-group polygons were stored in different coordinate systems, so I had to learn how to reproject everything consistently and verify that the joins were working. I also had to deal with odd ACS income values and a very skewed income distribution, which led me to explore and apply the IQR rule for outlier removal. Working through these issues taught me how sensitive geospatial analysis is to careful preprocessing, and how small choices in cleaning and merging can change the story the data tells. If I repeated the project, I would bring in ridership data sooner so I could study both where stations are and how they are actually being used.

### Q2: Equity Over Time
(junho)

### Q3: Membership & Frequency
This was more complicated than I thought it would be, not so much the analysis but the data preparation. The timestamps, bad trip filtering and adding the spatial join for trips, community areas and ACS data required multiple attempts to get working, especially due to differing community titles and comma formatted numerics. I learned a lot about spatial joins and data cleaning, especially when working with large datasets. If I were to start over, I would focus more on cleaning and preparing the data first before jumping into the analysis, as that took the most time.


### Q4: Safety and Mobility
- Messy data for spatial join was hard to clean.
- Spatial join didn’t work at first, everything had to be reprojected onto EPSG:3435 and names had to be standardized.
- Initial values too skewed right and had to be log transformed.
- Learned about geospatial preprocessing and how data cleaning determines analysis quality.


## Conclusions

Do the individual analysis connect with each other to answer a bigger question? If yes, explain.

### Q1: Access and Equity
Overall, my analysis suggests that Divvy stations are not evenly distributed across Chicago. Block groups with more stations tend to have higher median household income and a lower share of people of color than block groups with few or no stations. This pattern appears consistently in summaries by station presence, in station-density bins, in correlations with station counts, and on city maps. Together, these results indicate that Divvy infrastructure is more built out in central, higher-income, and whiter neighborhoods, while many outer or less advantaged areas have lower access to the system.

### Q2: Equity Over Time

Divvy's expansion became progressively more equitable in geographic station distribution, but this improved access did not translate into equitable usage. 

Phase 1 (2014-2017) concentrated 57.9% of stations in Predominantly White areas, but by Phase 3 (2023-2025), new stations were distributed nearly equally across racial groups (33% each for White, Black, and Hispanic areas). Despite this geographic improvement, per capita ridership disparities remained severe: White areas averaged 1.19 rides per person in Phase 3, while Black areas averaged only 0.14 (88% lower) and Hispanic areas 0.08 (93% lower). 

The top 10 census tracts by absolute ridership growth were all Predominantly White, Very High Income areas, gaining an average of +3,865 rides compared to +628 in Black areas and +845 in Hispanic areas. This pattern indicates that while station placement became more equitable, usage patterns widened the equity gap. Physical infrastructure alone is insufficient and barriers such as affordability, safety concerns, and inadequate bike infrastructure in underserved neighborhoods must be addressed to achieve true equity.

### Q3: Membership & Frequency
This project demonstrated how annual members vs. casual riders use Divvy differently when their trips are compared by trip length, time of day, neighborhood and ACS data. Cleaning and merging the trip, spatial and ACS data showed that the majority of this type of project occurs in making the data reliable enough to answer even the most straightforward of questions. Most interestingly, the final visualizations revealed similar patterns. Members fill a majority of weekday rush hours and are denser in higher income, higher use communities while casual users are more around leisure hours and neighborhoods. Ultimately, this assignment was an insightful experience into the from scratch pipeline of taking disparate files to spatially joined and community level assessments, and how even simple descriptive analytics can highlight significant inequities and prospects in a real life transit system.

### Q4: Safety and Mobility
Overall, the analysis indicates that Chicago community areas with higher physical inactivity tend to have fewer Divvy stations and lower Divvy usage, which suggests limited access to active transportation options. Furthermore, community areas with higher Divvy access experience higher counts of traffic crashes, which reflects the Divvy stations’ location of being in denser or more active parts of the city. The patterns observed were consistent across scatter plots, correlations, and choropleth maps. This highlights a spatial divide, where more central parts of the city have stronger Divvy infrastructure and higher mobility, while areas outside of these tend to face higher inactivity levels and less access to Divvy infrastructure. 


## Recommendations to stakeholder(s)

In this section:

- Clearly state **which stakeholder** you are writing for (e.g., *Divvy operations manager*, *City of Chicago planner*, *marketing team*).
- Provide **2–4 specific, actionable recommendations** based on your analysis. Be as concrete and practical as possible so stakeholders can use your suggestions directly.
- Briefly discuss **limitations** of your analysis. Indicate whether stakeholders can act on your results as-is, or whether they should:
  - collect more or updated data,
  - perform additional analysis, or
  - repeat your analysis on more recent data.




### Q1: Access and Equity
My recommendations are aimed at a City of Chicago transportation planner or Divvy operations planner deciding where to expand the system. The results suggest that the city should treat block groups with low station counts, lower incomes, and higher shares of people of color as priority areas for new stations if the goal is to improve equity in access to bike-share. Because this analysis relies on a single snapshot of ACS and station data, planners should repeat a similar analysis with updated data and combine it with ridership information and community input before making major investment decisions.

### Q2: Equity Over Time

Based on findings that geographic expansion failed to close usage gaps, I recommend:

1. **Invest in protected bike lanes** in South and West Side neighborhoods to address safety concerns that likely deter ridership despite station availability.

2. **Partner with community organizations** for culturally-relevant outreach in Predominantly Black and Hispanic areas, as usage patterns suggest barriers beyond physical access.

3. **Track per capita usage metrics by demographic group**, not just station counts, to identify widening disparities that current equity reporting may mask.

**Limitations:** This analysis uses ACS demographic snapshots that may not capture rapid neighborhood change. Census tract racial classifications shifted between 2017 and 2023, complicating longitudinal comparisons. Additionally, 20% of historical rides were excluded due to missing station data. Stakeholders should repeat this analysis with 2024-2025 data and conduct qualitative research (surveys, focus groups) to validate barriers like affordability and safety concerns.

### Q3: Membership & Frequency
These proposals are intended for the Divvy operations and marketing team. Since members make up the majority of ridership during weekday commutes and are heavily clustered in high-income, high-usage areas, Divvy should use communities with significant usage but limited member share for targeted membership drives/employer partnerships to convert these casual riders into members. Simultaneously, rebalancing and capacity considerations should emphasize commuter peaks for member-heavy regions and afternoon/weekend demand for touristy areas. Since these findings are based on a single month's worth of data and correlation assessment, they're good for piloting and subsequent experimentation but not sweeping, large, irreversible changes.

### Q4: Safety and Mobility
This is recommended for a City of Chicago transportation planner, who may be focusing on where to add new Divvy stations and how to reduce physical inactivity rates. The city should expand Divvy station coverage in community areas with high physical inactivity and limited Divvy access. Neighborhoods with low station density, low ridership, and higher inactivity rates should be focused on to improve their Divvy station placement across the city. Furthermore, places with higher rates of traffic crashes should have stricter traffic regulations implemented, or infrastructure that supports the higher volume of Divvy access, such as dedicated bike lanes. However, the data captured only covers the year of 2024, and may not reflect current Divvy trends. Traffic crashes are also aggregated across all severities as well, giving minor and more severe traffic crashes equal weight. An analysis with updated data and more detailed data about traffic crashes and Divvy station access should be performed before taking any major action. 

## References {-}

[1] City of Chicago. *Divvy Bicycle Stations*. Chicago Data Portal.  
    Available at: https://data.cityofchicago.org/Transportation/Divvy-Bicycle-Stations-bbyy-e7gq

[2] Lyft / Divvy. *Divvy System Data*. Trip data for Chicago bike-share system.  
    Available at: https://divvybikes.com/system-data

[3] Chicago Health Atlas. *Adult Physical Inactivity Indicator (HCSPAP)*.  
    Available at: https://chicagohealthatlas.org/indicators/HCSPAP

[4] City of Chicago. *Boundaries – Community Areas (Map)*. Chicago Data Portal.  
    Available at: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-Map/cauq-8yn6

[5] U.S. Census Bureau. *American Community Survey 5-Year Estimates, 2023 (ACS 5-Year)*.  
    Retrieved via Census API and the Python `census` package.  
    Available at: https://www.census.gov/programs-surveys/acs

[6] U.S. Census Bureau. *TIGER/Line Shapefiles, 2023, State-County-Block Group, Illinois (tl_2023_17_bg)*.  
    Available at: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.2023.html

[7] City of Chicago. *ACS 5-Year Data by Community Area*. Chicago Data Portal.
    Available at: https://data.cityofchicago.org/Community-Economic-Development/ACS-5-Year-Data-by-Community-Area/t68z-cikk/about_data

## AI Tools and Assistance (Disclosure — for course purposes) {-}


### Q1: Access and Equity
I used ChatGPT as a support tool while working on this project. I mainly used it to clarify how to use Python libraries such as geopandas and census and to polish the wording of my write-up.

### Q2: Equity Over Time
I used claude to get help with polishing the wording of my write-up.

### Q3: Membership & Frequency
I used Github Copilot line recommendations as a support tool while working on this project to brainstorm ideas and debug code. I also used it to polish the wording of my write-up.

### Q4: Safety and Mobility
I used ChatGPT to help with standardizing the community boundary data and the Divvy station data into EPSG:3435 and performing spatial join.