# \[STAT-315\] Bikeshare Insights Data Analysis Project

#### Defining the questions:
We first ask the following questions:
1. Can we predict which casual riders are most likely to benefit from a membership, enabling targeted promotional strategies?
2. Are there any seasonal or temporal patterns in ridership behavior that could be used to optimize station positioning and bike allocations to stations?
3. Which factors (such as membership status, trip length, bike type, day of week, or station location) most strongly influence whether a rider chooses an electric versus a classic bike, and how much do these factors impact overall demand?


#### Data collection

For our given questions, we decide to leverage the Divvy dataset previously used for our mini-project. Simply run the following cell to obtain the bike sharing insights data for the year of 2023. It will be stored in `./data/`.

In [1]:
!python combine.py

Downloading and unzipping all Divvy .csv files.: 100%|█| 12/12 [00:24<00:00,  2.
Reading all Divvy .csv files.: 100%|████████████| 12/12 [00:10<00:00,  1.13it/s]
Creating concatenated .csv file.
Successfully created merged .csv file. Path is ./data/2023-divvy-tripdata.csv


#### Data cleaning and preparation
We then prepare the data for our analysis by cleaning out unusual rows and adding additional features.

In [3]:
# required imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [4]:
# loading data
divvy_df = pd.read_csv("./data/2023-divvy-tripdata.csv", index_col=0)
divvy_df = divvy_df.reset_index(drop=True)

In [5]:
# get rid of abnormally long or short ride times (<1 minute or >2 hours)
divvy_df["started_at"] = pd.to_datetime(divvy_df["started_at"])
divvy_df["ended_at"] = pd.to_datetime(divvy_df["ended_at"])

divvy_df["ride_duration_min"] = (divvy_df["ended_at"] - divvy_df["started_at"]).astype('int64') / 60_000_000_000

divvy_df = divvy_df[(1 <= divvy_df["ride_duration_min"]) & (divvy_df["ride_duration_min"] <= 120)]

In [6]:
# get rid of abnormally long distances
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in kilometers
    lat1_rad, lon1_rad = np.radians(lat1), np.radians(lon1)
    lat2_rad, lon2_rad = np.radians(lat2), np.radians(lon2)
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

divvy_df["distance_km"] = haversine(
    divvy_df["start_lat"],
    divvy_df["start_lng"],
    divvy_df["end_lat"],
    divvy_df["end_lng"]
)

divvy_df = divvy_df[(divvy_df["distance_km"] <= 15) & (divvy_df["distance_km"] > 0.1)]

In [7]:
# adds season to dataframe
def season(month: int):
    if month in [1, 2, 12]:
        return "winter"
    elif month in [3, 4, 5]:
        return "spring"
    elif month in [6, 7, 8]:
        return "summer"
    else:
        return "fall"
    
divvy_df["season"] = divvy_df["started_at"].dt.month.apply(season)

In [8]:
# add km/hr to dataset, removing abnormally low values
# they have significantly higher ride times (~1 hr)
divvy_df["avg_velocity_km_per_hr"] = divvy_df["distance_km"] / (divvy_df["ride_duration_min"] / 60)
divvy_df = divvy_df[divvy_df["avg_velocity_km_per_hr"] > 2]

In [9]:
divvy_df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_duration_min,distance_km,season,avg_velocity_km_per_hr
0,F96D5A74A3E41399,electric_bike,2023-01-21 20:05:42,2023-01-21 20:16:33,Lincoln Ave & Fullerton Ave,TA1309000058,Hampden Ct & Diversey Ave,202480.0,41.924074,-87.646278,41.93,-87.64,member,10.85,0.839042,winter,4.639863
1,13CB7EB698CEDB88,classic_bike,2023-01-10 15:37:36,2023-01-10 15:46:05,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,8.483333,1.204573,winter,8.519576
2,BD88A2E670661CE5,electric_bike,2023-01-02 07:51:57,2023-01-02 08:05:11,Western Ave & Lunt Ave,RP-005,Valli Produce - Evanston Plaza,599,42.008571,-87.690483,42.039742,-87.699413,casual,13.233333,3.543683,winter,16.067074
3,C90792D034FED968,classic_bike,2023-01-22 10:52:58,2023-01-22 11:01:44,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,8.766667,1.204573,winter,8.244229
4,3397017529188E8A,classic_bike,2023-01-12 13:58:01,2023-01-12 14:13:20,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,15.316667,1.204573,winter,4.718677


In [17]:
divvy_df = divvy_df.dropna()
divvy_df.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name        0
start_station_id          0
end_station_name          0
end_station_id            0
start_lat                 0
start_lng                 0
end_lat                   0
end_lng                   0
member_casual             0
ride_duration_min         0
distance_km               0
season                    0
avg_velocity_km_per_hr    0
dtype: int64

In [37]:
# get both start and end stations with non-matching names and ids
# e.g. one station is mapped to two ids on either start or end
stations = pd.concat([divvy_df[["start_station_name", "start_station_id"]].rename(columns={"start_station_name": "station_name", "start_station_id": "station_id"}),
                      divvy_df[["end_station_name", "end_station_id"]].rename(columns={"end_station_name": "station_name", "end_station_id": "station_id"})],
                      ignore_index=True)

name_to_id = (
    stations.groupby("station_name")["station_id"]
      .agg(lambda x: x.value_counts().idxmax())
)

mismatched_start_stations = divvy_df["start_station_id"] != divvy_df["start_station_name"].map(name_to_id)
mismatched_end_stations = divvy_df["end_station_id"] != divvy_df["end_station_name"].map(name_to_id)

suspicious_stations = pd.concat([divvy_df[mismatched_start_stations]["start_station_name"].drop_duplicates(), divvy_df[mismatched_end_stations]["end_station_name"].drop_duplicates()])
divvy_df = divvy_df[~divvy_df["start_station_name"].isin(suspicious_stations) & ~divvy_df["end_station_name"].isin(suspicious_stations)]

In [None]:
divvy_df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_duration_min,distance_km,season,avg_velocity_km_per_hr
0,F96D5A74A3E41399,electric_bike,2023-01-21 20:05:42,2023-01-21 20:16:33,Lincoln Ave & Fullerton Ave,TA1309000058,Hampden Ct & Diversey Ave,202480.0,41.924074,-87.646278,41.93,-87.64,member,10.85,0.839042,winter,4.639863
1,13CB7EB698CEDB88,classic_bike,2023-01-10 15:37:36,2023-01-10 15:46:05,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,8.483333,1.204573,winter,8.519576
2,BD88A2E670661CE5,electric_bike,2023-01-02 07:51:57,2023-01-02 08:05:11,Western Ave & Lunt Ave,RP-005,Valli Produce - Evanston Plaza,599,42.008571,-87.690483,42.039742,-87.699413,casual,13.233333,3.543683,winter,16.067074
3,C90792D034FED968,classic_bike,2023-01-22 10:52:58,2023-01-22 11:01:44,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,8.766667,1.204573,winter,8.244229
4,3397017529188E8A,classic_bike,2023-01-12 13:58:01,2023-01-12 14:13:20,Kimbark Ave & 53rd St,TA1309000037,Greenwood Ave & 47th St,TA1308000002,41.799568,-87.594747,41.809835,-87.599383,member,15.316667,1.204573,winter,4.718677


In [42]:
divvy_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3913034 entries, 0 to 5719876
Data columns (total 17 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   ride_id                 object        
 1   rideable_type           object        
 2   started_at              datetime64[ns]
 3   ended_at                datetime64[ns]
 4   start_station_name      object        
 5   start_station_id        object        
 6   end_station_name        object        
 7   end_station_id          object        
 8   start_lat               float64       
 9   start_lng               float64       
 10  end_lat                 float64       
 11  end_lng                 float64       
 12  member_casual           object        
 13  ride_duration_min       float64       
 14  distance_km             float64       
 15  season                  object        
 16  avg_velocity_km_per_hr  float64       
dtypes: datetime64[ns](2), float64(7), object(8)
memory 