# Analysis of Divvy Usage and Chicago Weather

Joint analysis of Divvy bikesharing data and Chicago weather from April 2020 to May 2023. [View this notebook on NBViewer](https://nbviewer.org/github/pollyren/divvy/blob/main/analysis/weather_analysis.ipynb) to see the proper map renderings.

### Preliminaries

In [1]:
import pandas as pd
import os
from datetime import datetime
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import folium

In [2]:
bike_dtypes = {
    'ride_id': str,
    'rideable_type': str,
    'started_at': str,
    'ended_at': str,
    'start_station_name': str,
    'start_station_id': str,
    'end_station_name': str,
    'end_station_id': str,
    'start_lat': float,
    'start_lng': float,
    'end_lat': float,
    'end_lng': float,
    'member_casual': str,
    'time': float,
    'distance': float,
}

In [3]:
weather_dtypes = {
    'name': str,
    'datetime': str,
    'temp': float,
    'feelslike': float,
    'humidity': float,
    'precip': float,
    'precipprob': float,
    'preciptype': str,
    'snow': float,
    'snowdepth': float,
    'windgust': float,
    'windspeed': float,
    'winddir': float,
    'sealevelpressure': float,
    'cloudcover': float,
    'visibility': float,
    'solarradiation': float,
    'solarenergy': float,
    'uvindex': float,
    'severerisk': str,
    'conditions': str,
    'icon': str,
    'stations': str,
}

In [4]:
data_path = os.getcwd() + '/../data/'
bike_data = pd.read_csv(data_path+'data_dist_time.csv', dtype=bike_dtypes, index_col=0)
weather_data = pd.read_csv(data_path+'chicago_04012020-05312023.csv', dtype=weather_dtypes, index_col=False)

### Cleaning and combining the datasets

I will redo the same cleaning operations on the bike dataset performed in `eda.ipynb`, with the same justifications provided there. Additionally, I will clean and remove the unnecessary information in the new weather dataset.

In [5]:
bike_data['started_at'] = pd.to_datetime(bike_data['started_at'])
bike_data['ended_at'] = pd.to_datetime(bike_data['ended_at'])
bike_data['time'] = bike_data['time'].div(60)

bike_data['year'] = bike_data['started_at'].dt.year.astype('int')
bike_data['month'] = bike_data['started_at'].dt.month.astype('int')
bike_data['day'] = bike_data['started_at'].dt.day.astype('int')
bike_data['hour'] = bike_data['started_at'].dt.hour.astype('int')

bike_data.drop(
    ['ride_id','started_at','ended_at','start_station_id','end_station_id'], 
    axis=1, 
    inplace=True
)

In [6]:
lower = np.percentile(bike_data['time'], 1)
upper = np.percentile(bike_data['time'], 99)
bike_data = bike_data[bike_data.time.between(lower, upper)]
bike_data = bike_data[bike_data.distance < 25]

In [7]:
bike_data = bike_data.replace({'docked_bike': 'classic_bike'}, regex=True)

In [8]:
bike_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15716344 entries, 0 to 16048416
Data columns (total 14 columns):
 #   Column              Dtype  
---  ------              -----  
 0   rideable_type       object 
 1   start_station_name  object 
 2   end_station_name    object 
 3   start_lat           float64
 4   start_lng           float64
 5   end_lat             float64
 6   end_lng             float64
 7   member_casual       object 
 8   time                float64
 9   distance            float64
 10  year                int64  
 11  month               int64  
 12  day                 int64  
 13  hour                int64  
dtypes: float64(6), int64(4), object(4)
memory usage: 1.8+ GB


Now that the bike dataset is pretty much good to go, let's turn our attention to the weather dataset. 

The name and stations columns are awfully redundant and uninformative. The precipprob, preciptype, and snow columns are also pretty redundant because we're able to deduce the same information from the precip and snowdepth columns. Also, the windgust and severerisk columns are incomplete across the timeframe of interest; moreover, there is some correlation between those columns and windspeed and conditions, respectively, so we are not completely disregarding these factors. Let's remove these columns from the data.

In [9]:
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], format='ISO8601')
weather_data['year'] = weather_data['datetime'].dt.year.astype('int')
weather_data['month'] = weather_data['datetime'].dt.month.astype('int')
weather_data['day'] = weather_data['datetime'].dt.day.astype('int')
weather_data['hour'] = weather_data['datetime'].dt.hour.astype('int')

weather_data.drop(
    ['name','datetime','precipprob','preciptype','snow','windgust','severerisk','stations'], 
    axis=1, 
    inplace=True
)

In [10]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27744 entries, 0 to 27743
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   temp              27744 non-null  float64
 1   feelslike         27744 non-null  float64
 2   dew               27744 non-null  float64
 3   humidity          27744 non-null  float64
 4   precip            27744 non-null  float64
 5   snowdepth         27744 non-null  float64
 6   windspeed         27744 non-null  float64
 7   winddir           27744 non-null  float64
 8   sealevelpressure  27744 non-null  float64
 9   cloudcover        27744 non-null  float64
 10  visibility        27744 non-null  float64
 11  solarradiation    27744 non-null  float64
 12  solarenergy       27744 non-null  float64
 13  uvindex           27744 non-null  float64
 14  conditions        27744 non-null  object 
 15  icon              27744 non-null  object 
 16  year              27744 non-null  int64 

Now the weather dataset is also good to go! Let's group the bike dataset points by the year, month, day, and hour that the ride was started. We will then merge the two datasets so we can begin to analyse the correlations between the corresponding values of both datasets.

In [11]:
bike_agg = pd.DataFrame()

In [12]:
classic = bike_data['rideable_type']=='classic_bike'
electric = bike_data['rideable_type']=='electric_bike'
member = bike_data['member_casual']=='member'
casual = bike_data['member_casual']=='casual'

In [13]:
bike_agg['member_classic_counts'] = bike_data[member & classic].groupby(['year','month','day','hour']).size()

In [14]:
bike_agg['member_electric_counts'] = bike_data[member & electric].groupby(['year','month','day','hour']).size()

In [15]:
bike_agg['casual_classic_counts'] = bike_data[casual & classic].groupby(['year','month','day','hour']).size()

In [16]:
bike_agg['casual_electric_counts'] = bike_data[casual & electric].groupby(['year','month','day','hour']).size()

In [17]:
bike_agg['member_classic_avg_time'] = bike_data[member & classic].groupby(['year','month','day','hour'])[['time']].mean()

In [18]:
bike_agg['member_electric_avg_time'] = bike_data[member & electric].groupby(['year','month','day','hour'])[['time']].mean()

In [19]:
bike_agg['casual_classic_avg_time'] = bike_data[casual & classic].groupby(['year','month','day','hour'])[['time']].mean()

In [20]:
bike_agg['casual_electric_avg_time'] = bike_data[casual & electric].groupby(['year','month','day','hour'])[['time']].mean()

In [21]:
bike_agg['member_classic_avg_dist'] = bike_data[member & classic].groupby(['year','month','day','hour'])[['distance']].mean()

In [22]:
bike_agg['member_electric_avg_dist'] = bike_data[member & electric].groupby(['year','month','day','hour'])[['distance']].mean()

In [23]:
bike_agg['casual_classic_avg_dist'] = bike_data[casual & classic].groupby(['year','month','day','hour'])[['distance']].mean()

In [24]:
bike_agg['casual_electric_avg_dist'] = bike_data[casual & electric].groupby(['year','month','day','hour'])[['distance']].mean()

In [36]:
def most_common(series):
    counts = series.value_counts()
    return None if counts.empty else counts.idxmax()

In [51]:
# bike_agg['pop_station'] = bike_data.groupby(['year','month','day','hour']).apply(lambda x: get_most_popular_station(*x.name))
# bike_agg['pop_station'] = bike_data.groupby(['year', 'month', 'day', 'hour'])['start_station_name'].agg(lambda x: x.mode().iloc[0])
bike_agg['pop_station'] = bike_data.groupby(['year','month','day','hour'])['start_station_name'].agg(most_common)

In [49]:
bike_agg.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,member_classic_counts,member_electric_counts,casual_classic_counts,casual_electric_counts,member_classic_avg_time,member_electric_avg_time,casual_classic_avg_time,casual_electric_avg_time,member_classic_avg_dist,member_electric_avg_dist,casual_classic_avg_dist,casual_electric_avg_dist,pop_station
year,month,day,hour,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2020,9,7,8,264,56.0,171.0,40.0,18.502715,15.421429,30.886257,19.572083,1.530619,1.296803,1.243794,2.338953,Michigan Ave & Oak St
2020,9,21,0,21,9.0,28.0,30.0,10.161111,13.564815,28.408929,23.555556,1.078422,1.208181,0.796597,1.595384,Sheridan Rd & Irving Park Rd
2021,6,6,10,444,204.0,508.0,257.0,16.360323,14.562663,28.495243,21.946757,1.491004,1.699551,1.528063,1.652213,Streeter Dr & Grand Ave
2022,10,17,7,244,335.0,33.0,104.0,11.111407,8.537612,10.366667,8.105128,1.198045,1.290032,0.755833,1.092195,Ellis Ave & 60th St
2023,1,11,15,241,240.0,54.0,112.0,10.448202,10.372986,21.652778,11.147917,0.995351,1.371489,0.942042,1.134349,University Ave & 57th St
2022,2,3,10,25,30.0,1.0,1.0,11.340667,9.868889,8.05,7.833333,1.094969,0.839843,0.288924,1.033101,Wells St & Huron St
2023,1,6,4,7,10.0,6.0,6.0,6.716667,7.406667,5.038889,12.208333,0.854118,1.237911,0.414271,1.695746,Clark St & Newport St
2023,1,8,13,169,156.0,49.0,73.0,11.925247,9.530342,15.670748,10.222146,1.135236,1.283147,1.349502,1.031786,Kingsbury St & Kinzie St
2021,1,3,21,25,12.0,5.0,5.0,8.671333,8.525,12.483333,8.206667,1.08798,1.06255,1.077115,1.000291,Glenwood Ave & Morse Ave
2021,11,1,13,263,260.0,115.0,123.0,11.512104,9.288526,23.205217,18.062873,0.942363,1.036361,1.101741,1.39992,Clinton St & Lake St


In [53]:
bw = weather_data.merge(
    bike_agg, 
    on=['year','month','day','hour'], 
    how='left'
)
bw.sample(10)

Unnamed: 0,temp,feelslike,dew,humidity,precip,snowdepth,windspeed,winddir,sealevelpressure,cloudcover,...,casual_electric_counts,member_classic_avg_time,member_electric_avg_time,casual_classic_avg_time,casual_electric_avg_time,member_classic_avg_dist,member_electric_avg_dist,casual_classic_avg_dist,casual_electric_avg_dist,pop_station
18668,60.4,60.4,52.9,76.53,0.0,0.0,6.3,341.0,1007.1,100.0,...,220.0,12.212485,11.348963,17.657534,16.311515,1.182738,1.446848,1.12145,1.405752,DuSable Lake Shore Dr & North Blvd
6378,39.0,33.2,27.2,62.28,0.0,0.0,8.3,177.0,1020.0,63.5,...,36.0,12.875744,11.846667,14.676587,15.066667,1.225901,1.343496,1.191425,1.616819,LaSalle St & Washington St
52,46.5,43.0,37.8,71.7,0.001,0.0,6.9,81.0,1019.1,99.3,...,,7.166667,,,,0.996817,,,,Millennium Park
19046,82.1,79.8,34.0,17.68,0.0,0.0,17.2,299.0,1012.2,24.2,...,378.0,14.19514,13.627253,27.022318,21.691623,1.235395,1.497459,1.259211,1.645337,Streeter Dr & Grand Ave
14074,53.9,53.9,39.6,58.44,0.0,0.0,13.4,227.0,1019.6,24.2,...,121.0,11.147273,8.872624,23.111111,15.229477,1.086898,1.229458,1.061889,1.519503,Loomis St & Lexington St
17414,33.8,27.3,8.9,34.79,0.0,0.0,7.6,324.0,1023.3,24.2,...,87.0,11.792488,10.422626,18.255044,16.331992,1.019561,1.247793,1.141285,1.448831,Dearborn St & Erie St
23822,23.0,23.0,9.0,54.35,0.0,0.07,2.2,110.0,1031.2,78.4,...,39.0,9.916064,7.780599,11.801961,8.128632,0.994166,0.926808,0.819611,0.97697,Wells St & Huron St
19320,89.2,92.8,67.6,49.13,0.0,0.0,12.5,210.0,1009.2,0.0,...,85.0,11.346667,15.166667,25.910959,13.966471,1.108943,1.966593,0.99019,1.359287,Wabash Ave & 9th St
4816,42.4,37.2,37.4,82.49,0.013,0.0,8.6,294.0,1022.2,100.0,...,57.0,12.288559,14.478409,18.526667,17.277485,1.193874,1.364129,1.219781,1.847776,Kingsbury St & Erie St
24980,34.8,28.4,25.1,67.23,0.0,0.13,7.9,310.0,1018.9,0.0,...,37.0,9.724359,7.56537,12.081373,8.905856,0.84729,0.948615,1.005834,1.217907,University Ave & 57th St


In [55]:
bw.to_csv(data_path+'bike_weather_merged.csv')