## 1 Data Wrangling<a id='2_Data_wrangling'></a>

## 1.1 Introduction<a id='2.2_Introduction'></a>

This step focuses on collecting your data, organizing it, and making sure it's well defined. The datasets available ('day.csv' and 'hour.csv') are the total numbers only for the D.C. district.

## 1.2 Imports <a id='1.2_Imports'></a>

In [61]:
#Import libraries and modules necessary
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 1.3 Load Data <a id='1.3_Load Data'></a>

In [62]:
# the supplied CSV data file is the raw_data directory - day data
day_data = pd.read_csv ('/Users/esrasaydam/Documents/GitHub/Capstone Project #2/day.csv')

In [63]:
# the supplied CSV data file is the raw_data directory - hour data
hour_data = pd.read_csv ('/Users/esrasaydam/Documents/GitHub/Capstone Project #2/hour.csv')

In [64]:
# We'll first focus on the day_data
#We use the info method on day_data to see a summary.
day_data.info


<bound method DataFrame.info of      instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
0          1  2011-01-01       1   0     1        0        6           0   
1          2  2011-01-02       1   0     1        0        0           0   
2          3  2011-01-03       1   0     1        0        1           1   
3          4  2011-01-04       1   0     1        0        2           1   
4          5  2011-01-05       1   0     1        0        3           1   
..       ...         ...     ...  ..   ...      ...      ...         ...   
726      727  2012-12-27       1   1    12        0        4           1   
727      728  2012-12-28       1   1    12        0        5           1   
728      729  2012-12-29       1   1    12        0        6           0   
729      730  2012-12-30       1   1    12        0        0           0   
730      731  2012-12-31       1   1    12        0        1           1   

     weathersit      temp     atemp       hum  windspee

'cnt' is the total count of 'casual' and 'registered' users and the rest is the info/features on weather and time ('season', 'yr', 'workingday', 'holiday', 'mnth', etc.) )

In [65]:
day_data.head(50)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
5,6,2011-01-06,1,0,1,0,4,1,1,0.204348,0.233209,0.518261,0.089565,88,1518,1606
6,7,2011-01-07,1,0,1,0,5,1,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
7,8,2011-01-08,1,0,1,0,6,0,2,0.165,0.162254,0.535833,0.266804,68,891,959
8,9,2011-01-09,1,0,1,0,0,0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
9,10,2011-01-10,1,0,1,0,1,1,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321


There doesn't seem to be a null/Nan value.

## 1.4 Explore the Data

In [66]:
# checking if there is any null data.
day_data.isnull().sum() 

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [67]:
day_data.isnull().mean()

instant       0.0
dteday        0.0
season        0.0
yr            0.0
mnth          0.0
holiday       0.0
weekday       0.0
workingday    0.0
weathersit    0.0
temp          0.0
atemp         0.0
hum           0.0
windspeed     0.0
casual        0.0
registered    0.0
cnt           0.0
dtype: float64

There is no Nan value.

In [68]:
day_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


checking if there is any duplicate date.

In [69]:
day_data['dteday'].value_counts().head(50)

2011-01-01    1
2012-04-25    1
2012-04-27    1
2012-04-28    1
2012-04-29    1
2012-04-30    1
2012-05-01    1
2012-05-02    1
2012-05-03    1
2012-05-04    1
2012-05-05    1
2012-05-06    1
2012-05-07    1
2012-05-08    1
2012-05-09    1
2012-05-10    1
2012-05-11    1
2012-05-12    1
2012-05-13    1
2012-05-14    1
2012-05-15    1
2012-04-26    1
2012-04-24    1
2012-05-17    1
2012-04-23    1
2012-04-04    1
2012-04-05    1
2012-04-06    1
2012-04-07    1
2012-04-08    1
2012-04-09    1
2012-04-10    1
2012-04-11    1
2012-04-12    1
2012-04-13    1
2012-04-14    1
2012-04-15    1
2012-04-16    1
2012-04-17    1
2012-04-18    1
2012-04-19    1
2012-04-20    1
2012-04-21    1
2012-04-22    1
2012-05-16    1
2012-05-18    1
2012-07-02    1
2012-06-10    1
2012-06-12    1
2012-06-13    1
Name: dteday, dtype: int64

In [70]:
day_data['dteday'].value_counts().tail(50)

2011-07-20    1
2011-07-21    1
2011-07-22    1
2011-07-23    1
2011-08-16    1
2011-08-17    1
2011-08-18    1
2011-08-19    1
2011-09-11    1
2011-09-12    1
2011-09-13    1
2011-09-14    1
2011-09-15    1
2011-09-16    1
2011-09-17    1
2011-09-18    1
2011-09-19    1
2011-09-20    1
2011-09-21    1
2011-09-22    1
2011-09-23    1
2011-09-24    1
2011-09-25    1
2011-09-26    1
2011-09-27    1
2011-09-28    1
2011-09-29    1
2011-09-10    1
2011-09-09    1
2011-09-08    1
2011-08-28    1
2011-08-20    1
2011-08-21    1
2011-08-22    1
2011-08-23    1
2011-08-24    1
2011-08-25    1
2011-08-26    1
2011-08-27    1
2011-08-29    1
2011-09-07    1
2011-08-30    1
2011-08-31    1
2011-09-01    1
2011-09-02    1
2011-09-03    1
2011-09-04    1
2011-09-05    1
2011-09-06    1
2012-12-31    1
Name: dteday, dtype: int64

No duplicates are found.

## Explore the Seasons
Seasons are repped with integers:

1:spring, 
2: summer,
3:fall,
4:winter

In [None]:
# Usage counts in every day of Spring
day_data[day_data.season == 2].cnt

In [None]:
# Total usage counts in Spring (average of 2 years)
day_data[day_data.season == 2].cnt.sum()/2

In [None]:
# Usage counts in every day of Summer
day_data[day_data.season == 3].cnt

In [None]:
# Total usage counts in Summer (average of 2 years)
day_data[day_data.season == 3].cnt.sum()/2

In [None]:
# Usage counts in every day of Fall
day_data[day_data.season == 4].cnt

In [None]:
# Total usage counts in Fall (average of 2 years)
day_data[day_data.season == 4].cnt.sum()/2

In [None]:
# Usage counts in every day of Winter
day_data[day_data.season == 1].cnt

In [None]:
#Total usage counts in Winter (average of 2 years)
day_data[day_data.season == 1].cnt.sum()/2

It seems like bikers avoid biking based on how much the weather is cold. The temperature do matter. Fall and Spring have close numbers with Spring beating Fall by 40k more rides. In Winter the usage is half of the Spring

In [None]:
# A closer look to spring days in 2011 and 2012
spring_data = day_data[day_data.season == 2]

In [None]:
spring_data.head()

In [None]:
## Days with precipitation in Spring 2011, 2012
spring_data[spring_data.weathersit == 3].cnt

In [None]:
# Days with precipitation in Fall 2011, 2012
fall_data = day_data[day_data.season == 4]
fall_data[fall_data.weathersit == 3].cnt

In [None]:
#Days with precipitation in Summer 2011, 2012
summer_data = day_data[day_data.season == 3]
summer_data[summer_data.weathersit == 3].cnt

In [None]:
#Days with precipitation in Winter 2011, 2012
winter_data = day_data[day_data.season == 1]
winter_data[winter_data.weathersit == 3].cnt

Actually Summer and Winter seasons have equal number of days when it rains.  It really feels like temperature is the deal breaker. On the other hand, in Fall number of days with precipitation triple - compared to Spring so this explains that precipitation might be a secondary factor. 

## Next steps:
Checking the number of usages:
in the hottest day
coldest day
in the warmest day of Winter

Checking number of usages:
when there is heavy wind. 
When the sky is cloudy versus clear...

Plotting these findings:
ex: number of usages (y), seasons (x)...

Question to Silvia: Do I bring the Philadelphia weather dataset in this stage?


