# Citibike - Statistical Analysis Project

August 1, 2020

<u>About the Data:</u>   https://www.citibikenyc.com/system-data <br>
<u>Data Source:</u>    https://s3.amazonaws.com/tripdata/index.html<br>
<u>Data Date Range:</u>    June 2019 - June 2020

## Setup

### Import libraries

In [45]:
# import relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option('display.float_format', lambda x: '%f' % x)

### Import data

In [25]:
os.getcwd()

'C:\\Users\\Grace\\Documents\\My_Git_Repos\\citibike-analysis\\data'

In [26]:
os.listdir()

['201906-citibike-tripdata.csv',
 '201907-citibike-tripdata.csv',
 '201908-citibike-tripdata.csv',
 '201909-citibike-tripdata.csv',
 '201910-citibike-tripdata.csv',
 '201911-citibike-tripdata.csv',
 '201912-citibike-tripdata.csv',
 '202001-citibike-tripdata.csv',
 '202002-citibike-tripdata.csv',
 '202003-citibike-tripdata.csv',
 '202004-citibike-tripdata.csv',
 '202005-citibike-tripdata.csv',
 '202006-citibike-tripdata.csv']

In [29]:
# import all files into a list of dataframes
dfs_list = []
for file in os.listdir():
    dfs_list.append(pd.read_csv(file))

In [30]:
# concatenate all dataframes into a single dataframe
big_df=pd.concat(dfs_list,ignore_index=True)

In [33]:
big_df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,330,2019-06-01 00:00:01.5000,2019-06-01 00:05:31.7600,3602.0,31 Ave & 34 St,40.763154,-73.920827,3570.0,35 Ave & 37 St,40.755733,-73.923661,20348,Subscriber,1992,1
1,830,2019-06-01 00:00:04.2400,2019-06-01 00:13:55.1470,3054.0,Greene Ave & Throop Ave,40.689493,-73.942061,3781.0,Greene Av & Myrtle Av,40.698568,-73.918877,34007,Subscriber,1987,2
2,380,2019-06-01 00:00:06.0190,2019-06-01 00:06:26.7790,229.0,Great Jones St,40.727434,-73.99379,326.0,E 11 St & 1 Ave,40.729538,-73.984267,20587,Subscriber,1990,2
3,1155,2019-06-01 00:00:06.7760,2019-06-01 00:19:22.5380,3771.0,McKibbin St & Bogart St,40.706237,-73.933871,3016.0,Kent Ave & N 7 St,40.720368,-73.961651,33762,Subscriber,1987,1
4,1055,2019-06-01 00:00:07.5200,2019-06-01 00:17:42.5580,441.0,E 52 St & 2 Ave,40.756014,-73.967416,3159.0,W 67 St & Broadway,40.774925,-73.982666,31290,Subscriber,1973,1


### Set datatypes

In [55]:
# original datatypes
big_df.dtypes

tripduration                        int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                  float64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                    float64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
usertype                           object
birth year                          int64
gender                              int64
tripduration_min                  float64
dtype: object

Set `starttime` and `stoptime` are set to datetimes.

In [39]:
big_df['starttime'] = big_df['starttime'].astype('datetime64')
big_df['stoptime'] = big_df['stoptime'].astype('datetime64')

Station IDs should be integers (getting an error when trying to convert this)

In [49]:
big_df['start station id'] = big_df['start station id'].astype('int64')

ValueError: Cannot convert non-finite values (NA or inf) to integer

Updated datatypes

In [40]:
big_df.dtypes

tripduration                        int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                  float64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                    float64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
usertype                           object
birth year                          int64
gender                              int64
dtype: object

## Exploratory Data Analysis (EDA)

### Notes about the data:

<u>Data Columns:</u>
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth

<u>Additional Notes:</u>
- Test trips & trips < 60 seconds (potentially false starts or users tyring to redock bike) have been removed.
- Milage estimates are calculated using an assumed speed of 7.456 miles per hour, up to two hours. Trips over two hours max-out at 14.9 miles. Once you opt into Ride Insights, the Citi Bike app will use your phone's location to record the route you take between your starting and ending Citi Bike station to give exact mileage.
- We only include trips that begin at publicly available stations (thereby excluding trips that originate at our depots for rebalancing or maintenance purposes).

### Add additional features

Since `tripduration` is in seconds, let's create another column that shows duration in minutes.

In [51]:
big_df['tripduration_min'] = big_df['tripduration']/60

Extract year & month of `starttime`

In [57]:
big_df['starttime_year'] = pd.DatetimeIndex(big_df['starttime']).year
big_df['starttime_month'] = pd.DatetimeIndex(big_df['starttime']).month

### Review 

In [53]:
big_df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,tripduration_min
0,330,2019-06-01 00:00:01.500,2019-06-01 00:05:31.760,3602.0,31 Ave & 34 St,40.763154,-73.920827,3570.0,35 Ave & 37 St,40.755733,-73.923661,20348,Subscriber,1992,1,5.5
1,830,2019-06-01 00:00:04.240,2019-06-01 00:13:55.147,3054.0,Greene Ave & Throop Ave,40.689493,-73.942061,3781.0,Greene Av & Myrtle Av,40.698568,-73.918877,34007,Subscriber,1987,2,13.833333
2,380,2019-06-01 00:00:06.019,2019-06-01 00:06:26.779,229.0,Great Jones St,40.727434,-73.99379,326.0,E 11 St & 1 Ave,40.729538,-73.984267,20587,Subscriber,1990,2,6.333333
3,1155,2019-06-01 00:00:06.776,2019-06-01 00:19:22.538,3771.0,McKibbin St & Bogart St,40.706237,-73.933871,3016.0,Kent Ave & N 7 St,40.720368,-73.961651,33762,Subscriber,1987,1,19.25
4,1055,2019-06-01 00:00:07.520,2019-06-01 00:17:42.558,441.0,E 52 St & 2 Ave,40.756014,-73.967416,3159.0,W 67 St & Broadway,40.774925,-73.982666,31290,Subscriber,1973,1,17.583333
