# Chicago Divvy Bike Ride-Sharing Analysis

![alt text](images/divvy.jpg)
![alt_text](images/divvy_map.jpg)

### Introduction
This notebook is based on the [Divvy Ride-Sharing Kaggle dataset and competition](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data) for Divvy bike rides in Chicago, IL. It's a re-creation of a program I created about a year ago that was lost when a hard drive died and I've since learned my lesson so this is going straight to git. Some of the features of the previous program will be implemented but I really want to try to divide the notebooks more atomically so that each serves a pretty specific purpose and doesn't have too large of scope. 

### Goals of the Notebook
 * Conditionally split dataset
 * Determine which stations are busiest/where they usually lead to
 * Show distributions of the data, branch out with Seaborn library
 * Map rides using Basemap--really improve my skills with that
 * Perform machine learning and create models using Scikit-Learn
 

### Importing 
Just a typical data science library stack for the EDA notebook.

In [11]:
import os
import math
import random
import subprocess

import numpy   as np
import pandas  as pd
import seaborn as sns
import matplotlib.pyplot as plt
import dask.dataframe as dd

from tqdm import tqdm

pd.set_option('mode.chained_assignment', None)

### CSV File Exploration and Importing

The first thing we'll do is peer into the csv files provided by Kaggle to see what kind of data we're looking at. Since the data is incredibly large (don't have a week for operating on 9 million divvy bike rides), for this stage of the analysis we'll just take a random sampling from the data.

To randomize the import I'm just going to retrieve every 1/n lines from the file. 

**Updated 8/20/19:** Changing import methods based on [this kaggle kernel](https://www.kaggle.com/szelee/how-to-import-a-csv-file-of-55-million-rows)

In [12]:
%%time

filename = "data/data.csv"

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0]) + 1

n_rows = file_len(filename)
print('Data file contains {} rows'.format(n_rows))

Data file contains 9495237 rows
CPU times: user 7.51 ms, sys: 36.4 ms, total: 43.9 ms
Wall time: 33.7 s


In [13]:
df_tmp = pd.read_csv(filename, nrows=5)
df_tmp.head()

Unnamed: 0,trip_id,year,month,week,day,hour,usertype,gender,starttime,stoptime,...,from_station_id,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_id,to_station_name,latitude_end,longitude_end,dpcapacity_end
0,2355134,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:57:00,2014-07-01 00:07:00,...,131,Lincoln Ave & Belmont Ave,41.939365,-87.668385,15.0,303,Broadway & Cornelia Ave,41.945512,-87.64598,15.0
1,2355133,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:56:00,2014-07-01 00:00:00,...,282,Halsted St & Maxwell St,41.86458,-87.64693,15.0,22,May St & Taylor St,41.869482,-87.655486,15.0
2,2355130,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:33:00,2014-06-30 23:35:00,...,327,Sheffield Ave & Webster Ave,41.921687,-87.653714,19.0,225,Halsted St & Dickens Ave,41.919936,-87.64883,15.0
3,2355129,2014,6,27,0,23,Subscriber,Female,2014-06-30 23:26:00,2014-07-01 00:24:00,...,134,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0,194,State St & Wacker Dr,41.887155,-87.62775,11.0
4,2355128,2014,6,27,0,23,Subscriber,Female,2014-06-30 23:16:00,2014-06-30 23:26:00,...,320,Loomis St & Lexington St,41.872187,-87.661501,15.0,134,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0


In [14]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 23 columns):
trip_id              5 non-null int64
year                 5 non-null int64
month                5 non-null int64
week                 5 non-null int64
day                  5 non-null int64
hour                 5 non-null int64
usertype             5 non-null object
gender               5 non-null object
starttime            5 non-null object
stoptime             5 non-null object
tripduration         5 non-null float64
temperature          5 non-null float64
events               5 non-null object
from_station_id      5 non-null int64
from_station_name    5 non-null object
latitude_start       5 non-null float64
longitude_start      5 non-null float64
dpcapacity_start     5 non-null float64
to_station_id        5 non-null int64
to_station_name      5 non-null object
latitude_end         5 non-null float64
longitude_end        5 non-null float64
dpcapacity_end       5 non-null float64
dt

In [15]:
traintypes = {
    'trip_id': 'int32',
    'year': 'uint16',
    'month': 'uint8',
    'week': 'uint8',
    'day': 'uint8',
    'hour': 'uint8',
    'usertype': 'str',
    'gender': 'str',
    'starttime': 'str',
    'stoptime': 'str',
    'tripduration': 'float32',
    'temperature': 'float32',
    'events': 'str',
    'from_station_id': 'int32',
    'from_station_name': 'str',
    'latitude_start': 'float32',
    'longitude_start': 'float32',
    'dpcapacity_start': 'float32',
    'to_station_id': 'int32',
    'to_station_name': 'str',
    'latitude_end': 'float32',
    'longitude_end': 'float32',
    'dpcapacity_end': 'float32'
}
cols = list(traintypes.keys())

In [18]:
chunksize = 1_000_000

In [None]:
%%time
df_list = []

for df_chunk in tqdm(
    pd.read_csv(
        filename, 
        usecols=cols, 
        dtype=traintypes, 
        chunksize=chunksize
    )
):
    df_chunk['starttime'] = df_chunk['starttime'].str.slice(0, 16)
    df_chunk['starttime'] = pd.to_datetime(df_chunk['starttime'], utc=True, format='%Y-%m-%d %H:%M')
    
    df_list.append(df_chunk)



0it [00:00, ?it/s][A[A

1it [00:07,  7.65s/it][A[A

2it [01:13, 25.04s/it][A[A

3it [01:40, 25.69s/it][A[A

4it [02:41, 35.98s/it][A[A

5it [03:45, 44.47s/it][A[A

In [9]:
data_df = pd.concat(df_list[0:4])

del df_list

data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 23 columns):
trip_id              400000 non-null int32
year                 400000 non-null uint16
month                400000 non-null uint8
week                 400000 non-null uint8
day                  400000 non-null uint8
hour                 400000 non-null uint8
usertype             400000 non-null object
gender               400000 non-null object
starttime            400000 non-null datetime64[ns, UTC]
stoptime             400000 non-null object
tripduration         400000 non-null float32
temperature          400000 non-null float32
events               400000 non-null object
from_station_id      400000 non-null int32
from_station_name    400000 non-null object
latitude_start       400000 non-null float32
longitude_start      400000 non-null float32
dpcapacity_start     400000 non-null float32
to_station_id        400000 non-null int32
to_station_name      400000 non-null objec

**Thoughts:** There are a few really interesting features contained within the dataset that I'll want to explore moving foward in this notebook. I'll break the features down sort of categorically here.

**Geographical:**
 * from_station_id
 * from_station_name
 * latitude_start
 * longitude_start
 * to_station_id
 * to_station_name
 * latitude_end
 * longitude_end
 
**Weather:**
 * temperature
 * events
 
**Datetime:**
 * year
 * month
 * week
 * day
 * hour
 * starttime
 * stoptime
 * tripduration
 
**User-Specific:**
 * usertype
 * gender

### Feature Engineering

There are a couple of columns that we can build from our current set of features.

In [10]:
def haversine(row):
    lon1 = row['longitude_start']
    lat1 = row['latitude_start']
    lon2 = row['longitude_end']
    lat2 = row['latitude_end']
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

data_df['displacement'] = data_df.apply(lambda row: haversine(row), axis=1)
data_df['rate'] = data_df['displacement'].div(data_df['tripduration']).multiply(60)

KeyboardInterrupt: 

## General Data Distribution

In [None]:
num_df = data_df.drop(
    [
        'trip_id', 
        'usertype',
        'events',
        'from_station_id',
        'from_station_name',
        'to_station_id',
        'to_station_name',
        'dpcapacity_start',
        'dpcapacity_end'
    ], axis=1
)

In [None]:
if n >= 1000:
    sns.pairplot(num_df)
else:
    print('Not plotting to save time')

### Thoughts:
* Divvy popularity grew over the course of data collection
* People like riding bikes in the summer
* People take bikes to and from work (9AM and 5PM)
* People don't like to ride bikes for more than about 10 minutes
* People like riding bikes when it's about 70 degrees out
* People seem to ride north more than south
* People seem to ride east more than west

In [None]:
if n >= 1000:
    g = sns.PairGrid(num_df)
    g.map_diag(sns.kdeplot)
    g.map_offdiag(sns.kdeplot, n_levels=6)
else:
    print('Not plotting to save time')

### Weather Analysis

**Vanilla Distributions**

In [None]:
sns.distplot(
    data_df.temperature
)

In [None]:
sns.catplot(
    x='events',
    data=data_df,
    kind='count',
    palette='ch:.25'
)

In [None]:
sns.catplot(
    x='events',
    y='tripduration',
    data=data_df
)

In [None]:
sns.jointplot(
    data_df['temperature'], 
    data_df['tripduration'],
    kind='kde'
)

**Conditional Splitting**

In [None]:
hot_rides  = data_df[data_df['temperature'] > 80]
cold_rides = data_df[data_df['temperature'] < 10]

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.catplot(
    x='temperature',
    data=hot_rides,
    kind='count',
    palette='ch:.25',
    ax=ax
)
print(hot_rides['temperature'].value_counts())
plt.close(2)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sns.catplot(
    x='temperature',
    data=cold_rides,
    kind='count',
    palette='ch:.25',
    ax=ax
)
print(cold_rides['temperature'].value_counts())
plt.close(2)
plt.show()

In [None]:
with sns.axes_style("white"):
    sns.jointplot(
        x=data_df['temperature'], 
        y=data_df['tripduration'], 
        kind="hex", 
        color="k",
    )

### Distance Analysis

**Vanilla Distributions**

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

sns.distplot(data_df.displacement, ax=ax)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

sns.catplot(
    x='events',
    y='displacement',
    data=data_df,
    ax=ax
)

plt.close(2)
plt.show()