# Chicago Divvy Bike Ride-Sharing Analysis

![alt text](images/divvy.jpg)
![alt_text](images/divvy_map.jpg)

### Introduction
This notebook is based on the [Divvy Ride-Sharing Kaggle dataset and competition](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data) for Divvy bike rides in Chicago, IL. It's a re-creation of a program I created about a year ago that was lost when a hard drive died and I've since learned my lesson so this is going straight to git. Some of the features of the previous program will be implemented but I really want to try to divide the notebooks more atomically so that each serves a pretty specific purpose and doesn't have too large of scope. 

### Goals of the Notebook
 * Conditionally split dataset
 * Determine which stations are busiest/where they usually lead to
 * Show distributions of the data, branch out with Seaborn library
 * Map rides using Basemap--really improve my skills with that
 * Perform machine learning and create models using Scikit-Learn
 

### Importing 
Just a typical data science library stack for the EDA notebook.

In [1]:
import random

import numpy   as np
import pandas  as pd
import seaborn as sns
import matplotlib.pyplot as plt

#pip3 install https://github.com/matplotlib/basemap/archive/master.zip

### CSV File Exploration and Importing

The first thing we'll do is peer into the csv files provided by Kaggle to see what kind of data we're looking at. Since the data is incredibly large (don't have a week for operating on 9 million divvy bike rides), for this stage of the analysis we'll just take a random sampling from the data.

To randomize the import I'm just going to retrieve every 1/n lines from the file. 

In [31]:
!wc data/data.csv
!wc data/data_raw.csv

 9495236 105797302 2084674565 data/data.csv
 13774716 160096066 3483195736 data/data_raw.csv


In [2]:
n = 1000

csv_filename = 'data/data.csv'
num_lines = sum(1 for l in open(csv_filename))

skip_ix = [x for x in range(1, num_lines) if x % n != 0]

data_df = pd.read_csv(
    csv_filename,
    skiprows=skip_ix
)

print("Columns in List: \n")
for column in list(data_df):
    print(column)

Columns in List
trip_id
year
month
week
day
hour
usertype
gender
starttime
stoptime
tripduration
temperature
events
from_station_id
from_station_name
latitude_start
longitude_start
dpcapacity_start
to_station_id
to_station_name
latitude_end
longitude_end
dpcapacity_end


In [33]:
data_df.head()

Unnamed: 0,trip_id,year,month,week,day,hour,usertype,gender,starttime,stoptime,...,from_station_id,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_id,to_station_name,latitude_end,longitude_end,dpcapacity_end
0,2353212,2014,6,27,0,17,Subscriber,Male,2014-06-30 17:47:00,2014-06-30 17:52:00,...,51,Clark St & Randolph St,41.884576,-87.63189,31.0,43,Michigan Ave & Washington St,41.883893,-87.624649,43.0
1,2351643,2014,6,27,0,16,Subscriber,Male,2014-06-30 16:57:00,2014-06-30 17:03:00,...,49,Dearborn St & Monroe St,41.88132,-87.629521,27.0,91,Clinton St & Washington Blvd,41.88338,-87.64117,31.0
2,2349620,2014,6,27,0,14,Subscriber,Male,2014-06-30 14:58:00,2014-06-30 15:03:00,...,197,Michigan Ave & Madison St,41.882134,-87.625125,19.0,174,Canal St & Madison St,41.882091,-87.639833,23.0
3,2347261,2014,6,27,0,11,Subscriber,Female,2014-06-30 11:37:00,2014-06-30 11:52:00,...,211,St Clair St & Erie St,41.894448,-87.622663,19.0,49,Dearborn St & Monroe St,41.88132,-87.629521,27.0
4,2345061,2014,6,27,0,8,Subscriber,Male,2014-06-30 08:31:00,2014-06-30 08:40:00,...,264,Stetson Ave & South Water St,41.886835,-87.62232,19.0,321,Wabash Ave & 8th St,41.871962,-87.626106,19.0


In [34]:
data_df.describe()

Unnamed: 0,trip_id,year,month,week,day,hour,tripduration,temperature,from_station_id,latitude_start,longitude_start,dpcapacity_start,to_station_id,latitude_end,longitude_end,dpcapacity_end
count,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0,9495.0
mean,9861927.0,2015.737441,7.161559,29.383149,2.684887,13.630542,11.352547,63.011448,179.279305,41.900204,-87.644597,21.380832,179.651395,41.900417,-87.644647,21.2594
std,4680399.0,1.075634,2.708081,11.776481,1.890381,4.851057,7.110063,17.20015,121.730907,0.034862,0.021657,7.577247,122.679316,0.035232,0.021766,7.517441
min,1110978.0,2014.0,1.0,1.0,0.0,0.0,2.0,-9.9,2.0,41.746559,-87.80287,11.0,2.0,41.746559,-87.80224,9.0
25%,5943099.0,2015.0,5.0,21.5,1.0,9.0,6.025,52.0,76.0,41.881032,-87.654787,15.0,75.0,41.881032,-87.654787,15.0
50%,10059920.0,2016.0,7.0,30.0,3.0,15.0,9.616667,66.9,164.0,41.892278,-87.641066,19.0,164.0,41.89257,-87.641066,19.0
75%,13833300.0,2017.0,9.0,38.0,4.0,17.0,14.9,75.9,268.0,41.920082,-87.629928,23.0,272.0,41.920771,-87.629928,23.0
max,17536030.0,2017.0,12.0,53.0,6.0,23.0,59.35,95.0,625.0,42.063598,-87.559275,55.0,625.0,42.063999,-87.565688,55.0


**Thoughts:** There are a few really interesting features contained within the dataset that I'll want to explore moving foward in this notebook. I'll break the features down sort of categorically here.

**Geographical:**
 * from_station_id
 * from_station_name
 * latitude_start
 * longitude_start
 * to_station_id
 * to_station_name
 * latitude_end
 * longitude_end
 
**Weather:**
 * temperature
 * events
 
**Datetime:**
 * year
 * month
 * week
 * day
 * hour
 * starttime
 * stoptime
 * tripduration
 
**User-Specific:**
 * usertype
 * gender

## Data Distribution