# Part I - Ford Bike Dataset Exploration Title
## By Rellika Kisyula

## Introduction
The Ford GoBike dataset contains anonymized trip data for the bike-sharing system from June 2017 to April 2019. <p style="color:red"> **However, I decided to only use the data in the year 2018 (January 2018 to December 2018).**</p> The data includes information on individual bike rides such as trip duration, start and end time, start and end station, bike ID, and user type. Additionally, demographic data such as age, gender, and membership type is provided for some users.

- `duration_sec`: The duration of the bike ride in seconds
- `start_time`: The date and time the bike ride started
- `end_time`: The date and time the bike ride ended
- `start_station_id`: The ID number of the station where the ride started
- `start_station_name`: The name of the station where the ride started
- `start_station_latitude`: The latitude of the station where the ride started
- `start_station_longitude`: The longitude of the station where the ride started
- `end_station_id`: The ID number of the station where the ride ended
- `end_station_name`: The name of the station where the ride ended
- `end_station_latitude`: The latitude of the station where the ride ended
- `end_station_longitude`: The longitude of the station where the ride ended
- `bike_id`: The ID number of the bike used in the ride
- `user_type`: The type of user, either "Subscriber" (members with monthly or annual memberships) or "Customer" (casual riders who - `purchase` a single ride or day pass)
- `member_birth_year`: The birth year of the user (for subscribers only)
- `member_gender`: The gender of the user (for subscribers only)

These columns provide information on the duration and location of the bike ride, the bike and station used, and some demographic information on the users.

### Extra Packages
We will be calculating the distance between the start and end stations. To install this package, run the following command in the terminal:

`pip install haversine`

In [None]:
%pip install haversine

### Importing Packages

In [80]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
# import the haversine package
from haversine import haversine

### Base Color
The base color for this project is `#1F77B4`.

In [81]:
base_color = sb.color_palette()[0]

### Downloading the Dataset
I manually downloaded the datasets from the [System Data | Bay Wheels | Lyft](https://www.lyft.com/bikes/bay-wheels/system-data) page. The datasets were in the form of a zip file. I extracted the zip files and saved the csv files in the `data` folder as this notebook. The zip files are in `data/zip_files` folder.

### Unzipping the Dataset
Imagine you have zip files stored in `./data/zip_files `with names like 201801-fordgobike-tripdata.csv.zip, 201802-fordgobike-tripdata.csv.zip, etc. You can use the following code to extract all the zip files into the `./data/data_files` folder.

In [82]:
# Unzip zip files in the data/zip_files folder into the data/data_files folder
import zipfile
import os

# create a list of all zip files in the zip_files folder
zip_files = os.listdir('./data/zip_files')

# loop through the list of zip files
for zip_file in zip_files:
    # create a full path to the zip file
    zip_path = './data/zip_files/' + zip_file
    # extract the zip file to the data folder
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('./data/data_files')

> **Note:** The code above is adapted from [How to unzip multiple files in a folder using Python?](https://stackoverflow.com/questions/3451111/unzipping-files-in-python)

> **Note:** The folder `data/data_files` is not included in the repository because it contains the extracted csv files. These csv files can be generated by running the code above.

### Combining the Datasets
I combined the datasets into one csv file by reading all the csv files in the `./data/data_files` folder into an individual pandas dataframe. I then saved the combined those individual dataframe into a csv file in the `data` folder as `bike_data.csv`.

In [83]:
# Read the data files from the data/data_files folder
january = pd.read_csv('./data/data_files/201801-fordgobike-tripdata.csv')
january.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
6296,178,2018-01-30 16:55:08.8090,2018-01-30 16:58:07.8020,182,19th Street BART Station,37.809013,-122.268247,180,Telegraph Ave at 23rd St,37.812678,-122.268773,152,Subscriber,1989.0,Male,No
17535,1742,2018-01-27 12:09:03.7450,2018-01-27 12:38:06.4410,119,18th St at Noe St,37.761047,-122.432642,70,Central Ave at Fell St,37.773311,-122.444293,1327,Customer,,,No
32689,582,2018-01-23 08:49:26.5720,2018-01-23 08:59:08.7800,122,19th St at Mission St,37.760299,-122.418892,60,8th St at Ringold St,37.77452,-122.409449,353,Subscriber,1991.0,Male,No
37241,610,2018-01-22 08:01:13.1060,2018-01-22 08:11:23.8990,89,Division St at Potrero Ave,37.769218,-122.407646,5,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,45,Customer,,,No
93468,7699,2018-01-01 19:53:16.4740,2018-01-01 22:01:35.9910,21,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,8,The Embarcadero at Vallejo St,37.799953,-122.398525,2945,Customer,,,No


In [84]:
september = pd.read_csv('./data/data_files/201809-fordgobike-tripdata.csv')
september.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
90984,193,2018-09-17 08:31:51.8550,2018-09-17 08:35:05.1720,318.0,San Carlos St at Market St,37.330698,-121.888979,310.0,San Fernando St at 4th St,37.335885,-121.88566,2529,Subscriber,1990.0,Male,Yes
83135,657,2018-09-18 08:52:05.4800,2018-09-18 09:03:03.2710,14.0,Clay St at Battery St,37.795001,-122.39997,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,1758,Customer,1953.0,Male,No
173018,787,2018-09-04 12:46:41.4690,2018-09-04 12:59:48.7740,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,13.0,Commercial St at Montgomery St,37.794231,-122.402923,3895,Subscriber,1989.0,Male,No
177339,154,2018-09-03 16:32:55.4910,2018-09-03 16:35:30.3950,52.0,McAllister St at Baker St,37.777416,-122.441838,53.0,Grove St at Divisadero,37.775946,-122.437777,2028,Subscriber,1993.0,Male,No
93243,242,2018-09-16 17:01:01.8580,2018-09-16 17:05:04.3830,281.0,9th St at San Fernando St,37.338395,-121.880797,311.0,Paseo De San Antonio at 2nd St,37.333798,-121.886943,1588,Customer,1984.0,Male,No


**Instead of reading the data files one by one, we can use a for loop to read all the files**

In [85]:
# create a list of all data files in the data_files folder
data_files = os.listdir('./data/data_files')

In [86]:
# Function to loop through the data files and read them into a dataframe
def read_data_files( data_files):
    # create an empty list to store the dataframes
    dataframe_list = []
    # loop through the list of data files
    for data_file in data_files:
        # ignore if it is not a csv file
        if data_file[-3:] != 'csv':
            continue
        # create a full path to the data file
        data_path = './data/data_files/' + data_file
        # read the data file and append it to the list of dataframes
        dataframe_list.append(pd.read_csv(data_path))
    # return the list of dataframes
    return dataframe_list

In [87]:
dataframes = read_data_files(data_files)
# concatenate the dataframes into one dataframe
bike_data = pd.concat(dataframes, ignore_index=True)

In [88]:
bike_data.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
1339572,640,2018-08-13 21:20:05.9510,2018-08-13 21:30:45.9680,195.0,Bay Pl at Vernon St,37.812314,-122.260779,162.0,Franklin St at 9th St,37.800516,-122.27208,1250,Subscriber,1988.0,Male,Yes
1156212,321,2018-07-12 17:01:13.5340,2018-07-12 17:06:34.8930,37.0,2nd St at Folsom St,37.785,-122.395936,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,3326,Subscriber,1981.0,Male,No
1095755,564,2018-07-21 15:10:58.6860,2018-07-21 15:20:23.1020,81.0,Berry St at 4th St,37.77588,-122.39317,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,1618,Subscriber,1988.0,Male,No
1802763,859,2018-04-16 16:43:20.6470,2018-04-16 16:57:40.1910,58.0,Market St at 10th St,37.776619,-122.417385,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,647,Subscriber,1969.0,Male,No
669614,414,2018-05-23 17:41:52.0110,2018-05-23 17:48:46.3290,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,3854,Customer,,,No


In [89]:
bike_data.shape

(1863721, 16)

> To confirm if all the rows of each dataset was added onto the dataframe, lets check the number of rows in the combined dataframe and the sum of the number of rows in each individual dataframe.

In [90]:
number_of_rows = []
# Loop through the list of dataframes and print the shape of each dataframe
for dataframe in dataframes:
    print(dataframe.shape)
    number_of_rows.append(dataframe.shape[0])
print(number_of_rows)
# Confirm that sum of the number of rows in each dataframe is equal to the number of rows in the concatenated dataframe
sum(number_of_rows) == bike_data.shape[0]

(106718, 16)
(134135, 16)
(186217, 16)
(195968, 16)
(179125, 16)
(131363, 16)
(94802, 16)
(199222, 16)
(192162, 16)
(201458, 16)
(111382, 16)
(131169, 16)
[106718, 134135, 186217, 195968, 179125, 131363, 94802, 199222, 192162, 201458, 111382, 131169]


True