# Part I - Ford Bike Dataset Exploration Title
## By Rellika Kisyula

## Introduction
The Ford GoBike dataset contains anonymized trip data for the bike-sharing system from June 2017 to April 2019. <p style="color:red"> **However, I decided to only use the data in the year 2018 (January 2018 to December 2018).**</p> The data includes information on individual bike rides such as trip duration, start and end time, start and end station, bike ID, and user type. Additionally, demographic data such as age, gender, and membership type is provided for some users.

- `duration_sec`: The duration of the bike ride in seconds
- `start_time`: The date and time the bike ride started
- `end_time`: The date and time the bike ride ended
- `start_station_id`: The ID number of the station where the ride started
- `start_station_name`: The name of the station where the ride started
- `start_station_latitude`: The latitude of the station where the ride started
- `start_station_longitude`: The longitude of the station where the ride started
- `end_station_id`: The ID number of the station where the ride ended
- `end_station_name`: The name of the station where the ride ended
- `end_station_latitude`: The latitude of the station where the ride ended
- `end_station_longitude`: The longitude of the station where the ride ended
- `bike_id`: The ID number of the bike used in the ride
- `user_type`: The type of user, either "Subscriber" (members with monthly or annual memberships) or "Customer" (casual riders who - `purchase` a single ride or day pass)
- `member_birth_year`: The birth year of the user (for subscribers only)
- `member_gender`: The gender of the user (for subscribers only)

These columns provide information on the duration and location of the bike ride, the bike and station used, and some demographic information on the users.

### Extra Packages
We will be calculating the distance between the start and end stations. To install this package, run the following command in the terminal:

`pip install haversine`

In [None]:
%pip install haversine

### Importing Packages

In [80]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
# import the haversine package
from haversine import haversine

### Base Color
The base color for this project is `#1F77B4`.

In [81]:
base_color = sb.color_palette()[0]

### Downloading the Dataset
I manually downloaded the datasets from the [System Data | Bay Wheels | Lyft](https://www.lyft.com/bikes/bay-wheels/system-data) page. The datasets were in the form of a zip file. I extracted the zip files and saved the csv files in the `data` folder as this notebook. The zip files are in `data/zip_files` folder.

### Unzipping the Dataset
Imagine you have zip files stored in `./data/zip_files `with names like 201801-fordgobike-tripdata.csv.zip, 201802-fordgobike-tripdata.csv.zip, etc. You can use the following code to extract all the zip files into the `./data/data_files` folder.

In [82]:
# Unzip zip files in the data/zip_files folder into the data/data_files folder
import zipfile
import os

# create a list of all zip files in the zip_files folder
zip_files = os.listdir('./data/zip_files')

# loop through the list of zip files
for zip_file in zip_files:
    # create a full path to the zip file
    zip_path = './data/zip_files/' + zip_file
    # extract the zip file to the data folder
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('./data/data_files')

> **Note:** The code above is adapted from [How to unzip multiple files in a folder using Python?](https://stackoverflow.com/questions/3451111/unzipping-files-in-python)

> **Note:** The folder `data/data_files` is not included in the repository because it contains the extracted csv files. These csv files can be generated by running the code above.

### Combining the Datasets
I combined the datasets into one csv file by reading all the csv files in the `./data/data_files` folder into an individual pandas dataframe. I then saved the combined those individual dataframe into a csv file in the `data` folder as `bike_data.csv`.

In [83]:
# Read the data files from the data/data_files folder
january = pd.read_csv('./data/data_files/201801-fordgobike-tripdata.csv')
january.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
6296,178,2018-01-30 16:55:08.8090,2018-01-30 16:58:07.8020,182,19th Street BART Station,37.809013,-122.268247,180,Telegraph Ave at 23rd St,37.812678,-122.268773,152,Subscriber,1989.0,Male,No
17535,1742,2018-01-27 12:09:03.7450,2018-01-27 12:38:06.4410,119,18th St at Noe St,37.761047,-122.432642,70,Central Ave at Fell St,37.773311,-122.444293,1327,Customer,,,No
32689,582,2018-01-23 08:49:26.5720,2018-01-23 08:59:08.7800,122,19th St at Mission St,37.760299,-122.418892,60,8th St at Ringold St,37.77452,-122.409449,353,Subscriber,1991.0,Male,No
37241,610,2018-01-22 08:01:13.1060,2018-01-22 08:11:23.8990,89,Division St at Potrero Ave,37.769218,-122.407646,5,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,45,Customer,,,No
93468,7699,2018-01-01 19:53:16.4740,2018-01-01 22:01:35.9910,21,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,8,The Embarcadero at Vallejo St,37.799953,-122.398525,2945,Customer,,,No


In [84]:
september = pd.read_csv('./data/data_files/201809-fordgobike-tripdata.csv')
september.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
90984,193,2018-09-17 08:31:51.8550,2018-09-17 08:35:05.1720,318.0,San Carlos St at Market St,37.330698,-121.888979,310.0,San Fernando St at 4th St,37.335885,-121.88566,2529,Subscriber,1990.0,Male,Yes
83135,657,2018-09-18 08:52:05.4800,2018-09-18 09:03:03.2710,14.0,Clay St at Battery St,37.795001,-122.39997,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,1758,Customer,1953.0,Male,No
173018,787,2018-09-04 12:46:41.4690,2018-09-04 12:59:48.7740,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,13.0,Commercial St at Montgomery St,37.794231,-122.402923,3895,Subscriber,1989.0,Male,No
177339,154,2018-09-03 16:32:55.4910,2018-09-03 16:35:30.3950,52.0,McAllister St at Baker St,37.777416,-122.441838,53.0,Grove St at Divisadero,37.775946,-122.437777,2028,Subscriber,1993.0,Male,No
93243,242,2018-09-16 17:01:01.8580,2018-09-16 17:05:04.3830,281.0,9th St at San Fernando St,37.338395,-121.880797,311.0,Paseo De San Antonio at 2nd St,37.333798,-121.886943,1588,Customer,1984.0,Male,No


**Instead of reading the data files one by one, we can use a for loop to read all the files**

In [85]:
# create a list of all data files in the data_files folder
data_files = os.listdir('./data/data_files')

In [86]:
# Function to loop through the data files and read them into a dataframe
def read_data_files( data_files):
    # create an empty list to store the dataframes
    dataframe_list = []
    # loop through the list of data files
    for data_file in data_files:
        # ignore if it is not a csv file
        if data_file[-3:] != 'csv':
            continue
        # create a full path to the data file
        data_path = './data/data_files/' + data_file
        # read the data file and append it to the list of dataframes
        dataframe_list.append(pd.read_csv(data_path))
    # return the list of dataframes
    return dataframe_list

In [87]:
dataframes = read_data_files(data_files)
# concatenate the dataframes into one dataframe
bike_data = pd.concat(dataframes, ignore_index=True)

In [88]:
bike_data.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
1339572,640,2018-08-13 21:20:05.9510,2018-08-13 21:30:45.9680,195.0,Bay Pl at Vernon St,37.812314,-122.260779,162.0,Franklin St at 9th St,37.800516,-122.27208,1250,Subscriber,1988.0,Male,Yes
1156212,321,2018-07-12 17:01:13.5340,2018-07-12 17:06:34.8930,37.0,2nd St at Folsom St,37.785,-122.395936,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,3326,Subscriber,1981.0,Male,No
1095755,564,2018-07-21 15:10:58.6860,2018-07-21 15:20:23.1020,81.0,Berry St at 4th St,37.77588,-122.39317,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,1618,Subscriber,1988.0,Male,No
1802763,859,2018-04-16 16:43:20.6470,2018-04-16 16:57:40.1910,58.0,Market St at 10th St,37.776619,-122.417385,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,647,Subscriber,1969.0,Male,No
669614,414,2018-05-23 17:41:52.0110,2018-05-23 17:48:46.3290,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,3854,Customer,,,No


In [89]:
bike_data.shape

(1863721, 16)

> To confirm if all the rows of each dataset was added onto the dataframe, lets check the number of rows in the combined dataframe and the sum of the number of rows in each individual dataframe.

In [90]:
number_of_rows = []
# Loop through the list of dataframes and print the shape of each dataframe
for dataframe in dataframes:
    print(dataframe.shape)
    number_of_rows.append(dataframe.shape[0])
print(number_of_rows)
# Confirm that sum of the number of rows in each dataframe is equal to the number of rows in the concatenated dataframe
sum(number_of_rows) == bike_data.shape[0]

(106718, 16)
(134135, 16)
(186217, 16)
(195968, 16)
(179125, 16)
(131363, 16)
(94802, 16)
(199222, 16)
(192162, 16)
(201458, 16)
(111382, 16)
(131169, 16)
[106718, 134135, 186217, 195968, 179125, 131363, 94802, 199222, 192162, 201458, 111382, 131169]


True

## Data Preparation

**The following are the changes made to the dataset before saving it:**
1. Get the distance travelled from the coordinates using `haversine` package
2. Get the age of the users from the `member_birth_year` column
3. Extract the hour, day, month and year from the `start_time`
4. Creating period of day (`period_of_day`) column from the `hour` column

### 1: Calcultate distance travelled using the `haversine` package
I decided to find the distance the riders rode. I used the Haversine formula to calculate the distance between the start and end points of the ride.

In [91]:
# Create a new column `distance` which is the distance between the start and end station
bike_data['distance'] = bike_data.apply(lambda x: haversine((x['start_station_latitude'], x['start_station_longitude']),
                                                                    (x['end_station_latitude'], x['end_station_longitude'])), axis=1)   

### 2: Calculate the age of the users

In [92]:
# Create a new column `member_age` which is the difference between the 2018 and `member_birth_year`
bike_data['member_age'] = 2018 - bike_data.member_birth_year
# Select the column member_birth_year and member_age
bike_data[['member_birth_year', 'member_age']].sample(10)

Unnamed: 0,member_birth_year,member_age
9942,1976.0,42.0
1628966,1964.0,54.0
1295504,1983.0,35.0
1228287,1986.0,32.0
709065,1987.0,31.0
663747,1986.0,32.0
1123852,1977.0,41.0
1107968,1972.0,46.0
1299553,1979.0,39.0
750046,1988.0,30.0


### Extract the hour, day, month and year from the `start_time` column

In [93]:
bike_data['start_time'] = pd.to_datetime(bike_data['start_time'])
# Extract the month name from the start_time column
bike_data['month_of_year'] = bike_data['start_time'].dt.strftime('%B')

# Extract the day of the week from the start_time column
bike_data['day_of_week'] = bike_data['start_time'].dt.strftime('%A')

# Extract the hour from the start_time column
bike_data['hour'] = bike_data['start_time'].dt.strftime('%H')

In [94]:
# Select the columns start_time, month, day_of_week, hour
bike_data[['start_time', 'month_of_year', 'day_of_week', 'hour']].sample(10)

Unnamed: 0,start_time,month_of_year,day_of_week,hour
266797,2018-09-26 17:35:23.713,September,Wednesday,17
929505,2018-12-02 15:08:45.386,December,Sunday,15
881349,2018-12-12 09:13:47.971,December,Wednesday,9
348427,2018-09-13 20:42:35.645,September,Thursday,20
1448320,2018-10-27 10:23:38.654,October,Saturday,10
1174052,2018-07-10 11:40:11.508,July,Tuesday,11
1184773,2018-07-09 06:42:40.909,July,Monday,6
750054,2018-05-09 19:45:40.943,May,Wednesday,19
1013831,2018-01-06 18:30:34.609,January,Saturday,18
266822,2018-09-26 17:39:55.348,September,Wednesday,17


In [95]:
# Using the `month_of_year` column, perform a value count
bike_data.month_of_year.value_counts()

October      201458
July         199222
June         195968
August       192162
September    186217
May          179125
November     134135
December     131363
April        131169
March        111382
February     106718
January       94802
Name: month_of_year, dtype: int64

### Creating period of day (`period_of_day`) column from the `hour` column

As mentioned above, I want to get the period of the day, that is either **Early Morning**, **Morning**, **Afternoon**, **Evening**, **Night**, **Late Night**, **Midnight**. I will use the `start_time` column to extract the hour of the day and then categorize it into the above periods.  

In [96]:
# Using the `hour`, generate a new column `period_of_day` which is the period of the day
# Early Morning: 3am - 6am, Morning: 6am - 12pm, Afternoon: 12pm - 3pm, Evening: 3pm - 6pm, Night: 6pm - 9pm, Late Night: 9pm - 12am, Midnight: 12am - 3am
bike_data['period_of_day'] = bike_data['hour'].apply(lambda x: 'Early Morning' if 3 <= int(x) < 6 else 'Morning' if 6 <= int(x) < 12 else 'Afternoon' if 12 <= int(x) < 15 else 'Evening' if 15 <= int(x) < 18 else 'Night' if 18 <= int(x) < 21 else 'Late Night' if 21 <= int(x) < 24 else 'Midnight')

In [99]:
# Select the columns start_time, hour, period_of_day
bike_data[['start_time', 'hour', 'period_of_day']].sample(10)

Unnamed: 0,start_time,hour,period_of_day
1471246,2018-10-23 21:22:20.858,21,Late Night
937152,2018-01-31 08:48:50.630,8,Morning
1650605,2018-03-25 13:53:34.639,13,Afternoon
1374284,2018-08-08 09:52:35.797,9,Morning
607563,2018-06-03 19:11:00.031,19,Night
1074940,2018-07-24 19:03:20.307,19,Night
641284,2018-05-29 10:37:30.453,10,Morning
1455010,2018-10-26 08:17:33.150,8,Morning
1418982,2018-08-01 08:02:10.306,8,Morning
1099078,2018-07-20 19:36:41.853,19,Night


In [100]:
# Use the period_of_day and perform a value count
bike_data.period_of_day.value_counts()

Morning          669598
Evening          459806
Night            341831
Afternoon        261127
Late Night        96657
Midnight          19815
Early Morning     14887
Name: period_of_day, dtype: int64

### Saving the `bike_data` dataframe to csv file.
I saved the combined dataframe as `bike_data.csv` in the `data` folder.
```python
# Save the combined dataframe as bike_data.csv in the data folder
bike_data.to_csv('data/bike_data.csv', index=False)
```

In [101]:
# Save the combined dataframe as bike_data.csv in the data folder
bike_data.to_csv('data/bike_data.csv', index=False)

## Preliminary Wrangling
> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [102]:
# Read the bike_data.csv file into a dataframe
combined_bike_data = pd.read_csv('data/bike_data.csv')
combined_bike_data.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,user_type,member_birth_year,member_gender,bike_share_for_all_trip,distance,member_age,month_of_year,day_of_week,hour,period_of_day
437592,324,2018-06-29 08:42:30.458,2018-06-29 08:47:54.8900,16.0,Steuart St at Market St,37.79413,-122.39443,28.0,The Embarcadero at Bryant St,37.787168,...,Customer,1995.0,Female,No,0.953356,23.0,June,Friday,8,Morning
1482564,1573,2018-10-22 12:11:47.049,2018-10-22 12:38:00.1380,126.0,Esprit Park,37.761634,-122.390648,19.0,Post St at Kearny St,37.788975,...,Customer,1983.0,Male,No,3.241758,35.0,October,Monday,12,Afternoon
183312,2103,2018-11-09 23:50:11.592,2018-11-10 00:25:15.5790,41.0,Golden Gate Ave at Polk St,37.78127,-122.41874,58.0,Market St at 10th St,37.776619,...,Customer,,,No,0.530702,,November,Friday,23,Late Night
879176,948,2018-12-12 16:40:18.227,2018-12-12 16:56:06.2770,58.0,Market St at 10th St,37.776619,-122.417385,126.0,Esprit Park,37.761634,...,Subscriber,1985.0,Male,No,2.880893,33.0,December,Wednesday,16,Evening
1254209,737,2018-08-28 07:57:18.615,2018-08-28 08:09:36.0360,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,9.0,Broadway at Battery St,37.798572,...,Subscriber,1984.0,Other,No,2.483611,34.0,August,Tuesday,7,Morning


### What is the structure of your dataset?

In [103]:
# Check the shape of the data
combined_bike_data.shape

(1863721, 22)

In [104]:
# Get general information about the dataframe, including the number of non-null values in each column
combined_bike_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1863721 entries, 0 to 1863720
Data columns (total 22 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   duration_sec             1863721 non-null  int64  
 1   start_time               1863721 non-null  object 
 2   end_time                 1863721 non-null  object 
 3   start_station_id         1851950 non-null  float64
 4   start_station_name       1851950 non-null  object 
 5   start_station_latitude   1863721 non-null  float64
 6   start_station_longitude  1863721 non-null  float64
 7   end_station_id           1851950 non-null  float64
 8   end_station_name         1851950 non-null  object 
 9   end_station_latitude     1863721 non-null  float64
 10  end_station_longitude    1863721 non-null  float64
 11  bike_id                  1863721 non-null  int64  
 12  user_type                1863721 non-null  object 
 13  member_birth_year        1753003 non-null 

> I have observed the following properties about the dataset:
- The `start_time`, `end_time` are of object type, I will convert them to datetime type so it will be possible to perform analysis
- The dataset contains some missing values in the `start_station_id`, `start_station_name`, `end_station_id`,  and `end_station_name` columns. I will drop the rows with missing values.

In [105]:
# View descriptive statistics for numeric variables
combined_bike_data.describe()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id,member_birth_year,distance,member_age,hour
count,1863721.0,1851950.0,1863721.0,1863721.0,1851950.0,1863721.0,1863721.0,1863721.0,1753003.0,1863721.0,1753003.0,1863721.0
mean,857.3026,119.6744,37.76678,-122.3492,118.173,37.7669,-122.3487,2296.851,1983.088,1.590931,34.91204,13.51437
std,2370.379,100.3976,0.1057689,0.1654634,100.4403,0.1056483,0.1650597,1287.733,10.44289,1.028364,10.44289,4.742223
min,61.0,3.0,37.26331,-122.4737,3.0,37.26331,-122.4737,11.0,1881.0,0.0,18.0,0.0
25%,350.0,33.0,37.77106,-122.4114,30.0,37.77106,-122.4094,1225.0,1978.0,0.8675446,27.0,9.0
50%,556.0,89.0,37.78107,-122.3974,88.0,37.78127,-122.3971,2338.0,1985.0,1.374592,33.0,14.0
75%,872.0,186.0,37.79625,-122.2865,183.0,37.79728,-122.2894,3333.0,1991.0,2.087456,40.0,17.0
max,86366.0,381.0,45.51,-73.57,381.0,45.51,-73.57,6234.0,2000.0,65.30934,137.0,23.0


> The dataset contains 1863721 rows and 16 columns. In the **data preparation** section, I added 6 more columns name `member_age`, `distance`, `hour`, `period_of_day`, `day_of_week` and `month_of_year`. The features are described above.
> - **trip duration**: This includes columns for the duration of the bike ride in seconds, the date and time the bike ride started, and the date and time the bike ride ended.
> - **start station**: This includes columns for the ID number of the station where the ride started, the name of the station where the ride started, and the latitude and longitude of the station where the ride started.
> - **end station**: This includes columns for the ID number of the station where the ride ended, the name of the station where the ride ended, and the latitude and longitude of the station where the ride ended.
> - **bike**: This includes columns for the ID number of the bike used in the ride.
> - **customer data**: This includes information such as if the person who rented the bike was a customer or subscriber. It also states information of the person who rented such as date of birth, gender, age, and membership type.

### What is/are the main feature(s) of interest in your dataset?
> 1. Based on the Ford GoBike dataset, I can explore when and where most trips are taken as the dataset includes information on the start time and location of each ride. This information can help me identify popular starting points and times for the bike-sharing system. I will start by analyzing the `start_station_name`. I will then use `start_station_latitude` and `start_station_longitude` columns to calculate the distance of travel. Doing so, I will be able to get a better understanding of when and where the most trips originate.

> 2. In addition to identifying popular starting points and times, I am also interested in exploring the characteristics of the riders such as age, sex, and user type. This can be done by analyzing the `member_birth_year`, `member_gender`, and `user_type` columns. Understanding the demographics of the riders can help me identify patterns in bike usage and preferences.

> 3. I am also interested in exploring the time of the day, that is either **morning**, **afternoon**, **evening** or **night**.  Understanding the time of the day can help me identify patterns in bike usage and preferences. Moreover, I want to explore the day of the week and month of the year. Understanding the day of the week and month of the year can help me identify patterns in bike usage and preferences.

> 4.  Finally, I plan to analyze the duration of the trips for each starting point and time. This information can help me understand how long riders typically use the bikes for and whether there are any patterns or trends in trip duration based on the starting location or time. Overall, I am looking forward to exploring this dataset and gaining insights into the usage patterns of the Ford GoBike system.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?
> To observe the points mentioned above, we can use the following features of the Ford GoBike dataset:
> 1. To identify the popular starting points and times, we can use the start_time, start_station_id, start_station_name, start_station_latitude, and start_station_longitude columns.
> 2. To explore the characteristics of the riders, we can use the member_birth_year, member_gender, and user_type columns.
> 3. To explore the time of the day, the day of the week and month of the year, we can `hour`, `period_of_day`, `day_of_week`, `month_of_year` columns we extracted from the `start_time` column in our **data perparation phase**.
> 4. To analyze the duration of the trips for each starting point and time, we can use the duration_sec column, as well as the start_time and start_station_id columns to match up each ride's duration with its starting point and time.
>
> By examining these features of the dataset, we can gain insights into when and where most trips are taken, the characteristics of the riders, and the duration of the trips for each starting point and time. These insights can help us understand usage patterns and preferences, and identify opportunities for improving the Ford GoBike system.

> #### Expectations before univariate, bivariate, and multivariate exploration
> 1. I expect that the most popular starting points and times will be in the morning and afternoon, and that the most popular starting points will be near the city center.
> 2. I expect that young riders will be more that the older riders who are subscribers
> 3. Comparing the subscribers and customers, I expect that the subscribers will be more than the customers.
> 4. Concerning the genders, I expect that males will be more frequent riders than the female riders

## Data Wrangling

### Data Assessment

In [106]:
# Lets see the top 5 rows
combined_bike_data.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,user_type,member_birth_year,member_gender,bike_share_for_all_trip,distance,member_age,month_of_year,day_of_week,hour,period_of_day
0,598,2018-02-28 23:59:47.097,2018-03-01 00:09:45.1870,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,114.0,Rhode Island St at 17th St,37.764478,...,Subscriber,1988.0,Male,No,2.272573,30.0,February,Wednesday,23,Late Night
1,943,2018-02-28 23:21:16.495,2018-02-28 23:36:59.9740,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,324.0,Union Square (Powell St at Post St),37.7883,...,Customer,1987.0,Male,No,1.889595,31.0,February,Wednesday,23,Late Night
2,18587,2018-02-28 18:20:55.190,2018-02-28 23:30:42.9250,93.0,4th St at Mission Bay Blvd S,37.770407,-122.391198,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,...,Customer,1986.0,Female,No,2.790685,32.0,February,Wednesday,18,Night
3,18558,2018-02-28 18:20:53.621,2018-02-28 23:30:12.4500,93.0,4th St at Mission Bay Blvd S,37.770407,-122.391198,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,...,Customer,1981.0,Male,No,2.790685,37.0,February,Wednesday,18,Night
4,885,2018-02-28 23:15:12.858,2018-02-28 23:29:58.6080,308.0,San Pedro Square,37.336802,-121.89409,297.0,Locust St at Grant St,37.32298,...,Subscriber,1976.0,Female,Yes,1.6306,42.0,February,Wednesday,23,Late Night


In [107]:
# Lets see the last 10 columns
combined_bike_data.tail(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,user_type,member_birth_year,member_gender,bike_share_for_all_trip,distance,member_age,month_of_year,day_of_week,hour,period_of_day
1863716,887,2018-04-01 00:00:08.163,2018-04-01 00:14:55.5710,194.0,Lakeshore Ave at Trestle Glen Rd,37.811081,-122.243268,215.0,34th St at Telegraph Ave,37.822547,...,Subscriber,1988.0,Male,Yes,2.392783,30.0,April,Sunday,0,Midnight
1863717,387,2018-04-01 00:08:06.367,2018-04-01 00:14:33.9940,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,79.0,7th St at Brannan St,37.773492,...,Subscriber,1995.0,Female,No,0.814323,23.0,April,Sunday,0,Midnight
1863718,480,2018-04-01 00:06:21.281,2018-04-01 00:14:21.4600,44.0,Civic Center/UN Plaza BART Station (Market St ...,37.781074,-122.411738,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,...,Customer,1984.0,Male,No,1.351422,34.0,April,Sunday,0,Midnight
1863719,503,2018-04-01 00:04:36.805,2018-04-01 00:13:00.1020,100.0,Bryant St at 15th St,37.7671,-122.410662,93.0,4th St at Mission Bay Blvd S,37.770407,...,Subscriber,1984.0,Female,No,1.749894,34.0,April,Sunday,0,Midnight
1863720,192,2018-04-01 00:02:03.827,2018-04-01 00:05:16.4430,176.0,MacArthur BART Station,37.82841,-122.266315,215.0,34th St at Telegraph Ave,37.822547,...,Customer,1984.0,Male,No,0.651878,34.0,April,Sunday,0,Midnight


In [108]:
# Lets see the number of unique values in each column
combined_bike_data.nunique()

duration_sec                 16709
start_time                 1863584
end_time                   1863610
start_station_id               331
start_station_name             348
start_station_latitude         369
start_station_longitude        370
end_station_id                 331
end_station_name               348
end_station_latitude           370
end_station_longitude          371
bike_id                       5054
user_type                        2
member_birth_year               86
member_gender                    3
bike_share_for_all_trip          2
distance                     19145
member_age                      86
month_of_year                   12
day_of_week                      7
hour                            24
period_of_day                    7
dtype: int64

In [109]:
# Lets see the number of missing values in each column
combined_bike_data.isnull().sum()

duration_sec                    0
start_time                      0
end_time                        0
start_station_id            11771
start_station_name          11771
start_station_latitude          0
start_station_longitude         0
end_station_id              11771
end_station_name            11771
end_station_latitude            0
end_station_longitude           0
bike_id                         0
user_type                       0
member_birth_year          110718
member_gender              110367
bike_share_for_all_trip         0
distance                        0
member_age                 110718
month_of_year                   0
day_of_week                     0
hour                            0
period_of_day                   0
dtype: int64

In [110]:
# Lets see the number of duplicated values in each column
combined_bike_data.duplicated().sum()

0

In [111]:
# Lets see a sample of the data frame 5 rows
combined_bike_data.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,user_type,member_birth_year,member_gender,bike_share_for_all_trip,distance,member_age,month_of_year,day_of_week,hour,period_of_day
514584,819,2018-06-18 11:32:11.150,2018-06-18 11:45:50.1860,86.0,Market St at Dolores St,37.769305,-122.426826,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,...,Subscriber,1969.0,Male,Yes,2.863178,49.0,June,Monday,11,Morning
1256930,465,2018-08-27 17:43:47.344,2018-08-27 17:51:33.0090,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,...,Subscriber,1986.0,Male,No,1.360623,32.0,August,Monday,17,Evening
264272,456,2018-09-27 06:45:15.917,2018-09-27 06:52:52.0770,215.0,34th St at Telegraph Ave,37.822547,-122.266318,182.0,19th Street BART Station,37.809013,...,Subscriber,1987.0,Female,No,1.514527,31.0,September,Thursday,6,Morning
744012,2444,2018-05-10 17:01:08.856,2018-05-10 17:41:52.9320,78.0,Folsom St at 9th St,37.773717,-122.411647,75.0,Market St at Franklin St,37.773793,...,Customer,1981.0,Female,No,0.843136,37.0,May,Thursday,17,Evening
1060202,414,2018-07-26 17:44:55.212,2018-07-26 17:51:49.7900,58.0,Market St at 10th St,37.776619,-122.417385,34.0,Father Alfred E Boeddeker Park,37.783988,...,Subscriber,1976.0,Male,Yes,0.928824,42.0,July,Thursday,17,Evening


In [112]:
# Lets see the information of the data frame using info() and verbose=True
combined_bike_data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1863721 entries, 0 to 1863720
Data columns (total 22 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   duration_sec             int64  
 1   start_time               object 
 2   end_time                 object 
 3   start_station_id         float64
 4   start_station_name       object 
 5   start_station_latitude   float64
 6   start_station_longitude  float64
 7   end_station_id           float64
 8   end_station_name         object 
 9   end_station_latitude     float64
 10  end_station_longitude    float64
 11  bike_id                  int64  
 12  user_type                object 
 13  member_birth_year        float64
 14  member_gender            object 
 15  bike_share_for_all_trip  object 
 16  distance                 float64
 17  member_age               float64
 18  month_of_year            object 
 19  day_of_week              object 
 20  hour                     int64  
 21  period_o