# Udacity Project - Bikesharing Dataset

## Data Wrangling and Data Visualisation

This data set includes information about individual rides made in a bike-sharing system covering the greater Boston area. 

Table of Contents
- Data Wrangling
- Explonatory Data Analysis
- Explanatory Data Analysis

### Data Wrangling


In [112]:
# import libraries 
import numpy as np
import pandas as pandas
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

**Structure of the dataset**

I have 12 csv's, for the 12 months of 2020. Ideally, I would like to see all 12 in one dataset, so it's easier to do explanatory analysis. 
Most of the months have 15 columns with a few having only 14. Need to make sure I'm not missing data so will reduce the ones with 15 rows to 14 before I merge them together. When I merge them, I will do it by quarters to help explore the data better. Different months have different amount of entries, but that's normal and I will not manipulate that.

I will look at the datasets statistics, datatypes, missing and duplicated values as well as visually assess the datasets to see if there is any need for cleaning.


**Main feature(s) of interest**

- When are the most trips taken in terms of time of day, day of the week, or month of the year?
- Which are the most used start and end stations?
- How long does the average trip takes?
- Does these depend on if a user is a subscriber or customer?

**Features in the dataset the will help support the investigation?**

Trip duration, Start time, End time, Start station, End Station, Usertype


### Assessment part 1

In [114]:
# load in each month's dataset into a pandas dataframe
df_01 = pd.read_csv('202001-bluebikes-tripdata.csv')
df_01.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,478,2020-01-01 00:04:05.8090,2020-01-01 00:12:04.2370,366,Broadway T Stop,42.342781,-71.057473,93,JFK/UMass T Stop,42.32034,-71.05118,6005,Customer,1969,0
1,363,2020-01-01 00:04:45.6990,2020-01-01 00:10:49.0400,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3168,Subscriber,2000,1
2,284,2020-01-01 00:06:07.0630,2020-01-01 00:10:51.9240,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3985,Subscriber,2001,1
3,193,2020-01-01 00:06:13.8550,2020-01-01 00:09:27.8320,396,Main St at Beacon St,42.40933,-71.063819,387,Norman St at Kelvin St,42.409859,-71.066319,2692,Subscriber,1978,1
4,428,2020-01-01 00:07:25.2950,2020-01-01 00:14:33.7800,60,Charles Circle - Charles St at Cambridge St,42.360793,-71.07119,49,Stuart St at Charles St,42.351146,-71.066289,4978,Subscriber,1987,1


In [116]:
# assess shape
df_01.shape

(128598, 15)

In [118]:
# assess column names in each data set to see which column is missing
df_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128598 entries, 0 to 128597
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             128598 non-null  int64  
 1   starttime                128598 non-null  object 
 2   stoptime                 128598 non-null  object 
 3   start station id         128598 non-null  int64  
 4   start station name       128598 non-null  object 
 5   start station latitude   128598 non-null  float64
 6   start station longitude  128598 non-null  float64
 7   end station id           128598 non-null  int64  
 8   end station name         128598 non-null  object 
 9   end station latitude     128598 non-null  float64
 10  end station longitude    128598 non-null  float64
 11  bikeid                   128598 non-null  int64  
 12  usertype                 128598 non-null  object 
 13  birth year               128598 non-null  int64  
 14  gend

In [120]:
df_02 = pd.read_csv('202002-bluebikes-tripdata.csv')
df_02.shape

(133235, 15)

In [122]:
# assess column names in each data set to see which column is missing
df_02.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133235 entries, 0 to 133234
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             133235 non-null  int64  
 1   starttime                133235 non-null  object 
 2   stoptime                 133235 non-null  object 
 3   start station id         133235 non-null  int64  
 4   start station name       133235 non-null  object 
 5   start station latitude   133235 non-null  float64
 6   start station longitude  133235 non-null  float64
 7   end station id           133235 non-null  int64  
 8   end station name         133235 non-null  object 
 9   end station latitude     133235 non-null  float64
 10  end station longitude    133235 non-null  float64
 11  bikeid                   133235 non-null  int64  
 12  usertype                 133235 non-null  object 
 13  birth year               133235 non-null  int64  
 14  gend

In [124]:
df_03 = pd.read_csv('202003-bluebikes-tripdata.csv')
df_03.shape

(107350, 15)

In [126]:
# assess column names in each data set to see which column is missing
df_03.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107350 entries, 0 to 107349
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             107350 non-null  int64  
 1   starttime                107350 non-null  object 
 2   stoptime                 107350 non-null  object 
 3   start station id         107350 non-null  int64  
 4   start station name       107350 non-null  object 
 5   start station latitude   107350 non-null  float64
 6   start station longitude  107350 non-null  float64
 7   end station id           107350 non-null  int64  
 8   end station name         107350 non-null  object 
 9   end station latitude     107350 non-null  float64
 10  end station longitude    107350 non-null  float64
 11  bikeid                   107350 non-null  int64  
 12  usertype                 107350 non-null  object 
 13  birth year               107350 non-null  int64  
 14  gend

In [128]:
df_04 = pd.read_csv('202004-bluebikes-tripdata.csv')
df_04.shape

(46793, 15)

In [130]:
# assess column names in each data set to see which column is missing
df_04.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46793 entries, 0 to 46792
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tripduration             46793 non-null  int64  
 1   starttime                46793 non-null  object 
 2   stoptime                 46793 non-null  object 
 3   start station id         46793 non-null  int64  
 4   start station name       46793 non-null  object 
 5   start station latitude   46793 non-null  float64
 6   start station longitude  46793 non-null  float64
 7   end station id           46793 non-null  int64  
 8   end station name         46793 non-null  object 
 9   end station latitude     46793 non-null  float64
 10  end station longitude    46793 non-null  float64
 11  bikeid                   46793 non-null  int64  
 12  usertype                 46793 non-null  object 
 13  birth year               46793 non-null  int64  
 14  gender                

In [132]:
df_05 = pd.read_csv('202005-bluebikes-tripdata.csv')
df_05.shape

(124879, 14)

In [134]:
# assess column names in each data set to see which column is missing
df_05.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124879 entries, 0 to 124878
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             124879 non-null  int64  
 1   starttime                124879 non-null  object 
 2   stoptime                 124879 non-null  object 
 3   start station id         124879 non-null  int64  
 4   start station name       124879 non-null  object 
 5   start station latitude   124879 non-null  float64
 6   start station longitude  124879 non-null  float64
 7   end station id           124879 non-null  int64  
 8   end station name         124879 non-null  object 
 9   end station latitude     124879 non-null  float64
 10  end station longitude    124879 non-null  float64
 11  bikeid                   124879 non-null  int64  
 12  usertype                 124879 non-null  object 
 13  postal code              110498 non-null  object 
dtypes: f

In [136]:
df_06 = pd.read_csv('202006-bluebikes-tripdata.csv')
df_06.shape

(191843, 14)

In [138]:
# assess column names in each data set to see which column is missing
df_06.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191843 entries, 0 to 191842
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             191843 non-null  int64  
 1   starttime                191843 non-null  object 
 2   stoptime                 191843 non-null  object 
 3   start station id         191843 non-null  int64  
 4   start station name       191843 non-null  object 
 5   start station latitude   191843 non-null  float64
 6   start station longitude  191843 non-null  float64
 7   end station id           191843 non-null  int64  
 8   end station name         191843 non-null  object 
 9   end station latitude     191843 non-null  float64
 10  end station longitude    191843 non-null  float64
 11  bikeid                   191843 non-null  int64  
 12  usertype                 191843 non-null  object 
 13  postal code              170883 non-null  object 
dtypes: f

In [140]:
df_07 = pd.read_csv('202007-bluebikes-tripdata.csv')
df_07.shape

(259726, 14)

In [142]:
# assess column names in each data set to see which column is missing
df_07.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259726 entries, 0 to 259725
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             259726 non-null  int64  
 1   starttime                259726 non-null  object 
 2   stoptime                 259726 non-null  object 
 3   start station id         259726 non-null  int64  
 4   start station name       259726 non-null  object 
 5   start station latitude   259726 non-null  float64
 6   start station longitude  259726 non-null  float64
 7   end station id           259726 non-null  int64  
 8   end station name         259726 non-null  object 
 9   end station latitude     259726 non-null  float64
 10  end station longitude    259726 non-null  float64
 11  bikeid                   259726 non-null  int64  
 12  usertype                 259726 non-null  object 
 13  postal code              230479 non-null  object 
dtypes: f

In [144]:
df_08 = pd.read_csv('202008-bluebikes-tripdata.csv')
df_08.shape

(289033, 14)

In [146]:
# assess column names in each data set to see which column is missing
df_08.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289033 entries, 0 to 289032
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             289033 non-null  int64  
 1   starttime                289033 non-null  object 
 2   stoptime                 289033 non-null  object 
 3   start station id         289033 non-null  int64  
 4   start station name       289033 non-null  object 
 5   start station latitude   289033 non-null  float64
 6   start station longitude  289033 non-null  float64
 7   end station id           289033 non-null  int64  
 8   end station name         289033 non-null  object 
 9   end station latitude     289033 non-null  float64
 10  end station longitude    289033 non-null  float64
 11  bikeid                   289033 non-null  int64  
 12  usertype                 289033 non-null  object 
 13  postal code              264273 non-null  object 
dtypes: f

In [148]:
df_09 = pd.read_csv('202009-bluebikes-tripdata.csv')
df_09.shape

(307853, 14)

In [150]:
# assess column names in each data set to see which column is missing
df_09.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307853 entries, 0 to 307852
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             307853 non-null  int64  
 1   starttime                307853 non-null  object 
 2   stoptime                 307853 non-null  object 
 3   start station id         307853 non-null  int64  
 4   start station name       307853 non-null  object 
 5   start station latitude   307853 non-null  float64
 6   start station longitude  307853 non-null  float64
 7   end station id           307853 non-null  int64  
 8   end station name         307853 non-null  object 
 9   end station latitude     307853 non-null  float64
 10  end station longitude    307853 non-null  float64
 11  bikeid                   307853 non-null  int64  
 12  usertype                 307853 non-null  object 
 13  postal code              284701 non-null  object 
dtypes: f

In [152]:
df_10 = pd.read_csv('202010-bluebikes-tripdata.csv')
df_10.shape

(248424, 14)

In [154]:
# assess column names in each data set to see which column is missing
df_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248424 entries, 0 to 248423
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             248424 non-null  int64  
 1   starttime                248424 non-null  object 
 2   stoptime                 248424 non-null  object 
 3   start station id         248424 non-null  int64  
 4   start station name       248424 non-null  object 
 5   start station latitude   248424 non-null  float64
 6   start station longitude  248424 non-null  float64
 7   end station id           248424 non-null  int64  
 8   end station name         248424 non-null  object 
 9   end station latitude     248424 non-null  float64
 10  end station longitude    248424 non-null  float64
 11  bikeid                   248424 non-null  int64  
 12  usertype                 248424 non-null  object 
 13  postal code              230070 non-null  object 
dtypes: f

In [156]:
df_11 = pd.read_csv('202011-bluebikes-tripdata.csv')
df_11.shape

(161712, 14)

In [158]:
# assess column names in each data set to see which column is missing
df_11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161712 entries, 0 to 161711
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tripduration             161712 non-null  int64  
 1   starttime                161712 non-null  object 
 2   stoptime                 161712 non-null  object 
 3   start station id         161712 non-null  int64  
 4   start station name       161712 non-null  object 
 5   start station latitude   161712 non-null  float64
 6   start station longitude  161712 non-null  float64
 7   end station id           161712 non-null  int64  
 8   end station name         161712 non-null  object 
 9   end station latitude     161712 non-null  float64
 10  end station longitude    161712 non-null  float64
 11  bikeid                   161712 non-null  int64  
 12  usertype                 161712 non-null  object 
 13  postal code              151233 non-null  object 
dtypes: f

In [160]:
df_12 = pd.read_csv('202012-bluebikes-tripdata.csv')
df_12.shape

(74002, 14)

In [162]:
# assess column names in each data set to see which column is missing
df_12.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74002 entries, 0 to 74001
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tripduration             74002 non-null  int64  
 1   starttime                74002 non-null  object 
 2   stoptime                 74002 non-null  object 
 3   start station id         74002 non-null  int64  
 4   start station name       74002 non-null  object 
 5   start station latitude   74002 non-null  float64
 6   start station longitude  74002 non-null  float64
 7   end station id           74002 non-null  int64  
 8   end station name         74002 non-null  object 
 9   end station latitude     74002 non-null  float64
 10  end station longitude    74002 non-null  float64
 11  bikeid                   74002 non-null  int64  
 12  usertype                 74002 non-null  object 
 13  postal code              71109 non-null  object 
dtypes: float64(4), int64(4

### Summary of Assessment

**Quality Issues**
*All datasets*
- `starttime` and `stopttime` column is in object format, it should be datetime

**Tidiness issues**

- merge datasets (concat)
- remove birthyear, postcalcode and gender columns as it only has partial month's data coverage, not the full year

I have to state here, that I haven't gone through all the assessment like checking for missing values, etc. I would like to do that after merging the datasets together as it will be a lot easier like that.


### Cleaning

### Define 
- merge datasets (concat)

Action: find best method to merge datasets

### Code

In [180]:
# make a copy of all datasets to be sure I can revert if things go wrong
df_01_copy = df_01.copy()
df_02_copy = df_02.copy()
df_03_copy = df_03.copy()
df_04_copy = df_04.copy()
df_05_copy = df_05.copy()
df_06_copy = df_06.copy()
df_07_copy = df_07.copy()
df_08_copy = df_08.copy()
df_09_copy = df_09.copy()
df_10_copy = df_10.copy()
df_11_copy = df_11.copy()
df_12_copy = df_12.copy()

Check each copied datasets shape so I can see if I got all the data after merging

In [177]:
df_01_copy.shape

(128598, 15)

In [181]:
df_02_copy.shape

(133235, 15)

In [182]:
df_03_copy.shape

(107350, 15)

In [183]:
df_04_copy.shape

(46793, 15)

In [184]:
df_05_copy.shape

(124879, 14)

In [185]:
df_06_copy.shape

(191843, 14)

In [186]:
df_07_copy.shape

(259726, 14)

In [187]:
df_08_copy.shape

(289033, 14)

In [188]:
df_09_copy.shape

(307853, 14)

In [189]:
df_10_copy.shape

(248424, 14)

In [190]:
df_11_copy.shape

(161712, 14)

In [191]:
df_12_copy.shape

(74002, 14)

Check if copies have the right columns

In [166]:
df_01_copy.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,478,2020-01-01 00:04:05.8090,2020-01-01 00:12:04.2370,366,Broadway T Stop,42.342781,-71.057473,93,JFK/UMass T Stop,42.32034,-71.05118,6005,Customer,1969,0
1,363,2020-01-01 00:04:45.6990,2020-01-01 00:10:49.0400,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3168,Subscriber,2000,1
2,284,2020-01-01 00:06:07.0630,2020-01-01 00:10:51.9240,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3985,Subscriber,2001,1
3,193,2020-01-01 00:06:13.8550,2020-01-01 00:09:27.8320,396,Main St at Beacon St,42.40933,-71.063819,387,Norman St at Kelvin St,42.409859,-71.066319,2692,Subscriber,1978,1
4,428,2020-01-01 00:07:25.2950,2020-01-01 00:14:33.7800,60,Charles Circle - Charles St at Cambridge St,42.360793,-71.07119,49,Stuart St at Charles St,42.351146,-71.066289,4978,Subscriber,1987,1


In [168]:
df_06_copy.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,postal code
0,1160,2020-06-01 00:00:25.6240,2020-06-01 00:19:45.7010,192,Purchase St at Pearl St,42.354659,-71.053181,366,Broadway T Stop,42.342781,-71.057473,2831,Customer,
1,1419,2020-06-01 00:01:09.3800,2020-06-01 00:24:48.9520,355,Bennington St at Constitution Beach,42.385224,-71.010631,355,Bennington St at Constitution Beach,42.385224,-71.010631,5696,Customer,2128.0
2,1093,2020-06-01 00:01:29.4790,2020-06-01 00:19:43.3530,131,Jackson Square T Stop,42.322931,-71.100141,331,Huntington Ave at Mass Art,42.336586,-71.09887,3922,Subscriber,10570.0
3,1067,2020-06-01 00:01:35.8540,2020-06-01 00:19:23.3250,131,Jackson Square T Stop,42.322931,-71.100141,331,Huntington Ave at Mass Art,42.336586,-71.09887,3361,Subscriber,10570.0
4,1391,2020-06-01 00:01:51.0390,2020-06-01 00:25:02.8460,355,Bennington St at Constitution Beach,42.385224,-71.010631,355,Bennington St at Constitution Beach,42.385224,-71.010631,3621,Customer,1902.0


In [203]:
# concat the data sets
df_concatenated_columns_all = pandas.concat([df_01_copy, df_02_copy, df_03_copy, df_04_copy, df_05_copy, df_06_copy, df_07_copy, df_08_copy, df_09_copy, df_10_copy, df_11_copy, df_12_copy,])


### Test

In [204]:
df_concatenated_columns_all.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,postal code
0,478,2020-01-01 00:04:05.8090,2020-01-01 00:12:04.2370,366,Broadway T Stop,42.342781,-71.057473,93,JFK/UMass T Stop,42.32034,-71.05118,6005,Customer,1969.0,0.0,
1,363,2020-01-01 00:04:45.6990,2020-01-01 00:10:49.0400,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3168,Subscriber,2000.0,1.0,
2,284,2020-01-01 00:06:07.0630,2020-01-01 00:10:51.9240,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3985,Subscriber,2001.0,1.0,
3,193,2020-01-01 00:06:13.8550,2020-01-01 00:09:27.8320,396,Main St at Beacon St,42.40933,-71.063819,387,Norman St at Kelvin St,42.409859,-71.066319,2692,Subscriber,1978.0,1.0,
4,428,2020-01-01 00:07:25.2950,2020-01-01 00:14:33.7800,60,Charles Circle - Charles St at Cambridge St,42.360793,-71.07119,49,Stuart St at Charles St,42.351146,-71.066289,4978,Subscriber,1987.0,1.0,


In [205]:
df_concatenated_columns_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073448 entries, 0 to 74001
Data columns (total 16 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tripduration             int64  
 1   starttime                object 
 2   stoptime                 object 
 3   start station id         int64  
 4   start station name       object 
 5   start station latitude   float64
 6   start station longitude  float64
 7   end station id           int64  
 8   end station name         object 
 9   end station latitude     float64
 10  end station longitude    float64
 11  bikeid                   int64  
 12  usertype                 object 
 13  birth year               float64
 14  gender                   float64
 15  postal code              object 
dtypes: float64(6), int64(4), object(6)
memory usage: 268.9+ MB


In [206]:
df_concatenated_columns_all.shape

(2073448, 16)

In [207]:
df_all = df_concatenated_columns_all.copy()

In [208]:
df_all.shape


(2073448, 16)

### Define
- remove birth year, gender and postal code columns

Action: 
drop these columns permanently

### Code

In [235]:
df_all.drop(['birth year', 'postal code', 'gender'], axis=1, inplace=True)

### Test

In [236]:
df_all.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype
0,478,2020-01-01 00:04:05.809,2020-01-01 00:12:04.237,366,Broadway T Stop,42.342781,-71.057473,93,JFK/UMass T Stop,42.32034,-71.05118,6005,Customer
1,363,2020-01-01 00:04:45.699,2020-01-01 00:10:49.040,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3168,Subscriber
2,284,2020-01-01 00:06:07.063,2020-01-01 00:10:51.924,219,Boston East - 126 Border St,42.373312,-71.04102,212,Maverick Square - Lewis Mall,42.368844,-71.039778,3985,Subscriber
3,193,2020-01-01 00:06:13.855,2020-01-01 00:09:27.832,396,Main St at Beacon St,42.40933,-71.063819,387,Norman St at Kelvin St,42.409859,-71.066319,2692,Subscriber
4,428,2020-01-01 00:07:25.295,2020-01-01 00:14:33.780,60,Charles Circle - Charles St at Cambridge St,42.360793,-71.07119,49,Stuart St at Charles St,42.351146,-71.066289,4978,Subscriber


### Define
- `starttime` and `stopttime` column is in object format, it should be datetime

Action: convert these to datetime

### Code

In [216]:
# convert starttime field into datetime
df_all['starttime'] = pd.to_datetime(df_all['starttime'])

In [217]:
# convert stoptime field into datetime
df_all['stoptime'] = pd.to_datetime(df_all['stoptime'])

### Test

In [220]:
df_all.dtypes

tripduration                        int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id                    int64
start station name                 object
start station latitude            float64
start station longitude           float64
end station id                      int64
end station name                   object
end station latitude              float64
end station longitude             float64
bikeid                              int64
usertype                           object
birth year                        float64
gender                            float64
postal code                        object
dtype: object

### Assessment part 2

In [222]:
# assess df_all statistics
df_all.describe()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,birth year,gender
count,2073448.0,2073448.0,2073448.0,2073448.0,2073448.0,2073448.0,2073448.0,2073448.0,415976.0,415976.0
mean,1831.438,163.4481,42.35597,-71.08868,162.3836,42.35589,-71.0884,4263.332,1985.275107,1.156915
std,24258.91,135.7767,0.03452189,0.05628081,135.8793,0.03455072,0.05626398,1272.965,11.482199,0.524449
min,61.0,1.0,0.0,-71.22627,1.0,0.0,-71.22627,31.0,1888.0,0.0
25%,475.0,55.0,42.34522,-71.10594,55.0,42.34522,-71.10567,3175.0,1979.0,1.0
50%,828.0,110.0,42.3556,-71.08981,108.0,42.3556,-71.08822,4256.0,1989.0,1.0
75%,1443.0,239.0,42.36567,-71.06959,239.0,42.36567,-71.06894,5406.0,1994.0,1.0
max,3879352.0,499.0,42.41608,0.0,499.0,42.41608,0.0,6724.0,2004.0,2.0


In [230]:
# assess df_all's duplicated values
sum(df_all.duplicated())

0

In [237]:
# assess df_all's null values
df_all.isnull().sum()

tripduration               0
starttime                  0
stoptime                   0
start station id           0
start station name         0
start station latitude     0
start station longitude    0
end station id             0
end station name           0
end station latitude       0
end station longitude      0
bikeid                     0
usertype                   0
dtype: int64

In [240]:
df_all['start station name'].value_counts

<bound method IndexOpsMixin.value_counts of 0                                    Broadway T Stop
1                        Boston East - 126 Border St
2                        Boston East - 126 Border St
3                               Main St at Beacon St
4        Charles Circle - Charles St at Cambridge St
                            ...                     
73997                          Rogers St & Land Blvd
73998                                  Wilson Square
73999                           Somerville City Hall
74000            Harvard Square at Mass Ave/ Dunster
74001                        Shawmut Ave at Oak St W
Name: start station name, Length: 2073448, dtype: object>

In [241]:
df_all['end station name'].value_counts

<bound method IndexOpsMixin.value_counts of 0                                         JFK/UMass T Stop
1                             Maverick Square - Lewis Mall
2                             Maverick Square - Lewis Mall
3                                   Norman St at Kelvin St
4                                  Stuart St at Charles St
                               ...                        
73997        One Broadway / Kendall Sq at Main St / 3rd St
73998    Cambridge Dept. of Public Works -147 Hampshire...
73999                                           Perry Park
74000                             Huron Ave At Vassal Lane
74001                         Albany St at E. Brookline St
Name: end station name, Length: 2073448, dtype: object>

### Explore Univariate Variables

In this section, I'll investigate distributions of individual variables. My aim is to check if there are any unusual points or outliers, take a deeper look to clean things up and prepare myself to look at relationships between variables.

I will look into the following:
- When are the most trips taken in terms of time of day, day of the week, or month of the year?
- Which are the most used start and end stations?
- How long does the average trip take?
- Does these depend on if a user is a subscriber or customer?


In [None]:
df_all.