## Data Visualisation - Ford GoBike
### by Miji
### Preliminary Wrangling
This dataset has been shared by bike sharing company, Ford GoBike. The data includes many interesting information on users and their bike trips. I'm going to start with wrangling data to make it suitable for investigation.

In [3]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [4]:
ford = pd.read_csv('fordgobike.csv')
ford.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [5]:
ford.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  memb

### Chcking for any missing information

In [6]:
ford.isnull().sum()

# zero values in start_hour are normal as customer could rent the bike in midnight
# Since the numbers of missing values are less than 5 % of the whole dataset at most, I will remove the rows 

duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

In [7]:
ford.dropna(inplace = True)

ford.isnull().sum()

duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
bike_share_for_all_trip    0
dtype: int64

### Removing unneccesary columns

In [8]:
# Since the information on latitude on longitute are not part of my interest, I will go ahead and remove related 4 columns.

ford.drop(columns = ['start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude'],
         inplace = True)

ford.columns

Index(['duration_sec', 'start_time', 'end_time', 'start_station_id',
       'start_station_name', 'end_station_id', 'end_station_name', 'bike_id',
       'user_type', 'member_birth_year', 'member_gender',
       'bike_share_for_all_trip'],
      dtype='object')

### Checking for datatypes of each column
- Object in 'start_time', 'end_time', 'member_birth_year' should be changed to Datetime. & extract hour
- Float types in 'start_station_id', 'end_station_id', 'bike_id' should be changed to String
- Object in 'user_type', 'member_gender', 'bike_share_for_all_trip' should be changed to Category

In [9]:
date_type = ['end_time', 'start_time']
str_type = ['start_station_id', 'end_station_id', 'bike_id']
cat_type = ['user_type', 'member_gender', 'bike_share_for_all_trip']

for d in date_type:
    ford[d] = ford[d].astype('datetime64[ns]')
    ford['start_day'] = ford[d].dt.day_name()
    ford['start_hour'] = ford[d].dt.hour
ford['end_hour'] = ford['end_time'].dt.hour

for s in str_type:
    ford[s] = ford[s].apply(lambda x: int(float(x))).astype('str')

for c in cat_type:
    ford[c] = ford[c].astype('category')

ford['member_birth_year'] = ford['member_birth_year'].astype('int64')

ford.start_time.replace('0', '24', inplace = True)
ford.end_time.replace('0', '24', inplace = True)

### Setting category order

In [10]:
day_lists = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

def orders(df):
    day_var = pd.api.types.CategoricalDtype(ordered = True, categories = day_lists)
    df['start_day'] = df['start_day'].astype(day_var)
    return print("Order setting successful")

orders(ford)

Order setting successful


### Checking for duplicated rows


In [11]:
ford.duplicated().sum()

0

### Checking for basic statistic

In [12]:
ford.describe()

Unnamed: 0,duration_sec,member_birth_year,start_hour,end_hour
count,174952.0,174952.0,174952.0,174952.0
mean,704.002744,1984.803135,13.456165,13.609533
std,1642.204905,10.118731,4.734282,4.748029
min,61.0,1878.0,0.0,0.0
25%,323.0,1980.0,9.0,9.0
50%,510.0,1987.0,14.0,14.0
75%,789.0,1992.0,17.0,18.0
max,84548.0,2001.0,23.0,23.0


### Amending birth year & Adding a categorical variable
We already have users' birth year. But it is not quite easy to understand only by birth year how old the user is, and what generation the user is. I am going to classify users' birth years into 4 parts, old & middle-aged, young adult and teenager. Old generations were born on from 1920 to 1961 (age 60 - 101). Middle aged generations were born on from 1961 to 1981 (age 40 - 59). Young adults were born on from 1982 to 2002 (age 19 - 39). Lastly, teenagers were born on from 2004 (age - 18).



In [13]:
ford['member_birth_year'].value_counts().sort_index(ascending = True).head(10)

1878     1
1900    53
1901     6
1902    11
1910     1
1920     3
1927     1
1928     1
1930     1
1931    89
Name: member_birth_year, dtype: int64

In [14]:
# I don't think some people's birth year of 1878, 1900, 1901, 1902, 1910 are true, considering the dataset has been created in 2019.
# I will amend selected birth years to the later dates

ford['member_birth_year'].replace([1878, 1900, 1901, 1902, 1910] , [1978, 2000, 2001, 2002, 2010], inplace = True)

ford['member_birth_year'].value_counts().sort_index(ascending = True).head()

1920     3
1927     1
1928     1
1930     1
1931    89
Name: member_birth_year, dtype: int64

In [15]:
gen_list = [1920, 1960, 1981, 2003, 2021]
ford['generation'] = pd.cut(ford['member_birth_year'], gen_list, labels = ['Old', 'Middle-aged', 'Young adult', 'Teenager'], 
       include_lowest = True, ordered = True, right = False)

ford.head()

# I will keep member_birth_year for now, just in case I might need the column in future analysis

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,end_station_id,end_station_name,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip,start_day,start_hour,end_hour,generation
0,52185,2019-02-28 17:32:10.145,2019-03-01 08:01:55.975,21,Montgomery St BART Station (Market St at 2nd St),13,Commercial St at Montgomery St,4902,Customer,1984,Male,No,Thursday,17,8,Young adult
2,61854,2019-02-28 12:13:13.218,2019-03-01 05:24:08.146,86,Market St at Dolores St,3,Powell St BART Station (Market St at 4th St),5905,Customer,1972,Male,No,Thursday,12,5,Middle-aged
3,36490,2019-02-28 17:54:26.010,2019-03-01 04:02:36.842,375,Grove St at Masonic Ave,70,Central Ave at Fell St,6638,Subscriber,1989,Other,No,Thursday,17,4,Young adult
4,1585,2019-02-28 23:54:18.549,2019-03-01 00:20:44.074,7,Frank H Ogawa Plaza,222,10th Ave at E 15th St,4898,Subscriber,1974,Male,Yes,Thursday,23,0,Middle-aged
5,1793,2019-02-28 23:49:58.632,2019-03-01 00:19:51.760,93,4th St at Mission Bay Blvd S,323,Broadway at Kearny,5200,Subscriber,1959,Male,No,Thursday,23,0,Old
