# Part I - Exploratory data Analysis on the Ford gobike 2019feb tripdata
## by Moses Ojonuba

## Introduction

Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California. 

Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.

This data set includes information about individual rides made in Ford GoBike bike-sharing system covering the greater San Francisco Bay area


#### Dataset Dictionary:

- duration_sec: Trip Duration (seconds)
- start_time>: Start Time and Date
- end_time: End Time and Date
- start_station_id: Start Station ID
- start_station_name: Start Station Name
- start_station_latitude: Start Station Latitude
- start_station_longitude: Start Station Longitude
- end_station_id: End Station ID
- end_station_name: End Station Name
- end_station_latitude: End Station Latitude
- end_station_longitude: End Station Longitude
- bike_id: Bike ID
- user_type: User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
- member_birth_year: Member Year of Birth
- member_gender: Member Gender
- bike_share_for_all_trip: Boolean to track members who are enrolled in the "Bike Share for All" program for low-income residents



## Preliminary Wrangling


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
!pip install plotly==5.9.0 --quiet
import plotly.express as px
%matplotlib inline

In [None]:
sns.set_style('darkgrid')
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.facecolor'] = '#00000000'

In [None]:
#loading data
rides_df = pd.read_csv('201902-fordgobike-tripdata.csv')

rides_df.head()


In [None]:
rides_df.shape

In [None]:
rides_df.info()

In [None]:
rides_df.isna().sum().sort_values()

In [None]:
rides_df.duplicated().sum()

In [None]:
rides_df.user_type.unique()

In [None]:
rides_df.member_gender.unique()

In [None]:
rides_df.bike_share_for_all_trip.unique()

from Assesment
change datatype
drop missing values

In [None]:
rides_df.dropna(inplace=True)

In [None]:
#convert to string
rides_df[['start_station_id', 'end_station_id', 'bike_id']] = rides_df[['start_station_id', 'end_station_id', 'bike_id']].astype(str)


rides_df['member_birth_year'] =rides_df['member_birth_year'].astype(int)

rides_df[['start_time', 'end_time']] = rides_df[['start_time', 'end_time']].apply(pd.to_datetime)

In [None]:
rides_df.info()

### What is the structure of your dataset?

> The Dataset has 183412 rows, and 16 columns

### What is/are the main feature(s) of interest in your dataset?

> Average Age of Riders, Average Duration of Trips, Gender distribution of Riders, Ditribution of User type.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> birth year, duration, gender and user type variables

## Univariate Exploration
##### Q. What are the distribution of age?

In [None]:
#create an age column
rides_df['age'] = 2019 - rides_df['member_birth_year']

In [None]:
rides_df.head(3)

In [None]:
rides_df['age'].describe()

our age columns seems to have quite a number of outliers seeing that 75% of the individuals are below 39 years old. the maximum age there is 141 which is likely an error. let us visualize it for a clearer picture

In [None]:
plt.boxplot(rides_df['age'], vert=False)
plt.xlabel('Age')
plt.title('Distribution of Age');

##### Observation
As we can see from the boxplot above, we have quite a number of outliers. we have alot of people in the senior's category of age. this is certainly expected because seniors are encouraged to ride bikes as a form of excercise therefore we would consider these outliers as legitimate data point. however we doubt the possibility of someone riding a bike at age 141 or even at age 119. we have no such record of someone been alive at age 141 as at 2019 and even they existed, it would certainly be risky to allow them to ride a bike.

to deal with this, we will set our age limit for this analysis to be 100.

In [None]:
high = rides_df["age"] < 101

rides = rides_df[high]

rides.age.describe()

In [None]:
rides['age'].hist(bins= 20);
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution');

From the histogram distribution plot above, we can see that most of the riders are within the age of 25 and 35.

##### Q What is the distribution of trip Duration?

In [None]:
rides['duration_sec'].describe()

In [None]:
rides['duration_sec'].hist(bins=500)
plt.xlabel('Duration[Sec]')
plt.title('Distribution of Trip Duration');

##### Observation

the trip Duration histogram is highly skewed due to the long duration of  some trips. as a result of this, we will be using the median in answering other questions related to duration. we are using the median because it is not affected by outliers unlike mean.

##### Q What is the distribution of gender?

In [None]:
fig= px.pie(rides, names='member_gender', width=600, height=300, title='Distribution of Gender')

fig.show()

##### Observation

A large percentage of the riders are male(74%). thrice as much as the female(23%).

According to the investigation carried out by Elizabeth Plank at the bike paths of New York City, Turns out way more men ride bikes than women: “In the U.S., 1 woman for every 3 men gets around on a bicycle”. 

According to Plank “In London, 77% of bike trips are taken by men and only 5% of women identify as frequent cyclists.”

https://slate.com/human-interest/2014/09/gender-gap-alert-men-ride-bikes-way-more-than-women-do.html

##### Q What is the distribution of User type?

In [None]:
#define a function to plot categorical feature
def plot_cat(var, l=8,b=5):
    plt.figure(figsize = (l, b))
    sns.countplot(rides[var], order = rides_df[var].value_counts().index)

In [None]:
#call function to plot countplot
plot_cat('user_type')

##### Observation

over 90% of riders in our dataset are subscriber. that means they pay  subscription fee which could be monthly or year. only a small percentage of the riders are customers. that means they pay at the station or Kiosk per every trip.

##### Q. What percentage of riders are part of the Bike share for all Trip Program?

In [None]:
plot_cat('bike_share_for_all_trip')

##### Observation
A large percentage of riders are not part of the program. we don't have detailed information to know the reason for this

## Bivariate Exploration

##### Q. What the average duration of trip for the categories of gender?

In [None]:
rides.groupby('member_gender')['duration_sec'].median()

In [None]:
rides.groupby('member_gender')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Duration[sec]')
plt.title('Average Duration of Trips for Gender');

##### Observation
Female go on longer trips (567 seconds or aproximately 10mins). though the difference is much from the trip duration for males.

##### Q. What the average duration of trip for the categories of  user type?

In [None]:
rides.groupby('user_type')['duration_sec'].median()

In [None]:
rides.groupby('user_type')['duration_sec'].median().plot(kind='bar')
plt.xlabel('User Type')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips for User Type');


##### Observation
Customers go on a longer trip (780 sec or 13mins) than Subscribers(490 secs or 9mins).

##### Q. Which week of the month did people go on longer rides?

In [None]:
rides_df['ride_start_week'] = rides_df['start_time'].dt.week
rides_df.groupby('ride_start_week')['duration_sec'].median()

In [None]:
rides_df.groupby('ride_start_week')['duration_sec'].median().sort_values().plot(kind='barh')
plt.xlabel('Duration[Sec]')
plt.ylabel('Week of the Month')
plt.title('Duration of Trips per Week of the Month');

###### Observation

riders went on longer trip in the fourth week(8). the average trip duration for the fourth week was 532 sec or aproximately 8mins. though no much difference from other weeks

###### Q. Which day of the week did riders go on longer trip?

In [None]:
# Add a column for the weekday of the start of the ride
rides_df['ride_start_weekday'] = rides_df['start_time'].dt.day_name()

# Print the median trip time per weekday
print(rides_df.groupby('ride_start_weekday')['duration_sec'].median())

In [None]:
rides_df.groupby('ride_start_weekday')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Week Day')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips on Weekdays');

##### Observation
Riders went on longer trips on Weekends (Saturdays and Sundays)

##### Q. Is There a relationship between Age and Duration?

In [None]:
fig = px.scatter(rides_df, x='age', y='duration_sec', title='Duration vs Age')
fig.show()

##### Observation 
There is no linear relationship between age and duration of a trip. However most people who took longer trips were between the age of 25 and 45

##### Q. What is the distribution of the locations accross sans Franscisco?

In [None]:
fig = px.scatter_mapbox(
    rides_df,  # Our DataFrame
    lat='start_station_latitude',
    lon='start_station_longitude',
    center={"lat": 37.773972, "lon": -122.431297},  # Map will be centered on San Francisco
    width=600,  # Width of map
    height=600,  # Height of map
    hover_data=['start_station_name'],  # Display Station name when hovering mouse over station
    title = 'Dsitribution of Stations'
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

###### Observation
there are three clusters of the various bike stations

## Multivariate Exploration

some riders take a bike from a station and returns it back to the same station. let us call such trips as round_trips

In [None]:
# Create round trips
trips = (rides['start_station_name'] == rides['end_station_name'])

In [None]:
round_trips = rides[trips]

In [None]:
len(round_trips)

3458 of the trips were round_trips. the bikes were picked and dropped and the same station

##### Q. What is the Age Distribution for the categories of Gender who took a round trip?

In [None]:
#Plot an Histogram
fig = px.histogram(round_trips, 
                   x='age', 
                   color='member_gender',
                   marginal='box', 
                   nbins=50, 
                   title='Distribution of Age for Gender')
fig.update_layout(bargap=0.1)
fig.show()

###### Observation

 The average age of males who took round trips is 36,  while that of female is 29

##### Q. What is the Age Distribution for the categories of Users who took a round trip?

In [None]:
fig = px.histogram(round_trips, 
                   x='age', 
                   color='user_type',
                   marginal='box', 
                   nbins=47, 
                   title='Distribution of Age of User Type')
fig.update_layout(bargap=0.1)
fig.show()

##### Observation
The average age of customers who took round trips is 30,  while that of Subscriber is 31

##### Q. What is Distribution of Trip Duration for Gender

In [None]:
fig = px.histogram(round_trips, 
                   x='duration_sec', 
                   color='member_gender',
                   marginal='box', 
                   nbins=75, 
                   title='Distribution of Trip Duration for Gender')
fig.update_layout(bargap=0.1)
fig.show()

##### Observation
 The average duration of round trip for female (1114 secs or 18mins) is greater than that of Male (910 secs or 15mins). females took longer time than males

In [None]:
#plot Histogram
fig = px.histogram(round_trips, 
                   x='duration_sec', 
                   color='user_type',
                   marginal='box', 
                   nbins=75, 
                   title='Distribution of Trip Duration for User Type')
fig.update_layout(bargap=0.1)
fig.show()

##### Observation
customers spends an average time of 1680.5 secs or 28mins on round trips while subscribers spends an average time of 757secs or 12mins

## Conclusions
- A large percentage of the riders are male(74%). thrice as much as the female(23%).

- over 90% of riders in our dataset are subscriber. that means they pay subscription fee which could be monthly or year. only a small percentage of the riders are customers. that means they pay at the station or Kiosk per every trip.
- Female go on longer trips (567 seconds or aproximately 10mins). though the difference is much from the trip duration for males.
- Customers go on a longer trip (780 sec or 13mins) than Subscribers(490 secs or 9mins).
- Riders went on longer trips on Weekends (Saturdays and Sundays)
- There is no linear relationship between age and duration of a trip. However most people who took longer trips were between the age of 25 and 45