# Part I - Exploration of data on fordgobike trips in San francisco for the month of febraury in the year 2019

## by Afariogun John Tolulope

## Introduction
> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area for the month of february



In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

In [2]:
# Loading in the dataset
ford_df = pd.read_csv('201902-fordgobike-tripdata.csv')

# ASSESSING THE DATA

In [3]:
ford_df.shape

(183412, 16)

In [4]:
ford_df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [5]:
ford_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
duration_sec               183412 non-null int64
start_time                 183412 non-null object
end_time                   183412 non-null object
start_station_id           183215 non-null float64
start_station_name         183215 non-null object
start_station_latitude     183412 non-null float64
start_station_longitude    183412 non-null float64
end_station_id             183215 non-null float64
end_station_name           183215 non-null object
end_station_latitude       183412 non-null float64
end_station_longitude      183412 non-null float64
bike_id                    183412 non-null int64
user_type                  183412 non-null object
member_birth_year          175147 non-null float64
member_gender              175147 non-null object
bike_share_for_all_trip    183412 non-null object
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB


In [6]:
sum(ford_df.duplicated())

0

### The structure of the dataset?

> There are 183412 records of trip in the dataset. Most variables are numeric in nature, but the variables user_type,station names are cateogrical in nature 
The dataset is noted to contain 183,412 unique entries(rows) and this are all grouped under 16 variables (columns)

* duration_sec - total time of the trip in seconds
* start_time - the starting time for the trip
* end_time - the time when the trip came to an end
* start_station_id -  a unique id given to the specific station where the trip was noted to start
* start_station_name - the name of the station where the trip started
* start_station_latitude - the latitude co-ordinate of the station where the trip started
* start_station_longitude - the longitude co-ordinate of the station where the trip started
* end_station_id - a unique id given to the specific station where the trip was noted to end
* end_station_name - the name of the station where the trip ended
* end_station_latitude - the latitude co-ordinate of the station where trip ended
* end_station_longitude - the longitude co-ordinate of the station where trip ended
* bike_id -  a unique id for the bikes
* user_type - the type ascribed to the user if it is a subscriber or just a random cutomer, other
* member_birth_year - the birth year of the user
* member_gender - what gender the user belongs to, if male, female or other
* bike_share_for_all_trip - if the user shared the bike at all for the trip

### What is/are the main feature(s) of interest in your dataset?

* What variables influence trip duration

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

* Trip Duration
* User Type
* Member gender
* Member age


## Preliminary Wrangling

#### Creating a copy of the original data to carry out any data wrangling operations:
- Converting columns **start_time** and **end_time** to DateTime format
- Extracting the starting hour of the trip from the **start_time** into a **start_hour** column
- Extracting the starting day of the trip from the **start_time** into a **day** column
- Extracting the age of the user bu substracting the **member_birth_year** from year of the trips  into a **age** column, but first converting the **member_birth_year** to a *float* datattype
- Calculating the distance of the trip in kilometer using python GeoPy package and collating them in a **distance_(km)** column.
- Filling all categorical variable's missing value with unknown and numerical with 0, so as to try to maintain the integrity of the dataset

In [7]:
df=ford_df.copy()

Converiting datetime related columns to DateTime datatype

In [8]:
df['start_time']=pd.to_datetime(df['start_time'])

In [9]:
df['end_time']=pd.to_datetime(df['end_time'])

Creating a new **start_hour** column which contains the hour that the trip started

In [10]:
df['start_hour']=df['start_time'].dt.hour
df['start_hour']=df['start_hour'].astype('category')    

In [11]:
#check
df.start_hour.unique()

[17, 18, 12, 23, 22, ..., 6, 4, 3, 2, 1]
Length: 24
Categories (24, int64): [17, 18, 12, 23, ..., 4, 3, 2, 1]

Creating a new **day** column which contains the weekday that the trip started

In [12]:
df['day']=df['start_time'].dt.day_name()

Extracting the age of the user into a **age** column from the **member_birth_year** column and doing this by substracting the column from the year that the data was gathered i.e. 2019

In [13]:
df['member_birth_year']=df['member_birth_year'].astype('float')
df['age'] = 2019 - df['member_birth_year']

Creating a column to contain the distance in km the trip covers for analysis. Titled **distance_(km)**

In [None]:
df['distance_(km)']= 0
df['distance_(km)']=df['distance_(km)'].astype('float')
import geopy
from geopy.distance import geodesic as GD
for i in range(len(df)):
    coords_1 = (df['start_station_latitude'][i],df['start_station_longitude'][i])
    coords_2 = (df['end_station_latitude'][i], df['end_station_longitude'][i])
    df['distance_(km)'][i] = (GD(coords_1, coords_2).km)
  

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### Testing if the wrangling were all successful so far

In [None]:
df.info()

In [None]:
df.isna().sum()

Filling non-numeric missing values with **unknown** as a placeholder and changing their datatype to category 

In [None]:
df.loc[:,['start_station_id','member_gender','start_station_name','end_station_name','end_station_id']]=df.loc[:,['start_station_id','start_station_name','member_gender','end_station_name','end_station_id']].fillna('Unknown')

In [None]:
df.info()

In [None]:
df.loc[0:,['start_station_id','start_station_name','end_station_id','end_station_name','bike_id','user_type','member_gender']]=df.loc[0:,['start_station_id','start_station_name','end_station_id','end_station_name','bike_id','user_type','member_gender']].astype('category')

In [None]:
df.info()

In turn filling missing numeric variables with 0, so as to not cause conflicts when plotting the data

In [None]:
df.loc[:,['age','member_birth_year']]=df.loc[:,['age','member_birth_year']].fillna(0)

## Taking a look at our dataframe

In [None]:
df.info()

In [None]:
df.head()

# PLOTTING

## Univariate Exploration

Creating a new dataframe whivh contains all of the columns that are to be used in this analysis, which are the:
- 

In [None]:
plot=df.loc[:,['duration_sec','start_hour','start_time','end_time','day','start_station_name','end_station_name','bike_id','user_type','age','bike_share_for_all_trip','member_gender','distance_(km)']]

In [None]:
# Defining the default color in seaborn in case i use it at any point
color=sb.color_palette()[0]

In [None]:
plot.info()

In [None]:
plt.hist(data=plot,x='duration_sec',bins=20);

When assessing the data earlier there were some values which were greater than 20,000 secs in the dataset and the prescence of higher values in this plot lends credence to that fact. So redoing the plot but this time with some xlims to properly see the distribution

In [None]:
bins = np.arange(0, 1800+20, 20)
plt.hist(data=plot,x='duration_sec',bins=bins);
plt.xlim([0,1800]);
plt.xlabel('Duration (secs)', size=12)
plt.ylabel('Count', size=10)
plt.title('Distribution of the time in seconds')
y_ticklocs=[0,1000,2000,3000, 4000,5000]
y_ticklabels=['0','1k','2k','3k','4k','5k']
plt.yticks(y_ticklocs, y_ticklabels);

We can see from the plot that it is right skewed and unimodal, with a peak between 200 and 400 secs, maybe 300 seconds.

In [None]:
bins = np.arange(0, 1800+20, 20)
plt.hist(data=plot,x='duration_sec',bins=bins);
ticks=np.arange(0,31,2)
tick_ =np.arange(0,1920,120)
plt.xlim([0,1800]);
plt.xticks(tick_,ticks);
plt.xlabel('Duration (mins)')
plt.xlabel('Duration (Seconds)', size=14)
plt.ylabel('Count', size=11)
plt.title('Distribution of Trip Duration', size=16);
y_ticklocs=[0,1000,2000,3000, 4000,5000]
y_ticklabels=['0','1k','2k','3k','4k','5k']
plt.yticks(y_ticklocs, y_ticklabels);

Taking a look at this visualization again, but with the tick marks in mins for better visualization. This shows us that the trips peaked at 5mins which is at 300 seconds, which is the duration most trips took

In [None]:
# The Duration with a logarithmic scale and x-axis limit
bins=np.arange(60,1e4,20)
plt.hist(data=plot,x='duration_sec',bins=bins)
plt.xscale('log')
plt.xlim(60,1e4)
x_ticklocs=[60,200,500,1000,2000,5000,10000]
x_ticklabels=['60','200','500','1k','2k','5k','10k']
plt.xticks(x_ticklocs, x_ticklabels)
plt.xlabel('Duration (Seconds)', size=14)
plt.ylabel('Count', size=13)
plt.title('Distribution of Trip Duration', size=16);
y_ticklocs=[0,1000,2000,3000, 4000,5000]
y_ticklabels=['0','1k','2k','3k','4k','5k']
plt.yticks(y_ticklocs, y_ticklabels);

After applying log transfrom to the data we can see that it is still unimodal, but looks somewhat symmetric, still right-skewed though.

In [None]:

plt.figure(figsize=[12,5])
bins =np.arange(0, 10+1/5,1/5)
plt.hist(data=plot, x='distance_(km)',bins=bins);
ticks=np.arange(0,10.4,0.4)
plt.xticks(ticks);
plt.xlabel('Distance Traveled (km)', size=12);
y_ticklocs=[0,2500,5000,10000, 12500,15000,17500]
y_ticklabels=['0','2.5k','5k','7.5k','10k','12.5k','15k','17.5k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.ylabel('Count', size=12)
plt.title('Distribution of the distance traveled in km', size=15)

Taking a look at the distances traveled, we can see that the bikes were generally used for trips within the range of 2km with the most trips being between the distances 0.8 and 1.2 km, our data is right skewed and bimodal

In [None]:
label=np.arange(0,95,5);
bins=np.arange(0,plot['age'].max()+2,2);
plt.hist(data=plot[plot['age'] !=0],x='age',bins=bins);
plt.xlim([0,90]);
plt.xticks(label,rotation=90);
y_ticklocs=[0,2500,5000,7500,10000, 12500,15000,17500]
y_ticklabels=['0','2.5k','5k','7.5k','10k','12.5k','15k','17.5k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlabel('age')
plt.ylabel('count')
plt.title('Distribution of age')

From the this plot, the the most trips where by people within the age range of age 25 and 40, which we can assume to be the working class or more of the youth of the population.

In [None]:
plot.age.value_counts().nlargest(25).plot(kind='bar')
y_ticklocs=[0,2000,4000,6000,8000,10000]
y_ticklabels=['0','2k','4k','6k','8k','10k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlabel('age')
plt.ylabel('count')
plt.title('Distribution of the 25 most frequent ages')

Further analysis shows that the age with the most trips is at age 31, followed by 26 and so on. Further supporting the theory that fordgobike users are workers

### Taking a look at the distribution of the trip starting hours

In [None]:
plot['start_hour'].value_counts().plot(kind='bar')
y_ticklocs=[0,2000,4000,6000,8000,10000,15e3,20e3]
y_ticklabels=['0','2k','4k','6k','8k','10k','15k','20k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlabel('Trip start time (hours)')
plt.ylabel('Count')
plt.title('Number of trips at different hours of the day');

We can see that at time 8 am and 5 pm were the most common periods of the day when trips started. Also generally most trips were between 7-9 am and 4-6 pm. Which actually correlates with the fact that most of fordgobike users are of the working class. With the least trips being within the range of 3 hours after and before midnight.

 This relationship will be further investigated during the Bivariate and Multivariate analysis. 

In [None]:
users=plot['user_type'].value_counts()
plt.pie(users,labels=users.index);
plt.title('A pie depicting the types of users and their relative frequency');
plt.legend()

As seen the amount of subscribers which use fordgobikes dwarf those who are just casual or one off users

In [None]:
users.plot(kind='bar')
y_ticklocs=[0,20000,40000,60000,80000,100000, 120000,140000,160000]
y_ticklabels=['0','20k','40k','60k','80k','100k','120k','140k','160k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlabel('User type')
plt.ylabel('Count')
plt.title('Count of the various user types');

Taking a closer look, as above we see that while subscribers took just north of 160000 trips, customers took barely 20,000 trips

In [None]:
plot.info()

In [None]:
color=sb.color_palette()[0]
sb.countplot(data=plot, x='member_gender',color=color);
y_ticklocs=[0,20000,40000,60000,80000,100000, 120000]
y_ticklabels=['0','20k','40k','60k','80k','100k','120k']
plt.yticks(y_ticklocs, y_ticklabels);

Males make up most of fordgobike's users more than 120,000 trips were made by males. Though is this because of bikes require more strenous work to use. This will be further investigated during the Bivariate exploration. The female gender are the next most users of firdgibikes around 40,000 trips. While those who identify as Other gender use forgobikes barely and those whose gender are not registered come up as the third most populous user of fordgobikes 

In [None]:
gender=plot['member_gender'].value_counts()
plt.pie(gender,labels=gender.index);
plt.title('Plot of gender distribution');
plt.legend()

Here we can see clearly how the gender distribution is in the dataset

In [None]:
sb.countplot(data=plot, x='bike_share_for_all_trip',color=color);
y_ticklocs=[0,20000,40000,60000,80000,100000, 120000,140000,160000]
y_ticklabels=['0','20k','40k','60k','80k','100k','120k','140k','160k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.title('Plot showing bike sharing for all trips')

From this we can say that most people didn't share their bikes for the duration of their trips

In [None]:
plt.figure(figsize=[10,5])
sb.countplot(data=plot,x='day',color=color);
plt.title('Distribution of trips accross days')

Most trips were taken during weekdays, which might be during work hour while the least were taken during the weekend

In [None]:
df['start_time'].dt.strftime('%d-%m').value_counts().sort_index().plot(kind='line', title='Fordgobike\' trips over time', figsize=(12,10));
plt.xlabel('Day, Month');
plt.ylabel('Trips');

Taking a look at the progression of number of trips over the month of febraury we can see that there is a gradual increase trend 

In [None]:
bikes=plot.bike_id.value_counts().nlargest(20)
bikes.plot(kind='bar')
plt.xlabel('Bike ids')
plt.ylabel('Count')
plt.title('The bikes which took the most trips');

## Bivariate Exploration
### Taking a look at influence of particular days on other variables

First taking a look at the mean trip duration and the days of the week

In [None]:
plot['duration_sec'].groupby(plot['day']).mean().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Mean trip duration')
plt.title('Plot of days compared to mean trip duration');

Here we can see that during the weekend was when there were longer trips, probably for leisure and exercising. Next taking a llok at which day had the most time spent on trips, which will basically tell us the duration that forgobike bikes were out on the streets

In [None]:
plot['duration_sec'].groupby(plot['day']).sum().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Sum trip duration')
plt.title('Plot of days compared to sum of all trip duration');

From the plot above we can already tell that more trips were taken during the week than the weekend cumulating in longer time. With Thursday having the most trip duration

In [None]:
plot['distance_(km)'].groupby(plot['day']).mean().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Mean trip distance')
plt.title('Plot of days compared to mean trip duration');

From this plot we can tell that longer trips were taken during the week than on weekends

### Taking a glance at the influence of age on the other variables

First looking at how long was the trip for users at different ages

In [None]:
sb.regplot(data=plot,x='age',y='duration_sec',fit_reg=False, x_jitter=0.4, scatter_kws={'alpha':1/6})
y_ticklocs=[0,20000,40000,60000,80000]
y_ticklabels=['0','20k','40k','60k','80k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlabel('Age (years)')
plt.ylabel('Duration (Seconds)')
plt.title('Age and Duration');

First we might notice that there are multiple outliers in the plot. first is the the 0 age which was given to those whose age is unknown it detracts from the plot, also we have multiple outliers starting at 100 years, as people above or even close to that age shouldn't be able to ride bikes, we will remove that from the visualization

In [None]:
sb.regplot(data=plot,x='age',y='duration_sec',fit_reg=False, x_jitter=0.4, scatter_kws={'alpha':1/6})
y_ticklocs=[0,20000,40000,60000,80000]
y_ticklabels=['0','20k','40k','60k','80k']
plt.yticks(y_ticklocs, y_ticklabels);
plt.xlim([15,95])
plt.xlabel('Age (years)')
plt.ylabel('Duration (Seconds)')
plt.title('Age and Duration');


Now taking a look at this plot, we can see that the longest trips were taken by those below 60 years of age and there is high density of users at ages until 50 years, were it begins to taper out and we have the least duration at age 80, excluding additional outliers

In [None]:
sb.boxplot(data=plot,x='user_type',y='age',color=color)
plt.title('Distribution of the usert types based on age')

From this we can see that most of fordgobike's subscribers were generally older than their regular customers

In [None]:
sb.boxplot(data=plot,x='member_gender',y='age',color=color);
plt.title('Plot of the various member genders against their age');

In [None]:
plot['age'].groupby(plot['member_gender']).mean().sort_values(ascending=False).plot(kind='bar');
plt.ylabel('Mean age')
plt.title('Mean age for the various member genders')

Users who identify as Other are on a average the older of the different genders.

In [None]:
plot['age'].groupby(plot['start_hour']).mean().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Mean age of users')
plt.title('Plot of age compared to the start time (hour) of the trip');

The mean age of those who took trips starting at hour 4:00am are the most.

In [None]:
sb.regplot(data=plot,x='age',y='distance_(km)',fit_reg=False, x_jitter=0.08, scatter_kws={'alpha':1/6});
plt.ylim([0,20])
plt.xlim([15,100]);
plt.title('Plot of user age against distance of trip');

A greater concentration of users who take trips of longer distances are within the age range 20 and 50. Corellating with the earlier assumption during the univariate analysis

## Taking a look at the influence of other factors on the  duration of trip

In [None]:
sb.boxplot(data=plot,x='user_type',y='duration_sec')
y_ticklocs=[0,20000,40000,60000,80000]
y_ticklabels=['0','20k','40k','60k','80k']
plt.yticks(y_ticklocs, y_ticklabels);

We see that user_type seemingly does not have any disjoint relationship with duration, but applying log transform to the numerical y variable to confirm

In [None]:
plt.figure(figsize=[6,6])
sb.boxplot(data=plot,x='user_type',y='duration_sec',color=color)
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels)
plt.xlabel('User Type')
plt.ylabel('Duration (Seconds)')
plt.title('The Effect of User Type on the Duration of the Trip');

After the application of the log transform we can clearly see that Subscribers generally spend lesser time on trips as opposed to just customers. This might be due to special priviledges given to subscribers, but it might also be due to other factors we will investigate this further in our analysis

Taking a look at influence of member_gender on duration of trip, simply reusing the same code as before for the log transform but this time with *member_gender* instead of *user_type*

In [None]:
plt.figure(figsize=[6,6])
sb.boxplot(data=plot,x='member_gender',y='duration_sec',color=color)
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels)
plt.xlabel('Member gender')
plt.ylabel('Duration (Seconds)')
plt.title('The Effect of User Type on the Duration of the Trip');

Taking a look at the duration with resspect to gender, we find that males generally spend lesser time on trips.

In [None]:
plt.figure(figsize=[6,6])
sb.boxplot(data=plot,x='bike_share_for_all_trip',y='duration_sec',color=color)
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels)
plt.xlabel('Bike sharing for all trips')
plt.ylabel('Duration (Seconds)')
plt.title('The Effect of User Type on the Duration of the Trip');

The visualization shows that users who do not share bikes generally spend more time on trips 

In [None]:
sb.regplot(data=plot,x='distance_(km)',y='duration_sec',fit_reg=False, x_jitter=0.08, scatter_kws={'alpha':1/6})
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.xlim([0,20])
plt.ylabel('Duration (sec))')
plt.xlabel('Distance (km)')
plt.title('Distance against Duration');


As seen there is a generally direct correllation between Distance of trip and the trip duration

## Multivariate Plots
Making some multivariate plots to confirm the relationships between:

- distance of the trip
- user type
- duration of trip
- bike_share_for_all_trip
- member gender
- age

In [None]:
sb.scatterplot(data=plot,x='distance_(km)',y='duration_sec',hue='bike_share_for_all_trip')
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.xlim([0,20])
plt.legend(bbox_to_anchor =(1.25, 1.23))
plt.title('Effect of bike_share_for_all_trip on distance and duration of travel')

From this plot we can see that most of the trips in which bikes were shared were mostly at a small distance and did not last for a long time. This actually confirms for us our earlier analysis that those who share bikes have a shorter trip period than those who didn't 

In [None]:
sb.scatterplot(data=plot,x='distance_(km)',y='duration_sec',hue='user_type',x_jitter=0.2)
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.xlim([0,20]);
plt.title('Effect of user_type on distance and duration of travel')

This plot shows that most subscribers spend lesser time on their trips and also go on short trips as opposed to regular customers

In [None]:
sb.scatterplot(data=plot,x='distance_(km)',y='duration_sec',hue='member_gender',x_jitter=0.5)
plt.yscale('log')
y_ticklocs=[100,200,500,1000,2e3,5e3,1e4,2e4,4e4,8e4]
y_ticklabels=['100','200','500','1k','2k','5k','10k','20k','40k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.xlim([0,20])
plt.title('Effect of member_gender on distance and duration of travel')

There is a far greater distribution of men who go on further trips but also spend lesser times on their trips, though considering that the data contains more of males than female this may be a bias

In [None]:
g=sb.FacetGrid(data=plot,col='member_gender',row='user_type',height=3,margin_titles=True)
g.map(plt.scatter, 'distance_(km)','duration_sec')
# plt.yscale('log')
y_ticklocs=[0,20000,40000,60000,80000]
y_ticklabels=['0','20k','40k','60k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.xlim([0,20]);
plt.title('Member genders and User type effect on duration of travel and distance')

In [None]:
sb.boxplot(x="member_gender", y="distance_(km)", hue='user_type', data=plot)
plt.ylim([0,20])
plt.legend(bbox_to_anchor =(1.27, 1.25))

From this visualization a large amount of customers go on longer trips, with the **Other** gender being the most frequent, this is the same for subscribers. Though there are more outliers of male Subscribers going on longer trips. Those who didn't ascribe a gender to themselves though also follow the trends of the **Other** gender closely

In [None]:
sb.boxplot(x="user_type", y="duration_sec", hue='member_gender', data=plot)
# plt.ylim([0,20])
plt.yscale('log')
y_ticklocs=[1000,5e3,1e4,2e4,4e4,5e4,8e4]
y_ticklabels=['1k','5k','10k','20k','40k','50k','80k']
plt.yticks(y_ticklocs,y_ticklabels);
plt.legend(bbox_to_anchor =(1.27, 1.25))

From the analysis Males travels are faster and take lesser time compared to the rest of the genders  

In [None]:
sb.countplot(data=plot,x='member_gender',hue='user_type')
plt.xlabel('Gender',labelpad=10)
plt.ylabel('Count')
plt.legend(title='Bike Share for all Trip')
plt.title('Gender and Bike Sharing');

# Conclusion

After a taking a look at the various variables and carrying out univariate, bivariate and multivariate analysis. I found out that:
- User type isn't related to neither the Age nor the Gender of the user.
- the bias of the male gender probably distorted a lot of the visulizations.
- The longer a trip was the more the probability of the gender being **male** or **Other**
- Subscribers generally go on longer trips compared to customers excluding outliers
- Males take lesser time on trips for the same distance

In [None]:
jupyter nbconvert project3.ipynb --to slides --post serve --no-input --no-prompt