
# Project: Ford Go-bike system data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#Exploratory">Exploratory Data Analysis</a></li>
<li><a href="#Explanatory">Explanatory</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> I will analyze and explore data from Ford Go-bike system Dataset , this data icludes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. 
This data set contains information about 183,000 rides collected from bike-sharing system, including 16 different variables like ride duration, start & end times for each trib , location for start & end stations , gender , member birth year and and other atrractive information to be explored .

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
from scipy import stats
import datetime


<a id='wrangling'></a>
## Data Wrangling

In [3]:
df = pd.read_csv('201902-fordgobike-tripdata.csv')
print(df.shape)
df.head()

(183412, 16)


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  memb

### To be clean
- convert duration_sec to duration minutes
- start & end time to date time 
- all ids to object
- latitude and longitude to object
- make new column to start and end stations location
- user type category
- birth year to object
- gender to category
- bike_share_for_all_trip to category

In [5]:
df.describe()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id,member_birth_year
count,183412.0,183215.0,183412.0,183412.0,183215.0,183412.0,183412.0,183412.0,175147.0
mean,726.078435,138.590427,37.771223,-122.352664,136.249123,37.771427,-122.35225,4472.906375,1984.806437
std,1794.38978,111.778864,0.099581,0.117097,111.515131,0.09949,0.116673,1664.383394,10.116689
min,61.0,3.0,37.317298,-122.453704,3.0,37.317298,-122.453704,11.0,1878.0
25%,325.0,47.0,37.770083,-122.412408,44.0,37.770407,-122.411726,3777.0,1980.0
50%,514.0,104.0,37.78076,-122.398285,100.0,37.78101,-122.398279,4958.0,1987.0
75%,796.0,239.0,37.79728,-122.286533,235.0,37.79732,-122.288045,5502.0,1992.0
max,85444.0,398.0,37.880222,-121.874119,398.0,37.880222,-121.874119,6645.0,2001.0


In [6]:
df.duration_sec.sort_values(ascending = True)

169882       61
157305       61
103565       61
44787        61
44301        61
          ...  
112435    83407
127999    83519
153705    83772
85465     84548
101361    85444
Name: duration_sec, Length: 183412, dtype: int64

In [7]:
df[df['member_birth_year'] < 1919]  #unlogic #convert to nan.

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
1285,148,2019-02-28 19:29:17.6270,2019-02-28 19:31:45.9670,158.0,Shattuck Ave at Telegraph Ave,37.833279,-122.263490,173.0,Shattuck Ave at 55th St,37.840364,-122.264488,5391,Subscriber,1900.0,Male,Yes
10827,1315,2019-02-27 19:21:34.4360,2019-02-27 19:43:30.0080,343.0,Bryant St at 2nd St,37.783172,-122.393572,375.0,Grove St at Masonic Ave,37.774836,-122.446546,6249,Subscriber,1900.0,Male,No
16087,1131,2019-02-27 08:37:36.8640,2019-02-27 08:56:28.0220,375.0,Grove St at Masonic Ave,37.774836,-122.446546,36.0,Folsom St at 3rd St,37.783830,-122.398870,4968,Subscriber,1900.0,Male,No
19375,641,2019-02-26 17:03:19.8550,2019-02-26 17:14:01.6190,9.0,Broadway at Battery St,37.798572,-122.400869,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,6164,Customer,1900.0,Male,No
21424,1424,2019-02-26 08:58:02.9040,2019-02-26 09:21:47.7490,375.0,Grove St at Masonic Ave,37.774836,-122.446546,343.0,Bryant St at 2nd St,37.783172,-122.393572,5344,Subscriber,1900.0,Male,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171996,1368,2019-02-03 17:33:54.6070,2019-02-03 17:56:42.9490,37.0,2nd St at Folsom St,37.785000,-122.395936,375.0,Grove St at Masonic Ave,37.774836,-122.446546,4988,Subscriber,1900.0,Male,No
173711,993,2019-02-03 09:45:30.4640,2019-02-03 10:02:04.1690,375.0,Grove St at Masonic Ave,37.774836,-122.446546,36.0,Folsom St at 3rd St,37.783830,-122.398870,5445,Subscriber,1900.0,Male,No
177708,1527,2019-02-01 19:09:28.3870,2019-02-01 19:34:55.9630,343.0,Bryant St at 2nd St,37.783172,-122.393572,375.0,Grove St at Masonic Ave,37.774836,-122.446546,5286,Subscriber,1900.0,Male,No
177885,517,2019-02-01 18:38:40.4710,2019-02-01 18:47:18.3920,25.0,Howard St at 2nd St,37.787522,-122.397405,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,2175,Subscriber,1902.0,Female,No


In [8]:
df[df['start_station_id'].isnull()] 
# i will keep it and won't drop it because the location of stations and all other data existed.

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
475,1709,2019-02-28 20:55:53.9320,2019-02-28 21:24:23.7380,,,37.40,-121.94,,,37.40,-121.93,4211,Customer,1991.0,Female,No
1733,1272,2019-02-28 18:32:34.2730,2019-02-28 18:53:46.7270,,,37.40,-121.94,,,37.41,-121.96,4174,Subscriber,1980.0,Male,No
3625,142,2019-02-28 17:10:46.5290,2019-02-28 17:13:09.4310,,,37.41,-121.95,,,37.41,-121.96,4283,Subscriber,1988.0,Male,No
4070,585,2019-02-28 16:28:45.9340,2019-02-28 16:38:31.3320,,,37.39,-121.93,,,37.40,-121.92,4089,Subscriber,1984.0,Male,Yes
5654,509,2019-02-28 12:30:17.1310,2019-02-28 12:38:46.3290,,,37.40,-121.92,,,37.39,-121.93,4089,Subscriber,1984.0,Male,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176154,1447,2019-02-02 12:03:04.5440,2019-02-02 12:27:12.2670,,,37.40,-121.93,,,37.40,-121.93,4249,Customer,1984.0,Male,No
179730,309,2019-02-01 12:59:45.9690,2019-02-01 13:04:55.4260,,,37.40,-121.94,,,37.40,-121.93,4249,Customer,1987.0,Female,No
179970,659,2019-02-01 12:17:37.6750,2019-02-01 12:28:37.0140,,,37.41,-121.96,,,37.41,-121.94,4092,Subscriber,1999.0,Female,No
180106,2013,2019-02-01 11:33:55.1470,2019-02-01 12:07:28.9400,,,37.40,-121.94,,,37.40,-121.94,4251,Customer,1990.0,Female,No


In [9]:
df[df['member_birth_year'].isnull()] 
# i will accept it , the data of gender and birth year for about 8265 rides are dropped 
# from csv file and other data are existed.

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.775880,-122.393170,2535,Customer,,,No
13,915,2019-02-28 23:49:06.0620,2019-03-01 00:04:21.8670,252.0,Channing Way at Shattuck Ave,37.865847,-122.267443,244.0,Shattuck Ave at Hearst Ave,37.873676,-122.268487,5101,Subscriber,,,No
28,650,2019-02-28 23:43:27.5030,2019-02-28 23:54:18.4510,258.0,University Ave at Oxford St,37.872355,-122.266447,263.0,Channing Way at San Pablo Ave,37.862827,-122.290231,4784,Customer,,,No
53,3418,2019-02-28 22:41:16.3620,2019-02-28 23:38:14.3630,11.0,Davis St at Jackson St,37.797280,-122.398436,11.0,Davis St at Jackson St,37.797280,-122.398436,319,Customer,,,No
65,926,2019-02-28 23:17:05.8530,2019-02-28 23:32:32.6820,13.0,Commercial St at Montgomery St,37.794231,-122.402923,81.0,Berry St at 4th St,37.775880,-122.393170,2951,Subscriber,,,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183354,449,2019-02-01 01:35:07.6630,2019-02-01 01:42:36.8780,244.0,Shattuck Ave at Hearst Ave,37.873676,-122.268487,253.0,Haste St at College Ave,37.866418,-122.253799,5430,Customer,,,No
183356,795,2019-02-01 01:25:50.3660,2019-02-01 01:39:05.9500,368.0,Myrtle St at Polk St,37.785434,-122.419622,125.0,20th St at Bryant St,37.759200,-122.409851,5400,Subscriber,,,No
183363,673,2019-02-01 01:12:24.4200,2019-02-01 01:23:37.6450,75.0,Market St at Franklin St,37.773793,-122.421239,133.0,Valencia St at 22nd St,37.755213,-122.420975,5166,Customer,,,No
183371,196,2019-02-01 01:08:38.6410,2019-02-01 01:11:54.9490,58.0,Market St at 10th St,37.776619,-122.417385,75.0,Market St at Franklin St,37.773793,-122.421239,2395,Customer,,,No


In [10]:
df.member_gender.isnull().sum()

8265

In [11]:
df.sample(30)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
6504,654,2019-02-28 10:05:45.0950,2019-02-28 10:16:40.0750,92.0,Mission Bay Kids Park,37.772301,-122.393028,321.0,5th St at Folsom,37.780146,-122.403071,5967,Subscriber,1988.0,Male,No
114799,749,2019-02-12 09:07:37.8180,2019-02-12 09:20:07.0950,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,176,Subscriber,1963.0,Male,No
84503,445,2019-02-17 17:28:11.9750,2019-02-17 17:35:37.4640,105.0,16th St at Prosper St,37.764285,-122.431804,88.0,11th St at Bryant St,37.77003,-122.411726,5325,Subscriber,1984.0,Male,No
117708,925,2019-02-11 20:26:17.8960,2019-02-11 20:41:43.6720,27.0,Beale St at Harrison St,37.788059,-122.391865,365.0,Turk St at Fillmore St,37.78045,-122.431946,4911,Subscriber,1983.0,Male,No
48031,729,2019-02-22 08:13:09.3700,2019-02-22 08:25:19.3240,118.0,Eureka Valley Recreation Center,37.759177,-122.436943,114.0,Rhode Island St at 17th St,37.764478,-122.40257,5911,Subscriber,1974.0,Male,No
137018,551,2019-02-08 09:02:38.7060,2019-02-08 09:11:50.1740,73.0,Pierce St at Haight St,37.771793,-122.433708,58.0,Market St at 10th St,37.776619,-122.417385,1758,Customer,1981.0,Female,No
48297,563,2019-02-22 07:56:24.2260,2019-02-22 08:05:48.0070,241.0,Ashby BART Station,37.852477,-122.270213,238.0,MLK Jr Way at University Ave,37.871719,-122.273068,6411,Subscriber,1997.0,Male,No
140553,1219,2019-02-07 18:42:07.0520,2019-02-07 19:02:26.7970,59.0,S Van Ness Ave at Market St,37.774814,-122.418954,70.0,Central Ave at Fell St,37.773311,-122.444293,4626,Subscriber,1994.0,Female,No
16466,244,2019-02-27 08:33:18.7570,2019-02-27 08:37:22.7800,90.0,Townsend St at 7th St,37.771058,-122.402717,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,4988,Subscriber,1993.0,Female,No
69004,1793,2019-02-19 20:08:06.8410,2019-02-19 20:38:00.7220,78.0,Folsom St at 9th St,37.773717,-122.411647,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,4767,Subscriber,1991.0,Female,No


In [12]:
df.start_station_name.value_counts(ascending = True).head(10)

16th St Depot                             2
21st Ave at International Blvd            4
Palm St at Willow St                      4
Parker Ave at McAllister St               7
Willow St at Vine St                      9
Taylor St at 9th St                      13
Backesto Park (Jackson St at 13th St)    17
Leavenworth St at Broadway               17
Farnam St at Fruitvale Ave               18
23rd Ave at Foothill Blvd                18
Name: start_station_name, dtype: int64

In [13]:
df.end_station_name.value_counts(ascending = True).head(10)

Willow St at Vine St                5
21st Ave at International Blvd      6
16th St Depot                       6
Palm St at Willow St                7
Parker Ave at McAllister St         9
Taylor St at 9th St                11
Leavenworth St at Broadway         12
Foothill Blvd at Harrington Ave    16
26th Ave at International Blvd     19
Foothill Blvd at 42nd Ave          20
Name: end_station_name, dtype: int64

In [14]:
df.user_type.value_counts(ascending = True).head(10)

Customer       19868
Subscriber    163544
Name: user_type, dtype: int64

In [15]:
df.member_gender.value_counts(ascending = True)

Other       3652
Female     40844
Male      130651
Name: member_gender, dtype: int64

In [16]:
df.bike_share_for_all_trip.value_counts(ascending = True).head(10)

Yes     17359
No     166053
Name: bike_share_for_all_trip, dtype: int64

In [17]:
df.duplicated().sum()

0

### Data Cleaning

### To be clean
- convert duration_sec to duration minutes to be more readable
- start & end time to date time 
- all ids to object
- user type category
- birth year to object
- gender to category
- bike_share_for_all_trip to category
- convert birthday year less than 1919 and it coeesponding gender to Nan.
- rearrange columns.
- make new column for the user age.

In [18]:
df_clean = df.copy()

In [19]:
df_clean.columns

Index(['duration_sec', 'start_time', 'end_time', 'start_station_id',
       'start_station_name', 'start_station_latitude',
       'start_station_longitude', 'end_station_id', 'end_station_name',
       'end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type',
       'member_birth_year', 'member_gender', 'bike_share_for_all_trip'],
      dtype='object')

In [20]:
df_clean = df_clean.rename(columns = {"duration_sec": "duration_minutes"})
df_clean.duration_minutes = (df_clean.duration_minutes) / 60
df_clean.start_station_id = df_clean.start_station_id.astype('str').replace('\.0', '', regex=True)
df_clean.end_station_id = df_clean.end_station_id.astype('str').replace('\.0', '', regex=True)
df_clean.bike_id = df_clean.bike_id.astype('str').replace('\.0', '', regex=True)
df_clean.user_type = df_clean.user_type.astype('category')
df_clean.member_gender = df_clean.member_gender.astype('category')
df_clean.bike_share_for_all_trip = df_clean.bike_share_for_all_trip.astype('category')
df_clean.start_time = pd.to_datetime(df_clean.start_time)
df_clean.end_time = pd.to_datetime(df_clean.end_time)
df_clean.loc[df_clean['member_birth_year'] < 1919, 'member_gender'] = np.nan
df_clean.loc[df_clean['member_birth_year'] < 1919, 'member_birth_year'] = np.nan


In [21]:
df_clean['user_age'] = (df_clean['end_time'].dt.year) - (df_clean.member_birth_year)
df_clean['user_age'].value_counts()

31.0    10236
26.0     9325
30.0     8972
29.0     8658
28.0     8498
        ...  
85.0        2
75.0        2
91.0        1
89.0        1
92.0        1
Name: user_age, Length: 70, dtype: int64

In [22]:
df_clean.member_birth_year = df_clean.member_birth_year.astype('str').replace('\.0', '', regex=True)

In [23]:
df_clean.columns

Index(['duration_minutes', 'start_time', 'end_time', 'start_station_id',
       'start_station_name', 'start_station_latitude',
       'start_station_longitude', 'end_station_id', 'end_station_name',
       'end_station_latitude', 'end_station_longitude', 'bike_id', 'user_type',
       'member_birth_year', 'member_gender', 'bike_share_for_all_trip',
       'user_age'],
      dtype='object')

In [25]:
#arrange columns
df_clean = df_clean[['duration_minutes',
                           'start_time',
                           'end_time', 'start_station_id','start_station_name','start_station_latitude','start_station_longitude',
                           'end_station_id','end_station_name',
                           'end_station_latitude','end_station_longitude','bike_id','user_type','member_birth_year','member_gender',
                            'user_age','bike_share_for_all_trip']]

In [26]:
print(df_clean.shape)
df_clean.head()

(183412, 17)


Unnamed: 0,duration_minutes,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,user_age,bike_share_for_all_trip
0,869.75,2019-02-28 17:32:10.145,2019-03-01 08:01:55.975,21,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,35.0,No
1,708.683333,2019-02-28 18:53:21.789,2019-03-01 06:42:03.056,23,The Embarcadero at Steuart St,37.791464,-122.391034,81,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,,No
2,1030.9,2019-02-28 12:13:13.218,2019-03-01 05:24:08.146,86,Market St at Dolores St,37.769305,-122.426826,3,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,47.0,No
3,608.166667,2019-02-28 17:54:26.010,2019-03-01 04:02:36.842,375,Grove St at Masonic Ave,37.774836,-122.446546,70,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,30.0,No
4,26.416667,2019-02-28 23:54:18.549,2019-03-01 00:20:44.074,7,Frank H Ogawa Plaza,37.804562,-122.271738,222,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,45.0,Yes


In [27]:
df_clean.loc[df_clean['member_birth_year'] == 'nan', 'member_birth_year'] = np.nan
df_clean[df_clean['member_birth_year'] == 'nan']

Unnamed: 0,duration_minutes,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,user_age,bike_share_for_all_trip


In [28]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_minutes         183412 non-null  float64       
 1   start_time               183412 non-null  datetime64[ns]
 2   end_time                 183412 non-null  datetime64[ns]
 3   start_station_id         183412 non-null  object        
 4   start_station_name       183215 non-null  object        
 5   start_station_latitude   183412 non-null  float64       
 6   start_station_longitude  183412 non-null  float64       
 7   end_station_id           183412 non-null  object        
 8   end_station_name         183215 non-null  object        
 9   end_station_latitude     183412 non-null  float64       
 10  end_station_longitude    183412 non-null  float64       
 11  bike_id                  183412 non-null  object        
 12  user_type       

In [31]:
df_clean.describe()

Unnamed: 0,duration_minutes,start_station_latitude,start_station_longitude,end_station_latitude,end_station_longitude,user_age
count,183412.0,183412.0,183412.0,183412.0,183412.0,175075.0
mean,12.101307,37.771223,-122.352664,37.771427,-122.35225,34.158778
std,29.906496,0.099581,0.117097,0.09949,0.116673,9.972079
min,1.016667,37.317298,-122.453704,37.317298,-122.453704,18.0
25%,5.416667,37.770083,-122.412408,37.770407,-122.411726,27.0
50%,8.566667,37.78076,-122.398285,37.78101,-122.398279,32.0
75%,13.266667,37.79728,-122.286533,37.79732,-122.288045,39.0
max,1424.066667,37.880222,-121.874119,37.880222,-121.874119,99.0


In [32]:
df_clean .to_csv('Ford Go-bike system data.csv', index=False)

<a id='Exploratory'></a>
## Exploratory Data Analysis

In [None]:
df_to_plot = df_clean.copy()
df_to_plot['day'] = df_clean['start_time'].dt.day_name()
df_to_plot['hour'] = (df_clean['start_time'].dt.hour) / 1.0
df_to_plot_weekdays = df_to_plot.loc[df_to_plot['day'].isin(['Monday','Tuesday','Wednesday','Thursday','Friday'])]
df_to_plot_weekends = df_to_plot.loc[df_to_plot['day'].isin(['Saturday','Sunday',])]


### Research Question 1 (when most trips occurs , for specific hour and day ?)

In [None]:
# most day
most_trips_day = df_to_plot.groupby('day').start_station_id.count().sort_values(ascending= False).head(10)
most_trips_day

In [None]:
# most hour
most_trips_hr = df_to_plot['hour'].value_counts()
most_trips_hr

### Research Question 2 (in which hour , most trips occurs for weekdays vs weekends ?)

In [None]:
# most hour for weekdays
most_trips_hr1 = df_to_plot_weekdays.groupby('hour').start_station_id.count().sort_values(ascending= False).head(6)
most_trips_hr1

In [None]:
# most hour for weekends
most_trips_hr2 = df_to_plot_weekends.groupby('hour').start_station_id.count().sort_values(ascending= False).head(6)
most_trips_hr2

### Research Question 3 (what is most user type for weekdays and weekends , and in general ?)


In [None]:
# most user type for weekdays
most_trips_user1 = df_to_plot_weekdays.groupby('user_type').start_station_id.count().sort_values(ascending= False).head(10)
most_trips_user1

In [None]:
# most user type for weekends
most_trips_user2 = df_to_plot_weekends.groupby('user_type').start_station_id.count().sort_values(ascending= False).head(10)
most_trips_user2

In [None]:
most_trips_user = df_to_plot.groupby('user_type').start_station_id.count().sort_values(ascending= False).head(10)
most_trips_user

### Research Question 4 (no.of trips according to age ?)

In [None]:
avg_trip_duration_age = df_to_plot.groupby('user_age').duration_minutes.count().sort_values(ascending= False).head(11)
avg_trip_duration_age

### Research Question 5 (what are most popular start stations ?)


In [None]:
most_start_station = df_to_plot.groupby('start_station_name').start_station_id.count().sort_values(ascending= False).head(10)
most_start_station

### Research Question 6 (what are less popular start stations ?)


In [None]:
less_start_station = df_to_plot.groupby('start_station_name').start_station_id.count().sort_values(ascending= True).head(10)
less_start_station


### Research Question 7 (what are most popular end stations ?)


In [None]:
most_end_station = df_to_plot.groupby('end_station_name').start_station_id.count().sort_values(ascending= False).head(10)
most_end_station

### Research Question 8 (what are less popular end stations ?)


In [None]:
less_end_station = df_to_plot.groupby('end_station_name').start_station_id.count().sort_values(ascending= True).head(10)
less_end_station

### Research Question 9 (what are most relative start and end stations ?)


In [None]:
df_to_plot['start_end_station'] = df_to_plot.start_station_name + ' / ' +  df_to_plot.end_station_name 

most_relatives = df_to_plot.groupby(['start_end_station']).start_station_id.count().sort_values(ascending= False).head(10)
most_relatives

### Research Question 10 (what are most bikes used ?)


In [None]:
most_bikes = df_to_plot.groupby('bike_id').start_station_id.count().sort_values(ascending= False).head(10)
most_bikes

### Research Question 11 (avg duration for most bikes used ?)


In [None]:
df_to_plot_most_used = df_to_plot.loc[df_to_plot['bike_id'].isin(['4794','4814','5014','4422','5175','5145','4450','5482','5274','4773'])]
most_bikes_avg_dur = df_to_plot_most_used.groupby('bike_id').duration_minutes.mean().sort_values(ascending= False).head(10)
most_bikes_avg_dur

### Research Question 12 (the most bikes which shred all trip ?)


In [None]:
df_to_plot_all_trip = df_to_plot.loc[df_to_plot['bike_share_for_all_trip'].isin(['Yes'])]
bikes_all_trip = df_to_plot_all_trip.groupby(['bike_id']).bike_share_for_all_trip.count().sort_values(ascending= False).head(10)
bikes_all_trip

### Research Question 13 (the average duration when bike share all trip vs don't share all trip ?)


In [None]:
dur = df_to_plot.groupby(['bike_share_for_all_trip']).duration_minutes.mean().sort_values(ascending= False).head(10)
dur

### Research Question 14 (the average age when bike share all trip vs don't share all trip ?)


In [None]:
avg_age = df_to_plot.groupby(['bike_share_for_all_trip']).user_age.mean().sort_values(ascending= False).head(10)
avg_age

In [None]:
df_to_plot.columns

### Research Question 15 (the age for each gender ?)


In [None]:
avg_age_gender = df_to_plot.groupby(['member_gender']).user_age.mean().sort_values(ascending= False).head(10)
avg_age_gender

### Research Question 16 ( the duration time vs user type?)


In [None]:
type_dur = df_to_plot.groupby(['user_type']).duration_minutes.mean().sort_values(ascending= False).head(10)
type_dur

### Research Question 17 ( the duration time vs gender? )


In [None]:
gender_dur = df_to_plot.groupby(['member_gender']).duration_minutes.mean().sort_values(ascending= False).head(10)
gender_dur

### Research Question 18 (which ginder has most trips ?)


In [None]:
most_gender_dur = df_to_plot.groupby('member_gender').duration_minutes.count().sort_values(ascending= False).head(10)
most_gender_dur

### Research Question 19 (the proportion if the bike share for all trip or not ?)


In [None]:
all_trip_por= df_to_plot.groupby('bike_share_for_all_trip').duration_minutes.count().sort_values(ascending= False).head(10)
all_trip_por

### Research Question 20 (the relation between duration , bike share all trip and gender ?)


In [None]:
relation1 = df_to_plot.groupby(['bike_share_for_all_trip','member_gender']).duration_minutes.mean().sort_values(ascending= False).head(10)
relation1

### Research Question 21 (the relation between duration , gender and user type ?)


In [None]:
relation3 = df_to_plot.groupby(['member_gender','user_type']).duration_minutes.mean().sort_values(ascending= False).head(10)
relation3

### Research Question 24 (the relation between no. of trips , gender and user type ?)


In [None]:
relation4 = df_to_plot.groupby(['member_gender','user_type']).duration_minutes.count().sort_values(ascending= False).head(10)
relation4

### Research Question 25 (the relation between no. of trips , if bike share for all trip and user type ?)


In [None]:
relation5 = df_to_plot.groupby(['bike_share_for_all_trip','user_type']).duration_minutes.count().sort_values(ascending= False).head(10)
relation5

### Research Question 26 (the relation trip duration and no.of trips?)


In [None]:
relation5 = df_to_plot.groupby(['duration_minutes']).duration_minutes.count().sort_values(ascending= False).head(10)
relation5

### Research Question 27 (the relation trip duration and user age?)


In [None]:
relation6 = df_to_plot.groupby(['user_age']).duration_minutes.mean().sort_values(ascending= False).head(10)
relation6

<a id='Explanatory'></a>
## Explanatory

### Findining 1 : The most number of trips occurs on thrusday with percentage of 19.2% of total number of trips.

In [None]:
base_color = sb.color_palette()[0]
sb.countplot(data=df_to_plot, x='day', color=base_color, order= most_trips_day.index);
n_trips = df_to_plot['day'].value_counts().sum()

locs, labels = plt.xticks(rotation=90);
for loc, label in zip(locs, labels):

    count = most_trips_day[label.get_text()]
    pct_string = '{:0.1f}%'.format(100*count/n_trips)

    plt.text(loc, count+2, pct_string, ha = 'center', color = 'black')
plt.xlabel('Day');
plt.ylim((10000,40000));

plt.title('No. of trips for each day');

from matplotlib import rcParams
rcParams['figure.figsize'] = 12,8

### Findining 2: The use of bikes occurs most in the hours to go to work (7.00, 8.00, 9.00) and the hours to leave work (16.00, 17.00, 18.00), the time to go at 8 am and leave at 5.00 pm is the peak time for bike use . 


In [None]:
base_color = sb.color_palette()[0]
hr_order = most_trips_hr.index
n_trips = df_to_plot['hour'].value_counts().sum()
max_hr = most_trips_hr.iloc[0]
max_prop = max_hr / n_trips
tick_props = np.arange(0, max_prop, 0.02)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
sb.countplot(data=df_to_plot, y='hour', color=base_color, order=hr_order);
for i in range (most_trips_hr.shape[0]):
    count = most_trips_hr.iloc[i]
    pct_string = '{:0.1f}'.format(100*count/n_trips)
    plt.text(count+1, i, pct_string, va='center')
    
ticks = np.arange(0,23000,1000)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('Hour');
plt.title('No. of trips in day hours');

from matplotlib import rcParams
rcParams['figure.figsize'] = 12,7

### Findining 3: Unlike the weekdays , in holdays (weekends) The use of bikes occurs most in the hours of the mid-day from 11.00 A.m to 4.00 P.M . 


In [None]:
plt.figure(figsize = [20, 5])

plt.subplot(1, 2, 1) 
base_color = sb.color_palette()[0]
hr_order1 = most_trips_hr1.index
n_trips1 = df_to_plot_weekdays['hour'].value_counts().sum()
max_hr1 = most_trips_hr1.iloc[0]
max_prop = max_hr1 / n_trips1
tick_props = np.arange(0, max_prop, 0.02)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
sb.countplot(data=df_to_plot_weekdays, y='hour', color=base_color, order=hr_order1);
for i in range (most_trips_hr1.shape[0]):
    count = most_trips_hr1.iloc[i]
    pct_string = '{:0.1f}'.format(100*count/n_trips1)
    plt.text(count+1, i, pct_string, va='center')
    
ticks = np.arange(0,23000,500)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('Hour');
plt.xlim((9000,21000));
plt.title('No. of trips in weekdays hours');

from matplotlib import rcParams
rcParams['figure.figsize'] = 12,7


plt.subplot(1, 2, 2) 
base_color = sb.color_palette()[0]
hr_order2 = most_trips_hr2.index
n_trips2 = df_to_plot_weekends['hour'].value_counts().sum()
max_hr2 = most_trips_hr2.iloc[0]
max_prop = max_hr2 / n_trips2
tick_props = np.arange(0, max_prop, 0.02)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
sb.countplot(data=df_to_plot_weekends, y='hour', color=base_color, order=hr_order2);
for i in range (most_trips_hr2.shape[0]):
    count = most_trips_hr2.iloc[i]
    pct_string = '{:0.1f}'.format(100*count/n_trips2)
    plt.text(count+1, i, pct_string, va='center')
    
ticks = np.arange(0,3000,50)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('Hour');
plt.xlim((2400,3000));
plt.title('No. of trips in weekends hours');

from matplotlib import rcParams
rcParams['figure.figsize'] = 12,7

### Findining 4: Subscribers are the largest proportion of bikes users with percentage of 89.2% of users. 

In [None]:
sorted_counts = df_to_plot['user_type'].value_counts()
explode = (0, 0.1) 

plt.pie(sorted_counts, labels = sorted_counts.index, explode=explode, autopct='%1.1f%%', startangle = 90, counterclock = False);

plt.axis('square');
plt.legend();
plt.title('Proportion of users');

### Findining 5: In weekends the percentage of customer users almost doubled to be 18.2% instead of 9.3% in weekdays .

In [None]:
plt.subplot(1, 2, 1) 
sorted_counts1 = df_to_plot_weekdays['user_type'].value_counts()
explode = (0, 0.1) 

plt.pie(sorted_counts1, labels = sorted_counts1.index, explode=explode, autopct='%1.1f%%', startangle = 90, counterclock = False);

plt.axis('square');
plt.legend();
plt.title('Users in weekdays');

plt.subplot(1, 2, 2) 
sorted_counts2 = df_to_plot_weekends['user_type'].value_counts()
explode = (0, 0.1) 

plt.pie(sorted_counts2, labels = sorted_counts2.index, explode=explode, autopct='%1.1f%%', startangle = 90, counterclock = False);
plt.legend();
plt.title('Users in weekends');

plt.axis('square');

### Findining 6: Most of bike-share users are in the age from 24 to 34  .

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['user_age'].max()+2, 2)
plt.hist(data = df_to_plot, x = 'user_age', bins = bins , rwidth = 0.7);
plt.xlabel('User Age');
plt.ylabel('Count');
plt.xlim((15,80));

### Findining 7: Unlike males and females, The count of trips for gender of (other) has different  distribution  according to age .

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['user_age'].max()+2, 2)

g = sb.FacetGrid(data = df_to_plot, col = 'member_gender',sharey=False);
g.map(plt.hist, "user_age", bins = bins );

plt.xlim((15,80));

### Findining 8: The most number of trips for subscribers and customers almost happened in the same age period.

In [None]:
plt.figure(figsize = [12, 5]);
bins = np.arange(0, df_to_plot['user_age'].max()+2, 2);

g = sb.FacetGrid(data = df_to_plot, col = 'user_type',sharey=False);
g.map(plt.hist, "user_age", bins = bins );
plt.xlim((15,80));

### Findining 9: The relation between age and trip count if bike share all trip or not  .

In [None]:
plt.figure(figsize = [12, 5]) ;

bins = np.arange(0, df_to_plot['user_age'].max()+2, 2)

g = sb.FacetGrid(data = df_to_plot, col = 'bike_share_for_all_trip',sharey=False);
g.map(plt.hist, "user_age", bins = bins );

plt.xlim((15,80));

### Findining 10 : The largest start station in trips count is Market St at 10th St  with 3904 trips  .

In [None]:
base_color = sb.color_palette()[0]

m_order = most_start_station.index
ax = sb.countplot(data=df_to_plot, y='start_station_name', color=base_color, order=m_order);

for index, value in enumerate(most_start_station):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(2000,4200,100)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('Start Station');    
plt.xlim((2000,4200));
plt.title('No. of trips for top 10 popular start stations');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 11 : The less used start station is 16th St Depot  with only 2 trips  .

In [None]:
base_color = sb.color_palette()[0]

l_order = less_start_station.index
ax = sb.countplot(data=df_to_plot, y='start_station_name', color=base_color, order=l_order);

for index, value in enumerate(less_start_station):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(0,21,1)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('Start Station');    
plt.xlim((0,21));
plt.title('No. of trips for less 10 popular start stations');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 12 : The largest end station in trips count is San Francisco Caltrain Station 2  (Townsend St at 4th St)  with 4875 trips  .

In [None]:
base_color = sb.color_palette()[0]

m_end_order = most_end_station.index
ax = sb.countplot(data=df_to_plot, y='end_station_name', color=base_color, order=m_end_order);

for index, value in enumerate(most_end_station):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(2000,5100,100)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);
plt.ylabel('End Station');    
plt.xlim((2100,5000));
plt.title('No. of trips for top 10 popular end stations');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 13 : The less used end station is Willow St at Vine Stt  with only 5 trips  .

In [None]:
base_color = sb.color_palette()[0]

l_s_order = less_end_station.index
ax = sb.countplot(data=df_to_plot, y='end_station_name', color=base_color, order=l_s_order);

for index, value in enumerate(less_end_station):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(0,21,1)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);

plt.ylabel('End Station');    
plt.xlim((3,21));
plt.title('No. of trips for less 10 popular end stations');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 14 : The most relative start and end station happen when Berry St at 4th St  is the start station and San Francisco Ferry Building (Harry Bridges Plaza) is the end station with total number of trips : 337 trips.

In [None]:
base_color = sb.color_palette()[0]

m_r_order = most_relatives.index
ax = sb.countplot(data=df_to_plot, y='start_end_station', color=base_color, order=m_r_order);

for index, value in enumerate(most_relatives):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(0,350,10)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=45);

plt.ylabel('Start and end Station');    
plt.xlim((230,350));
plt.title('No. of trips for top 10 realtive start and end stations');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 15 : The most bike used in ford go-bike system is bike with ID 4794 with total trips : 191 trips.

In [None]:
base_color = sb.color_palette()[0]

bikes_order = most_bikes.index
ax = sb.countplot(data=df_to_plot, y='bike_id', color=base_color, order=bikes_order);

for index, value in enumerate(most_bikes):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(0,210,2)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=90);
plt.ylabel('Bike ID');    
plt.xlim((168,193));
plt.title('No. of trips for top 10 used bikes ');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 16 : The most bikes used has centre duration time for trip around 10 minutes and trip duartion range from 5 to 15 minutes.

In [None]:
plt.figure(figsize = [10, 5])
base_color = sb.color_palette()[0]
sb.boxplot(data=df_to_plot_most_used, x='bike_id', y='duration_minutes', color=base_color)
plt.ylim(0,30);
plt.ylabel('Duration minutes');    
plt.xlabel('Most bikes used IDs'); 
plt.title('Duration distribution for top 10 bikes used');

### Findining 17 : The trips with duration range from 3 to 8 minutes have the largest count of trips in bike-share system.

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['duration_minutes'].max()+1, 1)
plt.hist(data = df_to_plot, x = 'duration_minutes', bins = bins , rwidth = 0.7);
plt.xlabel('Duration minutes');
plt.ylabel('Count');
plt.xlim((1,40));

### Findining 18: The count of trips for trip duaration according to user gender .

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['duration_minutes'].max()+1, 1)

g = sb.FacetGrid(data = df_to_plot, col = 'member_gender',sharey=False);
g.map(plt.hist, "duration_minutes", bins = bins );

plt.xlim((1,50));


### Findining 19: The most number of trips for subscribers happen in less time duration than customers  .

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['duration_minutes'].max()+1, 1)

g = sb.FacetGrid(data = df_to_plot, col = 'user_type',sharey=False);
g.map(plt.hist, "duration_minutes", bins = bins );

plt.xlim((1,50));


### Findining 20 : The relation between time duration and trip count if bike share all trip or not  .

In [None]:
plt.figure(figsize = [12, 5]) 

bins = np.arange(0, df_to_plot['duration_minutes'].max()+1, 1)

g = sb.FacetGrid(data = df_to_plot, col = 'bike_share_for_all_trip',sharey=False);
g.map(plt.hist, "duration_minutes", bins = bins );

plt.xlim((1,50));

### Findining 21 : the most bike used when bike shared all trip is bike with ID 3967 with total trips : 91 trips.

In [None]:
base_color = sb.color_palette()[0]

bikes_order1 = bikes_all_trip.index
ax = sb.countplot(data=df_to_plot_all_trip, y='bike_id', color=base_color, order=bikes_order1);

for index, value in enumerate(bikes_all_trip):
    plt.text(value, index,
             str(value))
            
ticks = np.arange(0,105,5)
labels = ['{}'.format(v) for v in ticks]    
plt.xticks(ticks, labels,rotation=90);
plt.ylabel('Bike ID');    
plt.xlim((55,100));
plt.title('No. of trips for top 10 used bikes when bike share all trip');

from matplotlib import rcParams
rcParams['figure.figsize'] = 15,8

### Findining 22 : The trip duration for bikes that don't share all trip is larger than bikes that share all trip.

In [None]:

plt.figure(figsize = [10, 5])
base_color = sb.color_palette()[0]
sb.boxplot(data=df_to_plot, x='bike_share_for_all_trip', y='duration_minutes', color=base_color)
plt.ylim(0,30);
plt.ylabel('Duration minutes');    
plt.xlabel('Is bike share for all trip ?'); 


### Findining 23 : Bikes which don't share all trip has percentage 90.5% of total bikes used .

In [None]:
sorted_counts = df_to_plot['bike_share_for_all_trip'].value_counts()
explode = (0, 0.1) 

plt.pie(sorted_counts, labels = sorted_counts.index, explode=explode, autopct='%1.1f%%', startangle = 90, counterclock = False);

plt.axis('square');
plt.title('Is bike share for all trip ?'); 
plt.legend();

### Findining 24 : User age when using bike don't share all trip is larger thang user age using bike share all trip .

In [None]:
base_color = sb.color_palette()[0]

sb.violinplot(data=df_to_plot, x='bike_share_for_all_trip', y='user_age', color=base_color, inner='quartile')

ticks = np.arange(0,105,5)
labels = ['{}'.format(v) for v in ticks]    
plt.yticks(ticks, labels,rotation=90);
plt.ylabel('User Age');    
plt.xlabel('Is bike share for all trip ?'); 

### Findining 25 : Age distribution for each gender .

In [None]:
base_color = sb.color_palette()[0]

sb.violinplot(data=df_to_plot, x='member_gender', y='user_age', color=base_color, inner='quartile')

ticks = np.arange(0,105,5)
labels = ['{}'.format(v) for v in ticks]    
plt.yticks(ticks, labels,rotation=90);
plt.ylabel('User Age');    
plt.xlabel('Member gender');

### Findining 26 : Trip duration time for customers is larger than subscribers .

In [None]:
plt.figure(figsize = [10, 5])
base_color = sb.color_palette()[0]
sb.boxplot(data=df_to_plot, x='user_type', y='duration_minutes', color=base_color)
plt.ylim(0,45);
plt.ylabel('Duration time');    
plt.xlabel('User type');

### Findining 27 : Trip duration time for each gender .

In [None]:
plt.figure(figsize = [10, 5])
base_color = sb.color_palette()[0]
sb.boxplot(data=df_to_plot, x='member_gender', y='duration_minutes', color=base_color)
plt.ylim(0,30);
plt.ylabel('Duration time');    
plt.xlabel('Member gender');

### Findining 28 : Males are more used for ford go-bike system.


In [None]:
plt.pie(most_gender_dur, labels = most_gender_dur.index, autopct='%1.1f%%', startangle = 90, counterclock = False);

plt.axis('square');
plt.legend();

### Findining 29 : Subscriber males are more used for ford go-bike system by 119069 trips.


In [None]:
ct_counts = df_to_plot.groupby(['member_gender', 'user_type']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index = 'member_gender', columns = 'user_type', values = 'count')
sb.heatmap(ct_counts, annot = True, fmt = 'd');
plt.ylabel('Member gender');    
plt.xlabel('User type');
plt.title('No. of trips for combination of gender and user type');

### Findining 30 : 
### - there are no customers to use bikes for share all trip.
### - subscribers who use bikes don't share all trip are the most with 146185 trips.

In [None]:
ct_counts = df_to_plot.groupby(['user_type', 'bike_share_for_all_trip']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index = 'user_type', columns = 'bike_share_for_all_trip', values = 'count')
sb.heatmap(ct_counts, annot = True, fmt = 'd');
plt.ylabel('User type');    
plt.xlabel('bike share for all trip ?');
plt.title('No. of trips for combination of user type and is bike share all trip');

### Findining 31 : Males who use bikes don't share all trip are the most with 117510 trips.


In [None]:
ct_counts = df_to_plot.groupby(['member_gender', 'bike_share_for_all_trip']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index = 'member_gender', columns = 'bike_share_for_all_trip', values = 'count')
sb.heatmap(ct_counts, annot = True, fmt = 'd');
plt.ylabel('Member gender');    
plt.xlabel('bike share for all trip ?');
plt.title('No. of trips for combination of gender and is bike share all trip');

### Findining 32 : The relation between duration time for trips and user age.


In [None]:
plt.figure(figsize = [18, 6])
bins_x = np.arange(0.6, 100+4,4)
bins_y = np.arange(12, 100+4, 4)

plt.hist2d(data = df_to_plot, x = 'duration_minutes', y = 'user_age',cmin=0.5, cmap='viridis_r',bins = [bins_x, bins_y])
plt.colorbar()
plt.xlabel('Duration')
plt.ylabel('User Age');
plt.xlim((1,40));
plt.title('The relation between trip Duration and the age of user');

### Findining 33 : The mean time duration of trip for gender(other) which use bike don't share all trip is the largest with 17 minutes.


In [None]:
cat_means1 = df_to_plot.groupby(['member_gender', 'bike_share_for_all_trip']).mean()['duration_minutes']
cat_means1 = cat_means1.reset_index(name = 'duration_minutes_avg')
cat_means1 = cat_means1.pivot(index = 'bike_share_for_all_trip', columns = 'member_gender',
                            values = 'duration_minutes_avg');
sb.heatmap(cat_means1, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'Mean( Duratin minutes)'});
plt.title('Avg. time duration for combination of gender and is bike share all trip');

### Findining 34 : The mean time duration of trip for gender(other) and  customer is the largest with 26.5 minutes.

In [None]:
cat_means2 = df_to_plot.groupby(['member_gender', 'user_type']).mean()['duration_minutes']
cat_means2 = cat_means2.reset_index(name = 'duration_minutes_avg')
cat_means2 = cat_means2.pivot(index = 'user_type', columns = 'member_gender',
                            values = 'duration_minutes_avg');
sb.heatmap(cat_means2, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'Mean( Duratin minutes)'});
plt.title('Avg. time duration for combination of gender and user type');

### Findining 35 : The mean age for gender(other) which use bike share all trip is the largest with 36.2 years in average.

In [None]:
cat_means3 = df_to_plot.groupby(['member_gender', 'bike_share_for_all_trip']).mean()['user_age']
cat_means3 = cat_means3.reset_index(name = 'user_age_avg')
cat_means3 = cat_means3.pivot(index = 'bike_share_for_all_trip', columns = 'member_gender',
                            values = 'user_age_avg');
sb.heatmap(cat_means3, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'Mean( User Age)'});
plt.title('Avg. age for combination of gender and is bike share all trip');

### Findining 36 : The mean age for gender(other) and subscriber is the largest with 36 years in average.

In [None]:
cat_means3 = df_to_plot.groupby(['member_gender', 'user_type']).mean()['user_age']
cat_means3 = cat_means3.reset_index(name = 'user_age_avg')
cat_means3 = cat_means3.pivot(index = 'user_type', columns = 'member_gender',
                            values = 'user_age_avg');
sb.heatmap(cat_means3, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'Mean( User Age)'});
plt.title('Avg. age for combination of gender and user type');

<a id='Conclusions'></a>

## Conclusions

#### After loading data from CSV file of Ford Go-bike system Dataset, i started assessing data visually and programatically to be cleaned and I found some limitations i need to handle it before starting analysis , for instance : 

- convert duration_sec to duration minutes to be more readable
- convert start & end time to date time
- convert all ids to object
- make new column to start and end stations location
- convert user type category
- convert birth year to object
- convert gender to category
- convert bike_share_for_all_trip to category
- convert birthday year less than 1919 and it coeesponding gender to Nan.
- rearrange columns.
- make new column for the user age.
- i didn't drop or change the data with Nan value for start and end stations name and id because all other data are existed.
- the data of gender and birth year for about 8265 rides are dropped from the source CSV file but i decided to accept it.


#### After cleaning the data i started to pose questions and analyze data , here the most important findings afte explanatory:




- 1 : The most number of trips occurs on thrusday with percentage of 19.2% of total number of trips.
- 2 : The use of bikes occurs most in the hours to go to work (7.00, 8.00, 9.00) and the hours to leave work (16.00, 17.00, 18.00), the time to go at 8 am and leave at 5.00 pm is the peak time for bike use .
- 3: Unlike the weekdays , in holdays (weekends) The use of bikes occurs most in the hours of the mid-day from 11.00 A.m to 4.00 P.M
- 4: Subscribers are the largest proportion of bikes users with percentage of 89.2% of users and in weekends the percentage of customer users almost doubled. 
- 5 : Most of bike-share users are in the age from 24 to 34  .
- 6 : The largest start station in trips count is Market St at 10th St  with 3904 trips.
- 7 : The less used start station is 16th St Depot  with only 2 trips .
- 8 : The largest end station in trips count is San Francisco Caltrain Station 2  (Townsend St at 4th St)  with 4875 trips.
- 9 : The less used end station is Willow St at Vine Stt  with only 5 trips  .
- 10 : The most relative start and end station happen when Berry St at 4th St  is the start station and San Francisco Ferry Building (Harry Bridges Plaza) is the end station with total number of trips : 337 trips. 
- 11 : The most bike used in ford go-bike system is bike with ID 4794 with total trips : 191 trips.
- 12 : The trips with duration range from 3 to 8 minutes have the largest count of trips in bike-share system.
- 13: The most number of trips for subscribers happen in less time duration than customers  .
- 14 : the most bike used when bike shared all trip is bike with ID 3967 with total trips : 91 trips.
- 15 : The trip duration for bikes that don't share all trip is larger than bikes that share all trip.
- 16 : Bikes which don't share all trip has percentage 90.5% of total bikes used .
- 17 : User age when using bike don't share all trip is larger thang user age using bike share all trip .
- 18 : Trip duration time for customers is larger than subscribers .
- 19 : Males are more used for ford go-bike system.
- 20 : Subscriber males are more used for ford go-bike system by 119069 trips.
- 21 : there are no customers to use bikes for share all trip.
- 22 :  subscribers who use bikes don't share all trip are the most with 146185 trips.
- 23 : Males who use bikes don't share all trip are the most with 117510 trips.
- 24 : The mean time duration of trip for gender(other) which use bike don't share all trip is the largest with 17 minutes.
- 25 : The mean time duration of trip for gender(other) and  customer is the largest with 26.5 minutes.
- 26 : The mean age for gender(other) which use bike share all trip is the largest with 36.2 years in average.
- 27 : The mean age for gender(other) and subscriber is the largest with 36 years in average