# Case Study: Cycle Sharing Scheme

##### Description:

-----------
    
    The cycle sharing scheme provides means for the people of the city to commute using a convenient, cheap, and green transportation alternative. The service has 500 bikes at 50 stations across Seattle. Each of the stations has a dock locking system (where all bikes are parked); kiosks (so customers can get a membership key or pay for a trip) and a helmet rental service. A person can choose between purchasing a membership key or short-term pass. A membership key entitles an annual membership, and the key can be obtained from a kiosk. Advantages for members include quick retrieval of bikes and unlimited 45-minute rentals. Short-term passes offer access to bikes for a 24-hour or 3-day time interval. Riders can avail and return the bikes at any of the 50 stations citywide.

------------

#### DATA Dictionary
![title](asset/dictionary.png)

### Importing Packages

In [None]:
%matplotlib inline
import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
import seaborn

#### Reading Input File

In [None]:
data = pd.read_csv("data/trip.csv")

# EDA

#### Exploring data

##### Major types of variables
![title](asset/vartype.png)

In [None]:
len(data)

In [None]:
data.head()

In [None]:
data.info()

--------------------------
![title](asset/vartype1.png)

In [None]:
data.describe()

In [None]:
data.usertype.value_counts()

In [None]:
data.sort_values(by='starttime', inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
print ('Date range of dataset: {} - {}'.format(data.loc[1, 'starttime'],data.loc[len(data)-1, 'stoptime']))

#### Data Transformation

In [None]:
data.starttime = pd.to_datetime(data.starttime)
data.stoptime = pd.to_datetime(data.stoptime)

In [None]:
data.sort_values(by='starttime', inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
print ('Date range of dataset: {} - {}'.format(data.loc[1, 'starttime'],data.loc[len(data)-1, 'stoptime']))

##### Generartion in workplace
| Generation | Description |
|--------|-------------|
| The Silent Generation | Born 1928-1945 (73-90 years old) |
| Baby Boomers | Born 1946-1964 (54-72 years old) |
| Generation X | Born 1965-1980 (38-53 years old) |
| Millennials | Born 1981-1996 (22-37 years old) |
| Post-Millennials | Born 1997-Present (0-21 years old) |

###### Exercise :Create a generation column using the above criteria

In [None]:
##YourAnswershere:


#### Plotting the distribution for the category variables

In [None]:
### Plotting the Distribution of User Types
groupby_user = data.groupby('usertype').size()
groupby_user.plot.bar(title = 'Distribution of User Types');

In [None]:
### Plotting the Distribution of User Gender
groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribution of User Types');

In [None]:
### Plotting the Distribution of Birth Years
data = data.sort_values(by='birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years',figsize = (15,4));

In [None]:
data_mil = data[(data['birthyear'] >= 1981) & (data['birthyear']<=1996)]
groupby_mil = data_mil.groupby('usertype').size()
groupby_mil.plot.bar(title = 'Distribution of user types')

### Multivariate Analysis

In [None]:
data.gender.value_counts()

In [None]:
groupby_birthyear_gender = data.groupby(['birthyear', 'gender'])['birthyear'].count().unstack('gender').fillna(0)
groupby_birthyear_gender[['Male','Female','Other']].plot.bar(title =
'Distribution of birth years by Gender', stacked=True, figsize = (15,4));

##### Plotting the Distribution of Birth Years by User Types

In [None]:
groupby_birthyear_user = data.groupby(['birthyear', 'usertype'])['birthyear'].count().unstack('usertype').fillna(0)
groupby_birthyear_user[['Member']].plot.bar(title = 'Distribution of birth years by Usertype', stacked=True, figsize = (15,4));

In [None]:
data[data['usertype']=='Short-Term Pass Holder']['birthyear'].isnull().values.all()

In [None]:
data[data['usertype']=='Short-Term Pass Holder']['gender'].isnull().values.all()

In [None]:
data['starttime_date'] = pd.DatetimeIndex(data.starttime).date

---------------------------------------------------
## NORMAL DISTRIBUTION
![title](asset/nd.png)

In [None]:
print ("Mean of trip duration:{}".format(data.tripduration.mean()))
print ("Median of trip duration:{}".format(data.tripduration.median()))
print("Mode of trip duration:{}".format(data.tripduration.mode()))
print("Mode of station originating from:{}".format(data.from_station_name.mode()))

In [None]:
data.from_station_name.value_counts()

In [None]:
data.tripduration.plot.hist(bins = 50, title = 'Frequency Distribution of Trip Duration');

##### Box plot or Whisker plot

![title](asset/whisker.png)

##### With Outliers

![title](asset/whisker1.png)


In [None]:
data.boxplot(column=['tripduration']);

#### Outliers - Detecting using IQR

![title](asset/percentile.png)

In [None]:
q75, q25 = np.percentile(data.tripduration, [75,25])

In [None]:
iqr = q75 - q25

In [None]:
upper_whisker = q75 + 1.5 * iqr
lower_whisker = q25 - 1.5 * iqr

In [None]:
data.tripduration.describe()

In [None]:
def check(x, ul, ll):
    if ul>=x>=ll:
        return x

## Percentage of outliers

In [None]:
print("Percentage of Outliers in tripduration:",len(data[data.tripduration.apply(check, args = (upper_whisker, lower_whisker)).isnull()]['tripduration'])/len(data) * 100)

In [None]:
mean_trip_duration = data[data.tripduration.apply(check, args = (upper_whisker, lower_whisker)).notnull()]['tripduration'].mean()
print (mean_trip_duration)

### Outliers Treatment

In [None]:
def transform_tripduration(x):
    if x > upper_whisker:
        return mean_trip_duration
    return x

data['tripduration_mean'] = data['tripduration'].apply(lambda x: transform_tripduration(x))
data['tripduration_mean'].plot.hist(bins=100, title='Frequency distribution of mean transformed Trip duration');

#### Skewness vs. Symmetric distibution

!['title'](asset/skew.png)

### Measuring Center of Measure
Mean

Median


Mode

Variance - represents variability of data points about the mean


Standard Deviation - Square root of Vairance


### Correlation

1) Pearson R


2) Kendall Rank


3) Spearman Rank

In [None]:
data['starttime_year'] = pd.DatetimeIndex(data.starttime).year

In [None]:
data['age'] = data['starttime_year'] - data['birthyear']

In [None]:
data.age.plot.hist(bins=100)

In [None]:
data = data.dropna()
seaborn.pairplot(data, vars=['age', 'tripduration'], kind='reg')
plt.show()

##### Correlation Directions

---------------------

![title](asset/corr1.png)

-------------------
Reference table

![title](asset/corr2.png)

In [None]:
correlations = data[['tripduration','age']].corr(method='pearson')
print(correlations)

### Log Transformation to reduce skewness

In [None]:
plt.hist(data.age);

In [None]:
plt.hist(np.log10(data.age));