In [None]:
# ***Cyclistic | Divvy Bike Sharing Data Analysis with using Python***
     -by mangesh tayade

## Introduction
This analysis is brought to you by Mangesh Tayade, inspired by Google and Coursera as a part of Google Data Analytics Certification and Divvy bike-sharing company as a source of data. Throughout this project, you will see some real-world data kindly provided by Motivate International Inc. for public use under the license.

In Data Analytics Certification the dataset is referred to a fictional bike-sharing company called Cyclistic, so let's keep that name - you will see it on different stages of data analysis. You can come along with me through all the steps of data cleaning and processing, but if you are interested just in conclusions the data let me make, you can find it at the end of the report.

## About Company

Divvy is Chicagoland’s bike share system across Chicago and Evanston. Divvy provides residents and visitors with a convenient, fun and affordable transportation option for getting around and exploring Chicago.

Divvy, like other bike share systems, consists of a fleet of specially-designed, sturdy and durable bikes that are locked into a network of docking stations throughout the region. The bikes can be unlocked from one station and returned to any other station in the system. People use bike share to explore Chicago, commute to work or school, run errands, get to appointments or social engagements, and more.

Divvy is available for use 24 hours/day, 7 days/week, 365 days/year, and riders have access to all bikes and stations across the system.

## Data sources used
[Divvy Data (July2021 to June2022)](https://divvy-tripdata.s3.amazonaws.com/index.html) - The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure)

## Business Task
* How do annual members and casual riders use Cyclistic bikes differently?
* Why would casual riders buy Cyclistic annual memberships?
* How can Cyclistic use digital media to influence casual riders to become members?

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import statsmodels.api as sm
from statsmodels.formula.api import ols
import datetime
from datetime import datetime, timedelta
import scipy.stats
import pandas_profiling
from pandas_profiling import ProfileReport
%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)
plt.rc('axes', titlesize=12)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
import warnings
warnings.filterwarnings('ignore')
# Use Folium library to plot values on a map.
#import folium
# Use Feature-Engine library
#import feature_engine
#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDis
#from feature_engine.encoding import OrdinalEncoder
pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)
random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

## Exploratory Data Analysis


In [10]:
# iterate over all files within "Database"
import os
for file in os.listdir("../input/bikeshare/db"):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join("../input/bikeshare/db", file), parse_dates=['started_at','ended_at'])
        df.to_csv("merged.csv", index=False, header=False, mode='a')


In [11]:
df

In [12]:

df.info()

In [13]:

df.describe(include='all')

In [14]:
df.columns

In [15]:
df["time_diff"] = df['ended_at'] - df['started_at']

In [16]:
#Convert to minutes
df["time_diff"] = df["time_diff"]/np.timedelta64(1,'m') 

In [17]:
df.head()

In [18]:
df["weekday"] = df["started_at"].dt.weekday

In [19]:
#Return the day of the week as an integer, where Monday is 0 and Sunday is 6
df.head()

In [24]:
df.info()

## Save to CSV 

In [26]:
df.to_csv("bike.csv", index=False)

## Groupby Function

In [27]:
df.groupby("start_station_name")["ride_id"].count().sort_values()

In [28]:
df.groupby("end_station_name")["ride_id"].count().sort_values()

In [29]:
df.groupby("rideable_type")["ride_id"].count().sort_values()


In [30]:
df.groupby("member_casual")["ride_id"].count().sort_values()

In [31]:
df.groupby("start_station_name")["time_diff"].mean().sort_values()

In [32]:
df.groupby("rideable_type")["time_diff"].mean().sort_values()

In [33]:
df.groupby("member_casual")["time_diff"].mean().sort_values()

## Pandas-Profiling Reports

In [35]:
profile = ProfileReport(df=df, title='Cyclistic Divy Bike Report', minimal=True)

In [36]:
profile.to_notebook_iframe()

In [37]:
profile.to_file("bike_report.html")

## Drop unwanted features

In [38]:
df.columns

In [39]:
df.drop(['ride_id','start_station_id','end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng'],axis=1,inplace=True)

In [40]:
df.head()

## Treat Missing Values

In [41]:
df.isnull().sum()

In [42]:
df['start_station_name'] = df['start_station_name'].replace(np.nan,"Missing")

In [43]:
df['end_station_name'] = df['end_station_name'].replace(np.nan,"Missing")

In [44]:
df.isnull().sum()

In [45]:
df.describe()

## Data Visualization

In [46]:
df.hist(bins=50, figsize=(20,5))
plt.suptitle('Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [47]:
df.boxplot(figsize=(20,5))
plt.suptitle('BoxPlot', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [59]:
plt.figure(figsize=(20,20))


g = sns.catplot(x='rideable_type', hue='member_casual', col = 'weekday', col_wrap=4,
            kind='count', data=df,
            height = 6, aspect = 1)

g.set_xlabels("Rideable Type")
g.set_ylabels("Number of Bikes")
plt.title("Count Bikes per Weekday", size=20)

#g = (g.set_axis_labels("Tip","Total bill(USD)").set(xlim=(0,10),ylim=(0,100)
g.set(xlim=(0,None))
g.set_xticklabels(rotation=90)
plt.tight_layout()
plt.show()

In [49]:
plt.figure(figsize=(20,20))

sns.catplot(x="time_diff", y="weekday",

                hue="member_casual", ci=None,

                data=df, color=None, linewidth=3, showfliers = False,

                orient="h", height=20, aspect=1, palette=None,

                kind="box", dodge=True)

plt.xlabel("Time Difference in Minutes", size=20)
plt.ylabel("WeekDay", size=20)

plt.title("Boxplot of time taken each weekday", size=20)

plt.show()

In [50]:
sns.relplot(x="time_diff", y="weekday", data=df, height = 6, aspect = 2)

plt.xlabel("Time Difference in Minutes", size=15)
plt.ylabel("WeekDay", size=15)
plt.title("Relationship plot", size=15)
plt.show()

### Time-Series Analysis

In [51]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.started_at,y=df.time_diff,data=df, estimator=None)
plt.title("Minutes spend by Date", fontsize=20)
plt.xlabel("Start Date", fontsize=20)
plt.ylabel("Minutes Spend", fontsize=20)
plt.show()

In [52]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.started_at,y=df.weekday, data=df, estimator=None)
plt.title("Amount spend per month", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.show()

### Pairplots

In [53]:
plt.figure(figsize=(20,20))
sns.pairplot(df.sample(500))
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=10)
plt.show()

### Correlation

In [54]:
df.corr()

In [55]:
plt.figure(figsize=(9,5))
sns.heatmap(df.corr(),cmap="coolwarm",annot=True,fmt='.2f',linewidths=2)
plt.title("", fontsize=20)
plt.show()

## Conclusions
Twelve datasets with ~ 8 million trips were checked and cleaned. It took time to find missing station names that would be helpful in a geotargeting ad campaign for casual users. I won't put here the whole dataset for 2021 and 2022 combined.

We had to check current pricing plans on the company's website to see what can be offered to casual riders to turn them into members. The patterns of bike use for different groups are shown on the plots, so we had to segment users for our analysis.

Starting with the Day Pass users. If they ride once a month on a day off or some tourists have a short trip to Chicago, they might not sign up for the whole year, but may do it for a month. It allows them to roam around the city easily if they decide to have one more trip during the month.

## Summary
* Casual riders spent more time in bikes
* Popular spot is Streeter Dr & Grand Ave
* Classic bikes are most rented
* Docked bikes spent most time cycling
* Saturday has highest count of rented bikes
* Member riders love classic and electric bikes but casual riders prefer docked bikes
* Member riders have been in consistent usage for all days, same for casual riders
* Member riders spent less time biking than casual riders.

## References

1. For datasets :- https://divvy-tripdata.s3.amazonaws.com/index.html
2. Company Website :- https://ride.divvybikes.com/