<a href="https://colab.research.google.com/github/khushijain822/bike_sharing/blob/main/Bike_Sharing_Demand_pridiction_d4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>

# **GitHub Link -** https://github.com/khushijain822/bike_sharing.git

## <b> Problem Statement </b>

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - 1 = Winter,2 = Spring, 3 = Fall, 4 = Summer
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as px
from datetime import date

sns.set_style('darkgrid')
# Importing Minmaxscaler to scale data
from sklearn.preprocessing import MinMaxScaler,StandardScaler

#Import the Models
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# importing library called warning to ignore warnings.
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# load & save data
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SeoulBikeData.csv',encoding='latin-1')

In [None]:
# creating copy so as to not disturb original dataset
df=data.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head().T

In [None]:
#checking bottom 5 rows
df.tail()

In [None]:
#checking random samples of data
df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
#rows& columns of data
df.shape


In [None]:
#total datapoints
df.size

### Dataset Information

In [None]:
# Dataset Info
#checking non null and datatypes
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Heatmap to see null values in dataset

plt.figure(figsize=(15,8))
sns.heatmap(df.isnull(),cbar=False,cmap="crest")
plt.title('Missing values display',fontsize=20,fontweight="bold")
plt.show()

### What did you know about your dataset?

In our Dataset 8760 Rows and 14 Coloums

No null values found in our Dataset

We will change the Datatypes of date column from object to date_time format.

We will convert datatypes of Functioning Day , Season ,Holidays from object type to categorical data , which help in Machine learning algos.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.to_list()

In [None]:
# Dataset Describe
df.describe().round(2).T

In [None]:
df.describe(include='O').T

### Variables Description

Unique values in Season = 4 i.e Spring ,Summer, Winter, Fall .Highest repeated Season is Spring i.e 2208.

Holiday having 2 unique values i.e Holiday , No-Holiday. Highest repeated is No-Holiday 8328.

Functioning day having 2 unique values i.e Yes, No.

Max Rented bike count is 3356 and Min is 0.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable. So that if some wrong entries like #,@,%,?,+,& in string or in integer type coloumn that we are unable to find during null value detection.
for num,col in enumerate(df.columns,1):
    print('\n')
    print(num,')\n','{} : {}'.format(col,df[col].unique().tolist()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# get sum of missing values in every column
df.isna().sum()

In [None]:
# sum of duplicated rows in dataset
df.duplicated().sum()

In [None]:
# extracting day,month,year from date
from datetime import date
df['Date']=pd.to_datetime(df['Date'], format="%d/%m/%Y")
df['year']=df['Date'].dt.year
df['month']=df['Date'].dt.month
df['day']=df['Date'].dt.day
df['day_name']=df['Date'].dt.day_name()

In [None]:
# Convert Hour in Object form
df['Hour']=df['Hour'].astype('object')


In [None]:
df.info()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Bar plot for Daily, Hourly, monthly & yearly Rented Bike count
cols = ['day','Hour','year','month']

n=1
plt.figure(figsize=(20,12))
for i in cols:
  plt.subplot(2,2,n)
  n=n+1
  sns.barplot(data=df,x=i,y='Rented Bike Count')
  plt.title(f"count of {i}")
plt.show()

1. 	**Hourly : high demand at 8am & 6pm**
2. 	**Daily : less rentend bike count in 1st 2nd day of month and gradually increases for week and in range of 600-800**
3.	**Monthly : Summer season has high rented bike count and winter has least rented bike count.**
4.	**Yearly : year 2017 has less  rented bike count & demand increased in 2018**


In [None]:
# data available for 2017 in every month
year_2017 = df[df['year']==2017]
year_2017['month'].value_counts()

## **Monthly Rented Bike count for 2017 & 2018**

#### Chart - 2

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(data=df,x='month',y='Rented Bike Count',hue ='year')
plt.title('Monthly Rented Bike count for 2017 & 2018')
plt.show()

Demand for rented bike count increased gradually in year 2018 from february onwards upto june

In [None]:
# data available for 2018 in every month
year_2018 = df[df['year']==2018]
year_2018['month'].value_counts()

#### Chart - 3

In [None]:
# Rented Bike count in every season
plt.figure(figsize=(8,8))
df.groupby('Seasons')['Rented Bike Count'].sum().plot.pie(autopct="%.2f%%")
plt.title(' Rented Bike count in every season')
plt.show()

**As seen earlier demand for rented bike is high in summer 36.99%  
Demand is least in winter only 7.8%**


In [None]:
df.groupby('Hour')['Solar Radiation (MJ/m2)'].sum()

**Hourly Solar radiation Season wise**

In [None]:
plt.figure(figsize=(20,5))
#df.groupby('Hour').sum()['Solar Radiation (MJ/m2)'].plot(kind='bar', color='red')
sns.pointplot(x='Hour',y='Solar Radiation (MJ/m2)',hue='Seasons',data=df)
plt.title('Hourly Solar radiation Season wise')
plt.show()


**Solar radiations are at peak at 1pm
And hourly interval of solar radiation seen for every season**


In [None]:
df['Functioning Day'].value_counts()

**Rented Bike count on functioing day**

In [None]:
sns.barplot(data=df,x='Functioning Day',y='Rented Bike Count')
plt.title('Rented Bike count on functioing day')
plt.show()

**Rented Bike count on Holiday-non Holiday**

In [None]:
sns.barplot(data=df,x='Holiday',y='Rented Bike Count')
plt.title('Rented Bike count on Holiday-non Holiday')
plt.show()

**No holiday has more rented bike count, this may indicates that customer uses bike on working day
for travelling at workplace more than used on Holidays**



**Hourly distribution of Rented bike count on Holiday & non holiday**

In [None]:
plt.figure(figsize=(14,8))
sns.pointplot(data=df,y='Rented Bike Count',x='Hour',hue='Holiday')
plt.title('Hourly distribution of Rented bike count on Holiday & non holiday')
plt.show()

**We can see peak from 7-9 am & (17-20) 5-8pm on NO holiday which indicates high demand period in daily time for rented bikes**

**Rented Bike count in every season hourly distribution**

In [None]:
plt.figure(figsize=(14,8))
sns.pointplot(data=df,y='Rented Bike Count',x='Hour',hue='Seasons')
plt.title('Rented Bike count in every season hourly distribution')
plt.show()

**Similar hourly pattern seen in every season so need of bike availability can be identified on hour basis. Irrespective of season peak is seen at 8am & 6pm**


In [None]:
plt.figure(figsize=(14,8))
sns.lineplot(data=df,y='Rented Bike Count',x='month')

**Regplot – Relationship between Rental Bike count & numerical  variables**

In [None]:
numrical_var=['Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

plt.figure(figsize=(12,10))
n=1
for i in numrical_var:
  plt.subplot(4,2,n)
  n += 1
  sns.regplot(x=df[i],y=df['Rented Bike Count'],scatter_kws={"color": "orange"}, line_kws={"color": "red"},lowess=True)
  plt.tight_layout()

**This regression plots shows that some of our features are positive linear and some are negative linear in relation to our target variable.**



# **Multicollinearity Detection**

In [None]:
numeric_columns = df.select_dtypes(include=['int', 'float'])

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(numeric_columns.corr(),annot=True,cmap='PuOr')
plt.title('Multicollinearity Detection by Heatmap')
plt.show()

**Observation:**

 We can see that there is **strong correlation** between the **temperature** and **dew point temperature** features which may cause trouble during the prediction. We will find/detect this type of multicollinearity in a different way ahead.

In [None]:
# detecting multicollinearity by VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
attributes = df[['Temperature(°C)','Dew point temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']]
VIF = pd.DataFrame()
VIF["feature"] = attributes.columns
#calculating VIF
VIF["Variance Inflation Factor"] = [variance_inflation_factor(attributes.values, i)
                          for i in range(len(attributes.columns))]

print(VIF)

In [None]:
# watching correlation between target variable and remaining independent variable
numeric_columns = df.select_dtypes(include=['int', 'float'])
numeric_columns.corr()['Rented Bike Count']

Temperature has more correlation with Dependend varaible, so lets drop Due point temp. from list and check VIF

In [None]:
# detecting multicollinearity by VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
attributes = df[['Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']]
VIF = pd.DataFrame()
VIF["feature"] = attributes.columns
#calculating VIF
VIF["Variance Inflation Factor"] = [variance_inflation_factor(attributes.values, i)
                          for i in range(len(attributes.columns))]

print(VIF)

Now VIF is preety much normal and hence Dropping Dew Point temperature would be better choice

In [None]:
df.drop(['Dew point temperature(°C)'],axis=1,inplace=True)

### Total columns after droping Dew Point temperature , remaining columns are   

In [None]:
df.columns.to_list()

# **Feature Transformation**

In [None]:
# checking distribution of Coubtinous Vriable
numrical_col=['Rented Bike Count','Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']
plt.figure(figsize=(18,10))
n=1
for i in numrical_col:
  plt.subplot(3,3,n)
  n=n+1
  sns.distplot(df[i])

In [None]:
# checking skewness of features
df[numrical_col].skew().sort_values(ascending=False)

In [None]:
# applying power transformation
from sklearn.preprocessing import PowerTransformer
sc_X=PowerTransformer(method = 'yeo-johnson')
df[numrical_col]=sc_X.fit_transform(df[numrical_col])

In [None]:
# Data distribution after applying Power Transformer
plt.figure(figsize=(18,10))
n=1
for i in numrical_col:
  plt.subplot(3,3,n)
  n=n+1
  sns.distplot(df[i])

In [None]:
# skewness after power transformation
df[numrical_col].skew().sort_values(ascending=False)

# **Encoding**

 ***Technique of converting categorical variables into numerical values so that it could be easily fitted to a machine learning model***

In [None]:
# lets have look at dataset to know which columns need to be encoded
df.head().T

columns to encode

1. Seasons
2. Holiday
3. Functioning Day
4. day_name
5. year

Binary Encoding

In [None]:
df.replace({'Holiday': { 'No Holiday': 0,'Holiday': 1 },'Functioning Day': { 'Yes': 0,'No': 1},'year':{2017:0,2018:1}},inplace=True)

In [None]:
df1=df.copy()
df1.head()

In [None]:
# shape of data after binary encoding
df.shape

In [None]:
#df['Hour'].value_counts()

In [None]:
dummy_col=pd.get_dummies(df[['Seasons','day_name','Hour']],drop_first=True)

In [None]:
# dummy columns in data
dummy_col.columns

In [None]:
# dropping columns for which dummy variables are created
df.drop(['Seasons','day_name','Hour','Date','day'],axis=1,inplace=True)

In [None]:
# joining dummy features to dataframe df
df=df.join(dummy_col)

In [None]:
# HAVE A LOOK AT ENCODED DATA
df.head().T

In [None]:
df.shape

In [None]:
df.columns

In [None]:
len(df.columns)

In [None]:
# dropping year & month to re
df.drop(['year','month'],axis=1,inplace=True)

In [None]:

# Importing Minmaxscaler to scale data
from sklearn.preprocessing import MinMaxScaler,StandardScaler

#Import the Models
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
# x= independant variable , y= Dependant variable

X=df.drop(columns=['Rented Bike Count'])
y=df['Rented Bike Count']

In [None]:
# train_test_split to divide data into training & testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# checking shape of trainign data & testing data
X_train.shape , X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
numrical_col

In [None]:
categorical_col=['Holiday', 'Functioning Day',
       'Seasons_Spring', 'Seasons_Summer', 'Seasons_Winter', 'day_name_Monday',
       'day_name_Saturday', 'day_name_Sunday', 'day_name_Thursday',
       'day_name_Tuesday', 'day_name_Wednesday', 'Hour_Second half',
       'Hour_Third half', 'Hour_fourth half']

categorical_col

# **Scaling**



In [None]:
# Transform Numrical features by scaling each feature to a given range.
scaler = MinMaxScaler()
scaling_cols = ['Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']
X_train[scaling_cols]=scaler.fit_transform(X_train[scaling_cols])
X_test[scaling_cols]=scaler.transform(X_test[scaling_cols])

In [None]:
# Shape of Training data
X_train.shape

In [None]:
# Shape of Testing data
X_test.shape

In [None]:
X_train.head()

## **ML Model Implementation**

In [None]:
# defining function to fit model get evaluation metrics also cross validation score

def fit_evaluate (model):
  model.fit(X_train,Y_train)
  y_pred=model.predict(X_test)

  MSE  = mean_squared_error(Y_test, y_pred)
  print("MSE:" ,round(MSE,2))
  MAE=mean_absolute_error(Y_test, y_pred)
  print("MAE :" ,round(MAE,2))

  RMSE = np.sqrt(MSE)
  print("RMSE :" ,round(RMSE,2))

  r2 = r2_score(Y_test, y_pred)
  print("R2 :" ,round(r2,2))
  Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  print("Adjusted R2 : ",round(Adjusted_R2,2))

  # measuring the accuracy of the model against the training data & testing daya
  print('                          ')
  print("-------Model accuracy-------")
  print(f"Training accuracy: {round(model.score(X_train,Y_train)*100)}%")
  print(f"Testing accuracy: {round(model.score(X_test,Y_test)*100)}%")
  print('                          ')
  print("-------cross_val_score-------")
  accuracies = cross_val_score(estimator = model, X = X_train, y = Y_train, cv = 5)
  print("Cross Val Accuracy: {:.2f} %".format(accuracies.mean()*100))

  # Ploting graph of actual vs predicted
  plt.figure(figsize=(20,10))
  plt.plot((y_pred)[:100])
  plt.plot((np.array(Y_test)[:100]))
  plt.legend(["Predicted","Actual"])
  plt.title(f'Difference in predicted & actual for {model}')
  plt.show()

**LinearRegression**

In [None]:
lr= LinearRegression()
fit_evaluate(lr)

In [None]:
# Applying Polynomial Linear Regression
# degree 2
poly = PolynomialFeatures(degree=2,include_bias=True)
X_train_trans = poly.fit_transform(X_train)
X_test_trans = poly.transform(X_test)

In [None]:
lr = LinearRegression()
lr.fit(X_train_trans,Y_train)
y_pred1 = lr.predict(X_test_trans)

In [None]:
training_score=lr.score(X_train_trans,Y_train)*100
testing_score=lr.score(X_test_trans,Y_test)*100
print(f"Training score: {training_score}")
print(f"testing score: {testing_score}")

In [None]:
MSE  = mean_squared_error(Y_test, y_pred1)
print("MSE :" , MSE)

MAE=mean_absolute_error(Y_test, y_pred1)
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(Y_test, y_pred1)
print("R2 :" ,r2)
Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred1))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",Adjusted_R2)

 # Ploting graph of actual vs predicted
plt.figure(figsize=(20,10))
plt.plot((y_pred1)[:100])
plt.plot((np.array(Y_test)[:100]))
plt.legend(["Predicted","Actual"])
plt.title(f'Difference in predicted & actual for polynomial Regression')
plt.show()

In [None]:
 poly_score={'r2':r2,'Adjusted_R2':Adjusted_R2,'MSE':MSE,'RMSE':RMSE,'MAE':MAE,'Training_score':training_score,'testing_score':testing_score,}

**Ridge**

In [None]:
R=Ridge(alpha=9)
fit_evaluate(R)

**Decision Tree**

In [None]:
regressor=DecisionTreeRegressor(max_depth=18)
fit_evaluate(regressor)

**BaggingRegressor**

In [None]:
bag_regressor= BaggingRegressor(random_state=22)
fit_evaluate(bag_regressor)

**Random Forest**

In [None]:
random_forest=RandomForestRegressor(n_estimators=10,random_state=0)
fit_evaluate(random_forest)

**Randomized Search Cv**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['friedman_mse', 'squared_error','gini']}
print(random_grid)

In [None]:
random_forest_best=RandomForestRegressor()

In [None]:
model_randomcv=RandomizedSearchCV(estimator=random_forest_best,param_distributions=random_grid,n_iter=10,cv=3,verbose=2,
                               random_state=100,n_jobs=-1)
### fit the randomized model
model_randomcv.fit(X_train,Y_train)

In [None]:
model_randomcv.best_params_

In [None]:
random_forest=RandomForestRegressor(n_estimators=600,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',max_depth=120,criterion='squared_error',random_state=0)
fit_evaluate(random_forest)

**Adaboost**

In [None]:
# weakbase --> accuracy 50% or just more than 50%
# decision sGtump -> smallest decision tree, depth=1
# adaboost--> join multiple weakbase and create strong learner
# weaklearner of adaboost--> Decision stump

In [None]:
ada_regressor= AdaBoostRegressor(random_state=22)
fit_evaluate(ada_regressor)

**Gradientboost**

In [None]:
gb= GradientBoostingRegressor(random_state=22)
fit_evaluate(gb)

In [None]:
# defining function to save accuracy metrics for model evaluation summary
evaluation_summary=[]
def save_score (model):
  model.fit(X_train,Y_train)
  y_pred=model.predict(X_test)

  MSE  = mean_squared_error(Y_test, y_pred)
  MAE=mean_absolute_error(Y_test, y_pred)
  RMSE = np.sqrt(MSE)
  r2 = r2_score(Y_test, y_pred)
  Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  training_score=round(model.score(X_train,Y_train)*100,2)
  testing_score=round(model.score(X_test,Y_test)*100,2)
  cv_accuracies = cross_val_score(estimator = model, X = X_train, y = Y_train, cv = 5)
  model={'r2':round(r2,2),'Adjusted_R2':round(Adjusted_R2,2),'MSE':round(MSE,2),'RMSE':round(RMSE,2),'MAE':round(MAE,2),'Training_score':round(training_score,2),'testing_score':round(testing_score,2)}
  evaluation_summary.append(model)
  #evaluation_summary.write("\n")




In [None]:
algo=[lr,R,regressor,bag_regressor,random_forest,random_forest_best,ada_regressor,gb]
l=[]
for i in algo:
  save_score(i)

In [None]:
for idx, summary in enumerate(evaluation_summary, 1):
    print(f"Model {idx}:")
    for metric, value in summary.items():
        print(f"{metric}: {value}")
    print()

In [None]:
 poly_score={'r2':round(r2,2),'Adjusted_R2':round(Adjusted_R2,2),'MSE':round(MSE,2),'RMSE':round(RMSE,2),'MAE':round(MAE,2),'Training_score':round(training_score,2),'testing_score':round(testing_score,2)}

In [None]:
df=pd.DataFrame(evaluation_summary,index=['lr','R','decision_tree','bag_regressor','random_forest','random_forest_best','ada_regressor','gb']).rename_axis('model', axis=1).sort_values(by='r2',ascending=False)

In [None]:
df

In [None]:
poly_df=pd.DataFrame(poly_score,index=['poly'])

In [None]:
poly_df

In [None]:
new_df=pd.concat([df,poly_df]).sort_values(by='r2',ascending=False)
new_df

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***