<a href="https://colab.research.google.com/github/khanimrangithu/End-to-End-Machine-Learning/blob/main/Predicting_Bike_Sharing_Demand_A_Regression_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font size='6px'><font color='black'>**Project Name**    - <font color='blue'>Bike Sharing Demand Prediction



##### **Project Type** - Regression
##### **Contribution** - Team
##### **Team Member 1** - Imran
##### **Team Member 2** - Harshad

In [None]:
from IPython.display import Image
Image(url='https://static.wixstatic.com/media/1a7d78_77851008cda84fa99b8dcb7013ea9d3d~mv2.jpg/v1/fill/w_962,h_541,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/1a7d78_77851008cda84fa99b8dcb7013ea9d3d~mv2.jpg', width=800)

# **Project Summary -**

The Bike Seoul initiative represents a sustainable transportation effort in Seoul, South Korea, aiming to reduce traffic congestion by offering a bike sharing service. This service allows residents and visitors to rent bicycles at various stations across the city, promoting eco-friendly and convenient transportation. With increasing demand for bike rentals, there is a need for efficient management of bike sharing operations. Accurately predicting bike demand is crucial for optimizing fleet management, ensuring bike availability at high-demand locations, and reducing waste and costs.

The primary objective of the project was to develop a machine learning model to predict bike rental demand in Seoul, based on historical data and relevant factors such as weather conditions, time of day, and public holidays. Regression analysis techniques were employed to model the bike demand data, using a large dataset of past bike rental information along with weather and time data. The model was tested and evaluated using metrics such as mean squared error and r-squared values. The dataset used was sourced from the Seoul city government's open data portal and is also available on Kaggle.

The project aimed to achieve a prediction accuracy of at least 85% in bike demand, enabling bike sharing service providers to plan fleet operations effectively and respond to demand changes in real-time. Various regression algorithms, including linear regression, random forest, decision tree, and gradient boosting, were explored. Hyperparameter tuning and cross-validation were also conducted to enhance model accuracy. Ultimately, the Xtreme gradient boosting algorithm was selected due to its high accuracy, achieving around 93% and 90% on train and test data, respectively.

This project not only provided valuable insights into bike demand patterns in Seoul but also showcased the practical applications of machine learning in addressing real-world problems. The findings could potentially be extended to other cities with similar bike sharing systems, leading to improved services for bike users and more sustainable transportation systems.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Data visualization libraries(matplotlib,seaborn, plotly)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Datetime library for manipulating Date columns.
from datetime import datetime
import datetime as dt


# from sci-kit library scaling, transforming and labeling functions are brought
# which is used to change raw feature vectors into a representation that is more
# suitable for the downstream estimators.
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MultiLabelBinarizer


# Importing various machine learning models.
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

# Import different metrics from sci-kit libraries for model evaluation.
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

# Importing warnings library. The warnings module handles warnings in Python.
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive                #Mounting google drive
drive.mount('/content/drive')

In [None]:
# Load the Seoul bike dataset from Google Drive using pd.read_csv
# The 'encoding='latin'' parameter is used to specify the character encoding of the file, ensuring proper reading of non-English characters or special symbols
bike_df = pd.read_csv('/content/drive/MyDrive/EDA Datasets/SeoulBikeData.csv', encoding='latin')

### Dataset First View

In [None]:
# Top 5 rows
bike_df.head()

In [None]:
# Last 5 rows
bike_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_cols = bike_df.shape

print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bike_df.duplicated().sum()

# Dataset Unique Value Count
for i in bike_df.columns.tolist():
  print(f"No. of unique values in {i} is {bike_df.nunique()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(20,8))
sns.heatmap(bike_df.isna().transpose(), cmap="viridis", cbar_kws={'label': 'Missing Data'})
plt.title('Visualization of Missing Values', fontsize=18)
plt.show()

### What did you know about your dataset?

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe(include='all').round(2)

### Variables Description

####**Features Breakdown:-**

**Date:** The day's date, ranging from 01/12/2017 to 30/11/2018, formatted as DD/MM/YYYY (string). Conversion to datetime format required.

**Rented Bike Count:** Number of rented bikes per hour, our dependent variable for prediction (integer).

**Hour:** The hour of the day, ranging from 0 to 23 in digital time format (integer). Conversion to categorical data type needed.

**Temperature(°C):** Temperature in Celsius (float).

**Humidity(%):** Air humidity percentage (integer).

**Wind speed (m/s):** Wind speed in meters per second (float).

**Visibility (10m):** Visibility in meters (integer).

**Dew point temperature(°C):** Morning temperature (float).

**Solar Radiation (MJ/m2):** Sun contribution (float).

**Rainfall(mm):** Amount of rainfall in millimeters (float).

**Snowfall (cm):** Amount of snowfall in centimeters (float).

**Seasons:** Season of the year (string), limited to four seasons.

**Holiday:** Indicates if the day is a holiday period (string).

**Functioning Day:** Indicates if the day is a functioning day (string).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in bike_df.columns.tolist():
  print(f"No. of unique values in {i} is {bike_df[i].nunique()}.")

## ***3. Data Wrangling***

### Data Wrangling Code

In [None]:
# Copy of the Dataset
bike_df_1 = bike_df.copy()

In [None]:
# Some of  the columns name in the dataset are too large and clumsy so we change them into some simple name, and it don't affect our end results.
# Renaming the Columns

bike_df_1.rename(columns= {'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'}, inplace=True)

In [None]:
bike_df_1.columns

**In Python, the "Date" column is read as an object type, essentially as a string. Since the date column is crucial for analyzing user behavior, it needs to be converted into a datetime format. After this conversion, we will split it into three columns: 'year', 'month', and 'day', each as a category data type.**

In [None]:
# converting date variable into datetime format
bike_df_1['Date'] = bike_df_1['Date'].apply(lambda x: dt.datetime.strptime(x, '%d/%m/%Y'))

In [None]:
# Split the "Date" column into three "year","month","day" columns
bike_df_1['year'] = bike_df_1['Date'].dt.year
bike_df_1['month'] = bike_df_1['Date'].dt.month
bike_df_1['day'] = bike_df_1['Date'].dt.day_name()

* **We splited the "date" column into 3 different columns: "year", "month", "day".**
* **The "year" column in our dataset contains 2 unique numbers detailing from December 2017 to November 2018. Considering this as a single year, we can drop the "year" column.**
* **The "day" column contains details about each day of the month. For our purposes, we only need to know if a day is a weekday or a weekend, so we convert it into this format and drop the "day" column.**

In [None]:
# Creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
bike_df_1['weekdays_weekend']=bike_df_1['day'].apply(lambda x : 1 if x=='Saturday' or x=='Sunday' else 0 )    # 0 for weekdays and 1 for weekends
bike_df_1=bike_df_1.drop(columns=['Date','day','year'], axis=1)

In [None]:
bike_df_1.head()

In [None]:
bike_df_1['weekdays_weekend'].value_counts()

**As the "Hour," "month," and "weekdays_weekend" columns are currently shown as integer data types, they should actually be categorized as category data types. Failing to do so may lead to inaccurate analysis and correlations, potentially resulting in misleading conclusions.**

In [None]:
# Change the int64 columns into category columns
cols=['Hour','month','weekdays_weekend']
for col in cols:
  bike_df_1[col]=bike_df_1[col].astype('category')

In [None]:
# Check the dtypes again
bike_df_1.info()

In [None]:
# defining continuous independent variables separately
cont_var = ['Temperature', 'Humidity', 'Wind_speed', 'Visibility', 'Dew_point_temperature','Solar_Radiation', 'Rainfall', 'Snowfall']

In [None]:
# defining dependent variable
dependent_variable = ['Rented_Bike_Count']

In [None]:
# defining categorical independent variables separately
cat_var = ['Hour','Seasons', 'Holiday', 'Functioning_Day', 'month', 'weekdays_weekend']

### What all manipulations have you done and insights you found?

* Some of the columns' names in the dataset were excessively long and cumbersome, so we simplified them. This modification did not impact our final results.
* The "Date" column in the dataset was initially read as an object type in Python, essentially as a string. Recognizing the significance of the date column for analyzing user behavior, we converted it into a datetime format.
* Following this conversion, we split it into three columns: 'year', 'month', and 'day', each as a category data type.
* The "year" column in our dataset contains 2 unique numbers detailing from December 2017 to November 2018. Treating this as a single year, we dropped the "year" column.
* The "day" column contains details about each day of the month. For our purposes, we only needed to know if a day is a weekday or a weekend, so we converted it into this format and dropped the "day" column.
* The "Hour," "month," and "weekdays_weekend" columns were initially shown as integer data types. We categorized them as category data types to ensure accurate analysis and correlations, thereby avoiding potentially misleading conclusions.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Dependent variable Distribution

In [None]:
# Chart-1 Visualization code for distribution of target variable
plt.figure(figsize=(10,8))
sns.distplot(bike_df_1['Rented_Bike_Count'], color='blue')
plt.show()

##### 1. Why did you pick the specific chart?

A distplot, also referred to as a histogram with a kernel density estimate (KDE) plot, is valuable as it offers a swift and straightforward method to examine data distribution, detect patterns or outliers, and compare the distribution of multiple variables. It also facilitates the assessment of whether the data adheres to a normal distribution.

Consequently, I utilized the histogram plot to analyze the distribution of variables across the entire dataset, determining symmetry.

##### 2. What is/are the insight(s) found from the chart?

Based on the distribution plot of the dependent variable "rented bike", it is evident that the distribution is positively skewed (right skewed), indicating asymmetry around the mean.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Certainly, based on this insight, it's clear that our data is not normally distributed. Therefore, prior to implementing any model on this dataset, it's essential to normalize the data.

#### Chart - 2 : Distribution V/s Box plot

In [None]:
# Visualizing code of histogram plot & boxplot for each columns to know the data distribution
for col in bike_df_1.describe().columns:
    fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(18,6))
    sns.histplot(bike_df_1[col], ax = axes[0],kde = True, color='blue')
    sns.boxplot(bike_df_1[col], ax = axes[1],orient='h',showmeans=True,color='orange')
    fig.suptitle("Distribution plot of "+ col, fontsize = 15)
    plt.show()

##### 1. Why did you pick the specific chart?

A histplot is a chart that displays the distribution of a dataset, providing a graphical representation of how often each value or group of values occurs. It is valuable for understanding the dataset's distribution, identifying patterns or trends, and is particularly useful for large datasets (exceeding 100 observations) to detect outliers or gaps in the data.

Consequently, we utilized the histogram plot to analyze the variable distributions across the entire dataset for symmetry.

A boxplot summarizes key statistical characteristics of a dataset, including the median, quartiles, and range, in a single plot. It is useful for identifying outliers, comparing multiple datasets, and understanding data dispersion, commonly employed in statistical analysis and data visualization.

Therefore, for each numerical variable in the given dataset, we used a box plot to analyze outliers and the interquartile range, encompassing mean, median, maximum, and minimum values.

##### 2. What is/are the insight(s) found from the chart?

Based on the above univariate analysis of all continuous feature variables, it is evident that only the temperature and humidity columns exhibit a normal distribution, while the others display different distributions.

Furthermore, outlier values are noticeable in the snowfall, rainfall, wind speed, and solar radiation columns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Histogram and Box plot cannot give us whole information regarding data. It's done just to see the distribution of the column data over the dataset.

#### Chart - 3 : Dependent variable with continuous variables (Bivariate)

In [None]:
# Analyzing the relationship between the dependent variable and the continuous variables
for i in cont_var:
  plt.figure(figsize=(11,8))
  sns.regplot(x=i,y=dependent_variable[0],data=bike_df_1)
  plt.xlabel(i)
  plt.ylabel(dependent_variable[0])
  plt.title(i+' vs '+ dependent_variable[0])
  plt.show()

##### 1. Why did you pick the specific chart?

The regplot function is utilized to generate a scatter plot with a linear regression line. Its purpose is to visualize the relationship between two continuous variables, aiding in the identification of patterns and trends in the data. Additionally, it can be employed to test for linearity and independence of the variables.

We utilized this regplot to examine the patterns between the independent variable and our dependent variable, "rented bike."

##### 2. What is/are the insight(s) found from the chart?

From above regression plot we can see that there is some linearity between temperature, solar radiation & dew point temperature with dependent variable rented bike

Other variables are not showing any patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indeed, the regplot provided valuable insights, indicating that certain variables exhibit patterns with the dependent variable. These variables may be crucial features when predicting rented bike counts, warranting focused attention from the business.

#### Chart - 4 : Categorical variables with dependent variable (Bivariate)

In [None]:
# Analyzing the relationship between the dependent variable and the categorical variables
for i in cat_var:
    plt.figure(figsize=(11, 8))
    sns.barplot(x=i, y=bike_df_1[dependent_variable[0]], data=bike_df_1, palette="muted")
    plt.xlabel(i)
    plt.ylabel(dependent_variable[0])
    plt.title(i + ' vs ' + dependent_variable[0])
    plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are employed to compare the size or frequency of different categories or groups of data. They are valuable for comparing data across various categories and can effectively display a large amount of data in a small space.

We utilized bar charts to illustrate the distribution of rented bike counts with other categorical variables.

##### 2. What is/are the insight(s) found from the chart?

From above bar charts we got insights:

1. In hour vs rented bike chart there is high demand in the morning 8'o clock and evening 18'o clock
2. From season vs rented bike chart there is more demand in summer and less demand in winter.
3. There is high demand on working days.
4. From month chart we know that there is high demand in month of june.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, these insights are likely to have a positive impact on the business. By analyzing demand based on categorical variables, we can discern when bike demand is highest, allowing us to focus more resources on those specific periods.

#### Chart - 5 : Rented Bike vs Hour

In [None]:
#ploting line graph
# group by Hrs and get average Bikes rented, and precent change
avg_rent_hrs = bike_df_1.groupby('Hour')['Rented_Bike_Count'].mean()

# plot average rent over time(hrs)
plt.figure(figsize=(12,6))
sns.lineplot(data=avg_rent_hrs, marker='o')
plt.title('Average bike rented per hour')
# a=avg_rent_hrs.plot(legend=True,marker='o',title="Average Bikes Rented Per Hr")
# a.set_xticks(range(len(avg_rent_hrs)))
# a.set_xticklabels(avg_rent_hrs.index.tolist(), rotation=85)

##### 1. Why did you pick the specific chart?

A line plot, also referred to as a line chart or line graph, is a method to visualize the trend of a single variable over time. It connects a series of data points with a line to illustrate how the value of the variable changes over time.

Line plots are valuable as they swiftly and clearly display trends and patterns in the data, particularly showcasing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

We employed a line plot to observe the distribution of rented bike demand over a 24-hour period.

##### 2. What is/are the insight(s) found from the chart?

From above line plot we can clearly see that there is high demand in the morning and in the evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, the insight gleaned indicates a high demand in the morning and evening, suggesting that the business should prioritize and focus on meeting the demand during these specific time slots.

#### Chart - 6 : Bike demand throughout the day (Multivariate)

In [None]:
# Chart - 6 visualization code
for i in cat_var:
  if i == 'hour':
    continue
  else:
    fig, ax = plt.subplots(figsize=(12,8))
    sns.pointplot(data=bike_df_1, x='Hour', y='Rented_Bike_Count', hue=i, ax=ax)
    plt.title('Hourly bike demand broken down based on the attribute: '+i)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title=i)
    plt.show()

##### 1. Why did you pick the specific chart?

A line plot, also referred to as a line chart or line graph, is a method to visualize the trend of a single variable over time, connecting a series of data points with a line to illustrate changes over time.

Line plots are valuable for swiftly and clearly displaying trends and patterns in the data, particularly showcasing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

We utilized a line plot, drawing multiple lines on charts, to illustrate the demand for rented bikes throughout the day based on other categorical variables.

##### 2. What is/are the insight(s) found from the chart?

From above line plots we see that :

1. In winter season there is no significant demand even in the morning or in the evening.
2. On the functional day (i.e No Holiday) there is spike in morning and in evening, but that is not there on Holidays.
3. Around 3 months in winter season (i.e December, January & February) there is low demand.
4. On weekend almost throught the day there is demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from this analysis we figure out some key factors such as high demand in morning and evening slot in all the seasons.

#### Chart - 7 : Categorical plot for seasons

In [None]:
#plot for rented bike count seasonly
sns.catplot(x='Seasons',y='Rented_Bike_Count',data=bike_df_1)

##### 1. Why did you pick the specific chart?

The catplot function is utilized to create a categorical plot, which is employed to visualize the distribution of a categorical variable. These plots can illustrate how a variable is related to a categorical variable and can also compare the distribution of multiple categorical variables.

We used the catplot to observe the distribution of rented bikes based on the "season" column.

##### 2. What is/are the insight(s) found from the chart?

From above catplot we got know that:

1. There is low demand in winter
2. Also in all seasons upto the 2500 bike counts distribution is seen dense.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from this catplot we know that there is high bike count upto the 2500 so, above that there maybe outliers present. business needs to evaluate that.

#### Chart - 8 : Pie Chart

In [None]:
# Chart - 8 visualization code

# Grouping by season and summing the rented bike count
season_counts = bike_df_1.groupby('Seasons')['Rented_Bike_Count'].sum()

# Plotting the pie chart
plt.figure(figsize=(10, 10))
plt.pie(season_counts, labels=season_counts.index, autopct='%1.1f%%')
plt.title("Distribution of rented bikes by season", fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are commonly employed to display the proportions of a whole, particularly useful for presenting data that has been calculated as a percentage of the whole.

In this case, we utilized a pie chart to illustrate the percentage distribution of rented bikes based on different seasons.

##### 2. What is/are the insight(s) found from the chart?

From above pie chart:

1. In year data season summer contributes around 36% then autumn around 29%
2. Lowest demand in winter, it contributes around only 7%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insights only tell about percentage contribution of year data of season varible, which clearly gave indication about demand.

#### Chart - 9 : Correlation Heatmap

In [None]:

# Select only numeric columns for correlation calculation
numeric_columns = bike_df_1.select_dtypes(include=[np.number])

corr = numeric_columns.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(18, 9))
    ax = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, annot=True, cmap="viridis")

##### 1. Why did you pick the specific chart?

The correlation coefficient serves as a measure of the strength and direction of a linear relationship between two variables. A correlation matrix is employed to summarize the relationships among a set of variables, serving as a crucial tool for data exploration and for selecting which variables to include in a model. The correlation range is between -1 and 1.

To understand the correlation between all the variables, along with the correlation coefficients, we utilized a correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From above correlation map we can clearly see that:

1. There is high multicolinearity between independent variable (i.e temperature & dew point temp, humidity & dew point temp, weekend & day of week).
2. There is correlation of temperature, hour, dew point temp & solar radiation with dependent variable rented bike.
3. Other than that we didnt see any correlation.

#### Chart - 10 : Pair Plot

In [None]:
# Pair Plot
sns.pairplot(bike_df_1)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot, also known as a scatterplot matrix, is a visualization that allows you to visualize the relationships between all pairs of variables in a dataset. It is a useful tool for data exploration because it allows you to quickly see how all of the variables in a dataset are related to one another.

Thus, we used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From above pair plot we got to know that, there is not clear linear relationship between variables. other than dew point temp, temperature & solar radiation there is not any reationship.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***