<a href="https://colab.research.google.com/github/ishan711997/ML_linear_model_for_Stock_Closing_Price/blob/main/Yes_Bank_Stock_Closing_Price_Prediction_(Regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Yes Bank Stock Closing Price Prediction

##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Ishan Srivastava

# **Project Summary -**

The project aimed to analyze stock price data for Yes Bank and build predictive models using Linear Regression, Lasso Regression, and Ridge Regression. The dataset contained information about the stock's opening, closing, high, and low prices for various months and years. The primary objective was to predict the closing stock price based on the other variables.

The initial data exploration revealed that the dataset had no missing values or null values. The date column was converted from an object type to a datetime datatype, and it was set as the index for time series analysis. The exploratory data analysis (EDA) was performed to gain insights into the stock's historical price trends.

The EDA showed that the stock's all-time maximum closing price was around 370 units, while the minimum was approximately 10 units. There were fluctuations in the stock's high and low prices over time, and the relationship between these variables was linear. The opening and closing prices also exhibited a linear relationship. Furthermore, the distribution of features was right-skewed, and some potential outliers were present, but they were not removed from the dataset.

To prepare the data for modeling, log transformations were applied to both the independent and dependent variables to improve their distribution. The dataset was then split into training and testing sets, and the data was standardized using StandardScaler.

Three machine learning models were implemented and evaluated - Linear Regression, Lasso Regression, and Ridge Regression. Initially, the models were implemented without hyperparameter tuning. Linear Regression achieved impressive performance, with a high R-squared value of 0.99374, indicating a strong fit to the data. However, hyperparameter tuning using GridSearchCV significantly improved the performance of all models.

After hyperparameter tuning, all three models exhibited exceptional performance. The tuned Lasso Regression had the lowest MAE of 0.0173, followed closely by Ridge Regression with an MAE of 0.01772. Linear Regression achieved an MAE of 0.017328. Similarly, the RMSE values were quite low, indicating accurate predictions.

Finally, the evaluation metric scores were plotted to visualize the performance of the models. Although all three models performed well, Linear Regression with hyperparameter tuning had the highest adjusted R-squared, indicating the best fit for the data while considering the number of predictors. Therefore, Linear Regression was chosen as the final prediction model for this project.

The model was saved in a joblib file for future deployment and prediction on unseen data. With this model, stakeholders can make informed decisions about stock investments and plan their financial strategies based on reliable predictions of Yes Bank's closing stock price.

In conclusion, the project successfully analyzed historical stock price data for Yes Bank and developed robust predictive models using machine learning techniques. The chosen model, Linear Regression, exhibited the best performance and can be utilized for making accurate predictions on new data. The project demonstrates the effectiveness of machine learning in the financial domain and how it can help investors and financial analysts in making informed decisions.

# **GitHub Link -**

https://github.com/ishan711997/ML_linear_model_for_Stock_Closing_Price.git

# **Problem Statement**


Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock's closing price of the month.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split    # for splitting the data
from sklearn.preprocessing import StandardScaler

# for setting x axis year range
import matplotlib.dates as mdates

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Capstone P2/data/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

In [None]:
# # convert string object to datetime object
# data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(), cbar=False, yticklabels=False)

### What did you know about your dataset?

According to given Dataset there is


*   no null values.
*   date column is a object type and other columns are float.
*   5 column and 185 rows



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include= 'all')

### Variables Description


*  **Date**: It denotes the month and year of the for a particular price.
*  **Open**: The opening price of the stock on that particular month.
*  **High**: The highest price the stock reached during the month.
*  **Low**: The lowest price the stock reached during the month.
*  **Close**: The closing price of the stock on that particular month.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling and EDA***

### **Data Wrangling Code**

In [None]:
# making a copy of data and assign to df
df = data.copy()

In [None]:
# converting date column, from object to datetime datatype
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors = 'ignore')

In [None]:
# all features related to date so we set date as a index
df.set_index(keys='Date', inplace = True)

In [None]:
X = df[['High', 'Low', 'Open']]  # Independent variables
Y = df['Close']                  # Dependent variable

#### **functions**

In [None]:
# made a functin for setting the frequency of x-axis ticks to show every year
def x_year_lable():
  years = mdates.YearLocator()
  plt.gca().xaxis.set_major_locator(years)

In [None]:
# function ---- 1
# range detection function
def range_detection(col1, col2, title):
  high_low_range = df[col1] - df[col2]

  plt.figure(figsize=(15,10))

  # Plot the trading range over time
  plt.plot(df.index, high_low_range)
  plt.title(title)
  plt.xlabel("Year")
  plt.ylabel("Trading Range")

  x_year_lable()

In [None]:
# function ---- 2
# funtion for ploting variation of two columns over time
def price_comparison(col1, col2, title):
  plt.figure(figsize=(15,10))


  sns.barplot(x=df.index.year, y=col1, data=df, color='blue', alpha=0.7, label=col1)
  sns.barplot(x=df.index.year, y=col2, data=df, color='orange', alpha=0.7, label=col2)

  plt.title(title)
  plt.xlabel("Date")
  plt.ylabel("Price")
  plt.legend()


In [None]:
# function ---- 3
# function for relation of two columns
def relation_plot(col1, col2, title):
  plt.scatter(df[col1], df[col2])
  plt.title(title)
  plt.xlabel(f"{col1} Price")
  plt.ylabel(f"{col2} Price")

In [None]:
# function ---- 4
# function for checking distribution and outlier
def hist_box():
  for column in df:
    plt.figure(figsize=(10, 6))

    # for histogram
    # plt.subplot(1, 2, 1)
    sns.histplot(df[column], kde = True)

### **EDA**

In [None]:
# 1. Maximum and Minimum price of stock for Open and Close price
plt.bar(['Open', 'Close'], [df['Open'].max(), df['Close'].max()], label='Maximum')
plt.bar(['Open', 'Close'], [df['Open'].min(), df['Close'].min()], label='Minimum')

# Set the title and labels
plt.title("Maximum and Minimum  for Open and Close Prices")
plt.xlabel("Price Type")
plt.ylabel("Price")
plt.legend()

In [None]:
# 2. what is the difference between high and low price of the stock over time?
range_detection('High','Low', title= 'Difference b/w High & Low Price Over Time')

In [None]:
# 3. How does High price and Low price vary over time?
price_comparison('High', 'Low', title = 'High Price vs Low Price Over Time')

In [None]:
# 4. What is the relationship between the high price and low price of the stock?
relation_plot('High', 'Low', title="High Price vs Low Price")

In [None]:
# 5. What is the difference between the opening price and closing price of the stock over time?
range_detection('Close', 'Open', title = 'Difference b/w Open & Close Price Over Time')

In [None]:
# 6. How does the opening price vary with the closing price over time
price_comparison('Open', 'Close', title = 'Opening Price vs Closing Price Over Time')

In [None]:
# 7. What is the relationship between the opening price and closing price of the stock?
relation_plot('Open', 'Close', title="Opening Price vs Closing Price")

In [None]:
# create histogram and boxplot for all columns to identify how it is distributed
hist_box()

In [None]:
# ploting scatter plot independent features w.r.t. Closing price

for col in X:
  plt.figure(figsize=(8, 6))
  plt.scatter(x=df[col], y=Y)
  plt.xlabel(col)
  plt.ylabel('Closing Price')

In [None]:
# Correlation Heatmap for all features
sns.heatmap(df.corr(), annot = True,)

In [None]:
# Pair Plot
sns.pairplot(df)

### What all manipulations have you done and insights you found?

*	Made a copy of data and assigned to df.
*	Converted “Date” column from object type to datetime datatype. and set “Date” column to index.
*	Differentiate independent and dependent variables.
*	Made some functions.
*	Stock’s all time maximum of opening and closing price is approx. 370. Whereas all time minimum price is approx. 10
*	In September of 2018 there was huge difference between stock’s high and low price that was more than 180. Means investors were withdrawing money.
*	Since 2016 to 2017 there was sharp jump in stock prices, and since 2018 stock prices continually falling down.
*	Relation b/w High and Low price is linear.
*	In September of 2018 the closing price was approximately 160 units lower than the opening price.
* In 2016-17 we can see opening price is lower than the closing price. Means during that specific time frame stock has gained value.
* Relation b/w Open and Close price is linear.
*	All features is right(+ve) skewed. And every feature have potential outlier. But I decided to not remove them.
*	All independent features (high, low, open) linearly related to closing price.
*	Every feature is highly co-related to each other (in scale 0.98 to 1), it is better for dependent var to be highly co-related to independent, but when independent vars highly co-related to each other, then it is called multicollinearity which is not good for models



## ***4. Feature Engineering & Data Pre-processing***

In [None]:
# Handling Null Values & Missing Value Imputation
df.isnull().sum()

there is no Null & Missing values

In [None]:
# Creating arrays of our input variable and label to feed the data to the model.
# Create the data of independent variables
X = np.log10(X).values            # applying log transform on our independent variables.

# Create the dependent variable data
Y = np.log10(Y).values               # applying log transform on our dependent variable.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [None]:
# Scaling the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

## ***5. ML Model Implementation***

In [None]:
# importing LinearRegression model and the metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler

# cross validation and hyperparameter tuning libray
from sklearn.model_selection import GridSearchCV

In [None]:
# empty dataframe for all metric tools performance
per_df = pd.DataFrame()         # for before cross validation & hyperparameter
tuned_per_df = pd.DataFrame()   # for after cross validation & hyperparameter

### ML Model - 1 **Linear Regression**

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
lr = LinearRegression()

# fiting model line on our data
lr.fit(x_train, y_train)

In [None]:
lr.score(x_train, y_train)

In [None]:
print(x_train.shape, x_test.shape)

In [None]:
# predict on our test data
y_pred = lr.predict(x_test)

In [None]:
lr.intercept_

In [None]:
lr.coef_

**Performace of metrics**

In [None]:
mae = round(mean_absolute_error(10**(y_test), 10**(y_pred)),5)
mse = round(mean_squared_error(10**(y_test), 10**(y_pred)),5)
rmse = round(np.sqrt(mse),5)
r2 = round(r2_score(10**(y_test), 10**(y_pred)),5)

# Calculate the number of observations and the number of independent variables
n = x_test.shape[0]
k = x_test.shape[1]
# Calculate the adjusted R-squared
adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)

In [None]:
print(f"MAE (Linear) : {mae}")
print(f"MSE (Linear): {mse}")
print(f"RMSE (Linear): {rmse}")
print(f"R-squared (Linear): {r2}")
print(f"Adjusted R-squared (Linear): {adjusted_r2}")

In [None]:
# inserting performance of metric tools in per_df
per_df.loc[0,'Model Name'] = 'Linear Regression'
per_df.loc[0,'MAE'] = mae
per_df.loc[0,'MSE'] = mse
per_df.loc[0,'RMSE'] = rmse
per_df.loc[0,'R2'] = r2
per_df.loc[0,'ADJUSTED_R2'] = adjusted_r2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [mae, mse, rmse, r2, adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

In [None]:
# Plotting the actual and predicted test data.
plt.figure(figsize=(8,5))
plt.plot(10**y_pred)
plt.plot(np.array(10**y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price using Linear regression")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:

# Define the hyperparameter grid to search (empty for linear regression)
parameters = {}

# Fit the model using GridSearchCV for hyperparameter tuning and cross-validation
linear_reg_regressor = GridSearchCV(lr, parameters, scoring='neg_mean_squared_error', cv=3)
linear_reg_regressor.fit(x_train, y_train)

# Make predictions on the test set
y_pred = linear_reg_regressor.predict(x_test)

# Calculate evaluation metric values after cross validation and hyperparameter tuning
tuned_lr_mae = mean_absolute_error(y_test, y_pred)
tuned_lr_mse = mean_squared_error(y_test, y_pred)
tuned_lr_rmse = np.sqrt(tuned_lr_mse)
tuned_lr_r2 = r2_score(y_test, y_pred)
tuned_lr_adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)


In [None]:
# Print the evaluation metric values after cross validation and hyperparameter tuning
print("Mean Absolute Error(Linear):", tuned_lr_mae)
print("Mean Squared Error(Linear):", tuned_lr_mse)
print("Root Mean Squared Error(Linear):", tuned_lr_rmse)
print("R-squared(Linear):", tuned_lr_r2)
print("Adjusted R-squared(Linear):", tuned_lr_adjusted_r2)

In [None]:
tuned_per_df.loc[0,'Model Name'] = 'Linear Regression'
tuned_per_df.loc[0,'MAE'] = tuned_lr_mae
tuned_per_df.loc[0,'MSE'] = tuned_lr_mse
tuned_per_df.loc[0,'RMSE'] = tuned_lr_rmse
tuned_per_df.loc[0,'R2'] = tuned_lr_r2
tuned_per_df.loc[0,'ADJUSTED_R2'] = tuned_lr_adjusted_r2

##### Which hyperparameter optimization technique have you used and why?

The reason for using GridSearchCV is that it allows us to specify a range of hyperparameter values to be tested and automatically performs cross-validation to evaluate the model's performance for each combination of hyperparameters. This helps in finding the hyperparameters that yield the best performance on the validation data, thereby improving the generalization of the model to new, unseen data.

GridSearchCV is a popular and widely used technique for hyperparameter tuning because it simplifies the process of finding the optimal hyperparameters while making sure the model is not overfitting the training data. By searching through a grid of hyperparameter values, GridSearchCV helps to identify the best hyperparameters that lead to better model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after using GridSearchCV for hyperparameter tuning, we observed improvements in the evaluation metric scores as follows:

**Before Hyperparameter Tuning:**

* Mean Absolute Error (mae): **4.81678**
* Mean Squared Error (mse): **70.42041**
* Root Mean Squared Error (rmse): **8.39169**
* R-squared (r2): **0.99374**
* Adjusted R-squared (adjusted_r2): **0.99317**

**After Hyperparameter Tuning:**

* Tuned Mean Absolute Error (tuned_lr_mae): **0.017328**
* Tuned Mean Squared Error (tuned_lr_mse): **0.000814**
* Tuned Root Mean Squared Error (tuned_lr_rmse): **0.02854**
* Tuned R-squared (tuned_lr_r2): **0.99562**
* Tuned Adjusted R-squared (tuned_lr_adjusted_r2): **0.99317**

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [tuned_lr_mae, tuned_lr_mse, tuned_lr_rmse, tuned_lr_r2, tuned_lr_adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

### ML Model - 2 **Lasso Regression**

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(alpha = 0.1, max_iter = 3000)
lasso.fit(x_train, y_train)

In [None]:
lasso.score(x_train, y_train)

In [None]:
y_pred_lasso = lasso.predict(x_test)

In [None]:
lasso.intercept_

In [None]:
lasso.coef_

**Performance of metrics**

In [None]:
mae = round(mean_absolute_error(10**(y_test), 10**(y_pred_lasso)),5)
mse = round(mean_squared_error(10**(y_test), 10**(y_pred_lasso)),5)
rmse = round(np.sqrt(mse),5)
r2 = round(r2_score(10**(y_test), 10**(y_pred_lasso)),5)

# Calculate the number of observations and the number of independent variables
n = x_test.shape[0]
k = x_test.shape[1]
# Calculate the adjusted R-squared
adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)

In [None]:
print(f"MAE (Lasso) : {mae}")
print(f"MSE (Lasso): {mse}")
print(f"RMSE (Lasso): {rmse}")
print(f"R-squared (Lasso): {r2}")
print(f"Adjusted R-squared (Lasso): {adjusted_r2}")

In [None]:
# inserting performance of metric tools in per_df
per_df.loc[1,'Model Name'] = 'Lasso Regression'
per_df.loc[1,'MAE'] = mae
per_df.loc[1,'MSE'] = mse
per_df.loc[1,'RMSE'] = rmse
per_df.loc[1,'R2'] = r2
per_df.loc[1,'ADJUSTED_R2'] = adjusted_r2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [mae, mse, rmse, r2, adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

In [None]:
# Plotting the actual and predicted test data.
plt.figure(figsize=(8,5))
plt.plot(10**y_pred_lasso)
plt.plot(np.array(10**y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price using Lasso regression")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# Define the hyperparameter grid to search (empty for linear regression)
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,0.005,0.006,0.007,0.01,0.015,0.02,1e-1,1,5,10,20,30,40,45,50]}  # list of parameters.

# Fit the model using GridSearchCV for hyperparameter tuning and cross-validation
lasso_reg_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=3)
lasso_reg_regressor.fit(x_train, y_train)

# Make predictions on the test set
y_pred_lasso = lasso_reg_regressor.predict(x_test)

# Calculate evaluation metric values after cross validation and hyperparameter tuning
tuned_lasso_mae = mean_absolute_error(y_test, y_pred_lasso)
tuned_lasso_mse = mean_squared_error(y_test, y_pred_lasso)
tuned_lasso_rmse = np.sqrt(tuned_lasso_mse)
tuned_lasso_r2 = r2_score(y_test, y_pred_lasso)
tuned_lasso_adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)


In [None]:
# Print the evaluation metric values after cross validation and hyperparameter tuning
print("Mean Absolute Error(Lasso):", tuned_lasso_mae)
print("Mean Squared Error(Lasso):", tuned_lasso_mse)
print("Root Mean Squared Error(Lasso):", tuned_lasso_rmse)
print("R-squared(Lasso):", tuned_lasso_r2)
print("Adjusted R-squared(Lasso):", tuned_lasso_adjusted_r2)

In [None]:
tuned_per_df.loc[1,'Model Name'] = 'Lasso Regression'
tuned_per_df.loc[1,'MAE'] = tuned_lasso_mae
tuned_per_df.loc[1,'MSE'] = tuned_lasso_mse
tuned_per_df.loc[1,'RMSE'] = tuned_lasso_rmse
tuned_per_df.loc[1,'R2'] = tuned_lasso_r2
tuned_per_df.loc[1,'ADJUSTED_R2'] = tuned_lasso_adjusted_r2

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used for hyperparameter optimization. It systematically searches through a predefined grid of hyperparameters and selects the best combination based on cross-validation performance. This helps to improve the model's predictive accuracy and achieve better evaluation metric scores.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after using GridSearchCV for hyperparameter tuning, we observed improvements in the evaluation metric scores as follows:

**Before Hyperparameter Tuning:**

* Mean Absolute Error (mae): **32.746**
* Mean Squared Error (mse): **2555.834**
* Root Mean Squared Error (rmse): **50.555**
* R-squared (r2): **0.7729**
* Adjusted R-squared (adjusted_r2): **0.75229**

**After Hyperparameter Tuning:**

* Tuned Mean Absolute Error (tuned_lasso_mae): **0.0173**
* Tuned Mean Squared Error (tuned_lasso_mse): **0.0008**
* Tuned Root Mean Squared Error (tuned_lasso_rmse): **0.0285**
* Tuned R-squared (tuned_lasso_r2): **0.9956**
* Tuned Adjusted R-squared (tuned_lasso_adjusted_r2): **0.7522**

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [tuned_lasso_mae, tuned_lasso_mse, tuned_lasso_rmse, tuned_lasso_r2, tuned_lasso_adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The ML model's performance, as evaluated by these metrics, directly influences. A model with lower MAE, MSE, and RMSE and higher R-squared and Adjusted R-squared will lead to more accurate predictions, better decision-making, and improved operational efficiency.

### ML Model - 3 **Ridge Regression**

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=0.1)

In [None]:
ridge.fit(x_train, y_train)

In [None]:
ridge.score(x_train, y_train)

In [None]:
y_pred_ridge = ridge.predict(x_test)

**Performance of metrics**

In [None]:
mae = round(mean_absolute_error(10**(y_test), 10**(y_pred_ridge)),5)
mse = round(mean_squared_error(10**(y_test), 10**(y_pred_ridge)),5)
rmse = round(np.sqrt(mse),5)
r2 = round(r2_score(10**(y_test), 10**(y_pred_ridge)),5)

# Calculate the number of observations and the number of independent variables
n = x_test.shape[0]
k = x_test.shape[1]
# Calculate the adjusted R-squared
adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)

In [None]:
print(f"MAE (Ridge) : {mae}")
print(f"MSE (Ridge): {mse}")
print(f"RMSE (Ridge): {rmse}")
print(f"R-squared (Ridge): {r2}")
print(f"Adjusted R-squared (Ridge): {adjusted_r2}")

In [None]:
# inserting performance of metric tools in per_df
per_df.loc[2,'Model Name'] = 'Ridge Regression'
per_df.loc[2,'MAE'] = mae
per_df.loc[2,'MSE'] = mse
per_df.loc[2,'RMSE'] = rmse
per_df.loc[2,'R2'] = r2
per_df.loc[2,'ADJUSTED_R2'] = adjusted_r2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [mae, mse, rmse, r2, adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

In [None]:
# Plotting the actual and predicted test data.
plt.figure(figsize=(8,5))
plt.plot(10**y_pred_ridge)
plt.plot(np.array(10**y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('Test Data')
plt.ylabel("Price")
plt.title("Actual vs Predicted Closing price using Lasso regression")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# Define the hyperparameter grid to search (empty for linear regression)
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,0.005,0.006,0.007,0.01,0.015,0.02,1e-1,1,5,10,20,30,40,45,50]}  # list of parameters.

# Fit the model using GridSearchCV for hyperparameter tuning and cross-validation
ridge_reg_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_reg_regressor.fit(x_train, y_train)

# Make predictions on the test set
y_pred_ridge = ridge_reg_regressor.predict(x_test)

# Calculate evaluation metric values after cross validation and hyperparameter tuning
tuned_ridge_mae = mean_absolute_error(y_test, y_pred_ridge)
tuned_ridge_mse = mean_squared_error(y_test, y_pred_ridge)
tuned_ridge_rmse = np.sqrt(tuned_ridge_mse)
tuned_ridge_r2 = r2_score(y_test, y_pred_ridge)
tuned_ridge_adjusted_r2 = round(1 - (1 - r2) * ((n - 1) / (n - k - 1)), 5)


In [None]:
# Print the evaluation metric values after cross validation and hyperparameter tuning
print("Mean Absolute Error(Ridge):", tuned_ridge_mae)
print("Mean Squared Error(Ridge):", tuned_ridge_mse)
print("Root Mean Squared Error(Ridge):", tuned_ridge_rmse)
print("R-squared(Ridge):", tuned_ridge_r2)
print("Adjusted R-squared(Ridge):", tuned_ridge_adjusted_r2)

In [None]:
tuned_per_df.loc[2,'Model Name'] = 'Ridge Regression'
tuned_per_df.loc[2,'MAE'] = tuned_ridge_mae
tuned_per_df.loc[2,'MSE'] = tuned_ridge_mse
tuned_per_df.loc[2,'RMSE'] = tuned_ridge_rmse
tuned_per_df.loc[2,'R2'] = tuned_ridge_r2
tuned_per_df.loc[2,'ADJUSTED_R2'] = tuned_lasso_adjusted_r2

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after using GridSearchCV for hyperparameter tuning, we observed improvements in the evaluation metric scores as follows:

**Before Hyperparameter Tuning:**

* Mean Absolute Error (mae): **4.96916**
* Mean Squared Error (mse): **70.20436**
* Root Mean Squared Error (rmse): **8.3788**
* R-squared (r2): **0.99376**
* Adjusted R-squared (adjusted_r2): **0.99319**

**After Hyperparameter Tuning:**

* Tuned Mean Absolute Error (tuned_ridge_mae): **0.01772075002**
* Tuned Mean Squared Error (tuned_ridge_mse): **0.00086081638134**
* Tuned Root Mean Squared Error (tuned_ridge_rmse): **0.029339672481907**
* Tuned R-squared (tuned_ridge_r2): **0.995378969299**
* Tuned Adjusted R-squared (tuned_ridge_adjusted_r2): **0.99319**

In [None]:
plt.figure(figsize = (10,4))
# Evaluation metric names
metrics = [ 'MAE', 'MSE', 'RMSE','R-squared', 'Adjusted R-squared']

# Evaluation metric values
values = [tuned_ridge_mae, tuned_ridge_mse, tuned_ridge_rmse, tuned_ridge_r2, tuned_ridge_adjusted_r2]

# Create the bar chart
plt.bar(metrics, values)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The ML model's performance, as evaluated by these metrics, directly influences. A model with lower MAE, MSE, and RMSE and higher R-squared and Adjusted R-squared will lead to more accurate predictions, better decision-making, and improved operational efficiency.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In [None]:
per_df

In [None]:
tuned_per_df

In [None]:
tuned_per_df.plot(kind = 'bar')

R-squared (r2) was chosen as an evaluation metric because it shows how well the model fits the data and explains the relationship between the input variables and the output. A higher R-squared means the model is better at predicting the target variable, which is essential for making reliable business decisions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The final prediction model chosen was Linear Regression. Although all evaluation metrics (MAE, MSE, RMSE, and R-squared) were similar across the models, the deciding factor was the higher value of adjusted R-squared for Linear Regression. A higher adjusted R-squared indicates that Linear Regression is better at explaining the variance in the target variable while considering the number of predictors in the model. This suggests that Linear Regression provides a better fit for the data and is more reliable for making predictions in our specific scenario.

## ***6.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import joblib

# save the model to a file
joblib.dump(linear_reg_regressor, 'regression_model.joblib')

# the First parameter is the name of the model and the second parameter is the name of the file
# with which we want to save it

# now the model named 'linear_reg_regressor' will be saved as 'regression_model.joblib' in the current directory.

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the model from the joblib file
loaded_model = joblib.load('regression_model.joblib')

In [None]:
# Make predictions on the unseen data
predictions = loaded_model.predict(x_test)

predictions_df = pd.DataFrame({'Predicted_Close_Price': predictions})
# predictions_df.to_csv('predictions.csv', index=False)

In [None]:
predictions_df.head(10)

In [None]:
y_test

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we performed a comprehensive analysis of stock prices for Yes Bank. We started by understanding and exploring the dataset, which consisted of five columns representing the date, open, high, low, and close prices of the stock. We checked for missing values, duplicate entries, and identified the data types of each column.

Next, we performed exploratory data analysis (EDA) to gain insights into the trends and patterns of the stock prices over time. We visualized the maximum and minimum stock prices for open and close prices, the difference between high and low prices over time, the variation of high and low prices over time, and the relationship between high and low prices. Additionally, we explored the difference between opening and closing prices and their relationship over time.

After understanding the data, we conducted feature engineering and data pre-processing, which involved converting the date column to a datetime datatype, setting the date as the index, and applying log transforms on the independent and dependent variables. We split the data into training and testing sets and standardized the features.

For modeling, we implemented three regression models - Linear Regression, Lasso Regression, and Ridge Regression. We evaluated each model's performance using various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R2), and Adjusted R-squared. We observed significant improvements in the model's performance after hyperparameter tuning using GridSearchCV.

The best-performing model was determined based on the evaluation metrics, and Linear Regression was selected as the final prediction model. The Linear Regression model showed the highest Adjusted R-squared value, indicating a better fit and explaining more variance in the target variable.

Finally, we saved the best-performing Linear Regression model as a joblib file for future deployment and prediction of unseen data.

In conclusion, our analysis provided valuable insights into the stock prices of Yes Bank, and the selected Linear Regression model can be used for making predictions and assisting in investment decisions. Further improvements could be made by incorporating more features or trying different regression techniques to enhance the model's accuracy and performance.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***