<a href="https://colab.research.google.com/github/pravinshukla108/Yes-Bank-Closing-Price-Prediction/blob/main/Yes_Bank_Closing_Price_Prediction_pravin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **YES BANK STOCK CLOSING PRICE PREDICTION**




##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual


# **Project Summary -**

##**Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.**

# **GitHub Link -**

[ GitHub Link ](https://github.com/pravinshukla108/Yes-Bank-Closing-Price-Prediction.git)

# **Problem Statement**


# ***Predicting the stock’s closing price***
***As we are very much familiar with the situation of "YES BANK" that it has experienced significiant volatility and faced challenges , including a high profile fraud case involving Rana Kapoor.***

***So according to it our main objective of this project is to develop a reliable prediction model that can forecast the closing price of YES BANK stock based on its historical data and relevant market indicators***

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd  # Data manipulation and analysis
import numpy as np  # Numerical operations and calculations
import matplotlib.pyplot as plt  # Creating visualizations and plots

import seaborn as sns  # High-level interface for statistical graphics
from datetime import datetime  # Handling date and time operations
import missingno as msno  # Visualizing missing data in datasets
from sklearn.linear_model import LinearRegression  # Linear regression model
from sklearn.linear_model import Ridge, RidgeCV  # Ridge regression models
from sklearn.linear_model import Lasso, LassoCV  # Lasso regression models
from sklearn.model_selection import train_test_split  # Splitting data into train and test sets
from sklearn.model_selection import GridSearchCV  # Hyperparameter tuning through grid search
from sklearn.preprocessing import StandardScaler  # Standardization of features
from sklearn.preprocessing import MinMaxScaler  # Scaling features to a specific range
from scipy.stats import *  # Contains statistical functions for hypothesis testing and probability distributions
import warnings  # Managing and suppressing warnings
warnings.filterwarnings('ignore')  # Suppress warnings

# Import required metrics for model evaluation
from sklearn.metrics import (
    r2_score,  # R-squared (coefficient of determination) regression score function
    mean_squared_error,  # Mean squared error regression loss
    mean_absolute_percentage_error,  # Mean absolute percentage error regression loss
    mean_absolute_error  # Mean absolute error regression loss
)

# Import linear regression models
from sklearn import linear_model
# Import Google Drive for data access
from google.colab import drive

### Dataset Loading

In [None]:
drive.mount('/content/drive')

In [None]:
# Load Dataset
file_path = '/content/drive/MyDrive/Colab Notebooks/yes bank ( ml regression )/data_YesBank_StockPrices.csv'
yesbank = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
yesbank.head()

In [None]:
yesbank.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
yesbank.shape

### Dataset Information

In [None]:
# Dataset Info
yesbank.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(yesbank[yesbank.duplicated()])

In [None]:
# Dataset Duplicate Value Count
yesbank.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
yesbank.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(yesbank.isnull())

In [None]:
# Visualizing the missing values
msno.bar(yesbank,
         fontsize=10,
         figsize=(7,4),
         color='purple')
plt.title('Missing values')
plt.show()

### What did you know about your dataset?

**In our dataset, there are 185 rows and 5 columns. In this data we have monthly stock price from July 2005 to November 2020. There are total five columns.  Date, Open, High, Low are the independent variables and Close is dependent variable. There are no Missing values and Duplicate values in the dataset.**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
yesbank.columns

In [None]:
# Dataset Describe
yesbank.describe(include='all')

### Variables Description

##**Date** - Date of the record. It has month and year for a particular price.

##**Open** - Opening price of the stock for that Month.

##**High** - Highest price of the Month.

##**Low** -  Lowest price of the Month.

##**Close** -  Closing price of the stock for that Month

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in yesbank.columns.tolist():
  print("No. of unique values in ",i,"is",yesbank[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
yesbank.head()

In [None]:
# Converting Date to Datetime format(YYYY-MM-DD)
yesbank['Date']= pd.to_datetime(yesbank['Date'].apply(lambda x: datetime.strptime(x, '%b-%y')))

In [None]:
# re-checking the dataset information regarding its datatype
yesbank.info()

***The machine learning models does not work on "Date" data so we need to convert it into numerical column.But, numerical date have no use in our respective dataframe to predict the goal. so here we make the "Date" column as dataframe index.***

In [None]:
# converting 'Date' feature to dataframe index.
yesbank.set_index('Date',inplace=True)

In [None]:
# checking the dataframe with index 'Date'
yesbank.head()

### What all manipulations have you done and insights you found?

***The given dataset has 185 rows and 5 columns/features having no null and duplicates values.'Date' Feature is not in proper format so it is converted to 'datetime' format and make it to the Index of the dataframe as per our need to proceed further.***

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### ***UNIVARIATE ANALYSIS***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Initializing variable for independent features.
indp_numeric_features = yesbank.describe().columns[0:3]
indp_numeric_features

# Define a list of colors
colors = ["m", "red", "green","blue","indigo"]

for i, col in enumerate(indp_numeric_features):
    plt.figure(figsize=(20, 5))
    plt.subplot(1, 2, 1)
    sns.distplot(yesbank[col], color=colors[i])
    plt.title('Distribution Curve')

    # The Axes. axvline() function in axes module of matplotlib library is used to add a vertical line across the axis.
    # It will show where the "mean" and "median" lie for each plot

    plt.ylabel("Density", size=14)
    plt.axvline(yesbank[col].mean(), color=colors[i], linewidth=2)
    plt.axvline(yesbank[col].median(), color='red', linestyle="dashed", linewidth=2)

    # using subplot() function of matplotlib to create boxplot in this figure itself
    # Box plot is used to check outliers are present in respective features or not

    plt.subplot(1, 2, 2)
    plt.title('Box Plot')
    graph = sns.boxplot(y=yesbank[col], color=colors[i])

plt.show()


##### 1. Why did you pick the specific chart?

***I have picked the above chart because it combines histogram,kde and box plot that offers a comprehensive visualization of the data distribution and outliner as well. It allows for a better understanding of the distribution's characteristics, such as its shape, peaks, and deviations from a normal distribution.The combined plot provides a richer visualization that incorporates both the frequency-based information from the histogram and the smooth density estimate from the KDE plot and in Box plot the quartile divides the data in four equal parts from which we can recognize max.,min.,mean and median of the data.***

##### 2. What is/are the insight(s) found from the chart?

***From the above chart it is clearly visualize that it is right/positively skewed and has to be converted to normal distribution and by converting it to normal distribution outliners can be removed.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Currently can't say that it has a positive or negative impact but it is helpful to understand and decide upon the requirement of transformation of the features for Model implementation.Here we will use log transformation to convert it into normal distribution and to remove outliner.***

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Plotting the histogram to see Dependent variable 'Close' distribution which we need to predict later
plt.figure(figsize=(9,7))
sns.distplot(yesbank['Close'],color="indigo")
plt.title("Close stock price distirbution")

##### 1. Why did you pick the specific chart?

***I have picked the above chart because it combines both histogram and kde plot that offers a comprehensive visualization of the data distribution. It allows for a better understanding of the distribution's characteristics, such as its shape, peaks, and deviations from a normal distribution.The combined plot provides a richer visualization that incorporates both the frequency-based information from the histogram and the smooth density estimate from the KDE plot.***

##### 2. What is/are the insight(s) found from the chart?

***From the above chart it is clearly visualize that it is right/positively skewed and has to be converted to normal distribution.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Currently can't say that it has a positive or negative impact but it is helpful to understand and decide upon the requirement of transformation of the features for Model implementation.Here we will use log transformation to convert it into normal distribution.***

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#  Plotting the Log Transformation to see Dependent variable 'Close' distribution which we need to predict later.
plt.figure(figsize=(9,5))
sns.distplot(np.log(yesbank['Close']),color="r")
plt.title("Close Price Distribution after log transformation")

##### 1. Why did you pick the specific chart?

***I have used the log transformation because the distribution is not much skewed, and log transformation is helpful to bring the normal pattern in distribution of dependent feature.Beacuse of the Log transformation outliners are removed.***

##### 2. What is/are the insight(s) found from the chart?

***Log transformation is sufficient to bring the noraml distribution.It shows the mean is pumped and the frequent points are not near to mean. The plot clarifies about the bubble price of Yes Bank stock remained for very less time.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***It helps to observe the peak and vallyes in closing stock price.The inflated price at mean is temporary as it is a bubble point and after this the price got decline tremendously because of the fraud case which happened in 2018.***

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# log tranformation to convert Independent Feautres to normal distribution

for col in indp_numeric_features:
    plt.figure(figsize=(30, 6))
    plt.subplot(1, 2, 1)
    plt.title("Distribution Curve")

# np.log() is a method in numpy library to convert our dataset values into log transformation to get a normal distribution curve

    feature_to_log = np.log(yesbank[col])  # assign log tranformation value into a variable
    sns.distplot(feature_to_log, color="green")

# The Axes. axvline() function in axes module of matplotlib library is used to add a vertical line across the axis.
# It will show where the "mean" and "median" lie for each plot

    plt.ylabel("Density", size=18)
    plt.axvline(feature_to_log.mean(),color='magenta',linewidth=2)
    plt.axvline(feature_to_log.median(),color='red',linestyle="dashed",linewidth=2)

# creating boxplot to see if there is any outliers in any feature or not
# using subplot() function of matplotlib to create boxplot in this figure itself

    plt.subplot(1, 2, 2)
    plt.title("Box plot")
    sns.boxplot(y=feature_to_log, color="blue")

plt.show()

##### 1. Why did you pick the specific chart?

***I have used the log transformation because the distribution is not much skewed, and log transformation is helpful to bring the normal pattern in distribution of dependent feature.Beacuse of the Log transformation outliners are removed.***

##### 2. What is/are the insight(s) found from the chart?

***Log transformation is sufficient to bring the noraml distribution.The plot clarifies about the bubble price of Yes Bank stock remained for very less time.we can see from the distribution curve that mean is now closer median.***

***From the above boxplot after log transformation, we can see outliner are removed and we have approximate result of quartiles for independent features which are as follows-***

* For feature Open- Lower Quartile(Q1)- 3.6 ,Median(Q2)- 4.3, Upper Quartile(Q3)- 5.0
* For feature High- Lower Quartile(Q1)- 3.7 ,Median(Q2)- 4.4, Upper Quartile(Q3)- 5.2
* For feature Low- Lower Quartile(Q1)- 3.3 ,Median(Q2)- 4, Upper Quartile(Q3)- 4.9

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***It helps to observe the peak and vallyes in stock prices.The inflated price at mean is temporary as it is a bubble point and after this the price got decline tremendously because of the fraud case which happened in 2018.***

***After the log transformation, the outliner are removed and the distribution is converted to normal pattern which will suffice the model requirements and help to achieve better accuracy of our models,so we can say that the transformation has a positive impact.***

##### Chart - 5

 ***BIVARIATE ANALYSIS***

In [None]:
#Chart - 5 visualization code
# Visualizing yesbank stock closing price over the time.

# Set the theme
sns.set_theme(style="darkgrid")

# Adjust figure size
sns.set(rc={'figure.figsize':(15, 8)})

# Create the lineplot
sns.lineplot(x="Date", y="Close", data=yesbank, color='red')

# Customize the plot
plt.title('Yes Bank Closing Price Over Time')
plt.xlabel('Year')
plt.grid(True)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

***A line plot was chosen to visualize the closing prices over time. Line plots are commonly used to show trends and changes in data over a continuous variable, such as time.***

##### 2. What is/are the insight(s) found from the chart?

***We can observe the overall trend of the closing prices over time.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***As we know the stock closing price serves as a benchmark for determining how a stock performs and also help investors comprehend how its value has changed over time.From the above plot it is seen that stock closing price diminishes continously after 2018 ,so it is alarming for the yes bank to cope with this situation as closing price is that one price which drives investors to invest in a stock or not.So, as per scnerio the insights have a negative impact on business, but by observing the trend they must try to restrict the stock manipulation to become stable in terms of revenue.***

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Define custom colors
colors = ['#FF4500', '#0077b6']

# Plot Open vs Close with custom colors
ax = yesbank.loc[:, ['Open', 'Close']].tail(35).plot(kind='bar', figsize=(20, 8), color=colors)

# Customize grid lines
plt.grid(which='major', linestyle='-', linewidth='0.9', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.9', color='orange')

# Customize axis labels and tick labels
ax.set_xlabel('Time', fontsize=15, color='lime')
ax.set_ylabel('Value', fontsize=15, color='crimson')
ax.tick_params(axis='both', which='major', labelsize=14, colors='darkviolet')

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

***The chart shows the trend of Open and Close prices over time.Blue bar indicates opening price of that month, orange bar indicates closing price of that month.***

##### 2. What is/are the insight(s) found from the chart?


***The graph above indicates that after 2018, the stock price of YES Bank drops, making it unwise for investors to place their money in the company.***

* ***The closing price reached its peak 4 times of above 350Rs, in the month of August 2017, January 2018, April 2018, July 2018 .***

* ***The opening and closing prices were at lowest during these 4 months - August 2020, September 2020, October 2020, November 2020 .***








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***The insight that the stock price of YES Bank has dropped after 2018 suggests that the company may be facing challenges and may not be performing well, which could lead to a negative impact on the business. This insight may discourage potential investors from investing in the company and could lead to negative growth.***

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# comparing High and Low Prices

# Create a range of colors for color-coding

colors = np.arange(len(yesbank))
color_map = plt.get_cmap('gist_rainbow_r', len(yesbank))

# Create a range of sizes for size variation

sizes = 50 + 10 * np.arange(len(yesbank))
plt.figure(figsize=(12, 8))

# Create the scatter plot with data labels, color-coding, and size variation

plt.scatter(yesbank['High'], yesbank['Low'],c=colors, cmap=color_map, s=sizes, alpha=0.35)
plt.xlabel('High Price')
plt.ylabel('Low Price')
plt.title('High vs Low Prices Over Time (Color-Coded by Time Progression)')
plt.colorbar(label='Time Progression')
plt.savefig('scatter_plot.png')
plt.show()

##### 1. Why did you pick the specific chart?

***it is a good way to visualize the relationship between two variables. In this case, the two variables are the high and low prices of a product.***

##### 2. What is/are the insight(s) found from the chart?

* ***There is a positive correlation between the high and low prices. This means that when the high price increases, the low price also tends to increase.***

* ***There is a lot of variation in the data. This means that there is not a perfect relationship between the high and low prices.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Set prices that are competitive and that will attract customers. Identify opportunities to increase prices without losing customers. Track the impact of changes in prices on sales.***

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Visualizing yesbank stock opening price over the time.

# Sets the Seaborn style and figure size
sns.set_theme(style="ticks")
sns.set(rc={'figure.figsize': (11, 8)})

# Creates the lineplot for opening prices
plt.figure(figsize=(11, 8))
sns.lineplot(x="Date", y="Open", data=yesbank, color='green')

# Customize the plot
plt.title('Yes Bank Opening Price Over Time', fontsize=16)
plt.xlabel('Year', fontsize=12 , color='red')
plt.ylabel('Opening Price', fontsize=12, color='blue')

# Add grid lines
plt.grid(True, linestyle='solid', alpha=1)

plt.show()


##### 1. Why did you pick the specific chart?

***Line plot is used to show the progression of a variable over time or any continuous variable that has an inherent order. They are particularly useful for visualizing trends, seasonality, and changes in values over time.***

##### 2. What is/are the insight(s) found from the chart?

***From the above plot we can say that there is a increasing trend from 2009 which reach at its highest during 2017 but price started falling after 2018 because of Rana Kapoor's fraud case***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***As we know the stock closing price serves as a benchmark for determining how a stock performs and also help investors comprehend how its value has changed over time.From the above plot it is seen that stock closing price diminishes continously after 2018 ,so it is alarming for the yes bank to cope with this situation as closing price is that one price which drives investors to invest in a stock or not.So, as per scnerio the insights have a negative impact on business, but by observing the trend they must try to restrict the stock manipulation to become stable in terms of revenue.***

#### Chart - 9

In [None]:
# Chart - 9 visualization code


# Create a figure and set its size
plt.figure(figsize=(24, 12))

# Plot the open, high, and low prices with custom line styles and colors
plt.plot(yesbank.index, yesbank['Open'], label='Open', color='blue', linestyle='-', linewidth=2)
plt.plot(yesbank.index, yesbank['High'], label='High', color='green', linestyle='--', linewidth=2)
plt.plot(yesbank.index, yesbank['Low'], label='Low', color='red', linestyle='-.', linewidth=2)


# Customize the plot labels and tick labels

plt.legend(loc='upper right', fontsize=14, title='Price Categories', title_fontsize=16)

plt.xlabel('Date', fontsize=16, color='lime')
plt.ylabel('Price', fontsize=16, color='crimson')
plt.title('Yes Bank Price - Open, High, and Low', fontsize=20, color='red')
plt.tick_params(axis='both', which='major', labelsize=16, colors='darkviolet')


# Add grid lines
plt.grid(True, linestyle='--', alpha=0.9)

# Add data labels to indicate price categories
plt.text(yesbank.index[-1], yesbank['Open'].iloc[-1], 'Open', fontsize=12, color='blue', verticalalignment='center')
plt.text(yesbank.index[-1], yesbank['High'].iloc[-1], 'High', fontsize=12, color='green', verticalalignment='bottom')
plt.text(yesbank.index[-1], yesbank['Low'].iloc[-1], 'Low', fontsize=12, color='red', verticalalignment='top')

plt.show()


##### 1. Why did you pick the specific chart?

***The specific chart used, is a line plot that displays the open, high, and low prices of Yes Bank stock over time. This chart was chosen because it is suitable for visualizing time series data, making it easier to identify trends, fluctuations, and historical peaks and valleys in stock prices. Additionally, using custom line styles and colors for each price category enhances the clarity of the visualization.***


##### 2. What is/are the insight(s) found from the chart?


The insights that can be derived from the chart include:

* Price Trends: Analyzing whether the stock prices have been consistently rising (positive trend), falling (negative trend), or fluctuating (volatility) over the observed time period.
* Historical Extremes: Identifying historical peaks (high prices) and valleys (low prices) in the stock's performance.
*  Comparing Price Categories: Comparing the behavior of open, high, and low prices over time, which can reveal patterns or differences in how these categories move.


***The insights found from the chart are:***
- Yes Bank's stock price has been steadily increasing from 2006 to 2018.
- The stock price peaked in 2018 and has been steadily decreasing since then.
- The stock price has been consistently lower than the opening price since 2018.


***As we can see that All the prices shows almost similar trends with each other which means that this features may be strongly correlated with each other .***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*The gained insights from the chart can potentially help in making informed investment and business decisions. Here's how:*

***Positive Impact:***

Investment Strategies: Investors can use insights into price trends and historical extremes to make strategic decisions on buying or selling Yes Bank stock.
Risk Assessment: Understanding price volatility is essential for risk management. Investors can assess and mitigate risk more effectively.
Timing: Identifying patterns or trends can assist in timing investment decisions for positive returns.


***Negative Impact:***

Negative Growth Potential: If the chart shows a consistent and prolonged downward trend in stock prices, this could lead to negative growth. Investors may experience losses, and businesses associated with Yes Bank may face challenges.
Volatility Risk: High volatility, especially if it leads to erratic price fluctuations, can introduce uncertainty and risk for both investors and businesses.
***The gained insights will help creating a positive business impact because it shows that Yes Bank's stock price has been steadily decreasing since 2018. This could be a sign that the company is not doing well and may need to make changes in order to improve their stock price. There are no insights that lead to negative growth because the stock price is already decreasing.***


#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Violin Plot: Distribution of Open, High, Low, and Close Prices

# Data preparation: Select the price columns
price_data = yesbank[['Open', 'High', 'Low', 'Close']]

# Set plot style and figure size
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10, 6))

# Create the violin plot
sns.violinplot(data=price_data, palette="Set2", inner="stick", scale="width")

# Customize the plot
plt.title("Distribution of Open, High, Low, and Close Prices for Yes Bank Stock", fontsize=16)
plt.xlabel("Price Category", fontsize=14)
plt.ylabel("Price Value", fontsize=14)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

***I picked this specific chart because it shows the distribution of open, high, low, and close prices for Yes Bank stock. This chart is helpful in understanding the stock's performance and trends.***

##### 2. What is/are the insight(s) found from the chart?

***The insights found from the chart are:***
- The open and close prices for Yes Bank stock are relatively similar, indicating that the stock is relatively stable.
- The high and low prices for Yes Bank stock are relatively similar, indicating that the stock is relatively stable.
- The stock's price is relatively stable, indicating that the stock is relatively stable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***The gained insights will help creating a positive business impact because the stock's price is relatively stable. This means that the stock is relatively stable and is not experiencing any major fluctuations. This is a good sign for investors and can lead to positive growth. There are no insights that lead to negative growth because the stock's price is relatively stable.***

- This is a chart showing the distribution of open, high, low, and close prices for Yes Bank stock.
- The open and close prices for Yes Bank stock are relatively similar, indicating that the stock is relatively stable.
- The high and low prices for Yes Bank stock are relatively similar, indicating that the stock is relatively stable.
- The stock's price is relatively stable, indicating that the stock is relatively stable.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visuaization code

# Calculate the correlation matrix
correlation_matrix = yesbank.corr()

# Set figure size and create the heatmap
plt.figure(figsize=(13, 9))
heatmap = sns.heatmap(correlation_matrix, annot=True, cmap="viridis", linewidths=.5, square=True)

# Customize the plot
plt.title("Correlation Heatmap for Yes Bank Stock Prices", fontsize=16)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

***I picked the specific chart because it is a correlation heatmap for Yes Bank Stock Prices. This chart is useful for understanding the relationship between different stock prices.***

##### 2. What is/are the insight(s) found from the chart?

***The insights found from the chart are that there is a strong positive correlation between the stock prices. This means that when one stock price increases, the other stock prices also tend to increase.***

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(yesbank ,diag_kind="kde")

##### 1. Why did you pick the specific chart?

***The pair plot is suitable when you want to visualize the relationships between multiple variables in a dataset. It creates a grid of scatter plots, making it easier to identify patterns, trends, and potential outliers.***

##### 2. What is/are the insight(s) found from the chart?

***The pair plot allows for a comprehensive examination of the pairwise relationships, helping to understand how variables interact with each other. On the other hand, the pair plot provides a more comprehensive view of the relationships by displaying scatter plots for all possible variable combinations.***

In [None]:
yesbank1= yesbank.iloc[:,0:].copy()
print(yesbank1.head())

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**MACD**

In [None]:
# Calculate the short Exponential Moving Average (EMA) with a span of 12 periods
shortEMA = yesbank['Close'].ewm(span=12, adjust=False).mean()

# Calculate the long Exponential Moving Average (EMA) with a span of 26 periods
longEMA = yesbank['Close'].ewm(span=26, adjust=False).mean()

# Calculate the MACD (Moving Average Convergence Divergence) by subtracting the long EMA from the short EMA
MACD = shortEMA - longEMA

# Calculate the signal line by taking a 9-period EMA of the MACD
signal = MACD.ewm(span=9, adjust=False).mean()

# Add the MACD values to the DataFrame as a new column named 'macd'
yesbank['macd'] = MACD

# Add the MACD signal line values to the DataFrame as a new column named 'macd_signal'
yesbank['macd_signal'] = signal


**RSI**

In [None]:
# Calculate the price changes (differences) for each day
delta = yesbank['Close'].diff()

# Calculate gains (positive price changes) and set losses to 0
gain = delta.mask(delta < 0, 0)

# Calculate losses (negative price changes) and set gains to 0
loss = -delta.mask(delta > 0, 0)

# Calculate the average gain over a rolling window of 14 periods
avg_gain = gain.rolling(window=14).mean()

# Calculate the average loss over a rolling window of 14 periods
avg_loss = loss.rolling(window=14).mean()

# Calculate the relative strength (RS) which is the ratio of average gain to average loss
rs = avg_gain / avg_loss

# Calculate the Relative Strength Index (RSI) using the formula: RSI = 100 - (100 / (1 + RS))
rsi = 100 - (100 / (1 + rs))

# Add the RSI values to the DataFrame as a new column 'rsi'
yesbank['rsi'] = rsi


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

***Null hypothesis H0: There is no significant difference in the predictive power of the RSI and a random walk model on the closing price of the stock.***

***Alternative hypothesis H1: The RSI provides a significant improvement in predicting the closing price of the stock compared to a random walk mode.***

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Calculate random walk predictions by shifting the closing prices by one time period
rw_predictions = yesbank['Close'].shift(1)

# Calculate RSI predictions using a simple rule-based strategy:
# If RSI > 50, the prediction is the current closing price; otherwise, it's the previous closing price.
rsi_predictions = yesbank['Close'].where(yesbank['rsi'] > 50, yesbank['Close'].shift(1)) \
    .where(yesbank['rsi'] < 50, yesbank['Close'].shift(1))

# Calculate prediction errors for both RSI and random walk models
rsi_errors = yesbank['Close'] - rsi_predictions
rw_errors = yesbank['Close'] - rw_predictions

# Perform a paired t-test on the prediction errors to determine if there's a significant difference
# between the RSI model and the random walk model.
t_stat, p_val = ttest_rel(rsi_errors, rw_errors)

# Print the results of the t-test
if p_val < 0.05:
    print("Reject null hypothesis.")
else:
    print("Fail to reject null hypothesis.")


***From this testing we can conlude that there is no significant difference in the predictive power of the RSI and a random walk model on the closing price of the stock.***

##### Which statistical test have you done to obtain P-Value?

***Paired t-test was performed to obtain the P-value.***

##### Why did you choose the specific statistical test?

***The RSI and random walk models are applied to the same set of data, and the prediction errors are calculated for each model on the same set of observations. Therefore, the data is paired, and a paired test is appropriate***

***I want to test whether there is any significant difference between the mean prediction errors of the RSI and random walk models. The paired t-test is designed to test the difference between the means of the paired data.***

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

***Null hypothesis H0:*** *There is no significant difference in the predictive power of the MACD indicator and a simple moving average strategy on the closing price of the stock.*
***Alternative hypothesis H1:*** *The MACD indicator provides a significant improvement in predicting the closing price of the stock compared to a simple moving average strategy.*

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# generate predictions using MACD and SMA strategies
yesbank['macd_pred'] = yesbank['Close'] + yesbank['macd_signal']
yesbank['sma'] = yesbank['Close'].rolling(window=14).mean()
yesbank['sma_pred'] = yesbank['sma'].shift(1)

In [None]:
# calculate prediction errors
macd_errors = yesbank['macd_pred'] - yesbank['Close']
sma_errors = yesbank['sma_pred'] - yesbank['Close']

In [None]:
# calculate prediction errors
macd_errors = yesbank['macd_pred'] - yesbank['Close']
sma_errors = yesbank['sma_pred'] - yesbank['Close']

In [None]:
# perform paired t-test
t_stat, p_val = ttest_rel(macd_errors, sma_errors)

# print results
if p_val < 0.05:
    print("Reject null hypothesis.")
else:
    print("Fail to reject null hypothesis.")

*From above test you can see there is no significant difference in the predictive power of the MACD indicator and a simple moving average strategy on the closing price of the bank stock.*

##### Which statistical test have you done to obtain P-Value?

***A paired t-test was performed to obtain the P-value.***

***we can then use the resulting p-value to determine whether to reject or fail to reject the null hypothesis***

##### Why did you choose the specific statistical test?

***The paired t-test is a parametric test that assumes that the differences between the paired measurements follow a normal distribution.***

***the paired t-test is a useful and widely-used statistical test for comparing paired data and testing hypotheses about the difference between two means.***

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis - Predict the stock’s closing price of the month.

Alternate Hypothesis - Not able to predict the stock’s closing price of the month.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
indep_var=yesbank[['High','Low','Open']]
dep_var=yesbank['Close']

In [None]:
indep_var = sm.add_constant(indep_var) ## let's add an intercept (beta_0) to our model
model = sm.OLS(dep_var, indep_var).fit() ## sm.OLS(output, input)
predictions = model.predict(indep_var)

In [None]:
model.summary()

##### Which statistical test have you done to obtain P-Value?

***I have used statsmodel.api statistical test to obtain the P-value.***

##### Why did you choose the specific statistical test?

I used this statistical test because I was quite aware of this test from my earlier self projects so i used it.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
yesbank.isnull().sum()

***Our dataset has no missing or null values***

#### What all missing value imputation techniques have you used and why did you use those techniques?

**Since our data does not contain any missing values . so there is no need to impute missing values.**

### 2. Handling Outliers

***IQR (INTER QUANTILE RATE) :-***

In [None]:
# Handling Outliers & Outlier treatments

# Function to detect outliers using the IQR method
def detect_outlier_iqr(data_column):
  # Calculate the first quartile (Q1) and third quartile (Q3)
  q1 = data_column.quantile(0.25)
  q3 = data_column.quantile(0.75)

  # Calculate the interquartile range (IQR)
  iqr = q3 - q1

  # Calculate the lower fence and upper fence
  lower_fence = q1 - (1.5 * iqr)
  upper_fence = q3 + (1.5 * iqr)

  # Identify the outliers that fall outside the lower and upper fences
  outlier = data_column[(data_column < lower_fence) | (data_column > upper_fence)]

  return outlier

# Apply the function to the 'Close' column of the 'yesbank' DataFrame to detect outliers
outlier = detect_outlier_iqr(yesbank['Close'])

# Display the outliers (if any)
print(outlier)


In [None]:
# Define custom colors
custom_colors = ["m", "g", "r","cyan"]

# Create a boxplot with custom colors to identify outliers
plt.figure(figsize=(13, 8))
sns.boxplot(data=yesbank, palette=custom_colors)
plt.title("Box Plot To Identify Outliers")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

***The technique that i used for outliers treatment is "Inter Quantile Range" as the data doesn't follow a normal distribution, we will calculate the outlier data points using the statistical method called interquartile range (IQR) instead of using Z-score.***

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

***There is no need of categorical encoding in this dataset as all our columns are numerics and datetime format.***

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

### ***There is no need to do any textual data preprocessing as it is not required according to our dataset***

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Calculate the correlation matrix for the DataFrame
corr = yesbank.corr()

# Create a colormap for the heatmap with specific color settings
cmap = sns.diverging_palette(450, 150, l=55, center="dark", as_cmap=True)

# Define a function to magnify the heatmap on hover
def magnify():
    return [
        dict(selector="th", props=[("font-size", "10pt")]),
        dict(selector="td", props=[('padding', "0em 0em")]),
        dict(selector="th:hover", props=[("font-size", "14pt")]),
        dict(selector="tr:hover td:hover", props=[('max-width', '250px'), ('font-size', '14pt')])
    ]

# Create the styled correlation heatmap with the specified colormap and hover effect
corr.style.background_gradient(cmap, axis=1) \
    .set_properties(**{'max-width': '85px', 'font-size': '12pt'}) \
    .set_caption("Hover to magnify") \
    .set_precision(2) \
    .set_table_styles(magnify())



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Feature selection for Independent and dependent variable
independent_variables = [col for col in yesbank.columns if col != 'Close']
dependent_variable = 'Close'


##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

# Log-transform the 'Open', 'High', 'Low', and 'Close' columns
columns_to_log_transform = ['Open', 'High', 'Low', 'Close']

for column in columns_to_log_transform:
    yesbank[column] = np.log(yesbank[column])

# Check the correlation between independent and dependent variables after data transformation
cor_log = yesbank1.corr()

# Create a heatmap to visualize the correlation
sns.heatmap(cor_log, annot=True, cmap='plasma')
yesbank.head()

*****Data Transformation is required and I have used log transformation(np.log) for the data which is given. The reasons behind the transformation are as follows-*****

* *Data which is given is right skewed so to make the distribution normal and symmetric transforamtion is required which will help to implement model correctly and pricesly which help us to reach towards desired accuracy*.


* *After treating outliers the correltion b/w independent and dependent variable got distorted so to restore that, transformation is required and as we can see from the above heatmap the correlation are restored which will make our model more accurate for prediction, but it has multicollinearity and to deal with it we have to drop that feature which is least correlated with the target variable, but by doing so we lose the valuable information as our dataset is small, so we continue with the multicollinearity and check how our model behaves with this phenomena*.

### 6. Data Scaling

In [None]:
# Scaling your data

# Define the independent variables (features) and the dependent variable (target)
independent_var = ['Open', 'High', 'Low']
dependent_var = 'Close'
x = np.log10(yesbank[independent_var])  # Log-transform the selected features
y = np.log10(yesbank['Close'])  # Log-transform the target variable

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)
# - x_train and y_train will be used for training the model.
# - x_test and y_test will be used for evaluating the model's performance.

# Scaling your data using Min-Max scaling
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)  # Fit and transform the training data
x_test = scaler.transform(x_test)  # Transform the testing data using the same scaler
# Min-Max scaling scales the data to a range between 0 and 1, making it suitable for many machine learning algorithms.


##### Which method have you used to scale you data and why?

***I have used MinMaxScaler.***
*I chose this method because it preserves the relative relationships between the data points and is suitable when the features have different scales. Scaling to a specific range helps algorithms that are sensitive to the magnitude of the features to perform better and converge faster.*

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

***No there is no need for dimensionality reduction in our dataset as it a small dataset with 185 rows and 5 columns.***

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Splitting data into training and testing sets with a 75-25 ratio
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

# Print the shapes of the training and testing sets
print("Training set shape:", X_train.shape)  # Displays the shape of the training data
print("Testing set shape:", X_test.shape)  # Displays the shape of the testing data


##### What data splitting ratio have you used and why?

***I used Train test split ,because this method is a fast and easy procedure to perform and also help to comapre between our model resut and given model result.***








### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

***No the data is not imbalanced as our dependent variable is evenly distributed over all datapoints and also this type of problem mainly occur in classification models.***

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 ***Linear Regression***

***For building this model I am using Linear Regression machine learning algorithm.It is a statistical method that is used for predictive analysis.***
***Linear regression algorithm shows a linear relationship between a dependent variable and one or more independent variables.***

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm
reg = LinearRegression().fit(X_train, y_train)
# Calculate the R-squared (coefficient of determination) for the training data
train_r_squared = reg.score(X_train, y_train)
# Print the model coefficients
model_coefficients = reg.coef_
print("Model Coefficients:", model_coefficients)
# Predict on the model
y_pred = reg.predict(X_test)
# Plot the predicted and actual values
plt.figure(figsize=(8, 5))
plt.plot(10 ** (y_pred), color='darkviolet', label="Predicted")
plt.plot(np.array(10**(y_test)),color='crimson', label="Actual")
plt.legend()
plt.title("Predicted vs Actual Values")
plt.xlabel("Data Point")
plt.ylabel("Stock Price")
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error (MAE):', mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(10 ** (y_test), 10 ** (y_pred))
print("Mean Squared Error (MSE):", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate R-squared (R2) Score
r2 = r2_score(10 ** (y_test), 10 ** (y_pred))
print("R-squared (R2) Score:", r2)

# Calculate Adjusted R-squared Score
n = X_test.shape[0]  # Number of data points
p = X_test.shape[1]  # Number of features
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print("Adjusted R-squared Score:", adjusted_r2)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

***I have not used any hyperparameter optimization techniques ,Our model already gives a high accuracy so there is no need of any hyperparameter optimization technique.***

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Since there is no hyperparameter optimization included in the code, there's no specific improvement to mention in terms of hyperparameter tuning.

### ML Model - 2  ***Lasso Regression***

In [None]:
# Initialize the Lasso regression model with specified alpha and max_iter values
lasso = Lasso(alpha=0.1, max_iter=3000)

# Fit the Lasso model to the training data
lasso.fit(X_train, y_train)

# Calculate the R-squared score on the training data
train_score = lasso.score(X_train, y_train)
print("R-squared (R2) Score on Training Data:", train_score)

# Predict the target values on the test data
y_pred1 = lasso.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred1)
print('Mean Absolute Error (MAE):', mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(10 ** (y_test), 10 ** (y_pred1))
print("Mean Squared Error (MSE):", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate R-squared (R2) Score
r2 = r2_score(10 **(y_test), 10 **(y_pred1))
print("R-squared (R2) Score:", r2)

# Calculate Adjusted R-squared Score
n = X_test.shape[0]  # Number of data points
p = X_test.shape[1]  # Number of features
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print("Adjusted R-squared Score:", adjusted_r2)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Fit the Algorithm
# Create a Lasso regression model
lasso1 = Lasso()

# Define a set of alpha values for hyperparameter tuning
parameters = {'alpha': [1e-15, 1e-13, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 5, 10, 20, 30, 40, 45, 50, 55, 60, 100]}

# Initialize the GridSearchCV with cross-validation and negative mean squared error scoring
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=3)

# Fit the Lasso regressor with hyperparameter tuning to the training data
lasso_regressor.fit(X_train, y_train)

# Print the best alpha parameter and its corresponding negative mean squared error
print("The best alpha value is found to be:", lasso_regressor.best_params_)
print("\nUsing alpha =", lasso_regressor.best_params_['alpha'], "the negative mean squared error is:", -lasso_regressor.best_score_)

# Predict on the model
# Make predictions on the test data
y_pred_lasso = lasso_regressor.predict(X_test)

# Plot the predicted and actual values
plt.figure(figsize=(18, 8))
plt.plot(10 ** (y_pred_lasso), color='orange', label='Predicted')
plt.plot(10 ** (np.array(y_test)), label='Actual')
plt.legend(["Predicted", "Actual"],fontsize=12,loc='upper right')

plt.xlabel('Data Points')
plt.ylabel('Values')
plt.title('Lasso Regression Predicted vs. Actual')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

***We have used Cross validation and hyper parameter tuning for avoiding overfiting of the model lasso and better accuracy on test data.***

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

***Not much improvement is seen beacuse of the less accuracy than our first model***

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

***Here all the evaluation metrics including MAE,MSE,MAPE,RMSE the lower there values are as good they are for our business out of choosing any one that can only be told at the end. R2 score here signies that 99.7 variance in dependent variable can be predicted by independent variable***

### ML Model - 3 ***Ridge Regression***

In [None]:
# ML Model - 3 Implementation

# Create a Ridge regression model
ridge = Ridge()

# Fit the Algorithm
# Define a set of alpha values for hyperparameter tuning
parameters = {'alpha': [1e-15, 1e-13, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 5, 10, 20, 30, 40, 45, 50, 55, 60, 100]}

# Initialize the GridSearchCV with cross-validation
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)

# Fit the Ridge regressor with hyperparameter tuning to the training data
ridge_regressor.fit(X_train, y_train)

# Print the best alpha parameter and its corresponding negative mean squared error
print("The best alpha value is found to be:", ridge_regressor.best_params_)
print("\nUsing alpha =", ridge_regressor.best_params_['alpha'], "the negative mean squared error is:", -ridge_regressor.best_score_)

# Fit the Ridge algorithm using the best alpha value
ridge = Ridge(alpha=ridge_regressor.best_params_['alpha'])
ridge.fit(X_train, y_train)

# Print the R-squared score on the training data
print("R-squared score on training data:", ridge.score(X_train, y_train))

# Predict on the model
# Make predictions on the test data
y_pred_r = ridge.predict(X_test)





#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate Mean Squared Error (MSE)
MSE = mean_squared_error(10**(y_test), 10**(y_pred_r))
print("MSE:", MSE)

# Calculate Root Mean Squared Error (RMSE)
RMSE = np.sqrt(MSE)
print("RMSE:", RMSE)

# Calculate R-squared (R2) score
r2 = r2_score(10**(y_test), 10**(y_pred_r))
print("R2:", r2)

# Calculate and print Adjusted R-squared score
adjusted_r2 = 1 - (1 - r2) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1))
print("Adjusted R2:", adjusted_r2)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Create a Ridge regression model
ridge = Ridge()

# Define a set of alpha hyperparameters for GridSearchCV
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10, 20, 30, 40, 45, 50, 55, 60, 100]}

# Fit the Algorithm

# Perform hyperparameter optimization using GridSearchCV
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridge_regressor.fit(X_train, y_train)

# Print the best alpha parameter found during optimization
print("The best fit alpha value is found out to be:", ridge_regressor.best_params_)
print("Using", ridge_regressor.best_params_, "the negative mean squared error is:", ridge_regressor.best_score_)

# Predict on the model

# Make predictions using the optimized Ridge model
y_pred_ridge = ridge_regressor.predict(X_test)

# Visualize predicted vs. actual values in colorful lines
plt.figure(figsize=(12, 6))
plt.plot(10**(y_pred_ridge), color='aqua', label='Predicted', linewidth=2)
plt.plot(10**(np.array(y_test)), color='coral', label='Actual', linewidth=2)
plt.legend(["Predicted", "Actual"],fontsize=10,loc='upper right')
plt.title('Predicted vs. Actual Values')
plt.xlabel('Data Points')
plt.ylabel('Values')
plt.grid(True)
plt.show()

# Detect heteroscedasticity by plotting colorful residuals
residuals = 10**(y_test) - 10**(y_pred_ridge)
plt.figure(figsize=(12, 6))
plt.scatter(10**(y_pred_ridge), residuals, color='purple', marker='o', s=10)
plt.title('Residuals vs. Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.grid(True)
plt.show()



##### Which hyperparameter optimization technique have you used and why?

***The hyperparameter optimization technique that has been used in the above code is Grid Search. Grid Search is a brute-force approach to hyperparameter optimization, where the model is trained on a grid of different hyperparameter values. The model with the best performance on the validation data is selected as the final model.***

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

***Yes, there has been an improvement in the performance of the model after using Grid Search to optimize the hyperparameters.***

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

***According to my point of view , the MSE, RMSE, and R-squared values are all relatively low, which indicates that the model is a good fit. The adjusted R-squared value is also relatively high, which indicates that the model is a good fit even when the number of independent variables is large. This suggests that the model is likely to have a positive business impact.***

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

***We choose our first and third model that is simple linear regression model and ridge regression model for final prediction because of good prediction accuracy than lasso and least mean squared error and good scores of evalution metrics.***

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

***The model explainability tool that I have used is the shap library. The shap library provides a number of tools for visualizing and understanding the explanations of machine learning models.***

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Closing Price**: Closing Price of a stock refers to the final price at which the stock is traded on a particular stock exchange on a given trading day. It is the last price at which the stock is bought or sold during the trading session.

**Importance** : The closing price is an important metric used by investors, analysts, and traders to evaluate a company’s financial health, market value, and stock performance. It is also used to calculate other important metrics such as the daily price change, market capitalization, and trading volume.

**For an Average Investor** : An average investor sees investing in stocks for long-term purposes and in premium stocks that have proved to be quality and high-performing stocks over the years. For such investors, the daily closing price may not hold as high importance as for an average trader.

**For a Traders** : For traders and analysts, the information on the closing price of stocks is essential to make sure that they make sound trading decisions and maximize returns on their portfolios.

##In this project we did the following things  to get our desired results:-

 1. At first we do the data wrangling then data cleaning and data transformation after that we do the Modeling part.

 2. The trend of the price of Yes Bank's stock increased until 2018 and then Close,Open,High,Low price decreased.

 3. Based on the open vs. close price graph, we concluded that Yes Bank's stock fell significantly after 2018.

 4. Visualization has allowed us to notice that the closing price of the stock has suddenly fallen starting in 2018. It seems reasonable that the Yes Bank stock price was significantly impacted by the Rana Kapoor case fraud.

 5. High, Low, Open are directly correlate with the Closing price of stocks.

 6. The target variable is highly dependent on input variables.

 7. Linear Regression has given the best results with lowest MAE, MSE, RMSE and MAPE scores.

 8. Ridge regression shrunk the parameters to reduce complexity and multicollinearity, but ended up affecting the evaluation metrics.

 9. Lasso regression did feature selection and ended up giving up worse results than ridge which again reflects the fact that each feature is important (as previously discussed).

 10. The accuracy for each model is more than 90%.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***