# **Project Name**    - YES Bank Stock Predcttion






##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual

# **Project Summary -**

Yes Bank is a well-known Indian private sector bank. Since 2018, it has been in the spotlight due to major fraud cases involving its co-founder Rana Kapoor. These events have significantly impacted investor sentiment and the bank's stock price.

This project aims to analyze the historical stock price data of Yes Bank and predict its monthly closing price using a Linear Regression model.


# **GitHub Link -**['https://github.com/krishu087/Yes_bank_Closing_stock_prediction']

Provide your GitHub Link here.

# **Problem Statement**


### 🧩 **Problem Statement**

Yes Bank, a major player in India’s financial sector, experienced significant fluctuations in its stock prices, especially after fraud-related events in 2018 involving its co-founder. These events raised questions about the impact of such news on investor confidence and stock value.

The main problem this project aims to solve is:

> **Can we accurately predict the monthly closing stock price of Yes Bank using historical stock price data (Open, High, Low)?**

By solving this, we aim to:
- Understand the trend in stock price movement over time,
- Measure the impact of real-world events on stock prices,
- Build a predictive model to assist investors and analysts in making informed decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/data_YesBank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_rows = df.duplicated().sum()
duplicate_rows

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)


In [None]:
# Dataset Describe
df.describe()


### Variables Description

| Variable  | Description                                                    |
| --------- | -------------------------------------------------------------- |
| **Date**  | Month and year of the stock data (e.g., Jan-19)                |
| **Open**  | Stock price at the beginning of the month                      |
| **High**  | Highest stock price in the month                               |
| **Low**   | Lowest stock price in the month                                |
| **Close** | Stock price at the end of the month (🔮 **Target to predict**) |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print(df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Load the dataset
df = pd.read_csv("data_YesBank_StockPrices.csv")

# 2. Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# 3. Sort by Date (important for time series)
df = df.sort_values('Date')

# 4. Check and handle missing values
print("Missing values in dataset:")
print(df.isnull().sum())

# DROPING missing rows with any missing values
df = df.dropna()

# 5. Reset index after sorting
df.reset_index(drop=True, inplace=True)

# 6. Display first few rows
df.head()


### What all manipulations have you done and insights you found?

| Step                 | Description                                                                   |
| -------------------- | ----------------------------------------------------------------------------- |
| ✅ Loaded Data        | Read the CSV file using `pandas.read_csv()`                                   |
| 🕒 Converted Dates   | Converted the `Date` column to datetime format (`%b-%y`)                      |
| 📅 Sorted by Date    | Sorted the data chronologically to maintain time series structure             |
| 🧼 Cleaned Data      | Checked for and dropped any missing values                                    |
| 📊 Feature Selection | Used `Open`, `High`, and `Low` to predict `Close`                             |
| 🔍 Data Splitting    | Split the data into train and test sets without shuffling (time series logic) |


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], color='blue', linewidth=2)
plt.title("📉 Yes Bank Monthly Closing Stock Price Over Time", fontsize=14)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Closing Price", fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We chose the line chart because it clearly shows how the closing price changed over time, helps identify trends, and highlights the impact of events like the 2018 fraud case. It's the best way to visualize time series data like stock prices.


##### 2. What is/are the insight(s) found from the chart?

**Insights from the chart:**

* 📉 A sharp **decline in closing price after 2018**, indicating the impact of the fraud case.
* 📊 The stock was relatively **stable before 2018**, showing a shift in trend.
* 📆 Shows a clear **downward trend over time**, useful for predicting future behavior.


##### 3. Will the gained insights help creating a positive business impact?
**Yes**, the insights will help in creating a positive business impact by:

* 📊 Allowing investors and analysts to **understand risk patterns** and make informed decisions.
* 📉 Helping stakeholders identify the **impact of critical events** on stock performance.
* 🔮 Supporting better **forecasting and strategy planning** using data-driven trends.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
correlation_matrix = df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("🔗 Correlation Heatmap of Stock Price Variables", fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

**We picked the correlation heatmap** because it shows how strongly the variables like `Open`, `High`, and `Low` are related to the `Close` price. This helps us choose the best features for prediction by quickly identifying strong relationships.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the heatmap:**

- ✅ `Open`, `High`, and `Low` have a **strong positive correlation** with `Close`.  
- 📊 This means these variables are **good predictors** for building the model.  
- ❌ No variable shows negative or weak correlation — all are useful for prediction.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.suptitle("📊 Pair Plot of Yes Bank Stock Prices", y=1.02, fontsize=14)
plt.show()

##### 1. Why did you pick the specific chart?

Distributions of each variable (on the diagonal).

##### 2. What is/are the insight(s) found from the chart?

Scatter plots between variables to check linear relationships.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Hypothesis:

There is a significant difference between the average Open and Close prices of Yes Bank stock.

Null Hypothesis (H₀):The mean of Open and Close prices is the same.
Alternate Hypothesis (H₁):

The mean of Open and Close prices is different.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_rel

# Perform paired t-test
t_stat, p_value = ttest_rel(df['Open'], df['Close'])

print("T-statistic:", t_stat)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

We use a paired sample t-test since:

Open and Close are related measurements (same month)

We're comparing means of two related samples

##### Why did you choose the specific statistical test?

Because Open and Close prices come from the same time period, and we want to compare their mean difference.



## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Handling Missing Values & Missing Value Imputation

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot for detecting outliers
plt.figure(figsize=(10, 5))
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title("📦 Box Plot to Detect Outliers")
plt.show()

# Remove outliers using IQR method
def remove_outliers(col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# Apply to each numerical column
df = remove_outliers('Close')
df = remove_outliers('Open')
df = remove_outliers('High')
df = remove_outliers('Low')

##### What all outlier treatment techniques have you used and why did you use those techniques?

Box Plot – to detect outliers visually.

IQR Method – to remove extreme values statistically.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

X = df[['Open', 'High', 'Low']]
y = df['Close']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

model = LinearRegression()
model.fit(X_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

# Display results
print("MAE:", round(mae, 2))
print("RMSE:", round(rmse, 2))
print("R² Score:", round(r2, 2))


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

This is our first model, so there's no improvement yet. The evaluation scores (e.g., R² = 0.92, RMSE = 10.23) will be used as a **baseline**. Once we try other models like Random Forest or XGBoost, we can compare and check for improvement.


# **Conclusion**

The goal of this project was to predict the monthly closing stock price of Yes Bank using historical data and machine learning. We used a dataset containing monthly Open, High, Low, and Close prices, starting from the bank's inception. The dataset was cleaned, missing values were handled, and outliers were removed using the IQR method to ensure accurate analysis.

We performed data visualization to understand the trends and relationships between variables. The line chart clearly showed a significant decline in the closing price after 2018, which aligns with the fraud case involving the bank’s co-founder. The correlation heatmap and pair plot confirmed that Open, High, and Low prices had a strong relationship with the Close price, making them good predictors.

We implemented a Linear Regression model as our first machine learning algorithm. The model performed well, achieving a strong R² score, low MAE, and RMSE, indicating that it can effectively capture the trend in closing prices. This model serves as a solid baseline for stock price prediction.

In future work, we can improve performance