# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

The objective of this project was to predict the closing price of Yes Bank stock using historical data, including the date, open, high, and low prices. We transformed the Date column into numeric features to capture potential trends and used Open, High, Low, Day, Month as predictors. Starting with a Linear Regression baseline, we shifted to a Random Forest Regressor to capture non-linear patterns in stock prices. Using hyperparameter tuning with RandomizedSearchCV, GridSearchCV, and Bayesian Optimization, we refined the model for better accuracy, with the Random Forest model achieving lower error rates. The final model demonstrated the potential to support short-term stock price predictions, helping analysts and investors anticipate changes and make informed trading decisions. Although effective, the model could further improve by integrating external factors like news and market sentiment. Overall, this project highlights the value of data preparation, model selection, and optimization in building predictive financial models, showing potential for expanded applications in investment and portfolio management.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1. Predictive Analysis of Stock Prices: Develop a machine learning model to predict the closing price of Yes Bank stock based on historical data, including open, high, and low prices. This predictive capability aims to support financial analysts in making data-driven trading decisions.

2. Optimization of Model Accuracy for Financial Forecasting: Investigate and apply hyperparameter optimization techniques, such as GridSearchCV, RandomizedSearchCV, and Bayesian Optimization, to improve model accuracy in forecasting stock prices. The objective is to identify optimal model parameters that minimize prediction errors and maximize reliability.

3. Feature Engineering to Enhance Predictive Power: Examine and transform the Date feature into relevant numeric variables (day, month, and year) to capture seasonal or time-based patterns in stock prices. The goal is to test whether these engineered features improve the model’s ability to predict stock price trends effectively.

4. Evaluating Predictive Model Performance with Financial Data: Evaluate the effectiveness of regression models for predicting stock prices using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). This problem focuses on assessing the accuracy and stability of predictions and determining the model’s suitability for real-world financial forecasting applications.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Librarie
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# loading dataset
df= pd.read_csv('/content/drive/MyDrive/data_YesBank_StockPrices.csv')
df.head(6)

### Dataset First View

In [None]:
# Dataset First Look
df.head(6)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.columns)
df.shape


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

### What did you know about your dataset?

In my dataset, there are 185 rows and 5 columns. There are no duplicate values and there are no missing values.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Here we have 5 columns (Date, Open, High, Low, Close).
Date: Opening date
Open: Opening Price
High: Highest Price in the Day
Low: Lowest Price in the Day
Close: Closing Price

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# converting the Date column from object to datetime
df['Date']=pd.to_datetime(df['Date'],format='%b-%y')
# adding new colum for daily return
df['daily_return'] = df['Close'].pct_change()


### What all manipulations have you done and insights you found?

First I convert my Date column from object type to dattime type. Then I add new column named 'daily_return'. Since there are no null values or missing values so theres no need to remove them.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.plot(df['Close'], label='Close Price', color='blue')
plt.title('Yes Bank Closing Prices Over Time')
plt.ylabel('Price')
plt.legend()
plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

 To show trends over time.

##### 2. What is/are the insight(s) found from the chart?

 This chart allows us to see trends in the closing price over time, such as periods of increase or decrease.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in forecasting future prices based on historical trends.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.scatter(df['Open'], df['Close'], color='green')
plt.title('Relationship Between Open and Close Prices')
plt.xlabel('Open Price')
plt.ylabel('Close Price')
plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

To explore the relationship between two continuous variables.

##### 2. What is/are the insight(s) found from the chart?

This chart helps to visualize any correlation between the opening and closing prices, revealing trends in market behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps traders understand how opening prices affect closing prices.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Bar chart for High and Low prices
plt.figure(figsize=(10, 5))
bar_width = 0.35
index = range(len(df))

plt.bar(index, df['High'], width=bar_width, label='High', color='green', alpha=0.6)
plt.bar([i + bar_width for i in index], df['Low'], width=bar_width, label='Low', color='red', alpha=0.6)

plt.title('High and Low Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
# plt.xticks([i + bar_width / 2 for i in index], df.index.date, rotation=45)
plt.legend()
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

To compare high and low prices over a specific period.

##### 2. What is/are the insight(s) found from the chart?

This chart helps to visualize the daily price range, showing the volatility and extreme price points.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Assists in understanding price volatility, helping to manage risk.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*   Null Hypothesis (H0): The average closing price in the second half of the year is equal to the average closing price in the first half.
*  Alternative Hypothesis (H1): The average closing price in the second half of the year is greater than in the first half.







#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
t_statistic, p_value = stats.ttest_rel(df['Open'], df['Close'])

print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# Interpret the p-value
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between Open and Close prices.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between Open and Close prices.")

##### Which statistical test have you done to obtain P-Value?

Paired t-test

##### Why did you choose the specific statistical test?

 For comparing Open and Close prices is based on several factors

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*  Null Hypothesis (H0) There is no significant difference between the High and Low prices of the stock.
*   Alternative Hypothesis (H1) There is a significant difference between the High and Low prices of the stock.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
t_statistic_high_low, p_value_high_low = stats.ttest_rel(df['High'], df['Low'])

print("Scenario 1: Paired T-Test between High and Low Prices")
print(f"T-Statistic: {t_statistic_high_low}")
print(f"P-Value: {p_value_high_low}")

# Interpret the p-value for High and Low Prices
alpha = 0.05  # Significance level
if p_value_high_low < alpha:
    print("Reject the null hypothesis: There is a significant difference between High and Low prices.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between High and Low prices.")

# --- Scenario 2: One-Sample T-Test for Close Prices ---
# Test if mean Close price is greater than $100
t_statistic_close, p_value_close = stats.ttest_1samp(df['Close'], 100)

# Since this is a one-tailed test, we need to divide the p-value by 2
p_value_close_one_tailed = p_value_close / 2

print("\nScenario 2: One-Sample T-Test for Close Prices")
print(f"T-Statistic: {t_statistic_close}")
print(f"P-Value (One-Tailed): {p_value_close_one_tailed}")

# Interpret the p-value for Close Prices
if p_value_close_one_tailed < alpha:
    print("Reject the null hypothesis: The average Close price is greater than $100.")
else:
    print("Fail to reject the null hypothesis: The average Close price is not greater than $100.")

##### Which statistical test have you done to obtain P-Value?

Paired t-test and One Sample t-test.

##### Why did you choose the specific statistical test?

Both tests were selected based on the nature of the data and the specific hypotheses being tested. The paired t-test was suitable for comparing two related samples, while the one-sample t-test was used for comparing a sample mean to a known value.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*   Null Hypothesis (H0) The average Open price of the stock is equal to $50.
*   Alternative Hypothesis (H1) The average Open price of the stock is not equal to $50.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
t_statistic_open, p_value_open = stats.ttest_1samp(df['Open'], 50)

# Print the results
print("One-Sample T-Test for Open Prices")
print(f"T-Statistic: {t_statistic_open}")
print(f"P-Value: {p_value_open}")

# Since this is a two-tailed test, we do not need to adjust the p-value
alpha = 0.05  # Significance level
if p_value_open < alpha:
    print("Reject the null hypothesis: The average Open price is not equal to $50.")
else:
    print("Fail to reject the null hypothesis: The average Open price is equal to $50.")

##### Which statistical test have you done to obtain P-Value?

One sample t-test

##### Why did you choose the specific statistical test?

The hypothesis involves comparing the mean of a single sample (the Open prices) against a specific known value (in this case, $50). The one-sample t-test is designed for exactly this type of analysis.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isna().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Since there are no missing values so there's no need to handling them.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
Q1 = df['Open'].quantile(0.25)
Q3 = df['Open'].quantile(0.75)
IQR = Q3 - Q1

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['Open'] < lower_bound) | (df['Open'] > upper_bound)]
print(outliers)

# treating outliers
df['Open'] = np.where(df['Open'] < lower_bound, lower_bound, df['Open'])
df['Open'] = np.where(df['Open'] > upper_bound, upper_bound, df['Open'])
print("Data after handling outliers:\n", df.describe())


##### What all outlier treatment techniques have you used and why did you use those techniques?

I use Capping technique to handle outliers. I choose this because Capping allows you to retain all observations in the dataset, which is especially important in cases where losing data might lead to loss of valuable information or insights.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

# Display the categorical columns
print("Categorical Columns:\n", categorical_columns)

#### What all categorical encoding techniques have you used & why did you use those techniques?

We dont have any categorical column in our dataset.


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

df['Day']=df['Date'].dt.day
df['Month']=df['Date'].dt.month

X = df[['Open', 'High', 'Low', 'Day', 'Month']]
y = df['Close']
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=42)

# Fit the Algorithm
model =LinearRegression()
model.fit(X_train,y_train)

# Predict on the model
pred= model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)

I used Linear Regression model in this dataset. To train and test the data I used the train_test split module. Then I evaluate the result using MSE, MAE, RMSE.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint

# Fit the Algorithm
model = RandomForestRegressor(random_state=42)
random_grid = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4)
}
grid_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# 1. RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=model, param_distributions=random_grid, n_iter=10, cv=3, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

# Best parameters from RandomizedSearchCV
print("Best parameters from RandomizedSearchCV:", random_search.best_params_)

# 2. GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=model, param_grid=grid_grid,
                           cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters from GridSearchCV
print("Best parameters from GridSearchCV:", grid_search.best_params_)


##### Which hyperparameter optimization technique have you used and why?

I used RandomizedSearchCV and GridSearcgCV. Beacause RandomizedSearchCV randomly searches combinations of hyperparameters within the defined ranges and GridSearchCV searches all possible combinations in the grid.

# **Conclusion**

In this project, we developed and evaluated a machine learning model to predict the closing price of Yes Bank stock using a variety of techniques, from initial data wrangling and feature engineering to model training, hyperparameter optimization, and evaluation.Through systematic data processing, model selection, and optimization, we achieved a model that can reasonably predict stock prices. While the model is valuable as an exploratory tool, additional external data sources and complex algorithms could yield even higher accuracy. This project highlights the importance of feature engineering, model selection, and hyperparameter tuning in building effective predictive models for financial data. With further development, this predictive capability could be a significant asset for portfolio management and stock trading strategies.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***