<a href="https://colab.research.google.com/github/ishan711997/ML_linear_model_for_Stock_Closing_Price/blob/main/ML_linear_model_for_Stock_Closing_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Yes Bank Stock Closing Price Prediction

##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Ishan Srivastava

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/ishan711997/ML_linear_model_for_Stock_Closing_Price.git

# **Problem Statement**


Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock's closing price of the month.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# for setting x axis year range
import matplotlib.dates as mdates

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Capstone P2/data/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

In [None]:
# # convert string object to datetime object
# data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(), cbar=False, yticklabels=False)

### What did you know about your dataset?

According to given Dataset there is


*   no null values.
*   date column is a object type and other columns are float.
*   5 column and 185 rows



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include= 'all')

### Variables Description


*  **Date**: It denotes the month and year of the for a particular price.
*  **Open**: The opening price of the stock on that particular month.
*  **High**: The highest price the stock reached during the month.
*  **Low**: The lowest price the stock reached during the month.
*  **Close**: The closing price of the stock on that particular month.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique()

*italicized text*## 3. ***Data Wrangling***

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# making a copy of data and assign to df
df = data.copy()

In [None]:
# converting date column, from object to datetime datatype
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors = 'ignore')

In [None]:
# all features related to date so we set date as a index
df.set_index(keys='Date', inplace = True)

In [None]:
X = df[['High', 'Low', 'Open']]  # Independent variables
Y = df['Close']                  # Dependent variable

#### functions

In [None]:
# made a functin for setting the frequency of x-axis ticks to show every year
def x_year_lable():
  years = mdates.YearLocator()
  plt.gca().xaxis.set_major_locator(years)

In [None]:
# function ---- 1
# range detection function
def range_detection(col1, col2, title):
  high_low_range = df[col1] - df[col2]

  plt.figure(figsize=(15,10))

  # Plot the trading range over time
  plt.plot(df.index, high_low_range)
  plt.title(title)
  plt.xlabel("Year")
  plt.ylabel("Trading Range")

  x_year_lable()

In [None]:
# function ---- 2
# funtion for ploting variation of two columns over time
def price_comparison(col1, col2, title):
  plt.figure(figsize=(15,10))


  sns.barplot(x=df.index.year, y=col1, data=df, color='blue', alpha=0.7, label=col1)
  sns.barplot(x=df.index.year, y=col2, data=df, color='orange', alpha=0.7, label=col2)

  plt.title(title)
  plt.xlabel("Date")
  plt.ylabel("Price")
  plt.legend()


In [None]:
# function ---- 3
# function for relation of two columns
def relation_plot(col1, col2, title):
  plt.scatter(df[col1], df[col2])
  plt.title(title)
  plt.xlabel(f"{col1} Price")
  plt.ylabel(f"{col2} Price")

In [None]:
# function ---- 4
# function for checking distribution and outlier
def hist_box():
  for column in df:
    plt.figure(figsize=(10, 6))

    # for histogram
    plt.subplot(1, 2, 1)
    sns.histplot(df[column], kde = True)

    # for boxplot
    plt.subplot(1, 2, 2)
    sns.boxplot(y = df[column])

### EDA

In [None]:
# 1. Maximum and Minimum price of stock for Open and Close price
plt.bar(['Open', 'Close'], [df['Open'].max(), df['Close'].max()], label='Maximum')
plt.bar(['Open', 'Close'], [df['Open'].min(), df['Close'].min()], label='Minimum')

# Set the title and labels
plt.title("Maximum and Minimum  for Open and Close Prices")
plt.xlabel("Price Type")
plt.ylabel("Price")
plt.legend()

In [None]:
# 2. what is the difference between high and low price of the stock over time?
range_detection('High','Low', title= 'Difference b/w High & Low Price Over Time')

In [None]:
# 3. How does High price and Low price vary over time?
price_comparison('High', 'Low', title = 'High Price vs Low Price Over Time')

In [None]:
# 4. What is the relationship between the high price and low price of the stock?
relation_plot('High', 'Low', title="High Price vs Low Price")

In [None]:
# 5. What is the difference between the opening price and closing price of the stock over time?
range_detection('Close', 'Open', title = 'Difference b/w Open & Close Price Over Time')

In [None]:
# 6. How does the opening price vary with the closing price over time
price_comparison('Open', 'Close', title = 'Opening Price vs Closing Price Over Time')

In [None]:
# 7. What is the relationship between the opening price and closing price of the stock?
relation_plot('Open', 'Close', title="Opening Price vs Closing Price")

In [None]:
# create histogram and boxplot for all columns to identify how it is distributed
hist_box()

In [None]:
# applying log transformation to column's value
for col in df.columns:
  # for implement on column's values
  df[col] = np.log10(df[col])

In [None]:
# histogram and boxplot after log tranformation
hist_box()

In [None]:
# ploting scatter plot independent features w.r.t. Closing price

for col in X:
  plt.figure(figsize=(8, 6))
  plt.scatter(x=df[col], y=Y)
  plt.xlabel(col)
  plt.ylabel('Closing Price')

In [None]:
# Correlation Heatmap for all features
sns.heatmap(df.corr(), annot = True,)

In [None]:
# Pair Plot
sns.pairplot(df)

### What all manipulations have you done and insights you found?

*	Made a copy of data and assigned to df.
*	Converted “Date” column from object type to datetime datatype. and set “Date” column to index.
*	Differentiate independent and dependent variables.
*	Made some functions.
*	Stock’s all time maximum of opening and closing price is approx. 370. Whereas all time minimum price is approx. 10
*	In September of 2018 there was huge difference between stock’s high and low price that was more than 180. Means investors were withdrawing money.
*	Since 2016 to 2017 there was sharp jump in stock prices, and since 2018 stock prices continually falling down.
*	Relation b/w High and Low price is linear.
*	In September month the closing price was approximately 160 units lower than the opening price.
* In 2016-17 we can see opening price is lower than the closing price. Means during that specific time frame stock has gained value.
* Relation b/w Open and Close price is linear.
*	All features is right(+ve) skewed. And every feature have potential outlier. But I decided to not remove them.
*	All independent features (high, low, open) linearly related to closing price.
*	Every feature is highly co-related to each other (in scale 0.98 to 1), it is better for dependent var to be highly co-related to independent, but when independent vars highly co-related to each other, then it is called multicollinearity which is not good for models



## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Handling Null Values & Missing Value Imputation
df.isnull().sum()

**Removing multicolinearity using Principle Component Analysis (PCA)**

In [None]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Performing PCA with the number of components
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame with the first principal component and the 'Close' variable
pca_df = pd.DataFrame({'PCA': X_pca.flatten(), 'Close': Y})

# Plot the relationship between PCA and Close using a scatter plot
plt.scatter(pca_df['PCA'], pca_df['Close'])
plt.xlabel("First Principal Component (PCA)")
plt.ylabel("Close")
plt.title("PCA vs. Close")
plt.show()


In [None]:
Y = pca_df['Close']
X = pca_df['PCA']

In [None]:
sns.histplot(np.log(pca_df['PCA']), kde = True)
# Plotting the mean and the median.
plt.axvline(np.log(df[col]).mean(),color='green',linewidth=2)
plt.axvline(np.log(df[col]).median(),color='red',linestyle='dashed',linewidth=1.5)


In [None]:
pca_df['PCA'].max()

In [None]:
plt.axvline(np.log(pca_df['PCA']).mean(),color='green',linewidth=2)
plt.axvline(np.log(pca_df['PCA']).median(),color='red',linestyle='dashed',linewidth=1.5)


there is no Null & Missing values

In [None]:
# Scaling your data

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***