# **Project Name**    -**`yes bank stock closing price prediction`**



##### **Project Type**    - Regression
##### **Contribution**    - Individual- sameer khan


# **Project Summary -**

In this project, we aim to predict the closing price of Yes Bank's stock using historical financial data and machine learning techniques. The closing price of a stock is an essential indicator for investors and traders to make informed decisions regarding buying, selling, or holding the stock.
* The primary objective of this project is to build a regression model that can accurately predict the closing price of Yes Bank's stock based on the available historical data.
* he model will help investors and traders make more informed decisions and potentially improve their trading strategies.

# **GitHub Link -**

https://github.com/itssameerkhan/Yes_bank_stock_closing_price_prediction

# **Problem Statement**


1. Data Collection: Gather historical financial data for Yes Bank from reliable sources, such as financial websites or APIs.
2. Data Preprocessing: Clean the data, handle missing values, and perform feature engineering to extract relevant features for the regression model.
3. Exploratory Data Analysis (EDA): Conduct EDA to understand the distribution of the target variable (closing price) and explore relationships between features and the target variable.
4. Feature Selection: Identify the most relevant features that have a significant impact on the closing price using correlation analysis and feature importance techniques.
5. Model Selection: Experiment with different regression algorithms, such as Linear Regression, Decision Tree Regression, Random Forest Regression, and Gradient Boosting Regression, to determine the best-performing model.
6. Model Training: Split the dataset into training and testing sets. Train the selected model on the training data and tune hyperparameters using cross-validation techniques to optimize model performance.
7. Model Evaluation: Evaluate the trained model's performance on the test set using appropriate evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) to assess how well the model predicts the closing price.
8. Result Interpretation: Interpret the model's predictions and provide insights into the factors influencing the closing price of Yes Bank's stock.
Conclusion:







# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import plotly.graph_objects as go
import statsmodels.api as sm
from statsmodels.tsa.stattools import kpss
from statsmodels.tsa.seasonal import STL
from scipy.stats import zscore
from tabulate import tabulate
import scipy.stats as stats
from sklearn.preprocessing import PowerTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from statsmodels.tsa.stattools import acf,pacf
from itertools import product
! pip install eli5
import eli5
import warnings
warnings.warn('Error: A warning just appeared')
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
! git clone https://github.com/itssameerkhan/Yes_bank_stock_closing_price_prediction.git
data=pd.read_csv("/content/Yes_bank_stock_closing_price_prediction/data_YesBank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows  {data.shape[0]}")
print(f"Number of columns {data.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull())

### What did you know about your dataset?

* total 185 rows and 5 column present in dataset.
* And the data type of 1 column is object and remaining 4 are float type.
* Not any null and duplicate value are present in this dataset .

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe()

### Variables Description

* Open :- open price of that month.
* High :- high price of that mont.
* Low :- low at that month.
* Close :- closing price of that month.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns.tolist():
  print(i)
  print(data[i].value_counts(),'\n\n')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting data column string object to datetime object
data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))


In [None]:
#set Date as index.
data=data.set_index('Date')

### What all manipulations have you done and insights you found?

* convert string object to datetime object of Date column.
* converting the Date column as index.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
#creating a new copy of data for EDA.
eda_data=data.copy()

#### Chart - 1 :- **`Close price and maximum and minimum of all time of closing price`**

In [None]:
# Chart - 1 visualization code
fig = go.Figure()
fig.add_trace(go.Candlestick(x=eda_data.index,
                open=eda_data['Open'], high=eda_data['High'],
                low=eda_data['Low'], close=eda_data['Close'],name='Candlestick')
                     )
fig.add_trace(go.Scatter(x=eda_data.index,y=data.Close,marker=dict(color='red'),opacity=0.35,name='Close'))
fig.update_layout(xaxis_rangeslider_visible=True,
                  title_text='Close price',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',size=15))
fig.update_yaxes(title_text='Price')
fig.show()

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data.Close,mode='markers+lines'))
fig.add_annotation(
        x='2018-07-01',
        y=370,
        xref="x",
        yref="y",
        text="367.9 is the maximum of all time",
        showarrow=True,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="darkslategray"
            ),
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        )
fig.add_annotation(
        x='2009-03-01',
        y=11,
        xref="x",
        yref="y",
        text="9.98 is the minimum of all time",
        showarrow=True,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="darkslategray"
            ),
        arrowhead=8,
        arrowsize=1,
        ay=-70,
        )
fig.update_layout(title_text='Closing Price',
                  font=dict(
                      family="Courier New, monospace",
                      size=14,
                      color="darkslategray"),
                  title_x=0.5,
                  title_y=0.85
                  )
fig.update_yaxes(title_text='stock price')
fig.show()

##### 1. Why did you pick the specific chart?

* first show the candelistic chart of close price
* this chart show the maximum and minimum of all time of closing price.

##### 2. What is/are the insight(s) found from the chart?

* at 2019-07-01 the closing price 367.9 is the maximum all time .
* at 2009-03-01 the closing price 9.98 is the minimum of all time .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes: it helpul for understanding the closing price of stock.

#### Chart - 2:- **`(monthly returns) monthly percentage change of closing stock price`**

In [None]:
# use pct_change to find the percent change for each month
eda_data['Return'] =eda_data['Close'].pct_change()
eda_data.dropna(inplace=True)

In [None]:
# Chart - 2 visualization code
fig= go.Figure()
fig.add_trace(go.Scatter(y=eda_data['Return'],x=eda_data.index, mode='markers+lines',line = dict(color='firebrick', width=4, dash='dot')))
fig.update_layout(
    title_text='monthly percentage change in Close price (yearly return)',
    title_x=0.3,
    title_y=0.85,
    font=dict(
       family="Courier New, monospace",
       size=11,
       color="darkslategray"
    )
)
fig.update_yaxes(
    title_text='percentage change'
)
fig.show()

In [None]:
#Now let's get an overall look at the average monthly return using a histogram.
fig=go.Figure()
fig.add_trace(go.Histogram(x=eda_data['Return'],marker=dict(color='red')))
fig.update_layout(
    title_text='average monthly returen',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.update_xaxes(
    title_text='percentage change'
)
fig.show()

##### 1. Why did you pick the specific chart?

We're now going to analyze the risk of the stock. In order to do so we'll need to take a closer look at the yearly changes of the stock, and not just its absolute value. Let's go ahead and use pandas to retrieve the monthly returns for the stock.

##### 2. What is/are the insight(s) found from the chart?

* maximum fluctuation occurs in stock price is btween 2009 to 2010 and 2018 to 2020.
* in the year of jan-2009 to may-2009 and sep-2019 to oct-2019 monthly price goes up upto 60% .
* maximum percentage change lies between -0.2 to 0.2 .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes:- this help us to know about yearly change in the stock.


#### Chart - 3:- **`all time min and max of Open stock price.`**

In [None]:
# Chart - 3 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data.Open,mode='markers+lines'))
fig.add_annotation(
        x='2018-08-01',
        y=371,
        text="369.95 is the maximum of all time",
        showarrow=True,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="darkslategray"
            ),
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        )
fig.add_annotation(
        x='2009-03-01',
        y=11,
        text="10 is the minimum of all time",
        showarrow=True,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="darkslategray"
            ),
        arrowhead=8,
        arrowsize=1,
        ay=-70,
        )
fig.update_layout(
    title_text='stock OPEN price',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.update_yaxes(title_text='price')
fig.show()

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data.Open,line=dict(color='blue'),showlegend=False))
fig.add_trace(go.Bar(x=eda_data.index,y=eda_data.Open ,marker=dict(color='white'),showlegend=False,textposition='auto'))
fig.update_layout(
    title_text='Open price',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.show()

##### 1. Why did you pick the specific chart?

all time min and max of Open price of stock.

##### 2. What is/are the insight(s) found from the chart?

* all time max is 369.95 in 2018-08-01.
* all time min is 10 in 2009-03-01.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes:- it improve the understandig of stock.

#### Chart - 4:- **`monthly percentage change in Open price`**

In [None]:
# Chart - 4 visualization code
fig= go.Figure()
fig.add_trace(go.Scatter(y=eda_data['Open'].pct_change(),x=eda_data.index, mode='markers+lines',line = dict(color='blue', width=4, dash='dot')))
fig.update_layout(
    title_text='monthly percentage change in Open price (montly return)',
    title_x=0.3,
    title_y=0.85,
    font=dict(
       family="Courier New, monospace",
       size=11,
       color="darkslategray"
    )
)
fig.update_yaxes(
    title_text='percentage change'
)
fig.show()

In [None]:
#Now let's get an overall look at the average monthly return using a histogram.
fig=go.Figure()
fig.add_trace(go.Histogram(x=eda_data['Open'].pct_change(),marker=dict(color='blue')))
fig.update_layout(
    title_text='Open monthly percentage change',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.update_xaxes(
    title_text='percentage change'
)
fig.show()

##### 1. Why did you pick the specific chart?

monthly percentage change in Open price

##### 2. What is/are the insight(s) found from the chart?

* Apr-2009 to jul-2009 and oct-2019 to jan-2019 in this months the percentage change in Open price goes more than 60%.
* oct-2008 to jan-2009, sep-20018 to nov-2018 andjul-2020 to aug-2020 in this monts the percentage change is lowest in Open price , goes in negative 40%.
* the maximum percentage chage lies between 10 to 20 %.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

:yes:

#### Chart - 5:- **`yearly changes in Close price.`**

In [None]:
# Chart - 5 visualization code
color=['green',]*13+['red']*3
fig=go.Figure()
fig.add_trace(go.Bar(x=eda_data.resample('Y').mean().index,
                     y=eda_data.resample('Y').mean()['Close'],showlegend=False,marker=dict(color=color),name='Close'))
fig.add_trace(go.Scatter(
    x=eda_data.resample('Y').mean().index,
    y=eda_data.resample('Y').mean()['Close'],
    text=round(eda_data.resample('Y').mean()['Close'],3),
    mode='text',
    textposition='top center',
    textfont=dict(
        size=14,
    ),
    showlegend=False
))
fig.add_trace(go.Scatter(x=eda_data.resample('Y').mean().index,y=eda_data.resample('Y').mean()['Close']*1.2,mode='lines',name='trend',showlegend=False))
fig.update_layout(title_text='Yearly changes in Close price',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(
                      family='bold',
                      size=15
                  ))
fig.update_yaxes(title_text='Price')
fig.show()

##### 1. Why did you pick the specific chart?

yearly changes in close price and show the trend in stock price .

##### 2. What is/are the insight(s) found from the chart?

in between 2006 to 2018 the trend is upword and sudden 2019 to 2021 the closing price of stock goes down.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

it show the stock is going down after 2019.

#### Chart - 6:- **`yearly returns of stock`** .

In [None]:
# Chart - 6 visualization code
color=[]
for i in eda_data.resample('Y').mean()['Return']:
  if i < 0:
    color.append('red')
  else:
    color.append('green')

fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.resample('Y').mean().index,
                         y=eda_data.resample('Y').mean()['Return']*0.2,line=dict(color='#d9967e'),showlegend=False,name='yearly return'))
fig.add_trace(go.Bar(x=eda_data.resample('Y').mean().index,
                     y=eda_data.resample('Y').mean()['Return'],marker=dict(color=color),showlegend=False,name='yearly return'))
fig.update_layout(
    title_text="Yearly returns of stock price",
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.show()

##### 1. Why did you pick the specific chart?

this shows, yearly return of stock price

##### 2. What is/are the insight(s) found from the chart?

* stock price goes down in 2019,2012,2014,2016,2019,2020,2021 and in 2019 to 2021 stock price goes down rapidaly .
* all green bar show stock price moving upword and 2018 this year the sock goes upword rapidaly

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

here we can see the every year increament and dicreament in stock price.

#### Chart - 7:- **`All time high price of stock .`**

In [None]:
# Chart - 7 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data.High,mode='markers+lines'))
fig.add_annotation(
    text='All time high price :- 404',
    x='2018-08-01',
    y=408,
    arrowhead=4,
    font=dict(
        family="Courier New, monospace",
        size=13
    )
)
fig.update_layout(
    title_text='All time high price',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        size=15,
        family='bold'
    )
)
fig.show()

##### 1. Why did you pick the specific chart?

showing all time high price of stock.

##### 2. What is/are the insight(s) found from the chart?

All time high price is 404.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes

#### Chart - 8:- **`All time Low price of stock.`**

In [None]:
# Chart - 8 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Low'],mode='lines+markers',line=dict(color='red')))
fig.add_annotation(
    text='All time Low = 5.55',
    x='2020-03-01',
    y=6,
    arrowhead=3,
    arrowsize=2,
    ay=-90,
    ax=30
)
fig.update_layout(
    title_text='All time Low price ',
    title_x=0.5,
    title_y=0.85,
    font=dict(
        family='bold',
        size=15
    )
)
fig.update_yaxes(title_text='price')
fig.show()

##### 1. Why did you pick the specific chart?

All time low price of stock .

##### 2. What is/are the insight(s) found from the chart?

all time low price is 5.55

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes: it help to uderstand the stock

#### Chart - 9:- **`simple moving average (SMV)`**

In [None]:
# Chart - 9 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'],name='closing price'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].rolling(window=5,min_periods=1).mean(),name='window 5'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].rolling(window=10,min_periods=1).mean(),name='window 10'))
fig.update_layout(title_text='simple moving average [Close price]',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',
                            size=15))
fig.show()

##### 1. Why did you pick the specific chart?

simple moving average

##### 2. What is/are the insight(s) found from the chart?

* **Trend**: By looking at the blue line (closing price), you can observe the overall trend of the financial instrument. If the line is consistently going upwards, it indicates an uptrend, while a consistent downward movement suggests a downtrend.
* **Moving Averages**: The orange line represents the 5-day simple moving average, and the green line represents the 10-day simple moving average. When the closing price (blue line) is above the moving averages, it suggests a positive trend, and vice versa.
* **Volatility**: The distance between the blue line and the moving averages can provide insights into the volatility of the financial instrument. When the lines are close together, it indicates relatively low volatility, while wider gaps suggest higher volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

this show the Trend and moving average and volatility of stock.

#### Chart - 10 :- **`Cumulative moving average (cmv)`**

In [None]:
# Chart - 10 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].expanding().mean(),name='CMV'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'],name='closing price'))
fig.update_layout(title_text='cumulative moving average',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',size=15))
fig.update_yaxes(title_text='price')
fig.show()

##### 1. Why did you pick the specific chart?

Cumulative moving average

##### 2. What is/are the insight(s) found from the chart?

* The blue line represents the cumulative moving average (CMA) of the closing price
* **Trend**: By examining the blue CMA line, you can get a sense of the long-term trend in the closing price of the financial instrument. If the CMA is consistently rising over time, it indicates a general upward trend in the price.
* **Price Movements**: The orange line represents the actual closing price of the financial instrument. By comparing the blue line to the orange CMA line, you can identify periods when the closing price is above or below the CMA. When the closing price is above the CMA, it suggests that the price is performing better than the long-term average, and vice versa.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

it indicates a general upward trend in the price

#### Chart - 11 :- **`Exponential moving average (EMA)`**

In [None]:
# Chart - 11 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'],name='closing price'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].ewm(alpha=0.1,adjust=False).mean(),name='alpha=0.1'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].ewm(alpha=0.5,adjust=False).mean(),name='alpha=0.5'))
fig.update_layout(title_text='Exponential moving average (EMA)',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',
                            size=15))
fig.update_yaxes(title_text='price')
fig.show()

##### 1. Why did you pick the specific chart?

Exponential moving average (EMA)

##### 2. What is/are the insight(s) found from the chart?

* Closing Price Trend: The first line represents the closing prices of the financial asset over time. By observing this line, we can see the general trend of the asset's price movement. If the line is mostly ascending, it indicates a positive price trend
* Alpha = 0.1 : The second line represents the Exponential Moving Average (EMA) of the closing prices with a smoothing factor (alpha) of 0.1.it may provide earlier signals for potential price reversals or short-term trends.
* Alpha = 0.5 EMA: The third line represents another EMA of the closing prices, but this time with a larger smoothing factor of 0.5.It is more useful for identifying longer-term trends or significant price movements.
* Comparing EMAs: When the EMA lines cross each other or the closing price line, it might signal a shift in the trend direction.
* Volatility: The distance between the closing price line and the EMAs indicates the volatility of the asset's price. If the closing price line is more distant from the EMAs, it indicates higher volatility, and if it's closer, it suggests lower volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

it helps to understand the trend and volatility of stock price.

#### Chart - 12 :- **`Exponential moving weighted average (EMWA)`**

In [None]:
# Chart - 12 visualization code
fig=go.Figure()
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'],name='closing price'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].ewm(span=4,adjust=False).mean(),name='span=4'))
fig.add_trace(go.Scatter(x=eda_data.index,y=eda_data['Close'].ewm(span=2,adjust=False).mean(),name='span=2'))
fig.update_layout(title_text='Exponential moving weigheted average (EMWA)',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',
                            size=15))
fig.update_yaxes(title_text='Price')
fig.show()

##### 1. Why did you pick the specific chart?

Exponential moving weighted average :- this is use to find the smothness and trend of the stock.

##### 2. What is/are the insight(s) found from the chart?

* EMWA with Span=4: The second line represents the Exponential Moving weighted Average of the closing prices with a span of 4 periods. The EMA with a span of 4 means it's giving more weight to the most recent four data points.
* EMWA with Span=2: The third line represents another EMWA of the closing prices, but this time with a smaller span of 2 periods. The EMA with a span of 2 gives even more weight to the most recent two data points, making it highly responsive to immediate price changes.
* Comparing EMAs: By comparing the closing price line to the EMAs with spans of 4 and 2, traders and analysts can identify potential short-term and very short-term trends.When the closing price line diverges significantly from the EMAs with spans of 4 and 2, it may indicate potential overbought or oversold conditions in the market. Divergence occurs when the price and the EMA lines move in opposite directions, suggesting a possible upcoming reversal in the short-term trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes:-

#### Chart - 13:- **`Seasonal Decomposition of Time Series (STL)`**

In [None]:
result = sm.tsa.seasonal_decompose(data['Close'], model='additive', period=10)
plt.figure(figsize=(10, 8))
plt.subplot(4, 1, 1)
plt.plot(result.trend)
plt.title('Trend')
plt.subplot(4, 1, 2)
plt.plot(result.seasonal)
plt.title('Seasonal')
plt.subplot(4, 1, 3)
plt.plot(result.resid)
plt.title('Residual')
plt.subplot(4, 1, 4)
plt.plot(data['Close'])
plt.title('Original Data')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

this show the seasonality, trend and residual of time series.

##### 2. What is/are the insight(s) found from the chart?

* Befor 2018 the trend is increasind and after 2018 the trend is dicreasing.

* The seasonal component represents the repeating patterns or periodic fluctuations in the data. These are yearly patterns. Identifying and understanding seasonal patterns can be valuable for various applications, such as demand forecasting.

* The residual, in 2018 to 2020 there is highest noisy deta are avilable .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

this help us to understand the trend and smothness in data which help us in treding .

#### Chart - 14 - **`Correlation Heatmap`**

In [None]:
# Correlation Heatmap visualization code
correlation_matrix=eda_data.drop(columns=['Return'],axis=1).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.show()

##### 1. Why did you pick the specific chart?

this show the correlation in variables

##### 2. What is/are the insight(s) found from the chart?

Every feature is extremely corelated with each other, so taking just one feature or average of these features would suffice for our regression model as linear regression assumes there is no multi colinearity in the features.

#### Chart - 15 - **`Pair Plot`**

In [None]:
# Pair Plot visualization code
sns.pairplot(eda_data.drop(columns=['Return']))

##### 1. Why did you pick the specific chart?

this show the relationship of varainale with each other.

##### 2. What is/are the insight(s) found from the chart?

All the variables of this dataset is **` linearly related`** to each other.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

* Dickey-fuller test- for stationarity test.
* pearnson correlation test - for correlation test.
* Testing the Significance of Individual Coefficients- for knowing the strength of independent variable to dependent variable.


### Hypothetical Statement - 1 :- **`Dickey-fuller Test`**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*  if pvalue <= 0.05 --Reject the null hypothesis. The time series is stationary.
*  if pvalue > 0.05 --Fail to reject the null hypothesis. The time series is non-stationary.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
result = sm.tsa.adfuller(data['Close'])
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:")
for key, value in result[4].items():
    print(f"\t{key}: {value}")

In [None]:
if result[1] <= 0.05:
    print("Reject the null hypothesis. The time series is stationary.")
else:
    print("Fail to reject the null hypothesis. The time series is non-stationary.")

##### Which statistical test have you done to obtain P-Value?

Dickey-fuller test

##### Why did you choose the specific statistical test?

stationary test of time series data.

### Hypothetical Statement - 2 :- **`pearnson correlation test`**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

*  if pvalue <= 0.05 --Reject the null hypothesis. The columns is not highly correlated.
*  if pvalue > 0.05 --Fail to reject the null hypothesis. The columns is highly correlated.

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

for i in data.columns.tolist():
  for j in data.columns.tolist():
     correlation_coefficient, p_value = stats.pearsonr(data[i], data[j])
     print("           ",i," : ",j,"           ")
     print("Pearson correlation coefficient:", correlation_coefficient)
     print("p-value:", p_value,"\n")

##### Which statistical test have you done to obtain P-Value?

pearnson correlation test.

##### Why did you choose the specific statistical test?

this show the correlation of each column to each others.

### Hypothetical Statement - 3:- **`Testing the Significance of Individual Coefficients`**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): The coefficient of the predictor variable is equal to zero (there is no relationship between the predictor and the response).
* Alternative Hypothesis (H1): The coefficient of the predictor variable is not equal to zero (there is a statistically significant relationship between the predictor and the response).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
X=data.drop(columns=['Close'])
Y=data.Close
# Add a constant to the predictor variables for the intercept term
X = sm.add_constant(X)
stats_model = sm.OLS(Y, X)
results =stats_model.fit()
print(results.summary())

In [None]:
index=results.pvalues.index
value=results.pvalues.values
for column,p_value in zip(index,value):
  if p_value<0.05:
    print(f"{column}: variable is statistically significant in explaining the variation in the target variable")
  else:
    print(f"{column}: variable is statistically NOT significant in explaining the variation in the target variable")

##### Which statistical test have you done to obtain P-Value?

Testing the Significance of Individual Coefficients.

##### Why did you choose the specific statistical test?

The first hypothesis test evaluates the significance of each individual coefficient (slope) in the regression model. Each coefficient represents the change in the response variable associated with a one-unit change in the corresponding predictor variable, assuming all other predictors are held constant.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
data.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

here we can see , there is not any missing value present in the dataset.

### 2. Handling Outliers

In [None]:
def seasonal_decomposition(ts):
    decomposition = STL(ts, seasonal=13)
    result = decomposition.fit()
    return result.trend, result.seasonal, result.resid
trend, seasonal, residual = seasonal_decomposition(data['Close'])
def detect_anomalies(residual, threshold=3.5):
    z_scores = np.abs(zscore(residual))
    anomalies = np.where(z_scores > threshold)[0]
    return anomalies,z_scores

anomalies,z_score = detect_anomalies(residual)
sns.histplot(z_score)
plt.title('z_score')
plt.show()

In [None]:
red_mu=residual.mean()
red_dev=residual.std()
uper_lim=red_mu+(3*red_dev)
lower_lim=red_mu-(3*red_dev)

In [None]:
res_sea=trend+seasonal
fig=go.Figure()
fig.add_trace(go.Scatter(x=data.index,y=data['Close'],name='close price'))
fig.add_trace(go.Scatter(x=res_sea.index,y=res_sea.values,name='trend+seasonality'))
fig.add_trace(go.Scatter(x=data.index[anomalies],y=data['Close'][anomalies],mode='markers',name='annomalies',marker=dict(color='red')))
fig.add_trace(go.Scatter(x=data.index,y=z_score,name='z_score'))
fig.update_layout(title_text='Annomalies in close price',
                  title_x=0.5,
                  title_y=0.85,
                  font=dict(family='bold',
                            size=15))
res_sea=trend+seasonal
fig2=go.Figure()
fig2.add_hline(y=uper_lim)
fig2.add_hline(y=lower_lim)
fig2.add_trace(go.Scatter(y=residual,x=data.index,marker=dict(color='red')))
fig2.update_yaxes(title_text='residuals')
fig2.update_layout(shapes=[
    dict(type='rect',
         xref='paper',
         yref='y',
         x0=0,
         y0=lower_lim,
         x1=1,
         y1=uper_lim,
         fillcolor='green',
         opacity=0.2,
         line=dict(width=0),
    )
])
fig2.show()
fig.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

we use the Seasonal-Trend decomposition using LOESS (STL) to decompose the time series into its trend, seasonal, and residual components. Then, we use the Grubbs test (z-score) to identify anomalies in the residual component.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
data.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

not any categorical variable is present in dataset.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# here we are going to prepeare dataset for time series forcasting .
data_time=data[['Close']].copy()
#note by hypothesis test 1(dickey fuller test) we can find that the 'Close' feature is not stationary so we are going to convert dataset to stationary data.
data_time['Close_log_diff']=np.log(data_time['Close'])-np.log(data_time['Close']).shift(1)
#now i am going to cheak stationary again.
result = sm.tsa.adfuller(data_time['Close_log_diff'].dropna())
if result[1] <= 0.05:
    print(f"p_value is :- {result[1]}")
    print("Reject the null hypothesis. The time series is stationary.")
else:
    print(f"p_value is :- {result[1]}")
    print("Fail to reject the null hypothesis. The time series is non-stationary.")



In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(x=data_time.index,y=data_time['Close_log_diff'],name='log_diff'))
fig.add_trace(go.Scatter(x=data_time.index,y=data_time['Close_log_diff'].rolling(12).mean(),name='rolling mean'))
fig.add_trace(go.Scatter(x=data_time.index,y=data_time['Close_log_diff'].rolling(12).std(),name='rolling std'))
fig.update_layout(title_text='STATIONARY CLOSE VALUES',font=dict(family='bold',size=15),title_x=0.5,title_y=0.85)
fig.show()

now here we can see no any divergen (upword and downword patter in mean and std) so this seres is stationary now.

#### 2. Feature Selection

In [None]:
X=data.drop(columns='Close')
y=data.Close

In [None]:
# here we are doing f_regression method for feature selection
from sklearn.feature_selection import f_regression
f_r=f_regression(X, y, center=True, force_finite=True)
fig=go.Figure()
fig.add_trace(go.Bar(x=X.columns,y=f_r[0],marker=dict(color=['red','yellow','green'])))
fig.update_layout(title_text='f_regression technique',
                  title_x=0.5,title_y=0.85,
                  font=dict(family='bold',size=15))
fig.show()


In [None]:
# here we are cheacking the variances in  each variable in independent variable.
feature_variances = X.var()
fig=go.Figure()
fig.add_trace(go.Bar(x=feature_variances.index,y=feature_variances.values))
fig.update_layout(title_text='variances in independent variable',
                  title_x=0.5,title_y=0.85,
                  font=dict(family='bold',size=15))
fig.update_xaxes(title_text='features name')
fig.show()

##### What all feature selection methods have you used  and why?

1. f_regression method.
2. variance threshold method.

##### Which all features you found important and why?

* from f_regression method 'LOW' feature is very important for regression model.
* and from variances all feature are having same amount of variance in their region.

by doing feature selection, i can understand this the LOW column in very important but i m not going to drop other feature because **`in this dataset there is not any problem of overfiting`** .

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
#here we can see the distribution of values in variable by QQ plot and dist plot.
for col in data.columns:
  plt.figure(figsize=(14,4))
  plt.subplot(121)
  sns.histplot(data[col], bins=20, kde=True)
  plt.title(col)

  plt.subplot(122)
  stats.probplot(data[col],dist='norm',plot=plt)
  plt.title(col)
  plt.show()

here all the variable of dataset is not linearly distributed all are moltidomialy distributed. So we are going to make it in linear distribution .

In [None]:
#we are going to do transformation for convert all variable into linearly distributed variable.
#A power transform will make the probability distribution of a variable more Gaussian.
pt=PowerTransformer()
transform_data=pt.fit_transform(data)
transform_data=pd.DataFrame(transform_data,columns=data.columns)

In [None]:
for col in transform_data.columns:
  plt.figure(figsize=(14,4))
  plt.subplot(121)
  sns.histplot(transform_data[col], bins=20, kde=True)
  plt.title(col)

  plt.subplot(122)
  stats.probplot(transform_data[col],dist='norm',plot=plt)
  plt.title(col)
  plt.show()

now detaset is converted almost in linear distribution.

### 6. Data Scaling

In [None]:
# Scaling your data
table=[['column_name','skewness'],
       ['Opne',transform_data['Open'].skew()],
       ['High',transform_data['High'].skew()],
       ['Low',transform_data['Low'].skew()],
       ['Close',transform_data['Close'].skew()]]
print(tabulate(table,headers='firstrow',tablefmt='grid'))

##### Which method have you used to scale you data and why?

we dont need to scale this dataset because the skewness of this datset is alredy removed by transformation.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

 When dealing with datasets that have a large number of features or variables, dimensionality reduction can help simplify the data and make it more manageable for analysis and modeling.

 but in this dataset feature is very less so we could not use dimesionality reduction.

In [None]:
# DImensionality Reduction (If needed)
#we are not going to do any dimesionality reduction

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

here we are going to train the model on two type of data .
1. original data
2. transform data

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# spliting the origina data
x_train=data.drop(columns=['Close'])[:110]
x_test=data.drop(columns=['Close'])[110:]
y_train=data['Close'][:110]
y_test=data['Close'][110:]
# splitig the transform data
transform_x_train=transform_data.drop(columns=['Close'])[:110]
transform_x_test=transform_data.drop(columns=['Close'])[110:]
transform_y_train=transform_data['Close'][:110]
transform_y_test=transform_data['Close'][110:]

##### What data splitting ratio have you used and why?

spliting ratio of train test is 110:75

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

NO becouse it is a regression problem and there is not any imblalanced variable are present .

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1:- **`ElasticNet`**

**`note`**:- here i am going to do model implementation on both.
1. Oringina data.
2. transformed data.

In [None]:
# ML Model - 1 Implementation
model=ElasticNet()
t_model=ElasticNet()
# Fit the Algorithm
model.fit(x_train,y_train)
t_model.fit(transform_x_train,transform_y_train)
# Predict on the model
y_pred=model.predict(x_test)
t_y_pred=t_model.predict(transform_x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
r2=r2_score(y_pred,y_test)
t_r2=r2_score(t_y_pred,transform_y_test)
cv=np.mean(cross_val_score(model,x_train,y_train,scoring='r2'))
t_cv=np.mean(cross_val_score(t_model,transform_x_train,transform_y_train,scoring='r2'))
table=[['','r2 score','mean(cross_score)'],
       ['without transformetion',r2,cv],
       ['with transformetion',t_r2,t_cv]]
print(tabulate(table,headers='firstrow',tablefmt='grid'))

In [None]:
#creating function for Visualizing evaluation Metric Score chart

#creating a fuction for visualize the original and predicted value
def pred_org(y_test,y_pred,text):
  mse = np.mean((y_test - y_pred) ** 2)
  mae = np.mean(np.abs(y_test - y_pred))
  rmse = np.sqrt(mse)
  r2 = (1 - (np.sum((y_test - y_pred) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2)))*100
  fig=go.Figure()
  fig.add_trace(go.Scatter(x=[min(y_test), max(y_test)],y=[min(y_test), max(y_test)],name='Ideal_line'))
  fig.add_trace(go.Scatter(x=y_test,y=y_pred,mode='markers',name='prediction'))
  fig.update_xaxes(title_text='original value')
  fig.update_yaxes(title_text='predicted value')
  if text=='original':
    fig.update_layout(title_text='EVALUATION MATRICS OF ORIGINAL DATA',
                      title_x=0.5,title_y=0.85,
                      font=dict(color='blue',size=15,family='bold'))
  else:
    fig.update_layout(title_text='EVALUATION MATRICS OF TRANSFORM DATA',
                      title_x=0.5,title_y=0.85,
                      font=dict(color='blue',size=15,family='bold'))
  fig.show()

#this fuction is for error and evulation matics.
def error_score(y_test,y_pred):
  mse = np.mean((y_test - y_pred) ** 2)
  mae = np.mean(np.abs(y_test - y_pred))
  rmse = np.sqrt(mse)
  r2 = (1 - (np.sum((y_test - y_pred) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2)))*100
  header=['','score']
  values=[
       ['Mean Squared Error (MSE)',mse],
       ['Root Mean Squared Error (RMSE)',rmse],
       ['Mean Absolute Error (MAE)',mae],
       ['R-squared (R2)',r2]]
  evaluation_metrics = ['Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)', 'Mean Absolute Error (MAE)', 'R-squared (R2)']
  metric_scores = [mse, rmse, mae, r2]
  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 3), sharex=True)
  axes[0].table(cellText=values, colLabels=header, cellLoc='center', loc='center')
  axes[0].set_xticklabels([])
  axes[0].set_yticklabels([])
  axes[1].bar(evaluation_metrics, metric_scores,  color='blue')
  axes[1].set_xlabel('')
  axes[1].set_ylabel('score')
  axes[1].tick_params(axis='x', rotation=90)
  axes[1].grid(True)
  axes[1].set_xticklabels([])
  for i,j in zip(evaluation_metrics,metric_scores):
    axes[1].text(i,j+2,i.split(' ')[-1])
  plt.tight_layout()
  plt.show()

#comparison between transfrom and original scores
def com_score(y_test,y_pred,t_y_test,t_y_pred):
  mse = np.mean((y_test - y_pred) ** 2)
  mae = np.mean(np.abs(y_test - y_pred))
  rmse = np.sqrt(mse)
  r2 = (1 - (np.sum((y_test - y_pred) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2)))*100

  t_mse = np.mean((t_y_test - t_y_pred) ** 2)
  t_mae = np.mean(np.abs(t_y_test - t_y_pred))
  t_rmse = np.sqrt(t_mse)
  t_r2 = (1 - (np.sum((t_y_test - t_y_pred) ** 2) / np.sum((t_y_test - np.mean(t_y_test)) ** 2)))*100
  table=[['','r2 score','mean(cv_score)','Mean Squared Error (MSE)','Root Mean Squared Error (RMSE)','Mean Absolute Error (MAE)'],
        ['without transformetion',r2,cv,mse,rmse,mae],
        ['with transformetion',t_r2,t_cv,t_mse,t_rmse,t_mae]]
  print('                                   METRIC SCORE CHART TRAMFORMED AMD ORIGINAL DATA ')
  print(tabulate(table,headers='firstrow',tablefmt='grid'))

#with or without hypermeter tunning comparison
def com_tunning_model(y_test,y_pred,t_y_pred):
  t_y_test=y_test
  mse = np.mean((y_test - y_pred) ** 2)
  mae = np.mean(np.abs(y_test - y_pred))
  rmse = np.sqrt(mse)
  r2 = (1 - (np.sum((y_test - y_pred) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2)))*100

  t_mse = np.mean((t_y_test - t_y_pred) ** 2)
  t_mae = np.mean(np.abs(t_y_test - t_y_pred))
  t_rmse = np.sqrt(t_mse)
  t_r2 = (1 - (np.sum((t_y_test - t_y_pred) ** 2) / np.sum((t_y_test - np.mean(t_y_test)) ** 2)))*100
  table=[['','r2 score','Mean Squared Error (MSE)','Root Mean Squared Error (RMSE)','Mean Absolute Error (MAE)'],
        ['without tunning',r2,mse,rmse,mae],
        ['with tunnig',t_r2,t_mse,t_rmse,t_mae]]
  print('                                   METRIC SCORE CHART TUNNING AMD WITHOUT TUNNIG MODEL ')
  print(tabulate(table,headers='firstrow',tablefmt='grid'))

In [None]:
# ecaluation matrics and score chart on original data
pred_org(y_test,y_pred,'original')
error_score(y_test,y_pred)

In [None]:
# ecaluation matrics and score chart on Transform data
pred_org(transform_y_test,t_y_pred,'transform')
error_score(transform_y_test,t_y_pred)

In [None]:
# comparison between transformed and original data
com_score(y_test,y_pred,transform_y_test,t_y_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

**`note:-`** i am going to do Hyperparameter tunning on transformed data.

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid = {
    'max_iter': [1, 5, 10,15],
    'alpha': [0.1, 1.0, 10.0],
    'l1_ratio': [0.1, 0.5, 0.9]
}
grid_search = GridSearchCV(estimator=t_model, param_grid=param_grid, cv=5)
# Fit the Algorithm
grid_search.fit(transform_x_train, transform_y_train)
print(grid_search.best_params_)
# Predict on the model
t2_model=ElasticNet(alpha=grid_search.best_params_['alpha'],l1_ratio=grid_search.best_params_['l1_ratio'],max_iter=grid_search.best_params_['max_iter'])
t2_model.fit(transform_x_train, transform_y_train)
t2_y_pred=t2_model.predict(transform_x_test)

In [None]:
#here we can see the original and prediction values after hyperperameter tinning the model
pred_org(transform_y_test,t2_y_pred,'transform')

##### Which hyperparameter optimization technique have you used and why?

Grid search Cv

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
com_tunning_model(transform_y_test,t_y_pred,t2_y_pred)

here we clearly see the improvment in the model performence.
1. r2 score goes from -169 to 96.
2. mean squared error goes from 2.47 to 0.03.
3. root mean square error goes from 1.572 to 0.174.
4. mean absolute error goes from 1.456 to 0.137.

### ML Model - 2 :- **`LassoCV`**

**`note`**:- here i am going to do model implementation on both.
1. Oringina data.
2. transformed data.

In [None]:
# ML Model - 2 Implementation
from sklearn.linear_model import LassoCV
l_model=LassoCV()
t_l_model=LassoCV()
# Fit the Algorithm
t_l_model.fit(transform_x_train,transform_y_train)
l_model.fit(x_train,y_train)
# Predict on the model
t_l_pred=t_l_model.predict(transform_x_test)
l_pred=l_model.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
l_r2=r2_score(l_pred,y_test)
t_l_r2=r2_score(t_l_pred,transform_y_test)
l_cv=np.mean(cross_val_score(model,x_train,y_train,scoring='r2'))
t_l_cv=np.mean(cross_val_score(t_l_model,transform_x_train,transform_y_train,scoring='r2',cv=10))
table=[['','r2 score','mean(cross_score)'],
       ['without transformetion',l_r2,l_cv],
       ['with transformetion',t_l_r2,t_l_cv]]
print(tabulate(table,headers='firstrow',tablefmt='grid'))

In [None]:
#here we can see the evaluation metrics and prediction on original data.
pred_org(y_test,l_pred,'original')
error_score(y_test,l_pred)

In [None]:
#here we can see the evaluation metrics and prediction on transform data.
pred_org(transform_y_test,t_l_pred,'transform')
error_score(transform_y_test,t_l_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
alphas =[0.01, 0.1, 1.0, 10.0]
lasso_cv = LassoCV(alphas=alphas, cv=5)
# Fit the Algorithm
lasso_cv.fit(transform_x_train, transform_y_train)
best_alpha = lasso_cv.alpha_
# Predict on the model
lasso_pred=lasso_cv.predict(transform_x_test)


##### Which hyperparameter optimization technique have you used and why?

LassoCV automates this process by performing cross-validation to find the best alpha value that provides the best trade-off between model complexity and performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Not any improvements occurs after tunning the model.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

In [None]:
com_tunning_model(transform_y_test,t_l_pred,lasso_pred)

here we clearly see the improvment in the model performence.
1. r2 score goes from 99.11 to 98.67.
2. mean squared error goes from 0.008 to 0.012.
3. root mean square error goes from 0.089 to 0.110.
4. mean absolute error goes from 0.058 to 0.0781.

### ML Model - 3 :- **`SARIMAX`**

So our data is not stationary it is seasonal. We need to use the Seasonal ARIMA (SARIMA) model for Time Series Forecasting on this data. But before using the SARIMA model, we will use the ARIMA model. It will help you learn using both models.

In [None]:
for i in [acf,pacf]:
  if i == acf:
    text='AUTO CORRELATION'
  else:
    text='PARTIAL AUTO COERELATION'
  fig=go.Figure()
  fig.add_trace(go.Scatter(x=[str(i) for i in range(21)],y=i(data_time['Close_log_diff'].dropna(),nlags=20),mode='markers+lines'))
  fig.update_layout(shapes=[
    dict(
         x0=-1,
         y0=-1.96/np.sqrt(len(data_time['Close_log_diff'])),
         x1=21,
         y1=1.96/np.sqrt(len(data_time['Close_log_diff'])),
         fillcolor='red',
         opacity=0.3,
        )
        ],
                    height=450,
                    title_text=text,
                    title_x=0.5,title_y=0.85,font=dict(
                        family='bold',
                        size=15
                    ))
  fig.show()

* p=7
* d=1
* q=7

In [None]:
# ML Model - 3 Implementation
model=sm.tsa.statespace.SARIMAX(data_time['Close'][:185],
                                order=(7, 1, 7),
                                seasonal_order=(7, 1, 7, 12))
# Fit the Algorithm
model=model.fit()
# Predict on the model
predictions = model.predict(160, 190)

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(x=data_time.index,y=data_time['Close'],name='origina'))
fig.add_trace(go.Scatter(x=data_time.index[160:],y=predictions,name='predicted'))
fig.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("                            Akaike's Information Criterion (AIC) of this model is :- ",model.aic)
error_score(y_test=data_time['Close'][160:],y_pred=predictions)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the hyperparameter ranges to search over.
p_range = range(0, 8)
d_range = range(0, 2)
q_range = range(0, 8)
P_range = range(0, 2)
D_range = range(0, 2)
Q_range = range(0, 2)
s_range = [12]
hyperparameter_combinations = list(product(p_range, d_range, q_range, P_range, D_range, Q_range, s_range))

best_aic = float("inf")
best_params = None

# Loop through all combinations and fit SARIMAX models
for params in hyperparameter_combinations:
    try:
        model = sm.tsa.SARIMAX(data_time['Close'], order=params[:3], seasonal_order=params[3:], enforce_stationarity=False, enforce_invertibility=False)
        results = model.fit()

        # Choose a performance metric (AIC in this case) to evaluate models
        aic = results.aic

        # Update best model if the current model has a lower AIC
        if aic < best_aic:
            best_aic = aic
            best_params = params

    except:
        continue

# Re-fit the best model with the selected hyperparameters
best_model = sm.tsa.SARIMAX(data_time['Close'], order=best_params[:3], seasonal_order=best_params[3:], enforce_stationarity=False, enforce_invertibility=False)
best_results = best_model.fit()

print("Best hyperparameters:", best_params)
print("Best AIC:", best_aic)
print("Best Model Summary:")
print(best_results.summary())


##### Which hyperparameter optimization technique have you used and why?

this is a simpel itretion of p,d,q value from a range of p,d,q.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Mean Squared Error (MSE): The MSE is one of the most widely used metrics for regression problems. It calculates the average of the squared differences between predicted and actual values. The lower the MSE, the better the model performance.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

i am going to choose the **LassoCV**  simple linear regression model for this detaset. And this show the best performace from all other models .

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
# Calculate feature importances using permutation importance
# The 'model' argument can be your trained LassoCV model
perm_importance = eli5.sklearn.PermutationImportance(l_model, random_state=42)
perm_importance.fit(x_train, y_train)

# Display feature importances
eli5.show_weights(perm_importance, feature_names=X.columns.tolist())

In [None]:
eli5.show_prediction(l_model , np.array(x_test)[1],feature_names=data.drop(columns=['Close']).columns.tolist(),show_feature_values=True)

As you can see from the above two tables how LassoCV assigned weights for each feature based on training data and from the other table, for a particular instance, to reach a probability of prediction for row 1 how each feature has contributed.   

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* Not a single null vlaue was present in this detaset.
* And not even any outlliers are present.
* eventhough each features are heighly correlated to each other but i did not drop any column becouse there is already features are very less.
* All variable is statistically signigicant in explaining the variation in target variable.
* There is three outliers i found but i cant droped them bcause it not afecting the variance and next reason is there is less amount of input are present.
* i transform the data with power transformer which make the probability distribution of a variable.
* i implement two regression model 1. ElsticNET 2. LassoCV and third model is a time series forecasting model which is SERIMAX ( seasonal auto-regressive integrated moving average with exagenous factores ).
* And LassoCV is the main model from all the models which i impremented because
         1. ElasticNet  original form MSE is 151.25, and transform MSE is 0.0304.
         2. LassoCV  original form MSE is 150, and transform MSE is 0.0122.
         3. SARIMAX  origina form MSE is 517 .
* High and Low this feature are taking heigh contribution in prediction and in other hand Low feature is contributing very less .






### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***