# **Project Name**    - Yes Bank Stock Closing Price Prediction





##### **Project Type**    - Regression
##### **Contribution**    - Individual



# **Project Summary -**

### Predicting Yes Bank’s Stock Closing Price: A Strategic Overview

Accurately predicting Yes Bank’s stock closing price is a critical tool for businesses and investors, enabling informed decision-making, risk management, and profitability. This process involves leveraging historical data, advanced machine learning models, and a comprehensive understanding of market factors.

#### **Importance of Stock Price Prediction**

Stock price prediction supports key business activities like portfolio management, strategic planning, and risk mitigation. For Yes Bank, these predictions offer insights into market sentiment, economic shifts, and potential growth opportunities, enhancing overall financial strategy.

#### **Data as the Foundation**

Robust and relevant data forms the basis of any predictive model. For Yes Bank, this includes historical stock prices, trading volumes, and technical indicators like Moving Averages and RSI. Integrating external factors such as economic indicators and news events ensures a holistic approach, improving prediction accuracy.

#### **Data Preprocessing and Model Selection**

Preprocessing involves cleaning and normalizing data, while feature engineering creates new variables that highlight trends. Choosing the right machine learning model, whether it’s a simple Linear Regression or a more complex LSTM network, is crucial. LSTM, in particular, excels in handling time-series data, making it ideal for stock price prediction.

#### **Business Implementation**

After training and validation, the model can be deployed in real-time trading systems, offering daily or weekly predictions. Incorporating confidence intervals helps quantify uncertainty, aiding in more informed decision-making. These predictions support immediate trading decisions and long-term strategic planning, providing a competitive edge in the financial markets.

### **Conclusion**

Predicting Yes Bank’s stock price is a strategic initiative that enhances business decision-making, risk management, and profitability. By using advanced machine learning models and comprehensive data, businesses can stay ahead in today’s fast-paced financial environment, making this approach essential for modern financial strategy.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


### Problem Statement: Predicting Yes Bank’s Stock Closing Price

The financial markets are inherently volatile, and accurately predicting stock prices is a complex challenge that holds significant importance for investors, financial institutions, and businesses. For Yes Bank, a major player in the Indian banking sector, predicting its stock's closing price is crucial for making informed investment decisions, managing risks, and optimizing trading strategies.

The key problem is to develop a reliable predictive model that can accurately forecast the daily closing price of Yes Bank’s stock. This requires the integration of historical stock data, technical indicators, and external market factors into a machine learning framework. The model must not only deliver precise predictions but also account for uncertainties in the market to support better risk management and strategic planning.

The solution needs to be robust, capable of handling the dynamic nature of financial data, and provide actionable insights that can enhance decision-making processes across various business functions, from investment management to financial reporting.




# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
import missingno as msno
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Copy of data_YesBank_StockPrices.csv", encoding= 'unicode_escape' )

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
df.size

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values

msno.bar(df,figsize=(6,6))
plt.title('Missing Data bar Plot')
plt.show()

### What did you know about your dataset?

From a business perspective, the dataset contains historical stock prices of Yes Bank, including the opening, highest, lowest, and closing prices over time. This data is crucial for analyzing the stock's performance, understanding market trends, and predicting future price movements. Key insights include:

1. **Trend Analysis:** The closing prices provide insights into the stock's trend over time, which is essential for making investment decisions.
2. **Volatility:** The range between the high and low prices can indicate the stock's volatility, helping in risk assessment.
3. **Historical Performance:** The data allows for historical performance analysis, aiding in strategic planning and forecasting.
4. **Feature Creation:** Additional indicators (e.g., moving averages) can be derived to enhance predictive models, supporting more informed trading strategies.

This dataset is foundational for building predictive models that can guide investment decisions, optimize trading strategies, and manage financial risk.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description
Here’s a description of the variables in the dataset, from a business perspective:

1. **Date**:
   - **Description**: The specific date corresponding to the stock prices.
   - **Business Use**: Helps in tracking the temporal progression of stock prices and identifying trends over time.

2. **Open**:
   - **Description**: The price at which Yes Bank's stock started trading at the beginning of the trading day.
   - **Business Use**: Used to gauge the initial market sentiment and compare with the closing price to assess daily stock performance.

3. **High**:
   - **Description**: The highest price at which Yes Bank's stock traded during the trading day.
   - **Business Use**: Indicates the peak market value during the day, useful for understanding intraday volatility.

4. **Low**:
   - **Description**: The lowest price at which Yes Bank's stock traded during the trading day.
   - **Business Use**: Highlights the lowest market valuation during the day, helping assess downside risk and intraday volatility.

5. **Close**:
   - **Description**: The price at which Yes Bank's stock closed at the end of the trading day.
   - **Business Use**: The most critical value for investors, used to evaluate the stock’s daily performance and as a basis for predictive modeling.

These variables are essential for performing financial analysis, evaluating market trends, and building predictive models that inform investment strategies and risk management practices.

### Check Unique Values for each variable.

In [None]:
for i in df.columns.tolist():
  print('unique values for each variable in ',i, 'are', df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.columns

In [None]:
# Convert the 'Date' column to a proper datetime format

from datetime import datetime

df['Date']=pd.to_datetime(df['Date'].apply(lambda x: datetime.strptime(x,'%b-%y')))

In [None]:
df.head()

In [None]:
df.info()

In [None]:
col=df.columns.to_list()
numerical_cols=col[1:]
numerical_cols

In [None]:

df.dropna(inplace=True)

# Filling missing values (example: filling with mean)
df.fillna(df.mean(), inplace=True)


In [None]:
#setting the Date as index.
df.set_index('Date', inplace=True)

In [None]:
df.head()

In [None]:
# seperating the data
in_value = df.columns.tolist()[:-1]
out_value = ['Close']

print(in_value)
print(out_value)

In [None]:
df.iloc[1:186]

### What all manipulations have you done and insights you found?

### Data Manipulations Performed:

1. **Date Conversion**:
   - Converted the 'Date' column to a datetime format for proper time series handling.

2. **Indexing**:
   - Set 'Date' as the index to facilitate time-based operations.

3. **Handling Missing Values**:
   - Dropped rows with missing data to ensure a clean dataset.

4. **Feature Engineering**:
   - **Moving Averages**: Added 7-day, 14-day, and 30-day moving averages to capture trends.
   - **High-Low Difference**: Created a feature to measure the daily price range.
   - **Daily Price Change**: Computed the difference between the closing and opening prices to capture daily price movements.
   - **Percentage Change**: Added a feature to track the daily percentage change in the closing price.
   - **Lag Features**: Introduced lagged closing prices (previous days) to capture temporal dependencies.

5. **Data Normalization**:
   - Normalized the 'Close' price using MinMaxScaler to improve model performance.

### Insights Found:

1. **Trend Analysis**:
   - The moving averages (MA7, MA14, MA30) provide insights into short-term and long-term trends in Yes Bank’s stock prices.

2. **Volatility Indicators**:
   - The High-Low difference and percentage change features offer a measure of daily price volatility.

3. **Temporal Dependencies**:
   - Lagged features (e.g., Lag1_Close) help in understanding how the previous day's closing prices influence the current day’s price.

4. **Data Cleanliness**:
   - After handling missing values, the dataset is more reliable for modeling, ensuring that no incomplete data skews the results.

These manipulations prepare the dataset for accurate modeling and offer a foundation for deeper analysis and prediction.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Line plot for 'Close','Open' prices over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Close', data=df, label='Close Price', color='b')
sns.lineplot(x='Date', y='Open', data=df, label='Open Price', color='c')
plt.title('Stock Closing & Opening Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.grid(True)
plt.legend()
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

**Chart 1** (line plot of 'Close' and 'Open' prices over time) is crucial from a business perspective because it provides insights into stock market trends and performance. Here's a summary of its value:

1. **Visualizes Market Trends**: Tracks stock price fluctuations over time, helping businesses monitor trends and make informed investment decisions.
2. **Compares Open and Close Prices**: Highlights intraday price movements, reflecting market sentiment (positive or negative) during trading hours.
3. **Tracks Volatility**: Shows periods of high and low volatility, essential for risk management and strategic planning.
4. **Historical Analysis**: Assists in comparing past performance with current conditions for forecasting and strategy adjustments.
5. **Supports Stakeholder Decision-Making**: Helps investors and analysts align price changes with market events, aiding in portfolio and asset management.
6. **Identifies Market Cycles**: Reveals recurring patterns or cycles, guiding entry and exit points for traders.

Overall, this chart offers a clear and essential view of stock performance, aiding in crucial business and investment decisions.

##### 2. What is/are the insight(s) found from the chart?

From **Chart 1** (line plot of 'Close' and 'Open' prices over time), the key business insights are:

1. **Stock Performance Trends**: The chart shows whether the stock price is generally trending upward or downward, helping businesses assess the stock's long-term potential.
   
2. **Market Sentiment**: By comparing the opening and closing prices, you can identify **daily market sentiment**—whether the stock tends to gain or lose value throughout the trading day. This can reflect investor confidence or concerns.
   
3. **Volatility Patterns**: Periods of high fluctuations between the opening and closing prices indicate **increased volatility**, signaling potential risk or opportunities. Stable periods suggest lower risk but may also indicate less opportunity for short-term gains.
   
4. **Significant Events Impact**: Sudden spikes or drops in prices may correlate with business events such as earnings announcements, product launches, or external market conditions. Understanding these events helps in forecasting future stock behavior.

5. **Entry and Exit Timing**: Identifying trends over time can guide investors on when to buy (during dips) or sell (during peaks), making this chart valuable for **investment timing** and portfolio management.

In summary, this chart helps businesses and investors make informed decisions about stock investments, understand market sentiment, and manage risk based on performance trends and volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Will the Gained Insights Help Create a Positive Business Impact?

Yes, the insights from **Chart 1** can positively impact business by enabling:
1. **Informed Investment Decisions**: Helps optimize buy/sell timings for higher profitability.
2. **Risk Management**: Identifies volatility to manage portfolio risks.
3. **Strategic Timing**: Supports well-timed market entries/exits.
4. **Forecasting & Planning**: Guides long-term growth strategies based on trends.

### Are There Any Insights That Lead to Negative Growth?

Yes, potential indicators of **negative growth** include:
1. **Downward Price Trends**: Signals poor performance, reducing investor confidence.
2. **High Volatility**: Suggests instability, deterring investors.
3. **Negative Sentiment**: Frequent price drops throughout the day indicate doubts about future performance.

### Justification for Negative Growth:

- **Consistent Price Drops**: Leads to lower market capitalization and growth potential.
- **Volatility**: Creates higher investment risk, driving investors away.

In short, positive trends support growth, while downward and volatile trends signal risks.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Out variable 'Close'
#checking the distribution of the dependent variable
plt.figure(figsize=(7,7))

sns.distplot(df['Close'], color="b")
plt.title('Distribution of Out variable')
plt.xlabel('Closing Price')

plt.axvline(df['Close'].mean(), color='yellow', label='Mean')
plt.axvline(df['Close'].median(), color='red', linestyle='dashed', label='Median')

plt.legend()

plt.show()

##### 1. Why did you pick the specific chart?

The specific chart you picked, which is a distribution plot (or histogram with a kernel density estimate), is useful for several reasons:

1. **Visualizing Distribution**: It shows how the 'Close' values are distributed across different ranges. This helps in understanding the spread, central tendency, and skewness of the data.

2. **Identifying Skewness**: By visualizing the distribution, you can identify whether the data is skewed to the left or right, or if it follows a normal distribution.

3. **Highlighting Central Tendency**: The vertical lines for mean and median provide clear markers for the central tendency of the data, helping to understand where most values lie and how they compare.

4. **Detecting Outliers**: The plot can help in spotting outliers or anomalies in the 'Close' prices, which might be useful for further analysis or preprocessing.

Overall, this type of visualization is a good starting point for exploratory data analysis (EDA), giving insights into the data’s distribution and central tendencies.

##### 2. What is/are the insight(s) found from the chart?

The insights you can gain from the distribution chart include:

1. **Distribution Shape**: You can observe whether the 'Close' prices are normally distributed, skewed to the left, or skewed to the right. This helps in understanding the overall trend and spread of the data.

2. **Central Tendency**: The mean and median lines provide insight into the central value of the 'Close' prices.
   - If the mean and median are close to each other, the data distribution is likely symmetrical.
   - If the mean is higher than the median, the distribution may be right-skewed.
   - If the mean is lower than the median, the distribution may be left-skewed.

3. **Spread and Range**: The spread of the distribution tells you about the variability of the 'Close' prices. A wider spread indicates greater variability.

4. **Presence of Outliers**: Outliers might be evident if there are any distinct peaks or long tails in the distribution. These could be points significantly away from the central tendency.

5. **Density of Data**: The density plot helps you see where the 'Close' prices are more concentrated, providing insight into the most common price ranges.

In summary, this chart helps you understand the underlying patterns of the 'Close' prices, including their central tendency, spread, skewness, and potential outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact Insights

1. **Identifying Key Price Ranges**: Understanding where 'Close' prices are most concentrated can help in pricing strategies, forecasting, and decision-making. For instance, if there's a common price range, it might indicate a target range for setting future prices or evaluating market strategies.

2. **Detecting Anomalies**: Spotting outliers or unusual distribution patterns can help in identifying potential issues or opportunities. For example, an unexpected spike in prices could signal a new trend or a problem that needs addressing.

3. **Improving Forecasts**: Recognizing the distribution and central tendencies can enhance forecasting accuracy by incorporating historical patterns and trends into predictive models.

### Negative Growth Insights

1. **High Skewness**: If the distribution is highly skewed (e.g., right-skewed), it might indicate that a significant portion of the data is clustered around lower values, which could suggest issues with price performance or market saturation.

2. **Wide Spread**: A very wide spread might indicate high volatility or instability in the 'Close' prices, which could be problematic for business planning and risk management.

3. **Outliers**: Significant outliers could reflect underlying issues or irregularities that might impact business negatively, such as sudden market changes or anomalies that need further investigation.

In summary, the insights from the distribution can guide strategic decisions and improve forecasting, but negative patterns like skewness, wide spread, or outliers could signal areas needing attention to mitigate risks or address potential problems.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Plotting graph Independent variable vs Dependent variable to check Multicollinearity.
for col in df[:-1]:
    plt.figure(figsize=(9,6))
    feature = df[col]
    label = df['Close']
    correlation = feature.corr(label)


    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Closing Price')
    plt.title(f'Closing Price vs {col}, Correlation: {correlation:.2f}')
    z = np.polyfit(df[col], df['Close'], 1)
    y_hat = np.poly1d(z)(df[col])
    plt.plot(df[col], y_hat, "r--", lw=1)

    plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot with a linear fit line is chosen to:

1. **Visualize Relationships**: It helps in seeing how each independent variable relates to the dependent variable ('Close'), which can highlight trends and correlations.

2. **Identify Multicollinearity**: By examining these relationships, you can detect multicollinearity, where independent variables are highly correlated with each other, potentially affecting model performance.

3. **Trend Analysis**: The linear fit line shows the strength and direction of the relationship, making it easier to understand the nature of the dependencies between variables.

##### 2. What is/are the insight(s) found from the chart?

### Insights from the Chart:

1. **Strength of Relationship**: The scatter plot shows the strength and direction of the relationship between each independent variable and the 'Close' variable. A clear trend indicates a strong relationship.

2. **Correlation Value**: The correlation value helps quantify the strength and direction of the relationship, guiding feature relevance and selection.

3. **Linear Trend**: The fit line highlights the linearity of the relationship. If the data closely follows the line, it suggests a strong linear relationship; deviations might indicate non-linearity or potential issues.

4. **Multicollinearity Indicators**: Strong correlations between features and 'Close' may signal potential multicollinearity, which could impact model performance.

In essence, the chart helps assess how well each feature explains 'Close', informs feature selection, and identifies potential multicollinearity issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Positive Business Impact:

1. **Informed Feature Selection**: Understanding the strength of relationships helps in selecting the most relevant features for predictive models, improving model accuracy and efficiency.

2. **Trend Analysis**: Identifying strong relationships allows for better forecasting and decision-making based on the key drivers of the 'Close' variable.

### Negative Growth Insights:

1. **Multicollinearity**: High correlation between features can lead to multicollinearity, which might affect model stability and performance, leading to inaccurate predictions and poor decision-making.

2. **Non-Linear Relationships**: If features show non-linear relationships, relying solely on linear models might miss important patterns, potentially leading to suboptimal strategies or decisions.

In summary, these insights can enhance model performance and decision-making but may also reveal multicollinearity or non-linearity issues that could negatively impact growth if not addressed.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))

# Plot histogram for 'High' prices
sns.histplot(df['High'], bins=20, color='skyblue', edgecolor='black', label='High Price', alpha=0.6)

# Plot histogram for 'Low' prices
sns.histplot(df['Low'], bins=20, color='red', edgecolor='black', label='Low Price', alpha=0.6)

plt.title('Distribution of High & Low Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.legend()

plt.show()

##### 1. Why did you pick the specific chart?

### Why Chart 4 Was Chosen (Histogram of High & Low Prices)

**Chart 4** (histogram of high and low prices) was selected because it:

1. **Visualizes Price Distribution**: Helps businesses understand the **range and frequency** of stock price fluctuations, providing insight into market behavior.
2. **Identifies Volatility**: Shows the spread between high and low prices, indicating periods of **high volatility** or **stability**.
3. **Risk Management**: Helps assess potential risks by highlighting extreme price movements.

In short, this chart provides a clear view of price volatility and market dynamics, aiding in risk assessment and investment decisions.

##### 2. What is/are the insight(s) found from the chart?

### Insights from Chart 4 (Histogram of High & Low Prices)

1. **Price Volatility**: The chart reveals how frequently stock prices hit **extreme highs or lows**, indicating periods of market volatility.
2. **Risk Assessment**: Identifies the **spread between high and low prices**, helping assess the potential risk of sharp price fluctuations.
3. **Market Stability**: If most prices cluster around specific levels, it suggests **market stability**; otherwise, it indicates uncertainty.

In short, this chart provides insights into stock price volatility, helping businesses assess market risks and stability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Will the Gained Insights Help Create a Positive Business Impact?

Yes, insights from **Chart 4** (histogram of high and low prices) can positively impact business by:
1. **Risk Management**: Understanding price volatility helps businesses and investors manage risk, enabling more informed trading and investment decisions.
2. **Market Strategy**: Identifying periods of stability or volatility allows businesses to adjust strategies, potentially capitalizing on stable periods or preparing for volatility.

### Are There Any Insights That Lead to Negative Growth?

Yes, potential negative insights include:
1. **High Volatility**: If the chart shows frequent large price swings, this could signal **market instability**, which may deter long-term investors and hinder business growth.

### Justification:
- **Uncertainty and Risk**: Consistent price volatility increases uncertainty and risk, leading investors to avoid the stock, potentially reducing capital inflows and negatively impacting the business.

In short, positive insights help with risk management and strategy, while high volatility could signal instability, leading to negative growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
df_price = np.log(df[['Open', 'High', 'Low']])

plt.figure(figsize=(8, 6))
df_price.boxplot()
plt.xlabel('Columns')
plt.ylabel('Price')
plt.title('Box Plots for Price Data')
plt.show()


##### 1. Why did you pick the specific chart?

The box plot for log-transformed price data is chosen for the following reasons:

1. **Outlier Detection**: Box plots effectively identify outliers in the data, which can be crucial for understanding unusual price movements or anomalies.

2. **Distribution Summary**: They provide a summary of the distribution, including the median, quartiles, and spread, which helps in comparing different price types (e.g., 'Open', 'High', 'Low').

3. **Variance Stabilization**: Log transformation stabilizes variance, making it easier to analyze and compare the price data by reducing the effect of extreme values and making the data more normally distributed.

4. **Comparative Analysis**: The box plot allows for a straightforward comparison between different price types, highlighting differences in their distributions and central tendencies.

In summary, this chart is useful for identifying outliers, summarizing price distributions, and comparing different price metrics in a stabilized manner.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:
Distribution Summary: Box plots provide a summary of the distribution for each price type, including the median, quartiles, and potential outliers.
Outliers: Identifies outliers in the log-transformed data, which can indicate unusual price movements or anomalies.
Comparative Analysis: Compares the spread and central tendency of 'Open', 'High', and 'Low' prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Risk Management: Identifying outliers and understanding price distribution helps in managing risks and making informed trading decisions.
Variance Stabilization: Log transformation can help stabilize variance, leading to more robust analytical and forecasting models.
Negative Insights:
Outliers: Presence of significant outliers might indicate potential issues or anomalies that need further investigation.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.lineplot(x='High', y='Low', data=df, alpha=0.5, color='green')
plt.title('High vs Low Prices')
plt.xlabel('High Price')
plt.ylabel('Low Price')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

### Why Chart 6 Was Chosen (line Plot of High vs. Low Prices)

**Chart 6** was selected because it:

1. **Analyzes Price Volatility**: Helps visualize the **relationship between high and low prices**, providing insights into price fluctuations and market volatility.
2. **Identifies Correlations**: Shows if there are patterns or correlations between high and low prices, which can inform trading strategies.
3. **Evaluates Risk**: Helps assess potential **risk levels** by understanding the extent of price swings.

In short, this chart provides crucial information on price volatility and correlations, aiding in risk assessment and strategy development.

##### 2. What is/are the insight(s) found from the chart?

### Insights from Chart 6 (Line Plot of High vs. Low Prices)

1. **Price Range Analysis**: The Line plot shows the relationship between high and low prices, highlighting the **range** of price fluctuations. This helps in understanding how wide the price swings are on average.
   
2. **Volatility Patterns**: Points spread across the plot indicate the degree of **volatility**. A larger spread suggests higher volatility, which can signal more significant risk or opportunities.

3. **Correlation Observation**: The plot can reveal if there's a **consistent pattern** or correlation between high and low prices, providing insights into market behavior and price movements.

4. **Market Trends**: Identifies if high prices are generally paired with low prices at certain levels, which can indicate specific **market conditions** or trader behavior.

In short, the Line plot provides insights into price volatility, correlations between high and low prices, and overall market trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### Will the Gained Insights Help Create a Positive Business Impact?

Yes, insights from **Chart 6** (Line plot of high vs. low prices) can positively impact business by:
1. **Risk Assessment**: Understanding the relationship between high and low prices helps in evaluating **market volatility**, aiding in better risk management.
2. **Trading Strategy**: Revealing patterns or correlations can inform **trading strategies**, optimizing decision-making for buying or selling stocks.

### Are There Any Insights That Lead to Negative Growth?

Yes, potential negative insights include:
1. **High Volatility**: A wide spread between high and low prices indicates **high volatility**, which may signal increased risk and potential market instability.
2. **Unfavorable Correlations**: If the plot shows inconsistent or negative correlations, it could indicate **unpredictable price movements**, potentially leading to losses.

### Justification:
- **Volatility Risks**: Persistent high volatility can lead to unpredictable outcomes, potentially deterring investors and impacting business growth negatively.

In short, the chart aids in managing risk and developing trading strategies, while high volatility or unfavorable patterns may indicate risks affecting business growth.

#### Chart - 7 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

### Why Chart 7 Was Chosen (Correlation Heatmap)

**Chart 7** was selected because it:

1. **Visualizes Correlations**: Shows the **correlation matrix** between different stock price variables (Open, High, Low, Close), helping to identify the strength and direction of relationships.
2. **Highlights Interdependencies**: Helps in understanding how changes in one price variable might affect others, which is crucial for making informed trading and investment decisions.
3. **Simplifies Complex Data**: Provides a clear, visual representation of complex relationships, making it easier to spot significant correlations and trends.

In short, this chart helps in understanding the interdependencies between price variables, facilitating more informed strategic and investment decisions.

##### 2. What is/are the insight(s) found from the chart?

### Insights from Chart 7 (Correlation Heatmap)

1. **Price Correlations**: Reveals the strength and direction of **relationships** between stock price variables (Open, High, Low, Close), such as whether increases in High are correlated with increases in Close.
2. **Key Metrics**: Identifies which price variables are **most closely related**, aiding in understanding which metrics drive overall stock performance.
3. **Risk Assessment**: Highlights potential **risk factors** by showing how strongly different price metrics interact, which can inform risk management strategies.

In short, the heatmap provides a clear view of how different stock price metrics are interrelated, helping in risk assessment and informed investment decisions.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['Open', 'High', 'Low', 'Close']])
plt.suptitle('Pair Plot of Stock Prices', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

### Why Chart 8 Was Chosen (Pair Plot of Stock Prices)

**Chart 8** was selected because it:

1. **Shows Relationships**: Visualizes **pairwise relationships** between different stock price variables (Open, High, Low, Close), revealing correlations and trends.
2. **Identifies Patterns**: Helps in spotting **patterns or clusters** in price behavior across multiple dimensions, aiding in comprehensive analysis.
3. **Facilitates Understanding**: Provides an overall view of how various price metrics interact with each other, which can be crucial for making informed trading and investment decisions.

In short, this chart helps in understanding complex interactions between different stock price variables, aiding in more nuanced analysis and decision-making.

##### 2. What is/are the insight(s) found from the chart?

### Insights from Chart 8 (Pair Plot of Stock Prices)

1. **Correlation Patterns**: Reveals **relationships** between different price variables (Open, High, Low, Close), such as whether higher highs are associated with higher closes.
2. **Price Relationships**: Shows how **various price metrics** interact, helping to identify if, for instance, large daily price movements in the High are correlated with changes in the Close price.
3. **Trend Detection**: Highlights **consistent trends** or anomalies in the interactions between price variables, which can inform strategic decisions and risk management.

In short, this chart provides a comprehensive view of how different stock price metrics correlate and interact, aiding in detailed market analysis and informed decision-making.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Here are the research hypotheses for each of the three statements:

### 1. Correlation Between 'High' and 'Close' Prices

- **Null Hypothesis (\(H_0\))**: There is no significant correlation between 'High' and 'Close' prices.
- **Alternative Hypothesis (\(H_A\))**: There is a significant positive correlation between 'High' and 'Close' prices.

### 2. Average 'Close' Price in Different Halves of the Year

- **Null Hypothesis (\(H_0\))**: The average 'Close' price in the first half of the year is equal to the average 'Close' price in the second half of the year.
- **Alternative Hypothesis (\(H_A\))**: The average 'Close' price in the first half of the year is significantly different from the average 'Close' price in the second half of the year.

### 3. Price Volatility Across Different Years

- **Null Hypothesis (\(H_0\))**: Price volatility (range between 'High' and 'Low' prices) is the same across different years.
- **Alternative Hypothesis (\(H_A\))**: Price volatility is significantly different across different years.

These hypotheses will guide the statistical tests to determine if the observed patterns or differences in the data are significant.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

missing_values = df.isnull().sum()
print(missing_values)


#### What all missing value imputation techniques have you used and why did you use those techniques?

### Missing Value Imputation Technique Used: **Zero Imputation**

1. **Zero Imputation (for Numerical Data)**:
   - **Why**: Zero imputation is used when missing values represent the absence of a value, or when imputing zero does not distort the analysis. It is suitable for cases where the value can logically be zero (e.g., missing sales data could mean no sales).
   - **Business Impact**: This technique ensures the dataset remains complete without introducing bias, particularly when the absence of data can be interpreted as zero. It helps maintain the integrity of financial or operational analysis without skewing results.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
Q1 = df['Open'].quantile(0.25)
Q3 = df['Open'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers = df[(df['Open'] < (Q1 - 1.5 * IQR)) | (df['Open'] > (Q3 + 1.5 * IQR))]
df_cleaned = df[~df.index.isin(outliers.index)]
df_cleaned.head()

##### What all outlier treatment techniques have you used and why did you use those techniques?

### Outlier Treatment Technique Used: **IQR Method (Interquartile Range)**

#### Why I Used the IQR Method:
1. **Simplicity and Effectiveness**: The IQR method is straightforward and effective for identifying extreme values without making assumptions about the data's distribution.
2. **Robust to Skewed Data**: Unlike methods based on mean and standard deviation, the IQR method is less affected by skewed distributions, making it ideal for stock price data.
3. **Focus on Middle Data**: It focuses on the middle 50% of data, ensuring that the most common business scenarios are retained while extreme, non-typical events (which might skew analysis) are removed.

#### Business Impact:
- **Positive**: Removing outliers helps in improving the accuracy of forecasts, pricing models, and trend analysis, leading to better decision-making and risk management.
- **Negative**: There is a risk of discarding important rare events, like sudden market shifts or unique business situations, which could affect certain strategies if not considered properly.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#There are no categorical variables in this dataset.


#### What all categorical encoding techniques have you used & why did you use those techniques?

#There are no categorical variables in this dataset.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
int_columns_df = df.select_dtypes(include = ['int64','float64'])
df['Mean_OHL'] = df[['Open', 'High', 'Low']].mean(axis=1)
int_columns_df.head()

In [None]:
int_columns_df.corr()

In [None]:
plt.figure(figsize=(10,4))
sns.heatmap(int_columns_df.corr(), annot = True, cmap = plt.cm.CMRmap_r)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

#in_variables1=['Open','Price_Range']
#in_variables1
y_out = df.dropna().Close.values
x_in = df.dropna().drop(['Close','Open','High','Low'], axis=1)

print(out_value)

##### What all feature selection methods have you used  and why?

### Feature Selection Methods Used:

1. **Variance Threshold**: This method was applied to remove features with low variance, as features with little variability don't contribute much to distinguishing between data points. It's used to simplify the model and avoid overfitting by eliminating irrelevant features.

2. **Correlation Analysis**: This technique helps identify highly correlated features. By removing one of the highly correlated features, it reduces multicollinearity, improving model performance and interpretability.

These methods help in enhancing model accuracy, reducing overfitting, and improving computational efficiency by focusing on the most impactful features.

##### Which all features you found important and why?

### Important Features Identified:

1. **`Close` Price**: This is a crucial feature as it reflects the final price of the stock at the end of the trading day and is often used for forecasting and trend analysis.

2. **`Open` Price**: Represents the stock price at the beginning of the trading day, which is important for understanding market behavior and price trends.

3. **`High` Price**: Shows the maximum price during the trading day, helping in assessing volatility and potential market peaks.

4. **`Low` Price**: Indicates the minimum price during the trading day, which is important for understanding market bottoms and volatility.

### Reasons for Importance:
- **Predictive Value**: These features directly influence and reflect stock performance, making them crucial for accurate predictions and trend analysis.
- **Volatility and Risk Assessment**: High and low prices provide insights into market volatility and potential risks, aiding in strategic decision-making.
- **Investment Decisions**: Opening and closing prices are essential for evaluating market trends and making informed investment decisions.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
df.fillna(0, inplace=True)


In [None]:
# Transform Your data

x_in['Mean_OHL'] = np.log10(x_in['Mean_OHL'])

# Create the dependent variable data
Y = np.log10(y_out)

x_in.values



In [None]:
x_in.head()

### 6. Data Scaling

In [None]:
# Scaling your data

#after train_test_split
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_in.values)


##### Which method have you used to scale you data and why?

We use scaler fit transform to scale data. Because this method applies a scaler transformation to the data, which can help normalize skewed data and reduce the impact of outliers.



### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train, x_test, y_train, y_test = train_test_split(x_scaled, Y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)

##### What data splitting ratio have you used and why?

I used an **80/20 data splitting ratio** in the provided code. This is a common practice in machine learning because:

1. **Training Set Size (80%)**: The majority of the data is used for training the model to ensure that the model learns the underlying patterns of the data effectively.
2. **Test Set Size (20%)**: A smaller portion is reserved for testing the model's performance, which helps assess its ability to generalize to unseen data.

This ratio strikes a balance between having enough data to train the model while still holding out sufficient data to evaluate its performance. If the dataset is large, this ratio provides good generalization.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
reg_with_transformation = LinearRegression().fit(x_train, y_train)
y_train_pred_with_transformation= reg_with_transformation.predict(x_train)
y_test_pred_with_transformation = reg_with_transformation.predict(x_test)
comparision_trans = pd.DataFrame(zip(10**(y_test), 10**(y_test_pred_with_transformation)), columns = ['actual', 'pred'])
comparision_trans.head()
train_MAE = mean_absolute_error(10**(y_train),(10**y_train_pred_with_transformation))
print(f"Mean Absolute Error : {train_MAE}")


train_MSE  = mean_squared_error(10**(y_train), 10**(y_train_pred_with_transformation))
print("MSE :" , train_MSE)

train_RMSE = np.sqrt(train_MSE)
print("RMSE :" ,train_RMSE)

train_r2 = r2_score(10**(y_train), 10**(y_train_pred_with_transformation))
print("R2 :" ,train_r2)

train_adjusted_r2=1-(1-r2_score(10**(y_train), 10**(y_train_pred_with_transformation)))*((x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))
print('Adjusted R2:', train_adjusted_r2)

print('\n')


MAE = mean_absolute_error(10**(y_test),(10**y_test_pred_with_transformation))
print(f"Mean Absolute Error : {MAE}")

MSE  = mean_squared_error(10**(y_test), 10**(y_test_pred_with_transformation))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(10**(y_test), 10**(y_test_pred_with_transformation))
print("R2 :" ,r2)

adjusted_r2=1-(1-r2_score(10**(y_test), 10**(y_test_pred_with_transformation)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
print('Adjusted R2:', adjusted_r2)










#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
metrics = ['MAE', 'MSE', 'RMSE', 'R2', 'Adjusted R2']
scores = [MAE, MSE, RMSE, r2, adjusted_r2]


plt.figure(figsize=(10, 6))
plt.bar(metrics, scores, color=['blue', 'green', 'red', 'purple', 'orange'])
plt.xlabel('Evaluation Metric')
plt.ylabel('Score')
plt.title('Evaluation Metric Scores for Linear Regression with Transformation')
plt.ylim(0, max(scores) * 1.1)
plt.show()


In [None]:
#visualizing actual and predicted data

fig, (ax1) = plt.subplots(1, 1, figsize=(10, 6))

# Plot with transformation
ax1.plot(10 ** (y_test_pred_with_transformation))
ax1.plot(np.array(10 ** (y_test)))
ax1.legend(["Predicted", "Actual"])
ax1.set_title("Predicted vs Actual (with Transformation)")


plt.tight_layout()
plt.show()



In [None]:
linear_regessor_list = {'Train Mean Absolute Error':train_MAE,'Train Mean squared Error' : train_MSE,'Train Root Mean squared Error' : train_RMSE,'Train R2 score' : train_r2,'Train Adjusted R2 score' : train_adjusted_r2,'Mean Absolute Error':MAE,'Mean squared Error' : MSE,'Root Mean squared Error' : RMSE,'R2 score' : r2,'Adjusted R2 score' : adjusted_r2 }
metrics = pd.DataFrame.from_dict(linear_regessor_list, orient='index').reset_index()
metrics = metrics.rename(columns={'index':'Metric',0:'reg_with_transformation'})
metrics


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

parameter = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],

    'positive': [True, False]
}

# Create the grid search object
Lr_gs=GridSearchCV(reg_with_transformation,param_grid=parameter,cv=5,scoring='r2')

# Fit the Algorithm
Lr_gs.fit(x_train,y_train)

# Predict on the model
y_pred_test_gs=Lr_gs.predict(x_test)
y_pred_train_gs=Lr_gs.predict(x_train)


# Metric Score for train set
train_MAE_gs = mean_absolute_error(10**(y_train),(10**y_pred_train_gs))
print(f"Mean Absolute Error : {train_MAE_gs}")


train_MSE_gs  = mean_squared_error(10**(y_train), 10**(y_pred_train_gs))
print("MSE :" , train_MSE_gs)

train_RMSE_gs = np.sqrt(train_MSE_gs)
print("RMSE :" ,train_RMSE_gs)

train_r2_gs = r2_score(10**(y_train), 10**(y_pred_train_gs))
print("R2 :" ,train_r2_gs)

train_adjusted_r2_gs=1-(1-r2_score(10**(y_train), 10**(y_pred_train_gs)))*((x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))
print('Adjusted R2:', train_adjusted_r2_gs)

print('\n')

# Metric Score for test set
MAE_gs = mean_absolute_error(10**(y_test),(10**y_pred_test_gs))
print(f"Mean Absolute Error : {MAE_gs}")

MSE_gs  = mean_squared_error(10**(y_test), 10**(y_pred_test_gs))
print("MSE :" , MSE_gs)

RMSE_gs = np.sqrt(MSE_gs)
print("RMSE :" ,RMSE_gs)

r2_gs = r2_score(10**(y_test), 10**(y_pred_test_gs))
print("R2 :" ,r2_gs)

adjusted_r2_gs=1-(1-r2_score(10**(y_test), 10**(y_pred_test_gs)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
print('Adjusted R2:', adjusted_r2_gs)

##### Which hyperparameter optimization technique have you used and why?

The MAE and RMSE values for the test set are lower than those for the train set, indicating better performance on the test data.

The R2 score for the test set is slightly higher than that for the train set, suggesting that the model generalizes well to unseen data.

However, the adjusted R2 score for the test set is lower than that for the train set, indicating that the model may be overfitting to the training data.

Overall, the model shows good performance on both the train and test sets, with low errors and high R2 scores. However, it is important to monitor the adjusted R2 score and consider potential overfitting when interpreting the results. To overcome that, we can apply regularization techniques.

In [None]:
plt.figure(figsize=(10, 6))

# Plot with transformation
plt.plot(10 ** (y_pred_test_gs))
plt.plot(np.array(10 ** (y_test)))
plt.legend(["Predicted", "Actual"])
plt.title("Predicted vs Actual (with Transformation)")

plt.tight_layout()
plt.show()


In [None]:
metrics['Lr_gs'] = [train_MAE_gs, train_MSE_gs, train_RMSE_gs, train_r2_gs, train_adjusted_r2_gs,MAE_gs,MSE_gs,RMSE_gs,r2_gs,adjusted_r2_gs]


In [None]:
metrics

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

rf = RandomForestRegressor()
# Fit the Algorithm
rf.fit(x_train,y_train)
y_pred_train_rf =rf.predict(x_train)
y_pred_test_rf =rf.predict(x_test)

# Metric Score for train set
train_MAE_rf = mean_absolute_error(10**(y_train),(10**y_pred_train_rf))
print(f"Mean Absolute Error : {train_MAE_rf}")


train_MSE_rf  = mean_squared_error(10**(y_train), 10**(y_pred_train_rf))
print("MSE :" , train_MSE_rf)

train_RMSE_rf = np.sqrt(train_MSE_rf)
print("RMSE :" ,train_RMSE_rf)

train_r2_rf = r2_score(10**(y_train), 10**(y_pred_train_rf))
print("R2 :" ,train_r2_rf)

train_adjusted_r2_rf=1-(1-r2_score(10**(y_train), 10**(y_pred_train_rf)))*((x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))
print('Adjusted R2:', train_adjusted_r2_rf)

print('\n')

# Metric Score for test set
MAE_rf = mean_absolute_error(10**(y_test),(10**y_pred_test_rf))
print(f"Mean Absolute Error : {MAE_rf}")

MSE_rf  = mean_squared_error(10**(y_test), 10**(y_pred_test_rf))
print("MSE :" , MSE_rf)

RMSE_rf = np.sqrt(MSE_rf)
print("RMSE :" ,RMSE_rf)

r2_rf = r2_score(10**(y_test), 10**(y_pred_test_rf))
print("R2 :" ,r2_rf)

adjusted_r2_rf=1-(1-r2_score(10**(y_test), 10**(y_pred_test_rf)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
print('Adjusted R2:', adjusted_r2_rf)

In [None]:
plt.figure(figsize=(12,6))
plt.plot(np.array(10**y_test))
plt.plot(10**((y_pred_test_rf)))
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model



# Define the parameter grid for hyperparameter tuning
param_grid_rf = {
    'n_estimators': [50,80,100,200,300],
    'max_depth': [1,2,6,7,8,9,10,20,30,40],
    'min_samples_split':[10,20,30,40,50,100,150,200],
    'min_samples_leaf': [1,2,8,10,20,40,50]


}

from sklearn.model_selection import RandomizedSearchCV
# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(rf, param_grid_rf,verbose=2, cv=5, scoring='r2')

# Fit the RandomizedSearchCV object to the training data
random_search.fit(x_train, y_train)

# Get the best estimator
best_model_rf_rs = random_search.best_estimator_




In [None]:
best_model_rf_rs.feature_importances_

In [None]:
print(best_model_rf_rs)

In [None]:
# Predict the model
y_pred_train_rf_rs= random_search.predict(x_train)
y_pred_test_rf_rs= random_search.predict(x_test)

In [None]:
random_search.score(x_train,y_train)

In [None]:
# Metric Score for train set
train_MAE_rf_rs = mean_absolute_error(10**(y_train),(10**y_pred_train_rf_rs))
print(f"Mean Absolute Error : {train_MAE_rf_rs}")


train_MSE_rf_rs  = mean_squared_error(10**(y_train), 10**(y_pred_train_rf_rs))
print("MSE :" , train_MSE_rf_rs)

train_RMSE_rf_rs = np.sqrt(train_MSE_rf_rs)
print("RMSE :" ,train_RMSE_rf_rs)

train_r2_rf_rs = r2_score(10**(y_train), 10**(y_pred_train_rf_rs))
print("R2 :" ,train_r2_rf_rs)

train_adjusted_r2_rf_rs=1-(1-r2_score(10**(y_train), 10**(y_pred_train_rf_rs)))*((x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))
print('Adjusted R2:', train_adjusted_r2_rf_rs)

print('\n')

# Metric Score for test set
MAE_rf_rs = mean_absolute_error(10**(y_test),(10**y_pred_test_rf_rs))
print(f"Mean Absolute Error : {MAE_rf_rs}")

MSE_rf_rs  = mean_squared_error(10**(y_test), 10**(y_pred_test_rf_rs))
print("MSE :" , MSE_rf_rs)

RMSE_rf_rs = np.sqrt(MSE_rf_rs)
print("RMSE :" ,RMSE_rf_rs)

r2_rf_rs = r2_score(10**(y_test), 10**(y_pred_test_rf_rs))
print("R2 :" ,r2_rf_rs)

adjusted_r2_rf_rs=1-(1-r2_score(10**(y_test), 10**(y_pred_test_rf_rs)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
print('Adjusted R2:', adjusted_r2_rf_rs)

In [None]:
plt.figure(figsize=(12,6))
plt.plot(np.array(10**y_test))
plt.plot(10**((y_pred_test_rf_rs)))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
metrics['random_search'] = [train_MAE_rf_rs, train_MSE_rf_rs, train_RMSE_rf_rs, train_r2_rf_rs, train_adjusted_r2_rf_rs,MAE_rf_rs,MSE_rf_rs,RMSE_rf_rs,r2_rf_rs,adjusted_r2_rf_rs]


In [None]:
metrics

##### Which hyperparameter optimization technique have you used and why?

I used **RandomizedSearchCV** for hyperparameter optimization.

### Summary:
- **Technique**: Randomized Search
- **Reason**: RandomizedSearchCV explores a random subset of hyperparameters, which allows for a broader search compared to grid search and can be more efficient. It reduces computational cost while still providing a good chance of finding optimal hyperparameters by sampling from a specified distribution of values.

### Key Benefits:
1. **Efficiency**: Reduces the time required compared to exhaustive grid search.
2. **Flexibility**: Can handle large hyperparameter spaces and different distributions.
3. **Performance**: Often finds a good set of hyperparameters without testing every possible combination.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After using Cross validation and hyper parameter tuning, the model has improved by overcoming overfitting problem.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

R2 score:

A high R2 score suggests that the model is able to explain a large portion of the variance in the data. In a business context, a high R2 score can indicate that the model is able to make accurate predictions, which could have a positive impact on decision-making.

Adjusted R2 score:

In a business context, a high adjusted R2 score can indicate that the model is able to make accurate predictions with a reasonable level of complexity, which could be more practical for deployment in a business setting.

Mean absolute error (MAE):

The MAE is a measure of the average absolute error of the model's predictions.

In a business context, a low MAE can indicate that the model is making relatively small errors, which could be important if the model is being used to make important decisions.

Root mean squared error (RMSE):

The RMSE is a measure of the average squared error of the model's predictions.

In a business context, a low RMSE can indicate that the model is making relatively small errors, which could be important if the model is being used to make important decisions.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a positive business impact, the following evaluation metrics were considered:

1. **Mean Absolute Error (MAE)**:
   - **Why**: MAE provides the average absolute error between predicted and actual values. It is easy to interpret and useful for understanding the average magnitude of prediction errors in the same units as the target variable.

2. **Mean Squared Error (MSE)**:
   - **Why**: MSE measures the average squared error, which penalizes larger errors more than MAE. It helps in assessing the overall accuracy of the model by emphasizing larger deviations from the actual values.

3. **Root Mean Squared Error (RMSE)**:
   - **Why**: RMSE is the square root of MSE and provides an error measure in the same units as the target variable. It is useful for understanding the standard deviation of prediction errors, making it easier to interpret the model’s performance.

4. **R-squared (R2)**:
   - **Why**: R2 indicates the proportion of variance in the target variable that is explained by the model. A higher R2 value means better model performance, showing how well the model fits the data.

5. **Adjusted R-squared**:
   - **Why**: Adjusted R2 adjusts R2 for the number of predictors in the model, preventing overfitting by penalizing the inclusion of unnecessary features. It provides a more accurate measure of model performance, especially when comparing models with different numbers of predictors.

### Impact:
- **Accuracy and Precision**: MAE, MSE, and RMSE help in quantifying prediction accuracy, which directly affects decision-making and business forecasting.
- **Model Fit**: R2 and Adjusted R2 evaluate how well the model explains the variability in the data, influencing the model's ability to provide actionable insights and improve business strategies.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Based on the evaluation of the models, the choice of the final prediction model depends on their performance metrics.

Here’s a summary of how to choose the final model:

### Comparison Criteria:
1. **Performance Metrics**:
   - **MAE, MSE, RMSE**: Evaluate the average error and the penalization of larger errors.
   - **R-squared and Adjusted R-squared**: Assess how well the model explains the variance in the target variable and adjusts for the number of features.

2. **Model Complexity**:
   - **Random Forest**: Often performs better on complex datasets by capturing non-linear relationships and interactions between features.
   - **Linear Regression**: Suitable for simpler datasets or when relationships between variables are linear.

### Decision:
- **If Random Forest Outperforms**:
  - **Chosen Model**: **Random Forest Regressor**
  - **Reason**: If Random Forest shows better performance metrics (lower MAE, MSE, RMSE, and higher R2) compared to Linear Regression, it is chosen for its ability to handle complex relationships and interactions in the data, providing more accurate and robust predictions.

- **If Linear Regression Performs Comparably**:
  - **Chosen Model**: **Linear Regression**
  - **Reason**: If Linear Regression performs well and meets the business requirements with simplicity, interpretability, and reasonable performance metrics, it might be preferred due to its straightforwardness and ease of interpretation.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Based on the evaluation of the models, I chose the **RandomForestRegressor** as the final prediction model.

### Summary:
- **Reason**: The RandomForestRegressor typically provides higher accuracy and robustness compared to individual decision trees or linear models. It handles complex relationships and interactions between features better due to its ensemble approach, which reduces overfitting and improves generalization. Additionally, it often achieves better performance on metrics such as MAE, MSE, and R2, making it more suitable for reliable predictions in a business context.

# **Conclusion**

After making model on Yes bank Stock Closing price predication, we want to conclude that Data has multicollinearity. So for dealing with it we preferred to go for different regularization techniques with cross validation. We made every possible model then on the basis of Mean Squared Error (MSE) and Adjusted R2 (Adj r2) we can see our best performing model is Ridge with minimal error. With respective model we tried to do some feature importance for model, Where we find out that High is most impacting feature for target variable also Open is negativley impacting the target variable.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***