<a href="https://colab.research.google.com/github/kavyasingh581/Rrgression_project/blob/main/Share_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** kavya

# **Project Summary -**

This stock market analysis project focuses on understanding historical stock price trends of a company from July 2005 to November 2020 using Exploratory Data Analysis (EDA). The dataset includes monthly records with four main features: Open, High, Low, and Close prices. These variables are critical in financial analysis, as they capture the movement of a stock's value over time. The goal of the project is to uncover meaningful patterns, assess price behavior, and analyze relationships between these variables using Python’s data analysis libraries like pandas, matplotlib, and seaborn. The analysis begins with a statistical summary to understand the basic characteristics of the dataset. It reveals that the stock has undergone significant fluctuations during the 15-year period, ranging from very low values in the early years to substantial growth in later years, followed by notable dips and rebounds. For example, during the 2008 global financial crisis and the 2020 COVID-19 pandemic, sharp declines in stock prices were clearly observed, reflecting real-world economic events. The statistical summary shows strong variance across the years, with the highest Close prices reaching over ₹350, and the lowest dropping below ₹15. These trends suggest a volatile but potentially high-growth asset. A key part of the analysis involved calculating the correlation coefficients among Open, High, Low, and Close prices. The results indicated extremely strong positive correlations — for example, the correlation between Open and Close prices was approximately 0.99, indicating that if a stock opened higher in a month, it almost always closed higher as well. Similar high correlations were found between High and Low prices, which suggests internal consistency in the dataset and predictable intra-month behavior. This is further supported by scatter plots that visualize the relationship between each pair of variables. The data points formed tight clusters along straight lines, visually reinforcing the idea of high correlation and minimal outliers. These visualizations are essential for understanding whether relationships are linear and if the data has any anomalies. In this case, no extreme outliers were found beyond what could be explained by major economic events. Time-wise trends were also observed through line plots, which showed the evolution of the stock price over the years. Major upward trends were seen between 2013 and 2018, while the stock experienced significant drops in 2008 and 2020. This points to the influence of macroeconomic factors on stock performance. Moreover, the dataset is rich enough to be used for further time-series modeling and forecasting, with potential applications in ARIMA, Prophet, or LSTM-based models. Overall, this project successfully demonstrates how EDA can be used to extract valuable insights from historical stock data. It highlights the importance of understanding price movements, the effects of external shocks on market performance, and the predictive value of historical patterns. The analysis provides a solid foundation for more advanced modeling techniques, such as risk assessment, forecasting, or trading strategy development. In conclusion, this stock market EDA offers a clear and insightful view into stock price behavior over time, reinforcing the value of data-driven approaches in financial analysis.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Stock market prices are highly dynamic and influenced by various economic, political, and social factors. Investors and analysts seek to understand the historical behavior of stock prices to make informed decisions about buying, selling, or holding stocks. However, without a proper analysis of past price trends and relationships among price variables, predicting future movements or assessing risk becomes challenging. This project aims to perform Exploratory Data Analysis (EDA) on monthly stock price data, including Open, High, Low, and Close prices over a 15-year period, to identify patterns, trends, and correlations. The primary objective is to understand how these price variables relate to each other and evolve over time, providing insights that can support better investment strategies and future predictive modeling. By analyzing historical price data, this project addresses the problem of extracting meaningful information from complex financial time series to aid in market analysis and decision-making.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### Dataset First View

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Copy of data_YesBank_StockPrices.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(' No. of Columns : ',len(df.columns))
print(' No. of Rows : ',len(df))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null = df.isnull().sum()
null

In [None]:
# Visualizing the missing values
plt.figure(figsize = (10,3))
null.plot(kind = 'bar')
plt.title('Missing Values')
plt.show()

### What did you know about your dataset?

The dataset contains monthly stock price data from July 2005 to November 2020, with columns: Date, Open, High, Low, and Close. It is clean with no missing values and shows typical stock market trends including growth, volatility, and crashes (e.g., 2008, 2020). The features are highly correlated, especially Open, High, Low, and Close, making it suitable for regression and time series prediction models.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
clm = df.columns
for i in clm:
  print(i)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

import pandas as pd

# Load your dataset
# df = pd.read_csv("your_dataset.csv")  # Removed as the dataset is already loaded

# Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# Sort by Date
df = df.sort_values(by='Date').reset_index(drop=True)

# Check for duplicates and remove them
df = df.drop_duplicates()

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# Check data types
print("\nData types:\n", df.dtypes)

# Ensure all numerical columns are floats
df[['Open', 'High', 'Low', 'Close']] = df[['Open', 'High', 'Low', 'Close']].astype(float)

# Final shape and info
print("\nFinal dataset shape:", df.shape)
print(df.head())

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1  Monthly Average Close Price (Seasonality Check)
df['Month'] = df['Date'].dt.month
monthly_avg = df.groupby('Month')['Close'].mean()

plt.figure(figsize=(10, 5))
monthly_avg.plot(kind='bar', color='teal')
plt.title("Average Monthly Close Prices")
plt.xlabel("Month")
plt.ylabel("Average Close Price")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

 i chose the monthly average close price bar chart to explore seasonal patterns in stock performance.

##### 2. What is/are the insight(s) found from the chart?

Certain months, such as January, April, and December, consistently show higher average close prices, suggesting stronger performance during these times.

If the chart shows significant differences between months, it signals market volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2  Line Plot (Trend Over Time)
plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['Close'], color='blue')
plt.title("Close Price Over Time")
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 Box Plot (Price Distributions)
plt.figure(figsize=(10, 5))
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title("Price Feature Distribution with Outliers")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4
# Rolling Statistics (Moving Average, Rolling Volatility)
df['MA20'] = df['Close'].rolling(window=20).mean()
# Calculate 'Return' column before calculating rolling volatility
df['Return'] = df['Close'].pct_change()
df['Rolling_Volatility'] = df['Return'].rolling(window=20).std()

plt.figure(figsize=(14,6))
plt.plot(df['Date'], df['Close'], label='Close Price', alpha=0.5)
plt.plot(df['Date'], df['MA20'], label='20-Day Moving Average', color='green')
plt.title("Close Price with 20-Day Moving Average")
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(14,4))
plt.plot(df['Date'], df['Rolling_Volatility'], label='20-Day Rolling Volatility', color='red')
plt.title("Rolling Volatility Over Time")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Volume or Volatility Over Time
df['Volatility'] = df['High'] - df['Low']

plt.figure(figsize=(14,4))
plt.plot(df['Date'], df['Volatility'], label='Volatility (High - Low)', color='red')
plt.title("Stock Volatility Over Time")
plt.xlabel("Date")
plt.ylabel("Volatility")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Plot the Time Series: Price Over Time
import matplotlib.pyplot as plt

plt.figure(figsize=(14,6))
plt.plot(df['Date'], df['Close'], label='Close Price', color='blue')
plt.plot(df['Date'], df['Open'], label='Open Price', color='orange', alpha=0.7)
plt.title("Stock Prices Over Time")
plt.xlabel("Date")
plt.ylabel("Price")
plt.legend()
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Cumulative Returns Plot
df['Cumulative_Return'] = (1 + df['Return']).cumprod() - 1

plt.figure(figsize=(14,6))
plt.plot(df['Date'], df['Cumulative_Return'], color='green')
plt.title('Cumulative Returns Over Time')
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Lag Plot
from pandas.plotting import lag_plot

plt.figure(figsize=(6,6))
lag_plot(df['Close'])
plt.title('Lag Plot of Close Price')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Distribution Plot of Returns
sns.histplot(df['Return'], bins=50, kde=True)
plt.title('Distribution of Returns')
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Rolling Correlation
rolling_corr = df['Open'].rolling(window=30).corr(df['Close'])

plt.figure(figsize=(14,6))
plt.plot(df['Date'], rolling_corr)
plt.title('30-Day Rolling Correlation between Open and Close Prices')
plt.xlabel('Date')
plt.ylabel('Correlation')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Pairplot for Feature Relationships
sns.pairplot(df[['Open', 'High', 'Low', 'Close', 'Return']])
plt.suptitle('Pairplot of Price Variables and Returns', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Autocorrelation Plot (ACF Plot)
from statsmodels.graphics.tsaplots import plot_acf

plot_acf(df['Return'].dropna(), lags=30)
plt.title('Autocorrelation of Returns')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
#Heatmap of Monthly or Yearly Returns
df['Year'] = df['Date'].dt.year
returns_pivot = df.pivot_table(index='Year', columns='Month', values='Return')

plt.figure(figsize=(12,6))
sns.heatmap(returns_pivot, annot=True, fmt=".2%", cmap='RdYlGn', center=0)
plt.title('Heatmap of Monthly Returns by Year')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix of Price Features")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['Open', 'High', 'Low', 'Close']])
plt.suptitle("Pairwise Scatter Plots of Stock Prices", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. Load the data
# data = pd.read_clipboard()  # Removed as the dataset is already loaded
# data.columns = ['Date', 'Open', 'High', 'Low', 'Close'] # Removed as columns are already named

# Use the existing DataFrame 'df'
data = df.copy()

# 2. Convert 'Date' and sort
# This is already done in the data wrangling section, but we ensure it here as well
data['Date'] = pd.to_datetime(data['Date']) # No format needed as it's already datetime
data = data.sort_values('Date').reset_index(drop=True)

# 3. Feature Engineering
data['Prev_Close'] = data['Close'].shift(1)
data = data.dropna().reset_index(drop=True)

# 4. Features and target
X = data[['Open', 'High', 'Low', 'Prev_Close']]
y = data['Close']

# 5. Normalize features (important for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 6. Train/Test Split
train_size = int(len(X_scaled) * 0.8)
X_train, X_test = X_scaled[:train_size], X_scaled[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
test_dates = data['Date'].iloc[train_size:]


# 7. Train KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

# 8. Predict
y_pred = knn.predict(X_test)

# 9. Evaluation
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R² Score:", r2_score(y_test, y_pred))

rmse = np.sqrt(mse)  # Calculate RMSE by taking the square root of MSE
print(f'RMSE: {rmse:.3f}')
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE for KNN: {mae:.3f}")


# 10. Plotting
plt.figure(figsize=(10,6))
plt.plot(test_dates, y_test.values, label='Actual')
plt.plot(test_dates, y_pred, label='Predicted', linestyle='--')
plt.title('KNN Regression: Actual vs Predicted Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
import pandas as pd
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load data
# data = pd.read_clipboard() # Removed as the data is already loaded
# data.columns = ['Date', 'Open', 'High', 'Low', 'Close'] # Removed as columns are already named

# Use the existing DataFrame 'df'
data = df.copy()

# 2. Convert 'Date' and sort
# This is already done in the data wrangling section, but we ensure it here as well
data['Date'] = pd.to_datetime(data['Date'], format='%b-%y')
data = data.sort_values('Date').reset_index(drop=True)

# 3. Feature Engineering
data['Prev_Close'] = data['Close'].shift(1)
data = data.dropna().reset_index(drop=True)

# 4. Define features and target
X = data[['Open', 'High', 'Low', 'Prev_Close']]
y = data['Close']
dates = data['Date']

# 5. Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 6. TimeSeriesSplit for Cross-Validation (preserves time order)
tscv = TimeSeriesSplit(n_splits=5)

# 7. Grid Search with CV
param_grid = {
    'n_neighbors': list(range(2, 21))
}

knn = KNeighborsRegressor()
grid_search = GridSearchCV(knn, param_grid, cv=tscv, scoring='neg_mean_squared_error')
grid_search.fit(X_scaled, y)

# 8. Best Params and Best Score
print("Best k (n_neighbors):", grid_search.best_params_['n_neighbors'])
print("Best CV Score (Negative MSE):", grid_search.best_score_)

# 9. Predict using best model (use last 20% of data as test)
best_k = grid_search.best_params_['n_neighbors']
knn_best = KNeighborsRegressor(n_neighbors=best_k)
train_size = int(len(X_scaled) * 0.8)

X_train, X_test = X_scaled[:train_size], X_scaled[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
test_dates = dates[train_size:]

knn_best.fit(X_train, y_train)
y_pred = knn_best.predict(X_test)

# 10. Final evaluation
print("Test MSE:", mean_squared_error(y_test, y_pred))
print("Test R² Score:", r2_score(y_test, y_pred))

# 10. Plot actual vs predicted
plt.figure(figsize=(10,6))
plt.plot(test_dates, y_test.values, label='Actual')
plt.plot(test_dates, y_pred, label='Predicted', linestyle='--')
plt.title(f'KNN Regression (k={best_k}): Actual vs Predicted')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import matplotlib.pyplot as plt

# Load dataset
# df = pd.read_csv("stock_data.csv")  # Replace with your filename # Removed as the data is already loaded
# Use the existing DataFrame 'df'
data = df.copy()

data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d') # Correct format based on previous code
data = data.sort_values('Date').reset_index(drop=True)

# Feature engineering
data['Return'] = data['Close'].pct_change()
data['Volatility'] = data['High'] - data['Low']
data['Next_Close'] = data['Close'].shift(-1)

# Drop rows with NaN values
data = data.dropna()

# Features and target
features = ['Open', 'High', 'Low', 'Close', 'Return', 'Volatility']
X = data[features]
y = data['Next_Close']

# Train-test split (time-based)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Initialize XGBoost Regressor
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("📊 XGBoost Evaluation Metrics:")
print(f"MAE  : {mae:.2f}")
print(f"MSE  : {mse:.2f}")
print(f"RMSE : {rmse:.2f}")
print(f"R²   : {r2:.2f}")

# Plot predictions
plt.figure(figsize=(10, 5))
plt.plot(np.arange(len(y_test)), y_test.values, label='Actual', marker='o') # Use index for plotting
plt.plot(np.arange(len(y_test)), y_pred, label='Predicted', marker='x') # Use index for plotting
plt.title("XGBoost - Actual vs Predicted Closing Price")
plt.xlabel("Test Data Points")
plt.ylabel("Closing Price")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Feature importance
xgb.plot_importance(model, height=0.6, importance_type='gain', title='Feature Importance')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import matplotlib.pyplot as plt

# Load and prepare the data
# df = pd.read_csv("stock_data.csv") # Removed as the data is already loaded

# Use the existing DataFrame 'df'
data = df.copy()

data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d') # Correct format based on previous code
data = data.sort_values('Date').reset_index(drop=True)


# Feature engineering
data['Return'] = data['Close'].pct_change()
data['Volatility'] = data['High'] - data['Low']
data['Next_Close'] = data['Close'].shift(-1)
data = data.dropna()

features = ['Open', 'High', 'Low', 'Close', 'Return', 'Volatility']
X = data[features]
y = data['Next_Close']

# TimeSeriesSplit for time-aware cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Hyperparameter grid
param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 1],
    'colsample_bytree': [0.7, 0.8, 1]
}

# Randomized Search
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=20,
    scoring='neg_mean_squared_error',
    cv=tscv,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit the model
random_search.fit(X, y)

# Best model
best_model = random_search.best_estimator_
print("✅ Best Parameters:", random_search.best_params_)

# Predict on last 20% of data (test set)
split = int(len(data) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
y_pred = best_model.predict(X_test)

# Evaluation
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\n📊 Evaluation Metrics (Tuned XGBoost):")
print(f"MAE  : {mae:.2f}")
print(f"MSE  : {mse:.2f}")
print(f"RMSE : {rmse:.2f}")
print(f"R²   : {r2:.2f}")

# Evaluation Metric Score Chart
scores = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R² Score": r2}
plt.figure(figsize=(8, 5))
plt.bar(scores.keys(), scores.values(), color=['steelblue', 'skyblue', 'slategray', 'seagreen'])
plt.title("Tuned XGBoost Evaluation Metrics")
plt.ylabel("Score")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

XGBoost has many tunable hyperparameters.

The model is slow to train with full GridSearch over large parameter spaces.

RandomizedSearchCV allows us to find a near-optimal solution faster and with lower compute cost.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

MAE dropped by ~17.4% (→ more accurate predictions)

MSE dropped by ~30.2% (→ fewer large errors)

RMSE improved, meaning closer average errors

R² Score increased → better overall model fit

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

MAE (Mean Absolute Error)-Lower MAE = more accurate daily/monthly forecast
MSE (Mean Squared Error)-Highlights large forecasting mistakes Useful if large errors lead to high financial losses
RMSE (Root Mean Squared Error)-Easier to interpret in dollar terms
“A typical prediction is off”



### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 1. Load the data
# data = pd.read_clipboard()  # Removed as the data is already loaded
# data.columns = ['Date', 'Open', 'High', 'Low', 'Close'] # Removed as columns are already named

# Use the existing DataFrame 'df'
data = df.copy()

# 2. Convert 'Date' to datetime and sort
# This is already done in the data wrangling section, but we ensure it here as well
data['Date'] = pd.to_datetime(data['Date'], format='%b-%y')
data = data.sort_values('Date').reset_index(drop=True)

# 3. Feature Engineering: Previous Close as a new feature
data['Prev_Close'] = data['Close'].shift(1)
data = data.dropna().reset_index(drop=True)

# 4. Define Features and Target
X = data[['Open', 'High', 'Low', 'Prev_Close']]
y = data['Close']

# 5. Train/Test Split
train_size = int(len(data) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

# 6. Train Random Forest Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# 7. Predict
y_pred = rf_model.predict(X_test)

# 8. Evaluation
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2_score(y_test, y_pred))
rmse = np.sqrt(mse)  # Calculate RMSE by taking the square root of MSE
print(f"RMSE: {rmse:.3f}")
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.3f}")



# 9. Plotting
plt.figure(figsize=(10,6))
plt.plot(data['Date'].iloc[train_size:], y_test.values, label='Actual')
plt.plot(data['Date'].iloc[train_size:], y_pred, label='Predicted', linestyle='--')
plt.title('Random Forest: Actual vs Predicted Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load the data
# data = pd.read_clipboard() # Removed as the data is already loaded
# data.columns = ['Date', 'Open', 'High', 'Low', 'Close'] # Removed as columns are already named

# Use the existing DataFrame 'df'
data = df.copy()

# 2. Convert 'Date' to datetime and sort
# This is already done in the data wrangling section, but we ensure it here as well
data['Date'] = pd.to_datetime(data['Date'], format='%b-%y')
data = data.sort_values('Date').reset_index(drop=True)

# 3. Feature Engineering
data['Prev_Close'] = data['Close'].shift(1)
data = data.dropna().reset_index(drop=True)

# 4. Features and target
X = data[['Open', 'High', 'Low', 'Prev_Close']]
y = data['Close']
dates = data['Date']

# Scaling not strictly needed for Random Forest, but keeping structure consistent
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 5. TimeSeriesSplit for time-aware CV
tscv = TimeSeriesSplit(n_splits=5)

# 6. Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# 7. Grid Search with Cross-Validation
rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_scaled, y)

# 8. Best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("Best CV Score (Neg MSE):", grid_search.best_score_)

# 9. Train/Test split (final evaluation)
train_size = int(len(X_scaled) * 0.8)
X_train, X_test = X_scaled[:train_size], X_scaled[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
test_dates = dates[train_size:]

# 10. Train best model on train set
best_rf = RandomForestRegressor(**best_params, random_state=42)
best_rf.fit(X_train, y_train)

# 11. Predict
y_pred = best_rf.predict(X_test)

# 12. Evaluation
print("Test MSE:", mean_squared_error(y_test, y_pred))
print("Test R² Score:", r2_score(y_test, y_pred))

# 12. Plot actual vs predicted
plt.figure(figsize=(10,6))
plt.plot(test_dates, y_test.values, label='Actual')
plt.plot(test_dates, y_pred, label='Predicted', linestyle='--')
plt.title('Random Forest (Tuned): Actual vs Predicted Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***