<a href="https://colab.research.google.com/github/nabeelnazeer/Reggresion-anaylsis/blob/main/stock_market_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction




The data is the price history and trading volumes of the fifty stocks in the index NIFTY 50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and trading values split across .cvs files for each stock along with a metadata file with some macro-information about the stocks itself. The data spans from 1st January, 2000 to 30th April, 2021.
The dataset we specifically focus on is AdaniPort from 2001 to 2021



In [3]:

import yfinance as yf

stock_data = yf.download('RELIANCE.NS', start='2015-01-01', end='2023-01-01')


stock_data.head()


[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01,405.186554,407.792389,403.975037,405.917999,381.517303,1481821
2015-01-02,406.00943,409.643921,404.272217,404.843658,380.507538,3665683
2015-01-05,404.592224,407.28952,399.6091,400.409149,376.339569,5051970
2015-01-06,397.734741,399.106232,380.362396,382.236786,359.259613,9313990
2015-01-07,382.69397,392.614471,382.648254,390.55722,367.079895,10360156


In [4]:
# Check for missing values
stock_data.isnull().sum()


Unnamed: 0,0
Open,0
High,0
Low,0
Close,0
Adj Close,0
Volume,0


In [5]:
stock_data.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,1976.0,1976.0,1976.0,1976.0,1976.0,1976.0
mean,1214.484407,1228.323599,1200.068431,1213.642562,1190.948641,9847056.0
std,694.282341,701.938658,686.132276,693.724458,693.09281,7251829.0
min,373.322052,373.824921,364.110138,370.647614,348.367096,852828.0
25%,489.356575,495.962624,485.487823,490.985237,471.85331,5679796.0
50%,1081.198425,1096.376343,1062.774597,1081.244141,1057.462585,7745768.0
75%,1886.011353,1906.848114,1849.333649,1878.027496,1853.996704,11105030.0
max,2636.225586,2636.225586,2571.569336,2602.720703,2576.380371,71341680.0


**Plot of Closing price over time**

In [13]:
# Interactive plot for closing price over time
import plotly.express as px
import plotly.graph_objects as go

fig = px.line(stock_data, x=stock_data.index, y='Close', title='Reliance Industries Stock Closing Price (2015-2023)',
              labels={'Close': 'Closing Price (INR)', 'index': 'Date'})
fig.show()


In [14]:
# Create additional features: Moving Averages (MA)
stock_data['MA50'] = stock_data['Close'].rolling(window=50).mean()
stock_data['MA200'] = stock_data['Close'].rolling(window=200).mean()

# Percentage Change in Close Price
stock_data['Pct_Change'] = stock_data['Close'].pct_change()

# Drop NA values after feature creation
stock_data.dropna(inplace=True)

fig = go.Figure()

# Add traces for closing price and moving averages
fig.add_trace(go.Scatter(x=stock_data.index, y=stock_data['Close'], mode='lines', name='Close Price', line=dict(color='blue')))
fig.add_trace(go.Scatter(x=stock_data.index, y=stock_data['MA50'], mode='lines', name='50-day MA', line=dict(color='red')))
fig.add_trace(go.Scatter(x=stock_data.index, y=stock_data['MA200'], mode='lines', name='200-day MA', line=dict(color='green')))

# Update layout
fig.update_layout(title='Reliance Industries: Close Price & Moving Averages',
                  xaxis_title='Date', yaxis_title='Price (INR)',
                  hovermode='x unified')
fig.show()

Relevant features for multiple linear regression:
1. Moving averages (MA)
2. Percentage price changes
3. Volume analysis

In [10]:
from sklearn.model_selection import train_test_split

X = stock_data[['MA50', 'MA200', 'Pct_Change', 'Volume']]  # Independent variables
y = stock_data['Close']  # Dependent variable (target)

# Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


1.  Simple MLR
2.  Ridge Regression
3.  Lasso Regression
4.  Polynomial Regression

In [16]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

mlr = LinearRegression()
mlr.fit(X_train, y_train)

# Make predictions
y_pred = mlr.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Multiple Linear Regression - MSE: {mse}, R2: {r2}')


Multiple Linear Regression - MSE: 9658.284616272347, R2: 0.9799601079261735


In [17]:
from sklearn.linear_model import Ridge

# Initialize and fit the Ridge Regression model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Predict and evaluate
y_pred_ridge = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f'Ridge Regression - MSE: {mse_ridge}, R2: {r2_ridge}')


Ridge Regression - MSE: 10097.26903879667, R2: 0.9790492629056554



Ill-conditioned matrix (rcond=1.73641e-17): result may not be accurate.



In [18]:
from sklearn.linear_model import Lasso

# Initialize and fit the Lasso Regression model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Predict and evaluate
y_pred_lasso = lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f'Lasso Regression - MSE: {mse_lasso}, R2: {r2_lasso}')


Lasso Regression - MSE: 9860.491576607328, R2: 0.9795405504350988


In [19]:
from sklearn.preprocessing import PolynomialFeatures

# Transform features to polynomial
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

# Fit Linear Regression on the polynomial features
poly_mlr = LinearRegression()
poly_mlr.fit(X_poly_train, y_train)

# Predict and evaluate
y_pred_poly = poly_mlr.predict(X_poly_test)
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
print(f'Polynomial Regression - MSE: {mse_poly}, R2: {r2_poly}')


Polynomial Regression - MSE: 9554.119549269572, R2: 0.9801762391320278


In [21]:
# Summary of model performances
import pandas as pd
model_performance = pd.DataFrame({
    'Model': ['Multiple Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Polynomial Regression'],
    'MSE': [mse, mse_ridge, mse_lasso, mse_poly],
    'R2 Score': [r2, r2_ridge, r2_lasso, r2_poly]
})

model_performance


Unnamed: 0,Model,MSE,R2 Score
0,Multiple Linear Regression,9658.284616,0.97996
1,Ridge Regression,10097.269039,0.979049
2,Lasso Regression,9860.491577,0.979541
3,Polynomial Regression,9554.119549,0.980176


In [24]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
# Helper function to calculate evaluation metrics
def evaluate_model(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return mse, rmse, mae, r2

# Evaluate MLR
mlr_metrics = evaluate_model(y_test, y_pred)

# Evaluate Ridge Regression
ridge_metrics = evaluate_model(y_test, y_pred_ridge)

# Evaluate Lasso Regression
lasso_metrics = evaluate_model(y_test, y_pred_lasso)

# Evaluate Polynomial Regression
poly_metrics = evaluate_model(y_test, y_pred_poly)

# Combine the results into a DataFrame for comparison
model_comparison = pd.DataFrame({
    'Model': ['Multiple Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Polynomial Regression'],
    'MSE': [mlr_metrics[0], ridge_metrics[0], lasso_metrics[0], poly_metrics[0]],
    'RMSE': [mlr_metrics[1], ridge_metrics[1], lasso_metrics[1], poly_metrics[1]],
    'MAE': [mlr_metrics[2], ridge_metrics[2], lasso_metrics[2], poly_metrics[2]],
    'R² Score': [mlr_metrics[3], ridge_metrics[3], lasso_metrics[3], poly_metrics[3]]
})

model_comparison


Unnamed: 0,Model,MSE,RMSE,MAE,R² Score
0,Multiple Linear Regression,9658.284616,98.276572,69.75005,0.97996
1,Ridge Regression,10097.269039,100.485168,70.964962,0.979049
2,Lasso Regression,9860.491577,99.300008,70.264629,0.979541
3,Polynomial Regression,9554.119549,97.745177,68.622971,0.980176


In [26]:
# prompt: Using dataframe model_comparison: make a report of this data and find which model is best and why using these evaluation metrics

# Sort the dataframe by R² Score in descending order to find the best model
best_model = model_comparison.sort_values('R² Score', ascending=False).iloc[0]

# Print a report summarizing the model performance
print("Model Comparison Report:\n")
print(model_comparison.to_string(), "\n")

print("Best Model:\n")
print(f"Model: {best_model['Model']}\n")
print(f"R² Score: {best_model['R² Score']}\n")
print(f"MSE: {best_model['MSE']}\n")
print(f"RMSE: {best_model['RMSE']}\n")
print(f"MAE: {best_model['MAE']}\n")

print("Analysis:\n")
print(f"Based on the R² score, {best_model['Model']} is the best performing model, indicating it explains the most variance in the data.")
print("A higher R² score is preferred, as it means that the model is fitting the data better and making more accurate predictions.")
print("Furthermore, a lower MSE, RMSE, and MAE indicates better model accuracy, as these metrics measure the average error of the model's predictions.")


Model Comparison Report:

                        Model           MSE        RMSE        MAE  R² Score
0  Multiple Linear Regression   9658.284616   98.276572  69.750050  0.979960
1            Ridge Regression  10097.269039  100.485168  70.964962  0.979049
2            Lasso Regression   9860.491577   99.300008  70.264629  0.979541
3       Polynomial Regression   9554.119549   97.745177  68.622971  0.980176 

Best Model:

Model: Polynomial Regression

R² Score: 0.9801762391320278

MSE: 9554.119549269572

RMSE: 97.74517660360317

MAE: 68.62297103240546

Analysis:

Based on the R² score, Polynomial Regression is the best performing model, indicating it explains the most variance in the data.
A higher R² score is preferred, as it means that the model is fitting the data better and making more accurate predictions.
Furthermore, a lower MSE, RMSE, and MAE indicates better model accuracy, as these metrics measure the average error of the model's predictions.


In [25]:
# Interactive bar plot for MSE
fig = px.bar(model_comparison, x='Model', y='MSE', title='Model Comparison: Mean Squared Error (MSE)',
             labels={'MSE': 'Mean Squared Error'}, hover_data=['RMSE', 'MAE', 'R² Score'], color='MSE', text='MSE')
fig.show()

# Interactive bar plot for RMSE
fig = px.bar(model_comparison, x='Model', y='RMSE', title='Model Comparison: Root Mean Squared Error (RMSE)',
             labels={'RMSE': 'Root Mean Squared Error'}, hover_data=['MSE', 'MAE', 'R² Score'], color='RMSE', text='RMSE')
fig.show()

# Interactive bar plot for MAE
fig = px.bar(model_comparison, x='Model', y='MAE', title='Model Comparison: Mean Absolute Error (MAE)',
             labels={'MAE': 'Mean Absolute Error'}, hover_data=['MSE', 'RMSE', 'R² Score'], color='MAE', text='MAE')
fig.show()

# Interactive bar plot for R² Score
fig = px.bar(model_comparison, x='Model', y='R² Score', title='Model Comparison: R² Score',
             labels={'R² Score': 'R² Score'}, hover_data=['MSE', 'RMSE', 'MAE'], color='R² Score', text='R² Score')
fig.show()
