## ABC Bank Stock Closing Price Prediction by Regression.


# **Summary -**

ABC Bank stock closing price prediction by regression involves using historical data to develop a regression model that can forecast future stock prices. Regression analysis is a statistical method that uses a combination of independent variables to predict the value of a dependent variable, in this case, the stock price. Here we are gonna develope different Regression model to predict the Closing stock price. On the basis of different metrics, we are gonna evaluate our model and try to find the best model of it. Also try to gain some insights in feature importance using various methods.

# **Problem Statement**


The problem statement for developing a ABC Bank stock closing price prediction ML model is to create an accurate and reliable forecasting model. The model should be trained on a subset of historical data and validated on another subset to ensure that it can accurately predict future stock prices. The ultimate objective is to create a robust and accurate model that can help to predict ABC Bank stocks Closing price.

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

import datetime

### Dataset Loading

In [None]:
# Load Dataset
data=pd.read_csv("D://Files//data_Bank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data[data.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

### What did you know about your dataset?

According to the analysis above, the dataset comprises 5 columns and 185 rows. where just one column is in float format and the others are formatted as dates. This data does not contain any Null values and duplicate rows.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description 

There are 5 variable in data as follows

**Independent Variable**

1. Date : It has Month and Year.
2. Open : Opening stock price for respective Month.
3. High : Highest sotck price for respective Month.
4. Low : Lowest stock price for respective Month.

**Dependent Variable**
5. Close : Closing price of Stock for respective Month

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Converting Date column from object format to Date
data["Date"]=pd.to_datetime(data["Date"],format='%b-%y')

In [None]:
data['Date']

In [None]:
plt.figure(figsize=(9,6))
plt.plot(data['Date'],data['Close'])

In [None]:
# Taking a Numerical Feature from Data
numeric_fea=data.describe().columns
numeric_fea

### What all manipulations have you done and insights you found?

Converted Date column to Date Format by using Datetime Library. To get better visualisation about Date and Closing price of Stock.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Doing Visualisation of Distributed Data for Close column.
plt.figure(figsize=(9,6))
sns.distplot(data['Close'],color='y')

In [None]:
# Normalizing the close column data by using log transformation.
plt.figure(figsize=(9,6))
sns.distplot(np.log10(data["Close"]),color='y')

##### 1. Why did you pick the specific chart?

ABC Bank stock closing price prediction ML model, a distribution plot of the closing price can provide useful insights into its distribution, shape, and potential outliers.

By visualizing the distribution of the target variable, we can gain a better understanding of its central tendency (i.e., the mean, median, and mode), spread (i.e., the range, variance, and standard deviation), skewness (i.e., whether it's symmetric or skewed), and any potential outliers.

This information can be helpful in selecting an appropriate ML algorithm for predicting the target variable, as well as in identifying any potential issues with the data (e.g., non-normality, extreme values) that may need to be addressed before training the model.

##### 2. What is/are the insight(s) found from the chart?

ABC Bank stock closing price prediction ML model, a distribution plot of the closing price can provide several insights, including:
1. The shape of the distribution: The shape of the distribution is skewed right (i.e., positively skewed). A skewed distribution suggests that the closing price is more concentrated on one side of the mean. So for having better model we need to make our target variable symmetrically distributed by using log transformation.
2. The presence of outliers: Outliers are data points that are significantly different from the rest of the data. By examining the distribution plot, we can identify any potential outliers that may need to be addressed before training the model.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Plotting Histogram for each independent column in Data. 
for col in numeric_fea[:-1]:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature=data[col]
  feature.hist(bins=50,ax=ax)
  ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
  ax.set_title(col)
plt.show()

##### 1. Why did you pick the specific chart?

Histograms are used to visualize the distribution of a single variable. Here histograms used to visualize the distribution of independent variables.
We also plotted mean and median lines for better understanding of skewness of data.

##### 2. What is/are the insight(s) found from the chart?

The shape of the distribution of independent variables skewed right. This information can help to determine the appropriate statistical approach for regression.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Plotting graph Independent variable vs Dependent variable to check Multicollinearity.
for col in numeric_fea[:-1]:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature=data[col]
  label=data["Close"]
  correlation=feature.corr(label)
  plt.scatter(x=feature,y=label)
  plt.ylabel("Closing price")
  plt.xlabel(col)
  ax.set_title('Closing price vs '+col+', Correlation: '+str(correlation))
  z=np.polyfit(data[col],data['Close'],1)
  y_hat=np.poly1d(z)(data[col])

  plt.plot(data[col],y_hat,"r--",lw=1)

plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is a common and useful visualization technique to explore the relationship between a dependent variable (i.e., ABC Bank stock closing price) and one or more independent variables. In a scatter plot, each observation is represented as a point on the graph, with the independent variable plotted on the x-axis and the dependent variable plotted on the y-axis.

By examining the scatter plot, we can identify any patterns or relationships between the two variables. For example, if the points on the scatter plot are closely clustered around a straight line, this suggests a strong linear relationship between the two variables. On the other hand, if the points on the scatter plot are more spread out and do not appear to form a straight line, this suggests a weaker relationship or no relationship at all.

By examining the scatter plot, we can determine whether there is a strong or weak relationship between the closing price and the independent variable(s), and whether this relationship is linear or nonlinear. This information can be used to inform the selection of appropriate ML algorithms for predicting the closing price, and to identify any potential issues with the data that may need to be addressed before training the ML model.

##### 2. What is/are the insight(s) found from the chart?

From all above graphs we can see that all the independent variable are linearly corelated with dependent variable(i.e., ABC Bank stock closing price). We need to choose appropriate model to deal with multicollinearity in our data.

#### Chart - 4 Correlation Heatmap

In [None]:
# Chart - 4 Correlation Heatmap visualization code
# Heatmap to see collinearity between columns
plt.figure(figsize=(9,6))
cor=data.corr()
sns.heatmap(abs(cor),annot=True)

##### 1. Why did you pick the specific chart?

Heatmap can be used to explore the correlation between the closing price and the independent variables. By examining the heatmap, we can identify any patterns or relationships between the variables, which can inform the selection of appropriate ML algorithms for predicting the closing price.

A heatmap can also be used to identify any potential issues with the data, such as multicollinearity (i.e., high correlation between independent variables). 

##### 2. What is/are the insight(s) found from the chart?

As from above chart we can see that our data is Multicollinear.
Multicollinearity can cause problems for linear regression, because it can lead to overfitting and unreliable coefficient estimates. By identifying variables with high correlations, we can decide whether to remove one of the variables or to use a different ML algorithm that is less sensitive to multicollinearity.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Data Scaling

In [None]:
# Scaling your data
data_pr=data.copy() # Making copy of our original data 
# Separate Dependent and Independent variable
X=np.log10(data_pr.iloc[:,1:-1]) # Normalizing the data using log transformation
y=np.log10(data_pr['Close']) # Normalizing the data using log transformation


##### Which method have you used to scale you data and why?

We use log transformation to scale data. Because this method applies a logarithmic transformation to the data, which can help normalize skewed data and reduce the impact of outliers.



### 2. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why? 

Data splitting is a common technique used in machine learning and data analysis to evaluate the performance of a model on an independent dataset. We used 80-20 split to split data.

80:20 split - In this split, the dataset is divided into two parts - training set and testing set. The training set contains 80% of the data, while the testing set contains the remaining 20% of the data. This split is commonly used when the dataset is large and the model requires significant training time. The training set is used to train the model, and the testing set is used to evaluate the model's performance.

## ***7. ML Model Implementation***

### ML Model - 1 Linear Regression

In [None]:
# ML Model - 1 Implementation
reg=LinearRegression()

# Fit the Algorithm
reg.fit(X_train,y_train)

# Predict on the model
y_pred=reg.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse=mean_squared_error(10**(y_test),10**(y_pred)) #10** to convert back log10 that we used while making it normalised
rmse=np.sqrt(mse)

r2=r2_score(10**(y_test),10**(y_pred))
Adjusted_R2=(1-(1-r2_score(10**(y_test), 10**(y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

from sklearn.metrics import mean_absolute_error

#from sklearn.metrics import mean_squared_error

# MSE
#mse = mean_squared_error(y_test, y_pred)

# RMSE
#rmse = np.sqrt(mse)
#from sklearn.metrics import r2_score
#r2 = r2_score(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

In [None]:
eval=pd.DataFrame([mse,rmse,r2,Adjusted_R2,mae],columns=['Linear'],index=['MSE','RMSE','R2','Adj R2','MAE']) # making a DataFrame for our metrics
eval

In [None]:
# Visualization of predicted and Actual data
plt.figure(figsize=(8,5))
plt.plot(10**(y_pred))
plt.plot(np.array(10**(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

### 1. Save the ml model in a joblib file format for deployment process.


In [None]:
# Save the File
import joblib 
joblib.dump(best_model, 'best_model.joblib') # saving best model in joblib file

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
model = joblib.load('best_model.joblib')

model.predict(X_test)

# **Conclusion**