# LINEAR REGRESSION 

## CHAPTER #1 - LINEAR REGRESSION MODEL 

 ### CLASS CREATION  
 This class will contain all of the model logic.   
 The class contains the Linear regression model itself. 
 Each object is a "best fit line"  i.e a line for each set of training data.   

 The attributes contain the:  
 - weight (coefficeint of the independent variable, x)  
 - bias (y-intercept) of the line 

 The methods contain all the "verbs" of the model (functions that change the weight and bias), namely:  
 - how the model learns the weight and bias. 
 - how the model fits the line to the data.  
 - how the line is evaluated e.g R^2 and RMSE. 
 - how close our predictions are i.e residuals.


In [None]:
class  LinearRegression(object):
    def __init__(self, weight=0, bias=0): #initialises the attributes of the class at 0
        self.weight = weight              #stores weight 
        self.bias = bias                  #stores bias 
        self.x = []                       #creates empty list to store our predictor variables (x)  
        self.y = []                       #creates empty list to store our our predicted variables (y) 
 
    def vectorise(self, x,y):             #defining method to store the data points to be modelled
        self.x = x                        #storing the values of x (independent variable) within the class 
        self.y = y                        #storing the values of y (dependent variable) within the class 

    def predict_y (self):                 #calculating the predicted y[i] for our optimisation later 
        y_predict =[]                     #creating an empty list to store all predicted y values 
        n = len(self.y)                   #range that we iterate over (number of values of y)

        for i in range(n):                #looping over the number of values we have in the dataset 
            y_predict.append(self.weight*self.x[i] + self.bias)     #calculating predicted y values with line equation and adding predicted values to our list 
        return y_predict                  

#NUMERICAL OPTIMISATION 
#Creating method to get weight
    def partial_w(self):                  #partial derivative in regard to weight 
        y_predict = self.predict_y()      #predicted y value is equal to calling the internal method we defined above 
        gradient = 0 
        n=len(self.y)

        for i in range(n):
            gradient += self.x[i]*(y_predict[i] - self.y[i])         #partial derivative equation to calculate total partial derivative of weight in regards to error function
        return (-2/n)*gradient                                       #returns the  weight eqaution that minimises the partial derivative in regard to error function

#Creating method to get bias 
    def partial_b(self):
        y_predict = self.predict_y()
        gradient = 0
        n=len(self.y)

        for i in range(n):
            gradient += (y_predict[i]- self.y[i])                     #partial derivative equation to calculate total partial derivative of bias in regards to error function
        return (-2/n)*gradient                                        #returns the  bias equation that minimises the partial derivative in regard to error function

#Gradient Descent - iterating over multiple steps with our partial weight and bias functions 
    def optimise(self): 
        learn_rate = 0.005                 #size of steps we make "downhill" to minimise total error in regards to the weight and bias 

        for i in range(10000):             #number of "epochs"/ steps we take in order to minimise aggregate error 
            self.weight = self.weight + learn_rate * self.partial_w() #optimised weight by calling partial_w 10000 times
            self.bias = self.bias + learn_rate * self.partial_b()     #optimised bias  by calling partial_b 10000 times
            if i % 10 == 0:                #prints out the weight and bias every 10 epochs 
                print(self.weight, self.bias)
    
#Residuals - creating a new residuals method to display deviation of predicted values from actual values
    def residuals(self):
        residuals = []
        n=len(self.x) 

        for i in range(n):
            residuals.append(self.y[i] - (self.weight * self.x[i] + self.bias)) #adding to the list called "residuals" the difference between actual and predicted y
        return residuals                                                        

#EVALUATION METRICS  -  these are key values that we will use to quantify how good our model predicts the data it is trained on. 
#Mean Square Error (MSE)  - the average squared deviation from actual values of y

    def mse(self):
        mse = 0                              #initialising our mse as a variable  that will be updated through the loops 
        mse_list = []                        #empty list to store our mse
        n=len(self.y)                        #creating length for range to iterate over
        total_error = []                     #what is the total error i.e actual - predicted y
        self.square_error = 0                #stores the square error of the deviations 

        for i in range(n):                   #iterating to calculate the mse 
            total_error.append((self.y[i] - (self.weight * self.x[i] + self.bias))**2) #deviation from actual y ^2
            self.square_error += total_error[i] #storing square error to be used in future calculations
            mse_list.append((1/n)*total_error[i]) #storing mse in the list using the mean squared error formula 
            mse += mse_list[i]               #iterates by adding all elements in the list together to give us our aggreagte mse
        return mse 
    
#R^2 -  how much of the deviation in y is explained by our model
    def rsquared(self):
        n=len(self.y)
        self.avg_y = 0                       #initial value of the average of our actual y values 

    #Average y- average of our actual y    
        for i in range(n):
            self.avg_y += ((1/n)*self.y[i])  #calculating the average value of actual y 
        
    #Total sum of squares -  
        self.sum_squares = 0                 #creating an object called sum_squares to be used further in the function 
        sum_squares_list =[]                 #empty list to store values of sum of squares 
        n = len(self.y)

        for i in range(n):
            sum_squares_list.append((self.y[i] - self.avg_y)**2) #the squared values of actual - predicted y  and storing them in the empty list above 
            self.sum_squares += sum_squares_list[i]              #adding togther all of the sum of squares into initial variable sum_squares 

    #Final calculation 
        rsquared = 0                         #initialising our value of rsquared as 0 
        n=len(self.y)

        rsquared = (1-(self.square_error/self.sum_squares)) #calculating R^2 with our instances of sum of squares and square error 
        return rsquared 



EXAMPLE WITH SIMPLE LISTS

In [None]:
x = [1,2,3,4,5]
y = [6,7,8,9,10]



In [None]:
%pip install matplotlib
from matplotlib import pyplot as plt

In [None]:
plt.plot(x,y)

In [None]:
model = LinearRegression()

In [None]:
model.vectorise(x,y)

In [None]:
model.optimise()

In [None]:
model.predict_y()

In [None]:
model.residuals()

In [None]:
model.mse()

In [None]:
model.rsquared()

## CHAPTER #2 - HANDLING DATA 

## Importing dependencies 

In [None]:
# Importing necessary packages 
%pip install pandas 
%pip install numpy
%pip install matplotlib
%pip install scikitlearn

In [None]:
import numpy as np 
import matplotlib as mpl
import pandas as pd
from matplotlib import pyplot as plt

## FINANCE DATA 
The following datasets provide metrics partitioned by market capitalization, price, volatility, and turnover. The stock market activity metrics are partitioned by decile and the ETP metrics by quartile. 

I want to look into how cancellation rate (cancel to trade) is affected by stock volatility.  
I will look at different deciles  (the market capitilsation of 10 businesses) and the Market cap decile column as well as its Volatility to see how cancellation rate changes with volatility.   

**Hypothesis**- I would hypothesise that the greater the volaitilty the greater the rate of cancellation. 

In our data schema the following are defined:  
**Market Cap  Decile(n)** - what is the decile_cancel_to_trade (number of cancelled trades/ number of successful trades) for that capitalisation at that date. 
-  will be renamed to "Cancellation rate".    
                        

**Volatility Decile(n)** - the amount of statistical variation within each stock decile (e.g decile 9) as that date. 
- will be renamed to "Volatility"

## Can we predict the the cancelllation rate of a stock based on its volatility?
We will use linear regression to find out.

In [None]:
decile_path= "/Users/admin/Desktop/Data Science Career /Python/Python Projects/Linear regression from scratch /decile_quartile_2025_q1/decile_cancel_to_trade_stock.csv"
#Saving path name as variable for read csv argument 

In [None]:
decile_to_cancel_raw=pd.read_csv(decile_path) #importing file as a pandas dataframe

In [None]:
decile_to_cancel_raw.head()
decile_to_cancel_raw.tail()                   #insight into what our data looks like 

## EXPLORATORY DATA ANALYSIS AND DATA PRE-PROCESSING


I will utilise exploratory data analysis (EDA) to identify which decile has the most linear pattern in order to utilise my linear regression model.  
This is to isolate only volatility and cancellation features for one set of independent and dependent variables. 

### ASSUMPTIONS OF LINEAR REGRESSION
Let us investigate decile_1 to see if it is a good candidate to be modelled by linear regression. 
For this to be true, there must be:
1) Strong negative or positive correlation
2) Linearity in the data points
3) Strong homoscedacity (data points maintian similar deviation throughtout all values of the independent variable)
4) Normality of errors

#### Decile 1 and Linear Regression

In [None]:
#Extracting features of market cap decile 1 
#independent variable = volatility decile1 (deviation in the stock prices  in this decile for each date)
#dependent varaible = market cap decile1 (the cancel to trade of stock each date for groups of 10 businesses in the lowest market capitalisation)
decile_1 = decile_to_cancel_raw[["Market Cap Decile1","Volatility Decile1"]] 

In [None]:
decile_1.head() #what does our data look like

In [None]:
help(decile_1.rename) #help on how to rename columns 

In [None]:
decile_1 = decile_1.rename(columns={'Market Cap Decile1':'Cancellation rate','Volatility Decile1':'Volatility'})
#renaming columns since we know we are in decile 1 of the decile_cancel_to_trade file

## Does decile 1 fit our assumptions?
Let us see if our data for decile 1 fits our assumptions.    
To test this, I will make a basic plot of the two features.

### Linearity
Plot our indepdent variable vs depedent variable as a scatterplot 

In [None]:
x_1= decile_1[['Cancellation rate']] #assigning columns to the variable x
y_1 = decile_1[['Volatility']]        #assigning columns to the variable y

In [None]:
plot_linear = plt.scatter(y_1, x_1) #scatter plot of our two features for deile 1

From this plot we can see a few key details, namely: 
1) Our data has a few outliers.  
2) Our data does follow a linear relationship with most values condensed around the centre point of the volatility scale.  
3) The linearity is present but does not have a strong gradient meaning the volatility in decile 1 does not have much predictive power in regards to the cancellation rate. I will explore  different features for this model i.e different deciles. 

#### Decile 9 and Linear Regression

In [None]:
#Extracting features of market cap decile 1 
decile_9 = decile_to_cancel_raw[["Market Cap Decile9","Volatility Decile9"]] 

In [None]:
y_2 = decile_9[['Market Cap Decile9']] #assigning columns to the variable x
x_2 = decile_9[['Volatility Decile9']] #assigning columns to the variable x

In [None]:
plot_linear = plt.scatter(x_2, y_2) #scatter plot of our two features for decile 9

From this plot we can see that:
1) There is a strong positive linear relationship, therefore volatililty does have predictive power for cancellation rates. 
2) There is heteroscedacity in the raw data, therefore we may need to apply some kind of transformation to the data but I will make the regression model and then check for homoscedacity in the residuals. 
3) Interetsing though since this high heteroscedcaity indicates that as stock get more volatile purchasing decisions become more extreme. 

In [None]:
decile_9 = decile_9.rename(columns={'Market Cap Decile9':'Cancellation rate','Volatility Decile9':'Volatility'})
#renaming columns since we know we are in decile 9 of the cancel_to_trade file

In [None]:
plot_box = plt.boxplot(x_2) #creating a boxplot of our independent variable

In [None]:
plot_box = plt.boxplot(y_2) #creating a boxplot of our dependent variable

From these plots we can see the presence of:
1) A significant number of outliers outside the maximum range of the dataset.
2) We will use the interquartile range method to impute these values.

## OUTLIER REMOVAL 
I will use the method of removing values above using upper and lower bounds based on quartiles. 

In [None]:
#Step 1- compute Q1 and Q3
#Cancellation rates 
Q1C = decile_9['Cancellation rate'].quantile(0.25) #quartile 1 of cancellation rates 
Q3C = decile_9['Cancellation rate'].quantile(0.75) ##quartile 3 of cancellation rates 
print(Q1C)
print(Q3C)




In [None]:
#Volatiltiy 
Q1V = decile_9['Volatility'].quantile(0.25)         #quartile 1 of volatility  
Q3V = decile_9['Volatility'].quantile(0.75)         #quartile 3 of volatility  
print(Q1V)
print(Q3V)

In [None]:
#Step 2 - Compute IQR
#Cancellation rate 
IQRC = Q3C - Q1C
print(IQR)

In [None]:
#Volatility 
IQRV= Q3V -Q1V
print(IQRV)

In [None]:
#Step 3 - Find the upper bound and lower bound
#Cancellation rate
upper_b_Canc = Q3C + 1.5*IQRC
print(upper_b_Canc)

In [None]:
#Volatility 
upper_b_Vol = Q3V + 1.5*IQRV
print(upper_b_Vol)

In [None]:
#Number of outliers - https://www.analyticsvidhya.com/blog/2022/09/dealing-with-outliers-using-the-iqr-method/
#Cancellation rate
decile_9[decile_9['Cancellation rate'] > upper_b_Canc].count()
print((70/3329)*100) #percentage of values to impute. Is not excessive ≈ 2%

In [None]:
#Volatility
decile_9[decile_9['Volatility'] > upper_b_Vol].count()
print((69/3329)*100) #percentage of values to impute. Is not excessive ≈ 2%

### WINSORISATION 
A method of dealing with systematic outliers to maintain distribution of data. 

Through our EDA, I learned that our outliers are only present in  the upper tail , truncating them would remove information from the data and make our model have 
lower predictive power at the high end.   
Therefore I will use Winsorisation to cap them  to the max of Q3 + 1.5*IQR - https://www.datacamp.com/tutorial/winsorized-mean

In [None]:

#Cancellation rate 
decile_9['Cancellation rate'] = decile_9['Cancellation rate'].clip (upper = upper_b_Canc) #winsorising cancellation rate 
decile_9[decile_9['Cancellation rate'] > upper_b_Canc].count()  


In [None]:
#Volatility
decile_9['Volatility'] = decile_9['Volatility'].clip (upper = upper_b_Vol)                 #winsorising volatility  
decile_9[decile_9['Volatility'] > upper_b_Vol].count()  

In [None]:
# New  values of x and y with Winsoration 
y_3 = decile_9[['Cancellation rate']]
x_3 = decile_9[['Volatility']]

In [None]:
plt.boxplot(x_3) #New boxplot of cancellation rate with Winsorisation

In [None]:
plt.boxplot(y_3) #New boxplot of volatility with Winsorisation

In [None]:
plt.hist(x_3)   #New histogram of Cancellation rate with Winsorisation

In [None]:
plt.hist(y_3)   #New boxplot of Volatility with Winsorisation

Now that we have the pre-processed data with linearity and outliers handled we can  use our linear regression model.