# Simple Linear Regression

Simple linear regression is a statistical technique used for finding the existence of an association 
relationship between a dependent variable (aka response variable or outcome variable) and an independent
variable (aka explanatory variable, predictor variable or feature).
<br>
<br>
Regression is one of the most popular supervised learning algorithms in predictive analytics. A regression model requires the knowledge of both the outcome and the feature variables in the training dataset.

1. A hospital may be interested in finding how the total cost of a patient for a treatment varies with the body weight of the patient.
2. Insurance companies would like to understand the association between healthcare costs and ageing.
3. An organization may be interested in finding the relationship between revenue generated from a product and features such as the price, money spent on promotion, competitors’ price, and promotion expenses.
4. Restaurants would like to know the relationship between the customer waiting time after placing the order and the revenue.
5. E-commerce companies such as Amazon, BigBasket, and Flipkart would like to understand the relationship between revenue and features such as <br>
(a) Number of customer visits to their portal.<br>
(b) Number of clicks on products.<br>
(c) Number of items on sale.<br>
(d) Average discount percentage.<br>
6. Banks and other financial institutions would like to understand the impact of variables such as unemployment rate, marital status, balance in the bank account, rain fall, etc. on the percentage of non-performing assets (NPA).


# STEPS IN BUILDING A REGRESSION MODEL
In this section, we will explain the steps used in building a regression model. Building a regression
model is an iterative process and several iterations may be required before finalizing the appropriate
model.
#### STEP 1: Collect/Extract Data
The first step in building a regression model is to collect or extract data on the dependent (outcome) vari-
able and independent (feature) variables from different data sources. Data collection in many cases can
be time-consuming and expensive, even when the organization has well-designed enterprise resource
planning (ERP) system.
#### STEP 2: Pre-Process the Data
Before the model is built, it is essential to ensure the quality of the data for issues such as reliability, com-
pleteness, usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with missing data. Use of descriptive statistics and visualization (such as box plot and scatter plot) may be used to identify the existence of outliers and variability in the dataset.
2. Many new variables (such as the ratio of variables or product of variables) can be derived (aka feature engineering) and also used in model building.
3. Categorical data has must be pre-processed using dummy variables (part of feature engineering) before it is used in the regression model.

#### STEP 3: Dividing Data into Training and Validation Datasets
In this stage the data is divided into two subsets (sometimes more than two subsets): training dataset
and validation or test dataset. The proportion of training dataset is usually between 70% and 80% of the
data and the remaining data is treated as the validation data. The subsets may be created using random/­
stratified sampling procedure. This is an important step to measure the performance of the model using
dataset not used in model building. It is also essential to check for any overfitting of the model. In many
cases, multiple training and multiple test data are used (called cross-validation).

#### STEP 4: Perform Descriptive Analytics or Data Exploration
It is always a good practice to perform descriptive analytics before moving to building a predictive ana-
lytics model. Descriptive statistics will help us to understand the variability in the model and visualiza-
tion of the data through, say, a box plot which will show if there are any outliers in the data. Another
visualization technique, the scatter plot, may also reveal if there is any obvious relationship between the
two variables under consideration. Scatter plot is useful to describe the functional relationship between
the dependent or outcome variable and features.
#### STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. The method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.
#### STEP 6: Perform Model Diagnostics
Regression is often misused since many times the modeler fails to perform necessary diagnostics tests
before applying the model. Before it can be applied, it is necessary that the model created is validated
for all model assumptions including the definition of the function form. If the model assumptions are
violated, then the modeler must use remedial measure.
#### STEP 7: Validate the Model and Measure Model Accuracy
A major concern in analytics is over-fitting, that is, the model may perform very well on the training
dataset, but may perform badly in validation dataset. It is important to ensure that the model perfor-
mance is consistent on the validation dataset as is in the training dataset. In fact, the model may be cross-
validated using multiple training and test datasets.
#### STEP 8: Decide on Model Deployment
The final step in the regression model is to develop a deployment strategy in the form of actionable items
and business rules that can be used by the organization.

# BUILDING SIMPLE LINEAR REGRESSION MODEL
Simple Linear Regression (SLR) is a statistical model in which there is only one independent vari-
able (or feature) and the functional relationship between the outcome variable and the regression
coefficient is linear. Linear regression implies that the mathematical function is linear with respect to
regression parameters.
One of the functional forms of SLR is as follows:
<img src="q.png" />

For a dataset with n observations (X i , Y i ), where i = 1, 2, ..., n, the above functional form can be written
as follows:
<img src="w.png" />

where Y i is the value of ith observation of the dependent variable (outcome variable) in the sample, X i is
the value of ith observation of the independent variable or feature in the sample, e i is the random error
(also known as residuals) in predicting the value of Y i , b 0 and b 1 are the regression parameters (or regres-
sion coefficients or feature weights).

The regression relationship stated is a statistical relationship, and so is not exact, unlike
a mathematical relationship, and thus the error terms e i . It can be written as
<img src="e.png" />
The regression parameters b 0 and b 1 are estimated by minimizing the sum of squared errors (SSE).
n n
i = 1 i = 1
<img src="r.png" />
The estimated values of regression parameters are given by taking partial derivative of SSE with respect
to b 0 and b 1 and solving the resulting equations for the regression parameters. The estimated parameter
values are given by
<img src="t.png" />

 are the estimated values of the regression parameters b and b . The above proce-
where b
0
1
0
1
dure is known as method of ordinary least square (OLS). The estimate using OLS gives the best linear
unbiased estimates (BLUE) of regression parameters.

Assumptions of the Linear Regression Model
1. The errors or residuals e i are assumed to follow a normal distribution with expected value of error E(e i ) = 0.
2. The variance of error, VAR(e i ), is constant for various values of independent variable X. This is known as homoscedasticity. When the variance is not constant, it is called heter­oscedasticity.
3. The error and independent variable are uncorrelated.
4. The functional relationship between the outcome variable and feature is correctly defined.

Properties of Simple Linear Regression
1. The mean value of Y i for given X i , E ( Y i | X ) = b

2. Y i follows a normal distribution with mean b

Let us consider an example of predicting MBA Salary (outcome variable) from marks in GMAT marks.
(feature).



In [1]:
import pandas as pd
import numpy as np

In [2]:
mba_salary = pd.read_csv("MBASSData.csv")
mba_salary.head()

Unnamed: 0,age,sex,gmat_tot,gmat_qpc,gmat_vpc,gmat_tpc,s_avg,f_avg,quarter,work_yrs,frstlang,salary,satis
0,23,2,620,77,87,87,3.4,3.0,1,2,1,0,7
1,24,1,610,90,71,87,3.5,4.0,1,2,1,0,6
2,24,1,670,99,78,95,3.3,3.25,1,2,1,0,6
3,24,1,570,56,81,75,3.3,2.67,1,1,1,0,7
4,24,2,710,93,98,98,3.6,3.75,1,2,1,999,5


In [3]:
mba_salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274 entries, 0 to 273
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       274 non-null    int64  
 1   sex       274 non-null    int64  
 2   gmat_tot  274 non-null    int64  
 3   gmat_qpc  274 non-null    int64  
 4   gmat_vpc  274 non-null    int64  
 5   gmat_tpc  274 non-null    int64  
 6   s_avg     274 non-null    float64
 7   f_avg     274 non-null    float64
 8   quarter   274 non-null    int64  
 9   work_yrs  274 non-null    int64  
 10  frstlang  274 non-null    int64  
 11  salary    274 non-null    int64  
 12  satis     274 non-null    int64  
dtypes: float64(2), int64(11)
memory usage: 28.0 KB


Field Description age age - in years<br>
sex 1=Male; 2=Female<br>
gmat_tot total GMAT score<br>
gmat_qpc quantitative GMAT percentile<br>
gmat_vpc verbal GMAT percentile<br>
qmat_tpc overall GMAT percentile<br>
s_avg spring MBA average<br>
f_avg fall MBA average<br>
quarter quartile ranking (1st is top, 4th is bottom)<br>
work_yrs years of work experience<br>
frstlang first language (1=English; 2=other)<br>
salary starting salary<br>
satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)<br>
<br>
Missing salary and data are coded as follows:<br>
998 = did not answer the survey<br>
999 = answered the survey but did not disclose salary data<br>
Size of data set: 274 records<br>

In [4]:
import statsmodels.api as sm

x = sm.add_constant(mba_salary['gmat_tot'])
x.head()

Unnamed: 0,const,gmat_tot
0,1.0,620
1,1.0,610
2,1.0,670
3,1.0,570
4,1.0,710


In [5]:
y = mba_salary['salary']
y

0           0
1           0
2           0
3           0
4         999
        ...  
269    104000
270    105000
271    115000
272    126710
273    220000
Name: salary, Length: 274, dtype: int64

In [6]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x,y, train_size=0.8, random_state = 42)

In [7]:
linearmodel = sm.OLS(ytrain, xtrain).fit()

In [8]:
print(linearmodel.params)

const       63133.771578
gmat_tot      -42.371118
dtype: float64


In [None]:
# MBA Salary = -42.371118(gmat_tot)  +  63133.771578

In [9]:
linearmodel.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,-0.002
Dependent Variable:,salary,AIC:,5368.8965
Date:,2021-03-02 09:53,BIC:,5375.6747
No. Observations:,219,Log-Likelihood:,-2682.4
Df Model:,1,F-statistic:,0.5236
Df Residuals:,217,Prob (F-statistic):,0.47
R-squared:,0.002,Scale:,2573500000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,63133.7716,36498.7618,1.7298,0.0851,-8803.6929,135071.2360
gmat_tot,-42.3711,58.5570,-0.7236,0.4701,-157.7844,73.0422

0,1,2,3
Omnibus:,41.513,Durbin-Watson:,1.993
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.513
Skew:,0.803,Prob(JB):,0.0
Kurtosis:,2.125,Condition No.:,6636.0


# MODEL DIAGNOSTICS
It is important to validate the regression model to ensure its validity and goodness of fit before it
can be used for practical applications. The following measures are used to validate the simple linear
regression models:
1. Co-efficient of determination (R-squared).
2. Hypothesis test for the regression coefficient.
3. Analysis of variance for overall model validity (important for multiple linear regression).



## Co-efficient of Determination (R-Squared or R 2 )
The primary objective of regression is to explain the variation in Y using the knowledge of X. The
co-efficient of determination (R-squared or R 2 ) measures the percentage of variation in Y explained by
the model (b 0 + b 1 X). The simple linear regression model can be broken into
1. Variation in outcome variable explained by the model.
2. Unexplained variation

<img src = "one.png" />


It can be proven mathematically that <br>
<img src = "two.png" />
SST is the sum of squares of total variation,SSR is the sum of squares of explained variation due to the regression model, and SSE is the sum of squares of unexplained variation (error).


The co-efficient
of determination (R-squared) is given by


<img src = "three.png" />

The co-efficient of determination (R-squared) has the following properties:
1. The value of R-squared lies between 0 and 1.
2. Mathematically, R-squared (R 2 ) is square of correlation coefficient (R 2 = r 2 ), where r is the Pearson correlation co-efficient.
3. Higher R-squared indicates better fit; however, one should be careful about the false relationship.


## Hypothesis Test for the Regression Co-eff icient
The regression co-efficient (b 1 ) captures the existence of a linear relationship between the outcome vari-
able and the feature. If b 1 = 0, we can conclude that there is no statistically significant linear relationship
between the two variables. 

It can be proved that the sampling distribution of b 1 is a t-distribution. 

The null and alternative hypotheses are

<img src = "four.png" />



# Analysis of Variance (ANOVA) in Regression Analysis
We can check the overall validity of the regression model using ANOVA in the case of multiple linear
regression model with k features. The null and alternative hypotheses are given by
<img src = "five.png" />
H A : Not all regression coefficients are zero
The corresponding F-statistic is given by

<img src = "six.png" />



where MSR (= SSR/k) and MSE [= SSE/(n − k − 1)] are mean squared regression and mean squared error,
respectively. F-test is used for checking whether the overall regression model is statistically significant or not.




In [None]:
linearmodel.summary2()

#### From the summary output
1. The model R-squared value is 0.001, that is, the model explains 0.1% of the variation in salary.
2. The p-value for the t-test is 0.5935 which indicates that there is not a statistically significant relationship (at significance value a = 0.05) between the feature, gmat_tot, and salary.

Also, the probability value of F-statistic of the model is 0.594 which indicates that the overall model is not statistically significant. Note that, in a simple linear regression, the p-value for t-test and F-test will be the same since the null hypothesis is the same. (Also F = t 2 in the case of SLR.)