# Simple Linear Regression - Model Validation #
Work through the code in the cells, run them and then look at the relevant output and interpret the model.

<hr>

## Importing packages and reading the data ##



If you haven't already, you may need to download the relevent packages to your virtual environment.

(If you haven't set up a virtual environment already, now's the time to do so!)

``` shell
python3 -m venv my-venv-name
source my-venv-name/bin/activate
```

You can call your virtual environment whatever you like, not just `my-venv-name`.

``` shell 
(my-venv-name) pip install -U scikit-learn 
``` 
It then downloads ...
``` shell
(my-venv-name) pip install -U statsmodels
```
It also downloads...

``` shell
(my-venv-name) pip install -U pandas matplotlib
```
Make sure you have pandas & matplotlib in this venv
You can always check by running:
``` shell
(my-venv-name) pip freeze > requirements.txt
```
And then open up `requirements.txt` and check the module/library has downloaded.

In [2]:
# Import the required packages

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import sklearn.metrics as metrics
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv("../datasets/SLR_advertising_budget.csv")



<HR>

## Split the DataFrame into train and test data ##

In [3]:
train, test = train_test_split(
    df,
    random_state = 13 # this ensures that we get the same answer each time
)

print(len(train))
print(len(test))
print(len(test)/(len(train)+len(test)))

150
50
0.25


<hr>
 
 ## Train the model


The syntax for building a regression model is:
> model = sm.OLS(<br>
&nbsp;&nbsp;&nbsp;&nbsp;DataFrame['dependent variable'],<br>
&nbsp;&nbsp;&nbsp;&nbsp;DataFrame[['independent variable 1','independent variable 2',...]]<br>
).fit()


In [4]:
# Identify dependent and independent variables

dependent_var = train['Sales']
independent_var = train['Advertising']
independent_var = sm.add_constant(independent_var)

# Build the model 
model = sm.OLS(
    dependent_var,
    independent_var
).fit()


<hr>

## Interpret the model results
In this section, we will interpret the model results. The summary statistics printed below give us the line of best fit. However, it also provides us with information if the model is 'a good fit’ for our data, meaning how accurately it can predict outcomes or trends seen in the data.

In [5]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.738
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,416.8
Date:,"Wed, 14 Jun 2023",Prob (F-statistic):,6.94e-45
Time:,12:16:32,Log-Likelihood:,-355.26
No. Observations:,150,AIC:,714.5
Df Residuals:,148,BIC:,720.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.2263,0.515,8.211,0.000,3.209,5.243
Advertising,0.0489,0.002,20.415,0.000,0.044,0.054

0,1,2,3
Omnibus:,3.786,Durbin-Watson:,2.157
Prob(Omnibus):,0.151,Jarque-Bera (JB):,3.542
Skew:,-0.376,Prob(JB):,0.17
Kurtosis:,3.042,Cond. No.,520.0


 
### Equation of the line ###

y = m*x + b

y = **0.0489**x + **4.2263**

y = 0.0489 * \[advertsing] + 4.2263

So for every £ spend on advertising, the sales number will increase by 0.048.
<br>
 
### Is the model 'a good fit'? ###
To understand if a model can be used to make a prediction we need to look at additional elements of the OLS Regression results: <br>

- **p-value**\
A p-value < 0.05 suggests that our data is statistically significant and that the values for our variable cannot be explained by mere coincidence. It helps us to identify which independent variables have a significant effect on the dependent variable. 


- **R-squared**\
R-squared tells us how closely the data fits the model between 0 (terrible fit) and 1 (perfect fit). In our case, 75.2\% is high.\
In good real world examples - you can expect to achieve closer to 30\%. \
\
30\% is usually sufficient because we are not trying to describe exactly the dependent variable (since we don’t have all the data), we just want to give an estimate as to what might happen under small changes. Despite not describing everything, the model still has practical use!

- **Adjusted R-squared**\
Adjusted R-squared is similar to the R-squared but is for models with more than 1 independent variable. 

- **F-Statistic**\
F-statistic is a test to compare two models. In this case, it is comparing our regression model to the base model (taking the mean).\
A high F value (> 1) means that the new model is better. A low F value (< 1) means that the old model is better.

- **Prob F-Statistic**\
Prob (F-Statistic) tells us how statistically significant this value is. The lower the probability, the more significant the difference between models. In our case, our regression model is MUCH better than the base model in a statistically significant way.

<HR>
 
## Predict data in the test set ##

First, we use the model to predict the test data. Then we compare it to the actual test data.

The syntax for predicting using the model is:
> model.predict(TestDataFrame)


In [6]:
# Predict the model results on the test data

predicted = model.predict(
    sm.add_constant(test['Advertising'])
  
)

## Is the model also 'a good fit' for our test data? ##
We can check this by looking at the R squared score for our prediction.

In [7]:
# Measure the test R squared

metrics.r2_score(test['Sales'],predicted)

0.7845437198872972

Our R squared is over 78%!\ 
In a good real-world example, you want your ***test R squared*** to be *roughly* similar to your ***train R squared*** $\pm 10\%$.

Once we're confident that the model can be applied to new data, we usually rebuild the model using *all* the data (test + train).

## Rebuilding the model for the whole data set ##

In [8]:
dependent_var = df['Sales']
independent_var = df['Advertising']
independent_var = sm.add_constant(independent_var)

# Build the model 
model = sm.OLS(
    dependent_var,
    independent_var
).fit()

model.summary()


0,1,2,3
Dep. Variable:,Sales,R-squared:,0.753
Model:,OLS,Adj. R-squared:,0.752
Method:,Least Squares,F-statistic:,603.4
Date:,"Wed, 14 Jun 2023",Prob (F-statistic):,5.06e-62
Time:,12:16:32,Log-Likelihood:,-473.88
No. Observations:,200,AIC:,951.8
Df Residuals:,198,BIC:,958.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.2430,0.439,9.676,0.000,3.378,5.108
Advertising,0.0487,0.002,24.564,0.000,0.045,0.053

0,1,2,3
Omnibus:,6.851,Durbin-Watson:,1.967
Prob(Omnibus):,0.033,Jarque-Bera (JB):,6.692
Skew:,-0.373,Prob(JB):,0.0352
Kurtosis:,3.495,Cond. No.,528.0


In [10]:
# make a prediction for sales with an advertising value of 10
sales_pred = model.predict([1, 10])  
sales_pred[0]

4.729907009226813