# How to build a linear regression model

### 0. Import required modules

In [1]:
import pandas as pd #import pandas library for data manipulation (https://pandas.pydata.org/)
import numpy as np #import the 'NumPy' module for scientific computing with Python (https://numpy.org/)

In [2]:
%run -i "../00_utils.py" #source required functions for building a logistic regression and plotting the results

### 1. Create generic dataset
*see [2. Creating dataframes using pandas](02_creating_df.ipynb)*

In [3]:
#create multiple independent variable lists
var1 = [8.5, 12.9, 5.2, 10.7, 3.1, 3.5, 9.2, 9.0, 15.1, 10.2]
var2 = [5.1, 5.8, 2.1, 8.4, 2.9, 1.2, 3.7, 7.6, 7.7, 4.5]
var3 = [4.7, 8.8, 15.1, None, 10.6, 3.5, 9.7, 5.9, 20.8, 7.9]

#create singular dependent variable lists
y = [8, 9.300000191, 7.5, 8.899999619, 10.19999981, 8.300000191, 8.800000191, 8.800000191, 10.69999981, 11.69999981]

#merge independent and dependent variables into dataframe
data = pd.DataFrame({
    'var1': var1, 
    'var2': var2, 
    'var3': var3,
    'y': y})

data #display dataframe

Unnamed: 0,var1,var2,var3,y
0,8.5,5.1,4.7,8.0
1,12.9,5.8,8.8,9.3
2,5.2,2.1,15.1,7.5
3,10.7,8.4,,8.9
4,3.1,2.9,10.6,10.2
5,3.5,1.2,3.5,8.3
6,9.2,3.7,9.7,8.8
7,9.0,7.6,5.9,8.8
8,15.1,7.7,20.8,10.7
9,10.2,4.5,7.9,11.7


### 2. Fitting a linear regression

Linear regression aims to solve continuous problems (i.e. with a change in x we see an proportional change in y).
It does this by predicting continuous outcomes, unlike logistic regression that predicts a categorical outcome.

A simple linear regression can be defined in the formula; y = mx + c, with 1 independent variable (x) and 1 dependent variable (y).
The relationship between the two variables x and y is reliant upon the gradient (m) and the y-intercept (c).
The greater the gradient (m) the greater the proportional change in y relative to x.
The greater the y-intercept the greater the disparity between x and y when y = 0.

In this example, we will be using 3 independent variables to describe the continuous results of the dependent variable.

In [4]:
#--- use the LinearRegTrain() function from 'SportsAnalytics.py' to build a logistic regression model
model = LinearRegTrain(X = data[['var1', 'var2', 'var3']], Y = data['y'])

                  const var1 var2 var3
y                                     
Coefficients       7.85 0.11 0.02 0.03
Std error          1.42 0.25 0.43 0.11
p-value            0.00 0.67 0.96 0.77
Log-likelihood   -14.08               
Number valid obs   5.00               
Total obs          9.00               


### 3. Predicting results using linear regression model

In [5]:
#--- use the LinearRegPredict() function from 'SportsAnalytics.py' to predict values of y with known values of x
model_predict = LinearRegPredict(model, X = data[['var1', 'var2', 'var3']])
model_predict

Unnamed: 0,var1,var2,var3,prediction
0,8.5,5.1,4.7,9.1
1,12.9,5.8,8.8,9.75
2,5.2,2.1,15.1,9.01
3,10.7,8.4,,
4,3.1,2.9,10.6,8.64
5,3.5,1.2,3.5,8.4
6,9.2,3.7,9.7,9.32
7,9.0,7.6,5.9,9.25
8,15.1,7.7,20.8,10.46
9,10.2,4.5,7.9,9.39


### 4. Predicting out-of-sample (OOS) results using linear regression model

In [6]:
#create multiple independent variable lists
oos_var1 = [3.099999905, 3.5, 12.89999962, 5.199999809, 9, 15.10000038]
oos_var2 = [5.800000191, 2.099999905, 7.699999809, 4.5, 2.900000095, 1.200000048]
oos_var3 = [4.699999809, 8.800000191, 15.10000038, 12.19999981, 10.60000038, 20.79999924]

#merge independent variables into dataframe
oos_data = pd.DataFrame({
    'var1': oos_var1, 
    'var2': oos_var2, 
    'var3': oos_var3})

#predict values of y with new out-of-sample values of x
oos_model_predict = LinearRegPredict(model, X = oos_data[['var1', 'var2', 'var3']])

oos_model_predict #display dataframe

Unnamed: 0,var1,var2,var3,prediction
0,3.1,5.8,4.7,8.5
1,3.5,2.1,8.8,8.6
2,12.9,7.7,15.1,10.01
3,5.2,4.5,12.2,8.96
4,9.0,2.9,10.6,9.31
5,15.1,1.2,20.8,10.31
