# Multiple Linear Regression

The notebook aims to study and implement a linear regression model with two (or more) variables. The diabetes dataset will be used to construct and explain the multiple linear regression.


## Acknowledgments

- Used dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html or diabetes from the "sklearn" package



## Importing libraries

In [None]:
# Import the packages that we will be using
import numpy as np                  # For array
import matplotlib.pyplot as plt     # For showing plots
import pandas as pd                 # For data handling
import seaborn as sns               # For advanced plotting

# Note: specific functions of the "sklearn" package will be imported when needed to show concepts easily
#from sklearn import datasets
#from sklearn import linear_model
#from sklearn.metrics import mean_squared_error, r2_score


## Importing data

In [None]:
# Dataset url
url = 

# Load the dataset
data = 

# Construct dataframe
ColumnNames = ['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6','Y']
df          = pd.DataFrame(data, columns = ColumnNames)


## Undertanding the dataset

Get a general 'feel' of the dataset

In [None]:
df

In [None]:
# Number of predictors/variables/features and obsertations in the dataset
Nr, Nc = 

print("Number of observations (rows)  = {0:0d}".format(Nr))
print("Number of variables(columns)   = {0:0d}".format(Nc))


#### Characteristics of the dataset

1. The dataset description
    - Many observations/measurements/recordings of the characteristics/attributes/variables of persons
    - Variables: age, sex, bmi, bp, tc, ... (10 variables)
    - Total numer of observations: 442


2. Description of the predictors/variables/features/attributes (independant variable)
    - age in years
    - sex
    - bmi body mass index
    - bp average blood pressure
    - s1 tc, total serum cholesterol
    - s2 ldl, low-density lipoproteins
    - s3 hdl, high-density lipoproteins
    - s4 tch, total cholesterol / HDL
    - s5 ltg, possibly log of serum triglycerides level
    - s6 glu, blood sugar level


3. Description of the response (dependant variable)
    - quantitative measure of disease progression one year after baseline

Note:
- In case you upload the dataset using sklearn, each of the 10 variables have been mean centered and scaled by the standard deviation times n_samples (i.e. the sum of squares of each column totals 1)

In [None]:
# Summary statistics for the variables



In [None]:
# Plot pairwise relationships in the dataset: use "BMI","BP","S6","Y"



In [None]:
# scatter plot between BMI and disease



The scatter seems plot shows a linear positive relationship between body mass index $BMI$ and disease $Y$. Note that that we can draw a straight line with positive slope which somehow fits the values on chart

In [None]:
# scatter plot between BP and disease



The scatter plot hardly shows a linear positive relationship between blood pressure $BP$ and disease $Y$

In [None]:
# scatter plot between S6 and disease



The scatter plot hardly shows a linear positive relationship between blood sugar level $S6$ and disease $Y$

In [None]:
# Calculate correlation between variables: use ["BMI","BP","S3","S6","Y"]



This shows the correlation between variables. See how the disease $Y$ is positive correlated with $BMI$, $BP$ and $S6$, and negative correlated with $S3$. Also, $BMI$ provides rhe strongues correlation with disease progression $Y$.

## Problem description

Given a dataset, we want to fit a model to explain and predict $y$ based on $p$ predictors or independent variables $x_1$, $x_2$, $x_3$, ..., $x_p$, that is:

$y = f(x_1,x_2,x_3,...,x_p)$

Two main objectives of model fitting:
- Making inference about relationships between variables in a given data set (commonly used in statistical analysis)
- Making predictions/forecasting future outcomes, based on models estimated using historical data (commonly used in machine learning)

## Description of the linear model

We want to built a (linear) model that predicts **disease** ($y$) based on the **body mass index** ($x_1$), **blood pressure** ($x_2$) and **blood sugar level** ($x_3$), that is:

$$y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 $$

or

$$y = \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot BP + \beta_3 \cdot S6 $$

This is multiple linear model, where $y$ is the dependent variable, $x_i$'s are the independet variables and 
$\beta_0$ (intercept) and $\beta_i$'s (slopes) are the unknown model parameters (or coefficients) that will be estimated from the data.

- $y$: response 
- $x_i$: predictors/variables/features
- $\beta_i$: coefficients (slopes)
- $\beta_0$: intercept



## Fitting the linear regression model using sklearn

Fitting the linear reggresion model using sklearn

In [None]:
# Import sklearn linear_model
from sklearn import linear_model

# Create linear regression object
regmodel = 

# Train the model using the training sets



In [None]:
# Model intercept
#regmodel.intercept_
#print("Intercept: \n", regmodel.intercept_)

# Get model intercept
b0 = 
print("The intercep b0 is", b0)


In [None]:
# Model coefficient (slope)
#regmodel.coef_
#print("Coefficients: \n", regmodel.coef_)

# Get model coefficients (slopes) 
b = 
print("The slopes beta are", b[0])


In [None]:
# Create a dataframe for the coefficients
Predictor_Names =['BMI', 'BP', 'S6']
#model_coefficients = regressor.coef_
df_coeff = pd.DataFrame(data = b, columns = Predictor_Names, index = ['Coefficient value:'])
print(df_coeff)


## Understanding the model

The estimated coefficients $\hat{\beta_0}$ and $\hat{\beta_i}$, $i=1,2,3$ were calculated from the data.

NOTE: the slopes are not zero which means that for a new observation of **BMI** $x_1$, **BP** $x_2$ and **S6** $x_3$ we can make a guess of the **disease** $y$

The final linear regression model is:

$$y = \beta_0 + \beta_1 \cdot BMI + \beta_2 \cdot BP + \beta_3 \cdot S6 $$

$$y = -244.74 + 7.93 \cdot BMI + 1.19 \cdot BP + 0.82 \cdot S6 $$

Interpreptation:

- for a unit increase in BMI, there is an increase decrease of 7.93 in desease progression
- for a unit increase in BP, there is an increase decrease of 1.19 in desease progression
- for a unit increase in S6, there is an increase decrease of 0.82 in desease progression

S6 coefficient is close to 0, which means it has the least impact on disease progression

BMI, with a coefficient of 7.93, has the biggest impact in the prediction of desease progression

In other words, desease progression is mostly explained by BMI.

Task: explore the potential connection of this result the correlation above. 

In [None]:
# Plot the data and the linear model

# NOTE: this is not possible because we have four dimensions


This multivariate model is the one that best minimize the residual sum of squares (RSS) between the observed responses in the dataset $y_i$, and the responses predicted by the linear approximation $\hat{y_i}$.

## Assessing the accuracy of the coefficient estimates

These concepts will not be covered here. For more details explore the concepts of "Statistical Modelling" and Fitting Statistical Models to Data".

## Assessing the accuracy of the model: the $R^2$ statistic

In [None]:
# Calculate the coefficient of determination of the prediction



Accoording to this $R^2$ value:

- The $R^2$ score of 0.41 implies that 41% of the variability of the dependent variable **disease** $y$ is explained by our multiple linear model

## Make predictions

Make predictions using a new $x_1$, $x_2$ and $x_3$


In [None]:
# Define one input x and compute the output using regmodel
xnew  = 
ynew  = 
ynew


In [None]:
# Plot scatter plot of the data, the linear model, and (xnew,ynew)

# NOTE: this is not possible because we have four dimensions


Make predictions using several predictors $x_1$, $x_2$ and $x_3$


In [None]:
# Define several inputs x and compute the output using regmodel
xnew = 
ynew = 
ynew


In [None]:
# Plot scatter plot of the data, the linear model, and (xnew,ynew)

# NOTE: this is not possible because we have four dimensions


Define a function to make predictions

In [None]:
# Function to predict


In [None]:
# Predict a value
X1new = 
X2new = 
X3new = 
Ynew  = 
Ynew


# Final remarks

We have studied multiple linear regression, one of the most fundamental (supervized) machine learning algorithms.


# Activity 1: work with the deadetes dataset

Use the **diabetes** dataset to:

1. Build a linear model to predict
    - Disease progression ($y$) based on all available variables


2. Analize your model and 
    - Identify variables with negligible impact on the impact of disease progression
    - Build a linear model to predict disease progression based on the remaining variables
    - Build a linear model to predict disease progression based on the discharged variables


3. Provide conclusiones

# Activity 2: work with the cartwheel dataset

Using the **cartwheel** dataset:

1. Undertand the dataset and provide descriptions

2. Build a linear model to predict cartwheel distance ($y$) based on at least three variables. Argue your decision of variables

3. Describe the impact that each variable has in the prediction of cartwheel distance

4. Indicate how well the model fit the data. Explain your results

5. Use the learned machine learning model to predict your cartwheel distance and of four family members. Provide detailed comments.


Note: always explain and detailes elaborate your responses
    