# Multiple Regression

## Learning goals:

For a multivariable linear regression, students will be able to:

* compare and contrast with univariable linear regression
* write an example of the equation
* develop one with statsmodels 
* assess the model fit 
* validate the model


### Keyterms
- Multivariable
- Train-test split
- MSE: Mean squared error
- RSME: Root squared mean error


## Scenario

The University of San Paulo in Brazil is likes to party. We are a contracted beer supplier to the University and we want to make sure we have enough supply on hand. We are hoping to build a model that can predict beer consumption given other variables. 


![beer](pexels-photo-544988-small.jpeg)
More about the dataset can be found [here](https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo)


###  Prior Knowledge


Before looking at the dataset, what variables do we think might be in there? What might make a student drink more? 

#### Step 1:  Discussion 

- compare and contrast with univariable linear regression
- How is this different from the regression we've done before?
- Here, you'll explore how to perform linear regressions using multiple independent variables to better predict a target variable.

#### Step 2:  Develop a multivariable regression model with statsmodels 

**Load Libraries and load in data**

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib.pyplot as plt

In [10]:
df = pd.read_csv('Consumo_cerveja.csv')

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 7 columns):
Data                           365 non-null object
Temperatura Media (C)          365 non-null object
Temperatura Minima (C)         365 non-null object
Temperatura Maxima (C)         365 non-null object
Precipitacao (mm)              365 non-null object
Final de Semana                365 non-null float64
Consumo de cerveja (litros)    365 non-null float64
dtypes: float64(2), object(5)
memory usage: 51.5+ KB


In [5]:
df.head()

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.9


### Small Data Cleaning Tasks:
- Drop Date
- convert all the columns to numeric (replace ',' with '.')
- rename columns to be `name = ['temp-median', 'temp-min', 'temp-max', 'rain', 'finals-week', 'target']`

In [6]:
df.columns

Index(['Data', 'Temperatura Media (C)', 'Temperatura Minima (C)',
       'Temperatura Maxima (C)', 'Precipitacao (mm)', 'Final de Semana',
       'Consumo de cerveja (litros)'],
      dtype='object')

In [18]:
df = pd.read_csv('Consumo_cerveja.csv', decimal=',', parse_dates=['Data'])

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 7 columns):
Data                           365 non-null datetime64[ns]
Temperatura Media (C)          365 non-null float64
Temperatura Minima (C)         365 non-null float64
Temperatura Maxima (C)         365 non-null float64
Precipitacao (mm)              365 non-null float64
Final de Semana                365 non-null float64
Consumo de cerveja (litros)    365 non-null float64
dtypes: datetime64[ns](1), float64(6)
memory usage: 51.5 KB


In [20]:
df['Consumo de cerveja (litros)'] = pd.to_numeric(
    df['Consumo de cerveja (litros)'])

In [12]:
# clean data here
var = ['Temperatura Media (C)', 'Temperatura Minima (C)',
       'Temperatura Maxima (C)', 'Precipitacao (mm)']
for v in var:
    df[v].apply(lambda x: x.replace(',','.'))

AttributeError: 'float' object has no attribute 'replace'

In [22]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 7 columns):
Data                           365 non-null datetime64[ns]
Temperatura Media (C)          365 non-null float64
Temperatura Minima (C)         365 non-null float64
Temperatura Maxima (C)         365 non-null float64
Precipitacao (mm)              365 non-null float64
Final de Semana                365 non-null float64
Consumo de cerveja (litros)    365 non-null float64
dtypes: datetime64[ns](1), float64(6)
memory usage: 51.5 KB


Unnamed: 0,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
count,365.0,365.0,365.0,365.0,365.0,365.0
mean,21.226356,17.46137,26.611507,5.196712,0.284932,25.401367
std,3.180108,2.826185,4.317366,12.417844,0.452001,4.399143
min,12.9,10.6,14.5,0.0,0.0,14.343
25%,19.02,15.3,23.8,0.0,0.0,22.008
50%,21.38,17.9,26.9,0.0,0.0,24.867
75%,23.28,19.6,29.4,3.2,1.0,28.631
max,28.86,24.5,36.5,94.8,1.0,37.937


**Check** for NaNs

In [23]:
df.isna().sum()

Data                           576
Temperatura Media (C)          576
Temperatura Minima (C)         576
Temperatura Maxima (C)         576
Precipitacao (mm)              576
Final de Semana                576
Consumo de cerveja (litros)    576
dtype: int64

In [24]:
df.dropna(inplace=True)

In [25]:
df.shape

(365, 7)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365 entries, 0 to 364
Data columns (total 7 columns):
Data                           365 non-null datetime64[ns]
Temperatura Media (C)          365 non-null float64
Temperatura Minima (C)         365 non-null float64
Temperatura Maxima (C)         365 non-null float64
Precipitacao (mm)              365 non-null float64
Final de Semana                365 non-null float64
Consumo de cerveja (litros)    365 non-null float64
dtypes: datetime64[ns](1), float64(6)
memory usage: 32.8 KB


In [31]:
df = df.drop('Data', axis =1)

In [32]:
df.columns = ['temp-median', 'temp-min', 'temp-max', 'rain', 'finals-week', 'target']

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365 entries, 0 to 364
Data columns (total 6 columns):
temp-median    365 non-null float64
temp-min       365 non-null float64
temp-max       365 non-null float64
rain           365 non-null float64
finals-week    365 non-null float64
target         365 non-null float64
dtypes: float64(6)
memory usage: 30.0 KB


In [34]:
df.dropna(inplace=True)

In [35]:
df.shape

(365, 6)

### Everyone write an example of an equation for our multiple regression

The main idea here is pretty simple. Whereas, in simple linear regression we took our dependent variable to be a function only of a single independent variable, here we'll be taking the dependent variable to be a function of multiple independent variables.

<img src="https://miro.medium.com/max/1400/1*d0icRnPHWjHSNXxuoYT5Vg.png" width=450 />

Our regression equation, then, instead of looking like $\hat{y} = mx + b$, will now look like:

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + ... + \hat{\beta}_nx_n$.

Remember that the hats ( $\hat{}$ ) indicate parameters that are estimated.

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

What would the formula be with real values?

**Send your equations to me via zoom or slack and I will paste them into the notebook**

Equations here

>

![statsmodels](https://www.statsmodels.org/stable/_static/statsmodels_hybi_banner.png)

Okay, now here's how you can use format and join to make the formula with **code**:

In [36]:
formula = 'target~{}'.format("+".join(df.columns[:-1]))
formula

'target~temp-median+temp-min+temp-max+rain+finals-week'

In [41]:
df.head()

Unnamed: 0,temp-median,temp-min,temp-max,rain,finals-week,target
0,27.3,23.9,32.5,0.0,0.0,25.461
1,27.02,24.5,33.5,0.0,0.0,28.972
2,24.82,22.4,29.9,0.0,1.0,30.814
3,23.98,21.5,28.6,1.2,1.0,29.799
4,23.82,21.0,28.3,0.0,0.0,28.9


In [42]:
cons = sm.add_constant(df.drop('target', axis=1))
cons

  return ptp(axis=axis, out=out, **kwargs)


Unnamed: 0,const,temp-median,temp-min,temp-max,rain,finals-week
0,1.0,27.30,23.9,32.5,0.0,0.0
1,1.0,27.02,24.5,33.5,0.0,0.0
2,1.0,24.82,22.4,29.9,0.0,1.0
3,1.0,23.98,21.5,28.6,1.2,1.0
4,1.0,23.82,21.0,28.3,0.0,0.0
5,1.0,23.78,20.1,30.5,12.2,0.0
6,1.0,24.00,19.5,33.7,0.0,0.0
7,1.0,24.90,19.5,32.8,48.6,0.0
8,1.0,28.20,21.9,34.0,4.4,0.0
9,1.0,26.76,22.1,34.2,0.0,1.0


In [45]:
model = sm.OLS(df.target, cons).fit()

In [46]:
model.summary()

0,1,2,3
Dep. Variable:,target,R-squared:,0.723
Model:,OLS,Adj. R-squared:,0.719
Method:,Least Squares,F-statistic:,187.1
Date:,"Fri, 19 Jul 2019",Prob (F-statistic):,1.19e-97
Time:,12:56:29,Log-Likelihood:,-824.07
No. Observations:,365,AIC:,1660.0
Df Residuals:,359,BIC:,1684.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4447,0.845,7.627,0.000,4.783,8.107
temp-median,0.0308,0.188,0.164,0.870,-0.339,0.401
temp-min,-0.0190,0.110,-0.172,0.863,-0.236,0.198
temp-max,0.6560,0.095,6.895,0.000,0.469,0.843
rain,-0.0575,0.010,-5.726,0.000,-0.077,-0.038
finals-week,5.1832,0.271,19.126,0.000,4.650,5.716

0,1,2,3
Omnibus:,39.362,Durbin-Watson:,1.93
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12.936
Skew:,0.153,Prob(JB):,0.00155
Kurtosis:,2.13,Cond. No.,271.0


### What's the actual multivariable  linear regression equation with the coefficients?

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

#### Step 3: Assess the model fit
Demonstrate and Apply:

**Discussion:**

In groups of 2 or 3 write a synopsis of the following summary

* What can you say about the coefficients?

* What do the p-values tell us?

* What does R^2 represent

* What other insights do you notice?





#### Step 4: Validate the model 
![scikit](https://cdn-images-1.medium.com/max/1200/1*-FHtcdQljtGKQGm77uDIyQ.png)
- Build LinReg Model with Scikit-Learn
- Check some of the linear regression assumptions


In [47]:
linreg = LinearRegression()

In [48]:
X = df.drop("target", axis=1)
y = df.target

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [50]:
# use fit to form model
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [51]:
# gives you r squared of the model
linreg.score(X_test, y_test)

0.7584380893422863

`score` here returns the R^2. 

How does it differ from when you use the whole dataset?

### Integration:

Repeat this process for concrete mixture. the documentation can be found [here](http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)
![test](building-construction-building-site-constructing-small.jpg)

In [None]:
df2 = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')

In [None]:
df2.head()
df2.info()

### Assessment

### Reflection

### Resources

Resources
https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86

Full code implementation of Linear Regression
Full code — https://github.com/SSaishruthi/Linear_Regression_Detailed_Implementation

Multiple regression explained
https://www.statisticssolutions.com/what-is-multiple-linear-regression/
