# Machine Learning - Parte I - Regressão Linear

Vamos começar a falar sobre os algoritmos de machine learning, começando por um de regressão. <br>
Regressão linear é um algoritmo largamente utilizado, tendo já completado dois séculos de existência desde que sua primeira forma de utilização foi publicada no começo do século XIX.

Apesar de não ser o que conduz ao melhor modelo, daremos os primeiros passos para entender:

* métricas de desempenho de modelos, ou seja como comparar modelos
* estratégias de validação: separação entre treino e teste


Além disso, introduziremos uma notação comum a todos os algoritmos da seguinte maneira:

* $X$ : matriz de features
* $y$ : vetor com os objetivos da predição


# Regressão

<div class="span5 alert alert-info">

<p> Fornecidos $x$ and $y$, o objetivo da regressão linear é: </p>
<ul>
  <li> Criar um <b>modelo preditivo</b> para predizer o $y$ a partir de $x_i$</li>
  <li> Modelar a <b>importancia</b> entre cada variável dependente $x_i$ e $y$</li>
    <ul>
      <li> Nem todos os $x_i$ tem relação com $y$</li>
      <li> Quais $x_i$ que mais contribuem para determinar $y$? </li>
    </ul>
</ul>
</div>



### recap
***

[Regressão Linear](http://en.wikipedia.org/wiki/Linear_regression) é um metodo para modelar a relação entre um conjunto de variaveis independentes $x$ (explanatórias, features, preditores) e uma variável dependente $Y$.  Esse metodo assume que $x$ tem uma relação linear com $y$.  

$$ y = \beta_0 + \beta_1 x + \epsilon$$

one $\epsilon$ refere-se a um erro. 

* $\beta_0$ é a intercepto do modelo

* O objetivo será estimar os coeficientes (e.g. $\beta_0$ and $\beta_1$). Representamos as estimativas com o "chapeu" em cima da letra. 

$$ \hat{\beta}_0, \hat{\beta}_1 $$

* Uma vez obtido a estimativa dos coeficientes $\hat{\beta}_0$ and $\hat{\beta}_1$, podemos usar para predizer novos valores de $Y$

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1$$

* Regressão Linear Multipla é quando há mais de uma variavel independente
    * $x_1$, $x_2$, $x_3$, $\ldots$

$$ y = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + \epsilon$$ 

In [1]:
import numpy as np
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

### carregando os dados

In [3]:
# local.
insurance = pd.read_excel('insurance_v2.xlsx', index_col = 0)

# Colab:
#from google.colab import files
#uploaded = files.upload()
#import io
#data = io.BytesIO(uploaded['insurance_v2.xlsx''])    
#insurance = pd.read_excel(data, index_col=0)

In [7]:
insurance.charges.mean()

13270.422265141257

In [8]:
insurance.groupby(['region']).charges.agg(['mean', 'std'])

Unnamed: 0_level_0,mean,std
region,Unnamed: 1_level_1,Unnamed: 2_level_1
northeast,13406.384516,11255.803066
northwest,12417.575374,11072.276928
southeast,14735.411438,13971.098589
southwest,12346.937377,11557.179101


In [9]:
insurance.groupby(['age', 'region']).charges.agg(['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
age,region,Unnamed: 2_level_1,Unnamed: 3_level_1
18,northeast,7558.732497,9016.763255
18,southeast,6677.555986,11228.556651
19,northwest,9479.636524,11808.718394
19,southeast,35570.314200,3718.397002
19,southwest,7543.201624,11007.425522
...,...,...,...
63,southwest,25327.514667,17630.838395
64,northeast,14944.022862,786.059483
64,northwest,20971.302894,8061.604748
64,southeast,26134.965187,14692.788507


In [10]:
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [4]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB


In [13]:
insurance.region.value_counts()

southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

In [15]:
insurance.bmi.describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

In [17]:
insurance.corr()

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


In [21]:
pd.pivot_table(insurance, 
               values='charges',
               index='sex',
               columns='region',
               aggfunc='median')


region,northeast,northwest,southeast,southwest
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,10197.7722,9614.0729,8582.3023,8530.837
male,9957.7216,8413.46305,9504.3103,9391.346


In [23]:
pd.pivot_table(insurance, 
               values='charges',
               index='sex',
               columns='region', aggfunc='count')

region,northeast,northwest,southeast,southwest
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,161,164,175,162
male,163,161,189,163


In [24]:
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [25]:
insurance = pd.get_dummies(data = insurance, columns = ['sex', 'smoker', 'region'], drop_first = False)

In [26]:
insurance

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.900,0,16884.92400,1,0,0,1,0,0,0,1
1,18,33.770,1,1725.55230,0,1,1,0,0,0,1,0
2,28,33.000,3,4449.46200,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.880,0,3866.85520,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0,1,1,0,0,1,0,0
1334,18,31.920,0,2205.98080,1,0,1,0,1,0,0,0
1335,18,36.850,0,1629.83350,1,0,1,0,0,0,1,0
1336,21,25.800,0,2007.94500,1,0,1,0,0,0,0,1


In [39]:
X = insurance.drop(columns=['charges'])
y = insurance['charges']

In [40]:
X = insurance.drop('charges', axis = 1)
y = insurance.charges

#### Como encontrar os "melhores" $a$ e $b$?

**Metodo dos minimos quadrados (least squares method)**


In [None]:
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/220px-Linear_least_squares_example2.svg.png'
Image(url)

Pergunta: qual a melhor linha azul que representa o conjunto de pontos vermelhos? <br>
Resposta: a que minimiza a soma dos quadrados das linhas verdes (o erro)

\begin{equation*}
MSE\quad = \frac { 1 }{ n } \sum _{ i=0 }^{ n-1 }{ { { (\hat { { y }^{ (i) } }  } }-{ y }^{ (i) })^{ 2 } }   \quad 
\end{equation*}

\begin{equation*}
RMSE\quad = \sqrt { \frac { 1 }{ n } \sum _{ i\quad =\quad 0 }^{ n-1 }{ { { (\hat { { y }^{ (i) } }  } }-{ y }^{ (i) })^{ 2 } }   } \quad  
\end{equation*}

### Com todas as features - com sklearn

In [41]:
from sklearn.linear_model import LinearRegression

In [42]:
lreg = LinearRegression()

Funções utilizadas:

* `lreg.fit()` : para treinar o modelo

* `lreg.predict()` : predição do valor, segundo um modelo treinado

* `lreg.score()` : retorna o coeficiente de determinação (R^2), uma medida de quão bem o modelo captura as observações. 

In [57]:
lreg.fit(X, y)

LinearRegression()

In [44]:
lreg.coef_

array([   256.85635254,    339.19345361,    475.50054515,     65.6571797 ,
          -65.6571797 , -11924.26727096,  11924.26727096,    587.00923503,
          234.0453356 ,   -448.01281436,   -373.04175627])

In [58]:
X.columns

Index(['age', 'bmi', 'children', 'sex_female', 'sex_male', 'smoker_no',
       'smoker_yes', 'region_northeast', 'region_northwest',
       'region_southeast', 'region_southwest'],
      dtype='object')

In [45]:
lreg.intercept_

-666.9377199366318

In [46]:
y_pred = lreg.predict(X)

In [50]:
X.sample()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
816,24,24.225,0,1,0,1,0,0,1,0,0


In [51]:
y_pred[816]

2090.0113990271093

In [56]:
y[816]

2842.76075

In [62]:
y_pred

array([25293.7130284 ,  3448.60283431,  6706.9884907 , ...,
        4149.13248568,  1246.58493898, 37085.62326757])

In [64]:
y

0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1338, dtype: float64

In [None]:
mse = np.mean((y - y_pred) ** 2)

In [None]:
rmse = np.sqrt(mse)

In [None]:
rmse

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
np.sqrt(mean_squared_error(y, y_pred))

## Treinamento e Validação

### Objetivo de separar os dados em treinamento e teste

<div class="span5 alert alert-info">

<p> No exemplo acima: </p>
<ul>
  <li> Treinamos e testamos na mesma base </li>
  <li> É esperado que as predições sobre essa base sejam boas, mas e quanto a novos dados? </li>
  <li> Uma solução seria repartir dados, e <b>treinar</b> numa parte dos dados 
      reservando uma parte para <b>teste</b>  </li>
  <li> isso se chama validação </li>  
</ul>
</div>

#### Predição de charges

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X.shape[0] // 4

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
lreg.fit(X_train, y_train)

In [None]:
y_pred = lreg.predict(X_test)

In [None]:
mse_test = mean_squared_error(y_test, y_pred)
mse_test

In [None]:
rmse_test = np.sqrt(mse_test)
rmse_test

In [None]:
lreg.coef_

In [None]:
X.describe()

### Referencias

* Link para os modulos de machine learning: [SciKit Learn](http://scikit-learn.org/stable/)
* Curso Machine Learning Andrew Ng: [Coursera](https://www.coursera.org/learn/machine-learning)
* Curso Data Analysis Jose Portilla: [Udemy](https://www.udemy.com/learning-python-for-data-analysis-and-visualization/learn/v4/t/lecture/2338236?start=0)
* Curso CS109 Harvard: [Harvard](http://cs109.github.io/2015)