<div >
    <img src = "../banner/banner_ML_UNLP_1900_200.png" />
</div>

<a target="_blank" href="https://colab.research.google.com/github/ignaciomsarmiento/ML_UNLP_Lectures/blob/main/Week01/Notebook_SS01_CE.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Regresión Lineal para predicción

Modelo 

$$
y= f(X) + u
$$

$$
y= X\beta + u
$$


Queremos predecir: $Y_i$

> *Ejemplo:* el logaritmo del salario

Características (aka **predictores**, **features**): $X_i=\left(X_{1i},\ldots,X_{pi}\right)'$

> *Ejemplo:* educación, edad, educación de los padres, habilidad cognitiva, etc.


In [23]:
%matplotlib inline

# import some useful packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-whitegrid')



In [24]:
nlsy=pd.read_csv('https://raw.githubusercontent.com/ignaciomsarmiento/datasets/main/nlsy97.csv')

In [25]:
nlsy.head()

Unnamed: 0,lnw_2016,educ,black,hispanic,other,exp,afqt,mom_educ,dad_educ,yhea_100_1997,...,_XPexp_13,_XPexp_14,_XPexp_16,_XPexp_17,_XPexp_18,_XPexp_19,_XPexp_20,_XPexp_21,_XPexp_22,_XPexp_23
0,4.076898,16,0,0,0,11,7.0724,12,12,3,...,0,0,0,0,0,0,0,0,0,0
1,3.294138,9,0,0,0,19,4.7481,9,10,2,...,0,0,0,0,0,1,0,0,0,0
2,2.830896,9,0,1,0,22,1.1987,12,9,3,...,0,0,0,0,0,0,0,0,1,0
3,4.306459,16,0,0,0,13,8.9321,16,18,2,...,1,0,0,0,0,0,0,0,0,0
4,5.991465,16,0,1,0,15,2.2618,16,16,1,...,0,0,0,0,0,0,0,0,0,0


## Regresión Lineal: log(salario) en polinomios de educación

In [26]:
# generate dictionary of transformations of education
powerlist=[nlsy['educ']**j for j in np.arange(1,10)]
X=pd.concat(powerlist,axis=1)
X.columns = ['educ'+str(j) for j in np.arange(1,10)]

In [27]:
X

Unnamed: 0,educ1,educ2,educ3,educ4,educ5,educ6,educ7,educ8,educ9
0,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736
1,9,81,729,6561,59049,531441,4782969,43046721,387420489
2,9,81,729,6561,59049,531441,4782969,43046721,387420489
3,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736
4,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736
...,...,...,...,...,...,...,...,...,...
1261,14,196,2744,38416,537824,7529536,105413504,1475789056,20661046784
1262,9,81,729,6561,59049,531441,4782969,43046721,387420489
1263,10,100,1000,10000,100000,1000000,10000000,100000000,1000000000
1264,18,324,5832,104976,1889568,34012224,612220032,11019960576,198359290368


### Corremos la regresión

In [28]:
# run least squares regression
# instantiate and fit our regression object:
reg= LinearRegression().fit(X,nlsy['lnw_2016'])

In [29]:
print(reg.coef_) #ver los resultados

[ 3.89539531e+01  1.17857783e+02 -4.54359326e+01  8.01174965e+00
 -8.16097613e-01  5.09465653e-02 -1.92939301e-03  4.07951469e-05
 -3.70127381e-07]


In [33]:
from statsmodels.formula.api import ols

db1 = pd.DataFrame(X)
db1.loc[:, 'lnw_2016'] = nlsy['lnw_2016']

In [34]:
db1.head()

Unnamed: 0,educ1,educ2,educ3,educ4,educ5,educ6,educ7,educ8,educ9,lnw_2016
0,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,4.076898
1,9,81,729,6561,59049,531441,4782969,43046721,387420489,3.294138
2,9,81,729,6561,59049,531441,4782969,43046721,387420489,2.830896
3,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,4.306459
4,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,5.991465


In [35]:
X

Unnamed: 0,educ1,educ2,educ3,educ4,educ5,educ6,educ7,educ8,educ9,lnw_2016
0,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,4.076898
1,9,81,729,6561,59049,531441,4782969,43046721,387420489,3.294138
2,9,81,729,6561,59049,531441,4782969,43046721,387420489,2.830896
3,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,4.306459
4,16,256,4096,65536,1048576,16777216,268435456,4294967296,68719476736,5.991465
...,...,...,...,...,...,...,...,...,...,...
1261,14,196,2744,38416,537824,7529536,105413504,1475789056,20661046784,1.833475
1262,9,81,729,6561,59049,531441,4782969,43046721,387420489,3.341985
1263,10,100,1000,10000,100000,1000000,10000000,100000000,1000000000,-0.928125
1264,18,324,5832,104976,1889568,34012224,612220032,11019960576,198359290368,3.702931


In [None]:
reg_statsmodels = ols("lnw_2016~educ1+educ2+educ3+educ4+educ5+educ6+educ7+educ8+educ9", data = db1).fit()
print(reg_statsmodels.summary())

## Predicción

$$
\hat{y} = \hat{\beta}_0 +  \hat{\beta}_1 educ + .... +  \hat{\beta}_{9} educ^{9}
$$

In [None]:
X

In [None]:
#predict me simplifica hacer la operación X\hat{\beta}

# generate predicted values
yhat=reg.predict(X)


In [None]:
# plot predicted values
lnwbar=nlsy.groupby('educ')['lnw_2016'].mean()
Xbar=pd.DataFrame({'educ':lnwbar.index.values})
powerlist=[Xbar['educ']**j for j in np.arange(1,10)]
Xbar=pd.concat(powerlist,axis=1)
Xbar.columns = X.columns
Xbar_scaled = scaler.transform(Xbar)
ybarhat=reg.predict(Xbar_scaled)
fig = plt.figure()
ax = plt.axes()
ax.plot(Xbar['educ1'],lnwbar,'bo',Xbar['educ1'],ybarhat,'g-');
plt.title("ln Wages by Education in the NLSY")
plt.xlabel("years of schooling")
plt.ylabel("ln wages");

Como podemos ver, la regresión lineal de mínimos cuadrados puede aproximar cualquier función continua y ciertamente puede usarse para la predicción. 

Incluyendo un conjunto lo suficientemente rico de transformaciones, las predicciones de OLS producirán estimaciones insesgadas del verdadero predictor ideal: la función de expectativa condicional. Pero estas estimaciones serán bastante ruidosas.