# **PUNTO 3 (PT. 1): MODELO LINEAL**

### Modelo de regresión 

$y = \beta_0 + \beta_1 \times \text{X1 transaction date	} + \beta_2 \times \text{X2 house age} + \beta_3 \times \text{X3 distance to the nearest MRT station} + \beta_1 \times \text{X4 number of convenience stores} + \beta_2 \times \text{X5 latitude} + \beta_3 \times \text{X6 longitude}$


In [33]:
# nombres de características
features = ['X1 transaction date','X2 house age','X3 distance to the nearest MRT station', 'X4 number of convenience stores','X5 latitude','X6 longitude']

# dataframe de características
X = data[features]

X.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
0,2012.917,32.0,84.87882,10,24.98298,121.54024
1,2012.917,19.5,306.5947,9,24.98034,121.53951
2,2013.583,13.3,561.9845,5,24.98746,121.54391
3,2013.5,13.3,561.9845,5,24.98746,121.54391
4,2012.833,5.0,390.5684,5,24.97937,121.54245


In [57]:
X.shape

(414, 6)

In [35]:
# variable de respuesta
y = data['Y house price of unit area']

y.head()

0    37.9
1    42.2
2    47.3
3    54.8
4    43.1
Name: Y house price of unit area, dtype: float64

In [36]:
# tipos de X y y
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


### División entre entrenamiento y prueba

In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [38]:
# tamaños 
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(310, 6)
(310,)
(104, 6)
(104,)


In [39]:
print(X.head())
print(X_train.head())

   X1 transaction date  X2 house age  X3 distance to the nearest MRT station  \
0             2012.917          32.0                                84.87882   
1             2012.917          19.5                               306.59470   
2             2013.583          13.3                               561.98450   
3             2013.500          13.3                               561.98450   
4             2012.833           5.0                               390.56840   

   X4 number of convenience stores  X5 latitude  X6 longitude  
0                               10     24.98298     121.54024  
1                                9     24.98034     121.53951  
2                                5     24.98746     121.54391  
3                                5     24.98746     121.54391  
4                                5     24.97937     121.54245  
     X1 transaction date  X2 house age  \
368             2013.417          18.2   
218             2013.417          13.6   
127      

In [40]:
# cambiando el tamaño del conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

In [41]:
# tamaños 
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(331, 6)
(331,)
(83, 6)
(83,)


In [42]:
# sin reordenar los datos
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, shuffle=False)

In [43]:
print(X.head())
print(X_train.head())

   X1 transaction date  X2 house age  X3 distance to the nearest MRT station  \
0             2012.917          32.0                                84.87882   
1             2012.917          19.5                               306.59470   
2             2013.583          13.3                               561.98450   
3             2013.500          13.3                               561.98450   
4             2012.833           5.0                               390.56840   

   X4 number of convenience stores  X5 latitude  X6 longitude  
0                               10     24.98298     121.54024  
1                                9     24.98034     121.53951  
2                                5     24.98746     121.54391  
3                                5     24.98746     121.54391  
4                                5     24.97937     121.54245  
   X1 transaction date  X2 house age  X3 distance to the nearest MRT station  \
0             2012.917          32.0                   

In [44]:
# tamaños 
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(310, 6)
(310,)
(104, 6)
(104,)


In [45]:
# volviendo al caso en que cambia el tamaño del conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

In [46]:
from sklearn.linear_model import LinearRegression

# crear el objeto del modelo
linreg = LinearRegression()

# ajustar los parámetros del modelo usando los datos de entrenamiento
linreg.fit(X_train, y_train)

In [47]:
# imprimir coeficientes
print(linreg.intercept_)
print(linreg.coef_)

-12796.117684899287
[ 5.71714218e+00 -2.49326467e-01 -4.93769843e-03  1.07614509e+00
  2.27037100e+02 -3.56988335e+01]


In [48]:
# coeficientes con nombre de las características
list(zip(features, linreg.coef_))

[('X1 transaction date', 5.7171421836102745),
 ('X2 house age', -0.24932646689585264),
 ('X3 distance to the nearest MRT station', -0.004937698432022964),
 ('X4 number of convenience stores', 1.0761450934453074),
 ('X5 latitude', 227.03710037116494),
 ('X6 longitude', -35.69883346505084)]

### Predicciones usando los datos de prueba

In [49]:
y_pred = linreg.predict(X_test)

### Evaluar el modelo

**Error absoluto medio**: 

$$\text{MAE} = \frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Error cuadrado medio**: 
$$\text{MSE} = \frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Raíz del Error cuadrado medio**: 
$$\text{RMSE} = \sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [50]:
from sklearn import metrics

In [51]:
import numpy as np

# mean absolute error
MAE = metrics.mean_absolute_error(y_test, y_pred)

# mean squared error
MSE = metrics.mean_squared_error(y_test, y_pred)

# root mean squared error
RMSE = np.sqrt(MSE)

In [52]:
print("MAE: ", MAE)
print("MSE: ", MSE)
print("RMSE: ", RMSE)

MAE:  5.343030944663339
MSE:  45.01050719519749
RMSE:  6.708987046879543


### Validación cruzada

In [53]:
from sklearn.model_selection import cross_val_score

# usar MSE - error cuadrático medio
scores = cross_val_score(linreg, X, y, cv=5, scoring='neg_mean_squared_error')
mse_scores = - scores
print(mse_scores)

[ 49.89813853  89.0294996   57.865991   134.82397694  60.0535528 ]


In [54]:
# calcular RMSE
rmse_scores = np.sqrt(mse_scores)
print(rmse_scores)


[ 7.06386145  9.43554448  7.6069699  11.61137274  7.74942274]


In [55]:
# RMSE promedio a través de todos los grupos
print(rmse_scores.mean())

8.693434260346503


## Ahora usando statsmodels

In [56]:
import statsmodels.api as sm

features = ['X1 transaction date','X2 house age','X3 distance to the nearest MRT station', 'X4 number of convenience stores','X5 latitude','X6 longitude']

X = data[features]
y = data.[Y house price of unit area]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# agregar constante explíticamente
X_train = sm.add_constant(X_train)

# regresión usando mínimos cuadrados ordinarios (ordinary least squares - OLS) 
model = sm.OLS(y_train, X_train).fit()

# resumen de resultados
print(model.summary())

SyntaxError: invalid syntax (3431607009.py, line 6)