Probaremos algunos modelos por nuestra cuenta (sin utilizar h2o), para testear si hay alguno que se pueda ajustar bien a lo que estamos buscando. Estos modelos los probaremos tanto en el dataset general ('data'), como en el dataset en el que solo se incluyen aquellas columnas con una importancia relativa elevada por encima de un F score de 65 ('umbral_65').

In [4]:
import pandas as pd
import numpy as np

from scipy.stats import linregress

import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn.model_selection import train_test_split as tts

**Generamos los subsets.**

In [5]:
data = pd.read_csv('data/clean_train.csv')

In [6]:
X = data.drop('price', axis=1)
y = data.price

In [7]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, train_size=0.8, random_state=42)

In [80]:
umbral_65 = pd.read_csv('data/train_umbralF_65.csv')

In [81]:
X2 = umbral_65.drop('price', axis=1)
y2 = umbral_65.price

X2_train, X2_test, y2_train, y2_test = tts(X2, y2, test_size=0.2, train_size=0.8, random_state=42)

## Linear Regression

In [8]:
# modelo

from sklearn.linear_model import LinearRegression as LinReg

linreg=LinReg()

linreg.fit(X_train, y_train)

LinearRegression()

In [9]:
y_pred=linreg.predict(X_test)

In [10]:
train_score=linreg.score(X_train, y_train)  # R2
test_score=linreg.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score)

Train: 0.14690477174563454
Test: 0.3418211850621623


In [82]:
linreg.fit(X2_train, y2_train)

y_pred=linreg.predict(X2_test)

train_score=linreg.score(X2_train, y2_train)  # R2
test_score=linreg.score(X2_test, y2_test)


print('Train:', train_score)
print('Test:', test_score)

Train: 0.33855722058492266
Test: 0.29546036477560556


## Lasso

In [11]:
from sklearn.linear_model import Lasso

In [12]:
# Lasso L1

lasso=Lasso()
lasso.fit(X_train, y_train)

train_score=lasso.score(X_train, y_train)  
test_score=lasso.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: 0.13976915847732774
Test: 0.3335011589833955


In [83]:
lasso=Lasso()
lasso.fit(X2_train, y2_train)

train_score=lasso.score(X2_train, y2_train)  
test_score=lasso.score(X2_test, y2_test)


print('Train:', train_score)
print('Test:', test_score)

Train: 0.3364318508847277
Test: 0.2965886500361614


## Ridge

In [13]:
from sklearn.linear_model import Ridge

In [14]:
# Ridge L2

ridge=Ridge()
ridge.fit(X_train, y_train)

train_score=ridge.score(X_train, y_train)  
test_score=ridge.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: 0.14633150863746003
Test: 0.33976228024304


In [84]:
ridge=Ridge()
ridge.fit(X2_train, y2_train)

train_score=ridge.score(X2_train, y2_train)  
test_score=ridge.score(X2_test, y2_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: 0.3384417944358885
Test: 0.29515780572744543


## ElasticNet

In [15]:
from sklearn.linear_model import ElasticNet

In [16]:
# ElasticNet  L1+L2

elastic=ElasticNet()
elastic.fit(X_train, y_train)

train_score=elastic.score(X_train, y_train)  
test_score=elastic.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: 0.11845579662153494
Test: 0.3112274672804425


In [85]:
elastic=ElasticNet()
elastic.fit(X2_train, y2_train)

train_score=elastic.score(X2_train, y2_train)  
test_score=elastic.score(X2_test, y2_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: 0.3188780008765909
Test: 0.28622941972188076


## Máquina de Soporte Vectorial

In [17]:
from sklearn.svm import SVR

In [18]:
svr=SVR(kernel='poly', degree=10)
svr.fit(X_train, y_train)

y_pred=svr.predict(X_test)

train_score=svr.score(X_train, y_train)  
test_score=svr.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: -0.01764617275274416
Test: -0.03488028344823624


In [86]:
svr=SVR(kernel='poly', degree=10)
svr.fit(X2_train, y2_train)

y_pred=svr.predict(X2_test)

train_score=svr.score(X2_train, y2_train)  
test_score=svr.score(X2_test, y2_test)


print('Train:', train_score)
print('Test:', test_score) 

Train: -0.01628587848288965
Test: -0.006734442212936731


In [19]:
from sklearn.ensemble import RandomForestRegressor as RFR

In [20]:
rfr=RFR(n_estimators=132)
rfr.fit(X_train, y_train)

y_pred=rfr.predict(X_test)

train_score=rfr.score(X_train, y_train)  
test_score=rfr.score(X_test, y_test)


print('Train:', train_score)
print('Test:', test_score)

Train: 0.8703781261305832
Test: -0.9707038794349756


In [21]:
def bosque(n):
    rfr=RFR(n_estimators=n)
    rfr.fit(X_train, y_train)

    y_pred=rfr.predict(X_test)

    train_score=rfr.score(X_train, y_train)  
    test_score=rfr.score(X_test, y_test)


    print('Train:', train_score)
    print('Test:', test_score) 

In [22]:
for e in [2, 5, 50, 100, 200]:
    print(e, bosque(e))

Train: 0.6636915497065528
Test: -2.1896293572990833
2 None
Train: 0.5363585363369836
Test: -6.8728196685399405
5 None
Train: 0.8578732231059017
Test: -1.7397042861258485
50 None
Train: 0.8724723949345651
Test: -1.1523357065856268
100 None
Train: 0.8620090139088246
Test: -0.9682220647614672
200 None


In [92]:
def bosque2(n):
    rfr=RFR(n_estimators=n)
    rfr.fit(X2_train, y2_train)

    y_pred=rfr.predict(X2_test)

    train_score=rfr.score(X2_train, y2_train)  
    test_score=rfr.score(X2_test, y2_test)


    print('Train:', train_score)
    print('Test:', test_score)

In [94]:
for e in [50, 100, 200, 300, 500, 1000]:
    print(e, bosque2(e))

Train: 0.9079704299124288
Test: 0.33444458539560806
50 None
Train: 0.9142624208183321
Test: 0.3288773768894825
100 None
Train: 0.9186645357609028
Test: 0.33705809184692037
200 None
Train: 0.9137990523159553
Test: 0.33420298140542515
300 None
Train: 0.9172942272070344
Test: 0.3291629684427323
500 None
Train: 0.9171521129395384
Test: 0.3328403578200253
1000 None


## Funciones de modelado.

In [28]:
def regre(modelo):
    
    modelo.fit(X_train, y_train)
    
    train_score=modelo.score(X_train, y_train)  # R2
    test_score=modelo.score(X_test, y_test)
    
    print(modelo)
    print('Train R2:', train_score)
    print('Test R2:', test_score)
    
    return modelo

In [95]:
def regre2(modelo):
    
    modelo.fit(X2_train, y2_train)
    
    train_score=modelo.score(X2_train, y2_train)  # R2
    test_score=modelo.score(X2_test, y2_test)
    
    print(modelo)
    print('Train R2:', train_score)
    print('Test R2:', test_score)
    
    return modelo

## SGDC

In [29]:
from sklearn.linear_model import SGDRegressor as SGDR

sgdr=SGDR(max_iter=50000)

sgdr=regre(sgdr)

SGDRegressor(max_iter=50000)
Train R2: -9.931602069607105e+25
Test R2: -2.7663650037630836e+26


In [96]:
sgdr=regre2(sgdr)

SGDRegressor(max_iter=50000)
Train R2: -3.3574374541784975e+25
Test R2: -2.3537322748308385e+25


## KNNR

In [46]:
from sklearn.neighbors import KNeighborsRegressor as KNNR

knnr=KNNR(n_neighbors=112, weights='distance')

knnr=regre(knnr)

KNeighborsRegressor(n_neighbors=112, weights='distance')
Train R2: 0.9999064128569031
Test R2: 0.13989448750144917


In [106]:
from sklearn.neighbors import KNeighborsRegressor as KNNR

knnr=KNNR(n_neighbors=50, weights='uniform')

knnr=regre(knnr)

KNeighborsRegressor(n_neighbors=50)
Train R2: 0.0915800660970838
Test R2: 0.11298430845905938


In [107]:
knnr=regre2(knnr)

KNeighborsRegressor(n_neighbors=50)
Train R2: 0.1741759550516515
Test R2: 0.13434733324960257


### Train con el dataset completo.

Seleccionamos el KNNR con los hiperparámetros que nos han proporcionado un mejor resultado.

Llevamos a cabo las predcciones y las exportamos para guardárnoslas como base.

No obstante, el ajuste es demasiado bajo (tanto con el dataset general, como con el filtrado) y debemos seguir trabajando para mejorar el modelo. Decidimos volver a trabajar con h2o, pero en este caso sobre el dataset filtrado (con el umbral de importancia de dimensiones).



In [67]:
test = pd.read_csv('data/clean_test.csv')

In [65]:
knnr=KNNR(n_neighbors=112, weights='distance') 

In [69]:
knnr.fit(X, y)

KNeighborsRegressor(n_neighbors=112, weights='distance')

In [71]:
preds = knnr.predict(test)

In [74]:
sample = pd.read_csv('data/sample.csv')

In [76]:
sample.price = preds

In [79]:
sample.to_csv('data/submit_2.csv', index=False) # Exportamos las predicciones.