**REGRESIÓN PARTIENDO DE DATOS NUMÉRICOS Y CATEGÓRICOS**

En este notebook vamos a tratar de hacer una regresión para estimar el precio de un apartamento de alquiler. 
Dicha predicción se va a hacer a partir de los datos numéricos y categóricos que tenemos en el dataset de airbnb que venimos usando en las prácticas de este Bootcamp.

En primer lugar nos descargamos el fichero de internet y lo copiamos en un directorio local de My Drive donde tenemos recogido todo el entorno de esta práctica.
También montamos el google collab con My Drive para tenerlo vinculado.

Estos pasos solo hay que realizarlos la primera vez, una vez que tenemos los ficheros en My Drive se pueden saltar y pasamos a cargar los datos directamente desde dicho directorio.

In [None]:
# nos descargamos el dataset de OpenDataSoft
!wget -O "airbnb-listings.csv" "https://public.opendatasoft.com/explore/dataset/airbnb-listings/download/?format=csv&disjunctive.host_verifications=true&disjunctive.amenities=true&disjunctive.features=true&refine.country=Spain&q=Madrid&timezone=Europe/London&use_labels_for_header=true&csv_separator=%3B"

!ls -lah

--2020-06-24 15:01:28--  https://public.opendatasoft.com/explore/dataset/airbnb-listings/download/?format=csv&disjunctive.host_verifications=true&disjunctive.amenities=true&disjunctive.features=true&refine.country=Spain&q=Madrid&timezone=Europe/London&use_labels_for_header=true&csv_separator=%3B
Resolving public.opendatasoft.com (public.opendatasoft.com)... 34.248.20.69, 34.249.199.226
Connecting to public.opendatasoft.com (public.opendatasoft.com)|34.248.20.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/csv]
Saving to: ‘airbnb-listings.csv’

airbnb-listings.csv     [         <=>        ]  54.19M  2.66MB/s    in 37s     

2020-06-24 15:02:15 (1.46 MB/s) - ‘airbnb-listings.csv’ saved [56826824]

total 55M
drwxr-xr-x 1 root root 4.0K Jun 24 15:01 .
drwxr-xr-x 1 root root 4.0K Jun 24 14:57 ..
-rw-r--r-- 1 root root  55M Jun 24 15:02 airbnb-listings.csv
drwxr-xr-x 1 root root 4.0K Jun 19 16:15 .config
drwx------ 4 root root 4.0K Jun 24

In [None]:
!ls -lah

total 55M
drwxr-xr-x 1 root root 4.0K Jun 24 15:01 .
drwxr-xr-x 1 root root 4.0K Jun 24 14:57 ..
-rw-r--r-- 1 root root  55M Jun 24 15:02 airbnb-listings.csv
drwxr-xr-x 1 root root 4.0K Jun 19 16:15 .config
drwx------ 4 root root 4.0K Jun 24 15:00 drive
drwxr-xr-x 1 root root 4.0K Jun 17 16:18 sample_data


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
!cp airbnb-listings.csv "drive/My Drive/BootcampBD&ML/práctica/prácticaDeepLearning"

Esta parte de descarga, montado y copiado solo hace falta ejecutarla la primera vez. Una vez que lo tenemos almacenado en My Drive solo necesitamos cargarlo directamente.

A partir de aquí empieza nuestro ejercicio de regresión.

Como hábito de buena costumbre, para no incurrir en errores involuntarios, en primer lugar se va a dividir el dataset original en train, validation y test.

Se trabaja únicamente con el de train con el objetivo de elegir un modelo. Eso se verifica con el conjunto de validation y finalmente se aplica ese "entrenamiento" al bloque de test.

In [None]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error



full_df = pd.read_csv('drive/My Drive/BootcampBD&ML/práctica/prácticaDeepLearning/airbnb-listings.csv', sep=';', decimal='.')
full_train, test = train_test_split(full_df, test_size=0.2, shuffle=True, random_state=0)
train, val = train_test_split(full_train, test_size=0.2, shuffle=True, random_state=0)

print(f'Dimensiones del dataset de training: {train.shape}')
print(f'Dimensiones del dataset de validación: {val.shape}')
print(f'Dimensiones del dataset de test: {test.shape}')

# Guardamos
train.to_csv('./train.csv', sep=';', decimal='.', index=True)
val.to_csv('./val.csv', sep=';', decimal='.', index=True)
test.to_csv('./test.csv', sep=';', decimal='.', index=True)


Dimensiones del dataset de training: (8960, 89)
Dimensiones del dataset de validación: (2240, 89)
Dimensiones del dataset de test: (2801, 89)


El preprocesado es similar al que hemos usado en el ejercicio de clasificación con la diferencia que aquí no categorizamos la variable Price.

In [None]:
# A partir de este momento cargamos el dataset de train y trabajamos ÚNICAMENTE con él. 
df_train = pd.read_csv('./train.csv', sep=';', decimal='.')
df_val = pd.read_csv('./val.csv', sep=';', decimal='.')
df_test = pd.read_csv('./test.csv', sep=';', decimal='.')

def preprocesado(train, val, test):
  #Nos quedamos solo con las filas que pertenecen a la ciudad de Madrid
  indexNames = train[ train['City'] != 'Madrid' ].index
  train.drop(indexNames , inplace=True)
  train.drop(['City'], axis=1, inplace=True)

  indexNames = val[ val['City'] != 'Madrid' ].index
  val.drop(indexNames , inplace=True)
  val.drop(['City'], axis=1, inplace=True)

  indexNames = test[ test['City'] != 'Madrid' ].index
  test.drop(indexNames , inplace=True)
  test.drop(['City'], axis=1, inplace=True)

  #eliminamos las columnas que no aportan
  train.drop(['ID','Scrape ID','Last Scraped','Host ID','Calendar last Scraped','Listing Url','Thumbnail Url',
         'Medium Url','Picture Url','XL Picture Url','Host URL','Host Thumbnail Url','Host Picture Url',
        'Name','Summary','Space','Description','Neighborhood Overview','Notes','Transit','Access',
         'Interaction','House Rules','Host Name','Host About','Street','Host Location','State','Market',
         'Smart Location','Country Code','Country','Geolocation','Weekly Price','Monthly Price',
         'Host Acceptance Rate','Experiences Offered','Has Availability','License','Jurisdiction Names','Square Feet'], 
        axis=1, inplace=True)
  val.drop(['ID','Scrape ID','Last Scraped','Host ID','Calendar last Scraped','Listing Url','Thumbnail Url',
         'Medium Url','Picture Url','XL Picture Url','Host URL','Host Thumbnail Url','Host Picture Url',
        'Name','Summary','Space','Description','Neighborhood Overview','Notes','Transit','Access',
         'Interaction','House Rules','Host Name','Host About','Street','Host Location','State','Market',
         'Smart Location','Country Code','Country','Geolocation','Weekly Price','Monthly Price',
         'Host Acceptance Rate','Experiences Offered','Has Availability','License','Jurisdiction Names','Square Feet'], 
        axis=1, inplace=True)
  test.drop(['ID','Scrape ID','Last Scraped','Host ID','Calendar last Scraped','Listing Url','Thumbnail Url',
         'Medium Url','Picture Url','XL Picture Url','Host URL','Host Thumbnail Url','Host Picture Url',
        'Name','Summary','Space','Description','Neighborhood Overview','Notes','Transit','Access',
         'Interaction','House Rules','Host Name','Host About','Street','Host Location','State','Market',
         'Smart Location','Country Code','Country','Geolocation','Weekly Price','Monthly Price',
         'Host Acceptance Rate','Experiences Offered','Has Availability','License','Jurisdiction Names','Square Feet'], 
        axis=1, inplace=True)
  
  #nueva variable --> en DL no lo uso (esta variable viene de ML) porque hemos visto que nos baja el accuracy
  #train['Bed_Bath_Rooms'] = train['Bedrooms']*train['Bathrooms']
  #val['Bed_Bath_Rooms'] = val['Bedrooms']*val['Bathrooms']
  #test['Bed_Bath_Rooms'] = test['Bedrooms']*test['Bathrooms']
  
  #PRICE
  #imputamos valores vacíos con la media de train
  MeanPriceTrain = train['Price'].mean()
  train['Price'].fillna(MeanPriceTrain, inplace=True)
  val['Price'].fillna(MeanPriceTrain, inplace=True)
  test['Price'].fillna(MeanPriceTrain, inplace=True)
  #definimos outlier >400€
  Price_filter = train['Price'] <= 400
  filtered_train = train[Price_filter]
  Price_filter = val['Price'] <= 400
  filtered_val = val[Price_filter]
  Price_filter = test['Price'] <= 400
  filtered_test = test[Price_filter]
  #transformamos variable Price a gausiana
  filtered_train['Price'] = filtered_train['Price'].apply(lambda x: np.log10(x))
  filtered_val['Price'] = filtered_val['Price'].apply(lambda x: np.log10(x))
  filtered_test['Price'] = filtered_test['Price'].apply(lambda x: np.log10(x))
    
  
  #FECHAS
  filtered_train['Host Since'] = pd.to_datetime(filtered_train['Host Since'], format="%Y-%m-%d")
  filtered_train['First Review'] = pd.to_datetime(filtered_train['First Review'], format="%Y-%m-%d")
  filtered_train['Last Review'] = pd.to_datetime(filtered_train['Last Review'], format="%Y-%m-%d")
  filtered_train['Host Since'] = filtered_train['Host Since'].apply(lambda x: 2017 - x.year)
  filtered_train['First Review'] = filtered_train['First Review'].apply(lambda x: 2017 - x.year)
  filtered_train['Last Review'] = filtered_train['Last Review'].apply(lambda x: 2017 - x.year)

  filtered_val['Host Since'] = pd.to_datetime(filtered_val['Host Since'], format="%Y-%m-%d")
  filtered_val['First Review'] = pd.to_datetime(filtered_val['First Review'], format="%Y-%m-%d")
  filtered_val['Last Review'] = pd.to_datetime(filtered_val['Last Review'], format="%Y-%m-%d")
  filtered_val['Host Since'] = filtered_val['Host Since'].apply(lambda x: 2017 - x.year)
  filtered_val['First Review'] = filtered_val['First Review'].apply(lambda x: 2017 - x.year)
  filtered_val['Last Review'] = filtered_val['Last Review'].apply(lambda x: 2017 - x.year)

  filtered_test['Host Since'] = pd.to_datetime(filtered_test['Host Since'], format="%Y-%m-%d")
  filtered_test['First Review'] = pd.to_datetime(filtered_test['First Review'], format="%Y-%m-%d")
  filtered_test['Last Review'] = pd.to_datetime(filtered_test['Last Review'], format="%Y-%m-%d")
  filtered_test['Host Since'] = filtered_test['Host Since'].apply(lambda x: 2017 - x.year)
  filtered_test['First Review'] = filtered_test['First Review'].apply(lambda x: 2017 - x.year)
  filtered_test['Last Review'] = filtered_test['Last Review'].apply(lambda x: 2017 - x.year)

  #Imputamos valores en variables categóricas donde tomamos la moda para los valores que faltan.
  #Lo extraemos en una variable disinta para cada columna con la intención de aplicar el mismo valor en val y test
  ModeHSTrain = filtered_train['Host Since'].mode()[0]
  ModeHLCTrain = filtered_train['Host Listings Count'].mode()[0]
  ModeHTLCTrain = filtered_train['Host Total Listings Count'].mode()[0]
  ModeBathroomsTrain = filtered_train['Bathrooms'].mode()[0]
  ModeBedroomsTrain = filtered_train['Bedrooms'].mode()[0]
  ModeBedsTrain = filtered_train['Beds'].mode()[0]

  filtered_train['Host Since'].fillna(ModeHSTrain, inplace=True)
  filtered_train['Host Listings Count'].fillna(ModeHLCTrain, inplace=True)
  filtered_train['Host Total Listings Count'].fillna(ModeHTLCTrain, inplace=True)
  filtered_train['Bathrooms'].fillna(ModeBathroomsTrain, inplace=True)
  filtered_train['Bedrooms'].fillna(ModeBedroomsTrain, inplace=True)
  filtered_train['Beds'].fillna(ModeBedsTrain, inplace=True)

  filtered_val['Host Since'].fillna(ModeHSTrain, inplace=True)
  filtered_val['Host Listings Count'].fillna(ModeHLCTrain, inplace=True)
  filtered_val['Host Total Listings Count'].fillna(ModeHTLCTrain, inplace=True)
  filtered_val['Bathrooms'].fillna(ModeBathroomsTrain, inplace=True)
  filtered_val['Bedrooms'].fillna(ModeBedroomsTrain, inplace=True)
  filtered_val['Beds'].fillna(ModeBedsTrain, inplace=True)
  filtered_test['Host Since'].fillna(ModeHSTrain, inplace=True)
  filtered_test['Host Listings Count'].fillna(ModeHLCTrain, inplace=True)
  filtered_test['Host Total Listings Count'].fillna(ModeHTLCTrain, inplace=True)
  filtered_test['Bathrooms'].fillna(ModeBathroomsTrain, inplace=True)
  filtered_test['Bedrooms'].fillna(ModeBedroomsTrain, inplace=True)
  filtered_test['Beds'].fillna(ModeBedsTrain, inplace=True)

  #Imputamos valores en variables lineales donde tomamos la media para los valores que faltan
  #Lo extraemos en una variable disinta para cada columna con la intención de aplicar el mismo valor en val y test
  MeanRSRatingTrain = filtered_train['Review Scores Rating'].mean()
  MeanRSAccuracyTrain = filtered_train['Review Scores Accuracy'].mean()
  MeanRSCleanlinessTrain = filtered_train['Review Scores Cleanliness'].mean()
  MeanRSCheckinTrain = filtered_train['Review Scores Checkin'].mean()
  MeanRSCommunicationTrain = filtered_train['Review Scores Communication'].mean()
  MeanRSLocationTrain = filtered_train['Review Scores Location'].mean()
  MeanRSValueTrain = filtered_train['Review Scores Value'].mean()

  filtered_train['Review Scores Rating'].fillna(MeanRSRatingTrain, inplace=True)
  filtered_train['Review Scores Accuracy'].fillna(MeanRSAccuracyTrain, inplace=True)
  filtered_train['Review Scores Cleanliness'].fillna(MeanRSCleanlinessTrain, inplace=True)
  filtered_train['Review Scores Checkin'].fillna(MeanRSCheckinTrain, inplace=True)
  filtered_train['Review Scores Communication'].fillna(MeanRSCommunicationTrain, inplace=True)
  filtered_train['Review Scores Location'].fillna(MeanRSLocationTrain, inplace=True)
  filtered_train['Review Scores Value'].fillna(MeanRSValueTrain, inplace=True)
  filtered_val['Review Scores Rating'].fillna(MeanRSRatingTrain, inplace=True)
  filtered_val['Review Scores Accuracy'].fillna(MeanRSAccuracyTrain, inplace=True)
  filtered_val['Review Scores Cleanliness'].fillna(MeanRSCleanlinessTrain, inplace=True)
  filtered_val['Review Scores Checkin'].fillna(MeanRSCheckinTrain, inplace=True)
  filtered_val['Review Scores Communication'].fillna(MeanRSCommunicationTrain, inplace=True)
  filtered_val['Review Scores Location'].fillna(MeanRSLocationTrain, inplace=True)
  filtered_val['Review Scores Value'].fillna(MeanRSValueTrain, inplace=True)
  filtered_test['Review Scores Rating'].fillna(MeanRSRatingTrain, inplace=True)
  filtered_test['Review Scores Accuracy'].fillna(MeanRSAccuracyTrain, inplace=True)
  filtered_test['Review Scores Cleanliness'].fillna(MeanRSCleanlinessTrain, inplace=True)
  filtered_test['Review Scores Checkin'].fillna(MeanRSCheckinTrain, inplace=True)
  filtered_test['Review Scores Communication'].fillna(MeanRSCommunicationTrain, inplace=True)
  filtered_test['Review Scores Location'].fillna(MeanRSLocationTrain, inplace=True)
  filtered_test['Review Scores Value'].fillna(MeanRSValueTrain, inplace=True)

  #los vacíos los consideramos como desconocidos
  filtered_train['Host Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_train['Host Verifications'].fillna('Unknown', inplace=True)
  filtered_train['Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_train['Zipcode'].fillna('Unknown', inplace=True)
  filtered_train['Amenities'].fillna('Unknown', inplace=True)
  filtered_train['First Review'].fillna('Unknown', inplace=True)
  filtered_train['Last Review'].fillna('Unknown', inplace=True)
  filtered_val['Host Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_val['Host Verifications'].fillna('Unknown', inplace=True)
  filtered_val['Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_val['Zipcode'].fillna('Unknown', inplace=True)
  filtered_val['Amenities'].fillna('Unknown', inplace=True)
  filtered_val['First Review'].fillna('Unknown', inplace=True)
  filtered_val['Last Review'].fillna('Unknown', inplace=True)
  filtered_test['Host Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_test['Host Verifications'].fillna('Unknown', inplace=True)
  filtered_test['Neighbourhood'].fillna('Unknown', inplace=True)
  filtered_test['Zipcode'].fillna('Unknown', inplace=True)
  filtered_test['Amenities'].fillna('Unknown', inplace=True)
  filtered_test['First Review'].fillna('Unknown', inplace=True)
  filtered_test['Last Review'].fillna('Unknown', inplace=True)

  #consideramos que donde falta un valor es porque no existe, es decir, no hay respuesta o la tasa es 0€
  filtered_train['Host Response Time'].fillna('No response', inplace=True)
  filtered_train['Host Response Rate'].fillna(0, inplace=True)
  filtered_train['Security Deposit'].fillna(0, inplace=True)
  filtered_train['Cleaning Fee'].fillna(0, inplace=True)
  filtered_train['Reviews per Month'].fillna(0, inplace=True)
  filtered_val['Host Response Time'].fillna('No response', inplace=True)
  filtered_val['Host Response Rate'].fillna(0, inplace=True)
  filtered_val['Security Deposit'].fillna(0, inplace=True)
  filtered_val['Cleaning Fee'].fillna(0, inplace=True)
  filtered_val['Reviews per Month'].fillna(0, inplace=True)
  filtered_test['Host Response Time'].fillna('No response', inplace=True)
  filtered_test['Host Response Rate'].fillna(0, inplace=True)
  filtered_test['Security Deposit'].fillna(0, inplace=True)
  filtered_test['Cleaning Fee'].fillna(0, inplace=True)
  filtered_test['Reviews per Month'].fillna(0, inplace=True)

  #transformaciones contando palabras. es algo muy sencillo, queda pendiente mejorarlo con técnicas NLP en el futuro
  filtered_train['Amenities'] = filtered_train['Amenities'].apply(lambda x: len(str(x).split(',')))
  filtered_train['Host Verifications'] = filtered_train['Host Verifications'].apply(lambda x: len(str(x).split(',')))
  filtered_train['Features'] = filtered_train['Features'].apply(lambda x: len(str(x).split(',')))
  filtered_val['Amenities'] = filtered_val['Amenities'].apply(lambda x: len(str(x).split(',')))
  filtered_val['Host Verifications'] = filtered_val['Host Verifications'].apply(lambda x: len(str(x).split(',')))
  filtered_val['Features'] = filtered_val['Features'].apply(lambda x: len(str(x).split(',')))
  filtered_test['Amenities'] = filtered_test['Amenities'].apply(lambda x: len(str(x).split(',')))
  filtered_test['Host Verifications'] = filtered_test['Host Verifications'].apply(lambda x: len(str(x).split(',')))
  filtered_test['Features'] = filtered_test['Features'].apply(lambda x: len(str(x).split(',')))

  #MeanEncoder
  categorical = ['Host Response Time', 'Host Neighbourhood', 'Neighbourhood','Neighbourhood Cleansed',
               'Neighbourhood Group Cleansed','Zipcode','Property Type','Room Type','Bed Type',
               'Calendar Updated','First Review','Last Review','Cancellation Policy']
  # En train creamos un dict para usarlo después en val y test
  mean_map = {}
  for c in categorical:
      mean = filtered_train.groupby(c)['Price'].mean()
      filtered_train[c] = filtered_train[c].map(mean)    
      mean_map[c] = mean
  for c in categorical:
    filtered_val[c] = filtered_val[c].map(mean_map[c])
  for c in categorical:
    filtered_test[c] = filtered_test[c].map(mean_map[c])
 #los valores vacíos de test los completo con la moda de train
  for c in categorical:
    filtered_val[c].fillna(filtered_train[c].mode()[0], inplace=True)
  for c in categorical:
    filtered_test[c].fillna(filtered_train[c].mode()[0], inplace=True)

  #extraemos la variable objetivo
  Ytrain = filtered_train['Price']
  Yval = filtered_val['Price']
  Ytest = filtered_test['Price']
  #eliminamos la variable Price
  filtered_train.drop(['Price'],axis=1, inplace=True)
  filtered_val.drop(['Price'],axis=1, inplace=True)
  filtered_test.drop(['Price'],axis=1, inplace=True)

  #escalamos los valores de entrada
  cs = MinMaxScaler()
  Xtrain_Scaled = cs.fit_transform(filtered_train)
  Xval_Scaled = cs.transform(filtered_val)
  Xtest_Scaled = cs. transform(filtered_test)



  return (Xtrain_Scaled, Xval_Scaled, Xtest_Scaled, Ytrain, Yval, Ytest)
      
  





Usamos la función definida previamente para obtener nuestros conjuntos de datos y la variable objetivo.

In [None]:
(Xtrain, Xval, Xtest, ytrain, yval, ytest) = preprocesado(df_train, df_val, df_test)

Ahora vamos a definir los modelos con los que vamos a trabajar y que iremos comparando. Son muy parecidos a los usados en el notebook de clasificación. La principal diferencia está en la última capa, que al tratarse de una regresión tendrá solo una neurona y sin activación (es lineal).

In [None]:
# import the necessary packages
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import BatchNormalization, Conv2D, MaxPooling2D, Activation, Dropout, Dense, Flatten, Input

#creamos una primera red muy sencilla igual que en clasificación pero con 1 sola neurona en la capa de salida y sin activación (lineal)
def MiRedReg1(dim):
  model = Sequential()
  model.add(Dense(8, input_dim=dim, activation="relu"))
  model.add(Dense(4, activation="relu"))
  model.add(Dense(1, activation="linear"))

  return model

#creamos una red un poco más compleja con más capas ocultas
def MiRedReg2(dim):
  model = Sequential()
  model.add(Dense(64, input_dim=dim, activation="relu"))
  model.add(Dense(32, activation="relu"))
  model.add(Dense(16, activation="relu"))
  model.add(Dense(8, activation="relu"))
  model.add(Dense(4, activation="relu"))
  model.add(Dense(1, activation="linear"))

  return model

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from keras.utils import to_categorical
from keras.optimizers import Adam

model = MiRedReg1(Xtrain.shape[1])
opt = Adam(lr=1e-2, decay=1e-3 / 200)

#compilamos el modelo
model.compile(loss="mse",
              optimizer=opt,
							metrics=[tf.keras.metrics.RootMeanSquaredError()])

# entrenamos el modelo
print("[INFO] training model...")
model.fit(x=Xtrain, y=ytrain, 
	validation_data=(Xval, yval),
	epochs=200, batch_size=8)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Using TensorFlow backend.


[INFO] training model...

Train on 8412 samples, validate on 2100 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200

<keras.callbacks.callbacks.History at 0x7f1375f4e9e8>

En el ejercicio de regresión vamos a usar el RMSE para comparar la precisión de los modelos. Este valor es la raiz cuadrada del error cuadrático medio, así que realmente se puede interpretar como la cantidad de euros que nos estamos desviando en la predicción.

Para hacerlo de forma correcta hay que deshacer la transformación logarítmica que aplicábamos a la variable Price para convertirla en gaussiana.

In [None]:

predTrain = model.predict(Xtrain)
predVal = model.predict(Xval)

#deshacemos la transformación logarítmica
predTrain_Eur = pd.DataFrame(predTrain).apply(lambda x: 10**(x))
predVal_Eur = pd.DataFrame(predVal).apply(lambda x: 10**(x))
Ytrain_Eur = pd.DataFrame(ytrain).apply(lambda x: 10**(x))
Yval_Eur = pd.DataFrame(yval).apply(lambda x: 10**(x))

#calculamos el MSE y el RMSE para train y test
mseTrainModel = mean_squared_error(Ytrain_Eur,predTrain_Eur)
mseValModel = mean_squared_error(Yval_Eur,predVal_Eur)

print('MSE (train): %0.3g' % mseTrainModel)
print('MSE (val) : %0.3g' % mseValModel)

print('RMSE (train): %0.3g' % np.sqrt(mseTrainModel))
print('RMSE (val) : %0.3g' % np.sqrt(mseValModel))

MSE (train): 706
MSE (val) : 947
RMSE (train): 26.6
RMSE (val) : 30.8


In [None]:
model2 = MiRedReg2(Xtrain.shape[1])
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model2.compile(loss="mse", optimizer=opt, metrics=[tf.keras.metrics.RootMeanSquaredError()])

# train the model
print("[INFO] training model...")
model2.fit(x=Xtrain, y=ytrain, 
	validation_data=(Xval, yval),
	epochs=200, batch_size=8)

[INFO] training model...
Train on 8412 samples, validate on 2100 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200


<keras.callbacks.callbacks.History at 0x7f13081f57b8>

In [None]:
predTrainM2 = model2.predict(Xtrain)
predValM2 = model2.predict(Xval)

#deshacemos la transformación logarítmica
predTrainM2_Eur = pd.DataFrame(predTrainM2).apply(lambda x: 10**(x))
predValM2_Eur = pd.DataFrame(predValM2).apply(lambda x: 10**(x))
Ytrain_Eur = pd.DataFrame(ytrain).apply(lambda x: 10**(x))
Yval_Eur = pd.DataFrame(yval).apply(lambda x: 10**(x))

#calculamos el MSE y el RMSE para train y test
mseTrainModel2 = mean_squared_error(Ytrain_Eur,predTrainM2_Eur)
mseValModel2 = mean_squared_error(Yval_Eur,predValM2_Eur)

print('MSE (train): %0.3g' % mseTrainModel2)
print('MSE (val) : %0.3g' % mseValModel2)

print('RMSE (train): %0.3g' % np.sqrt(mseTrainModel2))
print('RMSE (val) : %0.3g' % np.sqrt(mseValModel2))

MSE (train): 367
MSE (val) : 629
RMSE (train): 19.2
RMSE (val) : 25.1


Analizando los resultados podemos destacar los siguientes aspectos:

Igual que en el ejercicio de clasificación el modelo 2 presenta algo de overfitting y además es más lento, por lo que nos decantamos por el modelo 1.

Otra cosa a tener en cuenta es que en este ejercicio de regresión hemos usado más épocas. Esto se debe a que se puede observar como el RMSE sigue bajando a pesar de ejecutar más épocas. Esto significa que la red todavía no ha convergido al 100% y sigue aprendiendo. He probado a aumentar el learning rate, para que la red convergiera más rápido, pero se sigue dando la misma situación que estamos comentando. Por tanto, en este caso, por falta de tiempo lo hemos dejado así, pero se podría entrenar durante más tiempo y es muy probable que los resultados mejoraran algo.

In [None]:
# Evaluamos el modelo
scores = model.evaluate(Xtest, ytest)

print('Loss: %.3f' % scores[0])
print('RMSE: %.3f' % scores[1])

Loss: 0.020
RMSE: 0.144


In [None]:
predTrain = model.predict(Xtrain)
predTest = model.predict(Xtest)

#deshacemos la transformación logarítmica
predTrain_Eur = pd.DataFrame(predTrain).apply(lambda x: 10**(x))
predTest_Eur = pd.DataFrame(predTest).apply(lambda x: 10**(x))
Ytrain_Eur = pd.DataFrame(ytrain).apply(lambda x: 10**(x))
Ytest_Eur = pd.DataFrame(ytest).apply(lambda x: 10**(x))

#calculamos el MSE y el RMSE para train y test
mseTrainModel = mean_squared_error(Ytrain_Eur,predTrain_Eur)
mseTestModel = mean_squared_error(Ytest_Eur,predTest_Eur)

print('MSE (train): %0.3g' % mseTrainModel)
print('MSE (test) : %0.3g' % mseTestModel)

print('RMSE (train): %0.3g' % np.sqrt(mseTrainModel))
print('RMSE (test) : %0.3g' % np.sqrt(mseTestModel))

MSE (train): 706
MSE (test) : 868
RMSE (train): 26.6
RMSE (test) : 29.5
