# **Práctica Deep Learning**  



Predicción del precio de las habitaciones de AirBnb combinando las **imágenes** disponibles del dataset airbnb junto con los **datos numéricos** , mediante:

*   Regresión
*   Clasificación

y aplicando **redes convolucionales CNN**


#### **Cargar las librerías y funciones necesarias**

In [1]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [2]:
import tensorflow as tf
print(tf.__version__)

1.15.2


In [3]:
# Cargamos librerías necesarias
import numpy  as np  
import pandas as pd

import matplotlib.pyplot as plt # para dibujar
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Tratamiento de imágenes
import imageio as io
import cv2

# Sets the value of the specified option
# Para visualizar la información de todas las filas pj dtypes o head().T aplico set_option en max_rows 
pd.set_option('display.max_rows', None)
# Para visualizar la información de la matriz de correlación
pd.set_option('display.max_columns', None)

In [4]:
# Función que cambiar el orden de una columna
def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

In [5]:
from keras.utils import to_categorical

def classes_price(df):

  y = df['Price']
  y_class = []

  for x in y:
      # La variable objetivo se asocia a 5 clases barato, medio, medio alto, caro y  muy caro
      if x <= 50:
          y_class.append(0)
      elif x <=100:
          y_class.append(1)
      elif x <=150:
          y_class.append(2)
      elif x <=200:
          y_class.append(3)
      else:
          y_class.append(4)
  
  y_class_onehot = to_categorical(y_class)

  return y_class_onehot

Using TensorFlow backend.


In [6]:
# Función que define la red MLP Multi-Layer Perceptron
# parámetro regress = True para problema de regresión con función de activación linear (sin función de activación) y 1 neurona
# parámetro regress = False para problema de clasificación con función de activación softmax indicando el nº de clases de la variable objetivo y este valor será el nº de neuronas 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
	
def create_mlp(dim, regress=False):
	# define our MLP network
	model = Sequential()
	model.add(Dense(8, input_dim=dim, activation='relu'))
	#model.add(Dense(18, activation='relu'))
	model.add(Dense(4, activation='relu'))
	# check to see if the regression node should be added with function activation linear
	# otherwise (classification) use softmax with the number of classes
	if regress:
		model.add(Dense(1, activation='linear'))
	else:
		model.add(Dense(5, activation='softmax'))  
  
	# return our model
	return model

In [7]:
# import the necessary packages
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.models import Model

# Función que defina la red CNN convolucional
def create_cnn(width, height, depth, filters=(16, 32, 64), regress=False):
	# initialize the input shape and channel dimension, assuming
	# TensorFlow/channels-last ordering
	inputShape = (height, width, depth)
	chanDim = -1

	# define the model input
	inputs = Input(shape=inputShape)

	# loop over the number of filters
	for (i, f) in enumerate(filters):
		# if this is the first CONV layer then set the input
		# appropriately
		if i == 0:
			x = inputs

		# CONV => RELU => BN => POOL
		x = Conv2D(f, (3, 3), padding="same")(x)
		x = Activation('relu')(x)
		x = BatchNormalization(axis=chanDim)(x)
		x = MaxPooling2D(pool_size=(2, 2))(x)

	# flatten the volume, then FC => RELU => BN => DROPOUT
	x = Flatten()(x)
	x = Dense(16)(x)
	x = Activation('relu')(x)
	x = BatchNormalization(axis=chanDim)(x)
	x = Dropout(0.5)(x)

	# apply another FC layer, this one to match the number of nodes
	# coming out of the MLP
	x = Dense(4)(x)
	x = Activation("relu")(x)

	# check to see if the regression node should be added
	if regress:
		x = Dense(1, activation='linear')(x)

	# construct the CNN
	model = Model(inputs, x)

	# return the CNN
	return model

#### **Cargar Dataset airbnb con el procesado de los datos**

In [8]:
# Montamos GDrive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [9]:
# Partimos del dataset limpio tras realizar el preprocesado aplicado en el módulo de Machine Learning
# Las variables numéricas con valores nulos se sustituyen con la media salvo en ['Host Response Time', 'Host Verifications'] que aplica la moda
# Las variables categóricas ['Host Response Time', 'Neighbourhood', 'Property Type', 'Room Type', 'Bed Type', 'Calendar Updated', 'Cancellation Policy']
# se codifican mediante mean_encode
df_airbnb_clean = pd.read_csv('/content/drive/My Drive/df_airbnb_clean.csv', sep=';', decimal='.')   
print(f'Dimensiones del dataframe df_airbnb_clean son: {df_airbnb_clean.shape[0]} filas y {df_airbnb_clean.shape[1]} columnas')

Dimensiones del dataframe df_airbnb_clean son: 10593 filas y 26 columnas


In [10]:
df_airbnb_clean.describe()

Unnamed: 0,Host Response Time,Host Response Rate,Host Verifications,Neighbourhood,Latitude,Longitude,Property Type,Room Type,Bathrooms,Bedrooms,Beds,Bed Type,Amenities,Price,Security Deposit,Cleaning Fee,Guests Included,Extra People,Minimum Nights,Maximum Nights,Calendar Updated,Availability 365,Number of Reviews,Cancellation Policy,Review Scores Mean
count,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0,10593.0
mean,58.342679,96.189733,4.153309,58.342679,40.420579,-3.69688,58.342679,58.342679,1.245407,1.254067,1.900246,58.342679,14.431039,58.342679,176.237497,26.627911,1.539035,7.140659,2.401303,982.983668,58.342679,201.306429,25.5936,58.342679,21.330314
std,3.221779,11.786212,1.318122,13.684927,0.021078,0.024832,5.645638,23.645223,0.573463,0.745048,1.401336,1.612701,4.72183,39.100986,63.666834,13.726473,1.029315,9.701699,2.196173,9963.674662,10.425168,126.436562,40.868694,5.971538,1.758807
min,50.156282,0.0,1.0,32.070922,40.331888,-3.863907,17.0,26.830409,0.5,0.0,1.0,37.777778,1.0,9.0,70.0,4.0,1.0,0.0,1.0,1.0,15.0,0.0,0.0,50.836646,4.857143
25%,55.862564,96.189733,3.0,50.213333,40.409775,-3.708258,59.957491,32.508019,1.0,1.0,1.0,58.563422,11.0,30.0,176.237497,20.0,1.0,0.0,1.0,360.0,50.186722,79.0,1.0,50.836646,21.0
50%,59.858756,100.0,4.0,58.828571,40.418799,-3.701612,59.957491,79.794077,1.0,1.0,1.0,58.563422,14.0,50.0,176.237497,26.627911,1.0,0.0,2.0,1125.0,54.815315,236.0,9.0,58.387047,21.330314
75%,59.858756,100.0,5.0,66.681416,40.428242,-3.693171,59.957491,79.794077,1.0,1.0,2.0,58.563422,18.0,75.0,176.237497,26.627911,2.0,13.0,3.0,1125.0,74.193866,318.0,32.0,64.949627,22.285714
max,68.782609,100.0,10.0,103.666667,40.562736,-3.573613,145.0,79.794077,8.0,10.0,16.0,58.563422,34.0,250.0,900.0,200.0,16.0,276.0,27.0,1000000.0,217.5,365.0,446.0,131.5,80.0


#### **Cargar las imágenes**

In [11]:
# Cargamos las imágenes de los ficheros salvados en drive
loaded_images=np.load('/content/drive/My Drive/images_combinate.npy')
was_loaded = np.load('/content/drive/My Drive/was_loaded_combinate.npy')

print(f'Dimensiones del array de loaded_images son: {loaded_images.shape}')
print(f'Dimensiones del array de was_loaded son: {was_loaded.shape}')

Dimensiones del array de loaded_images son: (10593, 224, 224, 3)
Dimensiones del array de was_loaded son: (10593,)


In [12]:
# De las 10953 muestras de airbnb con referencia en la columna Thumbnail URL, se han descargado un total de 10564 imágenes
# Nos quedamos exclusivamente con los datos del dataset asociados a estas imágenes
loaded_images_ok = loaded_images[was_loaded==1]
loaded_images_ok.shape

(10564, 224, 224, 3)

In [13]:
# Recuperar las imágenes descargadas del dataframe airbnb
#loaded_images_ok = np.load('/content/drive/My Drive/loaded_images_combinate_ok.npy')
loaded_images_ok.shape

(10564, 224, 224, 3)

In [14]:
# Nos quedamos con las muestras del dataset airbnb asociadas a las descarga de imágenes realizada
df_airbnb_images = df_airbnb_clean[was_loaded==1]
df_airbnb_images.shape

(10564, 26)

In [15]:
# Eliminamos la columna Thumbnail Url 
print(f'Dimensiones del dataset df_airbnb_images son: {df_airbnb_images.shape}')
df_airbnb_images.drop(['Thumbnail Url'], axis=1, inplace=True)
print(f'Dimensiones del dataset df_airbnb_images son: {df_airbnb_images.shape}')

Dimensiones del dataset df_airbnb_images son: (10564, 26)
Dimensiones del dataset df_airbnb_images son: (10564, 25)


In [16]:
df_airbnb_images.dtypes

Host Response Time     float64
Host Response Rate     float64
Host Verifications       int64
Neighbourhood          float64
Latitude               float64
Longitude              float64
Property Type          float64
Room Type              float64
Bathrooms              float64
Bedrooms               float64
Beds                   float64
Bed Type               float64
Amenities                int64
Price                  float64
Security Deposit       float64
Cleaning Fee           float64
Guests Included          int64
Extra People             int64
Minimum Nights           int64
Maximum Nights           int64
Calendar Updated       float64
Availability 365         int64
Number of Reviews        int64
Cancellation Policy    float64
Review Scores Mean     float64
dtype: object

In [17]:
print(f'Dimensiones df_airbnb_images son: {df_airbnb_images.shape} y type: {type(df_airbnb_images)}')
print(f'Dimensiones loaded_images_ok son: {loaded_images_ok.shape} y type: {type(loaded_images_ok)}')

Dimensiones df_airbnb_images son: (10564, 25) y type: <class 'pandas.core.frame.DataFrame'>
Dimensiones loaded_images_ok son: (10564, 224, 224, 3) y type: <class 'numpy.ndarray'>


In [18]:
# Cambiamos la posición de la columna Price en la posición primera
df_airbnb_images = change_column_order(df_airbnb_images, 'Price', 0)

In [19]:
df_airbnb_images.dtypes

Price                  float64
Host Response Time     float64
Host Response Rate     float64
Host Verifications       int64
Neighbourhood          float64
Latitude               float64
Longitude              float64
Property Type          float64
Room Type              float64
Bathrooms              float64
Bedrooms               float64
Beds                   float64
Bed Type               float64
Amenities                int64
Security Deposit       float64
Cleaning Fee           float64
Guests Included          int64
Extra People             int64
Minimum Nights           int64
Maximum Nights           int64
Calendar Updated       float64
Availability 365         int64
Number of Reviews        int64
Cancellation Policy    float64
Review Scores Mean     float64
dtype: object

#### **Regresión con datos numéricos e imágenes**

In [20]:
from sklearn.model_selection import train_test_split
# Vamos a dividir en train, validation y en test con la muestra combinando imágenes y datos numéricos
split = train_test_split(df_airbnb_images, loaded_images_ok, test_size=0.33, shuffle = True, random_state=0)
(trainAttrX, testAttrX, trainImagesX, testImagesX) = split

splitval = train_test_split(trainAttrX, trainImagesX, test_size=0.33, shuffle = True, random_state=0)
(trainAttrX, valAttrX, trainImagesX, valImagesX) = splitval

print(f'trainAttrX: {trainAttrX.shape} y type: {type(trainAttrX)} - trainImagesX:{trainImagesX.shape} y type: {type(trainImagesX)}')
print(f'valAttrX:   {valAttrX.shape} y type: {type(valAttrX)} - valImagesX:  {valImagesX.shape} y type: {type(valImagesX)}')
print(f'testAttrX:  {testAttrX.shape} y type: {type(testAttrX)} - testImagesX: {testImagesX.shape} y type: {type(testImagesX)}')


trainAttrX: (4741, 25) y type: <class 'pandas.core.frame.DataFrame'> - trainImagesX:(4741, 224, 224, 3) y type: <class 'numpy.ndarray'>
valAttrX:   (2336, 25) y type: <class 'pandas.core.frame.DataFrame'> - valImagesX:  (2336, 224, 224, 3) y type: <class 'numpy.ndarray'>
testAttrX:  (3487, 25) y type: <class 'pandas.core.frame.DataFrame'> - testImagesX: (3487, 224, 224, 3) y type: <class 'numpy.ndarray'>


In [21]:
# find the largest house price in the training set and use it to
# scale our house prices to the range [0, 1] (will lead to better training and convergence)
maxPrice = trainAttrX['Price'].max()    # máximo precio en training aplicar en validation y test
trainY = trainAttrX['Price'] / maxPrice
valY   = valAttrX['Price'] / maxPrice
testY  = testAttrX['Price'] / maxPrice

In [22]:
trainAttrX.columns

Index(['Price', 'Host Response Time', 'Host Response Rate',
       'Host Verifications', 'Neighbourhood', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Bathrooms', 'Bedrooms', 'Beds',
       'Bed Type', 'Amenities', 'Security Deposit', 'Cleaning Fee',
       'Guests Included', 'Extra People', 'Minimum Nights', 'Maximum Nights',
       'Calendar Updated', 'Availability 365', 'Number of Reviews',
       'Cancellation Policy', 'Review Scores Mean'],
      dtype='object')

In [23]:
from sklearn import preprocessing

# Nos quedamos con el dataframe con todas las variables salvo la variable objetivo
trainAttrX = trainAttrX.values[:, 1:]
valAttrX   = valAttrX.values[:, 1:]
testAttrX  = testAttrX.values[:, 1:]

# Escalamos los datos
scaler = preprocessing.StandardScaler().fit(trainAttrX)
trainScaledAttrX = scaler.transform(trainAttrX)
valScaledAttrX = scaler.transform(valAttrX)
testScaledAttrX = scaler.transform(testAttrX)

##### **Modelo con redes MLP y CNN**

In [24]:
## LLamada a las redes  MLP y CNN
from keras.layers import concatenate
# create the MLP and CNN models
mlp = create_mlp(trainScaledAttrX.shape[1], regress=False)
cnn = create_cnn(224, 224, 3, regress=False)

# create the input to our final set of layers as the *output* of both the MLP and CNN
combinedInput = concatenate([mlp.output, cnn.output])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.



In [31]:
from keras.models import Model
from keras.optimizers import Adam

# our final FC layer head will have two dense layers, the final one
# being our regression head
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)

# our final model will accept categorical/numerical data on the MLP
# input and images on the CNN input, outputting a single value (the
# predicted price of the house)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

# compile the model using mean absolute percentage error as our loss,
# implying that we seek to minimize the absolute percentage difference
# between our price *predictions* and the *actual prices*
opt = Adam(lr=1e-1, decay=1e-1 / 10)
model.compile(loss='mean_squared_error', optimizer=opt)

# train the model with validation 
print('[INFO] training model...')
history = model.fit([trainScaledAttrX, trainImagesX], trainY,
	                  validation_data=([valScaledAttrX, valImagesX], valY),
	                  epochs=10, batch_size=16)

[INFO] training model...
Train on 4741 samples, validate on 2336 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10




```
mean_squared_error 

opt = Adam(lr=1e-2, decay=1e-2 / 10)
Epoch 10/10
4741/4741 [==============================] - 8s 2ms/step - loss: 0.0061 - val_loss: 0.0077 - batch_size = 16
4741/4741 [==============================] - 7s 1ms/step - loss: 0.0065 - val_loss: 0.0076 - batch_size = 32

opt = Adam(lr=1e-3, decay=1e-3 / 10) y batch_size =16
Epoch 10/10
4741/4741 [==============================] - 6s 1ms/step - loss: 0.0085 - val_loss: 19.3696 -  batch_size = 512   
4741/4741 [==============================] - 9s 2ms/step - loss: 0.0061 - val_loss: 50409648.7492 - batch_size = 16

opt = Adam(lr=1e-1, decay=1e-1 / 10) 
Epoch 10/10
4741/4741 [==============================] - 8s 2ms/step - loss: 0.0232 - val_loss: 470418.2140 - batch_size = 16
```



Con estos resultados, mantenemos el batch_size a 32 y modificamos el hiperparámetro learning rate en el optimizador que empleamos Adam y observamos que el mejor resulado obtenido es con lr a 1e-2, con lo que mantenemos dicho modelo evaluandolo a más épocas.
El resultado de hecho empeora ya que la función de perdidas en ambas muestra aumenta.


In [32]:
from keras.models import Model
from keras.optimizers import Adam

# our final FC layer head will have two dense layers, the final one
# being our regression head
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)

# our final model will accept categorical/numerical data on the MLP
# input and images on the CNN input, outputting a single value (the
# predicted price of the house)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

# compile the model using mean absolute percentage error as our loss,
# implying that we seek to minimize the absolute percentage difference
# between our price *predictions* and the *actual prices*
opt = Adam(lr=1e-2, decay=1e-2 / 50)
model.compile(loss='mean_squared_error', optimizer=opt)

# train the model with validation 
print('[INFO] training model...')
history = model.fit([trainScaledAttrX, trainImagesX], trainY,
	                  validation_data=([valScaledAttrX, valImagesX], valY),
	                  epochs=50, batch_size=32)

[INFO] training model...
Train on 4741 samples, validate on 2336 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Con mayor entrenamiento la función de pérdidas en train ha disminuido pero en validation se ha incrementado.
```
batch_size = 32 opt = Adam(lr=1e-2, decay=1e-2 / epochs)
Epoch 10/10
4741/4741 [==============================] - 16s 3ms/step - loss: 0.0065 - val_loss: 0.0076
Epoch 50/50
4741/4741 [==============================] - 7s 2ms/step - loss: 0.0062 - val_loss: 0.0083
```



El modelo seleccionado con los hiperparámetros batch_size a 32 y el optimizador opt = Adam(lr=1e-2, decay=1e-2 / 10), procedemos a evaluarlo sobre el conjunto de test.



In [33]:
# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict([testScaledAttrX, testImagesX])
preds

[INFO] predicting house prices...


array([[0.26296735],
       [0.1499562 ],
       [0.10507278],
       ...,
       [0.25243145],
       [0.7211937 ],
       [0.6594808 ]], dtype=float32)

In [34]:
import locale

# compute the difference between the *predicted* house prices and the
# *actual* house prices, then compute the percentage difference and
# the absolute percentage difference
diff = preds.flatten() - testY
percentDiff = (diff / testY) * 100
absPercentDiff = np.abs(percentDiff)

# compute the mean and standard deviation of the absolute percentage
# difference
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)

# finally, show some statistics on our model
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
print("[INFO] avg. house price: {}, std house price: {}".format(
	locale.currency(df_airbnb_images["Price"].mean(), grouping=True),
	locale.currency(df_airbnb_images["Price"].std(), grouping=True)))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))

[INFO] avg. house price: $58.36, std house price: $39.13
[INFO] mean: 26.57%, std: 26.87%


Nuestro mean_squared_error final implica que, en promedio, nuestra red tendrá un ~ 26.5% de descuento en sus predicciones del precio de la vivienda con una desviación estándar de ~ 26.87%.



#### **Clasificación con datos numéricos e imágenes**

In [35]:
from sklearn.model_selection import train_test_split
# Vamos a dividir en train, validation y en test con la muestra combinando imágenes y datos numéricos
split = train_test_split(df_airbnb_images, loaded_images_ok, test_size=0.33, shuffle = True, random_state=0)
(trainAttrX, testAttrX, trainImagesX, testImagesX) = split

splitval = train_test_split(trainAttrX, trainImagesX, test_size=0.33, shuffle = True, random_state=0)
(trainAttrX, valAttrX, trainImagesX, valImagesX) = splitval

print(f'trainAttrX: {trainAttrX.shape} y type: {type(trainAttrX)} - trainImagesX:{trainImagesX.shape} y type: {type(trainImagesX)}')
print(f'valAttrX:   {valAttrX.shape} y type: {type(valAttrX)} - valImagesX:  {valImagesX.shape} y type: {type(valImagesX)}')
print(f'testAttrX:  {testAttrX.shape} y type: {type(testAttrX)} - testImagesX: {testImagesX.shape} y type: {type(testImagesX)}')


trainAttrX: (4741, 25) y type: <class 'pandas.core.frame.DataFrame'> - trainImagesX:(4741, 224, 224, 3) y type: <class 'numpy.ndarray'>
valAttrX:   (2336, 25) y type: <class 'pandas.core.frame.DataFrame'> - valImagesX:  (2336, 224, 224, 3) y type: <class 'numpy.ndarray'>
testAttrX:  (3487, 25) y type: <class 'pandas.core.frame.DataFrame'> - testImagesX: (3487, 224, 224, 3) y type: <class 'numpy.ndarray'>


In [36]:
# find the largest house price in the training set and use it to
# scale our house prices to the range [0, 1] (will lead to better training and convergence)

trainY = classes_price(trainAttrX)
valY   = classes_price(valAttrX)
testY  = classes_price(testAttrX)

In [37]:
trainAttrX.columns

Index(['Price', 'Host Response Time', 'Host Response Rate',
       'Host Verifications', 'Neighbourhood', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Bathrooms', 'Bedrooms', 'Beds',
       'Bed Type', 'Amenities', 'Security Deposit', 'Cleaning Fee',
       'Guests Included', 'Extra People', 'Minimum Nights', 'Maximum Nights',
       'Calendar Updated', 'Availability 365', 'Number of Reviews',
       'Cancellation Policy', 'Review Scores Mean'],
      dtype='object')

In [38]:
# Nos quedamos con el dataframe con todas las variables salvo la variable objetivo
trainAttrX = trainAttrX.values[:, 1:]
valAttrX   = valAttrX.values[:, 1:]
testAttrX  = testAttrX.values[:, 1:]

# Escalamos los datos
scaler = preprocessing.StandardScaler().fit(trainAttrX)
trainScaledAttrX = scaler.transform(trainAttrX)
valScaledAttrX = scaler.transform(valAttrX)
testScaledAttrX = scaler.transform(testAttrX)

##### **Modelo con redes MLP y CNN**

In [39]:
## LLamada a las redes  MLP y 
from keras.layers import concatenate
# create the MLP and CNN models
mlp = create_mlp(trainAttrX.shape[1], regress=False)
cnn = create_cnn(224, 224, 3, regress=False)

# create the input to our final set of layers as the *output* of both
# the MLP and CNN
combinedInput = concatenate([mlp.output, cnn.output])

In [45]:
from keras.models import Model
from keras.optimizers import Adam

# our final FC layer head will have two dense layers, the final one
# being our classification head with 5 neurons
x = Dense(4, activation="relu")(combinedInput)
x = Dense(5, activation="softmax")(x)

# our final model will accept categorical/numerical data on the MLP
# input and images on the CNN input, outputting a single value (the
# predicted price of the house)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

# compile the model using mean absolute percentage error as our loss,
# implying that we seek to minimize the absolute percentage difference
# between our price *predictions* and the *actual prices*
opt = Adam(lr=1e-1, decay=1e-1 / 10)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

# train the model with validation 
print('[INFO] training model...')
history = model.fit([trainAttrX, trainImagesX], trainY,
	                  validation_data=([valAttrX, valImagesX], valY),
	                  epochs=10, batch_size=16)

[INFO] training model...
Train on 4741 samples, validate on 2336 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Con estas pruebas finalmente optamos por un batch_size = 16 junto con Adam como optimizador opt = Adam(lr=1e-2, decay=1e-2 / 10) aunque existe overfitting puesto que el accuracy en training es de 73.89% y en validation de un 56.46% pero optamos por este modelo entrenandolo hasta 50 épocas.

```
Epoch 10/10 opt = Adam(lr=1e-2, decay=1e-2 / 10) 
4741/4741 [======] - 20s 4ms/step - loss: 0.6798 - accuracy: 0.7389 - val_loss: 1.2005 - val_accuracy: 0.5646 - batch_size = 16
4741/4741 [======] - 17s 4ms/step - loss: 0.8361 - accuracy: 0.6625 - val_loss: 1.0374 - val_accuracy: 0.5736 - batch_size = 32

Epoch 10/10 opt = Adam(lr=1e-3, decay=1e-3 / 10)
4741/4741 [======] - 20s 4ms/step - loss: 0.4894 - accuracy: 0.8131 - val_loss: 1.4752 - val_accuracy: 0.5244 - batch_size = 16
4741/4741 [======] - 16s 3ms/step - loss: 0.3649 - accuracy: 0.8564 - val_loss: 2.0783 - val_accuracy: 0.5283 - batch_size = 32

Epoch 10/10 opt = Adam(lr=1e-1, decay=1e-1 / 10)
4741/4741 [======] - 20s 4ms/step - loss: 0.7425 - accuracy: 0.7184 - val_loss: 1.1270 - val_accuracy: 0.5407 - batch_size = 16
4741/4741 [======] - 17s 4ms/step - loss: 0.7763 - accuracy: 0.6946 - val_loss: 1.1090 - val_accuracy: 0.5582 - batch_size = 32

```



In [46]:
from keras.models import Model
from keras.optimizers import Adam

# our final FC layer head will have two dense layers, the final one
# being our classification head with 5 neurons
x = Dense(4, activation="relu")(combinedInput)
x = Dense(5, activation="softmax")(x)

# our final model will accept categorical/numerical data on the MLP
# input and images on the CNN input, outputting a single value (the
# predicted price of the house)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

# compile the model using mean absolute percentage error as our loss,
# implying that we seek to minimize the absolute percentage difference
# between our price *predictions* and the *actual prices*
opt = Adam(lr=1e-2, decay=1e-2 / 50)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

# train the model with validation 
print('[INFO] training model...')
history = model.fit([trainAttrX, trainImagesX], trainY,
	                  validation_data=([valAttrX, valImagesX], valY),
	                  epochs=50, batch_size=16)

[INFO] training model...
Train on 4741 samples, validate on 2336 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Con 50 épocas observamos que el overfitting es mayor puesto que la diferencia entre los accuracies es aún mayor (train es 90.40% y en validation 51.37%) aunque no es por tanto el mejor modelo optamos por elegirlo y lo evaluamos sobre test

```
Epoch 10/10 opt = Adam(lr=1e-2, decay=1e-2 / 10) - batch_size = 32
4741/4741 [======] - 17s 4ms/step - loss: 0.6798 - accuracy: 0.7389 - val_loss: 1.2005 - val_accuracy: 0.5646
Epoch 50/50
4741/4741 [======] - 17s 4ms/step -loss: 0.2774 - accuracy: 0.9040 - val_loss: 2.1520 - val_accuracy: 0.5137
```



In [47]:
# make predictions on the testing data
loss, acc = model.evaluate([testAttrX, testImagesX], testY)
print(f'Loss={loss}, Acc={acc}')

Loss=2.270183051319631, Acc=0.5007169246673584


El accuracy obtenido en test 50.07% es muy malo, además del overfitting y por lo tanto el modelo no está generalizando bien.