<a href="https://colab.research.google.com/github/lucatraverso/House-price-prediction_Prosperati-dataset./blob/main/prosperati.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Predicción de precios de alquileres en CABA

En este colab vamos a armar un modelo para predecir precios de casas usando un dataset de Prosperati y varios algoritmos diferentes. Primero importamos las librerias y módulos necesarios.

In [36]:
import pandas as pd
import gzip
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

##Obtencion del dataset

Ahora descargamos e importamos el dataset.

In [48]:
!wget https://storage.googleapis.com/properati-data-public/ar_properties.csv.gz

--2021-07-08 00:21:30--  https://storage.googleapis.com/properati-data-public/ar_properties.csv.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.204.128, 64.233.187.128, 64.233.189.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.204.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 292380968 (279M) [application/octet-stream]
Saving to: ‘ar_properties.csv.gz’


2021-07-08 00:21:32 (108 MB/s) - ‘ar_properties.csv.gz’ saved [292380968/292380968]



In [49]:
with gzip.open('ar_properties.csv.gz') as f:
    dataset = pd.read_csv(f)

##Limpieza y filtrado de datos

Filtramos el dataset para ver solamente alquileres en capital federal, en pesos que correspondan a casas, departamentos o ph. Tambien eliminamos columnas innecesarias.


In [50]:
dataset = dataset[(dataset.l2 == 'Capital Federal') & 
                  (dataset.operation_type == 'Alquiler') &
                  (dataset.price > 0) &
                  (dataset.currency == 'ARS')]

dataset = dataset[dataset.property_type == ('Departamento' or 'PH' or 'Casa')]

ubicacion = dataset[['lat', 'lon']]

drop = ['id', 
        'ad_type', 
        'start_date', 
        'end_date', 
        'created_on', 
        'title', 
        'description', 
        'l1', 'l2', 'l4',  
        'l5', 'l6', 
        'lat', 'lon', 
        'price_period', 
        'operation_type', 
        'currency']

dataset = dataset.drop(drop, axis=1)

##Feature engineering

Como las columnas de barrio y tipo de propiedad son de tipo categoricas, vamosa codificarlas utilizando one-hot encoding. Tambien se rellenan los valores ausentes.
Los correspondientes a cuartos, habitaciones y baños con 0 y las superficies con el valor medio. Tambien se normalizan las superficies total y cubierta utilizando la media y desviacion del training set.

In [51]:
barrios = pd.get_dummies(dataset['l3'])
tipos = pd.get_dummies(dataset['property_type'])

dataset = dataset.drop('l3', axis=1)
dataset = dataset.drop('property_type', axis=1)

dataset = dataset.dropna()

dataset = dataset.join(barrios)
dataset = dataset.join(tipos)

x_train = dataset.iloc[0:10000].drop('price', axis=1)
y_train = dataset.iloc[0:10000:]['price']

x_test = dataset.iloc[10001:].drop('price', axis=1)
y_test = dataset.iloc[10001:]['price']

def pad_and_normalize(column, media, desv):
    '''
    Remplaza valores vacios de una columna por la media y normaliza
    '''
    column = column.fillna(media)
    column = (column - media) / desv
    return column

(sc_media, sc_desv) = (x_train['surface_covered'].mean(), x_train['surface_covered'].std())
(st_media, st_desv) = (x_train['surface_total'].mean(), x_train['surface_total'].std())

x_train['surface_covered'] = pad_and_normalize(x_train['surface_covered'], sc_media, sc_desv)
x_train['surface_total'] = pad_and_normalize(x_train['surface_total'], st_media, st_desv)
x_train[['rooms', 'bedrooms', 'bathrooms']] = x_train[['rooms', 'bedrooms', 'bathrooms']].fillna(0)

x_test['surface_covered'] = pad_and_normalize(x_test['surface_covered'], sc_media, sc_desv)
x_test['surface_total'] = pad_and_normalize(x_test['surface_total'], st_media, st_desv)
x_test[['rooms', 'bedrooms', 'bathrooms']] = x_test[['rooms', 'bedrooms', 'bathrooms']].fillna(0)

##Definición de modelos

Vamos a usar regresión lineal y una red neuronal.

In [52]:
#%% REGRESIÓN LINEAL

def entrenar_rlineal(x_train, y_train, x_test, y_test):
    modelo = LinearRegression()
    modelo.fit(x_train, y_train)
    y_pred = modelo.predict(x_test)

    train_score = modelo.score(x_train, y_train)
    test_score = modelo.score(x_test, y_test)
    mserror = mean_squared_error(y_test, y_pred)
    
    mse = (((y_test - y_pred)**2).sum()) / len(y_test)

    print('Testing linear model...')
    print(f'Training score: {train_score:.2f}')
    print(f'Testing score: {test_score:.2f}')
    print(f'MSError: {mserror:.2f} | {mse:.2f}')
    print('...')
    return modelo


#%% RED NEURONAL

def entrenar_red(x_train, y_train, x_test, y_test, n=(100), 
                 solver='adam', lri=0.001, lr='constant'):
    '''
    Entrena una red 
    n: tupla con en numero de unidades por capa    
    '''
    nn = MLPRegressor(hidden_layer_sizes=n, 
                      activation = 'relu', 
                      solver='adam', 
                      learning_rate_init=lri,
                      learning_rate='constant', 
                      random_state=1
                      )
    nn.fit(x_train, y_train)
    y_pred = nn.predict(x_test)
    train_score = nn.score(x_train, y_train)
    test_score = nn.score(x_test, y_test)
    mserror = mean_squared_error(y_test, y_pred)

    print(f'Neural Net with {n} units')
    print(f'Training score: {train_score:.2f}')
    print(f'Testing score: {test_score:.2f}')
    print(f'MSError: {mserror:.2f}')
    print('...')
    return nn

Vamos a probar un par de configuraciones para ver que funciona mejor.

In [53]:
modelo_lineal = entrenar_rlineal(x_train, y_train, x_test, y_test)
red_simple = entrenar_red(x_train, y_train, x_test, y_test, (100), 'adam', 0.1, 'adaptive')
red_doble = entrenar_red(x_train, y_train, x_test, y_test, (100, 50), 'adam', 0.1, 'adaptive')
red_triple = entrenar_red(x_train, y_train, x_test, y_test, (100, 50, 25), 'adam', 0.1, 'adaptive')

Testing linear model...
Training score: 0.52
Testing score: 0.36
MSError: 1725687145.79 | 1725687145.79
...
Neural Net with 100 units
Training score: 0.57
Testing score: 0.36
MSError: 1716976288.12
...
Neural Net with (100, 50) units
Training score: 0.70
Testing score: 0.58
MSError: 1116857314.68
...
Neural Net with (100, 50, 25) units
Training score: 0.70
Testing score: 0.57
MSError: 1142053403.10
...


Ahora tomtamos la red de dos capaz ocultas para realizar mas pruebas. Vamos a convertir los precios a una escala logaritmica a ver como afecta la precision de los modelos.

In [22]:
import numpy as np

y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

modelo_lineal = entrenar_rlineal(x_train, y_train_log, x_test, y_test_log)
    
red_doble = entrenar_red(x_train, y_train_log, x_test, y_test_log, 
                         (100, 50), 'adam', 0.1, 'adaptive')

Testing linear model...
Training score:  0.5774018469572362
Testing score:  0.4560626691162659
...
Neural Net with (100, 50) units
Training score:  0.5188867253553939
Testing score:  -0.8489961152915662
...
