<a href="https://colab.research.google.com/github/kikiymini/7506R-1C2024-GRUPO02/blob/main/7506R_TP1_GRUPO02_ENTREGA_N4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Organizaci&oacute;n de Datos</center>

#### <center>C&aacute;tedra Ing. Rodriguez, Juan Manuel </center>

## <center>Trabajo Práctico 1: Propiedades en venta</center>

### <center> Grupo 2</center>

## Integrantes:

*   Aramayo Carolina
*   Utrera Maximo Damian
*   Villalba Ana Daniela
*   Fiorilo Roy


<font color='#fa5050'>Importante: para correr este notebook es pre-requisito haber corrido el notebook numero 1, ya que de lo contrario no se tienen los datasets de train y test.</font>

# Importación de librerias

(Importante: leer comentarios)

In [45]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import folium

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import sklearn
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

import joblib

import warnings
warnings.filterwarnings("ignore")

RAND_SEED = 42

# Lectura de archivo

### Desde google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')
drive_path = "/content/drive/MyDrive/7506R-1C2024-GRUPO02"

train_file = drive_path + '/Dataset/ds_train.csv'
test_file = drive_path + '/Dataset/ds_test.csv'

ds_train = pd.read_csv(train_file)
ds_test = pd.read_csv(test_file)
using_drive = True

Mounted at /content/drive


### Desde maquina local

In [None]:
# Si se esta trabajando en local, descomentar esta celta y comentar la de arriba
# train_file = './dataset/ds_train.csv'
# test_file = './dataset/ds_test.csv'

# ds_train = pd.read_csv(train_file)
# ds_test = pd.read_csv(test_file)
# using_drive = False

# Regresión

### Ingenieria de caracteristicas

In [4]:
# Creamos una copia del training dataset para modificarlo libremente
ds_reg_train = ds_train.copy()
ds_reg_test = ds_test.copy()

Vemos las columnas actuales del dataset

In [5]:
ds_reg_train.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

In [6]:
ds_reg_test.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

Columnas consideradas irrelevantes:
* ID de la propiedad
* Las fechas de creacion/comienzo/finalizacion de la publicacion
* Latitud y Longitud (ya que el barrio parece ser una mejor y mas confiable opcion a tener en cuenta para la ubicacion)
* Surface covered (ya que esta tiene una correlacion alta con surface total)
* Property Bedrooms (ya que esta altamente correlacionado con property rooms)

In [8]:
# Eliminamos dichas columnas
cols_a_eliminar = ["id", "start_date", "end_date", "latitud", "longitud", "property_surface_covered", "property_bedrooms"]
ds_reg_train.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_test.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_train.dtypes

place_l3                   object
property_type              object
property_rooms              int64
property_surface_total    float64
property_price            float64
dtype: object

Encodeo variables cualitativas

In [9]:
def encode_non_numerical_vars(ds, col):
  """returns encoding dict and its inverse dict"""
  _col = ds[col].unique()

  _col_dict = dict(zip(_col, range(len(_col))))
  _col_inv_dict = dict(zip(range(len(_col)), _col))

  ds[col] = ds[col].map(_col_dict)
  return _col_dict, _col_inv_dict

In [10]:
# Mapear property_type a una representacion numerica
property_type_dict, property_type_inv_dict = encode_non_numerical_vars(ds_reg_train, "property_type")
_ = encode_non_numerical_vars(ds_reg_test, "property_type")

In [11]:
# Mapear barrios a una representacion numerica
place_l3_dict, place_l3_inv_dict = encode_non_numerical_vars(ds_reg_train, "place_l3")
_ = encode_non_numerical_vars(ds_reg_test, "place_l3")

Normalizamos las varibles utilizando Min-Max

In [12]:
# Solo normalizamos las variables cuantitativas no encodeadas
columnas_con_numeros = ['property_rooms', 'property_surface_total', 'property_price']
scaler = MinMaxScaler()
ds_reg_train[columnas_con_numeros] = scaler.fit_transform(ds_reg_train[columnas_con_numeros])
ds_reg_test[columnas_con_numeros] = scaler.transform(ds_reg_test[columnas_con_numeros])

In [13]:
ds_reg_train.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.333333,0.0736,0.012383
1,1,0,0.111111,0.0164,0.006075
2,2,0,0.333333,0.074,0.010724
3,2,0,0.0,0.014,0.003738
4,3,1,0.555556,0.078,0.011215


In [14]:
ds_reg_test.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.111111,0.018,0.003738
1,1,1,0.333333,0.0744,0.01215
2,2,0,0.111111,0.0212,0.006028
3,3,0,0.333333,0.0276,0.001752
4,4,0,0.111111,0.0132,0.003972


### Modelo 1: KNN


#### Construccion

In [27]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

Busco los mejores parametros para KNN con respecto Root Mean Squared Error.

In [28]:
kfoldcv = KFold(n_splits=10)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid={ 'n_neighbors':range(5,25),
              'weights':['distance','uniform'],
              'algorithm':['ball_tree', 'kd_tree', 'brute'],
              'metric':['euclidean','manhattan','chebyshev']
             }

# Clasificador KNN
knn_model = KNeighborsRegressor()

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

Vemos los mejores parametros elegidos por el cross validation

In [26]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'weights': 'distance',
 'n_neighbors': 16,
 'metric': 'manhattan',
 'algorithm': 'ball_tree'}

##### Creacion del mejor modelo

In [31]:
# Mejor Regresor KNN
best_knn_model = KNeighborsRegressor(
    n_neighbors = params_elegidos['n_neighbors'],
    weights = params_elegidos['weights'],
    algorithm = params_elegidos['algorithm'],
    metric = params_elegidos['metric'],
)

best_knn_model.fit(x_train, y_train)
y_train_pred_best_knn_model = best_knn_model.predict(x_train)
y_test_pred_best_knn_model = best_knn_model.predict(x_test)

**Observamos como performa en test**

In [22]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_knn_model,
                            'Error': y_test - y_test_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.00594,-0.002202
1,0.01215,0.016513,-0.004363
2,0.006028,0.006475,-0.000447
3,0.001752,0.006026,-0.004274
4,0.003972,0.002934,0.001038


* MSE

In [32]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=True)

0.00013219540210612606

* RMSE

In [33]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=False)

0.0114976259334754

* R2 score

In [38]:
r2_score(y_test, y_test_pred_best_knn_model)

0.3767081268054536

**Observamos como se desempeña sobre los mismos datos con los que entreno**

In [40]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_knn_model,
                            'Error': y_train - y_train_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.012383,0.0
1,0.006075,0.004972,0.001103
2,0.010724,0.015088,-0.004363
3,0.003738,0.00364,9.8e-05
4,0.011215,0.011215,0.0


* MSE

In [41]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=True)

2.2803319376006244e-05

* RMSE

In [42]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=False)

0.0047752821252786985

* R2 score

In [43]:
r2_score(y_train, y_train_pred_best_knn_model)

0.8911565166803185

### Modelo 2: XGBoost

#### Construccion

In [None]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

In [None]:
kfoldcv = KFold(n_splits=5)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid = {
    'learning_rate': np.linspace(0.05, 0.5, 50),
    'gamma': [0,1,2],
    'max_depth': list(range(2,10)),
    'subsample': np.linspace(0.5, 1, 20),
    'lambda': [0,1,2],
    'alpha' : [0,1,2],
    'n_estimators': list(range(10,150,10))
    }

# Clasificador KNN
knn_model = XGBRegressor(random_state=RAND_SEED)

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

Vemos los mejores parametros

In [None]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

##### Creacion del mejor modelo

In [46]:
# Mejor Regresor XGB
best_xgbr_model = XGBRegressor(
    n_estimators = params_elegidos['n_estimators'],
    learning_rate = params_elegidos['learning_rate'],
    gamma = params_elegidos['gamma'],
    max_depth = params_elegidos['max_depth'],
    subsample = params_elegidos['subsample'],
    reg_lambda = params_elegidos['reg_lambda'],
    reg_alpha = params_elegidos['reg_alpha'],
)

best_xgbr_model.fit(x_train, y_train)
y_train_pred_best_xgbr_model = best_xgbr_model.predict(x_train)
y_test_pred_best_xgbr_model = best_xgbr_model.predict(x_test)

Como performa en los datos de testing

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_knn_model,
                            'Error': y_test - y_test_pred_best_knn_model})
# View
performance.head()

* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=True)

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=False)

* R2 score

In [None]:
r2_score(y_test, y_test_pred_best_knn_model)

Como performa en los datos de training

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_knn_model,
                            'Error': y_train - y_train_pred_best_knn_model})
# View
performance.head()

* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=True)

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=False)

* R2 score

In [None]:
r2_score(y_train, y_train_pred_best_knn_model)

### Modelo 3: a elección <font color="#e02626">(Pendiente)</font>

#### Construccion

##### Cross-validation

##### Creacion del mejor modelo

## Guardado de modelos

In [None]:
if using_drive:
  joblib.dump(best_knn_model, drive_path + '/Models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, drive_path + '/Models/best_xgbr_model.pkl')
else:
  joblib.dump(best_knn_model, './models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, './models/best_xgbr_model.pkl')