<a href="https://colab.research.google.com/github/kikiymini/7506R-1C2024-GRUPO02/blob/main/7506R_TP1_GRUPO02_ENTREGA_N4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Organizaci&oacute;n de Datos</center>

#### <center>C&aacute;tedra Ing. Rodriguez, Juan Manuel </center>

## <center>Trabajo Práctico 1: Propiedades en venta</center>

### <center> Grupo 2</center>

## Integrantes:

*   Aramayo Carolina
*   Utrera Maximo Damian
*   Villalba Ana Daniela
*   Fiorilo Roy


<font color='#fa5050'>Importante: para correr este notebook es pre-requisito haber corrido el notebook numero 1, ya que de lo contrario no se tienen los datasets de train y test.</font>

# Importación de librerias

(Importante: leer comentarios)

In [27]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import folium

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import sklearn
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

import joblib

import warnings
warnings.filterwarnings("ignore")

RAND_SEED = 42

# Lectura de archivo

### Desde google drive

In [28]:
from google.colab import drive
drive.mount('/content/drive')
drive_path = "/content/drive/MyDrive/7506R-1C2024-GRUPO02"

train_file = drive_path + '/Dataset/ds_train.csv'
test_file = drive_path + '/Dataset/ds_test.csv'

ds_train = pd.read_csv(train_file)
ds_test = pd.read_csv(test_file)
using_drive = True

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Desde maquina local

In [29]:
# Si se esta trabajando en local, descomentar esta celta y comentar la de arriba
# train_file = './dataset/ds_train.csv'
# test_file = './dataset/ds_test.csv'

# ds_train = pd.read_csv(train_file)
# ds_test = pd.read_csv(test_file)
# using_drive = False

# Regresión

### Ingenieria de caracteristicas

In [30]:
# Creamos una copia del training dataset para modificarlo libremente
ds_reg_train = ds_train.copy()
ds_reg_test = ds_test.copy()

Vemos las columnas actuales del dataset

In [31]:
ds_reg_train.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

In [32]:
ds_reg_test.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

Columnas consideradas irrelevantes:
* ID de la propiedad
* Las fechas de creacion/comienzo/finalizacion de la publicacion
* Latitud y Longitud (ya que el barrio parece ser una mejor y mas confiable opcion a tener en cuenta para la ubicacion)
* Surface covered (ya que esta tiene una correlacion alta con surface total)
* Property Bedrooms (ya que esta altamente correlacionado con property rooms)

In [33]:
# Eliminamos dichas columnas
cols_a_eliminar = ["id", "start_date", "end_date", "latitud", "longitud", "property_surface_covered", "property_bedrooms"]
ds_reg_train.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_test.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_train.dtypes

place_l3                   object
property_type              object
property_rooms              int64
property_surface_total    float64
property_price            float64
dtype: object

Encodeo variables cualitativas

In [34]:
def encode_non_numerical_vars(ds, col):
  """returns encoding dict and its inverse dict"""
  _col = ds[col].unique()

  _col_dict = dict(zip(_col, range(len(_col))))
  _col_inv_dict = dict(zip(range(len(_col)), _col))

  ds[col] = ds[col].map(_col_dict)
  return _col_dict, _col_inv_dict

In [35]:
# Mapear property_type a una representacion numerica
property_type_dict, property_type_inv_dict = encode_non_numerical_vars(ds_reg_train, "property_type")
_ = encode_non_numerical_vars(ds_reg_test, "property_type")

In [36]:
# Mapear barrios a una representacion numerica
place_l3_dict, place_l3_inv_dict = encode_non_numerical_vars(ds_reg_train, "place_l3")
_ = encode_non_numerical_vars(ds_reg_test, "place_l3")

Normalizamos las varibles utilizando Min-Max

In [37]:
# Solo normalizamos las variables cuantitativas no encodeadas
columnas_con_numeros = ['property_rooms', 'property_surface_total', 'property_price']
scaler = MinMaxScaler()
ds_reg_train[columnas_con_numeros] = scaler.fit_transform(ds_reg_train[columnas_con_numeros])
ds_reg_test[columnas_con_numeros] = scaler.transform(ds_reg_test[columnas_con_numeros])

In [38]:
ds_reg_train.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.333333,0.0736,0.012383
1,1,0,0.111111,0.0164,0.006075
2,2,0,0.333333,0.074,0.010724
3,2,0,0.0,0.014,0.003738
4,3,1,0.555556,0.078,0.011215


In [39]:
ds_reg_test.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.111111,0.018,0.003738
1,1,1,0.333333,0.0744,0.01215
2,2,0,0.111111,0.0212,0.006028
3,3,0,0.333333,0.0276,0.001752
4,4,0,0.111111,0.0132,0.003972


### Modelo 1: KNN


#### Construccion

In [40]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

Busco los mejores parametros para KNN con respecto Root Mean Squared Error.

In [41]:
kfoldcv = KFold(n_splits=10)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid={ 'n_neighbors':range(5,25),
              'weights':['distance','uniform'],
              'algorithm':['ball_tree', 'kd_tree', 'brute'],
              'metric':['euclidean','manhattan','chebyshev']
             }

# Clasificador KNN
knn_model = KNeighborsRegressor()

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

0.78210729761634

Vemos los mejores parametros elegidos por el cross validation

In [42]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'weights': 'distance',
 'n_neighbors': 21,
 'metric': 'euclidean',
 'algorithm': 'brute'}

##### Creacion del mejor modelo

In [43]:
# Mejor Regresor KNN
best_knn_model = KNeighborsRegressor(
    n_neighbors = params_elegidos['n_neighbors'],
    weights = params_elegidos['weights'],
    algorithm = params_elegidos['algorithm'],
    metric = params_elegidos['metric'],
)

best_knn_model.fit(x_train, y_train)
y_train_pred_best_knn_model = best_knn_model.predict(x_train)
y_test_pred_best_knn_model = best_knn_model.predict(x_test)

**Observamos como performa en test**

In [44]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_knn_model,
                            'Error': y_test - y_test_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.005928,-0.00219
1,0.01215,0.016377,-0.004228
2,0.006028,0.006258,-0.00023
3,0.001752,0.006019,-0.004267
4,0.003972,0.002922,0.00105


* MSE

In [45]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=True)

0.0001315753070488609

* RMSE

In [46]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=False)

0.011470628014579713

* R2 score

In [47]:
r2_score(y_test, y_test_pred_best_knn_model)

0.37963183068352946

**Observamos como se desempeña sobre los mismos datos con los que entreno**

In [48]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_knn_model,
                            'Error': y_train - y_train_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.012383,0.0
1,0.006075,0.005411,0.000664
2,0.010724,0.015088,-0.004363
3,0.003738,0.003628,0.00011
4,0.011215,0.011215,0.0


* MSE

In [49]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=True)

2.2743972777593e-05

* RMSE

In [50]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=False)

0.00476906414064573

* R2 score

In [51]:
r2_score(y_train, y_train_pred_best_knn_model)

0.8914397864266198

### Modelo 2: XGBoost

#### Construccion

In [52]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

In [53]:
kfoldcv = KFold(n_splits=5)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid = {
    'learning_rate': np.linspace(0.05, 0.5, 50),
    'gamma': [0,1,2],
    'max_depth': list(range(2,10)),
    'subsample': np.linspace(0.5, 1, 20),
    'lambda': [0,1,2],
    'alpha' : [0,1,2],
    'n_estimators': list(range(10,150,10))
    }

# Clasificador KNN
knn_model = XGBRegressor(random_state=RAND_SEED)

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

0.7263209515306797

Vemos los mejores parametros

In [54]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'subsample': 0.7105263157894737,
 'n_estimators': 120,
 'max_depth': 3,
 'learning_rate': 0.5,
 'lambda': 0,
 'gamma': 0,
 'alpha': 0}

##### Creacion del mejor modelo

In [55]:
# Mejor Regresor XGB
best_xgbr_model = XGBRegressor(
    n_estimators = params_elegidos['n_estimators'],
    learning_rate = params_elegidos['learning_rate'],
    gamma = params_elegidos['gamma'],
    max_depth = params_elegidos['max_depth'],
    subsample = params_elegidos['subsample'],
    reg_lambda = params_elegidos['lambda'],
    reg_alpha = params_elegidos['alpha'],
)

best_xgbr_model.fit(x_train, y_train)
y_train_pred_best_xgbr_model = best_xgbr_model.predict(x_train)
y_test_pred_best_xgbr_model = best_xgbr_model.predict(x_test)

Como performa en los datos de testing

In [56]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_xgbr_model,
                            'Error': y_test - y_test_pred_best_xgbr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.004711,-0.000972
1,0.01215,0.016982,-0.004833
2,0.006028,0.005695,0.000333
3,0.001752,0.007461,-0.005709
4,0.003972,0.003392,0.00058


* MSE

In [57]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_xgbr_model, squared=True)

8.873311150515407e-05

* RMSE

In [58]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_xgbr_model, squared=False)

0.009419825449824113

* R2 score

In [59]:
r2_score(y_test, y_test_pred_best_xgbr_model)

0.5816297208277482

Como performa en los datos de training

In [60]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_xgbr_model,
                            'Error': y_train - y_train_pred_best_xgbr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.01372,-0.001337
1,0.006075,0.00588,0.000195
2,0.010724,0.022218,-0.011493
3,0.003738,0.003623,0.000115
4,0.011215,0.012561,-0.001346


* MSE

In [61]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_xgbr_model, squared=True)

5.227970868486065e-05

* RMSE

In [62]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_xgbr_model, squared=False)

0.007230470848074879

* R2 score

In [63]:
r2_score(y_train, y_train_pred_best_xgbr_model)

0.7504615224489726

### Modelo 3: Árbol de Decisión


#### Construccion

In [64]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

In [65]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import make_scorer, r2_score

kfoldcv = KFold(n_splits=5)
scorer_fn = make_scorer(r2_score)
n = 10

#Conjunto de parámetros que quiero usar
params_grid = { "criterion" : ["squared_error", "friedman_mse", "absolute_error"],
               "min_samples_leaf" : [5, 10],
               "min_samples_split" : [2, 4, 10, 12, 16],
               "splitter": ['random','best'] }

# Clasificador Arbol
tree = DecisionTreeRegressor(random_state = n)

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(tree, params_grid, scoring='neg_root_mean_squared_error', cv=n, n_iter=n, random_state=n)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

-0.00724910126605482

Vemos los mejores parametros

In [66]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'splitter': 'best',
 'min_samples_split': 4,
 'min_samples_leaf': 5,
 'criterion': 'squared_error'}

##### Creacion del mejor modelo

In [67]:
# Mejor Regresor Arbol
best_dtr_model = DecisionTreeRegressor(
    splitter = params_elegidos['splitter'],
    min_samples_split = params_elegidos['min_samples_split'],
    min_samples_leaf = params_elegidos['min_samples_leaf'],
    criterion = params_elegidos['criterion'],

)

best_dtr_model.fit(x_train, y_train)
y_train_pred_best_dtr_model = best_dtr_model.predict(x_train)
y_test_pred_best_dtr_model = best_dtr_model.predict(x_test)

Como performa en los datos de testing

In [68]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_dtr_model,
                            'Error': y_test - y_test_pred_best_dtr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.005928,-0.00219
1,0.01215,0.015087,-0.002937
2,0.006028,0.005866,0.000162
3,0.001752,0.006368,-0.004615
4,0.003972,0.002284,0.001688


* MSE

In [69]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_dtr_model, squared=True)

9.027193895245698e-05

* RMSE

In [70]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_dtr_model, squared=False)

0.00950115461154364

* R2 score

In [71]:
r2_score(y_test, y_test_pred_best_dtr_model)

0.574374259390575

Como performa en los datos de training

In [72]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_dtr_model,
                            'Error': y_train - y_train_pred_best_dtr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.014439,-0.002056
1,0.006075,0.005448,0.000627
2,0.010724,0.018229,-0.007504
3,0.003738,0.003668,7e-05
4,0.011215,0.013868,-0.002653


* MSE

In [73]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_dtr_model, squared=True)

3.943778225825102e-05

* RMSE

In [74]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_dtr_model, squared=False)

0.006279950816547135

* R2

In [75]:
r2_score(y_train, y_train_pred_best_dtr_model)

0.8117578618879576

### Resumen de métricas


| TRAIN     |   KNN              | XGB                | ARBOL                |
|-----------|--------------------|--------------------|----------------------|
| MSE       | 2.2743972777593e-05| 2.2743972777593e-05| 3.943778225825102e-05|
| RMSE      | 0.00476906414064573| 0.00476906414064573| 0.006279950816547135 |
| R2        | 0.8914397864266198 | 0.8914397864266198 | 0.006279950816547135 |



-------------------------------------------------
| TEST      |   KNN                | XGB                  | ARBOL     |
|-----------|----------------------|----------------------|-----------|
| MSE       | 0.0001315753070488609| 0.0001315753070488609|9.065386454951484e-05|
| RMSE      | 0.011470628014579713 | 0.011470628014579713 |0.00952123230204551|
| R2        | 0.37963183068352946  | 0.37963183068352946  |0.5725735074958909  |


En base a las métricas podemos ver:
1. MSE:
  - En el conjunto de entrenamiento, tanto en KNN como en XGB el MSE es muy bajo, lo que indica un ajuste cercano de los modelos a los datos de entrenamiento. En cambio en Árbol de Decisión tenemos un valor más alto.
  - En el conjunto de prueba, el MSE más bajo se observa para el modelo de Árbol, seguido por KNN y XGB.
2. RMSE:
  - Los valores de RMSE en el conjunto de entrenamiento son bajos para KNN y XGB, mientras que para el modelo de Árbol son más altos.
  - En el conjunto de prueba, el modelo de Árbol tiene el RMSE más bajo, lo que sugiere que es el modelo que mejor generaliza en datos nuevos y no vistos.
3. R2:
  - En el conjunto de entrenamiento, tanto KNN como XGB tienen un alto R2, lo que indica una buena capacidad para explicar la variabilidad de los datos. Sin embargo, el modelo de Árbol tiene un valor muy bajo de R2.
  - En el conjunto de prueba, el R2 más alto se observa para el modelo de Árbol, seguido por XGB y luego KNN. Esto sugiere que el modelo de Árbol tiene el mejor rendimiento en la explicación de la variabilidad de los datos de prueba.

  En resumen, basándonos en estas métricas, parece que el modelo de Árbol de Decisión tiene un mejor rendimiento en el conjunto de prueba, mientras que los modelos KNN y XGB parecen tener un rendimiento comparable en el conjunto de entrenamiento.

 **¿Qué modelo elegirían para predecir el precio de venta de las propiedades?**
 A partir del análisis anterior elegimos el ÁRBOL DE DECISIÓN para predevir el precio de las propiedades

## Guardado de modelos

In [76]:
if using_drive:
  joblib.dump(best_knn_model, drive_path + '/Models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, drive_path + '/Models/best_xgbr_model.pkl')
  joblib.dump(best_dtr_model, drive_path + '/Models/best_dtr_model.pkl')
else:
  joblib.dump(best_knn_model, './models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, './models/best_gb_model.pkl')
  joblib.dump(best_dtr_model, './models/best_dtr_model.pkl')