<a href="https://colab.research.google.com/github/kikiymini/7506R-1C2024-GRUPO02/blob/main/7506R_TP1_GRUPO02_ENTREGA_N4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Organizaci&oacute;n de Datos</center>

#### <center>C&aacute;tedra Ing. Rodriguez, Juan Manuel </center>

## <center>Trabajo Práctico 1: Propiedades en venta</center>

### <center> Grupo 2</center>

## Integrantes:

*   Aramayo Carolina
*   Utrera Maximo Damian
*   Villalba Ana Daniela
*   Fiorilo Roy


<font color='#fa5050'>Importante: para correr este notebook es pre-requisito haber corrido el notebook numero 1, ya que de lo contrario no se tienen los datasets de train y test.</font>

# Importación de librerias

(Importante: leer comentarios)

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import folium

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import sklearn
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

import joblib

import warnings
warnings.filterwarnings("ignore")

RAND_SEED = 42

# Lectura de archivo

### Desde google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')
drive_path = "/content/drive/MyDrive/7506R-1C2024-GRUPO02"

train_file = drive_path + '/Dataset/ds_train.csv'
test_file = drive_path + '/Dataset/ds_test.csv'

ds_train = pd.read_csv(train_file)
ds_test = pd.read_csv(test_file)
using_drive = True

Mounted at /content/drive


### Desde maquina local

In [None]:
# Si se esta trabajando en local, descomentar esta celta y comentar la de arriba
# train_file = './dataset/ds_train.csv'
# test_file = './dataset/ds_test.csv'

# ds_train = pd.read_csv(train_file)
# ds_test = pd.read_csv(test_file)
# using_drive = False

# Regresión

### Ingenieria de caracteristicas

In [4]:
# Creamos una copia del training dataset para modificarlo libremente
ds_reg_train = ds_train.copy()
ds_reg_test = ds_test.copy()

Vemos las columnas actuales del dataset

In [5]:
ds_reg_train.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

In [6]:
ds_reg_test.dtypes

id                           object
start_date                   object
end_date                     object
latitud                     float64
longitud                    float64
place_l3                     object
property_type                object
property_rooms                int64
property_bedrooms             int64
property_surface_total      float64
property_surface_covered    float64
property_price              float64
dtype: object

Columnas consideradas irrelevantes:
* ID de la propiedad
* Las fechas de creacion/comienzo/finalizacion de la publicacion
* Latitud y Longitud (ya que el barrio parece ser una mejor y mas confiable opcion a tener en cuenta para la ubicacion)
* Surface covered (ya que esta tiene una correlacion alta con surface total)
* Property Bedrooms (ya que esta altamente correlacionado con property rooms)

In [7]:
# Eliminamos dichas columnas
cols_a_eliminar = ["id", "start_date", "end_date", "latitud", "longitud", "property_surface_covered", "property_bedrooms"]
ds_reg_train.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_test.drop(columns=cols_a_eliminar, inplace=True)
ds_reg_train.dtypes

place_l3                   object
property_type              object
property_rooms              int64
property_surface_total    float64
property_price            float64
dtype: object

Encodeo variables cualitativas

In [8]:
def encode_non_numerical_vars(ds, col):
  """returns encoding dict and its inverse dict"""
  _col = ds[col].unique()

  _col_dict = dict(zip(_col, range(len(_col))))
  _col_inv_dict = dict(zip(range(len(_col)), _col))

  ds[col] = ds[col].map(_col_dict)
  return _col_dict, _col_inv_dict

In [9]:
# Mapear property_type a una representacion numerica
property_type_dict, property_type_inv_dict = encode_non_numerical_vars(ds_reg_train, "property_type")
_ = encode_non_numerical_vars(ds_reg_test, "property_type")

In [10]:
# Mapear barrios a una representacion numerica
place_l3_dict, place_l3_inv_dict = encode_non_numerical_vars(ds_reg_train, "place_l3")
_ = encode_non_numerical_vars(ds_reg_test, "place_l3")

Normalizamos las varibles utilizando Min-Max

In [11]:
# Solo normalizamos las variables cuantitativas no encodeadas
columnas_con_numeros = ['property_rooms', 'property_surface_total', 'property_price']
scaler = MinMaxScaler()
ds_reg_train[columnas_con_numeros] = scaler.fit_transform(ds_reg_train[columnas_con_numeros])
ds_reg_test[columnas_con_numeros] = scaler.transform(ds_reg_test[columnas_con_numeros])

In [12]:
ds_reg_train.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.333333,0.0736,0.012383
1,1,0,0.111111,0.0164,0.006075
2,2,0,0.333333,0.074,0.010724
3,2,0,0.0,0.014,0.003738
4,3,1,0.555556,0.078,0.011215


In [13]:
ds_reg_test.head()

Unnamed: 0,place_l3,property_type,property_rooms,property_surface_total,property_price
0,0,0,0.111111,0.018,0.003738
1,1,1,0.333333,0.0744,0.01215
2,2,0,0.111111,0.0212,0.006028
3,3,0,0.333333,0.0276,0.001752
4,4,0,0.111111,0.0132,0.003972


### Modelo 1: KNN


#### Construccion

In [None]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

Busco los mejores parametros para KNN con respecto Root Mean Squared Error.

In [None]:
kfoldcv = KFold(n_splits=10)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid={ 'n_neighbors':range(5,25),
              'weights':['distance','uniform'],
              'algorithm':['ball_tree', 'kd_tree', 'brute'],
              'metric':['euclidean','manhattan','chebyshev']
             }

# Clasificador KNN
knn_model = KNeighborsRegressor()

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

0.78210729761634

Vemos los mejores parametros elegidos por el cross validation

In [None]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'weights': 'distance',
 'n_neighbors': 21,
 'metric': 'euclidean',
 'algorithm': 'brute'}

##### Creacion del mejor modelo

In [None]:
# Mejor Regresor KNN
best_knn_model = KNeighborsRegressor(
    n_neighbors = params_elegidos['n_neighbors'],
    weights = params_elegidos['weights'],
    algorithm = params_elegidos['algorithm'],
    metric = params_elegidos['metric'],
)

best_knn_model.fit(x_train, y_train)
y_train_pred_best_knn_model = best_knn_model.predict(x_train)
y_test_pred_best_knn_model = best_knn_model.predict(x_test)

**Observamos como performa en test**

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_knn_model,
                            'Error': y_test - y_test_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.005928,-0.00219
1,0.01215,0.016377,-0.004228
2,0.006028,0.006258,-0.00023
3,0.001752,0.006019,-0.004267
4,0.003972,0.002922,0.00105


* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=True)

0.0001315753070488609

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_knn_model, squared=False)

0.011470628014579713

* R2 score

In [None]:
r2_score(y_test, y_test_pred_best_knn_model)

0.37963183068352946

**Observamos como se desempeña sobre los mismos datos con los que entreno**

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_knn_model,
                            'Error': y_train - y_train_pred_best_knn_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.012383,0.0
1,0.006075,0.005411,0.000664
2,0.010724,0.015088,-0.004363
3,0.003738,0.003628,0.00011
4,0.011215,0.011215,0.0


* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=True)

2.2743972777593e-05

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_knn_model, squared=False)

0.00476906414064573

* R2 score

In [None]:
r2_score(y_train, y_train_pred_best_knn_model)

0.8914397864266198

### Modelo 2: XGBoost

#### Construccion

In [14]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

In [None]:
kfoldcv = KFold(n_splits=5)
scorer_fn = make_scorer(r2_score)
n = 10

#Grilla de Parámetros
params_grid = {
    'learning_rate': np.linspace(0.05, 0.5, 50),
    'gamma': [0,1,2],
    'max_depth': list(range(2,10)),
    'subsample': np.linspace(0.5, 1, 20),
    'lambda': [0,1,2],
    'alpha' : [0,1,2],
    'n_estimators': list(range(10,150,10))
    }

# Clasificador KNN
knn_model = XGBRegressor(random_state=RAND_SEED)

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(knn_model, params_grid, scoring=scorer_fn, cv=kfoldcv, n_iter=n, random_state=RAND_SEED)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

0.7263209515306797

Vemos los mejores parametros

In [None]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'subsample': 0.7105263157894737,
 'n_estimators': 120,
 'max_depth': 3,
 'learning_rate': 0.5,
 'lambda': 0,
 'gamma': 0,
 'alpha': 0}

##### Creacion del mejor modelo

In [16]:
# Mejor Regresor XGB
best_xgbr_model = XGBRegressor(
    n_estimators = params_elegidos['n_estimators'],
    learning_rate = params_elegidos['learning_rate'],
    gamma = params_elegidos['gamma'],
    max_depth = params_elegidos['max_depth'],
    subsample = params_elegidos['subsample'],
    reg_lambda = params_elegidos['lambda'],
    reg_alpha = params_elegidos['alpha'],
)

best_xgbr_model.fit(x_train, y_train)
y_train_pred_best_xgbr_model = best_xgbr_model.predict(x_train)
y_test_pred_best_xgbr_model = best_xgbr_model.predict(x_test)

Como performa en los datos de testing

In [18]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_xgbr_model,
                            'Error': y_test - y_test_pred_best_xgbr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.004711,-0.000972
1,0.01215,0.016982,-0.004833
2,0.006028,0.005695,0.000333
3,0.001752,0.007461,-0.005709
4,0.003972,0.003392,0.00058


* MSE

In [21]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_xgbr_model, squared=True)

8.873311150515407e-05

* RMSE

In [20]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_xgbr_model, squared=False)

0.009419825449824113

* R2 score

In [22]:
r2_score(y_test, y_test_pred_best_xgbr_model)

0.5816297208277482

Como performa en los datos de training

In [23]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_xgbr_model,
                            'Error': y_train - y_train_pred_best_xgbr_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.01372,-0.001337
1,0.006075,0.00588,0.000195
2,0.010724,0.022218,-0.011493
3,0.003738,0.003623,0.000115
4,0.011215,0.012561,-0.001346


* MSE

In [24]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_xgbr_model, squared=True)

5.227970868486065e-05

* RMSE

In [25]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_xgbr_model, squared=False)

0.007230470848074879

* R2 score

In [26]:
r2_score(y_train, y_train_pred_best_xgbr_model)

0.7504615224489726

### Modelo 3: GradientBoost <font color="#e02626">(Pendiente)</font>


#### Construccion

In [None]:
# dividimos los sets en x, y. x son los parametros con los que entrenar, e y son los parametros a predecir
x_train = ds_reg_train.drop(columns=["property_price"])
y_train = ds_reg_train["property_price"].copy()

x_test = ds_reg_test.drop(columns=["property_price"])
y_test = ds_reg_test["property_price"].copy()

##### Cross-validation

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
kfoldcv = KFold(n_splits=5)
scorer_fn = make_scorer(r2_score)
n = 10

#Conjunto de parámetros que quiero usar
params_grid = {'n_estimators': [50, 100], #nº de etapas de boosting
                        'learning_rate': [0.01, 0.1], #reduce la contribucion de cada arbol por este valor
                        'max_features': [4, 5], #nº de variables a tener en cuenta para las divisiones
                        'min_samples_split': [5, 10]} #nº mínimo de observaciones necesarias para dividir un nodo interno (n.minobsinnode en R)

# Clasificador GradientBoost
gb_model = GradientBoostingRegressor()

#Metrica que quiero optimizar MSE
scorer_fn = make_scorer(sklearn.metrics.mean_squared_error)

# Random Search con 10 Folds y 10 iteraciones
rs = RandomizedSearchCV(estimator=gb_model,
                              param_distributions = params_grid,
                              scoring=scorer_fn,
                              n_iter=n, cv=n, random_state=5)

rs_fit = rs.fit(x_train, y_train)
rs_fit.best_score_ # vemos el mejor score

0.0001333860175105814

Vemos los mejores parametros

In [None]:
params_elegidos=rs_fit.cv_results_['params'][np.argmax(rs_fit.cv_results_['mean_test_score'])]
params_elegidos

{'n_estimators': 50,
 'min_samples_split': 10,
 'max_features': 4,
 'learning_rate': 0.01}

##### Creacion del mejor modelo

In [None]:
# Mejor Regresor GradientBoost
best_gb_model = GradientBoostingRegressor(
    n_estimators = params_elegidos['n_estimators'],
    min_samples_split = params_elegidos['min_samples_split'],
    max_features = params_elegidos['max_features'],
    learning_rate = params_elegidos['learning_rate'],

)

best_gb_model.fit(x_train, y_train)
y_train_pred_best_gb_model = best_gb_model.predict(x_train)
y_test_pred_best_gb_model = best_gb_model.predict(x_test)

Como performa en los datos de testing

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_test,
                            'Prediccion': y_test_pred_best_gb_model,
                            'Error': y_test - y_test_pred_best_gb_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.003738,0.008125,-0.004387
1,0.01215,0.011991,0.000158
2,0.006028,0.008297,-0.002269
3,0.001752,0.009654,-0.007902
4,0.003972,0.008125,-0.004153


* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_gb_model, squared=True)

0.00014086444892094183

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_test, y_test_pred_best_gb_model, squared=False)

0.011868632984507603

* R2 score

In [None]:
r2_score(y_test, y_test_pred_best_gb_model)

0.3358341906326966

Como performa en los datos de training

In [None]:
# Performance
performance = pd.DataFrame({'Valor Real': y_train,
                            'Prediccion': y_train_pred_best_gb_model,
                            'Error': y_train - y_train_pred_best_gb_model})
# View
performance.head()

Unnamed: 0,Valor Real,Prediccion,Error
0,0.012383,0.016326,-0.003943
1,0.006075,0.008125,-0.00205
2,0.010724,0.016326,-0.005601
3,0.003738,0.008125,-0.004387
4,0.011215,0.011991,-0.000776


* MSE

In [None]:
# Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_gb_model, squared=True)

0.0001325717838628663

* RMSE

In [None]:
# Root Mean Squared Error
mean_squared_error(y_train, y_train_pred_best_gb_model, squared=False)

0.011513982102768194

* R2

In [None]:
r2_score(y_test, y_test_pred_best_gb_model)

0.3358341906326966

### Resumen de métricas


| TRAIN     |   KNN              | XGB                | GB                   |
|-----------|--------------------|--------------------|----------------------|
| MSE       | 2.2743972777593e-05| 2.2743972777593e-05| 0.0001325717838628663|
| RMSE      | 0.00476906414064573| 0.00476906414064573| 0.011513982102768194 |
| R2        | 0.8914397864266198 | 0.8914397864266198 | 0.3358341906326966   |



-------------------------------------------------
| TEST      |   KNN                | XGB                  | GB        |
|-----------|----------------------|----------------------|-----------|
| MSE       | 0.0001315753070488609| 0.0001315753070488609|0.00014086444892094183|
| RMSE      | 0.011470628014579713 | 0.011470628014579713 |0.011868632984507603|
| R2        | 0.37963183068352946  | 0.37963183068352946  |0.3358341906326966  |


En base a las métricas podemos ver:
1. MSE:
  - En el conjunto de entrenamiento, tanto en KNN como en XGB el MSE es muy bajo, lo que indica un ajuste cercano de los modelos a los datos de entrenamiento. En cambio en GradientBoost tenemos un valor más alto.
  - En el conjunto de test, KNN y XGB tienen un MSE ligeramente más bajo que GB. Esto puede indicar que KNN y XGB pueden estar generalizando mejor en datos no vistos.
2. RMSE:
  - Los valores de RMSE son proporcionales al MSE y siguen las mismas tendencias. En KNN y XGB también hay valores más bajos que en GB en el conjunto de prueba.
3. R2:
  - En el conjunto de entrenamiento KNN, XGB y GB tienen valores altos de R2. Esto quiere decir que los modelos explican bien la variabilidad de los datos de train. Sin embargo, en GB tenemos un valor considerablemente más bajo.
  - En el conjunto de test, los valores de R2 son más bajos en general en comparación al conjunto de entrenamiento. En KNN y XGB tienen valores ligeramente superiores al modelo GB, lo que quiere decir que estos modelos pueden estar capturando de manera más efectiva la variabilidad de los datos de prueba

 **¿Qué modelo elegirían para predecir el precio de venta de las propiedades?**

## Guardado de modelos

In [None]:
if using_drive:
  joblib.dump(best_knn_model, drive_path + '/Models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, drive_path + '/Models/best_xgbr_model.pkl')
  joblib.dump(best_xgbr_model, drive_path + '/Models/best_gb_model.pkl')
else:
  joblib.dump(best_knn_model, './models/best_knn_model.pkl')
  joblib.dump(best_xgbr_model, './models/best_gb_model.pkl')
  joblib.dump(best_xgbr_model, './models/best_gb_model.pkl')