# Proyecto Properatio: Creación y evaluación de model predictivo

Creacion de un modelo capaz de predecir valores en USD de propiedades tipo Oficinas y Locales Comerciales en CABA Argentina <br><br>
Creado por: Adriana Villalobos

## 1. Importación de librerías y Carga del dataset

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import mlflow.sklearn
from mlflow.models.signature import infer_signature

In [28]:
import os
from pathlib import Path

print("Actual directorio de trabajo:", os.getcwd())

# Configuro el directorio para no tener /mlruns dentro de los notebooks
current_dir = Path.cwd()
if current_dir.name == "notebooks":
    os.chdir(current_dir.parent)

print("Nuevo directorio de trabajo:", os.getcwd())

Actual directorio de trabajo: /Users/cosmos/Adri/Developer/DataScience/Clases y Consignas/Proyecto Final
Nuevo directorio de trabajo: /Users/cosmos/Adri/Developer/DataScience/Clases y Consignas/Proyecto Final


In [29]:
df = pd.read_csv('data/data_cleaned.csv', sep=",")
df.head()

Unnamed: 0,neighborhood,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing
0,Once,2.0,1.0,20.0,Oficina,Venta,32000.0,1600.0,0,1
1,Flores,2.0,1.0,32.0,Oficina,Alquiler,279.72028,9.0,0,1
2,Flores,2.0,1.0,46.0,Oficina,Alquiler,349.65035,8.0,0,1
3,Palermo,2.0,2.0,70.0,Oficina,Alquiler,1258.741259,18.0,0,1
4,Tribunales,2.0,1.0,40.0,Oficina,Venta,89000.0,2225.0,0,1


In [30]:
df['operation_type'].value_counts()

operation_type
Alquiler    10238
Venta        9754
Name: count, dtype: int64

In [31]:
df['property_type'].value_counts()

property_type
Oficina            10200
Local comercial     9792
Name: count, dtype: int64

In [32]:
df.sample(5)

Unnamed: 0,neighborhood,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing
13313,Retiro,2.0,3.0,356.0,Oficina,Alquiler,5000.0,14.0,0,1
5996,Villa Crespo,1.0,2.0,240.0,Local comercial,Alquiler,787.401575,3.0,0,1
16500,Almagro,1.0,3.0,450.0,Local comercial,Venta,950000.0,2111.0,0,1
16563,Palermo,2.0,1.0,45.0,Oficina,Venta,210000.0,4667.0,0,1
15734,Nuñez,1.0,1.0,35.0,Local comercial,Alquiler,250.0,7.0,0,0


In [33]:
# Incluyo mlflow para trackear el desempeño de diferentes opciones a lo largo del proceso

mlflow.set_tracking_uri(f"file://{os.getcwd()}/mlruns")
mlflow.set_experiment(experiment_name="Proyectio_Properatio")

<Experiment: artifact_location='file:///Users/cosmos/Adri/Developer/DataScience/Clases%20y%20Consignas/Proyecto%20Final/notebooks/mlruns/417185556600615154', creation_time=1762538423016, experiment_id='417185556600615154', last_update_time=1762538423016, lifecycle_stage='active', name='Proyectio_Properatio', tags={'mlflow.experimentKind': 'custom_model_development'}>

## 2. Transformación de variables numéricas con standarScaler

In [34]:

from sklearn.preprocessing import StandardScaler

numericas = ['rooms', 'bathrooms', 'surface_covered']
for col in numericas:
    scaler = StandardScaler()
    df[col] = scaler.fit_transform(df[[col]])


## 3 Separación de dataframes para distintos modelos

La variable del barrio se va a pasar por OHE para el modelo de LinearRegressor<br>
Para el de RandomForest se usará LabelEncoder en esa columna

In [35]:
from sklearn.preprocessing import LabelEncoder

### 3.2 LabelEncoder para RandomForest

In [36]:
# Versión para RandomForest
df_le = df.copy()

le = LabelEncoder()
df_le['neighborhood_encoded'] = le.fit_transform(df_le['neighborhood'])
df_le.drop(columns=['neighborhood'], inplace=True)

X_tree = df_le.drop(columns=['price_usd', 'price_per_m2'])
y_tree = df_le['price_usd']


### 3.3 OHE para LinearRegressor

In [37]:
# Versión para LinearRegressor
df_ohe = df.copy()

In [38]:
df_ohe.columns

Index(['neighborhood', 'rooms', 'bathrooms', 'surface_covered',
       'property_type', 'operation_type', 'price_usd', 'price_per_m2',
       'bathrooms_missing', 'rooms_missing'],
      dtype='object')

In [39]:
# Reemplazo los espacios por _ para mantener la consistencia en los nombres de columnas

df_ohe['neighborhood'] = df_ohe['neighborhood'].str.replace(' ', '_', regex=False)
df_ohe['neighborhood'] = df_ohe['neighborhood'].str.replace('/', 'o', regex=False)


# Paso todos los nombres de columnas a lowercase

df_ohe['neighborhood'] = df_ohe['neighborhood'].str.lower()

In [40]:
df_ohe['neighborhood'].unique()

array(['once', 'flores', 'palermo', 'tribunales', 'puerto_madero',
       'centro_o_microcentro', 'almagro', 'barracas', 'balvanera',
       'chacarita', 'san_nicolás', 'villa_crespo', 'san_cristobal',
       'villa_urquiza', 'retiro', 'recoleta', 'barrio_norte', 'congreso',
       'san_telmo', 'colegiales', 'parque_patricios', 'otros',
       'caballito', 'catalinas', 'floresta', 'paternal', 'belgrano',
       'monserrat', 'liniers', 'mataderos', 'nuñez', 'boca',
       'constitución', 'parque_chacabuco', 'boedo', 'villa_devoto',
       'abasto', 'villa_del_parque', 'saavedra'], dtype=object)

In [41]:
# Aplico OHE para los dataframes adaptados para LinearRegression
df_ohe = pd.get_dummies(df_ohe, columns=['neighborhood'], drop_first=True)
X_linear = df_ohe.drop(columns=['price_usd', 'price_per_m2'])
y_linear = df_ohe['price_usd']


In [42]:
#with open('models/columns_oficina_OHE.pkl', 'wb') as f:
#   pickle.dump(df_ohe.columns.tolist(), f)

## 4. Comparativa de modelos LinearRegressor y RandomForest, y métodos de transformación OHE y LabelEncoder

Se va a compara el rendimiento de 2 tipos de modelo distinto, 2 tipos de transformaciones, y las 4 combinaciones de datos posibles entre Alquiler, Venta, Local comercial y Oficina

Para medir el rendimiento según el porcentaje asignado a test, lo guardo como parámetro de mlFlow. <br>
Inicialmente separo un 70% de datos para el entrenamiento y un 30% para test

In [43]:
TEST_SIZE = 0.2
RANDOM_STATE = 24
mlflow.log_param("Tamaño de Test2", TEST_SIZE)
mlflow.log_param("Random state2", RANDOM_STATE)


24

In [44]:
df_ohe.sample(1)

Unnamed: 0,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing,neighborhood_almagro,...,neighborhood_retiro,neighborhood_saavedra,neighborhood_san_cristobal,neighborhood_san_nicolás,neighborhood_san_telmo,neighborhood_tribunales,neighborhood_villa_crespo,neighborhood_villa_del_parque,neighborhood_villa_devoto,neighborhood_villa_urquiza
19489,0.166562,0.00167,-0.142908,Oficina,Venta,280000.0,1918.0,0,1,False,...,True,False,False,False,False,False,False,False,False,False


In [45]:
df_le.sample(1)

Unnamed: 0,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing,neighborhood_encoded
19616,0.166562,0.727499,0.213883,Oficina,Alquiler,8000.0,22.0,0,1,4


### 4.1 Entrenamiento y métricas iniciales

In [46]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd

combinations = [
    ('Oficina', 'Alquiler'),
    ('Oficina', 'Venta'),
    ('Local comercial', 'Alquiler'),
    ('Local comercial', 'Venta')
]
datasets = {'get_dummies': df_ohe, 'label_encoder': df_le}

results = []

for nombre_df, dataset in datasets.items():
    for tipo, operacion in combinations:
        # ----- LINEAR REGRESSION -----
        subset_lin = dataset[(dataset['property_type'] == tipo) & (dataset['operation_type'] == operacion)].copy()
        subset_lin = subset_lin.drop(columns=['property_type', 'operation_type'], axis=1)
        if not subset_lin.empty:
            X_lin = subset_lin.drop(columns=['price_usd', 'price_per_m2'])
            y_lin = np.log1p(subset_lin['price_per_m2'])

            X_train, X_test, y_train, y_test = train_test_split(X_lin, y_lin, test_size=TEST_SIZE, random_state=RANDOM_STATE)

            model_lin = LinearRegression()
            model_lin.fit(X_train, y_train)
            y_pred = model_lin.predict(X_test)

            results.append({
                'Dataset': nombre_df,
                'Tipo': tipo,
                'Operación': operacion,
                'Modelo': 'LinearRegression',
                'R²': r2_score(y_test, y_pred),
                'RMSE': root_mean_squared_error(np.expm1(y_test), np.expm1(y_pred)),
                'MAE': mean_absolute_error(np.expm1(y_test), np.expm1(y_pred))
            })

        # ----- RANDOM FOREST -----
        subset_tree = dataset[(dataset['property_type'] == tipo) & (dataset['operation_type'] == operacion)].copy()
        subset_tree = subset_tree.drop(columns=['property_type', 'operation_type'], axis=1)
        if not subset_tree.empty:
            X_tree = subset_tree.drop(columns=['price_usd', 'price_per_m2'])
            y_tree = np.log1p(subset_tree['price_per_m2'])

            X_train, X_test, y_train, y_test = train_test_split(X_tree, y_tree, test_size=TEST_SIZE, random_state=RANDOM_STATE)

            model_rf = RandomForestRegressor(
                n_estimators=500,
                max_depth=None,
                max_features='sqrt',
                min_samples_leaf=2,
                min_samples_split=2,
                random_state=RANDOM_STATE
            )

            model_rf.fit(X_train, y_train)
            y_pred = model_rf.predict(X_test)

            results.append({
                'Dataset': nombre_df,
                'Tipo': tipo,
                'Operación': operacion,
                'Modelo': 'RandomForest',
                'R²': r2_score(y_test, y_pred),
                'RMSE': root_mean_squared_error(np.expm1(y_test), np.expm1(y_pred)),
                'MAE': mean_absolute_error(np.expm1(y_test), np.expm1(y_pred))
            })

# ---- Tabla comparativa ----
results_df = pd.DataFrame(results)



In [47]:
results_df.sort_values(by='R²', ascending=False)

Unnamed: 0,Dataset,Tipo,Operación,Modelo,R²,RMSE,MAE
1,get_dummies,Oficina,Alquiler,RandomForest,0.554703,3.833121,2.802505
9,label_encoder,Oficina,Alquiler,RandomForest,0.554596,3.767488,2.691248
11,label_encoder,Oficina,Venta,RandomForest,0.549134,1431.823739,498.314649
3,get_dummies,Oficina,Venta,RandomForest,0.524825,1446.879697,509.734917
2,get_dummies,Oficina,Venta,LinearRegression,0.373206,1525.012204,618.076184
5,get_dummies,Local comercial,Alquiler,RandomForest,0.347067,3.573645,2.564039
7,get_dummies,Local comercial,Venta,RandomForest,0.340693,1877.84692,968.39273
15,label_encoder,Local comercial,Venta,RandomForest,0.336276,1801.525465,895.528427
13,label_encoder,Local comercial,Alquiler,RandomForest,0.330407,3.6216,2.578715
0,get_dummies,Oficina,Alquiler,LinearRegression,0.320208,5.254609,3.776248


Se observan mejores resultados usando el modelo RandomForest, mientras que respecto a la transformando la barrios no hay mucha diferencia entre GetDummies y LabelEncoder

#### 4.1.1 Validación cruzada

In [48]:
from sklearn.model_selection import cross_val_score, KFold

# Defino ana función de scoring personalizada para el MAE con transformación inversa
def mae_exp(y_true, y_pred):
    # Aplica la transformación inversa: exp(x) - 1
    y_true_orig = np.expm1(y_true)
    y_pred_orig = np.expm1(y_pred)
    # Calcula el MAE sobre la escala original
    return mean_absolute_error(y_true_orig, y_pred_orig)

# Configuración general
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scorer_r2 = 'r2'
scorer_mae_custom = make_scorer(mae_exp, greater_is_better=False) # Se utiliza 'greater_is_better=False' porque un MAE menor es mejor.


results = []

# Iteramos por tipo de propiedad y operación
for type, operation in combinations:
    # Dataset específico
    df_ohe_sub = df_ohe[
        (df_ohe['property_type'] == type) &
        (df_ohe['operation_type'] == operation)
    ]
    df_le_sub = df_le[
        (df_le['property_type'] == type) &
        (df_le['operation_type'] == operation)
    ]
    # Elimino columnas tipo Object
    df_ohe_sub = df_ohe_sub.drop(columns=['property_type', 'operation_type'], axis=1)
    df_le_sub = df_le_sub.drop(columns=['property_type', 'operation_type'], axis=1)

    # Separación X/y
    y_lin = np.log1p(df_ohe_sub['price_per_m2'])
    X_lin = df_ohe_sub.drop(columns=['price_per_m2'])
    y_tree = np.log1p(df_le_sub['price_per_m2'])
    X_tree = df_le_sub.drop(columns=['price_per_m2'])
    
    # Linear Regression
    lin_model = LinearRegression()
    scores_r2_lin = cross_val_score(lin_model, X_lin, y_lin, cv=kf, scoring=scorer_r2)
    scores_mae_lin = cross_val_score(lin_model, X_lin, y_lin, cv=kf, scoring=scorer_mae_custom)

    results.append({
        "Tipo": type,
        "Operación": operation,
        "Modelo": "LinearRegression",
        "R² promedio": scores_r2_lin.mean(),
        "R² std": scores_r2_lin.std(),
        "MAE promedio": -scores_mae_lin.mean(),  # negado porque MAE se definió como "negativo"
        "MAE std": scores_mae_lin.std()
    })

    # Random Forest
    rf_model = RandomForestRegressor(
        n_estimators=500,
        max_depth=None,
        max_features='sqrt',
        min_samples_split=2,
        min_samples_leaf=2,
        random_state=42
    )
    scores_r2_rf = cross_val_score(rf_model, X_tree, y_tree, cv=kf, scoring=scorer_r2)
    scores_mae_rf = cross_val_score(rf_model, X_tree, y_tree, cv=kf, scoring=scorer_mae_custom)

    results.append({
        "Tipo": type,
        "Operación": operation,
        "Modelo": "RandomForest",
        "R² promedio": scores_r2_rf.mean(),
        "R² std": scores_r2_rf.std(),
        "MAE promedio": -scores_mae_rf.mean(),
        "MAE std": scores_mae_rf.std()
    })

# Tabla final
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by=["MAE promedio", "R² promedio"]).reset_index(drop=True)
results_df


Unnamed: 0,Tipo,Operación,Modelo,R² promedio,R² std,MAE promedio,MAE std
0,Oficina,Alquiler,RandomForest,0.948182,0.003978,0.810433,0.022252
1,Local comercial,Alquiler,RandomForest,0.914469,0.004649,0.853443,0.030397
2,Local comercial,Alquiler,LinearRegression,0.271076,0.143673,2.660125,0.243379
3,Oficina,Alquiler,LinearRegression,0.423125,0.026453,3.620035,0.252724
4,Oficina,Venta,RandomForest,0.856763,0.011944,252.661015,22.240877
5,Local comercial,Venta,RandomForest,0.935379,0.008849,495.242914,189.569554
6,Oficina,Venta,LinearRegression,0.507154,0.034801,621.00952,149.039957
7,Local comercial,Venta,LinearRegression,0.189574,0.154047,2004.832272,918.256852


### 4.2 Un solo modelo de RandomForest podría estimar alquiler y venta? (2 en 1)

In [49]:
le = LabelEncoder()
df_le['operation_type'] = le.fit_transform(df_le['operation_type'])
df_le.drop(columns=['operation_type'], inplace=True)

In [50]:
tipo_propiedad = ['Oficina', 'Local comercial']

results = []

for tp in tipo_propiedad:
    
    # ----- RANDOM FOREST -----
    subset_tree = df_le[(df_le['property_type'] == tp)].copy()
    subset_tree = subset_tree.drop(columns=['property_type'], axis=1)
    if not subset_tree.empty:
        X_tree = subset_tree.drop(columns=['price_usd', 'price_per_m2'])
        y_tree = np.log1p(subset_tree['price_per_m2'])

        X_train, X_test, y_train, y_test = train_test_split(X_tree, y_tree, test_size=TEST_SIZE, random_state=RANDOM_STATE)

        model_rf_2in1 = RandomForestRegressor(
            n_estimators=500,
            max_depth=None,
            max_features='sqrt',
            min_samples_leaf=2,
            min_samples_split=2,
            random_state=RANDOM_STATE
        )

        model_rf_2in1.fit(X_train, y_train)
        y_pred = model_rf_2in1.predict(X_test)

        results.append({
            'Tipo': tp,
            'Modelo': 'RandomForest',
            'R²': r2_score(y_test, y_pred),
            'RMSE': root_mean_squared_error(np.expm1(y_test), np.expm1(y_pred)),
            'MAE': mean_absolute_error(np.expm1(y_test), np.expm1(y_pred))
        })

# ---- Tabla comparativa ----
results = pd.DataFrame(results)


In [51]:
results.sort_values(by='R²', ascending=False)

Unnamed: 0,Tipo,Modelo,R²,RMSE,MAE
1,Local comercial,RandomForest,0.274101,11503.292828,1378.845228
0,Oficina,RandomForest,0.162006,1946.292311,954.118401


Las métricas se degradan mucho al querer estimar el valor de alquiler y venta, en locales comerciales como en oficinas, lo que nos indica que se podrían tener modelos distintos para cada tipo de operación y así poder tener más presición en la estimación.

#### 4.2.1 Validación cruzada

In [52]:
# Configuración general
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scorer_r2 = 'r2'
scorer_mae_custom = make_scorer(mae_exp, greater_is_better=False) # Se utiliza 'greater_is_better=False' porque un MAE menor es mejor.

results = []

# Iteramos por tipo de propiedad
for tp in tipo_propiedad:
    df_le_sub = df_le[df_le['property_type'] == tp] 
    df_le_sub = df_le_sub.drop(columns=['property_type'], axis=1)

    # Separación X/y
    y_tree = np.log1p(df_le_sub['price_per_m2'])
    X_tree = df_le_sub.drop(columns=['price_per_m2'])
    
    # Random Forest
    rf_model_2in1 = RandomForestRegressor(
        n_estimators=500,
        max_depth=None,
        max_features='sqrt',
        min_samples_split=2,
        min_samples_leaf=2,
        random_state=42
    )
    scores_r2_rf = cross_val_score(rf_model_2in1, X_tree, y_tree, cv=kf, scoring=scorer_r2)
    
    # MAE: Usamos el custom scorer (scorer_mae_custom)
    # y_tree es pasada, pero el custom scorer la transforma internamente
    scores_mae_rf = cross_val_score(rf_model_2in1, X_tree, y_tree, cv=kf, scoring=scorer_mae_custom)

    results.append({
        "Tipo": tp,
        "Modelo": "RandomForest",
        "R² promedio": scores_r2_rf.mean(),
        "R² std": scores_r2_rf.std(),
        "MAE promedio": -scores_mae_rf.mean(), # Se invierte el signo para obtener el MAE positivo
        "MAE std": scores_mae_rf.std()
    })

# Tabla final
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by=["MAE promedio", "R² promedio"]).reset_index(drop=True)
results_df


Unnamed: 0,Tipo,Modelo,R² promedio,R² std,MAE promedio,MAE std
0,Oficina,RandomForest,0.996301,0.000329,121.247944,9.334401
1,Local comercial,RandomForest,0.994148,0.001292,288.501752,115.231381


El resultado R^2 promedio de 0.99$ es un síntoma de que el modelo no está resolviendo el problema de la regresión, sino que está explotando una variable (casi seguro 'operation_type') para clasificar la escala.

## 5. Conclusiones

La evidencia de tus resultados individuales muestra que los modelos son más precisos cuando se entrenan por separado. Los resultados del modelo "2 en 1" confirman que la unión genera un modelo confundido e ineficiente.<br>
<br>
Avanzaremos con solamente con Oficinas para Alquiler, quedando la posibilidad de sumar a futuro modelos para predecir Ventas y para incluir Locales comerciales. <br>
Se probaron 2 diferentes test_size (0.2 y 0.3), siendo 0.2 el que mejores resultados arrojó.