# Proyecto Properatio: Creación y evaluación de model predictivo

Creacion de un modelo capaz de predecir valores en USD de propiedades tipo Oficinas y Locales Comerciales en CABA Argentina <br><br>
Creado por: Adriana Villalobos

## 1. Importación de librerías y Carga del dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import mlflow.sklearn
from mlflow.models.signature import infer_signature

In [2]:
import os
from pathlib import Path

print("Actual directorio de trabajo:", os.getcwd())

# Configuro el directorio para no tener /mlruns dentro de los notebooks
current_dir = Path.cwd()
if current_dir.name == "notebooks":
    os.chdir(current_dir.parent)

print("Nuevo directorio de trabajo:", os.getcwd())

Actual directorio de trabajo: /Users/cosmos/Adri/Developer/DataScience/Clases y Consignas/Proyecto Final/notebooks
Nuevo directorio de trabajo: /Users/cosmos/Adri/Developer/DataScience/Clases y Consignas/Proyecto Final


In [3]:
df = pd.read_csv('data/data_cleaned.csv', sep=",")
df.head()

Unnamed: 0,neighborhood,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing
0,Once,2.0,1.0,20.0,Oficina,Venta,32000.0,1600.0,0,1
1,Flores,2.0,1.0,32.0,Oficina,Alquiler,514800.0,16088.0,0,1
2,Flores,2.0,1.0,46.0,Oficina,Alquiler,643500.0,13989.0,0,1
3,Palermo,2.0,2.0,70.0,Oficina,Alquiler,2316600.0,33094.0,0,1
4,Tribunales,2.0,1.0,40.0,Oficina,Venta,89000.0,2225.0,0,1


In [4]:
df['operation_type'].value_counts()

operation_type
Alquiler    12024
Venta        9754
Name: count, dtype: int64

In [5]:
df['property_type'].value_counts()

property_type
Oficina            11304
Local comercial    10474
Name: count, dtype: int64

In [6]:
df.sample(5)

Unnamed: 0,neighborhood,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing
19568,Monserrat,2.0,2.0,361.0,Oficina,Alquiler,3610.0,10.0,1,1
1749,Monserrat,2.0,2.0,47.0,Oficina,Venta,79500.0,1691.0,0,1
12484,Boedo,1.0,4.0,895.0,Local comercial,Alquiler,23400000.0,26145.0,0,1
15225,Chacarita,1.0,2.0,475.0,Local comercial,Alquiler,8680000.0,18274.0,0,1
8072,Recoleta,3.0,2.0,79.0,Oficina,Venta,199000.0,2519.0,0,0


In [7]:
# Incluyo mlflow para trackear el desempeño de diferentes opciones a lo largo del proceso

mlflow.set_tracking_uri(f"file://{os.getcwd()}/mlruns")
mlflow.set_experiment(experiment_name="Proyectio_Properatio")

<Experiment: artifact_location='file:///Users/cosmos/Adri/Developer/DataScience/Clases%20y%20Consignas/Proyecto%20Final/notebooks/mlruns/417185556600615154', creation_time=1762538423016, experiment_id='417185556600615154', last_update_time=1762538423016, lifecycle_stage='active', name='Proyectio_Properatio', tags={'mlflow.experimentKind': 'custom_model_development'}>

## 2. Transformación de variables numéricas con standarScaler

In [8]:

from sklearn.preprocessing import StandardScaler

numericas = ['rooms', 'bathrooms', 'surface_covered']
for col in numericas:
    scaler = StandardScaler()
    df[col] = scaler.fit_transform(df[[col]])


In [9]:
# Guardo el scaler para usarlo en producción
with open(f'models/scaler_{col}.pkl', 'wb') as f:
    pickle.dump(scaler, f)

## 3 Separación de dataframes para distintos modelos

La variable del barrio se va a pasar por OHE para el modelo de LinearRegressor<br>
Para el de RandomForest se usará LabelEncoder en esa columna

In [10]:
from sklearn.preprocessing import LabelEncoder

### 3.2 LabelEncoder para RandomForest

In [11]:
# Versión para RandomForest
df_tree = df.copy()

le = LabelEncoder()
df_tree['neighborhood_encoded'] = le.fit_transform(df_tree['neighborhood'])
df_tree.drop(columns=['neighborhood'], inplace=True)

X_tree = df_tree.drop(columns=['price_usd', 'price_per_m2'])
y_tree = df_tree['price_usd']


In [12]:
#with open('models/columns_labelEncoder.pkl', 'wb') as f:
#   pickle.dump(df.columns.tolist(), f)

### 3.3 OHE para LinearRegressor

In [13]:
# Versión para LinearRegressor
df_linear = df.copy()

In [14]:
df_linear.columns

Index(['neighborhood', 'rooms', 'bathrooms', 'surface_covered',
       'property_type', 'operation_type', 'price_usd', 'price_per_m2',
       'bathrooms_missing', 'rooms_missing'],
      dtype='object')

In [15]:
# Reemplazo los espacios por _ para mantener la consistencia en los nombres de columnas

df_linear['neighborhood'] = df_linear['neighborhood'].str.replace(' ', '_', regex=False)
df_linear['neighborhood'] = df_linear['neighborhood'].str.replace('/', 'o', regex=False)


# Paso todos los nombres de columnas a lowercase

df_linear['neighborhood'] = df_linear['neighborhood'].str.lower()

In [16]:
df_linear['neighborhood'].unique()

array(['once', 'flores', 'palermo', 'tribunales', 'san_nicolás',
       'puerto_madero', 'centro_o_microcentro', 'almagro', 'barracas',
       'balvanera', 'chacarita', 'barrio_norte', 'villa_crespo',
       'san_cristobal', 'villa_urquiza', 'retiro', 'recoleta', 'congreso',
       'monserrat', 'san_telmo', 'colegiales', 'parque_patricios',
       'otros', 'caballito', 'catalinas', 'floresta', 'paternal',
       'belgrano', 'liniers', 'mataderos', 'nuñez', 'boca',
       'constitución', 'abasto', 'parque_chacabuco', 'boedo',
       'villa_devoto', 'villa_del_parque', 'saavedra'], dtype=object)

In [17]:
# Aplico OHE para los dataframes adaptados para LinearRegression
df_linear = pd.get_dummies(df_linear, columns=['neighborhood'], drop_first=True)
X_linear = df_linear.drop(columns=['price_usd', 'price_per_m2'])
y_linear = df_linear['price_usd']


In [18]:
#with open('models/columns_oficina_OHE.pkl', 'wb') as f:
#   pickle.dump(df_linear.columns.tolist(), f)

## 4. Comparativa de modelos LinearRegressor y RandomForest

Para medir el rendimiento según el porcentaje asignado a test, lo guardo como parámetro de mlFlow. <br>
Inicialmente separo un 70% de datos para el entrenamiento y un 30% para test

In [19]:
TEST_SIZE = 0.2
RANDOM_STATE = 24
mlflow.log_param("Tamaño de Test2", TEST_SIZE)
mlflow.log_param("Random state2", RANDOM_STATE)


24

In [20]:
df_linear.sample(4)

Unnamed: 0,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing,neighborhood_almagro,...,neighborhood_retiro,neighborhood_saavedra,neighborhood_san_cristobal,neighborhood_san_nicolás,neighborhood_san_telmo,neighborhood_tribunales,neighborhood_villa_crespo,neighborhood_villa_del_parque,neighborhood_villa_devoto,neighborhood_villa_urquiza
6755,0.162549,-0.021792,-0.324425,Oficina,Venta,113000.0,2457.0,1,1,False,...,False,False,False,True,False,False,False,False,False,False
4559,0.162549,-0.021792,-0.264184,Oficina,Venta,127000.0,1494.0,0,1,False,...,False,False,False,False,False,False,False,False,False,False
12970,0.162549,-0.730141,-0.194675,Oficina,Alquiler,4054500.0,31188.0,0,1,False,...,False,False,False,False,False,False,False,False,False,False
1751,0.162549,-0.021792,-0.264184,Oficina,Venta,145000.0,1706.0,0,1,False,...,False,False,False,False,False,True,False,False,False,False


In [21]:
df_tree.sample(4)

Unnamed: 0,rooms,bathrooms,surface_covered,property_type,operation_type,price_usd,price_per_m2,bathrooms_missing,rooms_missing,neighborhood_encoded
20130,-0.469932,-0.021792,-0.341416,Local comercial,Alquiler,2034900.0,58140.0,1,1,23
13255,0.162549,-0.021792,-0.210121,Oficina,Alquiler,1109080.0,9242.0,1,1,29
11209,-0.469932,-0.730141,-0.349139,Local comercial,Venta,430000.0,14333.0,0,1,23
16174,-0.469932,-0.730141,-0.135979,Local comercial,Venta,549000.0,3268.0,0,1,1


In [22]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd

combos = [
    ('Oficina', 'Alquiler'),
    ('Oficina', 'Venta'),
    ('Local comercial', 'Alquiler'),
    ('Local comercial', 'Venta')
]

resultados = []

for tipo, operacion in combos:
    # ----- LINEAR REGRESSION -----
    subset_lin = df_linear[(df_linear['property_type'] == tipo) & (df_linear['operation_type'] == operacion)].copy()
    subset_lin = subset_lin.drop(columns=['property_type', 'operation_type'], axis=1)
    if not subset_lin.empty:
        X_lin = subset_lin.drop(columns=['price_usd', 'price_per_m2'])
        y_lin = np.log1p(subset_lin['price_per_m2'])

        X_train, X_test, y_train, y_test = train_test_split(X_lin, y_lin, test_size=TEST_SIZE, random_state=RANDOM_STATE)

        model_lin = LinearRegression()
        model_lin.fit(X_train, y_train)
        y_pred = model_lin.predict(X_test)

        resultados.append({
            'Tipo': tipo,
            'Operación': operacion,
            'Modelo': 'LinearRegression',
            'R²': r2_score(y_test, y_pred),
            'RMSE': root_mean_squared_error(np.expm1(y_test), np.expm1(y_pred)),
            'MAE': mean_absolute_error(np.expm1(y_test), np.expm1(y_pred))
        })

    # ----- RANDOM FOREST -----
    subset_tree = df_tree[(df_tree['property_type'] == tipo) & (df_tree['operation_type'] == operacion)].copy()
    subset_tree = subset_tree.drop(columns=['property_type', 'operation_type'], axis=1)
    if not subset_tree.empty:
        X_tree = subset_tree.drop(columns=['price_usd', 'price_per_m2'])
        y_tree = np.log1p(subset_tree['price_per_m2'])

        X_train, X_test, y_train, y_test = train_test_split(X_tree, y_tree, test_size=TEST_SIZE, random_state=RANDOM_STATE)

        model_rf = RandomForestRegressor(
            n_estimators=500,
            max_depth=None,
            max_features='sqrt',
            min_samples_leaf=2,
            min_samples_split=2,
            random_state=RANDOM_STATE
        )

        model_rf.fit(X_train, y_train)
        y_pred = model_rf.predict(X_test)

        resultados.append({
            'Tipo': tipo,
            'Operación': operacion,
            'Modelo': 'RandomForest',
            'R²': r2_score(y_test, y_pred),
            'RMSE': root_mean_squared_error(np.expm1(y_test), np.expm1(y_pred)),
            'MAE': mean_absolute_error(np.expm1(y_test), np.expm1(y_pred))
        })

# ---- Tabla comparativa ----
resultados_df = pd.DataFrame(resultados)



In [23]:
resultados_df.sort_values(by='R²', ascending=False)

Unnamed: 0,Tipo,Operación,Modelo,R²,RMSE,MAE
1,Oficina,Alquiler,RandomForest,0.588822,25834.67466,13432.874042
3,Oficina,Venta,RandomForest,0.454649,430476.320029,14962.198495
5,Local comercial,Alquiler,RandomForest,0.377932,30536.420889,19651.280861
0,Oficina,Alquiler,LinearRegression,0.332827,30081.253338,17674.29401
2,Oficina,Venta,LinearRegression,0.312838,430480.003261,15065.77265
4,Local comercial,Alquiler,LinearRegression,0.283262,33708.372236,22243.342269
7,Local comercial,Venta,RandomForest,0.151496,457802.389439,40141.110965
6,Local comercial,Venta,LinearRegression,0.131037,458149.104223,40349.057916


### 4.1 Validación cruzada

In [24]:
from sklearn.model_selection import cross_val_score, KFold

# Configuración general
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scorer_r2 = 'r2'
scorer_mae = make_scorer(mean_absolute_error, greater_is_better=False)

results = []

# Iteramos por tipo de propiedad y operación
for tipo, operacion in combos:
    # Dataset específico
    df_linear_sub = df_linear[
        (df_linear['property_type'] == tipo) &
        (df_linear['operation_type'] == operacion)
    ]
    df_tree_sub = df_tree[
        (df_tree['property_type'] == tipo) &
        (df_tree['operation_type'] == operacion)
    ]
    # Elimino columnas tipo Object
    df_linear_sub = df_linear_sub.drop(columns=['property_type', 'operation_type'], axis=1)
    df_tree_sub = df_tree_sub.drop(columns=['property_type', 'operation_type'], axis=1)

    # Separación X/y
    y_lin = np.log1p(df_linear_sub['price_per_m2'])
    X_lin = df_linear_sub.drop(columns=['price_per_m2'])
    y_tree = np.log1p(df_tree_sub['price_per_m2'])
    X_tree = df_tree_sub.drop(columns=['price_per_m2'])
    
    # Linear Regression
    lin_model = LinearRegression()
    scores_r2_lin = cross_val_score(lin_model, X_lin, (np.expm1(y_lin)), cv=kf, scoring=scorer_r2)
    scores_mae_lin = cross_val_score(lin_model, X_lin, (np.expm1(y_lin)), cv=kf, scoring=scorer_mae)

    results.append({
        "Tipo": tipo,
        "Operación": operacion,
        "Modelo": "LinearRegression",
        "R² promedio": scores_r2_lin.mean(),
        "R² std": scores_r2_lin.std(),
        "MAE promedio": -scores_mae_lin.mean(),  # negado porque MAE se definió como "negativo"
        "MAE std": scores_mae_lin.std()
    })

    # Random Forest
    rf_model = RandomForestRegressor(
        n_estimators=500,
        max_depth=None,
        max_features='sqrt',
        min_samples_split=2,
        min_samples_leaf=2,
        random_state=42
    )
    scores_r2_rf = cross_val_score(rf_model, X_tree, (np.expm1(y_tree)), cv=kf, scoring=scorer_r2)
    scores_mae_rf = cross_val_score(rf_model, X_tree, (np.expm1(y_tree)), cv=kf, scoring=scorer_mae)

    results.append({
        "Tipo": tipo,
        "Operación": operacion,
        "Modelo": "RandomForest",
        "R² promedio": scores_r2_rf.mean(),
        "R² std": scores_r2_rf.std(),
        "MAE promedio": -scores_mae_rf.mean(),
        "MAE std": scores_mae_rf.std()
    })

# Tabla final
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by=["Tipo", "Operación", "Modelo"]).reset_index(drop=True)
results_df


Unnamed: 0,Tipo,Operación,Modelo,R² promedio,R² std,MAE promedio,MAE std
0,Local comercial,Alquiler,LinearRegression,0.33501,0.023878,18339.26376,610.851133
1,Local comercial,Alquiler,RandomForest,0.927729,0.005581,4447.811159,162.294753
2,Local comercial,Venta,LinearRegression,0.691248,0.187579,44036.914396,4667.35905
3,Local comercial,Venta,RandomForest,0.742044,0.220652,21663.881491,11049.362443
4,Oficina,Alquiler,LinearRegression,0.464258,0.05792,12576.138199,521.358415
5,Oficina,Alquiler,RandomForest,0.935886,0.004269,2496.114363,142.795479
6,Oficina,Venta,LinearRegression,-10.709301,20.616429,6678.716954,2010.409151
7,Oficina,Venta,RandomForest,-19.650553,27.893048,7933.305878,5454.482627


## 5. Conclusiones

El modelo que arrojó mejores métricas es el RandomForest para Propiedades tipo Oficinas y Operación de tipo Alquiler, con un R2 promedio de 0.93, MAE promedio de 2496 y MAE std de 142. <br>
Seguido por RandomForest para Alquiler de Locales comerciales con valores cercanos a los de Alquiler de oficinas <br>
Cabe destacar que el mismo modelo no es capaz de predecir bien ambos tipos de propiedad, sino que cada una requiere un modelo entrenado distinto. <br>
Avanzaremos con solamente con Oficinas para Alquiler, quedando la posibilidad de sumar a futuro el modelo para Alquiler de Locales comerciales <br>
Se probaron 2 diferentes test_size (0.2 y 0.3), siendo 0.2 el que mejores resultados arrojó.