# Data Challenge by Letizia Dimonopoli 3132775

First, I imported the datasets. I will proceed with Data Cleaning and Preparation.

In [1]:
import numpy as np
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

## 1. Data Cleaning and Preparation
### a) Cleaning

In the instructions, it is indicated that the variable w in the train dataset is always 1, thus it can be discarded.

In [2]:
train.drop("w", axis = 1, inplace = True)

First, I want to visualize the count (and percentage) of the missing variables in the dataset to be able to start the data cleaning. I want to explore them before possibly removing them or imputing them.

In [3]:
total_train = train.isnull().sum().sort_values(ascending=False)
percent_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data_train = pd.concat([total_train, percent_train], axis=1, keys=['Total', 'Percent'])
missing_data_train

Unnamed: 0,Total,Percent
availability,645,0.143333
energy_efficiency_class,380,0.084444
condominium_fees,197,0.043778
conditions,109,0.024222
other_features,8,0.001778
floor,6,0.001333
y,0,0.0
square_meters,0,0.0
contract_type,0,0.0
description,0,0.0


We still have some variables that have quite a few missing variables. In particular, these variables are:
- availability,
- energy_efficiency_class, for which I will see how energy efficiency relates to the year of construction and the proximity to the center variable
- condominium_fees, for which I will impute the mean value per zone;
- conditions, which might be a little harder to deal with: the options are only "good condition", "new" or "excellent". Since we don't know the reason why one or the other was picked and it can be very subjective, we will impute "good condition" to all the missing ones, which is general enough;
- other_features, which I can't deduce from any of the other columns, thus I will simply put "unknown";
- floor, which again is hard to deduce, so I am going to impute the mode of the floor column.

Moreover, after having explored the data types of the observations for each column, I see that the elevator column has strings ("yes" and "no"), and I want to make it binary, thus using an OrdinalEncoder which will put 0 for every "no" and 1 for every "yes".

I am going to deal with the missing values of "other_features" and "floor" first, as they are the most straightforward as there is nothing I can do to deduce them. For the column "floor", I am also converting any non-numerical value into a numerical value for simplicity.

In [4]:
train['other_features'] = train['other_features'].apply(lambda x: ['unknown'] if pd.isna(x) else x)
test['other_features'] = test['other_features'].apply(lambda x: ['unknown'] if pd.isna(x) else x)

train['floor'] = train['floor'].fillna(int(train['floor'].mode()[0]))
test['floor'] = test['floor'].fillna(int(test['floor'].mode()[0]))

In [5]:
floor_mapping = {"ground floor": 0, "mezzanine": 1, "semi-basement": -1}

import re
def convert_floor(value):
  value = str(value).lower().strip()
  if value in floor_mapping:
      return floor_mapping[value]
  match = re.search(r'\d+', value)
  if match:
      return int(match.group())
  return int(value)

train['floor'] = train['floor'].apply(convert_floor)
test['floor'] = test['floor'].apply(convert_floor)

For my analysis, I really believe that the zone the appartment is located in is one of the variables that affects its price the most. Therefore, I transform the column "zone" into a list. I then transformed each zone name into its corresponding "via", "piazza", "corso", etc (I used the help of ChatGPT for this). I assumed all were places in Milan, e.g. Abbiategrasso is via Abbiategrasso and not the town. I decided to use the geopy library for scaling. Basically what I would like to do is to transform each zone into a readable area (e.g. "farini" will become "Farini, Milan") to feed it to the geocoder, which will output latitude and longitude coordinates. In this way, I can then use the coordinates to compute a distance from Duomo (coordinates: (45.4642, 9.1900), which I am considering as the center of Milan). Finally, I am going to assign a value to the distance of every zone. I will call this the "proximity_to_center", where 5 represents the furthest area and 1 represents the closest area (in this case the area of the Duomo itself). I inspected the results, it's not perfect but I believe it's good enough for my analyses.

In [6]:
zones_list = train["zone"].tolist()

def rewrite_zones(zones_list):
    rewritten = []

    for elem in zones_list:
        parts = elem.split(" - ")
        resolved_part = None
        ambiguous_names = {
        "abbiategrasso": "piazza abbiategrasso", "affori": "via affori","amendola": "via amendola","arco della pace": "arco della pace",
    "arena": "via arena","argonne": "viale argonne","ascanio sforza": "alzaia naviglio pavese","baggio": "via baggio",
    "bande nere": "via bande nere","barona": "via barona","bicocca": "viale bicocca","bignami": "viale fulvio testi",
    "bisceglie": "via bisceglie","bocconi": "via bocconi","bologna": "via emilia","borgogna": "via borgogna",
    "bovisa": "via bovisa","brenta": "viale brenta","brera": "via brera","bruzzano": "piazza bruzzano","buenos aires": "corso buenos aires",
    "buonarroti": "via buonarroti","ca' granda": "viale ca' granda",
    "cadore": "via cadore","cadorna": "piazzale cadorna","cantalupa": "via del mare","carrobbio": "via torino","via cesare correnti": "via leoncavallo",
    "cascina merlata": "via pier paolo pasolini","casoretto": "via casoretto","castello": "piazza castello","cenisio": "via cenisio",
    "centrale": "piazza duca d'aosta", "cermenate": "via cermenate",
    "certosa": "viale certosa","chiesa rossa": "via della chiesa rossa","cimiano": "via padova","città studi": "viale romagna",
    "city life": "piazza tre torri","comasina": "via comasina",
    "corsica": "viale corsica","corvetto": "piazzale corvetto","crescenzago": "via padova",
    "crocetta": "largo della crocetta","cuoco": "via cuoco",
    "darsena": "viale gorizia","de angeli": "piazza de angeli","dergano": "via dergano","dezza": "via dezza",
    "duomo": "piazza del duomo","faenza": "via faenza",
    "famagosta": "via famagosta","farini": "via carlo farini","fatima": "via diomede","frua": "via frua",
    "gallaratese": "via gallarate","gambara": "piazzale gambara","garibaldi": "corso garibaldi",
    "ghisolfa": "via ghisolfa",
    "giambellino": "via giambellino","gorla": "via gorla","gratosoglio": "via dei missaglia","greco": "via prospero finzi",
    "guastalla": "via francesco sforza","indipendenza": "corso indipendenza","inganni": "via inganni",
    "insubria": "piazza insubria","isola": "via borsieri","istria": "piazzale istria",
    "lambrate": "piazza gobetti","lanza": "via lanza","lodi": "corso lodi",
    "lorenteggio": "via lorenteggio","lotto": "piazzale lotto","mac mahon": "via mac mahon",
    "maggiolina": "via melchiorre gioia","manzoni": "via alessandro manzoni",
    "martini": "viale martini","mecenate": "via mecenate",
    "meda": "via meda","medaglie d'oro": "piazza medaglie d'oro",
    "melchiorre gioia": "via melchiorre gioia","missori": "piazza missori","molise": "via molise","montenero": "viale monte nero",
    "morgagni": "via morgagni","moscova": "via della moscova",
    "musocco": "via musocco","navigli": "alzaia naviglio grande",
    "niguarda": "piazza belloveso","ortica": "via ortica","pagano": "via mario pagano",
    "palestro": "via palestro","paolo sarpi": "via paolo sarpi",
    "parco trotter": "via giuseppe giacosa",
    "parco vittoria": "viale certosa","pasteur": "via pasteur","pezzotti": "via pezzotti",
    "piave": "viale piave","plebisciti": "corso plebisciti","ponale": "via ponale","ponte lambro": "via ponte lambro",
    "porta nuova": "piazza gae aulenti","porta romana": "corso di porta romana","porta venezia": "corso buenos aires",
    "porta vittoria": "viale monte nero","portello": "piazzale portello","prato centenaro": "viale ca' granda",
    "precotto": "via precotto","primaticcio": "via primaticcio","qt8": "piazza santa maria nascente",
    "quadrilatero della moda": "via della spiga","quadronno": "via quadronno",
    "quartiere adriano": "via adriano","quartiere feltre": "via feltre","quartiere forlanini": "via forlanini",
    "quarto cagnino": "via quarto cagnino","quarto oggiaro": "via quarto oggiaro",
    "repubblica": "piazza della repubblica","ripamonti": "via ripamonti",
    "rogoredo": "via rogoredo","rovereto": "via rovereto","rubattino": "via rubattino",
    "san babila": "piazza san babila","san siro": "piazzale dello sport","sant'ambrogio": "piazza sant'ambrogio",
    "santa giulia": "via cassinari","scala": "piazza della scala","segnano": "viale fulvio testi",
    "sempione": "corso sempione","soderini": "via soderini","solari": "via solari",
    "sulmona": "via sulmona","susa": "piazzale susa","ticinese": "corso di porta ticinese","tre castelli": "via tre castelli",
    "trenno": "via trenno","tricolore": "piazza del tricolore",
    "tripoli": "via tripoli","turati": "via filippo turati","turro": "via turro",
    "udine": "piazzale udine","vercelli": "corso vercelli","vigentino": "via vigentino","villa san giovanni": "viale monza",
    "vincenzo monti": "via vincenzo monti","wagner": "piazza wagner",
    "washington": "via luigi pirandello","zara": "viale zara","monte rosa": "via monterosa",
    "ponte nuovo": "via ponte nuovo","san carlo": "piazza san carlo","san paolo": "via san paolo",
    "san vittore": "via san vittore","abbiategrasso": "piazza abbiategrasso"}
        for part in parts:
            key = part.strip().lower()
            resolved_part = ambiguous_names.get(key, part.strip())
            break  # stop after first relevant part

        formatted = ', '.join([resolved_part.strip(), "Milano", "Italia"])
        rewritten.append(formatted)

    return rewritten

new_zone_names = rewrite_zones(zones_list)
train["zone_rewritten"] = rewrite_zones(train["zone"].tolist())

test["zone_rewritten"] = rewrite_zones(test["zone"].tolist())

In [7]:
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
import time
import pandas as pd

duomo_coords = (45.4642, 9.1900) #duomo coordinates
geolocator = Nominatim(user_agent="my_milan_script")

#THIS IS TO PREVENT IT RUNNING OVER AND OVER BECAUSE IT TAKES A WHILE
try:
    coords_df = pd.read_csv("zone_coordinates.csv")
    coords_dict = dict(zip(coords_df['zone'], zip(coords_df['lat'], coords_df['lon'], coords_df['distance_km'])))
except FileNotFoundError:
    coords_dict = {}

latitude_list = []
longitude_list = []
distance_km_list = []
proximity_to_center_list = []

for zone in train["zone_rewritten"]:
    if zone in coords_dict:
        lat, lon, dist = coords_dict[zone]
        latitude_list.append(lat)
        longitude_list.append(lon)
        distance_km_list.append(dist)
    else:
        location = geolocator.geocode(zone, timeout=5)
        time.sleep(1)
        if location:
            coords = (location.latitude, location.longitude)
            distance_km = geodesic(coords, duomo_coords).km
            latitude_list.append(coords[0])
            longitude_list.append(coords[1])
            distance_km_list.append(distance_km)
            coords_dict[zone] = (coords[0], coords[1], distance_km)
        else:
            coords_dict[zone] = (None, None, None)
            latitude_list.append(None)
            longitude_list.append(None)
            distance_km_list.append(None)
            print(f"NOT FOUND: {zone}")

train["lat"] = latitude_list
train["lon"] = longitude_list
train["distance_km"] = distance_km_list

def classify_proximity(distance):
    if distance is None:
        return None
    elif distance <= 1:
        return 1
    elif distance <= 5:
        return 2
    elif distance <= 10:
        return 3
    elif distance <= 15:
        return 4
    else:
        return 5

train["proximity_to_center"] = train["distance_km"].apply(classify_proximity)

# TO SAVE IT FOR NEXT USE
#pd.DataFrame([
#    {'zone': k, 'lat': v[0], 'lon': v[1], 'distance_km': v[2]}
#    for k, v in coords_dict.items()
#]).to_csv("zone_coordinates.csv", index=False)

In [8]:
#TEST SET
latitude_list_test = []
longitude_list_test = []
distance_km_list_test = []
proximity_to_center_list_test = []

for zone_test in test["zone_rewritten"]:
    if zone_test in coords_dict:
        lat, lon, dist = coords_dict[zone_test]
        latitude_list_test.append(lat)
        longitude_list_test.append(lon)
        distance_km_list_test.append(dist)

test["distance_km"] = distance_km_list_test
test["proximity_to_center"] = test["distance_km"].apply(classify_proximity)

In [9]:
test["lat"] = latitude_list_test
test["lon"] = longitude_list_test

In [10]:
train.drop(columns=['zone'], inplace=True)

test.drop(columns=['zone'], inplace=True)

In [11]:
train_unique = train.drop_duplicates(subset=['zone_rewritten'])

smallest_proximity = train_unique.nsmallest(10, 'proximity_to_center')
largest_proximity = train_unique.nlargest(10, 'proximity_to_center')

print("10 Smallest Proximity Zones:")
print(smallest_proximity[['zone_rewritten', 'proximity_to_center']])
print("\n10 Largest Proximity Zones:")
print(largest_proximity[['zone_rewritten', 'proximity_to_center']])

10 Smallest Proximity Zones:
                               zone_rewritten  proximity_to_center
20                  via brera, Milano, Italia                    1
23       via francesco sforza, Milano, Italia                    1
46             piazza missori, Milano, Italia                    1
51         piazza della scala, Milano, Italia                    1
61                 via torino, Milano, Italia                    1
129          piazza del duomo, Milano, Italia                    1
201   corso di porta ticinese, Milano, Italia                    1
390           via della spiga, Milano, Italia                    1
820         piazza san babila, Milano, Italia                    1
1799         piazza san carlo, Milano, Italia                    1

10 Largest Proximity Zones:
                           zone_rewritten  proximity_to_center
9               via lanza, Milano, Italia                    5
24            via pasteur, Milano, Italia                    5
48          via 

Furthermore, I notice that the column "condominium_fees" has 197 missing values. I decide to impute them, and I choose to fill missing values based on the mean fee of each zone of Milan. This is because I believe that zones have pretty much similar buildings, thus their respective fees (also after a careful manual analysis) should be more or less similar. In fact, we can see that the highest condominium fees are in piazza Tre Torri (CityLife area), which a mean of 481€, while the lowest ones are in via Ortica (close to Parco Forlanini), with a mean of around 58€. This makes sense.

In [12]:
zone_fee_means = train.groupby("zone_rewritten")["condominium_fees"].mean()
def impute_fee(row):
    if pd.isna(row["condominium_fees"]):
        return zone_fee_means.get(row["zone_rewritten"], None)
    else:
        return row["condominium_fees"]
train["condominium_fees"] = train.apply(impute_fee, axis=1)

zone_fee_means_test = test.groupby("zone_rewritten")["condominium_fees"].mean()
def impute_fee(row):
    if pd.isna(row["condominium_fees"]):
        return zone_fee_means_test.get(row["zone_rewritten"], None)
    else:
        return row["condominium_fees"]
test["condominium_fees"] = test.apply(impute_fee, axis=1)

In [13]:
zone_fee_means.idxmax(), zone_fee_means.max()

('piazza tre torri, Milano, Italia', 481.35714285714283)

In [14]:
zone_fee_means.idxmin(), zone_fee_means.min()

('via ortica, Milano, Italia', 58.333333333333336)

I do the same with the 380 missing values of the column "energy_efficiency_class". This is because I believe that similar areas have buildings that have been built in approximately the same decade, and they are supposed to have the same energy class (even though might have been renovated, but there's nothing I can do to control that). The code here is slightly different as the variables are categorical and not numerical anymore.

In [15]:
train["energy_efficiency_class"] = train["energy_efficiency_class"].astype(str).str.lower().str.strip()
valid_classes = {"a", "b", "c", "d", "e", "f", "g"}
train.loc[~train["energy_efficiency_class"].isin(valid_classes), "energy_efficiency_class"] = np.nan

zone_energy = train.dropna(subset=["energy_efficiency_class"]).groupby("zone_rewritten")["energy_efficiency_class"].agg(lambda x: x.mode()[0])

def impute_energy_class(row):
    val = row["energy_efficiency_class"]
    if pd.isna(val):
        val = zone_energy.get(row["zone_rewritten"], "g")
    return str(val).lower().strip()

energy_map = {"a":1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7}

train["energy_efficiency_class"] = train.apply(impute_energy_class, axis=1)
train["energy_efficiency_class"] = train["energy_efficiency_class"].map(energy_map)

In [16]:
zone_energy_test = test.dropna(subset=["energy_efficiency_class"]).groupby("zone_rewritten")["energy_efficiency_class"].agg(lambda x: x.mode()[0])

def impute_energy_class(row):
    if pd.isna(row["energy_efficiency_class"]):
        return zone_energy_test.get(row["zone_rewritten"], "g")
    else:
        return row["energy_efficiency_class"]

test["energy_efficiency_class"] = test.apply(impute_energy_class, axis=1)
test["energy_efficiency_class"] = test["energy_efficiency_class"].str.lower().map(energy_map)

For the variable availability, 14% of the data has missing values. It is way too much for it to be removed completely, so I'm just going to assume these apartments are available right away as there is no way I can infer a precise date of availability from the data. The last variable I need to deal with is conditions. As mentioned before, conditions of a house for rent are very subjective, and the available options for this dataset are only "new", "excellent" and "good condition". I will assume that all the missing ones are in good conditions, but it is not a reliable variable so I am not going to use it much in my analysis anyway.

In [17]:
train["availability"].fillna("available", inplace=True)
train["conditions"].fillna("good condition", inplace=True)

test["availability"].fillna("available", inplace=True)
test["conditions"].fillna("good condition", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train["availability"].fillna("available", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train["conditions"].fillna("good condition", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which 

In [18]:
condition_mapping = {"new": 1, "excellent": 2, "good condition": 3}

train["conditions"] = train["conditions"].map(condition_mapping)
test["conditions"] = test["conditions"].map(condition_mapping)

In [19]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_map = {"no": 0, "yes": 1}
ordinal_encoder = OrdinalEncoder(categories=[list(ordinal_map.keys())])
train["elevator"] = ordinal_encoder.fit_transform(train[["elevator"]]).astype(int)

test["elevator"] = ordinal_encoder.fit_transform(test[["elevator"]]).astype(int)

### b) Preparation
Then, after having observed a bit more the train dataset, I see that the columns "description" and "other_features" actually have a lot of different material inside them. Thus, I import re and use some NLP techniques to extract what I am interested in. For example, I don't really care about the type of kitchen (e.g. "kitchenette", "open kitchen", etc) but I believe it is important to have a kitchen compared to not having it. I would also have a few more columns: one indicating the number of rooms (excluding bathroom(s) and kitchen), one for the number of bedrooms, and one for the number of bathrooms.

In [20]:
import re

def parse_description(desc):
    #is there a kitchen?
    kitchen_present = int("kitchen" in desc.lower())
    #total rooms
    room_match = re.match(r'(\d+)', desc)
    total_rooms = int(room_match.group(1)) if room_match else None
    #nr bedrooms
    bedroom_match = re.search(r'(\d+)\s+bedroom', desc)
    bedrooms = int(bedroom_match.group(1)) if bedroom_match else 0
    #nr bathrooms
    bathroom_match = re.search(r'(\d+)\s+bathroom', desc)
    bathrooms = int(bathroom_match.group(1)) if bathroom_match else 0
    # Calculate other rooms (excluding kitchen and bathrooms)
    other_rooms = total_rooms if total_rooms is not None else bedrooms
    return pd.Series({
        "kitchen_present": kitchen_present,
        "nr_rooms": other_rooms,
        "nr_bedrooms": bedrooms,
        "nr_bathrooms": bathrooms})

parsed_features = train["description"].apply(parse_description)
train = pd.concat([train, parsed_features], axis=1)

parsed_features_test = test["description"].apply(parse_description)
test = pd.concat([test, parsed_features_test], axis=1)

The column "other_features" is trickier, as I am not interested in all the features that are mentioned. In particular, I don't believe that the following would contribute in a significantly higher rent price: video entryphone, optic fiber, security door, centralized TV system, concierge (unless luxury property), internal/external exposure, window frame materials, alarm system, closet, electric gate. It could make the apartment more prestigious and all of them together might be significant in raising the price, but not one by one. Therefore I do not want to create dummies for these features.

However, after a careful analysis, I see other features I am more interested in:
- balcony
- terrace
- private garden
- shared garden
- furnished /partially furnished

I am aware they are not exactly the same, but I will classify the first four in the same column to not have too many dummies, and then I will create another column for "furnished" and "partially furnished". I see there is also the option "only kitchen furnished", but I also don't consider it important enough.

In [21]:
features_lower = train["other_features"]
features_lower_test = test["other_features"]

#outdoors
outdoor_keywords = ["balcony", "terrace", "private garden", "shared garden"]
train["has_outdoor_space"] = features_lower.apply(lambda x: int(any(feature in x for feature in outdoor_keywords)))

test["has_outdoor_space"] = features_lower_test.apply(lambda x: int(any(feature in x for feature in outdoor_keywords)))

#furnished?
def get_furnishing_status(text):
    if "partially furnished" in text:
        return 1
    elif "furnished" in text:
        return 1
    else:
        return 0

train["is_furnished"] = features_lower.apply(get_furnishing_status) #contains also partially furnished
test["is_furnished"] = features_lower_test.apply(get_furnishing_status) #contains also partially furnished

In [22]:
total_train = train.isnull().sum().sort_values(ascending=False)
percent_train = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data_train = pd.concat([total_train, percent_train], axis=1, keys=['Total', 'Percent'])
missing_data_train

Unnamed: 0,Total,Percent
y,0,0.0
square_meters,0,0.0
contract_type,0,0.0
availability,0,0.0
description,0,0.0
other_features,0,0.0
conditions,0,0.0
floor,0,0.0
elevator,0,0.0
energy_efficiency_class,0,0.0


In [23]:
total_test = test.isnull().sum().sort_values(ascending=False)
percent_test = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data_test = pd.concat([total_test, percent_test], axis=1, keys=['Total', 'Percent'])
missing_data_test

Unnamed: 0,Total,Percent
square_meters,0,0.0
contract_type,0,0.0
availability,0,0.0
description,0,0.0
other_features,0,0.0
conditions,0,0.0
floor,0,0.0
elevator,0,0.0
energy_efficiency_class,0,0.0
condominium_fees,0,0.0


No missing data is left now and everything is prepared for my analyses. I also did some checks for the outliers with the help of boxplots but they were not significant at this point of the analysis. I didn't want to remove anything that I might have needed later. Therefore, I start doing my models now.

In [25]:
from sklearn.cluster import KMeans
zone_avg_price = train.groupby('zone_rewritten')['y'].mean().to_dict()
train['zone_avg_price'] = train['zone_rewritten'].map(zone_avg_price)
test['zone_avg_price'] = test['zone_rewritten'].map(zone_avg_price)

# === 3. Add zone cluster (k-means) ===
kmeans = KMeans(n_clusters=10, random_state=42)
train['zone_cluster'] = kmeans.fit_predict(train[['lat', 'lon']])
test['zone_cluster'] = kmeans.predict(test[['lat', 'lon']])

## 2. Models


### CATBOOST

In [26]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [27]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from catboost import CatBoostRegressor
from sentence_transformers import SentenceTransformer

# === 1. Prepare base features ===
drop_cols = ['y', 'contract_type', 'availability', 'description',
             'other_features', 'zone_rewritten', "nr_rooms", "proximity_to_center"]
X_base = train.drop(columns=drop_cols)
y = train['y']

# Add new features to structured base
X_base['zone_avg_price'] = train['zone_avg_price']
X_base['zone_cluster'] = train['zone_cluster']
X_test_base = test[X_base.columns]
X_test_base['zone_avg_price'] = test['zone_avg_price']
X_test_base['zone_cluster'] = test['zone_cluster']

# === 4. Scale structured data ===
scaler = MinMaxScaler()
X_base_scaled = scaler.fit_transform(X_base)
X_test_base_scaled = scaler.transform(X_test_base)

# === 6. Combine structured + text ===
X_train_combined = np.hstack([X_base_scaled])
X_test_combined = np.hstack([X_test_base_scaled])

# === 7. Train CatBoost on combined features ===
cat_model = CatBoostRegressor(
    iterations=1000,
    depth=8,
    learning_rate=0.05,
    loss_function='MAE',
    random_seed=42,
    early_stopping_rounds=50,
    verbose=100
)
cat_model.fit(X_train_combined, y)

# === 8. Predict and save ===
test_predictions = cat_model.predict(X_test_combined)
pd.Series(test_predictions).to_csv("DIMONOPOLI_3132775_FINAL.txt", index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_base['zone_avg_price'] = test['zone_avg_price']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_base['zone_cluster'] = test['zone_cluster']


0:	learn: 793.9926097	total: 52.6ms	remaining: 52.5s
100:	learn: 301.8053951	total: 580ms	remaining: 5.16s
200:	learn: 264.2375154	total: 1.06s	remaining: 4.21s
300:	learn: 239.6277462	total: 1.55s	remaining: 3.6s
400:	learn: 221.9699682	total: 2.05s	remaining: 3.06s
500:	learn: 208.2711496	total: 2.52s	remaining: 2.51s
600:	learn: 198.4053827	total: 3.02s	remaining: 2s
700:	learn: 191.2817219	total: 3.49s	remaining: 1.49s
800:	learn: 183.6157058	total: 3.99s	remaining: 990ms
900:	learn: 176.6408810	total: 4.45s	remaining: 489ms
999:	learn: 172.4918733	total: 4.95s	remaining: 0us
