# Short notebook

**Kaggle competition**: *Moscow Housing*

**Kaggle team name**: *Team 16*

**Team members**: Name - studentId
- Laure Beringer - 
- Hasse Rombouts - 566536
- Henrik Fjellheim - 490763

**Submission 1**: Bagging_copy_2.csv with public score 0.15661

**Submission 2**: LGBM_2_copy_2.csv with public score 0.15653

## Short about the final submission

The final submission is a "*bagging*" prediction, combining several submissions using a weighted average. We tried bagging with a lot of different models, both stand-alone models, and stacked meta-models. The models were trained individually on different aspects of the dataset, and used to create "Out-of-fold" (oof) datasets to train a meta-models; the *stacking* process, and to combine model predictions using a weighted average function; the *bagging* process. 

The advantage of having a wide variaty of models, training on different targets and with different labels, was that their final predictions would be biased differently - one model would not result in the same errors as the next. Using bagging and stacking, meta-models and meta-predictions can take advantage of this diversity, producing better results than any of the individual models could do by themselves.

The final and best submission however ended up needing only three, relatively simple but state of the art, models in the bag. 

Note that during the submissions, something went wrong whith using the random state and reproducibility. We were thus unable to reproduce our absolute best solution (Bagging.csv with public score 0.15610) which is why we did not submit this on Kaggle. This bagging submission used the files: LGMB2.csv + CatBoost.csv + GradientBoost.csv (GB.csv on Kaggle), so we also included them in the repository.

## What is in the bag?

1. **LGBM_2_copy_2.csv**
2. **Catboost_copy_2.csv**
3. **GradientBoost_copy_2.csv**

## Reproducing the Submission
To reproduce this submission the following steps must be taken:
* Ready the data
    > - Import the data (note that the data needs to be stored one level above this notebook in a data folder, for example: `../data/apartments_train.csv`)
    > - Clean the data
    > - Add new features
    > - Normalize the data
    > - Dependent on the model: perform one-hot-encoding
    > - Drop unnecessary features
* Declare the models with optimal parameters
* Train each models
* Predict the testing data using each model
* Bag the result
    > - Load all .csv files
    > - Use weighted averages to determine the importance of each model prediction in the final result
    > - Store final result
    

### Preparing the Data
#### Imports

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import math

#### Helper Functions

In [2]:
def load_all_data(fraction_of_data=1, apartment_id='apartment_id',path=None):
    
    if path is not None:
        # Metadata
        metaData_apartment = pd.read_json(path+'../data/apartments_meta.json')
        metaData_building = pd.read_json(path+'../data/buildings_meta.json')
        metaData = pd.concat([metaData_apartment, metaData_building])

        # Train
        train_apartment = pd.read_csv(path+'../data/apartments_train.csv')
        train_building = pd.read_csv(path+'../data/buildings_train.csv')

        # Test
        test_apartment = pd.read_csv(path+'../data/apartments_test.csv')
        test_building = pd.read_csv(path+'../data/buildings_test.csv')
    
    else:
    # Metadata
        metaData_apartment = pd.read_json('../data/apartments_meta.json')
        metaData_building = pd.read_json('../data/buildings_meta.json')
        metaData = pd.concat([metaData_apartment, metaData_building])

        # Train
        train_apartment = pd.read_csv('../data/apartments_train.csv')
        train_building = pd.read_csv('../data/buildings_train.csv')

        # Test
        test_apartment = pd.read_csv('../data/apartments_test.csv')
        test_building = pd.read_csv('../data/buildings_test.csv')
  
    train = pd.merge(train_apartment, train_building, left_on='building_id', right_on='id')
    train.rename(columns={'id_x' : apartment_id}, inplace=True)
    train.drop('id_y', axis=1, inplace=True)
    train = train.head(int(train.shape[0] * fraction_of_data))


    test = pd.merge(test_apartment, test_building, left_on='building_id', right_on='id')
    test.rename(columns={'id_x' : apartment_id}, inplace=True)
    test.drop('id_y', axis=1, inplace=True)

    return train, test, metaData

def clean_data(train, test,
                 features, float_numerical_features, int_numerical_features, cat_features,
                 log_targets=True, log_area=True, fillNan=True, log1_area=False, log1_targets=False):
    '''Clean the data according to best knowledge so far...
   
    '''



    # Extract
    train_labels = train[features]
    test_labels = test[features]
    train_targets = train['price']

    # Log targets
    if log_targets:
        train_targets = np.log(train_targets)
    elif log1_targets:
        print("log1p targets!")
        train_targets = np.log1p(train_targets)

    ## ------------------------------------------------------------------------------------------------ ##
    # TODO: Shoul we use thise at all?
    # Fill nans using correlated features
    train_labels = fillnaReg(train_labels, ['area_total'], 'area_living')
    test_labels = fillnaReg(test_labels, ['area_total'], 'area_living')
    # Area_kitchen
    train_labels = fillnaReg(train_labels, ['area_total', 'area_living'], 'area_kitchen')
    test_labels = fillnaReg(test_labels, ['area_total', 'area_living'], 'area_kitchen')
    # ceiling
    is_outlier = ((train_labels["ceiling"] > 10) | (train_labels["ceiling"] < 0.5))
    outliers = train_labels.copy()[is_outlier]
    inliers_index=[]
    for index in train_labels.index:
        if index not in outliers.index:
            inliers_index.append(index)
    train_labels.loc[outliers.index,['ceiling']] = train_labels.loc[inliers_index,['ceiling']].mean()

    is_outlier = ((test_labels["ceiling"] > 10) | (test_labels["ceiling"] < 0.5))
    outliers = test_labels.copy()[is_outlier]
    for index in test_labels.index:
        if index not in outliers.index:
            inliers_index.append(index)
    test_labels.loc[outliers.index,['ceiling']] = test_labels.loc[test_labels.index.intersection(inliers_index),['ceiling']].mean()

    ## ------------------------------------------------------------------------------------------------ ##

    # Remove zero area living
    # TODO: why not > 0?
    remove_zero = [row["area_living"] if row["area_living"] >= 1 else row["area_total"]*(train_labels["area_living"].mean() / train_labels["area_total"].mean()) for _, row in train_labels.iterrows()] 
    train_labels["area_living"] = remove_zero

    remove_zero = [row["area_living"] if row["area_living"] >= 1 else row["area_total"]*(test_labels["area_living"].mean() / test_labels["area_total"].mean()) for _, row in test_labels.iterrows()] 
    test_labels["area_living"] = remove_zero

    # Log the areas
    fs = ["area_total", "area_living", "area_kitchen"]
    if log_area:
        for feature in fs:
            # Logging
            train_labels[feature] = np.log(train_labels[feature])
            test_labels[feature] = np.log(test_labels[feature])
    elif log1_area:
        print("log1p area!!")
        for feature in fs:
            # Logging
            train_labels[feature] = np.log1p(train_labels[feature])
            test_labels[feature] = np.log1p(test_labels[feature])


    # remove upper Strip
    remove_upper_stripe = [row["area_living"] if row["area_living"] < row["area_total"] else row["area_total"]*(train_labels["area_living"].mean() / train_labels["area_total"].mean()) for _, row in train_labels.iterrows()] 
    train_labels["area_living"] = remove_upper_stripe

    remove_upper_stripe = [row["area_living"] if row["area_living"] < row["area_total"] else row["area_total"]*(test_labels["area_living"].mean() / test_labels["area_total"].mean()) for _, row in test_labels.iterrows()] 
    test_labels["area_living"] = remove_upper_stripe
    ## ------------------------------------------------------------------------------------------------ ##


    # Fillnans of the two lat/log.
    # Insert median district
    #unknown_index = test_labels[["district", "latitude", "longitude"]][test_labels["latitude"].isna()==True].index
    #test_labels.loc[unknown_index,['district']] = test_labels["district"].median()
    ## Mean the long/lat
    #test_labels["longitude"] = test_labels["longitude"].fillna(test_labels["longitude"].mean())
    #test_labels["latitude"] = test_labels["latitude"].fillna(test_labels["latitude"].mean())

    # Both nans in test-data is in same known street.
    # (37.470959, 55.570540)
    is_na = (test_labels["longitude"].isna())
    nas = test_labels.copy()[is_na]
    test_labels.loc[nas.index,['longitude']] = 37.470959
    test_labels.loc[nas.index,['latitude']] = 55.570540
    

    # Fix houses on the "north pole".
    is_outlier = (test_labels["longitude"] > 39) | (test_labels["longitude"] < 35)
    outliers = test_labels.copy()[is_outlier]
    for index in outliers.index:
        if test_labels.loc[index,['street']][0]=="Бунинские Луга ЖК":
            test_labels.loc[index,['longitude']] = 37.482297
            test_labels.loc[index,['latitude']] = 55.543673
        elif test_labels.loc[index,['street']][0]=="улица Центральная":
            test_labels.loc[index,['longitude']] = 37.640383
            test_labels.loc[index,['latitude']] = 55.566131
        else:
            test_labels.loc[index,['longitude']] = train_labels["longitude"].mean()
            test_labels.loc[index,['latitude']] = train_labels["latitude"].mean()
            test_labels.loc[index,['street']] = "Ленинский проспект"

    
    # Fill districts using long/lat
    district_centre_long = np.array(train_labels[["district", "longitude"]].groupby(['district']).mean())
    district_centre_lat = np.array(train_labels[["district", "latitude"]].groupby(['district']).mean())
    remove_nan_districts = [closest_district(row["latitude"],
     row["longitude"], district_centre_long, district_centre_lat) if math.isnan(row["district"]) else row["district"] for _, row in train_labels.iterrows()] 
    train_labels["district"] = remove_nan_districts
    remove_nan_districts = [closest_district(row["latitude"],
     row["longitude"], district_centre_long, district_centre_lat) if math.isnan(row["district"]) else row["district"] for _, row in test_labels.iterrows()] 
    test_labels["district"] = remove_nan_districts

    # Fix the seller with unknown category
    unknown_category = len(list(train_labels["seller"].unique())) - 1
    train_labels["seller"] = train_labels["seller"].fillna(unknown_category)
    test_labels["seller"] = test_labels["seller"].fillna(unknown_category)
    ## --------------------------------------------------------------------------------------------------------------- ##
    if fillNan:
        # Taking care of the string features first... NB! no nans in string data.
        # Fill rest with median or mean
        # Float
        train_labels[float_numerical_features] = train_labels[float_numerical_features].fillna(train_labels[float_numerical_features].mean())
        # Int
        train_labels[int_numerical_features] = train_labels[int_numerical_features].fillna(train_labels[int_numerical_features].median())
        # Cat
        train_labels[cat_features] = train_labels[cat_features].fillna(train_labels[cat_features].median())
        # Bool (The rest)
        train_labels = train_labels.fillna(train_labels.median()) # Boolean

        # Float
        test_labels[float_numerical_features] = test_labels[float_numerical_features].fillna(test_labels[float_numerical_features].mean())
        # Int
        test_labels[int_numerical_features] = test_labels[int_numerical_features].fillna(test_labels[int_numerical_features].median())
        # Cat
        test_labels[cat_features] = test_labels[cat_features].fillna(test_labels[cat_features].median())
        # Bool (The rest)
        test_labels = test_labels.fillna(test_labels.median()) # Boolean

    if "constructed" in features:
        # Is supposed to be integer.
        train["constructed"] = np.asarray(train["constructed"]).astype("int")
        test["constructed"] = np.asarray(test["constructed"]).astype("int")

    return train_labels, train_targets, test_labels

def feature_engineering(train_labels, test_labels,
    add_base_features=True, 
    add_bool_features=True,
    add_weak_features=False,
    add_dist_to_metro=False,
    add_close_to_uni=False,
    add_dist_to_hospital=False,
    add_floor_features=False,
    add_street_info=False,
    add_some_more_features=False,
    add_district_information=False,
    ):

    if add_base_features:
        added_features = []
        # Add R and theta
        train_labels, test_labels = polar_coordinates(train_labels, test_labels)
        added_features.append("r")
        added_features.append("theta")

        # Add "Spacious_rooms": area per room
        train_labels['spacious_rooms'] = train_labels['area_total'] / train_labels['rooms']
        test_labels['spacious_rooms'] = test_labels['area_total'] / test_labels['rooms']
        added_features.append('spacious_rooms')

        # Newly_built
        is_new = [1 if row["constructed"] >= 2000 else 0 for _, row in train_labels.iterrows()] 
        train_labels["actually_new"] = is_new
        is_new = [1 if row["constructed"] >= 2000 else 0 for _, row in test_labels.iterrows()] 
        test_labels["actually_new"] = is_new
        added_features.append("actually_new")

        train_labels["rel_living"] = np.asarray(train_labels["area_living"] / train_labels["area_total"]).astype("float32")
        test_labels["rel_living"] = np.asarray(test_labels["area_living"] / test_labels["area_total"]).astype("float32")
        added_features.append("rel_living")
            
        train_labels["total_bathrooms"] = np.asarray(train_labels["bathrooms_private"] + train_labels["bathrooms_shared"]).astype("int")
        test_labels["total_bathrooms"] = np.asarray(test_labels["bathrooms_private"] + test_labels["bathrooms_shared"]).astype("int")
        added_features.append("total_bathrooms")

    if add_bool_features:
        # Some boolean features
        train_labels["multiple_balconies"] = np.asarray((train_labels["balconies"]>1)).astype("int")
        test_labels["multiple_balconies"] = np.asarray((test_labels["balconies"]>1)).astype("int")
        added_features.append("multiple_balconies")

        train_labels["multiple_loggias"] = np.asarray((train_labels["loggias"]>1)).astype("int")
        test_labels["multiple_loggias"] = np.asarray((test_labels["loggias"]>1)).astype("int")
        added_features.append("multiple_loggias")

        train_labels["both_windows"] = np.asarray((train_labels["windows_court"]==True) & (train_labels["windows_street"]==True)).astype("int")
        test_labels["both_windows"] = np.asarray((test_labels["windows_court"]==True) & (test_labels["windows_street"]==True)).astype("int")
        added_features.append("both_windows")

    if add_weak_features:
        # Weakly correlated ones.
        train_labels["rel_kitchen"] = np.asarray(train_labels["area_kitchen"] / train_labels["area_total"]).astype("float32")
        test_labels["rel_kitchen"] = np.asarray(test_labels["area_kitchen"] / test_labels["area_total"]).astype("float32")
        added_features.append("rel_kitchen")

        train_labels['rel_height'] = np.asarray(train_labels["floor"] / train_labels["stories"]).astype("float32")
        test_labels['rel_height'] = np.asarray(test_labels["floor"] / test_labels["stories"]).astype("float32")
        added_features.append('rel_height')

    if add_dist_to_metro:
        # Pre-calculated
        dist_to_metro_train = np.loadtxt("./external_datasets/metro_distances_train.csv")
        dist_to_metro_test = np.loadtxt("./external_datasets/metro_distances_test.csv")
        # Meter variant
        train_labels["dist_to_metro_m"] = dist_to_metro_train * (2*np.pi*(6371000) / 360)
        test_labels["dist_to_metro_m"] = dist_to_metro_test * (2*np.pi*(6371000) / 360)
        added_features.append('dist_to_metro_m')
        # Walking distance
        train_labels["metro_walking_distance"] = np.asarray(train_labels["dist_to_metro_m"]<train_labels["dist_to_metro_m"].median()).astype("int")
        test_labels["metro_walking_distance"] = np.asarray(test_labels["dist_to_metro_m"]<test_labels["dist_to_metro_m"].median()).astype("int")
        added_features.append('metro_walking_distance')
    
    if add_close_to_uni:
        # Uni location
        uni_location = (37.5286, 55.7039)
        ## Distance to state university
        dist_to_uni_train = np.zeros(len(train_labels))
        for i, row_t in train_labels.iterrows():
            apartment = (row_t["longitude"], row_t["latitude"])
            dist_to_uni_train[i] = eucledian_distance(apartment, uni_location)
        dist_to_uni_test = np.zeros(len(test_labels))
        for i, row_t in test_labels.iterrows():
            apartment = (row_t["longitude"], row_t["latitude"])
            dist_to_uni_test[i] = eucledian_distance(apartment, uni_location)
        # Meter edition
        dist_to_uni_train_meters = dist_to_uni_train*(2*np.pi*(6371000) / 360)
        dist_to_uni_test_meters = dist_to_uni_test*(2*np.pi*(6371000) / 360)
        # Close to uni 
        train_labels["close_to_uni"] = np.asarray((dist_to_uni_train_meters<2000)).astype("int")
        test_labels["close_to_uni"] = np.asarray((dist_to_uni_test_meters<2000)).astype("int")
        added_features.append('close_to_uni')

    if add_dist_to_hospital:
        # Location of state hospital
        hosp_location = (37.389167, 55.746389)
        ## Distance to state university
        dist_to_hosp_train = np.zeros(len(train_labels))
        for i, row_t in train_labels.iterrows():
            apartment = (row_t["longitude"], row_t["latitude"])
            dist_to_hosp_train[i] = eucledian_distance(apartment, hosp_location)
        dist_to_hosp_test = np.zeros(len(test_labels))
        for i, row_t in test_labels.iterrows():
            apartment = (row_t["longitude"], row_t["latitude"])
            dist_to_hosp_test[i] = eucledian_distance(apartment, hosp_location)

        dist_to_hosp_train_meters = dist_to_hosp_train*(2*np.pi*(6371000) / 360)
        train_labels["dist_to_hospital_m"] = dist_to_hosp_train_meters
        dist_to_hosp_test_meters = dist_to_hosp_test*(2*np.pi*(6371000) / 360)
        test_labels["dist_to_hospital_m"] = dist_to_hosp_test_meters
        added_features.append('dist_to_hospital_m')

    if add_floor_features:
        train_labels["lives_in_highrise"] = np.asarray((train_labels["stories"]>30)).astype("int")
        test_labels["lives_in_highrise"] = np.asarray((test_labels["stories"]>30)).astype("int")
        added_features.append('lives_in_highrise')

        train_labels["first_floor"] = np.asarray(train_labels["floor"]==1).astype("int")
        test_labels["first_floor"] = np.asarray(test_labels["floor"]==1).astype("int")
        added_features.append('first_floor')
        
        train_labels["floor_inverse"] = train_labels["stories"]-train_labels["floor"]
        median_floor_inverse = np.median(train_labels["floor_inverse"])
        train_labels["floor_inverse"]=train_labels["floor_inverse"].where(train_labels["floor_inverse"]>=0, other=median_floor_inverse)
        test_labels["floor_inverse"] = test_labels["stories"]-test_labels["floor"]
        test_labels["floor_inverse"]=test_labels["floor_inverse"].where(test_labels["floor_inverse"]>=0, other=median_floor_inverse)
        added_features.append('floor_inverse')

    if add_street_info:
        # Seafront = набережная.
        # This was pretty much the only one correlated to price.
        on_type = [1 if "набережная" in row["street"] else 0 for _, row in train_labels.iterrows()] 
        train_labels["on_seafront"] = on_type
        on_type = [1 if "набережная" in row["street"] else 0 for _, row in test_labels.iterrows()] 
        test_labels["on_seafront"] = on_type
        added_features.append('on_seafront')
        # Also, these streets seemed good to live in.
        good_streets = ["Мосфильмовская улица", "набережная Пресненская",
         "улица Ефремова", "Казарменный переулок"]
        for street_name in good_streets:
            on_street = [1 if street_name in row["street"] else 0 for _, row in train_labels.iterrows()] 
            train_labels["in_"+street_name] = on_street
        for street_name in good_streets:
            on_street = [1 if street_name in row["street"] else 0 for _, row in test_labels.iterrows()] 
            test_labels["in_"+street_name] = on_street
            added_features.append("in_"+street_name)
        
    if add_some_more_features:
        # This is moved to data_cleaning.
        #train_labels = train_labels.astype({'constructed':'int'})
        #test_labels = test_labels.astype({'constructed':'int'})

        train_labels["area_floor"] = train_labels["area_total"] / train_labels["floor"]
        train_labels["area_stories"] = train_labels["area_total"] / train_labels["stories"]
        train_labels["area_rooms"] = train_labels["area_total"] / np.average(train_labels["rooms"]) # is just the invert of spacious rooms
        train_labels["old_building"] = (train_labels["constructed"]<1950)
        train_labels["cold_war_building"] = (train_labels["constructed"]>1955) & (train_labels["constructed"]<2000)
        train_labels["modern_but_not_too_modern"] = (train_labels["constructed"]>200) & (train_labels["constructed"]<2018)
        train_labels["bathroom_area"] = (train_labels["bathrooms_private"] + train_labels["bathrooms_shared"])/train_labels["area_total"]
        train_labels['bathrooms_per_room'] = (train_labels["total_bathrooms"])/train_labels["rooms"]

        test_labels["area_floor"] = test_labels["area_total"] / test_labels["floor"]
        test_labels["area_stories"] = test_labels["area_total"] / test_labels["stories"]
        test_labels["area_rooms"] = test_labels["area_total"] / np.average(test_labels["rooms"]) # is just the invert of spacious rooms
        test_labels["old_building"] = (test_labels["constructed"]<1950)
        test_labels["cold_war_building"] = (test_labels["constructed"]>1955) & (test_labels["constructed"]<2000)
        test_labels["modern_but_not_too_modern"] = (test_labels["constructed"]>200) & (test_labels["constructed"]<2018)
        test_labels["bathroom_area"] = (test_labels["bathrooms_private"] + test_labels["bathrooms_shared"])/test_labels["area_total"]
        test_labels['bathrooms_per_room'] = (test_labels["total_bathrooms"])/test_labels["rooms"]


        # Calculate district average areas
        arr_train = train_labels[["district","area_total","area_living","area_kitchen",]].to_numpy()
        arr_test = test_labels[["district","area_total","area_living","area_kitchen",]].to_numpy()

        arr_full = np.concatenate((arr_train,arr_test),axis=0)

        average_area_total_district = {}
        average_area_living_district = {}
        average_area_kitchen_district = {}

        for x in sorted(np.unique(arr_full[...,0])):
            average_area_total_district[x] = np.average(arr_full[np.where(arr_full[...,0]==x)][...,1])
            average_area_living_district[x] = np.average(arr_full[np.where(arr_full[...,0]==x)][...,2])
            average_area_kitchen_district[x] = np.average(arr_full[np.where(arr_full[...,0]==x)][...,3])

        # Calculate floor average areas
        arr_train = train_labels[["floor","area_total"]].to_numpy()
        arr_test = test_labels[["floor","area_total"]].to_numpy()

        arr_full = np.concatenate((arr_train,arr_test),axis=0)

        average_area_total_floor = {}
        for x in sorted(np.unique(arr_full[...,0])):
            average_area_total_floor[x] = np.average(arr_full[np.where(arr_full[...,0]==x)][...,1])
   
        # Calculate construction year average areas
        arr_train = train_labels[["constructed","area_total","area_living","area_kitchen",]].to_numpy()
        arr_test = test_labels[["constructed","area_total","area_living","area_kitchen",]].to_numpy()
        
        arr_full = np.concatenate((arr_train,arr_test),axis=0)

        average_area_total_constructed = {}
        for x in sorted(np.unique(arr_full[...,0])):
            average_area_total_constructed[x] = np.average(arr_full[np.where(arr_full[...,0]==x)][...,1])
   
        total_district = []
        living_district = []
        kitchen_district = []

        total_floor = []
        total_constructed = []

        for _,row in train_labels.iterrows():
            total_district.append(row['area_total']/average_area_total_district[row['district']])
            living_district.append(row['area_living']/average_area_living_district[row['district']])
            kitchen_district.append(row['area_kitchen']/average_area_kitchen_district[row['district']])
            total_floor.append(row['area_total']/average_area_total_floor[row['floor']])
            total_constructed.append(row['area_total']/average_area_total_constructed[row['constructed']] if (pd.notnull(row['area_total']) and pd.notnull(row['constructed'])) else None)
        
        total_district_test = []
        living_district_test = []
        kitchen_district_test = []

        total_floor_test = []
        total_constructed_test = []

        for _,row in test_labels.iterrows():
            total_district_test.append(row['area_total']/average_area_total_district[row['district']])
            living_district_test.append(row['area_living']/average_area_living_district[row['district']])
            kitchen_district_test.append(row['area_kitchen']/average_area_kitchen_district[row['district']])
            total_floor_test.append(row['area_total']/average_area_total_floor[row['floor']])
            total_constructed_test.append(row['area_total']/average_area_total_constructed[row['constructed']] if (pd.notnull(row['area_total']) and pd.notnull(row['constructed'])) else None)

        train_labels['average_area_total_district'] = total_district
        train_labels['average_area_living_district'] = living_district
        train_labels['average_area_kitchen_district'] = kitchen_district
        train_labels['average_area_total_floor'] = total_floor
        train_labels['average_area_year_constructed'] = total_constructed

        test_labels['average_area_total_district'] = total_district_test
        test_labels['average_area_living_district'] = living_district_test
        test_labels['average_area_kitchen_district'] = kitchen_district_test
        test_labels['average_area_total_floor'] = total_floor_test
        test_labels['average_area_year_constructed'] = total_constructed_test
    
    if add_district_information:
        density = {
            0: 701353/66.1755,
            1: 1112846/109.9,
            2: 1240062/101.889,
            3: 1394497/154.6,
            4: 1116924/117.6,
            5: 1593065/132,
            6: 1179211/111.4,
            7: 1049104/153,
            8: 779965/93.281,
            9: 215727/37.22,
            10:86752/1084.3,
            11:113569/361.4
          }
        population = {
            0: 701353,
            1: 1112846,
            2: 1240062,
            3: 1394497,
            4: 1116924,
            5: 1593065,
            6: 1179211,
            7: 1049104,
            8: 779965,
            9: 215727,
            10:86752,
            11:113569
                }
        district_area ={
            0: 66.1755,
            1: 109.9,
            2: 101.889,
            3: 154.6,
            4: 117.6,
            5: 132,
            6: 111.4,
            7: 153,
            8: 93.281,
            9: 37.22,
            10:1084.3,
            11:361.4
                }

        district_popularity_train_set = train_labels[['district','building_id']].groupby(['district']).count()
        district_popularity_test_set = test_labels[['district','building_id']].groupby(['district']).count()
        population_df = pd.DataFrame.from_dict(population,orient='index')
        total = len(train_labels) + len(test_labels)
        popularity_data_df = (district_popularity_train_set+district_popularity_test_set)/total
        popularity_data = popularity_data_df.to_dict()['building_id']
        popularity_population_df = pd.DataFrame.from_dict(popularity_data,orient='index')/population_df
        popularity_population = popularity_population_df.to_dict()[0]



        density_district = [density[row['district']] for _,row in train_labels.iterrows()]
        population_district = [population[row['district']] for _,row in train_labels.iterrows()]
        area_district = [district_area[row['district']] for _,row in train_labels.iterrows()]
        outside_MKAD = [1 if (row["district"] in [9,10,11]) else 0 for _,row in train_labels.iterrows()]
        popularity_district_data = [popularity_data[row['district']] for _,row in train_labels.iterrows()]
        popularity_district_population = [popularity_population[row['district']] for _,row in train_labels.iterrows()]

        train_labels["density_district"] = density_district
        train_labels["population_district"] = population_district
        train_labels["area_district"] = area_district
        train_labels["outside_MKAD"] = outside_MKAD        
        train_labels["popularity_district_data"] = popularity_district_data
        train_labels["popularity_district_population"] = popularity_district_population


        density_district_test = [density[row['district']] for _,row in test_labels.iterrows()]
        population_district_test = [population[row['district']] for _,row in test_labels.iterrows()]
        area_district_test = [district_area[row['district']] for _,row in test_labels.iterrows()]
        outside_MKAD_test = [1 if (row["district"] in [9,10,11]) else 0 for _,row in test_labels.iterrows()]
        popularity_district_data_test = [popularity_data[row['district']] for _,row in test_labels.iterrows()]
        popularity_district_population_test = [popularity_population[row['district']] for _,row in test_labels.iterrows()]

        test_labels["density_district"] = density_district_test
        test_labels["population_district"] = population_district_test
        test_labels["area_district"] = area_district_test
        test_labels["outside_MKAD"] = outside_MKAD_test
        test_labels["popularity_district_data"] = popularity_district_data_test
        test_labels["popularity_district_population"] = popularity_district_population_test


    return train_labels, test_labels, added_features

def normalize(train_labels, test_labels, features, scaler="minMax"):
    # Only normalize/scale the numerical data. Categorical data is kept as is.
    train_labels_n = train_labels.filter(features)
    test_labels_n = test_labels.filter(features)

    # Scale it.
    if scaler=="minMax":
        print("minMax")
        scaler = MinMaxScaler(feature_range=(0, 1))
        train_labels_scaled = scaler.fit_transform(train_labels_n)
        test_labels_scaled = scaler.transform(test_labels_n)
    elif scaler=="std":
        print("Std")
        std_scale = preprocessing.StandardScaler().fit(train_labels_n)
        train_labels_scaled = std_scale.transform(train_labels_n)
        test_labels_scaled = std_scale.transform(test_labels_n)
    else:
        assert ValueError, "Incorrect scaler"

    # Re-enter proceedure
    training_norm_col = pd.DataFrame(train_labels_scaled, index=train_labels_n.index, columns=train_labels_n.columns) 
    train_labels.update(training_norm_col)

    testing_norm_col = pd.DataFrame(test_labels_scaled, index=test_labels_n.index, columns=test_labels_n.columns) 
    test_labels.update(testing_norm_col)

    return train_labels, test_labels


def one_hot_encoder(train_df, test_df, cat_features, drop_old=True):
    '''
    Returns a copy of all three dfs, after one-hot encoding and !removing!
    their old cat_features.

    NB! pd.get_dummies() does pretty much the same job!
    https://stackoverflow.com/questions/36285155/pandas-get-dummies
    BUG! Some categories are only present in train not test or the other way around!
        - Then the encoding is made differently for the two!
    https://stackoverflow.com/questions/57946006/one-hot-encoding-train-with-values-not-present-on-test
    '''
    # TODO: use Laures one-hot encoder.
    
    #if(len(train_df.isna())!=0 or len(train_df.isna())!=0 or len(train_df.isna())!=0):
    #    assert ValueError

    train_labels = train_df.copy()
    test_labels = test_df.copy()

    encoded_features = []
    dfs=[train_labels, test_labels]
    for df in dfs:
        for feature in cat_features:
            encoded_feat = OneHotEncoder().fit_transform(df[feature].values.reshape(-1, 1)).toarray()
            n = df[feature].nunique()
            cols = ['{}_{}'.format(feature, n) for n in range(1, n + 1)]
            encoded_df = pd.DataFrame(encoded_feat, columns=cols)
            encoded_df.index = df.index
            encoded_features.append(encoded_df)

    n = len(cat_features)

    train_labels = pd.concat([train_labels, *encoded_features[ : n]], axis=1)
    test_labels = pd.concat([test_labels, *encoded_features[n : ]], axis=1)


    # Now drop the non-encoded ones!
    if drop_old:
        train_labels.drop(cat_features, inplace=True, axis=1)
        test_labels.drop(cat_features, inplace=True, axis=1)
    return train_labels, test_labels



In [3]:
def fillnaReg(df, X_features, y_feature):
    df = df.copy()
    df_temp = df[df[y_feature].notna()]
    if df_temp.shape[0] == 1: df_temp = df_temp.values.reshape(-1, 1)
    reg = LinearRegression().fit(df_temp[X_features], df_temp[y_feature])
    predict = reg.predict(df[X_features])
    df[y_feature] = np.where(df[y_feature]>0, df[y_feature], predict)
    return df

def fillnaReg(df, X_features, y_feature):
    df = df.copy()
    df_temp = df[df[y_feature].notna()]
    if df_temp.shape[0] == 1: df_temp = df_temp.values.reshape(-1, 1)
    reg = LinearRegression().fit(df_temp[X_features], df_temp[y_feature])
    predict = reg.predict(df[X_features])
    df[y_feature] = np.where(df[y_feature]>0, df[y_feature], predict)
    return df

def closest_district(lat, long, district_centre_long, district_centre_lat):
    best_distance = -1
    closest_dist = -1
    for i in range(len(district_centre_long)):
        long_dist =  district_centre_long[i][0]
        lat_dist = district_centre_lat[i][0]
        total_dist = np.sqrt((long-long_dist)**2 + (lat-lat_dist)**2)
        if total_dist < best_distance or best_distance==-1:
            best_distance = total_dist
            closest_dist = i
    return closest_dist

def polar_coordinates(train_labels, test_labels, use_city_centre=False):
    '''
    labels and input labels. Adds theta and also R to the dataframe copy.
    '''
    # Make a copy to manipulate
    labels1_normed_r = train_labels[["latitude", "longitude"]].copy()
    test1_normed_r = test_labels[["latitude", "longitude"]].copy()

    # City centre 55.755833, 37.617222 (lat/long)
    # Instead of mean, use city centre: 
    if use_city_centre:
        labels1_normed_r['latitude'] = labels1_normed_r['latitude'] - 55.755833
        labels1_normed_r['longitude'] = labels1_normed_r['longitude'] - 37.617222
        test1_normed_r['latitude'] = test1_normed_r['latitude'] -  55.755833
        test1_normed_r['longitude'] = test1_normed_r['longitude'] -  37.617222
    else: # just use means
        # FLOAT
        train_float_mean = train_labels["longitude"].mean()
        test_float_mean = test_labels["longitude"].mean()
        tl = len(train_labels) + len(test_labels)
        total_mean_long = (train_float_mean*len(train_labels) + test_float_mean*len(test_labels)) / tl

        train_float_mean = train_labels["latitude"].mean()
        test_float_mean = test_labels["latitude"].mean()
        tl = len(train_labels) + len(test_labels)
        total_mean_lat = (train_float_mean*len(train_labels) + test_float_mean*len(test_labels)) / tl 

        labels1_normed_r['latitude'] = labels1_normed_r['latitude'] - total_mean_lat
        labels1_normed_r['longitude'] = labels1_normed_r['longitude'] - total_mean_long
        test1_normed_r['latitude'] = test1_normed_r['latitude'] -  total_mean_lat
        test1_normed_r['longitude'] = test1_normed_r['longitude'] -  total_mean_long

    # Convert to polar coordinates
    labels1_normed_r['r'] =  np.sqrt(labels1_normed_r['latitude']**2 + labels1_normed_r['longitude']**2)
    labels1_normed_r['theta'] = np.arctan(labels1_normed_r['longitude']/labels1_normed_r['latitude'])
    test1_normed_r['r'] =  np.sqrt(test1_normed_r['latitude']**2 + test1_normed_r['longitude']**2)
    test1_normed_r['theta'] = np.arctan(test1_normed_r['longitude']/test1_normed_r['latitude'])

    # Add polar coordinates
    train_labels["r"] = labels1_normed_r['r']
    train_labels["theta"] = labels1_normed_r['theta']
    test_labels["r"] = test1_normed_r['r']
    test_labels["theta"] = test1_normed_r['theta']

    return train_labels, test_labels


def eucledian_distance(point1, point2):
    return np.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)

In [4]:
def predict_and_store(model, test_labels, test_pd, path="default", exponential=False, price_per_sq = False, total_area_df = None, exponentialm1=False):
    '''
        Inputs
        - test_pd needs to be the original full test dataframe
    '''
    result = model.predict(test_labels)
    if exponential:
        result = np.exp(result)
    elif exponentialm1:
        print("expm1")
        result = np.expm1(result)
    if price_per_sq:
        result = result*total_area_df
    submission = pd.DataFrame()
    submission['id'] = test_pd['apartment_id']
    submission['price_prediction'] = result
    if len(submission['id']) != 9937:
        raise Exception("Not enough rows submitted!")
    submission.to_csv(path, index=False)

#### Obtain Final Training and Testing Datasets

In [5]:
# Declare features
features =           ["building_id", # For grouping
                      "area_total", "area_kitchen", "area_living", "floor", "ceiling", "stories", "rooms",
                      "bathrooms_private", "bathrooms_shared", "balconies","loggias", "phones", "latitude", "longitude", "constructed", # Numerical
                     "layout", "condition", "district", "material", "parking", "heating", "seller", #Categorical
                      "windows_court", "windows_street", "new", "elevator_without", "elevator_passenger", "elevator_service", "garbage_chute", # Bool
                     "street", "address"] # Strings

all_numerical_features = ["area_total", "area_kitchen", "area_living", "floor",
                      "ceiling", "stories", "rooms", "bathrooms_private", "bathrooms_shared", "balconies","loggias", "phones", "latitude", "longitude", "constructed"]

float_numerical_features = ["area_total", "area_kitchen", "area_living", "ceiling", "latitude", "longitude", "constructed"]
int_numerical_features = ["floor", "stories", "rooms", "bathrooms_private", "bathrooms_shared", "balconies", "loggias", "phones"] # Ordinal categories

cat_features = ["layout", "condition", "district", "material", "parking", "heating", "seller"] # All are non-ordinal

# Load data
train, test, metaData = load_all_data()

# Clean data
train_labels, train_targets, test_labels = clean_data(train, test, features, float_numerical_features, int_numerical_features, cat_features, log_targets=False, log_area=True, fillNan=True)

# Add result feature engineering
train_labels, test_labels, added_features = feature_engineering(
    train_labels, 
    test_labels,
    add_base_features=True, 
    add_bool_features=True,
    add_weak_features=True,
    add_dist_to_metro=True,
    add_close_to_uni=True,
    add_dist_to_hospital=True,
    add_floor_features=True,
    add_street_info=True,
    add_some_more_features=True,
    add_district_information=True,
    )

# Normalize
train_labels, test_labels = normalize(train_labels, test_labels, float_numerical_features, scaler="minMax")

# One-hot encoding
train_labels_one_hot, test_labels_one_hot = one_hot_encoder(train_labels, test_labels, ["condition", "district", "material", "parking", "heating", "seller"], drop_old=True)

# Drop unnecessary features
droptable = ['street','address']
train_labels.drop(droptable, inplace=True, axis=1)
test_labels.drop(droptable, inplace=True, axis=1)
train_labels_one_hot.drop(droptable, inplace=True, axis=1)
test_labels_one_hot.drop(droptable, inplace=True, axis=1)

# Categorical features to integers (necessary for CatBoost)
cat_dict = {}
for feature in cat_features:
    cat_dict[feature] = 'int'
train_labels = train_labels.astype(cat_dict)
test_labels = test_labels.astype(cat_dict)

print("You have now imported all data!")

minMax
You have now imported all data!


### Preparing the Models

#### Global Parameters and Imports

In [6]:
import lightgbm as lgbm
from catboost import CatBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

price_per_square_meter = train_targets/train['area_total']

#### LightGBM

In [7]:
params= {
 'num_iterations': 10000,
 'n_estimators': 152*5,
 'learning_rate': 0.05/5,
 'num_leaves': 40,
 'max_depth': 10,
 'min_data_in_leaf': 20,
 'bagging_fraction': 0.9,
 'bagging_freq': 5,
 'feature_fraction': 0.8,
}
lgb = lgbm.LGBMRegressor(
    **params, 
    random_state= 1,
    silent=True,
    metric='regression',
    num_threads=4)

#### CatBoost

In [8]:
best_params = {
    'learning_rate': 0.0772446776104594, 
    'l2_leaf_reg': 0.5709519938247928, 
    'colsample_bylevel': 0.09221911838854839, 
    'depth': 9, 
    'boosting_type': 'Plain', 
    'bootstrap_type': 'MVS', 
    'min_data_in_leaf': 17, 
    'one_hot_max_size': 11}
catboost = CatBoostRegressor(
    **best_params, 
    random_state=42,
    loss_function='RMSE', 
    cat_features=cat_features)

#### GradientBoost

In [9]:
optimal_n_estimators = 300
optimal_max_depth = 13
optimal_min_samples_split = 1000
optimal_min_samples_leaf = 40
optimal_max_features = 40
optimal_subsample = 0.95
original_learning_rate = 0.1

gradientboost = GradientBoostingRegressor(
            n_estimators = optimal_n_estimators*10,
            max_depth = optimal_max_depth,
            min_samples_split = optimal_min_samples_split,
            min_samples_leaf = optimal_min_samples_leaf,
            max_features = optimal_max_features,
            subsample = optimal_subsample,
            learning_rate = original_learning_rate / 10,
            loss = 'squared_error',
            criterion = 'squared_error',
            verbose = 0,
            warm_start = False,
            random_state = 42,
        )

### Training the Models
#### LightGBM

In [26]:
lgb.fit(train_labels_one_hot.drop(['area_total','building_id'],axis=1),np.log(price_per_square_meter))

#### CatBoost

In [19]:
catboost.fit(train_labels.drop(['area_total','building_id'],axis=1),np.log(price_per_square_meter))

#### GradientBoost

In [27]:
gradientboost.fit(train_labels_one_hot.drop(['area_total','building_id'],axis=1),np.log(price_per_square_meter))

### Predicting the Test Data
#### LightGBM: public score `0.15653`

In [13]:
predict_and_store(lgb, test_labels_one_hot.drop(['area_total','building_id'],axis=1), test, path="LGBM_2_copy_2.csv", exponential=True, price_per_sq = True, total_area_df = test['area_total'])

Note that this recomputes the second submission.

#### CatBoost: public score `0.16221`

In [14]:
predict_and_store(catboost, test_labels.drop(['area_total','building_id'],axis=1), test, path="CatBoost_copy2.csv", exponential=True, price_per_sq = True, total_area_df = test['area_total'])

#### GradientBoost: public score `0.16443`

In [21]:
predict_and_store(gradientboost, test_labels_one_hot.drop(['area_total','building_id'],axis=1), test, path="GradientBoost_copy_3.csv", exponential=True, price_per_sq = True, total_area_df = test['area_total'])

### Predicting the Test Data

In [17]:
# Helper Function
def csv_bagging(kaggle_scores, csv_paths, submission_path):
    # Making the acc dataframe
    d = {}
    for i, score in enumerate(kaggle_scores):
        d[i] = score
    acc = pd.DataFrame(
    d,
    index=[0]
    )
    acc = acc.T
    acc.columns = ['RMSLE']
    acc

    # Read dataframes, sort and store
    pd_predictions = []
    for path in csv_paths:
        pd_predictions.append(
            pd.read_csv(path).sort_values(by="id")
            )
    # Cast to numpy
    np_predictions = []
    for pred in pd_predictions:
        np_predictions.append(
            pred["price_prediction"].to_numpy().T
        )

    # Bagging
    avg_prediction = np.average(
        np_predictions,
        weights = 1 / acc['RMSLE'] ** 4,
        axis=0
        )
    
    result = avg_prediction
    submission = pd.DataFrame()
    submission['id'] = pd_predictions[0]['id']
    submission['price_prediction'] = result
    if len(submission['id']) != 9937:
        raise Exception("Not enough rows submitted!")
    
    submission.to_csv(submission_path, index=False)

In [29]:
prediction_scores = [0.15653, 0.16443, 0.16221]
csv_paths = ["LGBM_2_copy_2.csv", "GradientBoost_copy_2.csv", "CatBoost_copy2.csv"]
submission_path = "Bagging_copy_2.csv"

csv_bagging(prediction_scores, csv_paths, submission_path)

Note: in case one would want to reproduce our absolute best submission (using only previously stored csv files), this code could be used:

In [None]:
predicted_scores = [0.15623, 0.16401, 0.16264]
csv_paths = ["LGBM_2_copy_2.csv", "GradientBoost_copy_2.csv", "CatBoost_copy2.csv"]
submission_path = "Bagging_original.csv"

csv_bagging(prediction_scores, csv_paths, submission_path)