# Modelling: XG Boost Berlin rent

XGBoost (eXtreme Gradient Boosting) is a supervised learning algorithm that belongs to the family of ensemble methods. It is a powerful and widely used algorithm that is known for its speed and accuracy. Gradient Boosting can be used to predict the rent for apartments. It is particularly effective when the dataset is small and the features are complex. Gradient Boosting builds a model in a step-wise manner by minimizing the error of the previous model.

In short, the XGBoost algorithm works by creating an ensemble of decision trees that are trained sequentially. Each decision tree is trained to correct the errors made by the previous tree. The XGBoost algorithm uses a technique called gradient boosting to minimize a loss function and optimize the performance of the model.

The gradient boosting algorithm works by fitting a simple model (such as a decision tree) to the data and then using the residuals (the difference between the predicted and actual values) to train the next model. The XGBoost algorithm extends gradient boosting by using a combination of tree-based models and linear models to make predictions. It also uses a technique called regularization to prevent overfitting of the model.

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
## multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.set_option("display.max_rows", None)

In [2]:
# read the clean data without outliers and only numeric (with dummies)
df = pd.read_csv("df_numeric.csv")
df.tail()

Unnamed: 0,total rent,area,number of rooms,balcony,built-in kitchen,basement,garden,elevator,stepless,guest toilet,flat share possible,district_alt-hohenschönhausen,district_alt-treptow,district_altglienicke,district_baumschulenweg,district_biesdorf,district_blankenburg,district_bohnsdorf,district_borsigwalde,district_britz,district_buch,district_buckow,district_charlottenburg,district_charlottenburg-nord,district_dahlem,district_falkenberg,district_falkenhagener feld,district_fennpfuhl,district_französisch buchholz,district_friedenau,district_friedrichsfelde,district_friedrichshagen,district_friedrichshain,district_gatow,district_gesundbrunnen,district_gropiusstadt,district_grunewald,district_grünau,district_hakenfelde,district_halensee,district_hansaviertel,district_haselhorst,district_heiligensee,district_heinersdorf,district_hellersdorf,district_hermsdorf,district_johannisthal,district_karlshorst,district_karow,district_kaulsdorf,district_konradshöhe,district_kreuzberg,district_köpenick,district_lankwitz,district_lichtenberg,district_lichtenrade,district_lichterfelde,district_lübars,district_mahlsdorf,district_mariendorf,district_marienfelde,district_marzahn,district_mitte,district_moabit,district_märkisches viertel,district_müggelheim,district_neu-hohenschönhausen,district_neukölln,district_niederschöneweide,district_niederschönhausen,district_nikolassee,district_oberschöneweide,district_pankow,district_plänterwald,district_prenzlauer berg,district_rahnsdorf,district_reinickendorf,district_rosenthal,district_rummelsburg,district_schmargendorf,district_schmöckwitz,district_schöneberg,district_siemensstadt,district_spandau,district_staaken,district_steglitz,district_tegel,district_tempelhof,district_tiergarten,district_wannsee,district_wedding,district_weißensee,district_westend,district_wilhelmsruh,district_wilhelmstadt,district_wilmersdorf,district_wittenau,district_zehlendorf,landlord_degewo,landlord_estate agent,landlord_housinganywhere b.v.,landlord_howoge,landlord_numa group,landlord_private offer,landlord_tauschwohnung wohnungstausch,landlord_visionapartments,landlord_wohnungsswap.de
3330,1035.0,80.0,3.0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3331,1663.0,84.1,2.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3332,1260.0,72.0,2.5,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3333,2995.0,158.0,4.0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3334,1999.0,136.32,4.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


#### Dropping columns with low number of occurences (not yet)

#### Doing X-y split (y is the target variable total rent)

In [3]:
y = df["total rent"]

In [4]:
X = df.drop(columns=["total rent"])


#### Train/ Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,random_state= 42)

#### Standardizing the data using a scaler is not necessary but can be done to improve model (in this case model stays the same)

In [6]:
# fit and transform for tain data
# only transform for test data
#transformer = StandardScaler()
#X_train = transformer.fit_transform(X_train)
#X_test  = transformer.transform(X_test)

#### Train, fit, and evaluate xgboost model and gradient regression model

In [7]:
#!pip install xgboost

In [14]:
#{'max_features': 'sqrt',
# 'min_samples_leaf': 1,
# 'min_samples_split': 2,
# 'n_estimators': 50}
#gritsearch_2:
#'colsample_bytree': 1.0,
#'max_depth': 6,
#'min_child_weight': 10,
#'n_estimators': 100}

import xgboost as xgb

xgb_ops = {"max_depth": 6,
           "min_child_weight": 10,
           "colsample_bytree": 1.0,
           "max_features": "sqrt",
           "min_samples_leaf": 1,
           "min_samples_split": 2,
           "n_estimators": 100,
           "random_state": 42}

regressor_x = xgb.XGBRegressor(**xgb_ops)
regressor_x.fit(X_train, y_train)

print("train prediction R2 score: %.2f" % (regressor_x.score(X_train, y_train)))
print("test prediction R2 score: %.2f" % (regressor_x.score(X_test, y_test)))


Parameters: { "max_features", "min_samples_leaf", "min_samples_split" } are not used.



XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=1.0, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=6, max_features='sqrt',
             max_leaves=None, min_child_weight=10, min_samples_leaf=1,
             min_samples_split=2, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, ...)

train prediction R2 score: 0.81
test prediction R2 score: 0.58


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gb_ops = {"max_depth": 6,
          "min_samples_leaf": 20,
          "max_features": None,
          "n_estimators": 100,
          "random_state": 42}

regressor = GradientBoostingRegressor(**gb_ops)
regressor.fit(X_train, y_train)

print("train prediction R2 score: %.2f" % (regressor.score(X_train, y_train)))
print("test prediction R2 score: %.2f" % (regressor.score(X_test, y_test)))


#### Cross validation

In [None]:
from sklearn.model_selection import cross_val_score
folds=5
cross_val_scores = cross_val_score(regressor, X_train, y_train, cv=folds)


In [None]:
print("cv scores over {:d} iterations: \n".format(folds))
cross_val_scores

In [None]:
# 5 folds, so the algorithm was trained and evaluated 5 times, and the resulting scores for each fold are given in the array
# these scores are measures of the model's performance on the validation set for each fold. 
# by looking at these scores, you can get an idea of how well the model is performing on different parts of the data. 
# a large variation in the scores may indicate that the model is overfitting or that the data is not consistent. 
# a consistent, high score across all folds suggests that the model is performing well and is generalizing to new data.

In [None]:
print("the std. dev. in the cv scores is {:.4f}".format(np.std(cross_val_scores)))

In [None]:
# when you calculate the standard deviation of the cross-validation scores using np.std, you get a single number 
# that represents the variation in the cross-validation scores across all the folds. 
# in your case, the result is 0.0309, which means that the cross-validation scores vary by approximately 0.0309 on average across all the folds.
# the standard deviation is a measure of the spread or variability of a set of values.  
# large standard deviation indicates that the values are more spread out or variable, 
# while a small standard deviation indicates that the values are more tightly clustered around the mean.

# in the context of cross-validation, a small standard deviation indicates that the model is performing consistently 
# across all the folds, while a large standard deviation indicates that the model's performance is more variable across the folds. 
# Ideally, you want the standard deviation to be as small as possible, 
# since this indicates that the model is stable and reliable across different parts of the data.

# in your case, a standard deviation of 0.0309 suggests that the model's performance is relatively consistent across the folds,
# which is a good sign. 

#### R² Test < R² Train indicates model is overfitting to train data


#### Improving model with feature selection!

#### Comparing Feature Importance: higher the score, the more important the feature is

In [None]:
regressor_x.fit(X_train, y_train)

In [None]:
regressor.fit(X_train, y_train)

In [None]:
feature_names = list(X_train.columns)

In [None]:
df = pd.DataFrame(list(zip(feature_names, regressor.feature_importances_)))
df.columns = ['columns_name_gradient', 'score_feature_importance']
sort = df.sort_values(by=['score_feature_importance'], ascending = False)
sort

In [None]:
df = pd.DataFrame(list(zip(feature_names, regressor_x.feature_importances_)))
df.columns = ['columns_name_x', 'score_feature_importance']
list = df.sort_values(by=['score_feature_importance'], ascending = False)
list

#### Hyper Parameter Tunning

The XGBoost algorithm has several hyperparameters that can be tuned to optimize the performance of the model. These hyperparameters include the number of trees, the learning rate, the maximum depth of the trees, and the minimum number of samples required to split a node.

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = {
    'n_estimators': [50, 100,500],
    'min_samples_split': [2, 4],
    'min_samples_leaf' : [1, 2],
    'max_features': ['sqrt']
    ##'max_samples' : ['None', 0.5],
    ##'max_depth':[3,5,10],
    ## 'bootstrap':[True,False]
    }
regressor_x = xgb.XGBRegressor(random_state=42)

In [None]:
# original parameter used for xgb boost:
param_grid_2 = {
    "max_depth": [4,6,8],
           "min_child_weight": [10,20,30],
           "colsample_bytree": [1.0,1.5],
           "n_estimators": [100,120]
    ##'max_samples' : ['None', 0.5],
    ##'max_depth':[3,5,10],
    ## 'bootstrap':[True,False]
    }
regressor_x = xgb.XGBRegressor(random_state=42)

In [None]:
grid_search = GridSearchCV(regressor_x, param_grid, cv=5,return_train_score=True,n_jobs=-1,)

In [None]:
grid_search_2 = GridSearchCV(regressor_x, param_grid_2, cv=5,return_train_score=True,n_jobs=-1,)

In [None]:
grid_search.fit(X_train,y_train.values.ravel())

In [None]:
grid_search_2.fit(X_train,y_train.values.ravel())

In [None]:
best_params = grid_search.best_params_ #To check the best set of parameters returned
best_params

In [None]:
best_params_2 = grid_search_2.best_params_ #To check the best set of parameters returned
best_params_2

In [None]:
pd.DataFrame(grid_search.cv_results_)