In this notebook I will test the model proposed in the ensemble learning lecture.
Chapter with explenation about the algorithm behind the lightGBM, will write this after the coding.

LightGBM is good in our case because it can deal with categorical values. We will start with importing the data, the data is already divided into test and training data.

In [89]:
import lightgbm as lgb
from catboost import CatBoostRegressor
import pandas as pd
import numpy as np

import matplotlib.pylab as plt
plt.style.use('ggplot')

I want to use import_data function from another directory, and make the data ready for the model. LightGBM accepts categorical values, but they have to be encoded as no-negative integers.


In [90]:
from sklearn.model_selection import train_test_split
from moscow_housing.display_data import import_data

#We dont want our model to care about the id of the house or the seller
#In my first run, i will replace missing values with the mean value
data, data_test = import_data()
Y = data.price
id = data.id
data = data.drop(columns=['price','id','seller'])


for column in data.columns:
    column_type = data[column].dtype
    if column_type == 'object':
        break
    data[column] = data[column].replace(np.NaN, data[column].mean())

#turn categorical features into correct type
for column in data.columns:
    column_type = data[column].dtype
    if column_type == 'object':
        data[column] = data[column].astype('category')


X_train, X_test, y_train, y_test = train_test_split(data, Y, test_size=0.2, random_state=42)


def root_mean_squared_log_error(y_true, y_pred):
    # Alternatively: sklearn.metrics.mean_squared_log_error(y_true, y_pred) ** 0.5
    assert (y_true >= 0).all()
    assert (y_pred >= 0).all()
    log_error = np.log1p(y_pred) - np.log1p(y_true)  # Note: log1p(x) = log(1 + x)
    return np.mean(log_error ** 2) ** 0.5

lightGBM_model = lgb.LGBMRegressor(
    num_leaves=20,
    max_depth=15,
    random_state=42,
    silent=True,
    metric='mse',
    n_jobs=4,
    n_estimators=10000,
    colsample_bytree=0.95,
    subsample=0.9,
    learning_rate=0.09)

lightGBM_model.fit(X_train,y_train)
prediction = lightGBM_model.predict(X_test)

count = 0
for i in prediction:
    if i < 0:
        prediction[count] = prediction.mean()
    count += 1

rmsle = root_mean_squared_log_error(y_test,prediction)
print("first run", rmsle)





first run 0.1834119728836997


Second attempt

In [91]:
from sklearn.model_selection import train_test_split
from moscow_housing.display_data import import_data
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


#We dont want our model to care about the id of the house or the seller
#For my second run, i want to use linear regression to predict the missing values
data_2, data_test_2 = import_data()
id = data_2.id

data_2 = data_2.drop(columns=['id', 'seller'])
Y_2 = data_2.price


data_without_NaN = data_2[['street','address']]
data_2 = data_2.drop(columns=['street','address'])

imp_mean = IterativeImputer(verbose= 2, random_state=0)
transformed_data = imp_mean.fit_transform(data_2)

print('done with transforming data')
data_2_after_transformation = data_2

[IterativeImputer] Completing matrix with shape (23285, 30)
[IterativeImputer] Ending imputation round 1/10, elapsed time 1.57
[IterativeImputer] Change: 1508.6948635194135, scaled tolerance: 2600000.0 
[IterativeImputer] Early stopping criterion reached.
done with transforming data


Now that the missing values has been predicted by the data, we need to get the data back as a pandas datafram instead of a numpy array

In [92]:
data_2 = data_2_after_transformation
Y = Y_2

data_2 = pd.DataFrame(transformed_data, columns=data_2.columns)

data[['street','address']] = data_without_NaN

#turn categorical features into correct type
for column in data_2.columns:
    column_type = data_2[column].dtype
    if column_type == 'object':
        data_2[column] = data_2[column].astype('category')


X_train, X_test, y_train, y_test = train_test_split(data_2, Y, test_size=0.2, random_state=42)


def root_mean_squared_log_error(y_true, y_pred):
    # Alternatively: sklearn.metrics.mean_squared_log_error(y_true, y_pred) ** 0.5
    assert (y_true >= 0).all()
    assert (y_pred >= 0).all()
    log_error = np.log1p(y_pred) - np.log1p(y_true)  # Note: log1p(x) = log(1 + x)
    return np.mean(log_error ** 2) ** 0.5


lightGBM_model = lgb.LGBMRegressor(
    num_leaves=20,
    max_depth=15,
    random_state=42,
    silent=True,
    metric='mse',
    n_jobs=4,
    n_estimators=10000,
    colsample_bytree=0.95,
    subsample=0.9,
    learning_rate=0.09)

lightGBM_model.fit(X_train,y_train)
prediction = lightGBM_model.predict(X_test)

count = 0
for i in prediction:
    if i < 0:
        prediction[count] = prediction.mean()
    count += 1

rmsle = root_mean_squared_log_error(y_test,prediction)
print("second run", rmsle)

second run 0.06719042389062292
