# Germany Rental Prediction - Creating Model

## Contents:
- Part 1: Cleaning and Visualization
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1biEgivJEOUVS8KbeTXyb1lNgsVtbitYj)

- Part 2: Using PyCaret for Model Hyperparameters Tuning
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lXJhdH3rGnKQ_LjBGMh8ZK-Lf2VcfLW5)
- Part 3: Create Model
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14XIC90Lss_izdw-PE1cgIe4eECsXrHbY)


## Purpose from this notebook.

Creating the model base on '02_model_comparison_pycaret.ipynb' which has already done all the hyperparameters tuning by using PyCaret.

We will create 2 models which are
1. Ridge Regression: [Wikipedia](https://en.wikipedia.org/wiki/Ridge_regression#:~:text=Ridge%20regression%20is%20a%20method,econometrics%2C%20chemistry%2C%20and%20engineering.)
2. Light Gradient Boostt: [Wikipedia](https://en.wikipedia.org/wiki/LightGBM)
3. Extreme Gradient Boost
4. CatBoost
5. Linear Regression

and compare it by using the metrics such as R2 ([Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)) and RMSE ([Root Mean Square Error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) which is the standard matrix to measure the accuracy of the regression model. In addition, we will compare the speed of comparison to which model is much faster when you're using it as a prediction.

# Basic data handling and inspection

Import all important libraries in this kernel

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.max_columns', None)

Load the dataset to the kernel

In [None]:
!gdown --id 1yw4RN-Z9b7PlF45kC5HnXZokaXivk3EV

df = pd.read_csv('predict_test.csv').iloc[:,1:]
df.head()

In [None]:
df[df['regio2']=='Berlin']

# Machine Learning

If it's an object or bool type (True,False). The code below will create the dummies for all of the categorical.

In [None]:
# Create dummy variables
columns = []
for cols in df.columns:
    if df[cols].dtype == 'object' or df[cols].dtype == 'bool':
        columns.append(cols)
dummies_feature = pd.get_dummies(df[columns],prefix='',prefix_sep='')
dummies_feature.head()

Combine those columns together and drop the categorical columns that we created for the dummies.

In [None]:
predict_df = df.copy()
predict_df = predict_df.drop(columns=columns)
predict_df = pd.concat([predict_df, dummies_feature], axis=1)
predict_df.head()

## Spliting the data into train and test

In [None]:
# Spliting the data into training and testing dataset
X = predict_df.iloc[:,1:] # Select all the columns, that's not totalRent
y = predict_df.iloc[:,0] # Select only totalRent
X_val = X.values
y_val = y.values
x_train, x_test, y_train, y_test = train_test_split(X_val, y_val, test_size = 0.10, random_state = 123)

In [None]:
print(f"Number of train datasets: {x_train.shape[0]}\n")
print(f"Number of test datasets: {x_test.shape[0]}")

## Ridge Regression

Ridge regression is an example of a shrinkage method: in contrast to least squares, it reduces the parameter estimates in an effort to reduce variance, improve prediction accuracy, and simplify interpretation.


### Setting hyperparameters and fit the model

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=2.81, copy_X=True, fit_intercept=False, max_iter=None,
      normalize=True, random_state=123, solver='auto', tol=0.001)

# Fit the data
ridge.fit(x_train,y_train)

In [None]:
# Predict in test dataset

y_pred = ridge.predict(x_test)
y_pred

### Create the Function to evaluate the model

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Fill the dataframe variables in the dataset (x) and fill the target in target (y)
def evaluate_function(model,dataset,target,name): 
    """
    Create the function to evaluate the model by
    - model: Fill in the model that we've created before
    - dataset: The dataset that contains the feature or attribute
    - target: The label of the dataset
    - name: Fill in the name of the model
    
    and return into dictionary type for further use
    """
    y_pred = model.predict(dataset)

    print(f'Model name: {name}\n')
    r2 = r2_score(target,y_pred)
    print(f'Coefficient of Determination (R2 score) of the model: {round(r2,3)}')
    rmse = pow(mean_squared_error(target,y_pred),0.5)
    print(f'RMSE (Root Mean Square Error) of the prediction: {round(rmse,3)}')
    mae = mean_absolute_error(target, y_pred)
    print(f'MAE (Mean Absolute Error) of the prediction: {round(mae,3)}')
    
    evaluation_results = {"Model Name": name,
                          'R2':r2,
                        'RMSE':rmse,
                        'MAE':mae}
    
    return evaluation_results



In [None]:
ridge_evaluation = evaluate_function(ridge,x_test,y_test,'Ridge Regression')

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(f'Coefficient of Determination (r2_score): {r2}')
rms = pow(mean_squared_error(y_test,y_pred),0.5)
print(f'RMSE of the prediction: {rms}')

In [None]:
fig, ax = plt.subplots()
ax.scatter(y_pred,y_test,edgecolors=(0,0,1))
ax.plot([y_test.min(),y_test.max()],[y_test.min(),y_test.max()], 'r--',
       lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

In [None]:
predict_df.head()

### Create the Predict Function by using Ridge Regression Model

In [None]:
from numpy import random

def model_predict_price(model, df):
    random_data = df.iloc[random.randint(df.shape[0]-1),:]

    heatingIndex = np.where(X.columns == random_data['heatingType'])[0][0]
    conIndex = np.where(X.columns == random_data['condition'])[0][0]
    flatTypeIndex = np.where(X.columns == random_data['typeOfFlat'])[0][0]
    regionIndex = np.where(X.columns == random_data['regio2'])[0][0]

    x = np.zeros(len(X.columns))
    x[0] = random_data['livingSpace']
    x[1] = random_data['noRooms']
    x[2] = random_data['additioncost']
  

    if heatingIndex >= 0:
        x[heatingIndex] = 1
    if conIndex >= 0:
        x[conIndex] = 1
    if flatTypeIndex >= 0:
        x[flatTypeIndex] = 1
    if regionIndex >= 0:
        x[regionIndex] = 1

    predict_price = model.predict([x])[0]
    print(f"Price from the dataframe: {random_data['totalRent']}\nPrice from prediction :{predict_price}")

    return  predict_price

In [None]:
predict_price = model_predict_price(ridge, df)
print(predict_price)

## Create the function to compare randomly between the prediction and the dataset

In [None]:
def comparison(model, df):
    list_of_diff = [] # Create a list to calculate the average
    
    for i in range(0,20):
        random_data = df.iloc[random.randint(df.shape[0]-1),:]

        heatingIndex = np.where(X.columns == random_data['heatingType'])[0][0]
        conIndex = np.where(X.columns == random_data['condition'])[0][0]
        flatTypeIndex = np.where(X.columns == random_data['typeOfFlat'])[0][0]
        regionIndex = np.where(X.columns == random_data['regio2'])[0][0]

        x = np.zeros(len(X.columns))
        x[0] = random_data['livingSpace']
        x[1] = random_data['noRooms']
        x[2] = random_data['additioncost']
      

        if heatingIndex >= 0:
            x[heatingIndex] = 1
        if conIndex >= 0:
            x[conIndex] = 1
        if flatTypeIndex >= 0:
            x[flatTypeIndex] = 1
        if regionIndex >= 0:
            x[regionIndex] = 1

        predict_price = model.predict([x])[0]
        diff = abs(random_data['totalRent']-predict_price)
        print(f"{i:<5}: Dataframe: {random_data['totalRent']:<10},  Model Prediction : {round(predict_price,2):<10},  Difference: {round(diff,2)}")

        
        # add the number of difference value of train and test to the list
        list_of_diff.append(diff) 
        # print("\n======================\n")
    
    avg = sum(list_of_diff)/len(list_of_diff)
    print("\n======================")
    print(f"\nThe average of the difference from the actual and prediction: {avg}")

### Testing the data randomly by using comparison function

In [None]:
comparison(ridge, df)

## Light Gradient Boost

### Setting hyperparameters and fit the model

In [None]:
d_train = lgb.Dataset(x_train, label=y_train) # Load the dataset and test

# parameters for this model
params = {
        'n_estimators': 10000,
        'objective': 'regression',
        'metric': 'rmse',
        'boosting_type': 'gbdt',
        'max_depth': -1,
        'learning_rate': 0.01,
        'subsample': 0.72,
        'subsample_freq': 4,
        'feature_fraction': 0.4,
        'lambda_l1': 1,
        'lambda_l2': 1,
        'seed': 46,
        }

lightgb = lgb.train(params, d_train, 100)

Check the data is what'we expected or not.

In [None]:
y_pred = lightgb.predict(x_test)
y_pred

The result of the 'Light Gradient Boosting Machine' is working great and we could use this model in concern of accuracy if compare with the 'Ridge Regression'

In [None]:
lightgb_evaluation = evaluate_function(lightgb,x_test,y_test,'Light Gradient Boost')
lightgb_evaluation

### Testing the data randomly by using comparison function

In [None]:
comparison(lightgb, df)

Let's try some rows to make sure our models is working properly

## Catboost

### Setting hyperparameters and fit the model

In [None]:
!pip install catboost
import catboost as cb

In [None]:
train_dataset = cb.Pool(x_train, y_train)
test_dataset = cb.Pool(x_test, y_test)

In [None]:
catboost = cb.CatBoostRegressor(loss_function="RMSE")

In [None]:
grid = {'iterations': [100, 150, 200],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}
catboost.grid_search(grid, train_dataset)

In [None]:
y_preds = catboost.predict(x_test)
y_preds

In [None]:
catboost_evaluation = evaluate_function(catboost,x_test,y_test,'CatBoost')


In [None]:
comparison(catboost, df)

## Xgboost

### Setting hyperparameters and fit the model

In [None]:
import xgboost
from xgboost import XGBRegressor

In [None]:
xgboost = XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='reg:squarederror',
             random_state=123)
xgboost = xgboost.fit(x_train, y_train)

In [None]:
y_preds = xgboost.predict(x_test)
y_preds

In [None]:
xgboost_evaluation = evaluate_function(xgboost,x_test,y_test,'Xgboost')

In [None]:
comparison(xgboost, df)

## Linear Regression

### Setting hyperparameters and fit the model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=True)
linear = linear.fit(x_train, y_train)

In [None]:
y_preds = linear.predict(x_test)
y_pred[0:100]

In [None]:
linear_evaluation = evaluate_function(linear,x_test,y_test,'Linear Regression')

# Comparison between models

In [None]:
ridge_evaluation

In [None]:
# Create comparison dataframe from our past dictionary
comparison = pd.DataFrame({ridge_evaluation['Model Name']:ridge_evaluation,
                           lightgb_evaluation['Model Name']:lightgb_evaluation,
                           catboost_evaluation['Model Name']:catboost_evaluation,
                           xgboost_evaluation['Model Name']:xgboost_evaluation,
                           linear_evaluation['Model Name']:linear_evaluation})


comparison = comparison.transpose()
comparison.reset_index(drop=True)

## Accuracy of the model

In [None]:
comparison[['RMSE','MAE']].sort_values('RMSE',ascending=False).plot(kind='bar',figsize=(10,10)).legend(bbox_to_anchor=(1.0,1.0))

## The speed/score tradeoff

### Create the function to measure the time of prediction

In [None]:
import time
def pred_timer(model,samples):
    start_time = time.perf_counter()
    model.predict(samples)
    end_time = time.perf_counter()
    total_time = end_time-start_time
    time_per_pred = total_time/len(samples)
    return total_time, time_per_pred

Calcuate the Ridge Regression time per prediction

In [None]:
x_test.shape

In [None]:
total_pred_time_ridge_reg, time_per_pred_ridge_reg = pred_timer(ridge,x_test)

print('Ridge Regression Model')
print(f"Total time prediction for test dataframe: {round(total_pred_time_ridge_reg,3)}")
print(f"Time per each prediction: {round(time_per_pred_ridge_reg,10)}")


Calculate the Light Gradient Boost time per prediction

In [None]:
total_pred_time_lgbm, time_per_pred_lgbm = pred_timer(lightgb,x_test)

print("Light Gradient Boost Model")
print(f"Total time prediction for test dataframe: {round(total_pred_time_lgbm,3)} seconds")
print(f"Time per each prediction: {round(time_per_pred_lgbm,10)}")


Calculate the CatBoost time per prediction

In [None]:
total_pred_time_catboost, time_per_pred_catboost = pred_timer(catboost,x_test)

print("Light Gradient Boost Model")
print(f"Total time prediction for test dataframe: {round(total_pred_time_catboost,3)} seconds")
print(f"Time per each prediction: {round(time_per_pred_catboost,10)}")


Calculate the XgBoost time per prediction

In [None]:
total_pred_time_xgboost, time_per_pred_xgboost = pred_timer(xgboost,x_test)

print("Light Gradient Boost Model")
print(f"Total time prediction for test dataframe: {round(total_pred_time_xgboost,3)} seconds")
print(f"Time per each prediction: {round(time_per_pred_xgboost,10)}")


Calculate the Linear Regression time per prediction

In [None]:
total_pred_time_linear, time_per_pred_linear = pred_timer(linear,x_test)

print("Linear Regression Model")
print(f"Total time prediction for test dataframe: {round(total_pred_time_linear,3)} seconds")
print(f"Time per each prediction: {round(time_per_pred_linear,10)}")


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,10))
plt.scatter(time_per_pred_ridge_reg, ridge_evaluation['RMSE'], s=200, label="Ridge Regression")
plt.scatter(time_per_pred_lgbm, lightgb_evaluation['RMSE'], s=200,  label="Light Gradient Boost")
plt.scatter(time_per_pred_catboost, catboost_evaluation['RMSE'], s=200, label="Catboost")
plt.scatter(time_per_pred_xgboost, xgboost_evaluation['RMSE'], s=200, label="Xgboost")
plt.text(x=0.00003, y=121.5, s="Best Model: Xgboost", size=20)
plt.scatter(time_per_pred_linear, linear_evaluation['RMSE'], s=200, c="black", label="Linear Regression")
plt.scatter(0.00001,118, s=200, label="Ideal model")
plt.text(x=0.00003, y=118, s="Ideal regression model", size=20)

plt.legend(prop={'size': 16})
plt.title("Comparison between RMSE and time per prediction")
plt.xlabel("Time per prediction")
plt.ylabel("RMSE")

## Conclusion

\\From the plot above we could see that Xgboost Regressor might have slightly error compare to Light Gradient Boost Machine but the time is much faster. Then if you want the model that is has a better prediction 'Light Gradient Boost Machine' might be better answer. However, if you're concern of speed, 'Ridge Regression' has 10 times better in speed.

I would use 'Xgboost' to predict the price afterward due to the resources and the scope of my work is limited.

# Saving the model for further use.

In [None]:
import pickle
with open('german_home_prices_model.pickle','wb') as f:
    pickle.dump(xgboost,f)

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))

# Summary

This is the end of the kernel, if you love this kernel or could study something from this please upvote! it means a lot for my future opportunity. Moreover, feel free to comment on my mistakes because it would be surely help me to improve my mistakes.

Thanks for viewing!