# Germany Rental Prediction - Cleaning, EDA and Prediction

## Contents:
- Part 1: Cleaning and Visualization
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1biEgivJEOUVS8KbeTXyb1lNgsVtbitYj)

- Part 2: Using PyCaret for Model Hyperparameters Tuning
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lXJhdH3rGnKQ_LjBGMh8ZK-Lf2VcfLW5)
- Part 3: Create Model
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14XIC90Lss_izdw-PE1cgIe4eECsXrHbY)


## Purpose from this kernel.
* For coding part skip to part 2 below

I've travel from SEA and I don't know how much apartment in the Berlin should cost and it's really tough to find an apartment while I'm staying in Germany for my Master Degree. Furthermore, I need something for my Data Science Portfolio for the job application after graduation. So why not build something from the scratch with the dataset from Kaggle ([Immobilien](https://www.immobilienscout24.de/))

So this kernel will be well written than my previous kernel for other people and using what I've learnt in my master course and other online resources to produce something that will be practical for the real environment.

## What we expected from this notebook.
- Data cleaning to clear the outliers and remove columns that doesn't have high correlation to the prediction
- Create virtualization to have a better understanding of the data of the rental in Germany.
- Feature engineering from the original variable to create a better model
- Create a tool that estimate the house cost predicted by many variables

## Where is the data from?


The data was scraped from Immoscout24, the biggest real estate platform in Germany. Immoscout24 has listings for both rental properties and homes for sale, however, the data only contains offers for rental properties. <br>
At a given time, all available offers were scraped from the site and saved. This process was repeated three times, so the data set contains offers from the dates 2018-09-22, 2019-05-10 and 2019-10-08.

### Dataset Description
The data was scraped from Immoscout24, the biggest real estate platform in Germany. Immoscout24 has listings for both rental properties and homes for sale, however, the data only contains offers for rental properties.

### Content

The data set contains most of the important properties, such as living area size, the rent, both base rent as well as total rent (if applicable), the location (street and house number, if available, ZIP code and state), type of energy etc. It also has two variables containing longer free text descriptions: description with a text describing the offer and facilities describing all available facilities, newest renovation etc.

## Acknowledgements
The data belongs to www.immobilienscount24.de and is for research purposes only. The data was created with .

# Basic data handling and inspection

Import all important libraries in this kernel

In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import time
import datetime
from datetime import date
from plotly.offline import init_notebook_mode, iplot
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
import lightgbm as lgb

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# import plotly.io as pio
# pio.renderers.default = "iframe"

# To make it run on Colab
# import plotly.io as pio
# pio.renderers.default = "colab"

Load the dataset to the kernel

In [None]:
# Data is load to a zip file so we need to unzip it by using this function
import zipfile

def unzip_data(filename):
  zip_ref = zipfile.ZipFile(filename, "r")
  zip_ref.extractall()
  zip_ref.close()

In [None]:
# Download csv file from my google drive
!gdown --id 1fpkyO-9WbkVdxjZ6dhidl6FNzGJ5u4AD

# Unzip the file
unzip_data("immo_data.csv.zip")


In [None]:
df = pd.read_csv('data/immo_data.csv')

## EDA to better know the data

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print(f'Number of dataset variables is:',df.shape[1],'\nNumber of datasets of rows is:',df.shape[0])

# Cleaning the Data

80% of all the data science job is to cleaning the data. It might a bit confuse but what we're trying in this section is dealing with many things such as
- Outliers
- Missing Data
- Drop columns
- Etc.
The result of this part is we will have a better dataset to analyze, virtualization and making a prediction.

## Dealing with the missing values

When we're working on any datasets, we need to check on the missing values to make sure the data is ready or not for further analyzation and virutalization.

Create the function to show the top 20 missing values by include number of missing values and percentage of it.

In [None]:
def missing_values(temp_idf,norows):   # input by the df and the number of rows that you want to show
    total = temp_idf.isnull().sum().sort_values(ascending=False)
    percent = ((temp_idf.isnull().sum().sort_values(ascending=False)/temp_idf.shape[0])*100).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return(missing_data.head(norows))


In [None]:
missing_values(df,20) # we use the df and the number of rows to show is 20

As we can see it contains a lot of missing value in some columns so I decide to remove all of the columns that contain missing value more than 20%

In [None]:
missing_data = missing_values(df,20)
# drop the data where the columns contains more than 30%
df = df.drop((missing_data[missing_data['Percent'] > 30]).index,1)

Because I want to predict rental price ('totalRent') so I should drop all the rows that doesn't consist totalRent

In [None]:
df.dropna(subset=['totalRent'],inplace=True)
print(f"Data shape after drop rows that doesn't contain 'totalRent' data: {df.shape}")

## Remove columns that doesn't contain useful information

In [None]:
df.head()

In [None]:
df.corr().sort_values('totalRent', ascending=False).totalRent

### Drop and delete what we woulnd't use further

In [None]:
df.drop(columns=['livingSpaceRange','street','description','facilities','geo_krs','scoutId','regio1','telekomUploadSpeed','telekomTvOffer','pricetrend','regio3','noRoomsRange','picturecount','geo_bln','date',\
    'houseNumber','streetPlain','firingTypes','yearConstructedRange','baseRentRange','lift'],inplace=True)

Let's check the missing data in this dataframe again before making any other decision.

In [None]:
missing_values(df,10)

## Handle with specific variables
We need to clean each specific variables to make the data more valuable to visualize and predict further

### Focus on 'Condition' variable
Take a deeper analysis in the condition variable and I fill all of the missing value into 'Other'

In [None]:
df['condition'].fillna("Other", inplace=True) # fill the NA by Other
df['condition'].value_counts()

The last 3 is not the good condition for the apartnebt fubder so I will group it in 'Other'

In [None]:
others_condition = df['condition'].value_counts().tail(3).index

def editcondition(dflist):
    if dflist in others_condition:
        return 'Other'
    else:
        return dflist

df['condition'] =df['condition'].apply(editcondition)
df['condition'].value_counts()

Fill NA of 'yearConstructed' with the mean of each type of condition 'condition' because from my perspective if the apartment is not fully_renovated or refurbished it means that it should have a lot of usage year.

In [None]:
df["yearConstructed"] = df['yearConstructed'].fillna(df.groupby('condition')['yearConstructed'].transform('mean')).round(0)

Create the new variables to tell the duration since last renovated or built til the today

In [None]:
df['numberOfYear'] = date.today().year - df["yearConstructed"]

### Focus on 'Regio2' variable
In other region, It might not have sufficient data so I group them together and put it in Other variable.

In [None]:
df['regio2']

In [None]:
df['regio2'].value_counts().head(100)

In [None]:
df['regio2'].replace("_Kreis","",regex=True,inplace=True)

In [None]:
others_region = list(df['regio2'].value_counts().iloc[80:,].index)

def edit_region(dflist):
    if dflist in others_region:
        return 'Other'
    else:
        return dflist

df['regio2'] =df['regio2'].apply(edit_region)
df['regio2'].value_counts()

In [None]:
df['regio2'].value_counts().sort_values(ascending=False).head(50)

## Outliers

What we should focus in the first is the data that we want to predict and what relavance most which is 'KaltMiete' and 'WarmMiete'

In [None]:
df['baseRent'].describe().round(2)

In [None]:
df['totalRent'].describe().round(2)

In [None]:
df['typeOfFlat'].value_counts()

### Remove Outlier by focus on the 'totalRent' and 'totalArea'

In [None]:
print(f"Shape before performing outlier removing task: {df.shape[0]}")

In [None]:
df.head()

In [None]:
# Finding the 95% of the percentile of the totalRent of penthouse price
percentile_95_penthouse = np.percentile(df[['totalRent']][df['typeOfFlat'] == 'penthouse'],95)
print(f"At the percentile 95 of the total rent for penthouse is: {percentile_95_penthouse}")

In [None]:
# Finding the 5% of the percentile of the totalRent of normal apartment price
percentile_5_apartment = np.percentile(df[['totalRent']][df['typeOfFlat'] == 'apartment'],5)
print(f"At the percentile 5 of the total rent for apartment is: {percentile_5_apartment}")

### Finding the diffrence between 'Total Rent' and 'Base Rent' by'Total Rent' - 'Base Rent'

In [None]:
df['rent_difference'] = df['totalRent']-df['baseRent']
df[['totalRent','baseRent','rent_difference']].head()

In [None]:
percentile_95_diff = np.percentile(df['rent_difference'],95)
percentile_5_diff = np.percentile(df['rent_difference'],5)

print(f"The difference of percentile_95_diff at the 95 percentile: {percentile_95_diff}")
print(f"The difference of percentile_5_diff at the 5 percentile: {percentile_5_diff}")

Filter and plot the graph repeatly to see the best result and try to remove the outliers that doesn't make sense to the df

In [None]:
df = df[(df['totalRent'] > percentile_5_apartment) & (df['totalRent'] < percentile_95_penthouse)]
df = df[(df['totalRent'] > df['baseRent'])]
df = df[(df['rent_difference'] < percentile_95_diff) & (df['rent_difference'] > percentile_5_diff)]

df.shape

Drop the 'rent_difference' column because we wouldn't use it anymore.

In [None]:
fig = px.scatter(df, x='totalRent', y='baseRent')
fig.show()

In [None]:
df['livingSpace'].describe()

In [None]:
df = df[(df['livingSpace'] > 10) & (df['livingSpace'] < 400)]


In [None]:
fig = px.scatter(df, x='baseRent', y='livingSpace')
fig.show()

## Feature Engineering
We've already created some columns already such as 'numberOfYears'. However, now I'm trying to create more variables for inspect and building a model from new variables later.

Create a new columns for the price per square meter

In [None]:
df['Pricepm2'] = df['baseRent'] / df['livingSpace']
df['additioncost'] = df['totalRent'] - df['baseRent']

In [None]:
fig = px.scatter(df, x='totalRent', y='Pricepm2')
fig.show()

### Service Charge

In [None]:
df['serviceCharge'].describe()

In [None]:
df = df[(df['serviceCharge'] < 1000)]
print(f"Shape after remove service charge that higher than 1000: {df.shape[0]}")

In [None]:
fig = px.scatter(df, x='totalRent', y='serviceCharge')
fig.show()

In [None]:
df['floor'].fillna(df['floor'].mode()[0], inplace=True)
df = df[(df['floor'] >= -1) & (df['floor'] <= 20)] # Floor should contain only basement - 20th floor
print(f"Shape after removing floor: {df.shape[0]}")


In [None]:
df['heatingType'].fillna(df['heatingType'].mode()[0], inplace=True)
df['typeOfFlat'].fillna(df['typeOfFlat'].mode()[0], inplace=True)

In [None]:
heatinglist = list(df['heatingType'].value_counts().head(10).index)
df = df[pd.DataFrame(df['heatingType'].tolist()).isin(heatinglist).any(1).values]

Using 'Empirical Rule' last time to clean the data

In [None]:
for cols in df.columns:
    if df[cols].dtype == 'int64' or df[cols].dtype == 'float64':
        upper_range = df[cols].mean() + 3 * df[cols].std()
        lower_range = df[cols].mean() - 3 * df[cols].std()
        
        indexs = df[(df[cols] > upper_range) | (df[cols] < lower_range)].index
        df = df.drop(indexs)

Checking for the last time, we don't have any missing data left

In [None]:
missing_values(df,5)

Now we're finished with Data Cleaning Job so we could virtualization to have a better understanding of our dataset.

In [None]:
print(f"Shape after all of the main data cleaning task: {df.shape[0]}")

# Data Virtualization

In [None]:
df.head()

I love to use correlation map to inspect the dataset. Which  variables has the more correlation to the variables that we want to predict.

In [None]:
f, ax = plt.subplots(figsize=(12, 12))

sns.heatmap(df.corr().sort_values(by='totalRent',ascending=False), square = True,fmt='.2f' ,annot = True)

In [None]:
cor = df.corr().sort_values(by='totalRent',ascending=False)
cor.style.background_gradient(cmap='coolwarm')

From the dataset, variables that irrelevant to 'totalRent' are 'cellar','floor' and 'garden' so I would consider drop it.

In [None]:
df.drop(['cellar','floor','garden'],axis=1,inplace=True)

## Basic Inspection

### Kurtosis and Skewness

In [None]:
fig,ax = plt.subplots(figsize=(10,6))
sns.distplot(df['totalRent'],fit=norm)

In [None]:
fig,ax = plt.subplots(figsize=(10,6))
sns.distplot(df['livingSpace'],fit=norm)

In [None]:
df.head()

### Start with the ratio of each city

In [None]:
countpie = df['regio2'].head().value_counts()

fig = {
  "data": [
    {
      "values": countpie.values,
      "labels": countpie.index,
      "domain": {"x": [0, .5]},
      "name": "City",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of all the City ratio in the dataset",
    }
}
iplot(fig)

In [None]:
countpie = df['regio2'].value_counts().sort_index()
countpie.drop(labels=['Other'],
             axis=0,
             inplace=True)

fig = px.histogram(x = df['regio2'].value_counts().drop(labels=['Other'],axis=0).sort_index().index,
                  y = countpie,
                  color=df['regio2'].value_counts().drop(labels=['Other'],axis=0).sort_index().index)

fig.update_xaxes(title="City")
fig.update_yaxes(title = "Number of Dataset seperate by City")
fig.show()

In [None]:
plotter = df.groupby('regio2')['totalRent'].agg(['mean'])
plotter.columns = ["mean"]
plotter['regio2'] = plotter.index

data = [
    {
        'x': plotter['regio2'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['regio2'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average rental per month", 
                   xaxis=dict(title='City'),
                   yaxis=dict(title='Cost of rental')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

München, Frankfurt am Main, Hamburg, Düsseldorf, Berlin and Köln seem to be the highest rental city

In [None]:
rentmean = df.groupby(['regio2'])['totalRent'].mean().sort_index()

fig = px.histogram(x = df['regio2'].value_counts().sort_index().index,
                   y = rentmean,
                   color= df['regio2'].value_counts().sort_index().index
             )
fig.update_xaxes(title="City")
fig.update_yaxes(title = "Average rental per month")
fig.show()

In [None]:
plotter = df.groupby('regio2')['Pricepm2'].agg(['mean'])
plotter.columns = ["mean"]
plotter['regio2'] = plotter.index

data = [
    {
        'x': plotter['regio2'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['regio2'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average rental per month compare by area per square meter", 
                   xaxis=dict(title='City'),
                   yaxis=dict(title='Euro/square meter')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
plotter = df.groupby('condition')['totalRent'].agg(['mean'])
plotter.columns = ["mean"]
plotter['condition'] = plotter.index

data = [
    {
        'x': plotter['condition'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['condition'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average rental per month group by apartment condition", 
                   xaxis=dict(title='Apartment Condition'),
                   yaxis=dict(title='Cost of rental')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
plotter = df.groupby('regio2')['livingSpace'].agg(['mean'])
plotter.columns = ["mean"]
plotter['regio2'] = plotter.index

data = [
    {
        'x': plotter['regio2'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['regio2'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average living space group by city", 
                   xaxis=dict(title='City'),
                   yaxis=dict(title='Average Living Space')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
countpie = df['heatingType'].head().value_counts()

fig = {
  "data": [
    {
      "values": countpie.values,
      "labels": countpie.index,
      "domain": {"x": [0, .5]},
      "name": "City",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of all the City ratio in the dataset",
    }
}
iplot(fig)

In [None]:
countpie = df['heatingType'].value_counts().sort_index()


fig = px.histogram(x = df['heatingType'].value_counts().sort_index().index,
                  y = countpie,
                  color=df['heatingType'].value_counts().sort_index().index)

fig.update_xaxes(title="Heating Type")
fig.update_yaxes(title = "Quantity of Heating Type")
fig.show()

In [None]:
plotter = df.groupby('heatingType')['totalRent'].agg(['mean'])
plotter.columns = ["mean"]
plotter['heatingType'] = plotter.index

data = [
    {
        'x': plotter['heatingType'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['heatingType'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average rental group by heating type", 
                   xaxis=dict(title='Heating Type'),
                   yaxis=dict(title='Average Rental Cost')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
countpie = df['newlyConst'].value_counts()
countpie = countpie.sort_index() 
fig = {
  "data": [
    {
      "values": countpie.values,
      "labels": ['False','True'],
      "domain": {"x": [0, .5]},
      "hoverinfo":"label+percent+name",
      "hole": .3,
      "type": "pie"
    },],
  "layout": {
        "title":"Percentage of the residence is newly constructed or not",
    }
}
iplot(fig)

Most of the apartment are newly constructed. Then, I want to know the difference of a price between refurbrished are having a big gap or not

In [None]:
constructmean = df.groupby(['newlyConst'])['totalRent'].mean().sort_index()

fig = px.histogram(x = df['newlyConst'].value_counts().sort_index().index,
                   y = constructmean,
                   color= df['newlyConst'].value_counts().sort_index().index
             )
fig.update_xaxes(title="Newly construct or not")
fig.update_yaxes(title = "Rental Cost")
fig.show()

In [None]:
constructmean = df.groupby(['newlyConst'])['totalRent'].mean().sort_index()
constructmean

So the difference between refurbrished or newly construct have more rental earning per month at 600 Euro.

In [None]:
plotter = df.groupby('regio2')['additioncost'].agg(['mean'])
plotter.columns = ["mean"]
plotter['regio2'] = plotter.index

data = [
    {
        'x': plotter['regio2'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['regio2'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Average Addition Cost Per Month (Warmmiete - Kaltmiete)", 
                   xaxis=dict(title='City'),
                   yaxis=dict(title='Additional Cost per month')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

In [None]:
plotter = df.groupby('typeOfFlat')['totalRent'].agg(['mean'])
plotter.columns = ["mean"]
plotter['typeOfFlat'] = plotter.index

data = [
    {
        'x': plotter['typeOfFlat'],
        'y': plotter['mean'],
        'mode': 'markers+text',
        'text' : plotter['typeOfFlat'],
        'textposition' : 'bottom center',
        'marker': {  
            'size': 20,
        }
    }
]

layout = go.Layout(title="Type of Apartment and Average Rental Cost Per Month", 
                   xaxis=dict(title='Type Of Rental'),
                   yaxis=dict(title='Average rental type cost per month')
                  )
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename='scatter0')

We might want to seperate the type of rental kind because some of it cost very high per month.

In [None]:
countpie = df['hasKitchen'].value_counts()

fig = {
  "data": [
    {
      "values": countpie.values,
      "labels": countpie.index,
      "domain": {"x": [0, .5]},
      "name": "City",
      "hoverinfo":"label+percent+name",
      "hole": .7,
      "type": "pie"
    },],
  "layout": {
        "title":"Pie chart of the Apartment has kitchen or not",
    }
}
iplot(fig)

Most of the place is not inclde kitchen

In [None]:
constructmean = df.groupby(['hasKitchen'])['totalRent'].mean().sort_index()

fig = px.histogram(x = df['hasKitchen'].value_counts().sort_index().index,
                   y = constructmean,
                   color= df['hasKitchen'].value_counts().sort_index().index
             )
             
fig.update_xaxes(title="Has Kitchen")
fig.update_yaxes(title = "Cost of Rental")
fig.show(renderer="colab")

Further opinion, we could create more meaningful virtualization such as seperate rental types or others to make it clearer for the trend of rental cost in Germany

## Visualyze by using Postal Code
This shapefile contains all the postal codes from Germany.
It was sourced from: https://www.suche-postleitzahl.org/

### Preparing the data

In [None]:
!gdown --id 1bAiuZigs6RzYsrOGE4YMEn83eDgQtYuK
unzip_data("maps.zip")

In [None]:
!pip install geopandas
!pip install folium==0.12.1
!pip install mapclassify==2.4.3

In [None]:
import geopandas as gpd

maps_data = gpd.read_file("data/maps/plz-5stellig.shp")

In [None]:
maps_data.head()

In [None]:
maps_data.plot(figsize=(20,20))

In [None]:
plz_avg = df.groupby(["geo_plz"])["totalRent"].mean()
plz_avg.head(10)

In [None]:
plz_avg = pd.DataFrame(plz_avg)
plz_avg.reset_index(drop=False, inplace=True)

plz_avg.head()

In [None]:
fill_na = pd.DataFrame()
fill_na["geo_plz"] = pd.DataFrame(maps_data["plz"]).astype(int)

fill_na

In [None]:
fill_na["is_in_results"] = fill_na["geo_plz"].isin(plz_avg["geo_plz"]).astype(int)
fill_na

In [None]:
to_be_filled = pd.DataFrame()
to_be_filled = fill_na.loc[fill_na['is_in_results'] == 0]

to_be_filled

In [None]:
to_be_filled["is_in_results"] = to_be_filled['is_in_results'].replace(0, np.NaN)
to_be_filled.columns = ["geo_plz", "totalRent"]

to_be_filled

In [None]:
plz_avg = pd.concat([plz_avg, to_be_filled], axis=0)
plz_avg

In [None]:
maps_data["plz"] = maps_data.plz.astype(int)
plz_avg["geo_plz"] = plz_avg.geo_plz.astype(int)

In [None]:
maps_data_final = maps_data.merge(plz_avg, left_on="plz", right_on="geo_plz")
maps_data_final = gpd.GeoDataFrame(maps_data_final)

maps_data_final = maps_data_final.rename(columns = {'totalRent':"avg_rent"})
maps_data_final.head(10)

In [None]:
maps_data_final.info()

In [None]:
maps_data_final.plot(figsize=(20,20))

In [None]:
plt.rcParams["figure.figsize"] = (50,50)

fig, ax = plt.subplots(1)

ax.axis('off')

ax.set_title("Average Rent in Euros by German 5-digit \"Postleitzahl\" (Zip Code)", fontdict={'fontsize': '50', 'fontweight': '10'})

maps_data_final.plot(column="avg_rent",
                      ax=ax,
                      legend=True,
                      scheme="natural_breaks",
                      k=20,
                      #cmap = 'cividis',
                      edgecolor = "0",
                      linewidth = 0.001,
                      missing_kwds={"color": "white",
                                    "edgecolor": "red",
                                    "hatch": "///",
                                    "label": "Missing values"});

ax.annotate("Souce: Kaggle Dataset", xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', 
            verticalalignment='bottom', fontsize=25)


## Save for further use
# fig.savefig('map.eps', format='eps')
# fig.savefig('map2.svg', format='svg')

In [None]:
maps_data_final.explore()

# Machine Learning

## Preparing the data for train
Copying the data

In [None]:
df.head()

Dropping the data that's highly correlated to the main prediction.

In [None]:
predict_df = df.copy()
predict_df.drop(columns=['yearConstructed','serviceCharge','numberOfYear','newlyConst','balcony','hasKitchen','Pricepm2','baseRent','geo_plz'],inplace=True)
predict_df.head()

In [None]:
print(f"Number of data for training and testing is : {predict_df.shape[0]}")

### <font color='red'>Move to the Google Colab for Model Comparison </font>


Preparing data to use in Google Colab  for PyCaret due to I couldn't run this libraries on my local environment<br>
<a href="https://colab.research.google.com/drive/1T0K249nNayfjiGjkDvjyG0UKj6bnkqKn?usp=sharing">Click to Google Colab</a></h4>


In [None]:
# predict_df.to_csv('data/predict_test.csv')

# Summary

This is the end of this notebokok, if you love this kernel or could study something from this please upvote! it means a lot for my future opportunity. Moreover, feel free to comment on my mistakes because it would be surely help me to improve my mistakes and you could read my others notebook.

Thanks for viewing!