# Fun project creating regressor for Berlin flat price using data from 2019

## steps to approach the project

1. Preprosess data i.e. PCA finding the make-sense features, 
2. EDA the cleansed dataset
3. Modeling
4. Model evaluation
5. Deploy model using FastAPI

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA

## step 1: Pre-process data

In [None]:
df_org = pd.read_csv('berlin-houses.csv')
df_org.info()

## Dataset

The dataset used for the analysis can be found in the file `berlin-houses.csv`

The variables of this dataset are:

- `id` - id of listing
- `lat` - latitude of the listing
- `lon` - longitude of the listing
- `cold_price` - price of the listing before heating and upkeep costs
- `warm_price` - price of the listing after heating and upkeep costs
- `currency` - currency of the listing prices
- `short_listed` - if a given listing has short listed candidates
- `postcode_id` - post code of the listing
- `balcony` - if a listing has a balcony
- `builtin_kitchen` - if a listing has a built-in kitchen
- `created_date` - date the listing was created
- `modified_date` - date the listing was modified
- `published_date` - date the listing was published
- `energy_certificate` - if a listing has an energy certificate
- `has_new_flag` - if a listing is a new build or has been renovated recently.
- `living_space` - the living area in squared meters (m2)
- `new_home_builder` - if a listing has been built by new building company
- `number_rooms` - total number of rooms in listing
- `private_offer` - if a listing is pusblished by private owner
- `address` - address of the listing
- `link` - link to listing page
- `quarter` - district where listing is located
- `garden` - if a listing has a garden
- `listing_type` - listing size category
- `localhost_date` - date when listing data was saved into database
- `no_longer_available` - if listing is no longer available in website
- `no_longer_available_date` - date when listing was no longer available on the website

In [None]:
all_feats = df_org.columns.values
useful_feats = df_org.drop(columns=['id','currency','cold_price','currency','short_listed',
                                'created_date','published_date',
                                'modified_date','address','link',
                                'listing_type','localhost_date','no_longer_available',
                                'no_longer_available_date','quarter']).columns.values
num_feats = ['lat', 'lon','number_rooms','living_space','warm_price']
cat_feats = df_org[useful_feats].drop(num_feats,axis=1).columns.values

In [18]:
x = ['id','currency','cold_price','currency','short_listed',
                                'created_date','published_date',
                                'modified_date','address','link',
                                'listing_type','localhost_date','no_longer_available',
                                'no_longer_available_date','quarter']

### Reason for dropping below columns
- `cold_price` col is dropped as it is redundent to `warm_price`
- `quarter` col is dropped as `lat`, `long`, and `PLZ` are chosen as location information group
- `address` col is dropped as it is redundent to location information group

**Below features are dropped as they are rather metadata of the flat and don't add any value to the model trainning**
- `id` 
- `currency` 
- `currency` 
- `short_listed` 
- `created_date` 
- `published_date` 
- `modified_date` 
- `link` 
- `listing_type` 
- `localhost_date` 
- `no_longer_available` 
- `no_longer_available_date` 

In [None]:
df = df_org[useful_feats].dropna()
df['postcode_id'] = df['postcode_id'].astype('category')
df.info()

In [None]:
df.sample(3)

In [None]:
df[num_feats].head(2)

In [None]:
df[cat_feats].head(2)

In [None]:
df_analysis = df[useful_feats]

In [None]:
df_analysis.sample(3)

In [None]:
X = df_analysis.drop(columns=['warm_price'])
y = df_analysis['warm_price']
SEED = 42

# Create a PCA model with 10 components
pca = PCA(n_components=10)

# Fit the model to the data and transform the data
X_transformed = pca.fit_transform(X)

# Get the explained variance ratio of each component
explained_variance = pca.explained_variance_ratio_

In [None]:
# create a bar chart
plt.bar(range(len(explained_variance)), explained_variance)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')

# show the plot
plt.show()

Clearly PCA doesn't help. I will then use RandomForest to determine feature importances

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rfr = RandomForestRegressor(max_depth=16, 
                           n_estimators=700, 
                           max_features='log2', 
                           random_state=SEED) 

rfr.fit(X,y)

importance = rfr.feature_importances_

f_importance = {}
for i in range(len(df_analysis.drop(columns=['warm_price']).columns)):
     f_importance[df_analysis.drop(columns=['warm_price']).columns[i]] = importance[i]
        
plt.bar(f_importance.keys(),f_importance.values())
plt.xticks(rotation='vertical')
plt.title('Feature Importance in RF Regression Model')

From RandomForestRegressor feature importance, living space takes highest importance, and followed by location information group, and then if the flat has EBK or not.

## step 2: EDA

In [None]:
import seaborn as sns

In [None]:
df[num_feats]

In [None]:
sns.heatmap(df[num_feats].corr(),annot=True)

From numerical features perspective, belows are spotted:
- number of rooms correlates with living space positively
- The flat price has mild correlations with all numerical features

In [None]:
sns.countplot(data=df_analysis,x='postcode_id').set(title='The Distribution of flat per PLZ')
plt.xticks(rotation=90)

In [None]:
df[cat_feats].columns[1:]

In [None]:
fig, axes = plt.subplots(1,7,figsize=(20,3))
# sns.countplot(data=df_analysis,x='balcony', ax=axes[0])
# sns.countplot(data=df_analysis,x='builtin_kitchen', ax=axes[1])
# sns.countplot(data=df_analysis,x='energy_certificate', ax=axes[2])

for i, col in enumerate(df[cat_feats].columns[1:]):
    sns.countplot(data=df_analysis,x=col,ax=axes[i])

We can even drop `new_home_builder` feature since all values are identical.

In [None]:
final_features = df_analysis.drop(columns=['new_home_builder','warm_price']).columns.values
final_features

## step 3: Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
X = df[final_features]
y = df['warm_price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [None]:
# from keras.models import Sequential
# from keras.layers import Dense

In [None]:
rfr.fit(X_train,y_train)
y_pred = rfr.predict(X_test)

## step 4: Model evaluation

In [None]:
r2 = r2_score(y_test,y_pred)
RMSE = mean_squared_error(y_test,y_pred,squared=False)
print(f"Model r2: {r2}")
print(f"Model RMSE: {RMSE}")

## step 4.1: Retry with other features

In [None]:
X = df[['lat', 'lon', 'postcode_id','builtin_kitchen','living_space']]
y = df['warm_price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
rfr.fit(X_train,y_train)
y_pred = rfr.predict(X_test)
r2 = r2_score(y_test,y_pred)
RMSE = mean_squared_error(y_test,y_pred,squared=False)
print(f"Model r2: {r2}")
print(f"Model RMSE: {RMSE}")

## step 4.2: Retry with other model - DNN

## step 5: Serve the model using FastAPI - still on the to do list

## step 6: Deploy model to streamlit or Heroku - still on the to do list