# RandomForestRegressor Machine Learning model
## Info:
- This algorithm will use imputation method to prevent values from being lose.
- Also, it will use one of these preprocessing methods: OneHotEncoding or OrdinalEncoding

Imports used in this notebook:

In [16]:
import numpy as np
import os
import pandas as pd
from sklearn.ensemble   import RandomForestRegressor
from sklearn.impute     import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

Gets the current working directory and the input files folder

In [17]:
cwd = os.getcwd()
input_folder = f"{cwd}\\input"

Reads the database into the variable

In [18]:
database_location = f"{input_folder}\\melb_data.csv"
database = pd.read_csv(f"{database_location}", encoding='utf-8')

Pandas options

In [19]:
pd.set_option('display.max_columns', None)

Prints the first 5 records

In [20]:
database.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


We need to make some checks in order to start:
- 1.Check for Data Types
- 2.Check Columns Names
- 3.Check for Missing Values
- 4.Check for Bad Data 
- 5.Imputation of Null values
- 6.Check for distribution type
- 7.Scaling the data 
- 8.Checks for outliers 
- 9.Check for data Imbalance 
- 10.Perform necessary transformations
- 11.Perform feature Engineering 
- 12.Binning Continuous data 
- 13.Feature selection

Check for Data Types, unique values and if the column has empty fields

In [21]:
def get_database_information(database):
    types_tb = pd.DataFrame(database.dtypes.reset_index())
    types_tb.rename(columns={'index':'column', 0:'type'}, inplace=True)
    unique_count = pd.Series([num.unique().size for num in (database[col] for col in database.columns)], name='uniqueValues')
    null_values = pd.Series(database.isna().any(), name='hasEmpty').reset_index()
    types_tb = pd.concat([types_tb, unique_count, null_values], axis=1)
    return types_tb
get_database_information(database)

Unnamed: 0,column,type,uniqueValues,index,hasEmpty
0,Suburb,object,314,Suburb,False
1,Address,object,13378,Address,False
2,Rooms,int64,9,Rooms,False
3,Type,object,3,Type,False
4,Price,float64,2204,Price,False
5,Method,object,5,Method,False
6,SellerG,object,268,SellerG,False
7,Date,object,58,Date,False
8,Distance,float64,202,Distance,False
9,Postcode,float64,198,Postcode,False


In [22]:
database.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


Rearrange data types

In [23]:
# Five more repeated values per column
group = [database.groupby(by=[col], axis=0).size() for col in database.columns]
group = [serie.sort_values(ascending=False) for serie in group]
sums = [serie.sum() for serie in group] # validação de quantidades

five_more_repeated = pd.concat([pd.DataFrame(serie).head().reset_index().rename({0:'count'}, axis=1, inplace=False) for serie in group], axis=1)

This code is extremely expensive, so it will not go to production:
```python
suburb_most_repeated = database['Suburb'].value_counts().sort_values(ascending=False).head(10).index

for label in suburb_most_repeated:
    database['Suburb_'+label] = np.where(database['Suburb'] == label, 1, 0)
for label in [label for label in database['Suburb'].unique() if label not in suburb_most_repeated]:
    database['Suburb_'+label] = 0
```

In [26]:
## Suburb column (0)
# Needs encoding (object column)
suburb_most_repeated = database['Suburb'].value_counts().sort_values(ascending=False).head(10).index


ValueError: Expected 2D array, got 1D array instead:
array=['Abbotsford' 'Abbotsford' 'Abbotsford' ... 'Williamstown' 'Williamstown'
 'Yarraville'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
# Address column (1)

In [None]:
# Type column (3)

In [None]:
# Method column (5)

In [None]:
# SellerG column (6)

In [None]:
# Date column (7)
database['Date'] = pd.to_datetime(database['Date'], format="%d/%m/%Y")
get_database_information(database).loc[7]

column                    Date
type            datetime64[ns]
uniqueValues                58
index                     Date
hasEmpty                 False
Name: 7, dtype: object

In [None]:
# CouncilArea column (16)