# Brazilian House Rent Price

After taking a look at the data, I've decided that the houses_to_rentv1 is pretty much unusable for me. So I'll use the v2, and decide to predict the most useful-looking value: Rent amount

# Setup

Setup time. We load all of the necessary libraries and the data.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%pylab inline

In [None]:
# Import the scikit-learn methods and models here
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [None]:
data = pd.read_csv("../input/brasilian-houses-to-rent/houses_to_rent_v2.csv").drop(["total (R$)"], axis=1)
data.head()

# Data Inference

Now we'll look at the raw data, and use statistical methods and analysis to infer meaning and decide what to visualize, and what to use for the ML process.

In [None]:
X_labels = ["city", "area", "rooms", "bathroom", "parking spaces", "floor", "animal", "furniture", "hoa (R$)", "fire insurance (R$)", "property tax (R$)"]
data.dropna()
X = data[X_labels]
y = data["rent amount (R$)"]

In [None]:
data.describe()

In [None]:
data.groupby(["city"]).mean()

**Conclusion**
> Because this dataset seems pretty much impossible to pin down when simply looking at the raw data, I concluse that this would be a great dataset to train on, and would be of use in real life. 

And thus, we move to the visualization part for better inference

# Data Visualization

We begin with general and mass visualization, then move on to more specific cases and pairs

In [None]:
sns.heatmap(data.corr(), annot=True)

Hmm... It seems that the homeowner tax and the property tax has little to do with anything else, creating a clear black mark. I'll drop them.

In [None]:
X = X.drop(["hoa (R$)", "property tax (R$)"], axis=1)
X.head()

In [None]:
sns.pairplot(data=data, hue="city")

# Data Manipulation

Time to change up the data a bit so the model can better infer from it. First step is to convert every string value to integers.

In [None]:
print(X["city"].unique())
print(X["animal"].unique())
print(X["furniture"].unique())

In [None]:
print(X["floor"].unique())
# I checked the other columns. Seems to be all fine

There are 5 unique cities, so it should work ok.
* Sao Paulo = 1
* Porto Alegre = 2
* Rio = 3
* Campinas = 4
* Belo Horizonte = 5

There are also two different values in the animal column and the furniture column, and the "-" thing in the floor. I will assume it means no data and drop the columns with it.

In [None]:
# Run this only once, please
X["city"] = X["city"].apply(lambda x: 1 if x == "São Paulo" 
                            else 2 if x == "Porto Alegre" 
                            else 3 if x == "Rio de Janeiro"
                            else 4 if x =="Campinas" else 5)
X["animal"] = X["animal"].apply(lambda x: 1 if x == "acept" else 0) # Gosh the hell is acept
X["furniture"] = X["furniture"].apply(lambda x: 1 if x == "furnished" else 0)
X["floor"] = X["floor"].apply(lambda x: np.nan if x == "-" else x)

In [None]:
X.tail()

In [None]:
y.tail()

... and all done! Now the fun part.

# Machine Learning

Here we go. I have decided to use RandomForestRegressor as my model, as the XGBoost models are too much work.
First we split the dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)
print("There are {} samples in the training set and {} samples in the test set".format(X_train.shape[0], X_test.shape[0]))

This was the place for gridsearch. Now I no longer needs this because I've got the best parameters via this, and self experimentation.

In [None]:
"""
pipeline = Pipeline(steps=[("preprocess", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                            ("model", RandomForestRegressor(random_state=1))])
grid_params = {
    "model__n_estimators": [140, 160, 180],
    "model__criterion": ["mse"],
    "model__bootstrap": [False],
    "model__max_depth": list(range(5, 21, 5))
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=grid_params, cv=3, verbose=1)
grid_search.fit(X_train, y_train)
grid_search.best_params_
"""

# Mean Squared Error Model

In [None]:
final_model = Pipeline(steps=[("preprocess", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                            ("model", RandomForestRegressor(random_state=1,
                                                            bootstrap=False, 
                                                            criterion="mse",
                                                            n_estimators=180,
                                                            max_depth=7))])
scores = cross_validate(final_model, X_train, y_train, cv=3, scoring="neg_root_mean_squared_error")
print(-scores["test_score"].mean())

# Mean Absolute Error Model

In [None]:
final_model_mae = Pipeline(steps=[("preprocess", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                            ("model", RandomForestRegressor(random_state=1,
                                                            bootstrap=False, 
                                                            criterion="mae",
                                                            n_estimators=60,
                                                            max_depth=16))])
scores_mae = cross_validate(final_model_mae, X_train, y_train, cv=3, scoring="neg_mean_absolute_error")
print(-scores_mae["test_score"].mean())

# Mini Model 

This predicts the rent price using fire insurance price, and the location of the place.

In [None]:
final_model_mae_fire = Pipeline(steps=[("preprocess", SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
                            ("model", RandomForestRegressor(random_state=1,
                                                            bootstrap=False, 
                                                            criterion="mse",
                                                            n_estimators=200,
                                                            max_depth=15))])
scores_mae_fire = cross_validate(final_model_mae_fire, X_train[["fire insurance (R$)", "city"]], y_train, cv=3, scoring="neg_root_mean_squared_error")
print(-scores_mae_fire["test_score"].mean())