# Rent analysis

## Problem statement

**Understand the factors affecting the price of rent in Tunisia**

## Research questions and hypotheses

* **H1**: `area` has an impact on price
* **H2**: `governorate` has an impact on price
* Which of `governorate`, `delegation` and `municipality` has more impact on `price` ?

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

plt.style.use("fivethirtyeight")

warnings.filterwarnings('ignore')

## Loading data

In [2]:
df = pd.read_json("./tunann.jl", encoding="utf-8", lines=True)
df.dropna(how="all", inplace=True)

## Cleaning

In [3]:
# https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)

        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError(
                "The DataFrame does not include the columns: %s" % cols_error)

In [4]:
class NumFromStringExtractor(BaseEstimator, TransformerMixin):
    """
    Remove non numeric characters from a string and convert it to a number
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        extract_num = lambda x: "".join([i for i in str(x) if i.isdecimal()])
        return X.applymap(extract_num).apply(pd.to_numeric)

In [7]:
class ColumnSplitterByChar(BaseEstimator, TransformerMixin):
    """
    Split a string column by a char and save the result in another column
    """

    def __init__(self, split_char, out_cols):
        self.split_char = split_char
        self.out_cols = out_cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Assuming operations preserve order
        # TODO: check they do
        # TODO: error check for resulting columns and provided column names (pandas already does it)
        return pd.DataFrame(
            X.astype(str).str.split(self.split_char).tolist(),
            columns=self.out_cols)

In [6]:
df.columns

Index(['address', 'area', 'category', 'created', 'description', 'edited',
       'location', 'price'],
      dtype='object')

* Numerical columns
    
    * area
    * price

* Categorical columns
    
    * category
    
        * type
        * transaction_type
        * estate_type

    * location
    
        * country (`ignored`)
        * governorate
        * delegation
        * municipality

In [12]:
numerical_pipeline = Pipeline([("select_numerical",
                                ColumnSelector(["area", "price"])),
                               ("extract_numbers", NumFromStringExtractor())])

category_pipeline = Pipeline(
    [("select_category", ColumnSelector("category")),
     ("split_column",
      ColumnSplitterByChar(">", ["type", "transaction_type", "estate_type"]))])

location_pipeline = Pipeline(
    [("select_location", ColumnSelector("location")),
     ("split_column",
      ColumnSplitterByChar(
          ">", ["_", "governorate", "delegation", "municipality"]))])

pipeline = FeatureUnion([
    ("numerical", numerical_pipeline),
    ("category", category_pipeline),
    ("location", location_pipeline)
])
pd.DataFrame(pipeline.fit_transform(df))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,130,165000,Offres,Vente,Duplex,Tunisie,Nabeul,Kelibia,Kelibia
1,500,125000,Offres,Vente,Surfaces,Tunisie,Nabeul,Hammamet,Hammamet
2,,1100,Offres,Location,Appart. 4 pièces,Tunisie,Tunis,La Marsa,Cite El Khalil
3,150,230,Offres,Location vacances,Appart. 4 pièces,Tunisie,Tunis,La Marsa,Berge Du Lac
4,,7000,Offres,Location,Appart. 4 pièces,Tunisie,Tunis,La Marsa,Gammart
5,120,220,Offres,Location vacances,Appart. 4 pièces,Tunisie,Tunis,La Marsa,Berge Du Lac
6,1040,600000,Offres,Vente,Maisons,Tunisie,Monastir,Sahline,Sahline
7,500,3500,Offres,Location,Maisons,Tunisie,Nabeul,Hammamet,Hammamet
8,100,1600,Offres,Location,Appart. 2 pièces,Tunisie,Tunis,La Marsa,Gammart
9,14000,210000,Offres,Terrain,Terrain agricole,Tunisie,Nabeul,Hammamet,Hammamet
