<div align="center"><img src="https://www.muycomputerpro.com/wp-content/uploads/2018/02/Machine-learning-in-cyber-security-770x476.jpg"></div>

<h1>Price History<span class="tocSkip"></span></h1>

In this jupyter I am going to clean up a DataFrame that allows us to make the machine learning model in order to see if the price of the desired house is in line with the market price.

This model is made thanks to a database downloaded from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market) in which there are more than 21,000 houses in Madrid. This will allow me to predict the future price of a property.

# Import libraries

In [1]:
import src.limpieza as lm
import pandas as pd
import numpy as np
import re

# Import DataFrame

## Download data from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market)

In [2]:
#lm.download_kaggle()

## Open `.csv` file

In [3]:
data =pd.read_csv("data/houses_Madrid.csv")

## Dimension

In [4]:
data.shape

(21742, 58)

## Null values

Let´s check it out how many null values are there in each column in order to see wich columns are useful.

In [5]:
data.isnull().sum()

Unnamed: 0                          0
id                                  0
title                               0
subtitle                            0
sq_mt_built                       126
sq_mt_useful                    13514
n_rooms                             0
n_bathrooms                        16
n_floors                        20305
sq_mt_allotment                 20310
latitude                        21742
longitude                       21742
raw_address                      5465
is_exact_address_hidden             0
street_name                      5905
street_number                   15442
portal                          21742
floor                            2607
is_floor_under                   1170
door                            21742
neighborhood_id                     0
operation                           0
rent_price                          0
rent_price_by_area              21742
is_rent_price_known                 0
buy_price                           0
buy_price_by

In [6]:
data.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

## Delete not useful columns

Despite deleting not useful columns, I am going to select wich columns are useful. It´s quicker!

In [7]:
estas_si = ["id",
           "title",
           "subtitle",
           "sq_mt_built",
           "sq_mt_useful",
           "n_rooms",
           "n_bathrooms",
           "floor",
           "neighborhood_id",
           "rent_price",
           "buy_price",
           "house_type_id",
           "is_new_development",
           "is_renewal_needed",
           "energy_certificate",
           "has_parking",
           "is_exterior"]

# New DataFrame
Now I have a new dataframe with less columns than the original but still with some null values.

In [8]:
data_limpio = data[estas_si]
data_limpio.to_csv("data/data_limpio.csv")

In [9]:
casas = pd.read_csv("data/data_limpio.csv")

In [10]:
casas.isnull().sum()

Unnamed: 0                0
id                        0
title                     0
subtitle                  0
sq_mt_built             126
sq_mt_useful          13514
n_rooms                   0
n_bathrooms              16
floor                  2607
neighborhood_id           0
rent_price                0
buy_price                 0
house_type_id           391
is_new_development      992
is_renewal_needed         0
energy_certificate        0
has_parking               0
is_exterior            3043
dtype: int64

##  Fixing null values
Now I am going to fill in the unknown values of useful $m^2$, because there are `13.514` null values for this column but it is a value that we can 'predict'. To do this I am going to find out what average percentage of $m^2$ is the usable area with respect to the constructed area.

This factor should vary depending on the thickness of the walls, the existence of terraces or partitions. It is up to the technician to define this value, which is usually between 0.90 and 0.80. So a correct value would be in this range.

`I do this process even though I know that I might delete the useful square metres column at a later stage due to the high correlation with the built square metres column.`

In [11]:
casas["useful/built"] = round(casas["sq_mt_useful"] / casas["sq_mt_built"],2)
porc_ub = casas["useful/built"].mean().round(2)
porc_ub

0.85

The average ratio between these columns in the DataFrame is `0.85%`, so it is within the expected range.

In [12]:
casas["sq_mt_useful"].fillna(round(casas["sq_mt_built"]*porc_ub,2), inplace=True)

In [13]:
casas["sq_mt_built"].fillna(round(casas["sq_mt_useful"]/porc_ub,2), inplace=True)

In [14]:
casas.drop(["useful/built"], axis=1, inplace=True)

## Is exterior?
To try to fill in the values of the `is_exterior` column what we are going to do is to look at the `house_type_id`, we cannot know if a flat or a duplex are interior or exterior, but we can say that a villa and a penthouse are.

In [15]:
casas.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

In [16]:
casas.is_exterior.value_counts()

True     16922
False     1777
Name: is_exterior, dtype: int64

In [17]:
casas["is_exterior"] = np.where((casas["house_type_id"] == "HouseType 2: Casa o chalet"),
                             True,
                             casas["is_exterior"])
casas.is_exterior.value_counts()

True     18860
False     1777
Name: is_exterior, dtype: int64

## Categorical to numeric

### Floor

In [18]:
casas.floor.value_counts()

1                       4440
2                       3546
3                       3001
4                       2323
Bajo                    2144
5                       1310
6                        913
7                        556
8                        326
Entreplanta exterior     236
9                        181
Semi-sótano exterior      55
Semi-sótano interior      36
Entreplanta interior      32
Sótano interior           23
Sótano                     5
Sótano exterior            4
Entreplanta                3
Semi-sótano                1
Name: floor, dtype: int64

In [19]:
valores_nuevos = {
    "Bajo" : 0,
    "Entreplanta" : 0.5,
    "Entreplanta exterior" : 0.5,
    "Entreplanta interior" : 0.5,
    "Semi-sótano" : -0.5,
    "Semi-sótano exterior" : -0.5,
    "Semi-sótano interior" : -0.5,
    "Sótano" : -1,
    "Sótano interior" : -1,
    "Sótano exterior" : -1,    
}

In [20]:
for k, v in valores_nuevos.items():
        casas["floor"].replace(k, v, inplace=True)

In [21]:
casas.floor.value_counts()

1       4440
2       3546
3       3001
4       2323
0       2144
5       1310
6        913
7        556
8        326
0.5      271
9        181
-0.5      92
-1        32
Name: floor, dtype: int64

### House type

In [22]:
keys_tipo = list(casas.house_type_id.value_counts().keys())
dicc_tipo = {}
for elem in keys_tipo:
    dicc_tipo[elem] = 0
    
dicc_tipo
    

{'HouseType 1: Pisos': 0,
 'HouseType 2: Casa o chalet': 0,
 'HouseType 5: Áticos': 0,
 'HouseType 4: Dúplex': 0}

In [23]:
valores_tipo = {
    'HouseType 1: Pisos': 0,
    'HouseType 2: Casa o chalet': 3,
    'HouseType 5: Áticos': 2,
    'HouseType 4: Dúplex': 1
}

In [24]:
casas["tipo"] = casas.house_type_id.map(valores_tipo)


### Neighborhood ID

In [25]:
barrios = list(casas.neighborhood_id.value_counts().keys()) 

In [26]:
barrios[0]

'Neighborhood 23: Malasaña-Universidad (5196.25 €/m2) - District 4: Centro'

In [27]:
value = re.findall(r"\d+.\d+", barrios[0])
float(value[0])

5196.25

In [28]:
dicc_pm2 = {}
for elem in barrios:
    value = re.findall(r"\d+.\d+", elem)
    try:
        dicc_pm2[elem] = (float(value[0])/1000)
    except:
        dicc_pm2[elem] = 0

In [29]:
casas["barrio_pm2"] = casas.neighborhood_id.map(dicc_pm2)

In [30]:
value_bd = re.findall(r"(?<=\:.)(\w+.\w+)", barrios[0])
value_bd

['Malasaña-Universidad', 'Centro']

In [31]:
barrio = {}
distrito = {}
for elem in barrios:
    value_bd = re.findall(r"(?<=\:.)(\w+.\w+)", elem)
    barrio[elem] = value_bd[0]
    distrito[elem] = value_bd[1]

In [32]:
casas["barrio"] = casas.neighborhood_id.map(barrio)
casas["distr"] = casas.neighborhood_id.map(distrito)

### Energy Certificate
To try to fill in the values of the `is_exterior` column what we are going to do is to look at the `house_type_id`, we cannot know if a flat or a duplex are interior or exterior, but we can say that a villa and a penthouse are.

In [33]:
keys = list(casas.energy_certificate.value_counts().keys())
dicc = {}
for elem in keys:
    dicc[elem] = 0
    
dicc
    

{'en trámite': 0,
 'no indicado': 0,
 'E': 0,
 'D': 0,
 'G': 0,
 'F': 0,
 'A': 0,
 'C': 0,
 'B': 0,
 'inmueble exento': 0}

In [34]:
valores_cert = {
    'en trámite': 0,
     'no indicado': 0,
     'E': 3,
     'D': 4,
     'G': 1,
     'F': 2,
     'A': 7,
     'C': 5,
     'B': 6,
     'inmueble exento': 0}

In [35]:
casas["e_certificate"] = casas.energy_certificate.map(valores_cert)


## Delete rows with null values
Once we have tried to fix the data and we cannot do anything else, we delete those rows that contain null values in any of their columns.

In [36]:
casas.shape

(21742, 23)

In [37]:
casas.isnull().sum()

Unnamed: 0               0
id                       0
title                    0
subtitle                 0
sq_mt_built            108
sq_mt_useful           108
n_rooms                  0
n_bathrooms             16
floor                 2607
neighborhood_id          0
rent_price               0
buy_price                0
house_type_id          391
is_new_development     992
is_renewal_needed        0
energy_certificate       0
has_parking              0
is_exterior           1105
tipo                   391
barrio_pm2               0
barrio                   0
distr                    0
e_certificate            0
dtype: int64

In [38]:
casas_limpio = casas.dropna(axis=0, how="any")
casas_limpio.isnull().sum()

Unnamed: 0            0
id                    0
title                 0
subtitle              0
sq_mt_built           0
sq_mt_useful          0
n_rooms               0
n_bathrooms           0
floor                 0
neighborhood_id       0
rent_price            0
buy_price             0
house_type_id         0
is_new_development    0
is_renewal_needed     0
energy_certificate    0
has_parking           0
is_exterior           0
tipo                  0
barrio_pm2            0
barrio                0
distr                 0
e_certificate         0
dtype: int64

In [39]:
casas_limpio.shape

(17526, 23)

In [42]:
casas_limpio.sample(5)

Unnamed: 0.1,Unnamed: 0,id,title,subtitle,sq_mt_built,sq_mt_useful,n_rooms,n_bathrooms,floor,neighborhood_id,...,is_new_development,is_renewal_needed,energy_certificate,has_parking,is_exterior,tipo,barrio_pm2,barrio,distr,e_certificate
7724,7724,14018,"Piso en venta en calle de Martín de los Heros, 83","Argüelles, Madrid",89.0,75.65,2,2.0,2,Neighborhood 73: Argüelles (4807.69 €/m2) - Di...,...,False,False,C,False,False,0.0,4.80769,Argüelles,Moncloa,5
19102,19102,2640,"Piso en venta en calle de Alcalá, 68","Recoletos, Madrid",114.0,100.0,4,1.0,4,Neighborhood 102: Recoletos (8392.43 €/m2) - D...,...,False,False,en trámite,False,False,0.0,0.102,Recoletos,Salamanca,0
5213,5213,16529,"Piso en venta en calle del Puerto de Arlabán, 45","San Diego, Madrid",40.0,34.0,1,1.0,0,Neighborhood 89: San Diego (2007.79 €/m2) - Di...,...,False,False,en trámite,False,False,0.0,2.00779,San Diego,Puente de,0
9346,9346,12396,Piso en venta en calle de la Costa Brava,"Mirasierra, Madrid",109.0,85.0,2,2.0,2,Neighborhood 56: Mirasierra (3695.5 €/m2) - Di...,...,False,False,en trámite,True,True,0.0,3.6955,Mirasierra,Fuencarral,0
16703,16703,5039,Piso en venta en calle de Zaida,"San Isidro, Madrid",53.0,45.05,1,1.0,3,Neighborhood 19: San Isidro (2323.93 €/m2) - D...,...,False,False,en trámite,False,True,0.0,2.32393,San Isidro,Carabanchel,0


# Export DataFrame

In [41]:
casas_limpio.to_csv("data/casas_limpio.csv")