<div align="center"><img src="https://www.muycomputerpro.com/wp-content/uploads/2018/02/Machine-learning-in-cyber-security-770x476.jpg"></div>

<h1>Price History<span class="tocSkip"></span></h1>

In this jupyter I am going to clean up a DataFrame that allows us to make the machine learning model in order to see if the price of the desired house is in line with the market price.

This model is made thanks to a database downloaded from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market) in which there are more than 21,000 houses in Madrid. This will allow me to predict the future price of a property.

# Import libraries

In [1]:
import src.limpieza as lm
import pandas as pd
import numpy as np
import re

# Import DataFrame

## Download data from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market)

In [2]:
#lm.download_kaggle()

## Open `.csv` file

In [3]:
data =pd.read_csv("data/houses_Madrid.csv")

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,title,subtitle,sq_mt_built,sq_mt_useful,n_rooms,n_bathrooms,n_floors,sq_mt_allotment,...,energy_certificate,has_parking,has_private_parking,has_public_parking,is_parking_included_in_price,parking_price,is_orientation_north,is_orientation_west,is_orientation_south,is_orientation_east
0,0,21742,"Piso en venta en calle de Godella, 64","San Cristóbal, Madrid",64.0,60.0,2,1.0,,,...,D,False,,,,,False,True,False,False
1,1,21741,Piso en venta en calle de la del Manojo de Rosas,"Los Ángeles, Madrid",70.0,,3,1.0,,,...,en trámite,False,,,,,,,,
2,2,21740,"Piso en venta en calle del Talco, 68","San Andrés, Madrid",94.0,54.0,2,2.0,,,...,no indicado,False,,,,,,,,
3,3,21739,Piso en venta en calle Pedro Jiménez,"San Andrés, Madrid",64.0,,2,1.0,,,...,en trámite,False,,,,,False,False,True,False
4,4,21738,Piso en venta en carretera de Villaverde a Val...,"Los Rosales, Madrid",108.0,90.0,2,2.0,,,...,en trámite,True,,,True,0.0,True,True,True,True


## Dimension

In [5]:
data.shape

(21742, 58)

In [6]:
data.shape

(21742, 58)

## Null values

Let´s check it out how many null values are there in each column in order to see wich columns are useful.

In [7]:
data.isnull().sum()

Unnamed: 0                          0
id                                  0
title                               0
subtitle                            0
sq_mt_built                       126
sq_mt_useful                    13514
n_rooms                             0
n_bathrooms                        16
n_floors                        20305
sq_mt_allotment                 20310
latitude                        21742
longitude                       21742
raw_address                      5465
is_exact_address_hidden             0
street_name                      5905
street_number                   15442
portal                          21742
floor                            2607
is_floor_under                   1170
door                            21742
neighborhood_id                     0
operation                           0
rent_price                          0
rent_price_by_area              21742
is_rent_price_known                 0
buy_price                           0
buy_price_by

In [8]:
data.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

## Delete not useful columns

Despite deleting not useful columns, I am going to select wich columns are useful. It´s quicker!

In [9]:
estas_si = ["id",
           "title",
           "subtitle",
           "sq_mt_built",
           "sq_mt_useful",
           "n_rooms",
           "n_bathrooms",
           "floor",
           "neighborhood_id",
           "rent_price",
           "buy_price",
           "house_type_id",
           "is_new_development",
           "is_renewal_needed",
           "energy_certificate",
           "has_parking",
           "is_exterior",           
           ]

# New DataFrame
Now I have a new dataframe with less columns than the original but still with some null values.

In [10]:
data_limpio = data[estas_si]
data_limpio.to_csv("data/data_limpio.csv")

In [11]:
casas = pd.read_csv("data/data_limpio.csv")

In [12]:
casas.isnull().sum()

Unnamed: 0                0
id                        0
title                     0
subtitle                  0
sq_mt_built             126
sq_mt_useful          13514
n_rooms                   0
n_bathrooms              16
floor                  2607
neighborhood_id           0
rent_price                0
buy_price                 0
house_type_id           391
is_new_development      992
is_renewal_needed         0
energy_certificate        0
has_parking               0
is_exterior            3043
dtype: int64

##  Fixing null values
Now I am going to fill in the unknown values of useful $m^2$, because there are `13.514` null values for this column but it is a value that we can 'predict'. To do this I am going to find out what average percentage of $m^2$ is the usable area with respect to the constructed area.

This factor should vary depending on the thickness of the walls, the existence of terraces or partitions. It is up to the technician to define this value, which is usually between 0.90 and 0.80. So a correct value would be in this range.

`I do this process even though I know that I might delete the useful square metres column at a later stage due to the high correlation with the built square metres column.`

In [13]:
casas["useful/built"] = round(casas["sq_mt_useful"] / casas["sq_mt_built"],2)
porc_ub = casas["useful/built"].mean().round(2)
porc_ub

0.85

The average ratio between these columns in the DataFrame is `0.85%`, so it is within the expected range.

In [14]:
casas["sq_mt_useful"].fillna(round(casas["sq_mt_built"]*porc_ub,2), inplace=True)

In [15]:
casas["sq_mt_built"].fillna(round(casas["sq_mt_useful"]/porc_ub,2), inplace=True)

In [16]:
casas.drop(["useful/built"], axis=1, inplace=True)

## Is exterior?
To try to fill in the values of the `is_exterior` column what we are going to do is to look at the `house_type_id`, we cannot know if a flat or a duplex are interior or exterior, but we can say that a villa and a penthouse are.

In [17]:
casas.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

In [18]:
casas.is_exterior.value_counts()

True     16922
False     1777
Name: is_exterior, dtype: int64

## Delete negative values


In [19]:
casas = casas[casas["rent_price"]>0]
casas.shape

(19095, 18)

## Categorical to numeric

### Floor

In [20]:
casas.floor.value_counts()

1                       4173
2                       3227
3                       2757
4                       2103
Bajo                    2089
5                       1133
6                        787
7                        491
8                        293
Entreplanta exterior     227
9                        157
Semi-sótano exterior      55
Semi-sótano interior      36
Entreplanta interior      32
Sótano interior           23
Sótano                     5
Sótano exterior            4
Entreplanta                3
Semi-sótano                1
Name: floor, dtype: int64

In [21]:
valores_nuevos = {
    "Bajo" : 0,
    "Entreplanta" : 0.5,
    "Entreplanta exterior" : 0.5,
    "Entreplanta interior" : 0.5,
    "Semi-sótano" : -0.5,
    "Semi-sótano exterior" : -0.5,
    "Semi-sótano interior" : -0.5,
    "Sótano" : -1,
    "Sótano interior" : -1,
    "Sótano exterior" : -1,    
}

In [22]:
for k, v in valores_nuevos.items():
        casas["floor"].replace(k, v, inplace=True)

In [23]:
casas.floor.value_counts()

1       4173
2       3227
3       2757
4       2103
0       2089
5       1133
6        787
7        491
8        293
0.5      262
9        157
-0.5      92
-1        32
Name: floor, dtype: int64

### House type

In [24]:
keys_tipo = list(casas.house_type_id.value_counts().keys())
dicc_tipo = {}
for elem in keys_tipo:
    dicc_tipo[elem] = 0
    
dicc_tipo
    

{'HouseType 1: Pisos': 0,
 'HouseType 2: Casa o chalet': 0,
 'HouseType 5: Áticos': 0,
 'HouseType 4: Dúplex': 0}

In [25]:
valores_tipo = {
    'HouseType 1: Pisos': 0,
    'HouseType 2: Casa o chalet': 3,
    'HouseType 5: Áticos': 2,
    'HouseType 4: Dúplex': 1
}

In [26]:
casas["tipo"] = casas.house_type_id.map(valores_tipo)


### Neighborhood ID

In [27]:
barrios = list(casas.neighborhood_id.value_counts().keys()) 

In [28]:
barrios[0]

'Neighborhood 23: Malasaña-Universidad (5196.25 €/m2) - District 4: Centro'

In [29]:
value = re.findall(r"(?<=\()\d+.\d+", barrios[0])
float(value[0])

5196.25

In [30]:
dicc_pm2 = {}
for elem in barrios:
    value = re.findall(r"(?<=\()\d+.\d+", elem)
    try:
        dicc_pm2[elem] = (float(value[0])/1000)
    except:
        dicc_pm2[elem] = 0

In [31]:
dicc_pm2

{'Neighborhood 23: Malasaña-Universidad (5196.25 €/m2) - District 4: Centro': 5.19625,
 'Neighborhood 22: Lavapiés-Embajadores (4448.3 €/m2) - District 4: Centro': 4.448300000000001,
 'Neighborhood 30: Prosperidad (4255.84 €/m2) - District 5: Chamartín': 4.25584,
 'Neighborhood 129: Ensanche de Vallecas - La Gavia (2677.28 €/m2) - District 20: Villa de Vallecas': 2.67728,
 'Neighborhood 39: Pueblo Nuevo (2578.87 €/m2) - District 7: Ciudad Lineal': 2.5788699999999998,
 'Neighborhood 113: Cuatro Caminos (4247.49 €/m2) - District 17: Tetuán': 4.24749,
 'Neighborhood 35: Trafalgar (5640.18 €/m2) - District 6: Chamberí': 5.64018,
 'Neighborhood 89: San Diego (2007.79 €/m2) - District 13: Puente de Vallecas': 2.00779,
 'Neighborhood 72: Aravaca (3600.4 €/m2) - District 11: Moncloa': 3.6004,
 'Neighborhood 31: Bernabéu-Hispanoamérica (5170.22 €/m2) - District 5: Chamartín': 5.1702200000000005,
 'Neighborhood 73: Argüelles (4807.69 €/m2) - District 11: Moncloa': 4.80769,
 'Neighborhood 53: Peñ

In [32]:
casas["barrio_pm2"] = casas.neighborhood_id.map(dicc_pm2)

In [33]:
value_bd = re.findall(r"(?<=\:.)(\w+.\w+)", barrios[0])
value_bd

['Malasaña-Universidad', 'Centro']

In [34]:
barrio = {}
distrito = {}
for elem in barrios:
    value_bd = re.findall(r"(?<=\:.)(\w+.\w+)", elem)
    barrio[elem] = value_bd[0]
    distrito[elem] = value_bd[1]

In [35]:
casas["barrio"] = casas.neighborhood_id.map(barrio)
casas["distr"] = casas.neighborhood_id.map(distrito)

In [36]:
len(casas.barrio.unique())

124

### Energy Certificate
To try to fill in the values of the `is_exterior` column what we are going to do is to look at the `house_type_id`, we cannot know if a flat or a duplex are interior or exterior, but we can say that a villa and a penthouse are.

In [37]:
keys = list(casas.energy_certificate.value_counts().keys())
dicc = {}
for elem in keys:
    dicc[elem] = 0
    
dicc
    

{'en trámite': 0,
 'no indicado': 0,
 'E': 0,
 'D': 0,
 'G': 0,
 'F': 0,
 'A': 0,
 'C': 0,
 'B': 0,
 'inmueble exento': 0}

In [38]:
valores_cert = {
    'en trámite': 0,
     'no indicado': 0,
     'E': 3,
     'D': 4,
     'G': 1,
     'F': 2,
     'A': 7,
     'C': 5,
     'B': 6,
     'inmueble exento': 0}

In [39]:
casas["e_certificate"] = casas.energy_certificate.map(valores_cert)


In [40]:
casas.head()

Unnamed: 0.1,Unnamed: 0,id,title,subtitle,sq_mt_built,sq_mt_useful,n_rooms,n_bathrooms,floor,neighborhood_id,...,is_new_development,is_renewal_needed,energy_certificate,has_parking,is_exterior,tipo,barrio_pm2,barrio,distr,e_certificate
0,0,21742,"Piso en venta en calle de Godella, 64","San Cristóbal, Madrid",64.0,60.0,2,1.0,3,Neighborhood 135: San Cristóbal (1308.89 €/m2)...,...,False,False,D,False,True,0.0,1.30889,San Cristóbal,Villaverde,4
1,1,21741,Piso en venta en calle de la del Manojo de Rosas,"Los Ángeles, Madrid",70.0,59.5,3,1.0,4,Neighborhood 132: Los Ángeles (1796.68 €/m2) -...,...,False,True,en trámite,False,True,0.0,1.79668,Los Ángeles,Villaverde,0
2,2,21740,"Piso en venta en calle del Talco, 68","San Andrés, Madrid",94.0,54.0,2,2.0,1,Neighborhood 134: San Andrés (1617.18 €/m2) - ...,...,False,False,no indicado,False,True,0.0,1.61718,San Andrés,Villaverde,0
3,3,21739,Piso en venta en calle Pedro Jiménez,"San Andrés, Madrid",64.0,54.4,2,1.0,0,Neighborhood 134: San Andrés (1617.18 €/m2) - ...,...,False,False,en trámite,False,True,0.0,1.61718,San Andrés,Villaverde,0
4,4,21738,Piso en venta en carretera de Villaverde a Val...,"Los Rosales, Madrid",108.0,90.0,2,2.0,4,Neighborhood 133: Los Rosales (1827.79 €/m2) -...,...,False,False,en trámite,True,True,0.0,1.82779,Los Rosales,Villaverde,0


## Delete rows with null values
Once we have tried to fix the data and we cannot do anything else, we delete those rows that contain null values in any of their columns.

In [41]:
casas.drop(["Unnamed: 0"], axis=1, inplace=True)

In [42]:
casas.shape

(19095, 22)

In [43]:
casas.isnull().sum()

id                       0
title                    0
subtitle                 0
sq_mt_built              0
sq_mt_useful             0
n_rooms                  0
n_bathrooms             14
floor                 1499
neighborhood_id          0
rent_price               0
buy_price                0
house_type_id          388
is_new_development     919
is_renewal_needed        0
energy_certificate       0
has_parking              0
is_exterior           1947
tipo                   388
barrio_pm2               0
barrio                   0
distr                    0
e_certificate            0
dtype: int64

In [44]:
casas_limpio = casas.dropna(axis=0, how="any")
casas_limpio.isnull().sum()

id                    0
title                 0
subtitle              0
sq_mt_built           0
sq_mt_useful          0
n_rooms               0
n_bathrooms           0
floor                 0
neighborhood_id       0
rent_price            0
buy_price             0
house_type_id         0
is_new_development    0
is_renewal_needed     0
energy_certificate    0
has_parking           0
is_exterior           0
tipo                  0
barrio_pm2            0
barrio                0
distr                 0
e_certificate         0
dtype: int64

In [45]:
casas_limpio.shape

(16035, 22)

In [46]:
#casas_limpio.buy_price = casas_limpio.buy_price/100_000


In [47]:
#casas_limpio.buy_price

In [48]:
casas = casas[casas["rent_price"]>0]
casas.shape

(19095, 22)

In [49]:
casas[casas["sq_mt_built"]>160].groupby("barrio").median().tail(50)

Unnamed: 0_level_0,id,sq_mt_built,sq_mt_useful,n_rooms,n_bathrooms,rent_price,buy_price,is_renewal_needed,has_parking,tipo,barrio_pm2,e_certificate
barrio,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Media Legua,15785.0,183.5,152.95,4.0,2.5,1943.0,620000.0,0.0,1.0,0.0,2.80311,0.0
Mirasierra,13023.0,242.0,205.7,5.0,4.0,2255.0,930000.0,0.0,1.0,1.0,3.6955,0.0
Montecarmelo,12522.0,192.5,167.5,4.0,3.0,2085.0,807500.0,0.0,1.0,2.0,4.61095,6.0
Moscardó,19453.0,196.0,186.0,3.0,2.0,1250.0,320000.0,0.0,1.0,2.0,2.28448,0.0
Niño Jesús,17377.5,207.5,180.1,4.0,3.0,2338.0,890000.0,0.0,1.0,0.0,4.9356,0.0
Nueva España,7011.0,222.0,187.0,4.0,3.0,1931.0,1195000.0,0.0,1.0,0.0,5.36375,0.0
Nuevos Ministerios,10217.0,199.0,170.0,4.0,3.0,2275.0,895000.0,0.0,1.0,0.0,5.0,0.0
Numancia,16585.0,250.0,212.5,11.0,9.0,1667.0,499998.0,0.0,0.0,3.0,2.08194,6.0
Opañel,5540.0,232.5,200.0,4.5,2.5,1316.5,347500.0,0.0,0.5,2.0,2.23532,0.0
Pacífico,17505.5,181.5,162.725,4.0,2.0,1930.0,645000.0,0.0,0.0,0.0,4.10512,0.0


## Boolean to numerical

In [50]:
casas_limpio["is_new_development"] = casas_limpio["is_new_development"].astype(int)
casas_limpio["is_renewal_needed"] = casas_limpio["is_renewal_needed"].astype(int)
casas_limpio["has_parking"] = casas_limpio["has_parking"].astype(int)
casas_limpio["is_exterior"] = casas_limpio["is_exterior"].astype(int)

In [51]:
casas_limpio.columns

Index(['id', 'title', 'subtitle', 'sq_mt_built', 'sq_mt_useful', 'n_rooms',
       'n_bathrooms', 'floor', 'neighborhood_id', 'rent_price', 'buy_price',
       'house_type_id', 'is_new_development', 'is_renewal_needed',
       'energy_certificate', 'has_parking', 'is_exterior', 'tipo',
       'barrio_pm2', 'barrio', 'distr', 'e_certificate'],
      dtype='object')

In [52]:
columnas = ['sq_mt_built', 'sq_mt_useful', 'n_rooms',
       'n_bathrooms', 'floor', 'is_new_development', 'is_renewal_needed',
        'has_parking', 'is_exterior', 'tipo',
       'barrio_pm2', 'e_certificate', 'rent_price', 'buy_price',]

In [53]:
casas_ml = casas_limpio[columnas]

# Save Test Dataframe
The next step is to keep a portion of the dataframe to test the model later, for which I keep `10%` of it.

In [54]:
from sklearn.model_selection import train_test_split

In [55]:
casas_train, casas_test = train_test_split(casas_ml, test_size=0.1, random_state=666)

In [56]:
casas_train.shape

(14431, 14)

In [57]:
casas_test.shape

(1604, 14)

# Export DataFrames

In [58]:
casas_limpio.to_csv("data/casas_limpio.csv")

In [59]:
casas_train.to_csv("data/Machine_learning/casas_train.csv")

In [60]:
casas_test.to_csv("data/Machine_learning/Test/casas_test.csv")

In [62]:
casas_limpio.columns

Index(['id', 'title', 'subtitle', 'sq_mt_built', 'sq_mt_useful', 'n_rooms',
       'n_bathrooms', 'floor', 'neighborhood_id', 'rent_price', 'buy_price',
       'house_type_id', 'is_new_development', 'is_renewal_needed',
       'energy_certificate', 'has_parking', 'is_exterior', 'tipo',
       'barrio_pm2', 'barrio', 'distr', 'e_certificate'],
      dtype='object')

In [67]:
casas_limpio.groupby("distr").mean()

Unnamed: 0_level_0,id,sq_mt_built,sq_mt_useful,n_rooms,n_bathrooms,rent_price,buy_price,is_new_development,is_renewal_needed,has_parking,is_exterior,tipo,barrio_pm2,e_certificate
distr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Arganzuela,1075.920598,95.486766,80.582451,2.56847,1.585731,1370.684695,380846.334868,0.03107,0.196778,0.275029,0.782509,0.150748,4.084129,1.23015
Barajas,71.052632,119.631579,101.313158,2.736842,2.052632,1482.789474,425494.736842,0.0,0.0,0.684211,1.0,0.315789,3.178492,1.052632
Carabanchel,5212.688044,82.543151,70.596595,2.569279,1.311164,862.646873,191704.14331,0.055424,0.182106,0.21536,0.940618,0.104513,2.181776,0.943785
Centro,3507.364656,107.468443,91.716655,2.485975,1.793829,1663.228612,555679.665498,0.018934,0.164797,0.06101,0.803647,0.097475,5.0791,1.418654
Chamartín,6783.291444,128.755793,109.664929,2.868093,1.97148,1747.701426,635256.923351,0.043672,0.298574,0.336898,0.909091,0.110517,4.957931,1.026738
Chamberí,9825.647701,126.217459,107.638932,3.050663,1.951676,1824.212003,648091.961808,0.051442,0.275136,0.201871,0.755261,0.111458,5.251565,1.375682
Ciudad Lineal,8260.802926,103.407917,88.064329,2.722892,1.6179,1230.57401,366264.578313,0.050775,0.179862,0.324441,0.960413,0.133391,3.151249,1.054217
Fuencarral,12705.21608,124.724874,105.086369,3.003769,1.969849,1549.032663,473798.430905,0.066583,0.14196,0.605528,0.98995,0.14196,3.522947,1.002513
Hortaleza,11474.778791,119.230331,99.008951,2.736602,1.851767,1561.473204,471029.765108,0.196123,0.087799,0.59179,0.989738,0.18358,3.644477,1.327252
Latina,14613.957252,81.357252,70.065115,2.587786,1.276336,907.276336,206344.754198,0.021374,0.2,0.184733,0.969466,0.073282,2.317072,1.105344


In [68]:
list(casas_limpio.columns)

['id',
 'title',
 'subtitle',
 'sq_mt_built',
 'sq_mt_useful',
 'n_rooms',
 'n_bathrooms',
 'floor',
 'neighborhood_id',
 'rent_price',
 'buy_price',
 'house_type_id',
 'is_new_development',
 'is_renewal_needed',
 'energy_certificate',
 'has_parking',
 'is_exterior',
 'tipo',
 'barrio_pm2',
 'barrio',
 'distr',
 'e_certificate']