<div align="center"><img src="https://www.muycomputerpro.com/wp-content/uploads/2018/02/Machine-learning-in-cyber-security-770x476.jpg"></div>

<h1>Price History<span class="tocSkip"></span></h1>

In this jupyter I am going to make a machine learning model that allows us to see if the price of the desired house is in line with the market price.

This model is made thanks to a database downloaded from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market) in which there are more than 21,000 houses in Madrid. This will allow me to predict the future price of a property.

# Import libraries

In [3]:
import src.limpieza as lm
import pandas as pd

# Import DataFrame

## Download data from [Kaggle](https://www.kaggle.com/mirbektoktogaraev/madrid-real-estate-market)

In [4]:
lm.download_kaggle()

Kaggle file downloaded.
Kaggle file unzipped.
zip file deleted.
Files moved to data folder.


'DataFrame downloaded correctly as madrid-real-estate-market.'

## Open `.csv` file

In [5]:
data =pd.read_csv("data/houses_Madrid.csv")

## Dimension

In [6]:
data.shape

(21742, 58)

## Null values

Let´s check it out how many null values are there in each column in order to see wich columns are useful.

In [7]:
data.isnull().sum()

Unnamed: 0                          0
id                                  0
title                               0
subtitle                            0
sq_mt_built                       126
sq_mt_useful                    13514
n_rooms                             0
n_bathrooms                        16
n_floors                        20305
sq_mt_allotment                 20310
latitude                        21742
longitude                       21742
raw_address                      5465
is_exact_address_hidden             0
street_name                      5905
street_number                   15442
portal                          21742
floor                            2607
is_floor_under                   1170
door                            21742
neighborhood_id                     0
operation                           0
rent_price                          0
rent_price_by_area              21742
is_rent_price_known                 0
buy_price                           0
buy_price_by

In [8]:
data.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

## Delete not useful columns

In [38]:
borrar = ["Unnamed: 0",
          "latitude",
          "longitude",
          "sq_mt_allotment",
          "is_exact_address_hidden",
          "street_number",
          "portal",
          "door",
          "is_rent_price_known",
          "operation",
          "rent_price_by_area",
          "is_furnished",
          "is_buy_price_known",
          "is_kitchen_equipped",
          "are_pets_allowed",
          "has_private_parking",
          "has_public_parking",
          "is_parking_included_in_price",
          "parking_price"
         ]

Despite deleting not useful columns, I am going to select wich columns are useful. It´s quicker!

In [10]:
estas_si = ["id",
           "title",
           "subtitle",
           "sq_mt_built",
           "sq_mt_useful",
           "n_rooms",
           "n_bathrooms",
           "floor",
           "is_floor_under",
           "neighborhood_id",
           "rent_price",
           "buy_price",
           "house_type_id",
           "is_new_development",
           "is_renewal_needed",
           "energy_certificate",
           "has_parking",
           "is_exterior"]

# New DataFrame
Now I have a new dataframe with less columns than the original but still with some null values.

In [20]:
casas = data[estas_si]
casas.isnull().sum()

id                        0
title                     0
subtitle                  0
sq_mt_built             126
sq_mt_useful          13514
n_rooms                   0
n_bathrooms              16
floor                  2607
is_floor_under         1170
neighborhood_id           0
rent_price                0
buy_price                 0
house_type_id           391
is_new_development      992
is_renewal_needed         0
energy_certificate        0
has_parking               0
is_exterior            3043
dtype: int64

##  Fixing null values
Now I am going to fill in the unknown values of useful $m^2$, because there are `13.514` null values for this column but it is a value that we can 'predict'. To do this I am going to find out what average percentage of $m^2$ is the usable area with respect to the constructed area.

This factor should vary depending on the thickness of the walls, the existence of terraces or partitions. It is up to the technician to define this value, which is usually between 0.90 and 0.80. So a correct value would be in this range.

`I do this process even though I know that I might delete the useful square metres column at a later stage due to the high correlation with the built square metres column.`

In [39]:
casas["useful/built"] = round(casas["sq_mt_useful"] / casas["sq_mt_built"],2)
porc_ub = casas["useful/built"].mean().round(2)
porc_ub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  casas["useful/built"] = round(casas["sq_mt_useful"] / casas["sq_mt_built"],2)


0.85

The average ratio between these columns in the DataFrame is `0.85%`, so it is within the expected range.

In [26]:
casas["sq_mt_useful"].fillna(round(casas["sq_mt_built"]*porc_ub,2), inplace=True)

## Is exterior?
To try to fill in the values of the `is_exterior` column what we are going to do is to look at the `house_type_id`, we cannot know if a flat or a duplex are interior or exterior, but we can say that a villa and a penthouse are.

In [40]:
casas.house_type_id.value_counts()

HouseType 1: Pisos            17705
HouseType 2: Casa o chalet     1938
HouseType 5: Áticos            1032
HouseType 4: Dúplex             676
Name: house_type_id, dtype: int64

In [41]:
casas.is_exterior.value_counts()

True     16922
False     1777
Name: is_exterior, dtype: int64

In [46]:
def isexterior():
    if casas["house_type_id"] == "HouseType 5: Áticos":
        return True
    elif casas["house_type_id"] == "HouseType 2: Casa o chalet":
        return True

In [48]:
casas["is_exterior"].apply(isexterior)

TypeError: isexterior() takes 0 positional arguments but 1 was given

In [27]:
casas.isnull().sum()

id                       0
title                    0
subtitle                 0
sq_mt_built            126
sq_mt_useful           108
n_rooms                  0
n_bathrooms             16
floor                 2607
is_floor_under        1170
neighborhood_id          0
rent_price               0
buy_price                0
house_type_id          391
is_new_development     992
is_renewal_needed        0
energy_certificate       0
has_parking              0
is_exterior           3043
useful/built           126
dtype: int64

In [32]:
len(casas.neighborhood_id.unique())

126

## Delete rows with null values
Once we have tried to fix the data and we cannot do anything else, we delete those rows that contain null values in any of their columns.

In [35]:
casas_limpio = casas.dropna(axis=0, how="any")
casas_limpio.isnull().sum()

id                    0
title                 0
subtitle              0
sq_mt_built           0
sq_mt_useful          0
n_rooms               0
n_bathrooms           0
floor                 0
is_floor_under        0
neighborhood_id       0
rent_price            0
buy_price             0
house_type_id         0
is_new_development    0
is_renewal_needed     0
energy_certificate    0
has_parking           0
is_exterior           0
useful/built          0
dtype: int64

In [37]:
casas_limpio.shape

(17526, 19)