# Digital House - Data Science a Distancia

## Trabajo Práctico 2

Prepara el dataset original con las características que se presentan en el [valuador de Properati](https://www.properati.com.ar/tools/valuador-propiedades)

### Autores: Daniel Borrino, Ivan Mongi, Jessica Polakoff, Julio Tentor

<p style="text-align:right;">Mayo 2022</p>

#### Aspectos técnicos
La notebook se ejecuta correctamente en una instalación estándar de Anaconda versión 4.11.0 build  3.21.6, Python 3.9.7


#### Librerías necesarias

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import matplotlib.pyplot as plt

In [2]:
data_url = "../Data/properatti.csv"
data = pd.read_csv(data_url, encoding="utf-8")

---
### Generación del dataset final

#### Eliminamos los valores nulos de la variable Target

In [3]:
#Limpiamos los NaN en el precio
data = data.dropna(axis=0, how='any', subset=['price_aprox_usd'])
data_clean = data

#### Seleccionamos solo Capital Federal y Bs.As. zonas Norte, Sur y Oeste

In [4]:
#Seleccionamos solo Capital Federal y Bs.As. zonas Norte, Sur y Oeste
iterar_state = ['Capital Federal',
                'Bs.As. G.B.A. Zona Norte',
                'Bs.As. G.B.A. Zona Sur',
                'Bs.As. G.B.A. Zona Oeste']

data_clean['state_name'] = [x if x in iterar_state else np.NaN for x in data_clean['state_name']]
data_clean = data_clean.dropna(axis=0, how='any', subset=['state_name']).copy()

#### Seleccionamos solo Departamento, Casa y PH

In [5]:
#Seleccionamos solo Departamento, Casa y PH
iterar_tipo = data_clean['property_type'].value_counts().head(3)
iterar_tipo = iterar_tipo.index

data_clean['property_type'] = [x if x in iterar_tipo else np.NaN for x in data_clean['property_type']]
data_clean = data_clean.dropna(axis=0, how='any', subset=['property_type']).copy()

In [6]:
data_clean['place_name'].value_counts().head(100)

Tigre                3024
Nordelta             2879
Belgrano             2481
Palermo              2425
Caballito            2022
                     ... 
San Nicolás           157
Villa Bosch           154
Paternal              154
Lomas del Mirador     146
Villa Sarmiento       140
Name: place_name, Length: 100, dtype: int64

#### Seleccionamos solo Lugares con muchas observaciones

In [7]:
#Seleccionamos solo Lugares con muchas observaciones
iterar_place = data_clean['place_name'].value_counts()[:100]
iterar_place = iterar_place.index

data_clean['place_name'] = [x if x in iterar_place else np.NaN for x in data_clean['place_name']]
data_clean = data_clean.dropna(axis=0, how='any', subset=['place_name']).copy()

#### Eliminamos Outliers para las variables que vamos a correlacionar: 

In [8]:
#funcion para borrar outliers.
def borrar_outliers(data, columnas):
    u"""Solo recibe columnas con valores numericos. 
    Data: dataset a analizar
    Columnas: columnas donde borrar outliers.Deben ser una tupla"""
    cols_limpiar = columnas
    mask=np.ones(shape=(data.shape[0]), dtype=bool)

    for i in cols_limpiar:
        
        #calculamos cuartiles, y valores de corte
        Q1=data[i].quantile(0.25)
        Q3=data[i].quantile(0.75)
        RSI=Q3-Q1
        max_value=Q3+1.5*RSI
        min_value=Q1-1.5*RSI
        
        #ajusto el min value 
            # No puede ser negativo.
            # No puede estar fuera del boxplot para outliers
            # Criterio experto se decide dejar desde el 5% hacia adelante en el precio.
            # Además, no consideraremos los que tienen menos de 10m2.
            
        min_value=max(data[i].quantile(0.05), min_value, 10)
        
        #filtramos por max y min
        mask=np.logical_and(mask, np.logical_and(data[i]>=min_value, data[i]<=max_value))
    return data[mask]

In [9]:
# serie para determinar observaciones sin outliers para precio en dólares y superficie cubierta
data_clean['tidy1'] = np.NaN

for tipo in iterar_tipo:
    for place in iterar_place:
        # selecciono por lugar y tipo
        mask = np.logical_and(data_clean['place_name']==place, data_clean['property_type']==tipo)
        # calcula outliers y los suprime
        data_ok = borrar_outliers(data_clean[mask], ('price_aprox_usd', 'surface_covered_in_m2'))
        # determina observaciones válidas
        data_clean.loc[data_ok.index, 'tidy1'] = True


In [10]:
# suprimo las observaciones que no me sirven
data_clean = data_clean.dropna(axis=0, how='any', subset=['tidy1']).copy()

---
#### Creacion de nuevas variables con valor predictivo:


##### Analisis para Cantidad de ambientes

In [11]:
def regex_to_values(col, reg, not_match=0) :
    u"""Returns a serie with the result of apply the regular expresion to the column
    the serie have a float value only when regular expression search() method found a match
    
    col : column where to apply regular expresion
    reg : regular expresion compiled
    """
    
    serie = col.apply(lambda x : x if x is np.NaN else reg.search(x))
    serie = serie.apply(lambda x : not_match if x is np.NaN or x is None else float(x.group(1)))

    return serie

In [12]:
#Buscamos cantidad de ambientes
_pattern = '([1-2][0-9]?)(?= amb)'
_express = re.compile(_pattern, flags = re.IGNORECASE)

work = regex_to_values(data_clean['description'], _express, 1)

data_clean['ambientes'] = work


In [13]:
#realizamos la imputacion
#data_clean['ambientes_final'] = data_clean['rooms']
#mask = data_clean['ambientes_final'].isnull()
#data_clean.loc[mask, 'ambientes_final'] = data_clean.loc[mask, 'ambientes']

mask = data_clean['rooms'].notnull()
data_clean.loc[mask, 'ambientes'] = data_clean.loc[mask, 'rooms']

##### Analisis para Cantidad de baños

In [14]:
_pattern = '([1-2][0-9]?)(?= baño)'
_express = re.compile(_pattern, flags = re.IGNORECASE)

work = regex_to_values(data['description'], _express, 1)

data_clean['baños'] = work


---
##### Nos proponemos encontrar amenities

In [15]:
def regex_to_tags(col, reg, match, not_match = np.NaN) :
    u"""Returns a series with 'match' values result of apply the regular expresion to the column
    the 'match' value will be when the regular expression search() method found a match
    the 'not_match' value will be when the regular expression serach() method did not found a match
    col : column where to apply regular expresion
    reg : regular expresion compiled
    """
    
    serie = col.apply(lambda x : x if x is np.NaN else reg.search(x))
    serie = serie.apply(lambda x : match if x is not None else not_match)
   
    return serie

In [16]:
#Buscamos Balcón
_pattern = 'balcon|balcón'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['balcón'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [17]:
#Buscamos Cocheras
_pattern = 'cochera|garage|auto'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['cochera'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [18]:
#Buscamos Parrillas
_pattern = 'parrilla'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['parrilla'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [19]:
#Buscamos Piletas
_pattern = 'piscina|pileta'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['pileta'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [20]:
#Buscamos Amoblado
_pattern = 'amoblado'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['amoblado'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [21]:
#Buscamos Lavadero
_pattern = 'lavadero'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['lavadero'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [22]:
#Buscamos Patio
_pattern = 'patio'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['patio'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [23]:
#Buscamos Terraza
_pattern = 'terraza'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['terraza'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [24]:
#Buscamos Jardin
_pattern = 'jardin'
_express = re.compile(_pattern, flags = re.IGNORECASE)

data_clean['jardin'] = regex_to_tags(data_clean['description'], _express, 1, 0)


In [26]:
data_clean.reset_index(inplace=True)

In [27]:
data_final = data_clean.drop(columns=['index', 'Unnamed: 0','operation', 'place_with_parent_names', 
                                      'country_name','geonames_id', 'lat-lon', 'lat', 'lon', 'price', 'currency',
                                        'price_aprox_local_currency','floor', 'surface_total_in_m2', 'price_usd_per_m2', 
                                     'price_per_m2', 'rooms', 'expenses', 'properati_url', 'description', 'title', 'image_thumbnail', 'tidy1'])

In [28]:
data_final

Unnamed: 0,property_type,place_name,state_name,price_aprox_usd,surface_covered_in_m2,ambientes,baños,balcón,cochera,parrilla,pileta,amoblado,lavadero,patio,terraza,jardin
0,apartment,Mataderos,Capital Federal,72000.0,55.0,2.0,1.0,0,0,0,0,0,1,0,0,0
1,PH,Munro,Bs.As. G.B.A. Zona Norte,130000.0,78.0,1.0,1.0,0,0,0,0,0,1,1,0,0
2,apartment,Belgrano,Capital Federal,138000.0,40.0,1.0,1.0,0,0,0,1,0,0,0,0,0
3,apartment,Belgrano,Capital Federal,195000.0,60.0,1.0,1.0,0,0,0,1,0,0,0,0,0
4,apartment,Palermo Soho,Capital Federal,111700.0,30.0,1.0,1.0,1,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47032,apartment,Belgrano,Capital Federal,128000.0,35.0,1.0,1.0,0,1,1,1,0,0,0,1,0
47033,apartment,Recoleta,Capital Federal,165000.0,39.0,1.0,1.0,1,0,1,0,0,0,0,0,0
47034,house,Beccar,Bs.As. G.B.A. Zona Norte,498000.0,360.0,1.0,1.0,0,1,1,1,0,0,0,0,1
47035,apartment,Villa Urquiza,Capital Federal,131500.0,39.0,1.0,1.0,1,1,1,0,0,0,0,1,0


In [29]:
data_final.shape

(47037, 16)

---
#### Sanity checks

In [30]:
data_final.isnull().sum()

property_type            0
place_name               0
state_name               0
price_aprox_usd          0
surface_covered_in_m2    0
ambientes                0
baños                    0
balcón                   0
cochera                  0
parrilla                 0
pileta                   0
amoblado                 0
lavadero                 0
patio                    0
terraza                  0
jardin                   0
dtype: int64

In [31]:
data_final_url = "../Data/properatti_final5.csv"
data_final.to_csv(data_final_url)