## Reto 5: Limpiando un dataset

### 1. Objetivos:
    - Aplicar todo lo que aprendimos el día de hoy a un dataset real

---
    
### 2. Desarrollo:

#### a) Limpieza de datos en el mundo real

Hasta ahora hemos estado realizando ejercicios con datasets dummy (falsos). Ahora vamos a aplicar todo lo que hemos aprendido el día de hoy a un dataset real.

El dataset se encuentra en la carpeta [Datasets](../../Datasets/Readme.md) en la raíz del repositorio. El nombre el dataset es 'melbourne_housing-raw.csv'.

Lee el dataset usando pandas y realiza las siguientes tareas:

1. Ve a este [link](https://www.kaggle.com/anthonypino/melbourne-housing-market) para conocer más sobre el dataset y los datos que contiene.
2. Explora tu dataset para entender su estructura
3. Identifica los `NaNs` en el dataset y dónde se encuentran
4. Elimina los `NaNs` de tu dataset
5. Resetea tu índice para que sea compatible con el nuevo dataset
6. Cambia los nombres de las columnas para que tengan consistencia y no haya errores ortográficos
7. Realiza agregaciones (min, man, mean, etc) de las siguientes filas para conocer mejor la distribución de tus datos:
    a) Price
    b) Distance
    c) Landsize
    
Si tienes dudas en algún momento, por favor pídele a la experta que te oriente. Todas las tareas que hay que realizar ya las hemos hecho en otros retos; puedes ir a revisar esos otros ejercicios para recordar.

¡Mucha suerte!

In [1]:
import pandas as pd
import numpy as np
import json

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/melbourne_housing-raw.csv', sep=',')

df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0


In [6]:
df.shape

(19740, 21)

In [8]:
df.isna().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             4344
Method               0
SellerG              0
Date                 0
Distance             8
Postcode             8
Bedroom2          4413
Bathroom          4413
Car               4413
Landsize          4796
BuildingArea     11123
YearBuilt        10389
CouncilArea       4444
Lattitude         4292
Longtitude        4292
Regionname           8
Propertycount        8
dtype: int64

In [14]:
df.isna().sum()/df.shape[0]*100

Suburb            0.000000
Address           0.000000
Rooms             0.000000
Type              0.000000
Price            22.006079
Method            0.000000
SellerG           0.000000
Date              0.000000
Distance          0.040527
Postcode          0.040527
Bedroom2         22.355623
Bathroom         22.355623
Car              22.355623
Landsize         24.295846
BuildingArea     56.347518
YearBuilt        52.629179
CouncilArea      22.512665
Lattitude        21.742655
Longtitude       21.742655
Regionname        0.040527
Propertycount     0.040527
dtype: float64

In [15]:
df_2 = df.drop(columns=['BuildingArea', 'YearBuilt'])

df_2.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price            4344
Method              0
SellerG             0
Date                0
Distance            8
Postcode            8
Bedroom2         4413
Bathroom         4413
Car              4413
Landsize         4796
CouncilArea      4444
Lattitude        4292
Longtitude       4292
Regionname          8
Propertycount       8
dtype: int64

In [18]:
df_2['Regionname'].fillna('unknown', inplace=True)

df_2.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price            4344
Method              0
SellerG             0
Date                0
Distance            8
Postcode            8
Bedroom2         4413
Bathroom         4413
Car              4413
Landsize         4796
CouncilArea      4444
Lattitude        4292
Longtitude       4292
Regionname          0
Propertycount       8
dtype: int64

In [21]:
df_2.dropna(inplace=True)
df_2.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

In [24]:
df_2.reset_index(drop=True, inplace=True)

In [25]:
df_2.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [26]:
df_2.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname',
       'Propertycount'],
      dtype='object')

In [29]:
column_mapping = {
    'Suburb' : 'suburbio',
    'Address'  :'direccion',
    'Rooms' : 'habitaciones',
    'Type':'tipo',
    'Price':'precio',
    'Method':'metodo',
    'SellerG':'vendedor',
    'Date':'fecha',
    'Distance':'distancia',
    'Postcode':'cp',
    'Bedroom2':'habitacion2',
    'Bathroom':'baño',
    'Car':'coches',
    'Landsize':'tamaño',
    'CouncilArea':'area',
    'Lattitude':'latitud',
    'Longtitude':'longitud',
    'Regionname':'region',
    'Propertycount':'totalpropiedades'}

In [30]:
df_renamed = df_2.rename(columns=column_mapping)

df_renamed

Unnamed: 0,suburbio,direccion,habitaciones,tipo,precio,metodo,vendedor,fecha,distancia,cp,habitacion2,baño,coches,tamaño,area,latitud,longitud,region,totalpropiedades
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,Yarra,-37.79690,144.99690,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11641,Whittlesea,30 Sherwin St,3,h,601000.0,S,Ray,29/07/2017,35.5,3757.0,3.0,2.0,2.0,1970.0,Manningham,-37.76311,145.10494,Northern Victoria,2170.0
11642,Williamstown,87 Pasco St,3,h,1285000.0,S,Jas,29/07/2017,6.8,3016.0,2.0,1.0,1.0,2010.0,Whittlesea,-37.68199,145.01744,Western Metropolitan,6380.0
11643,Yarraville,2 Adeney St,2,h,750000.0,SP,hockingstuart,29/07/2017,6.3,3013.0,3.0,2.0,2.0,1999.0,Darebin,-37.75948,144.99615,Western Metropolitan,6543.0
11644,Yarraville,54 Pentland Pde,6,h,2450000.0,VB,Village,29/07/2017,6.3,3013.0,3.0,2.0,1.0,2011.0,Hume,-37.70322,144.88236,Western Metropolitan,6543.0


In [33]:
df_renamed[['tamaño','precio','distancia']].mean()

tamaño       5.544581e+02
precio       1.068142e+06
distancia    9.583059e+00
dtype: float64

In [35]:
df_renamed[['tamaño','precio','distancia']].max()

tamaño         76000.0
precio       9000000.0
distancia         47.4
dtype: float64

In [36]:
df_renamed[['tamaño','precio','distancia']].min()

tamaño           0.0
precio       85000.0
distancia        0.0
dtype: float64

In [37]:
df_renamed[['tamaño','precio','distancia']].std()

tamaño         1460.432326
precio       643728.191437
distancia         5.304187
dtype: float64

In [38]:
df_renamed.to_csv('../../Datasets/melbourne_housing-no_nans.csv')

OSError: ignored