# Cleaning the data set

Now with a dataset store in the github project we can import it directly to our framework. We will continue using [pandas](https://pandas.pydata.org/)

In [1]:
import pandas as pd

In [14]:
df = pd.read_csv("https://raw.githubusercontent.com/prope-2020-gh-classroom/practica-final-por-equipos-verano-2020-itam-EddOselotl/master/airbnb.csv")

In [15]:
df.head()

Unnamed: 0,id,name,host_id,host_since,host_total_listings_count,latitude,longitude,neighbourhood_cleansed,property_type,room_type,square_feet,price,review_scores_rating
0,22787,"Sunny suite w/ queen size bed, inside boutique...",87973,2010-03-03,9,19.44076,-99.16324,Cuauhtémoc,Boutique hotel,Private room,248.0,"$2,331.00",98.0
1,35797,Villa Dante,153786,2010-06-28,2,19.38399,-99.27335,Cuajimalpa de Morelos,Villa,Entire home/apt,32292.0,"$4,457.00",
2,56074,Great space in historical San Rafael,265650,2010-10-19,2,19.43937,-99.15614,Cuauhtémoc,Condominium,Entire home/apt,646.0,$809.00,97.0
3,58955,Entire beautiful duplex in la Roma,282620,2010-11-09,1,19.42292,-99.15775,Cuauhtémoc,Loft,Entire home/apt,1184.0,"$1,932.00",100.0
4,61792,Spacious Clean Quiet room (own bath) in la Con...,299558,2010-11-26,1,19.41259,-99.17959,Cuauhtémoc,House,Private room,161.0,"$1,364.00",98.0


Now we need to check our dataframe types

In [16]:
df.dtypes

id                             int64
name                          object
host_id                        int64
host_since                    object
host_total_listings_count      int64
latitude                     float64
longitude                    float64
neighbourhood_cleansed        object
property_type                 object
room_type                     object
square_feet                  float64
price                         object
review_scores_rating         float64
dtype: object

As we can see, columns host_since and price are not the right data type, we want to change those only

In [13]:
### Change column datatype
df['host_since'] = pd.to_datetime(df['host_since'])

For the price column we need to remove the $\$$ sign and the commas before change its data type

In [24]:
### Remove $ sign
df['price'].replace({'\$': ''}, inplace=True, regex=True)
### Remove commas
df['price'].replace({',': ''}, inplace=True, regex=True)
### Change column datatype
df['price'] = df['price'].astype(float)

In [32]:
df.dtypes

id                             int64
name                          object
host_id                        int64
host_since                    object
host_total_listings_count      int64
latitude                     float64
longitude                    float64
neighbourhood_cleansed        object
property_type                 object
room_type                     object
square_feet                  float64
price                        float64
review_scores_rating         float64
dtype: object

Data set is clean and ready for analysis.

Finally we save the data set.

In [34]:
df.tail()

Unnamed: 0,id,name,host_id,host_since,host_total_listings_count,latitude,longitude,neighbourhood_cleansed,property_type,room_type,square_feet,price,review_scores_rating
21657,43517931,a media Cuadra de Reforma 222,78475678,2016-06-18,1,19.4311,-99.16093,Cuauhtémoc,House,Private room,,432.0,
21658,43520164,Depa en ajusco,168938330,2018-01-20,1,19.28033,-99.21582,Tlalpan,Apartment,Entire home/apt,,1705.0,
21659,43523985,"Un acogedor lugar, lleno de luz y plantas.",37100931,2015-06-30,1,19.37843,-99.16049,Benito Juárez,Apartment,Private room,,736.0,
21660,43527334,Estrena hermoso departamento,71849927,2016-05-13,1,19.38477,-99.1964,Álvaro Obregón,Condominium,Entire home/apt,,669.0,
21661,43527513,"3 Bedroom Apartment in Centro, CDMX [Mexico City]",346685643,2020-05-18,1,19.43575,-99.13223,Cuauhtémoc,Apartment,Entire home/apt,,55602.0,


In [33]:
df.to_csv("airbnb_clean.csv", index=False)