# Data Cleansing with Airbnb

We're going to start by doing some exploratory data analysis & cleansing. We will be using the SF Airbnb rental dataset from [Inside Airbnb](http://insideairbnb.com/get-the-data.html).

<img src="https://files.training.databricks.com/images/301/sf.jpg" style="height: 200px; margin: 10px; border: 1px solid #ddd; padding: 10px"/>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Impute missing values
 - Identify & remove outliers

### Setting the default database and user name  
##### Substitute "renato" by your name in the `username` variable.

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[1]: DataFrame[key: string, value: string]

By default Spark on Databricks works with files on DBFS, until you're explicitly changing the schema.  
But if you want to read a file using **spark.read** function in databricks you can use the prefix **file:** followed by the complete path to the file.   
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

In [0]:
import os

In [0]:
datapath = os.path.join(os.getcwd(), "data", "airbnb", "listings.csv.gz")
datapath = "file://" + datapath
print(datapath)

file:///Workspace/Repos/renato.rocha-souza@rbinternational.com/Embedded_Data_Scientist/Module_B/Day3/data/airbnb/listings.csv.gz


Let's load the Airbnb dataset in.

In [0]:
rawDF = spark.read.csv(datapath, header="true", inferSchema="true", multiLine="true", escape='"')
rawDF.limit(10).display()

id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
15883,https://www.airbnb.com/rooms/15883,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,b&b near Old Danube river,"Four rooms, each one differently and individually designed, really charming with lots of details. The interior: a potpourri of many different styles, completed with precious things brought along from our trips. The space Old Danube river, a short walk to one of the small streets here. Everywhere small houses or gardens, a small idyllic island far from hectic city life. Here we settled down a few years ago, built a house for us and the kids and decided: let’s add 4 more rooms. And thus this small b&b was created, anything but standard and Mozart kitsch. Guest access free wifi, many books about Vienna, great tipps from the hosts...","small and personal Four rooms at this B&B, each one differently and individually designed, really charming with lots tasteful details. The interior – a potpourri of many different styles – has been enhanced with precious souvenirs from our trips. free parking You arrive by car? No worries! Parking is no problem at all! It is possible for free on the street right in front of our house. There are always enough spaces available. Arrive and unload peacefully, and begin your stay without any stress whatsoever!  recreation nearby The Old Danube River’s beautiful recreational area begins at the end of our street. Kilometers of walks beside the water offer you fun and relaxation without cars and noise. It is a natural treasure in the middle of the city.",https://a0.muscache.com/pictures/18eff738-a737-428d-b653-99f2e79145b9.jpg,62142.0,https://www.airbnb.com/users/show/62142,Eva,2009-12-11T00:00:00.000+0000,"Vienna, Austria",Mein größtes Hobby: Reisen! Am liebsten mit meinen 3 Kindern um ihnen die weite Welt zu zeigen. Eindrücke von allen möglichen Orten finden so Einfluss im Design vom the rooms. Ich freue mich auf neue Begegnungen!,within a day,50%,33%,f,https://a0.muscache.com/im/pictures/user/2416670c-a785-417e-b345-f74b368678c9.jpg?aki_policy=profile_small,https://a0.muscache.com/im/pictures/user/2416670c-a785-417e-b345-f74b368678c9.jpg?aki_policy=profile_x_medium,Donaustadt,4.0,6.0,"['email', 'phone']",t,t,"Vienna, Austria",Donaustadt,,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,,1 private bath,1.0,2.0,"[""Essentials"", ""Heating"", ""High chair"", ""Hangers"", ""Wifi"", ""Air conditioning"", ""Patio or balcony"", ""Paid parking garage off premises"", ""Luggage dropoff allowed"", ""Shampoo"", ""Long term stays allowed"", ""Pack \u2019n play/Travel crib"", ""Hair dryer"", ""Breakfast"", ""Bed linens"", ""TV"", ""Smoke alarm"", ""Hot water""]",$110.00,1.0,365.0,1.0,1.0,365.0,365.0,1.0,365.0,,t,29.0,59.0,89.0,348.0,2022-09-12T00:00:00.000+0000,14.0,1.0,0.0,2015-04-10T00:00:00.000+0000,2021-10-07T00:00:00.000+0000,4.71,4.86,4.93,4.93,4.86,4.71,4.5,,f,4.0,2.0,0.0,0.0,0.15
38768,https://www.airbnb.com/rooms/38768,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,central cityapartement- wifi- nice neighbourhood,"39m² apartment with beautiful courtyard of the house. quiet and safe upcoming neighborhood-located in old jewish quater and popular area of Vienna. Surrounded by nice bars and restaurants, 15 min walkingdistance to the middle of citycenter, close to many sights . 300m to subway U2 Taborstraße. Next to the karmelitermarket (with delicious places to eat) and the Augarten. free wifi. touristtax of 3,2% is included in the price. The space Holiday atmosphere apartment, 39m². 2 rooms with free wlan Internet access. Its situated in the heart of Vienna, in a renovated house of the 2nd district, very close to the subwaystation U2 Taborstraße, only 15 min walking distance to St. Stephansdom-cathedral. The Karmelitermarket is around the corner and the neighborhood is with many bars and restaurants is very popular. Some nice restaurants and bars are only about 100 meters walkingdistance. The quiet apartment was recently completely renovated and stylish","the Karmeliterviertel became very popular in the last years. It offers many new pubs and restaurants and is located next to Augarten (huge garden good for jogging), to the concert hall of Wiener Sängerknaben (muth), and has the Karmelitermarket, which is best to visit on Saturday morning to get biological products. In summer the best nightbars are at the donaukanal, which is 5 minutes away.",https://a0.muscache.com/pictures/ad4089a3-5355-4681-96bb-e3ad70684987.jpg,166283.0,https://www.airbnb.com/users/show/166283,Hannes,2010-07-14T00:00:00.000+0000,"Vienna, Austria","I am open minded and like travelling myself. I have spent many months in Latinamerica and Asia, where I got in touch with Indian philosophie and meditation... Now I mostly work in the field of contemporary art and I do my best to offer you a nice apartment next to the citycenter...!",within an hour,100%,100%,t,https://a0.muscache.com/im/users/166283/profile_pic/1435040494/original.jpg?aki_policy=profile_small,https://a0.muscache.com/im/users/166283/profile_pic/1435040494/original.jpg?aki_policy=profile_x_medium,Leopoldstadt,3.0,3.0,"['email', 'phone']",t,t,"Vienna, Austria",Leopoldstadt,,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,,1 bath,1.0,3.0,"[""Dishes and silverware"", ""Cooking basics"", ""Shampoo"", ""Cleaning products"", ""Host greets you"", ""Long term stays allowed"", ""Wifi"", ""Die Mikrowelle hat eine Grillfunktion. Es gibt keinen eigenst\u00e4ndigen Ofen. oven"", ""Hangers"", ""Stove"", ""Hot water"", ""Outdoor dining area"", ""Drying rack for clothing"", ""Hot water kettle"", ""Iron"", ""Essentials"", ""Dining table"", ""Coffee maker"", ""Hair dryer"", ""Bed linens"", ""Free washer"", ""Portable fans"", ""Microwave"", ""Dedicated workspace"", ""Refrigerator"", ""Carbon monoxide alarm"", ""Kitchen"", ""Room-darkening shades"", ""Sound system with Bluetooth and aux"", ""Smoke alarm"", ""Shared patio or balcony"", ""Heating""]",$69.00,5.0,100.0,3.0,5.0,1125.0,1125.0,4.6,1125.0,,t,7.0,19.0,29.0,47.0,2022-09-12T00:00:00.000+0000,350.0,18.0,1.0,2011-03-23T00:00:00.000+0000,2022-09-06T00:00:00.000+0000,4.75,4.8,4.65,4.91,4.93,4.75,4.69,,t,3.0,3.0,0.0,0.0,2.5
40625,https://www.airbnb.com/rooms/40625,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,"Near Palace Schönbrunn, Apt. 1","Welcome to my Apt. 1! This is a 2bedroom apartment for 4 - 6 persons. The space Welcome to my Apt. 1! This is a two-bedroom apartment size 55m2. Kitchen: fridge, dish washer, micro wave etc. Bathroom: WC + bathtub. Washing machine with dryer function. Master bedroom: double bed king size (180cm x 200cm) for 2 persons 2nd bedroom: bunk bed for 2 persons * Built-in air-conditioning * Wifi Internet * Flat screen cable TV * DVD & DVD player * Iron, ironboard * Hairdryer * iPod docking station * Two extra mattresses can be provided to sleep two additional children. The apartment is located next to Austria's most visited tourist attraction: Schönbrunn Palace, with its lovely Imperial Gardens and famous Zoo. Furthermore, the apartment is directly on underground line U4 (Meidling Hauptstrasse) which takes you to the heart of the city in only 5 stops (i.e. less than 10 min). A nu","The neighbourhood offers plenty of restaurants and grocery shops. In the apartment you will find an area map marked with the different restaurants and shops in the area, including opening hours for your ease of reference.",https://a0.muscache.com/pictures/11509144/d55c2742_original.jpg,175131.0,https://www.airbnb.com/users/show/175131,Ingela,2010-07-20T00:00:00.000+0000,"Vienna, Austria","I´m originally from Sweden but have been living in Vienna for many years. I love this city and I love meeting guests from different parts of the world :-) During my travel, both privately and on duty, I have learned the importance of safe, clean and comfortable accommodation in a central location with good infrastructure. This is what I offer you with my apartments.  Most of my apartments are located in the same building right on the underground U4 Meidling Hauptstrase that connects to the city center in less than 10min, and at the same time walking distance to Palace Schönbrunn - Austria´s biggest tourist attraction. My city centre apartment is located next to Hofburg Palace with walking distance to all the inner city attractions. As the majority of my apartments are located in the same building it is very convenient for groups as well to stay with me. My offer spans from studio apartments of 30m2 to 3 bedroom apartments of 100m2 sleeping up to 14 persons. Since 1 January 2015 I have been an Airbnb Super Host. This is such an honour and I will do everything I can to make YOUR stay perfect as well!! Kind regards, Ingela Johansson :-)",within a few hours,94%,79%,t,https://a0.muscache.com/im/users/175131/profile_pic/1279660518/original.jpg?aki_policy=profile_small,https://a0.muscache.com/im/users/175131/profile_pic/1279660518/original.jpg?aki_policy=profile_x_medium,Rudolfsheim-Fünfhaus,16.0,19.0,"['email', 'phone']",t,t,"Vienna, Austria",Rudolfsheim-Fnfhaus,,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,,1 bath,2.0,4.0,"[""Babysitter recommendations"", ""Dishes and silverware"", ""Free washer \u2013 In unit"", ""Cooking basics"", ""Toaster"", ""Shampoo"", ""Cleaning products"", ""Long term stays allowed"", ""Shower gel"", ""Cable TV"", ""Clothing storage: closet and wardrobe"", ""Free street parking"", ""Baking sheet"", ""Wifi"", ""Stainless steel oven"", ""Elevator"", ""High chair"", ""Hangers"", ""Paid parking garage off premises"", ""Children\u2019s books and toys"", ""Stainless steel induction stove"", ""Drying rack for clothing"", ""Freezer"", ""First aid kit"", ""Hot water kettle"", ""Free dryer \u2013 In unit"", ""Iron"", ""Essentials"", ""Dishwasher"", ""Dining table"", ""Single level home"", ""Body soap"", ""Radiant heating"", ""Conditioner"", ""Hair dryer"", ""Coffee maker"", ""Bed linens"", ""Bathtub"", ""Wine glasses"", ""Nespresso machine"", ""Microwave"", ""Rice maker"", ""Central air conditioning"", ""Lockbox"", ""Refrigerator"", ""Barbecue utensils"", ""Carbon monoxide alarm"", ""Kitchen"", ""Extra pillows and blankets"", ""Luggage dropoff allowed"", ""Pack \u2019n play/Travel crib"", ""Fire extinguisher"", ""Pour-over coffee"", ""42\"" HDTV with standard cable"", ""Smoke alarm"", ""Hot water""]",$145.00,1.0,180.0,1.0,1.0,180.0,180.0,1.0,180.0,,t,19.0,49.0,79.0,169.0,2022-09-12T00:00:00.000+0000,181.0,21.0,0.0,2010-08-04T00:00:00.000+0000,2022-08-10T00:00:00.000+0000,4.83,4.9,4.88,4.89,4.93,4.59,4.7,,t,15.0,14.0,1.0,0.0,1.23
392757,https://www.airbnb.com/rooms/392757,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,VCA3 Palais Brambilla - studio with city views,"Palais Brambilla is an oasis located in the historic heart of Vienna, just a 10-minuteswalk away from St. Stephen’s Cathedral. Palais Brambilla is situated next to the Danube Canal, the green water band and recreational zone crossing the city. Four individual apartments are at Palais Brambilla – very quite, orientated to a private court yard or facing the quay with lots of light and panoramic views overlooking half of the city. The space Studio apartment with city views – this 40 m² apartment on 4th floor with elevator is light and sunny with panoramic views overlooking half of the city. The large size living room / bedroom is flexible to use. The kitchenette offers a breakfast nook. The bathroom is handicap accessible with a drive-in shower and the toilet can be accessed from the side. Guest access Guests are being welcomed by roses bushes in front of the house. When they enter the house, they cross a hallway with h","The neighborhood offers a wide range of cafés, restaurants and all kinds of shops. Trees in front of the house and a green park in the back invite to relax. There is even a playground and a dog park. We love our neighborhood as it is a nice community but still urban - we know our neighbors it is almost like living in village in the middle of a city- the shops and restaurant add a hip and down to earth mix to a vast variety of supermarkets from budget to gourmet. To the sights close by count the two oldest churches in Vienna, the Ruprechtskirche and the church Maria am Gestade. The street around our corner leads into the ""Tiefer Graben"" a deep ditch that was used by the Romans as fortification and edge of their castellum. If you continue further the ""Naglergasse"" is the right angled corner of the castellum. The narrow street still shows the Roman street pattern. The St. Stephen’s Cathedral is in walking distance as is all of the 1. district. The Vienna stock mark",https://a0.muscache.com/pictures/miso/Hosting-392757/original/a2f0d520-9076-440c-abe2-2feb38adf534.jpeg,1833176.0,https://www.airbnb.com/users/show/1833176,Markus,2012-02-29T00:00:00.000+0000,,"I am an architect and my speciality are historical buildings. I won a price with the restoration of this building from the city Vienna. I love finding the bones, the good structures in old buildings or within a neighbourhood and create with this knowledge perfect solutions. I also have art gallery in the former stables of this house. Our ""stable gallery"" is a perfect place to stable art and show them off to possible buyers. Our home is named after our ancestor Giovanni Alessandro de Brambilla who was the chief surgeon of Emperor Joseph II in the 18th century. He held this position for most of his life, travelled with the Emperor always and founded the Academy for surgeons (which is not far away from us) and reorganised the way of medical treatment in the Age of Enlightment. We love to travel ourselves but love the way to be at home far away. We have many frequent guests who love returning to us and our neighbourhood.",within a few hours,100%,95%,f,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_small,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_x_medium,Innere Stadt,4.0,6.0,"['email', 'phone', 'work_email']",t,t,"Vienna, Austria",Innere Stadt,,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,,1 bath,1.0,1.0,"[""Dishes and silverware"", ""Shampoo"", ""Long term stays allowed"", ""Cable TV"", ""Wifi"", ""Elevator"", ""High chair"", ""Hangers"", ""Stove"", ""Hot water"", ""Crib"", ""First aid kit"", ""TV with standard cable"", ""Iron"", ""Washer"", ""Essentials"", ""Dryer"", ""Coffee maker"", ""Hair dryer"", ""Microwave"", ""Paid parking off premises"", ""Central air conditioning"", ""Refrigerator"", ""Kitchen"", ""Luggage dropoff allowed"", ""Fire extinguisher"", ""Host greets you"", ""Smoke alarm"", ""Heating""]",$100.00,2.0,180.0,2.0,2.0,180.0,180.0,2.0,180.0,,t,6.0,29.0,54.0,329.0,2022-09-12T00:00:00.000+0000,100.0,12.0,3.0,2012-05-05T00:00:00.000+0000,2022-08-27T00:00:00.000+0000,4.64,4.73,4.55,4.8,4.91,4.89,4.59,,f,4.0,4.0,0.0,0.0,0.79
51287,https://www.airbnb.com/rooms/51287,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,little studio- next to citycenter- wifi- nice area,"small studio in new renovated old house and very nice upcoming neighbourhood. close to many sights and subway. wifi. safe area, nice restaurants, bars and market. walking distance to citycenter. if wanted 2 old bicycles are available for free. touristtax of 3,2% is included in the price. The space Nice little studio in one of the oldest and beautiful houses of the neighbourhood, at the same time renovated house with new rooftops, free wifi. Its a small and quiet apartment with a little balcony, 1st floor. A modern sofa can be used as a double bed (1,60x2,05m). There is a well equipped little kitchencorner, a little bathroom with shower and toilett, extra heater and hair dryer. Fresh sheets and towels are provided. Of course there is hot water. The spacious double bed (1,60mx2m) is located above the bathroom. The ceiling of the shower is lower than usual. For low-budget guests I offer the option for up to 4 persons. the sleeping sofa is 1,40mx2m,","The neighbourhood has a lot of very nice little pubs and restaurants. 200 meters away you have the karmelitermarket, which is especially great on saturday morning. Aswell I like the Augarten, which is one of the biggist gardens in Vienna and just 5 minutes away! Beautiful old buildings are around and nearby is the modern Noveltower, including a skybar at the Sofitel (Praterstraße 1) which has a terrific view above the city!",https://a0.muscache.com/pictures/25163038/1c4e1334_original.jpg,166283.0,https://www.airbnb.com/users/show/166283,Hannes,2010-07-14T00:00:00.000+0000,"Vienna, Austria","I am open minded and like travelling myself. I have spent many months in Latinamerica and Asia, where I got in touch with Indian philosophie and meditation... Now I mostly work in the field of contemporary art and I do my best to offer you a nice apartment next to the citycenter...!",within an hour,100%,100%,t,https://a0.muscache.com/im/users/166283/profile_pic/1435040494/original.jpg?aki_policy=profile_small,https://a0.muscache.com/im/users/166283/profile_pic/1435040494/original.jpg?aki_policy=profile_x_medium,Leopoldstadt,3.0,3.0,"['email', 'phone']",t,t,"Vienna, Austria",Leopoldstadt,,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,,1 bath,,2.0,"[""Dishes and silverware"", ""Cooking basics"", ""Shampoo"", ""Long term stays allowed"", ""Wifi"", ""Hangers"", ""Stove"", ""Hot water"", ""Iron"", ""Essentials"", ""Patio or balcony"", ""Dryer"", ""Coffee maker"", ""Hair dryer"", ""Bed linens"", ""Free washer"", ""Refrigerator"", ""Kitchen"", ""Host greets you"", ""Smoke alarm"", ""Heating""]",$68.00,5.0,31.0,3.0,5.0,1125.0,1125.0,4.9,1125.0,,t,7.0,12.0,19.0,19.0,2022-09-12T00:00:00.000+0000,347.0,32.0,4.0,2011-01-27T00:00:00.000+0000,2022-09-07T00:00:00.000+0000,4.65,4.77,4.51,4.93,4.95,4.86,4.58,,f,3.0,3.0,0.0,0.0,2.45
392905,https://www.airbnb.com/rooms/392905,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,City Apartment 1- Palais Brambilla romantic style,"Palais Brambilla is an oasis located in the historic heart of Vienna, just a 10-minuteswalk away from St. Stephen’s Cathedral. Palais Brambilla is situated next to the Danube Canal, the green water band and recreational zone crossing the city. Four individual apartments are at Palais Brambilla – very quite, orientated to a private court yard or facing the quay with lots of light and panoramic views overlooking half of the city. The space One bedroom apartment with views into a private court yard – this 50 m² apartment on ground floor (US first floor) is quite with high and vaulted ceilings. The decent sized living and dining room is furnished partly with antiques, kitchen is separate. An open doorway leads you from the living room to the bedroom decorated with 19th century landscape paintings and direct access to the bathroom – equipped with a whirlpool. Guest access Guests are being welcomed by roses bushes in front","The neighborhood offers a wide range of cafés, restaurants and all kinds of shops. Trees in front of the house and a green park in the back invite to relax. There is even a playground and a dog park. We love our neighborhood as it is a nice community but still urban - we know our neighbors it is almost like living in village in the middle of a city- the shops and restaurant add a hip and down to earth mix to a vast variety of supermarkets from budget to gourmet. To the sights close by count the two oldest churches in Vienna, the Ruprechtskirche and the church Maria am Gestade. The street around our corner leads into the ""Tiefer Graben"" a deep ditch that was used by the Romans as fortification and edge of their castellum. If you continue further the ""Naglergasse"" is the right angled corner of the castellum. The narrow street still shows the Roman street pattern. The St. Stephen’s Cathedral is in walking distance as is all of the 1. district. The Vienna stock mark",https://a0.muscache.com/pictures/18998543-d9dd-4217-8293-5ec47cb8fb59.jpg,1833176.0,https://www.airbnb.com/users/show/1833176,Markus,2012-02-29T00:00:00.000+0000,,"I am an architect and my speciality are historical buildings. I won a price with the restoration of this building from the city Vienna. I love finding the bones, the good structures in old buildings or within a neighbourhood and create with this knowledge perfect solutions. I also have art gallery in the former stables of this house. Our ""stable gallery"" is a perfect place to stable art and show them off to possible buyers. Our home is named after our ancestor Giovanni Alessandro de Brambilla who was the chief surgeon of Emperor Joseph II in the 18th century. He held this position for most of his life, travelled with the Emperor always and founded the Academy for surgeons (which is not far away from us) and reorganised the way of medical treatment in the Age of Enlightment. We love to travel ourselves but love the way to be at home far away. We have many frequent guests who love returning to us and our neighbourhood.",within a few hours,100%,95%,f,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_small,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_x_medium,Innere Stadt,4.0,6.0,"['email', 'phone', 'work_email']",t,t,"Vienna, Austria",Innere Stadt,,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,,1 bath,1.0,1.0,"[""Shampoo"", ""Long term stays allowed"", ""Cable TV"", ""Wifi"", ""Elevator"", ""High chair"", ""Hangers"", ""Hot water"", ""Crib"", ""First aid kit"", ""TV with standard cable"", ""Iron"", ""Washer"", ""Essentials"", ""Dryer"", ""Hair dryer"", ""Paid parking off premises"", ""Kitchen"", ""Luggage dropoff allowed"", ""Fire extinguisher"", ""Host greets you"", ""Smoke alarm"", ""Heating""]",$99.00,3.0,180.0,3.0,3.0,180.0,180.0,3.0,180.0,,t,19.0,49.0,73.0,324.0,2022-09-12T00:00:00.000+0000,52.0,10.0,2.0,2012-10-12T00:00:00.000+0000,2022-08-22T00:00:00.000+0000,4.63,4.67,4.35,4.69,4.75,4.88,4.56,,f,4.0,4.0,0.0,0.0,0.43
70637,https://www.airbnb.com/rooms/70637,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,Flat in the Center with Terrace,"The space My apartment (including a large terrace) is very quiet. The city center is about ten minutes away, in less than five minutes walk you reach one of the nicest parks in Vienna, the Augarten, and in one minute you are at the U-Bahn Underground-station. You have your own room upstairs and can share the rest of the apartement with me. The apartment is located in the old Jewish quarter, and many artists live here. There is a variety of excellent bars and restaurants around the place as well as a very nice market, which is worth a visit. From the flat roof of my apartment one has a really nice view of the nearby city. The apartment is on the fifth floor however the lift goes up only to the fourth floor. you don´t have to care about your bed linen and towels. you can use my hairdryer. feel free to make your own coffee or tea in the morning. Welcome to Vienna - I am looking forward meeting you! nullhttps://a0.muscache.com/pictures/925691/c8c1bdd6_original.jpg358842https://www.airbnb.com/users/show/358842Elxe2011-01-23T00:00:00.000+0000Vienna, AustriaFlat in the Center with TerraceWien, Wien, ÖsterreichMy apartment (including a large terrace) is very quiet. The city center is about ten minutes away, in less than five minutes walk you reach one of the nicest parks in Vienna, the Augarten, and in o...Ferienwohnungen in Vienna hi there, i´m an outdoor trainer, graficdesigner, mother of two children, i love my city, running, playing accordion, meeting friends, gardening, dancing, traveling, my job, good art and summertime ...within a few hours100%80%thttps://a0.muscache.com/im/users/358842/profile_pic/1434576141/original.jpg?aki_policy=profile_smallhttps://a0.muscache.com/im/users/358842/profile_pic/1434576141/original.jpg?aki_policy=profile_x_mediumLeopoldstadt44['email', 'phone']ttnullLeopoldstadtnull48.217616.38018Private room in rental unitPrivate room2null2 shared baths12[""Dishes and silverware"", ""Cooking basics"", ""Shampoo"", ""Long term stays allowed"", ""Cable TV"", ""Wifi"", ""Elevator"", ""Hangers"", ""Stove"", ""Hot water"", ""Outdoor dining area"", ""BBQ grill"", ""Ethernet connection"", ""First aid kit"", ""TV with standard cable"", ""Iron"", ""Washer"", ""Essentials"", ""Dishwasher"", ""Oven"", ""Dryer"", ""Coffee maker"", ""Hair dryer"", ""Lock on bedroom door"", ""Bed linens"", ""Bathtub"", ""Paid parking off premises"", ""Refrigerator"", ""Kitchen"", ""Outdoor furniture"", ""Extra pillows and blankets"", ""Luggage dropoff allowed"", ""Fire extinguisher"", ""Indoor fireplace"", ""Host greets you"", ""Shared patio or balcony"", ""Heating""]$50.002100022100010002.01000.0nullt2221422022-09-12T00:00:00.000+0000117002011-03-28T00:00:00.000+00002021-06-25T00:00:00.000+00004.774.744.684.84.754.814.71nullf31200.84",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
393961,https://www.airbnb.com/rooms/393961,20220911230927,2022-09-12T00:00:00.000+0000,city scrape,City Apartment 2- Palais Brambilla - 1920ies style,"Palais Brambilla is an oasis located in the historic heart of Vienna, just a 10-minuteswalk away from St. Stephen’s Cathedral. Palais Brambilla is situated next to the Danube Canal, the green water band and recreational zone crossing the city. Four individual apartments are at Palais Brambilla – very quite, orientated to a private court yard or facing the quay with lots of light and panoramic views overlooking half of the city. The space Two bedrooms apartment and views into a private court yard – this 60 m² apartment on first floor (US second floor) with elevator is very quiet with high ceilings. The decent sized living and dining room is furnished partly with an early art deco cabinet and Chesterfield sofas. kitchen, bathroom and powder room are separate. From the living room the bedrooms can be accessed, decorated in the style of the 1920ties. Guest access Guests are being welcomed by roses bushes in front of the","The neighborhood offers a wide range of cafés, restaurants and all kinds of shops. Trees in front of the house and a green park in the back invite to relax. There is even a playground and a dog park. We love our neighborhood as it is a nice community but still urban - we know our neighbors it is almost like living in village in the middle of a city- the shops and restaurant add a hip and down to earth mix to a vast variety of supermarkets from budget to gourmet. To the sights close by count the two oldest churches in Vienna, the Ruprechtskirche and the church Maria am Gestade. The street around our corner leads into the ""Tiefer Graben"" a deep ditch that was used by the Romans as fortification and edge of their castellum. If you continue further the ""Naglergasse"" is the right angled corner of the castellum. The narrow street still shows the Roman street pattern. The St. Stephen’s Cathedral is in walking distance as is all of the 1. district. The Vienna stock market, th",https://a0.muscache.com/pictures/f4ca12dd-0e99-494c-96ff-1c61f193958e.jpg,1833176.0,https://www.airbnb.com/users/show/1833176,Markus,2012-02-29T00:00:00.000+0000,,"I am an architect and my speciality are historical buildings. I won a price with the restoration of this building from the city Vienna. I love finding the bones, the good structures in old buildings or within a neighbourhood and create with this knowledge perfect solutions. I also have art gallery in the former stables of this house. Our ""stable gallery"" is a perfect place to stable art and show them off to possible buyers. Our home is named after our ancestor Giovanni Alessandro de Brambilla who was the chief surgeon of Emperor Joseph II in the 18th century. He held this position for most of his life, travelled with the Emperor always and founded the Academy for surgeons (which is not far away from us) and reorganised the way of medical treatment in the Age of Enlightment. We love to travel ourselves but love the way to be at home far away. We have many frequent guests who love returning to us and our neighbourhood.",within a few hours,100%,95%,f,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_small,https://a0.muscache.com/im/pictures/user/2f403f6c-0a38-42a8-9980-20e30406101c.jpg?aki_policy=profile_x_medium,Innere Stadt,4.0,6.0,"['email', 'phone', 'work_email']",t,t,"Vienna, Austria",Innere Stadt,,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,,0 baths,2.0,1.0,"[""Elevator"", ""Iron"", ""Washer"", ""Essentials"", ""Hangers"", ""Kitchen"", ""First aid kit"", ""Dryer"", ""Shampoo"", ""Long term stays allowed"", ""Fire extinguisher"", ""Hair dryer"", ""Cable TV"", ""TV with standard cable"", ""Wifi"", ""Heating""]",$140.00,3.0,180.0,3.0,3.0,180.0,180.0,3.0,180.0,,t,12.0,42.0,72.0,337.0,2022-09-12T00:00:00.000+0000,69.0,5.0,0.0,2013-06-07T00:00:00.000+0000,2022-07-21T00:00:00.000+0000,4.58,4.8,4.76,4.83,4.92,4.85,4.73,,f,4.0,4.0,0.0,0.0,0.61
75471,https://www.airbnb.com/rooms/75471,20220911230927,2022-09-12T00:00:00.000+0000,previous scrape,nice big apartment with balcony,"you will like my beautiful apartment with balcony, waterbed, wooden floors and everything you need for your vacation. The space i am a teacher and like travelling a lot. since i moved to a new apartment with my family my apartment is for rent now. it has everything you need for your short or longer vacation or business stay, such as a big kitchen with dining table and direct access to a nice big balcony with an awning directed to a calm courtyard, a spacious living- and bedroom with parquet flooring and a new waterbed. For guests that have work to do, there is also an office area. There is a big bathroom with bathtub and washing machine beside the bedroom and a walk-in-closet. the toilet is seperate. there is a further room, which you can use as a nursery or a second bedroom if required. crib and highchair are available. towels, tea towels and linen are available. kettle, toaster and nespresso machine (without gorge clooney)",the neighborhood is charming and quiet. but you have easy access to public transport which bfings you to the center of the city within 25 minutes. there are public,https://a0.muscache.com/pictures/7292067/63747cd0_original.jpg,363315.0,https://www.airbnb.com/users/show/363315,Wolfgang,2011-01-26T00:00:00.000+0000,"Vienna, Austria",I am a teacher and I like to travel myself - nonetheless I enjoy being a host too - My first experiences with Airbnb have been very promising and I look forward to welcome more people!,,,,f,https://a0.muscache.com/im/users/363315/profile_pic/1296216486/original.jpg?aki_policy=profile_small,https://a0.muscache.com/im/users/363315/profile_pic/1296216486/original.jpg?aki_policy=profile_x_medium,Ottakring,1.0,1.0,"['email', 'phone']",t,t,"Vienna, Austria",Ottakring,,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,,1 bath,2.0,2.0,"[""Dishes and silverware"", ""Cooking basics"", ""Shampoo"", ""Long term stays allowed"", ""Cable TV"", ""Wifi"", ""Elevator"", ""High chair"", ""Hangers"", ""Stove"", ""Hot water"", ""Children\u2019s books and toys"", ""TV with standard cable"", ""Iron"", ""Washer"", ""Essentials"", ""Dishwasher"", ""Patio or balcony"", ""Oven"", ""Single level home"", ""Coffee maker"", ""Hair dryer"", ""Bed linens"", ""Bathtub"", ""Microwave"", ""Paid parking off premises"", ""Refrigerator"", ""Carbon monoxide alarm"", ""Kitchen"", ""Room-darkening shades"", ""Extra pillows and blankets"", ""Pack \u2019n play/Travel crib"", ""Smoke alarm"", ""Heating""]",$77.00,3.0,60.0,3.0,3.0,1125.0,1125.0,3.0,1125.0,,t,0.0,0.0,0.0,0.0,2022-09-12T00:00:00.000+0000,50.0,0.0,0.0,2011-08-17T00:00:00.000+0000,2019-01-02T00:00:00.000+0000,4.87,4.94,4.71,4.94,4.96,4.4,4.73,,t,1.0,1.0,0.0,0.0,0.37
397617,https://www.airbnb.com/rooms/397617,20220911230927,2022-09-12T00:00:00.000+0000,previous scrape,bright+colourful loft - near Hauptbahnhof,"Quiet and colourful 2 bedroom loft in the authentic and non touristy district 'Favoriten' in Vienna. 15 min to city center. 35 min to ✈. I can host up to 6 people. Real apartment with everything you need. The space Hallo! I’m renting my place to people who are searching for a private apartment which is affordable yet also has a good atmosphere to spend some time with friends and family. The flat is situated in the 10th district of Vienna, Favoriten, and is 15 min away from the city center (STEPHANSPLATZ). The apartment: ---------------------- There are two bedrooms with windows heading towards the court yard so it’s very quiet and peaceful – you hear no neighbours or traffic noise at all. There is the masterbed room with a double bed. The guest room is furnished with a double bed. Two people can sleep in the spacious living room on two matresses. The kitchen-cum-living room with access to balkony","THE LOCATION: ------------------ The flat is located in a safe and friendly neighborhood. To go to CITY CENTER it takes about 15 min (2 stations with tram 6 , 5 steps with metro U1). To go to AIRPORT Schwechat it takes about 35 min. And to train station WIEN WESTBAHNHOF 25 min (just take the tram 6). To WIEN HAUPTBAHNHOF 15 min. THE NEIGHBORHOOD: -------------------------- Right next to the apartment is a supermarket (BILLA) which has open from mo-fr until 7:30 pm and on sa until 6 pm. It's not a discounter, but they also have cheaper foods from their own label called 'clever'. So you have a good mixture of quality and price. There are also some very nice turkish grocery stores in the neighborhood which also have open longer than the ordinary super markets. Also very close by are two other supermarkets (another billa and an Interspar), a bakery (Anker), a drug store (BIPA), a pharmacy and a tobacconist's, a post office, a bank and a secon",https://a0.muscache.com/pictures/52a4f28a-f98a-4d86-a47b-35b1f0bdd603.jpg,1986417.0,https://www.airbnb.com/users/show/1986417,Mave,2012-03-22T00:00:00.000+0000,"Vienna, Austria",,within an hour,100%,100%,t,https://a0.muscache.com/im/pictures/user/bbc72bfc-e5ec-4d79-ad85-7e3a83b4d4c6.jpg?aki_policy=profile_small,https://a0.muscache.com/im/pictures/user/bbc72bfc-e5ec-4d79-ad85-7e3a83b4d4c6.jpg?aki_policy=profile_x_medium,Favoriten,1.0,2.0,"['email', 'phone']",t,t,"Vienna, Austria",Favoriten,,48.17437,16.39339,Entire condo,Entire home/apt,4.0,,1 bath,1.0,2.0,"[""Dishes and silverware"", ""Cooking basics"", ""Toaster"", ""Cleaning products"", ""Shower gel"", ""Baking sheet"", ""Fast wifi \u2013 197 Mbps"", ""Elevator"", ""Hangers"", ""Clothing storage"", ""Stove"", ""Hot water"", ""Outdoor dining area"", ""Drying rack for clothing"", ""Freezer"", ""Iron"", ""Essentials"", ""Sound system"", ""Dishwasher"", ""Dining table"", ""Patio or balcony"", ""Oven"", ""Body soap"", ""Paid dryer \u2013 In unit"", ""Paid washer \u2013 In unit"", ""Coffee maker"", ""Hair dryer"", ""Bed linens"", ""Bathtub"", ""Safe"", ""Wine glasses"", ""Microwave"", ""Paid parking off premises"", ""Refrigerator"", ""Kitchen"", ""Luggage dropoff allowed"", ""HDTV with Amazon Prime Video, Netflix"", ""Heating""]",$87.00,5.0,10.0,5.0,5.0,10.0,10.0,5.0,10.0,,t,0.0,0.0,0.0,0.0,2022-09-12T00:00:00.000+0000,178.0,1.0,0.0,2012-05-05T00:00:00.000+0000,2022-05-30T00:00:00.000+0000,4.77,4.87,4.67,4.88,4.87,3.98,4.66,,f,1.0,1.0,0.0,0.0,1.41


In [0]:
rawDF.columns

Out[6]: ['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'source',
 'name',
 'description',
 'neighborhood_overview',
 'picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bathrooms_text',
 'bedrooms',
 'beds',
 'amenities',
 'price',
 'minimum_nights',
 'maximum_nights',
 'minimum_minimum_nights',
 'maximum_minimum_nights',
 'minimum_maximum_nights',
 'maximum_maximum_nights',
 'minimum_nights_avg_ntm',
 'maximum_nights_avg_ntm',
 'calendar_updated',
 'has_availability',
 'availab

For the sake of simplicity, only keep certain columns from this dataset. We will talk about feature selection later.

In [0]:
columnsToKeep = [
  "host_is_superhost",
  #"cancellation_policy",
  "instant_bookable",
  "host_total_listings_count",
  "neighbourhood_cleansed",
  "latitude",
  "longitude",
  "property_type",
  "room_type",
  "accommodates",
  #"bathrooms",
  "bedrooms",
  "beds",
  #"bed_type",
  "minimum_nights",
  "number_of_reviews",
  "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value",
  "price"]

baseDF = rawDF.select(columnsToKeep)
baseDF.cache().count()
baseDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
f,f,6,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3,1.0,2,1,14,4.71,4.86,4.93,4.93,4.86,4.71,4.5,$110.00
t,t,3,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5,1.0,3,5,350,4.75,4.8,4.65,4.91,4.93,4.75,4.69,$69.00
t,t,19,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6,2.0,4,1,181,4.83,4.9,4.88,4.89,4.93,4.59,4.7,$145.00
f,f,6,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2,1.0,1,2,100,4.64,4.73,4.55,4.8,4.91,4.89,4.59,$100.00
t,f,3,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3,,2,5,347,4.65,4.77,4.51,4.93,4.95,4.86,4.58,$68.00
f,f,6,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2,1.0,1,3,52,4.63,4.67,4.35,4.69,4.75,4.88,4.56,$99.00
t,f,4,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2,1.0,2,2,117,4.77,4.74,4.68,4.8,4.75,4.81,4.71,$50.00
f,f,6,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4,2.0,1,3,69,4.58,4.8,4.76,4.83,4.92,4.85,4.73,$140.00
f,t,1,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4,2.0,2,3,50,4.87,4.94,4.71,4.94,4.96,4.4,4.73,$77.00
t,f,2,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4,1.0,2,5,178,4.77,4.87,4.67,4.88,4.87,3.98,4.66,$87.00


### Fixing Data Types

Take a look at the schema above. You'll notice that the `price` field got picked up as string.  
For our task, we need it to be a numeric (double type) field. 

Let's fix that.

In [0]:
from pyspark.sql.functions import col, translate

fixedPriceDF = baseDF.withColumn("price", translate(col("price"), "$,", "").cast("double"))

fixedPriceDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
f,f,6,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3,1.0,2,1,14,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0
t,t,3,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5,1.0,3,5,350,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0
t,t,19,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6,2.0,4,1,181,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0
f,f,6,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2,1.0,1,2,100,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0
t,f,3,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3,,2,5,347,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0
f,f,6,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2,1.0,1,3,52,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0
t,f,4,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2,1.0,2,2,117,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0
f,f,6,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4,2.0,1,3,69,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0
f,t,1,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4,2.0,2,3,50,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0
t,f,2,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4,1.0,2,5,178,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0


### Summary statistics

Two options:
* describe
* summary (describe + IQR)

**Question:** When to use IQR/median over mean? Vice versa?

In [0]:
display(fixedPriceDF.describe())

summary,host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,11795,11797,11795.0,11797,11797.0,11797.0,11797,11797,11797.0,10648.0,11649.0,11797.0,11797.0,9869.0,9774.0,9774.0,9774.0,9775.0,9773.0,9772.0,11797.0
mean,,,21.95158965663417,,48.20483212884346,16.361390798461894,,,3.3154191743663644,1.3684259954921112,1.8950124474203796,6.818004577434941,35.61159616851742,4.67334279055627,4.7868262737875815,4.692784939635757,4.833127685696722,4.810543222506388,4.722745318735274,4.679234547687248,95.07883360176317
stddev,,,52.91740195195649,,0.0213999840808346,0.0364909101528313,,,1.8156688982294795,0.8899037562441913,1.3422869374535251,28.42904205203276,66.26044025022497,0.6014365842466193,0.3870716872209593,0.4513291980541244,0.3583794383333711,0.398030373690555,0.3672015626706204,0.4013966348394232,194.3411495925487
min,f,f,1.0,Alsergrund,48.10857,16.16986,Camper/RV,Entire home/apt,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,t,t,1339.0,Whring,48.32643,16.55566859090834,Tiny home,Shared room,16.0,19.0,35.0,1125.0,844.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,9270.0


In [0]:
display(fixedPriceDF.summary())

summary,host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,11795,11797,11795.0,11797,11797.0,11797.0,11797,11797,11797.0,10648.0,11649.0,11797.0,11797.0,9869.0,9774.0,9774.0,9774.0,9775.0,9773.0,9772.0,11797.0
mean,,,21.95158965663417,,48.20483212884346,16.361390798461894,,,3.3154191743663644,1.3684259954921112,1.8950124474203796,6.818004577434941,35.61159616851742,4.67334279055627,4.7868262737875815,4.692784939635757,4.833127685696722,4.810543222506388,4.722745318735274,4.679234547687248,95.07883360176317
stddev,,,52.91740195195649,,0.0213999840808346,0.0364909101528313,,,1.8156688982294795,0.8899037562441913,1.3422869374535251,28.42904205203276,66.26044025022497,0.6014365842466193,0.3870716872209593,0.4513291980541244,0.3583794383333711,0.398030373690555,0.3672015626706204,0.4013966348394232,194.3411495925487
min,f,f,1.0,Alsergrund,48.10857,16.16986,Camper/RV,Entire home/apt,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,1.0,,48.19107,16.33922,,,2.0,1.0,1.0,1.0,2.0,4.61,4.75,4.59,4.81,4.79,4.62,4.59,47.0
50%,,,3.0,,48.20439,16.35991,,,3.0,1.0,2.0,2.0,9.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,71.0
75%,,,13.0,,48.21854,16.38211,,,4.0,2.0,2.0,3.0,39.0,5.0,5.0,5.0,5.0,5.0,4.98,4.91,104.0
max,t,t,1339.0,Whring,48.32643,16.55566859090834,Tiny home,Shared room,16.0,19.0,35.0,1125.0,844.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,9270.0


### Getting rid of extreme values

Let's take a look at the *min* and *max* values of the `price` column:

In [0]:
display(fixedPriceDF.select("price").describe())

summary,price
count,11797.0
mean,95.07883360176317
stddev,194.3411495925487
min,0.0
max,9270.0


There are some super-expensive listings. But it's the data scientist's job to decide what to do with them. We can certainly filter the "free" Airbnbs though.

Let's see first how many listings we can find where the *price* is zero.

In [0]:
fixedPriceDF.filter(col("price") == 0).count()

Out[12]: 2

Now only keep rows with a strictly positive *price*.

In [0]:
posPricesDF = fixedPriceDF.filter(col("price") > 0)

Let's take a look at the *min* and *max* values of the *minimum_nights* column:

In [0]:
display(posPricesDF.select("minimum_nights").describe())

summary,minimum_nights
count,11795.0
mean,6.818991097922849
stddev,28.43135145333913
min,1.0
max,1125.0


In [0]:
display(posPricesDF
  .groupBy("minimum_nights").count()
  .orderBy(col("count").desc(), col("minimum_nights"))
)

minimum_nights,count
1,4004
2,3294
3,1788
4,542
30,446
5,418
7,308
14,158
6,125
28,123


A minimum stay of one year seems to be a reasonable limit here. Let's filter out those records where the *minimum_nights* is greater then 365:

In [0]:
minNightsDF = posPricesDF.filter(col("minimum_nights") <= 365)

display(minNightsDF)

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
f,f,6,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3,1.0,2.0,1,14,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0
t,t,3,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5,1.0,3.0,5,350,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0
t,t,19,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6,2.0,4.0,1,181,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0
f,f,6,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2,1.0,1.0,2,100,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0
t,f,3,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3,,2.0,5,347,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0
f,f,6,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2,1.0,1.0,3,52,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0
t,f,4,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2,1.0,2.0,2,117,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0
f,f,6,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4,2.0,1.0,3,69,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0
f,t,1,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4,2.0,2.0,3,50,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0
t,f,2,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4,1.0,2.0,5,178,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0


### Nulls

There are a lot of different ways to handle null values. Sometimes, null can actually be a key indicator of the thing you are trying to predict (e.g. if you don't fill in certain portions of a form, probability of it getting approved decreases).

Some ways to handle nulls:
* Drop any records that contain nulls
* Numeric:
  * Replace them with mean/median/zero/etc.
* Categorical:
  * Replace them with the mode
  * Create a special category for null
* Use techniques like ALS which are designed to impute missing values
  
**If you do ANY imputation techniques for categorical/numerical features, you MUST include an additional field specifying that field was imputed.**

SparkML's Imputer (covered below) does not support imputation for categorical features.

### Impute: Cast to Double

SparkML's `Imputer` requires all fields be of type double [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Imputer)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.feature.Imputer). Let's cast all integer fields to double.

In [0]:
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

integerColumns = [x.name for x in minNightsDF.schema.fields if x.dataType == IntegerType()]
doublesDF = minNightsDF

for c in integerColumns:
  doublesDF = doublesDF.withColumn(c, col(c).cast("double"))

columns = "\n - ".join(integerColumns)
print(f"Columns converted from Integer to Double:\n - {columns}")

Columns converted from Integer to Double:
 - host_total_listings_count
 - accommodates
 - bedrooms
 - beds
 - minimum_nights
 - number_of_reviews


Add in dummy variable if we will impute any value.

In [0]:
from pyspark.sql.functions import when

imputeCols = [
  "bedrooms",
  #"bathrooms",
  "beds", 
  "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value"
]

for c in imputeCols:
  doublesDF = doublesDF.withColumn(c + "_na", when(col(c).isNull(), 1.0).otherwise(0.0))

In [0]:
display(doublesDF.describe())

summary,host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
count,11789,11789,11787.0,11789,11789.0,11789.0,11789,11789,11789.0,10644.0,11643.0,11789.0,11789.0,9862.0,9767.0,9767.0,9767.0,9768.0,9766.0,9765.0,11789.0,11789.0,11789.0,11789.0,11789.0,11789.0,11789.0,11789.0,11789.0,11789.0
mean,,,21.964197845083568,,48.20483212494854,16.361392849621012,,,3.3160573415896173,1.368564449455092,1.8948724555526928,6.377724997879379,35.601408092289425,4.673328939363195,4.786805569775756,4.6927224326814665,4.833116617180275,4.81047092547092,4.722740118779423,4.679256528417796,95.11866994656036,0.0971244380354567,0.0123844261599796,0.1634574603443888,0.1715158198320468,0.1715158198320468,0.1715158198320468,0.1714309949953346,0.171600644668759,0.1716854695054712
stddev,,,52.93311232817838,,0.0214056757104265,0.0365004428076598,,,1.8158889072156743,0.8900422995620187,1.3424639943607417,19.887887736108137,66.26414096165404,0.6016279347580344,0.3871977141090647,0.4514627182426419,0.3584926360722987,0.39814976421419,0.367200041389517,0.4014807958219126,194.39992348113,0.2961396977803014,0.1105987781744533,0.369798213705189,0.3769750626417025,0.3769750626417025,0.3769750626417025,0.3769011258891479,0.3770489658110705,0.3771228354169931
min,f,f,1.0,Alsergrund,48.10857,16.16986,Camper/RV,Entire home/apt,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,t,t,1339.0,Whring,48.32643,16.55566859090834,Tiny home,Shared room,16.0,19.0,35.0,365.0,844.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,9270.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
doublesDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,6.0,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,1.0,2.0,1.0,14.0,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,3.0,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,1.0,3.0,5.0,350.0,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,19.0,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,2.0,4.0,1.0,181.0,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,100.0,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,3.0,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,,2.0,5.0,347.0,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,1.0,1.0,3.0,52.0,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,4.0,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,117.0,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,2.0,1.0,3.0,69.0,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,t,1.0,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,2.0,2.0,3.0,50.0,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,2.0,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4.0,1.0,2.0,5.0,178.0,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Transformers and Estimators

**Transformer**: Accepts a DataFrame as input, and returns a new DataFrame with one or more columns appended to it. Transformers do not learn any parameters from your
data and simply apply rule-based transformations to either prepare data for model training or generate predictions using a trained MLlib model. They have a `.transform()` method.

**Estimator**: Learns (or "fits") parameters from your DataFrame via a `.fit()` method and returns a Model, which is a transformer.

In [0]:
from pyspark.ml.feature import Imputer

imputer = Imputer(strategy="median", inputCols=imputeCols, outputCols=imputeCols)

imputerModel = imputer.fit(doublesDF)
imputedDF = imputerModel.transform(doublesDF)

In [0]:
imputedDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,6.0,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,1.0,2.0,1.0,14.0,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,3.0,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,1.0,3.0,5.0,350.0,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,19.0,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,2.0,4.0,1.0,181.0,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,100.0,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,3.0,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,1.0,2.0,5.0,347.0,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,1.0,1.0,3.0,52.0,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,4.0,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,117.0,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,2.0,1.0,3.0,69.0,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,t,1.0,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,2.0,2.0,3.0,50.0,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,2.0,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4.0,1.0,2.0,5.0,178.0,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


OK, our data is cleansed now. Let's save this DataFrame to a Database so that we can start building models with it.
Delta

In [0]:
deltaPath = os.path.join("/", "tmp", username)    #If we were writing to the root folder and not to the DBFS
if not os.path.exists(deltaPath):
    os.mkdir(deltaPath)
    
print(deltaPath)

/tmp/renato


In [0]:
# Converting Spark DataFrame to Delta Table
dbutils.fs.rm(deltaPath, True)
imputedDF.write.format("delta").mode("overwrite").save(deltaPath)

### We are going to use this database in the next notebooks!

Code modified and enhanced from 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>