# Scraping Airbnb data 

# Assignment 2

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions need to be code-based, i.e. hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and to not modify the test cells. When you are done completing all the exercises submit this same notebook back to moodle in .ipynb format.

<div class="alert alert-success">The aim of this assignment is to create and save a dataset containing information about different listings in Airbnb. You will then use this dataset during the Artifical Intelligence course to train a predictive model.</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, October 17, 23:55</div>

## Getting started

[Airbnb](https://www.airbnb.com/) allows people to rent out their properties on their online platform. Travelers can then book these properties for shorter or longer periods of time. The company was founded in August 2008 in San Francisco, California, and currently has an annual revenue stream of over 2.5 Billion US Dollars. In the US alone, the platform has 660,000 listings.

![airbnb](https://www.dropbox.com/s/njll910mmpzm86z/airbnb.png?raw=1)

Every individual listing contains a lot of information like the facilities offered, the location, information about the host and reviews. In this assignment you will build a web scrapper to extract information from these listings using Python.

![bali](https://www.dropbox.com/s/t1erpuf8zo9sdt3/search.png?dl=1)

In particular, we are going to take a look at the different listings available to spend 5 nights in Bali during these Christmas holidays, from December 29 until January 3. You can check the different options available by in the following [link](https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click).

The whole url has been copied for you below.

In [1]:
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click"

Let's begin by making a request to retrieve the HTML code for this website. Since this is an action that you may need to perform several times throughout the assignment, let's encapsulate the corresponding code in a function.

<div class="alert alert-info"><b>Exercise 1 </b>Write the code to create a function called <i>get_page</i>. This function should take as input a single string containing the url for a given webpage and return its underlying HTML code as a <b>BeautifulSoup object</b> as output.<br><i>[0.25 points]</i></div>

In [2]:
# YOUR CODE HERE
import requests
import bs4

def get_page(url_input):
    listing_class = '_1e541ba5'
    
    soup = bs4.BeautifulSoup(requests.get(url_input).text, 'html.parser')
    
    #check whether the class corresponding to each listing is present (to compabat Airbnb antiscraping policy)
    while soup.find('div', {'class':listing_class}) == None:
        soup = bs4.BeautifulSoup(requests.get(url_input).text, 'html.parser')
    
    return soup

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [3]:
# LEAVE BLANK

In [4]:
# LEAVE BLANK

In [5]:
# LEAVE BLANK

As you well know, the first step in trying to extract information from a webpage is to check how it is constructed. A brief look at the given webpage shows that the information on the different listings is shown underneath each other in a list form. 

<img src="https://www.dropbox.com/s/197l7nr21fk5w48/listing.png?dl=1" width="700">

For every listing a preview image is shown together with some standard information, including a title, a subtitle, the number of guests allowed, the number of bedrooms and bathrooms, the number of beds, information about certain ammenities, the price per night, the discounted price per night, the total price per stay, the average rating and the number of reviews.

<div class="alert alert-success">For the example shown above, the title would be "Veluvana Bali - Bamboo House", while the subtitle corresponds to "Treehouse in Sidemen". The number of guests allowed is 2, the number of bedrooms 1, the number of beds 1, and the number of baths 1.5. Among the ammenities, it includes a pool, wifi, and a kitchen. The price per night is 175€, the discounted price per night is 144€ and the total price per stay is 720€. The listing has received 29 reviews and has an average rating of 4.90.</div>

Your primary goal will be to retrieve this information for each separate listing. Hence, we will start by identifying the tags that refer to each of these items and writing the code to retrieve them. 

## Retrieving the data

In what follows you will write functions to extract the information for one listing at a time. This means that the input to these functions should always be the code tag for a single listing. Hence, before moving on, make sure that you can extract these tags.

<div class="alert alert-info">Identify the code tag for each separate listing. Once you are done, use your <i>get_page</i> function above to extract the HTML code for the provided website and save the code tag corresponding to the first listing in a new variable called <i>first_listing</i></div>

<div class="alert alert-warning">If you inspect the HTML code for the webiste, you'll find that there are different options to choose form. All of them are valid choices, as long as you extract a block that contains all the information for each listing.</div>

In [6]:
# YOUR CODE HERE
#get HTML code of listing page using functions get_page
listings_page = get_page(url)

#listings_list = listings_page.find_all('div', {'itemprop':'itemListElement'})
listing_class = '_1e541ba5'
listings_list = listings_page.find_all('div', {'class':listing_class})

#get first listing from list
first_listing = listings_list[0]

<div class="alert alert-info"><b>Exercise 2 </b>Write the code to create a function called <i>get_listing_title</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>string</b> with its title as output. If no title is listed for the provided code tag, then the function should return a <b>None</b>.<br><i>[0.5 points]</i></div>

<div class="alert alert-warning">Note that your function should only output the title for the provided listing.</div>

In [7]:
# YOUR CODE HERE
#function that gets as input a tag object containing the code tag for a listing and returns the title in string form    
def get_listing_title(tag_object):
    try:
        title_class = '_5kaapu'
        title = tag_object.find('div', {'class':'_5kaapu'}).text
        if len(title) > 0:
            return str(title)
        elif len(title) == 0 or title == None:
            return None
    except:
        return None

In [8]:
#for x in range(len(listings_list)):
#    print(x, get_listing_title(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [9]:
# LEAVE BLANK

In [10]:
# LEAVE BLANK

In [11]:
# LEAVE BLANK

In [12]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3 </b>Write the code to create a function called <i>get_listing_subtitle</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>string</b> with its subtitle as output. If no subtitle is listed, then the function should return a <b>None</b>.<br><i>[0.5 points]</i></div>

In [13]:
# YOUR CODE HERE
def get_listing_subtitle(tag_object):
    try:
        subtitle_class = '_1xzimiid'
        subtitle = tag_object.find('div', {'class':subtitle_class}).text

        if len(subtitle) > 0: 
            return str(subtitle)
        elif len(subtitle) == 0 or subtitle == None:
            return None
    except: 
        return None

In [14]:
#test function 
#for x,y in enumerate(listings_list):
#    print(get_listing_subtitle(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [15]:
# LEAVE BLANK

In [16]:
# LEAVE BLANK

In [17]:
# LEAVE BLANK

In [18]:
# LEAVE BLANK

As mentioned above, right below each listing's title there's a list of attributes that contains information about the number of guests allowed, the number of bedrooms, the number of beds and the number bathrooms. Let's create a new function that retrieves this information for each separate listing. 

<div class="alert alert-info"><b>Exercise 4 </b>Write the code to complete function <i>get_listing_info</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>string</b> containing the detailed information as provided by the website. If no information is provided for one item, then the function should return a <i>None</i> for that item.<br><i>[0.5 points]</i></div>

<div class="alert alert-warning">For the example shown above, the function get_listing_info should return the string "2 guests · 1 bedroom · 1 bed · 1.5 baths"</div>

In [19]:
# YOUR CODE HERE
def get_listing_info(tag_object):
    try:
        #get class and information in form of string
        info_class = '_3c0zz1'
        information = tag_object.find_all('div', {'class':info_class})[0].text
        
        #count how many dots are in the string
        counter = information.count(' · ')

        #information is complete, return information
        if counter == 3:
            return str(information)

        #information is incomplete, check what is missing
        elif counter < 3: 
            check_guest = ' guest' in information or ' guests' in information
            check_bedroom = ' bedroom' in information or ' bedrooms' in information
            check_bed = ' bed ' in information or ' beds' in information and not 'r' in information
            check_bath = ' bath' in information or 'baths' in information

            info_list = information.split(' · ')
            
            #insert None where information is missing
            if check_guest == False:
                info_list.insert(0, None)
            if check_bedroom == False:
                info_list.insert(1, None)
            if check_bed == False:
                info_list.insert(2, None)
            if check_bath == False:
                info_list.insert(3, None)

            dot = ' · '
            information = str(str(info_list[0]) + dot + str(info_list[1]) + dot + str(info_list[2]) + dot + str(info_list[3]))
            return str(information)
    except: 
        return None

In [20]:
#for x in range(len(listings_list)):
#    print(get_listing_info(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [21]:
# LEAVE BLANK

In [22]:
# LEAVE BLANK

In [23]:
# LEAVE BLANK

In [24]:
# LEAVE BLANK

Now that we have retrieved and formatted the basic information we can move on to the list of ammenities.

<div class="alert alert-info"><b>Exercise 5 </b>Write the code to create a function called <i>get_listing_ammenities</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>string</b> with the listing ammenities information as output, as shown in the website. If no information is provided, then the function should return a <i>None</i>.</div>

<div class="alert alert-warning">For the example shown above, the function get_listing_ammenities should return the string "Pool · Wifi · Air conditioning · Kitchen".</div>

In [25]:
# YOUR CODE HERE
def get_listing_ammenities(tag_object):
    ammenities_class = '_3c0zz1'
    try:
        ammenities = tag_object.find_all('div', {'class':ammenities_class})[1].text

        if len(ammenities) > 0: 
            return str(ammenities)
        elif len(ammenities) == 0 or ammenities == []:
            return None
    except:
        return None

In [26]:
#for x in range(len(listings_list)):
#    print(get_listing_ammenities(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [27]:
# LEAVE BLANK

In [28]:
# LEAVE BLANK

In [29]:
# LEAVE BLANK

In [30]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 6 </b>Write the code to create a function called <i>get_listing_rating</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>float</b> with its average rating as output. If no rating is listed, then the function should return a <i>None</i>.<br><i>[0.75 points]</i></div>

In [31]:
# YOUR CODE HERE
def get_listing_rating(tag_object):
    rating_class = '_10fy1f8'
    
    try:
        avg_rating = tag_object.find('span', {'class':rating_class})  

        if avg_rating.text != '': 
            return float(avg_rating.text)
        elif avg_rating.text == '':
            return None
    except: 
        return None

In [32]:
#for x in range(len(listings_list)):
#    print(get_listing_rating(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [33]:
# LEAVE BLANK

In [34]:
# LEAVE BLANK

In [35]:
# LEAVE BLANK

In [36]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 7 </b>Write the code to create a function called <i>get_listing_reviews</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return an <b>int</b> with its number of reviews as output. If no reviews are included, then the function should return a <i>None</i>.<br><i>[1.25 points]</i></div>

In [37]:
# YOUR CODE HERE
def get_listing_reviews(tag_object):
    reviews_class = '_a7a5sx'
    try:
        reviews = tag_object.find('span', {'class':reviews_class}).text.split('(')[1].split(' ')[0]
        return int(reviews)
    except:
        return None

In [38]:
#for x in range(len(listings_list)):
#    print(get_listing_title(listings_list[x]), get_listing_reviews(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [39]:
# LEAVE BLANK

In [40]:
# LEAVE BLANK

In [41]:
# LEAVE BLANK

In [42]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 8 </b>Write the code to create a function called <i>get_listing_price_per_night</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>float</b> with its corresponding price per night. If no price is listed, then the function should return a <i>None</i>.<br><i>[1.75 points]</i></div>

<div class="alert alert-warning">Note that for some listing, two different prices are available: the regular price and the discounted price. When this is the case, your code should return the <b>discounted</b> price only.</div>

<div class="alert alert-warning">Note that depending on the chosen language for the website, the price may be listed in $ or €. Make sure your code is able to retrieve both in the right format, i.e. you don't need to convert th eucrrencies but make sure the output result is a float in both cases.</div>

In [43]:
# YOUR CODE HERE
def get_listing_price_per_night(tag_object):
    price_class = '_tyxjp1'
    
    try: 
        price = tag_object.find('span', {'class':price_class}).text
        if '$' in price: 
            price = float(price.split('$')[1])
        elif '€' in price:
            price = float(price.split('€')[1])
        return price
    except:
        return None

In [44]:
#for x in range(len(listings_list)):
#    print(get_listing_price_per_night(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [45]:
# LEAVE BLANK

In [46]:
# LEAVE BLANK

In [47]:
# LEAVE BLANK

In [48]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 9 </b>Write the code to create a function called <i>get_listing_total_price</i>. This function should take a <b>Tag object</b> containing the code tag for an individual listing as input and return a <b>float</b> with its total price. If no total price is listed, then the function should return a <i>None</i>.<br><i>[1 point]</i></div>

<div class="alert alert-warning">Note that depending on the chosen language for the website, the price may be listed in $ or €. Make sure your code is able to retrieve both in the right format, i.e. you don't need to convert th eucrrencies but make sure the output result is a float in both cases.</div>

In [49]:
# YOUR CODE HERE
def get_listing_total_price(tag_object):
    total_price_class = '_tt122m'
    
    try: 
        total_price = tag_object.find('div', {'class':total_price_class}).text
        total_price = total_price.split(' ')[0]
        
        #replace coma with space if price is greater than 1000
        if ',' in str(total_price):
            total_price = total_price.replace(',', '')
            
        #remove currency sign and turn into float
        if '$' in str(total_price): 
            total_price = float(total_price.split('$')[1])
        elif '€' in str(total_price):
            total_price = float(total_price.split('€')[1])

        return total_price
    except:
        return None

In [50]:
#for x in range(len(listings_list)):
#    print(x, get_listing_total_price(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [51]:
# LEAVE BLANK

In [52]:
# LEAVE BLANK

In [53]:
# LEAVE BLANK

In [54]:
# LEAVE BLANK

You should now have the code ready to extract all the information for the listings displayed in the provided website. Before moving on, it might be a good idea to heck that your funcstion provide the right outptu for them all.

## Looking for additional data

If you take a closer look at the website, you'll see that there are additional pages of listings that you can visit. At the end of each page, there is a link that allows you to access the next page. Final thing left to do it to come up with a way to extract the data from all the different pages. 

<div class="alert alert-info"><b>Exercise 10 </b>Write the code to create a function called <i>get_next_page</i>.This function should take a <b>BeautifulSoup object</b> containing the code for an individual page as input and return the <b> complete url</b> for the next page as output. If there are no more pages left, it should return a <i>None</i>. The <i>base_url</i> has already been defined for you.<br><i>[1.5 points]</i></div>

<div class="alert alert-warning">Note that this exercise requires you to check your code against different inputs. In particular, make sure that yoru functions returns the right output once it reaches the last page.</div>

In [55]:
url_first_page = 'https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click'
url_last_page = 'https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click&federated_search_session_id=78c8e5c9-3e06-4627-961e-941b7265ff36&pagination_search=true&items_offset=280&section_offset=2'

In [56]:
#get first page
first_page = get_page(url_first_page)
#print(first_page.prettify())

In [57]:
#get last page
last_page = get_page(url_last_page)

In [58]:
base_url = "https://airbnb.com"

# YOUR CODE HERE
def get_next_page(beautifulsoup_object):
    base_url = "https://airbnb.com"
    next_button_class = '_za9j7e'
    
    next_button = beautifulsoup_object.find('a', {'class':next_button_class})
    if next_button == None:
        return None
    else:
        next_button = next_button.get('href')
        next_url = base_url + next_button
        return next_url

In [59]:
#print(get_next_page(first_page))

In [60]:
#print(get_next_page(last_page))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [61]:
# LEAVE BLANK

In [62]:
# LEAVE BLANK

In [63]:
# LEAVE BLANK

In [64]:
# LEAVE BLANK

## Saving the data

Great! You are done defining the required functions. Now, let's put it all together. Go ahead and retrieve the data for all the listings, for all the different pages. 

<div class="alert alert-info"><b>Exercise 11</b> Write the code to retrieve the data above for all the listings in all the different pages. Store this information in a DataFrame object called <b>airbnb</b>. Set the names of the columns to: <i>title</i>, <i>subtitle</i>, <i>info</i>, <i>ammenities</i>, <i>rating</i>, <i>reviews</i>, <i>price_per_night</i> and <i>total_price</i>. Don't define any index when defining your DataFrame.<br><i>[1.5 point]</i></div>

<div class="alert alert-warning">Make sure you retrieve the information for al the listings. Each page contains <b>20 different listings</b>.</div>

<div class="alert alert-warning">Once you retrieve the data, make sure you check again that your functions return the right output in all cases. Otherwise, you my need to revise some of your code. </div>

In [65]:
# YOUR CODE HERE
#define starting url
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click"

#define variables
title = []
subtitle = []
info = []
ammenities = []
rating = []
reviews = []
price_per_night = []
total_price = []


#while the url is not None
while url != None:
    #get the page from the url
    page = get_page(url)

    #store html code for each listing in a list
    listing_class = '_1e541ba5'
    listings_list = []
    listings_list = page.find_all('div', {'class':listing_class})

    #while loop if len(listing_list) is zero in order to make sure the page is retrieved
    while len(listings_list) == 0:
        listings_list = []
        page = get_page(url)
        listings_list = page.find_all('div', {'class':listing_class})

    #for each element in listings_list
    for element in range(len(listings_list)):
        #get title and append it to corresponding variable
        title.append(get_listing_title(listings_list[element]))
        
        #get subtitle
        subtitle.append(get_listing_subtitle(listings_list[element]))
        
        #get information 
        info.append(get_listing_info(listings_list[element]))
        
        #get ammenities
        ammenities.append(get_listing_ammenities(listings_list[element]))
        
        #get ratings
        rating.append(get_listing_rating(listings_list[element]))
        
        #get reviews
        reviews.append(get_listing_reviews(listings_list[element]))
        
        #get price per night
        price_per_night.append(get_listing_price_per_night(listings_list[element]))
        
        #get total price
        total_price.append(get_listing_total_price(listings_list[element]))

    #get the next url and assign it to the variable url
    url = get_next_page(page)



In [66]:
#len(title) == len(subtitle) == len(ammenities) == len(rating) == len(reviews) == len(price_per_night) == len(total_price)
#print(title, subtitle, ammenities, rating, price_per_night, total_price)

In [67]:
#define dataframe
import pandas as pd

dict_data = {'title':title,
             'subtitle':subtitle,
             'info':info,
             'ammenities':ammenities,
             'rating':rating,
             'reviews':reviews,
             'price_per_night':price_per_night,
             'total_price':total_price}
             

airbnb = pd.DataFrame(dict_data)

In [68]:
airbnb

Unnamed: 0,title,subtitle,info,ammenities,rating,reviews,price_per_night,total_price
0,❣️Romantic Staycation-PrivateSunset Pool@megan...,Entire villa in Ubud,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning,4.94,243.0,92.0,458.0
1,Cozy 2BR Villa with Panoramic View of Rice Fields,Entire villa in Kecamatan Ubud,6 guests · 2 bedrooms · 2 beds · 2 baths,Pool · Wifi · Air conditioning · Kitchen,4.93,41.0,348.0,1740.0
2,Sparkling Gem - private pool - Mins to La Brisa,Entire villa in Mengwi,3 guests · 1 bedroom · 2 beds · 1 bath,Pool · Wifi · Air conditioning,4.85,98.0,69.0,342.0
3,Bali Bamboo House | Rescape Ubud - Relief Villa,Hut in Ubud,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi,4.72,39.0,71.0,354.0
4,"NEW Private Luxe Villa, Jungle View-Pool-Koi Pond",Entire villa in Kecamatan Ubud,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning · Kitchen,5.00,16.0,71.0,355.0
...,...,...,...,...,...,...,...,...
295,One Bedroom Private Pool Villa - Monthly Discount,Entire villa in Kecamatan Kuta Selatan,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning · Kitchen,4.40,5.0,86.0,427.0
296,Walk to the Beach from this Gorgeous Designer ...,Entire villa in Seminyak,4 guests · 2 bedrooms · 2 beds · 2.5 baths,Pool · Wifi · Air conditioning · Kitchen,4.90,29.0,293.0,1464.0
297,Villa Beda,Entire villa in Kuta Utara,8 guests · 4 bedrooms · 4 beds · 4 baths,Pool · Wifi · Air conditioning · Kitchen,4.38,71.0,182.0,907.0
298,Cozy Room wth Pool - 10 min Walk to Ubud Market,Hotel room in Ubud,2 guests · 1 bedroom · 1 bed · 1 private bath,Pool · Wifi · Air conditioning,4.55,229.0,16.0,80.0


The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [69]:
# LEAVE BLANK

In [70]:
# LEAVE BLANK

In [71]:
# LEAVE BLANK

In [72]:
# LEAVE BLANK

## Bonus exercises

The functions *get_listing_info* and *get_listing_ammenities* you created above do the job, but retrieve only the raw information. In order to be able to use these data, we will need to separate the different items.

<div class="alert alert-danger"><b>Bonus 1 </b>Write the code to create a function called <i>get_listing_info_2</i>. This function should take a soup object containing the code for an individual listing as input and return <b>a tuple containing 4 separate items</b>: the number of guests, the number of bedrooms, the number of beds and the number of baths. In all cases, the output should be returned in <b>float</b> form. If no information is provided for one item, then the function should return a <i>None</i> for that item. If no information is provided for any of the items, then the function should return a <i>None</i> for all of them.<br><i>[1 point]</i></div>

<div class="alert alert-warning">For the example shown above, the function get_listing_info_2 should return (2.0, 1.0, 1.0, 1.5). If a listing contains the information "2 guests · 1 bedroom · 1.5 baths", then the function get_listing_info_2 should return (2.0, 1.0, None, 1.5). If a listing contains the information "3 guests · 1 private bedroom · 2 beds · 1 shared bath", the function get_listing_info_2 should return (3.0, 1.0, 2.0, 1.0). If a listing contains the information "4 guests · Studio · 2 beds · 1 private bath", then the function get_listing_info_2 should return (4.0, None, 2.0, 1.0)</div>

<div class="alert alert-warning">You'll find that, for some reason, certain listings have a very misterious thing called <i>Half-bath</i>. Encode those cases using a 0.5 in <b>float</b> form.</div>

<div class="alert alert-warning">If necessary, your function can call other functions. Just make sure that all the code it needs to run properly is included below. </div>

In [73]:
# YOUR CODE HERE
def get_listing_info_2(tag_object):
    try:
        info_class = '_3c0zz1'
        information = tag_object.find_all('div', {'class':info_class})[0].text

        counter = information.count(' · ')

        #information is complete, namely there are 4 pieces of information
        if counter == 3:
            info_list = information.split(' · ')

            #get number of guests
            guests = float(info_list[0].split(' ')[0])

            #get number of bedrooms, if Studio return None
            if info_list[1] == "Studio":
                bedrooms = None
            else:
                bedrooms = float(info_list[1].split(' ')[0])

            #get number of beds
            beds = float(info_list[2].split(' ')[0])

            #get number of baths, if Half-bath, return 0.5
            if info_list[3] == "Half-bath":
                baths = float(0.5)
            else:
                baths = float(info_list[3].split(' ')[0])

            return(guests, bedrooms, beds, baths)

        #information is incomplete, check what is missing
        elif counter < 3: 
            check_guest = ' guest' in information or ' guests' in information
            check_bedroom = ' bedroom' in information or ' bedrooms' in information
            check_studio = 'Studio' in information
            check_bed = ' bed ' in information or ' beds' in information and not 'r' in information
            check_bath = ' bath' in information or ' baths' in information
            check_half_bath = ' Half-bath'in information

            info_list = information.split(' · ')

            #insert None where information is missing, otherwise take values
            if check_guest == False:
                info_list.insert(0, None)
                guests = None
            elif check_guest == True and info_list[0] != None: 
                guests = float(info_list[0].split(' ')[0])

            if check_bedroom == False and check_studio == False:
                info_list.insert(1, None)
                bedroom = None
            elif check_studio == True: 
                info_list[1] = None
                bedroom = None
            elif check_bedroom == True and check_studio == False:
                bedrooms = float(info_list[1].split(' ')[0])

            if check_bed == False:
                info_list.insert(2, None)
                beds = None
            elif check_bed == True:
                beds = float(info_list[2].split(' ')[0])
                
            if check_bath == False and check_half_bath == False:
                info_list.insert(3, None)
                bath = None
            elif check_bath == True:
                baths = float(info_list[3].split(' ')[0])
            elif check_half_bath == True and check_bath == False:
                baths = float(0.5)
            
            return(guests, bedrooms, beds, baths)
    except: 
        return None

In [74]:
#for x in range(len(listings_list)):
#    print(get_listing_info_2(listings_list[x]))

The following cell runs additional checks to your code. Please **don't write any code here**. Just leave it as it is.

In [75]:
# LEAVE BLANK

In [76]:
# LEAVE BLANK

In [77]:
# LEAVE BLANK

In [78]:
# LEAVE BLANK

Appart from the general information, when searching for the best choice, there might be some specific things we are looking for. I don't know about you, but I would definitely look for a place with a pool, a kitchen, wifi and definitely some air conditioning.

<div class="alert alert-danger"><b>Bonus 2 </b>Write the code to create a function called <i>get_listing_ammenities_2</i>. This function should take a soup object containing the code for an individual listing as input and return <b>a tuple containing 4 booleans</b> corresponding to whether the considered listing includes a pool, a kitchen, wifi, and air conditioning (in this order). If no ammenities are listed, the function should return False for all the entries.<br><i>[1 point]</i></div>

<div class="alert alert-warning">For the example shown above, the function get_listing_ammenities_2 should return (True, True, True, False). If a listing contains the information "Poll · Kitchen", then the function get_listing_ammenities_2 should return (True, True, False, False).</div>

<div class="alert alert-warning">If necessary, your function can call other functions. Just make sure that all the code it needs to run properly is included below. </div>

In [79]:
# YOUR CODE HERE
def get_listing_ammenities_2(tag_object):
    ammenities_class = '_3c0zz1'
    try:
        ammenities = tag_object.find_all('div', {'class':ammenities_class})[1].text
        
        if len(ammenities) > 0: 
            #check pool
            pool = 'Pool' in ammenities
            #check kitchen
            kitchen = 'Kitchen' in ammenities
            #check wifi
            wifi = 'Wifi' in ammenities
            #check aircon
            aircon = 'Air conditioning' in ammenities

            return(pool, kitchen, wifi, aircon)
        elif len(ammenities) == 0 or ammenities == []:
            pool = False
            kitchen = False
            wifi = False
            aircon = False
            return(pool, kitchen, wifi, aircon)
        
    except:
        pool = False
        kitchen = False
        wifi = False
        aircon = False
        return(pool, kitchen, wifi, aircon)

In [80]:
#for x in range(len(listings_list)):
#    print(get_listing_ammenities_2(listings_list[x]))

The following cells runs additional checks to your code. Please **don't write any code here**. Just leave them as they are.

In [81]:
# LEAVE BLANK

In [82]:
# LEAVE BLANK

In [83]:
# LEAVE BLANK

In [84]:
# LEAVE BLANK

Now that you have finished writing the code for the more advanced functions, you may want to use them to extract the corresponding data and update your DataFrame.

<div class="alert alert-info">Write the code to extract data regarding the number of guests, bedrooms, beds and bathrooms, as well as the data about whether the listings include a pool, a kitchen, wifi or air conditioning. Update your <i>airbnb</i> DataFrame to include this information and save your results to a new csv file.</div>

In [85]:
# YOUR CODE HERE
#define starting url
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_dates%5B%5D=november&flexible_trip_dates%5B%5D=october&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2021-12-29&checkout=2022-01-03&source=structured_search_input_header&search_type=autocomplete_click"

guests = []
bedrooms = []
beds = []
bathrooms = []

pool = []
kitchen = []
wifi = []
aircon = []


#while the url is not None
while url != None:
    #get the page from the url
    page = get_page(url)

    #store html code for each listing in a list
    listing_class = '_1e541ba5'
    listings_list = []
    listings_list = page.find_all('div', {'class':listing_class})

    #while loop that runs if the len(listing_list) is zero
    while len(listings_list) == 0:
        listings_list = []
        page = get_page(url)
        listings_list = page.find_all('div', {'class':listing_class})

    #for each element in listings_list
    for element in range(len(listings_list)):
        
        if get_listing_info_2(listings_list[element]) != None:
            #get information 
            guests.append(get_listing_info_2(listings_list[element])[0])
            bedrooms.append(get_listing_info_2(listings_list[element])[1])
            beds.append(get_listing_info_2(listings_list[element])[2])
            bathrooms.append(get_listing_info_2(listings_list[element])[3])
        elif get_listing_info_2(listings_list[element]) == None:
            guests = None
            bedrooms = None
            beds = None
            bathrooms = None
        
        if get_listing_ammenities_2(listings_list[element]) != None:
            #get ammenities
            pool.append(get_listing_ammenities_2(listings_list[element])[0])
            kitchen.append(get_listing_ammenities_2(listings_list[element])[1])
            wifi.append(get_listing_ammenities_2(listings_list[element])[2])
            aircon.append(get_listing_ammenities_2(listings_list[element])[3])
        elif get_listing_ammenities_2(listings_list[element]) == None:
            pool = None
            kitchen = None
            wifi = None
            aircon = None


    #get the next url and assign it to the variable url
    url = get_next_page(page)

In [86]:
#len(guests) == len(bedrooms) == len(beds) == len(bathrooms)
#len(pool) == len(kitchen) == len(wifi) == len(aircon)

In [87]:
airbnb.drop('info', inplace=True, axis=1)
airbnb.drop('ammenities', inplace=True, axis=1)

In [88]:
airbnb['guests'] = guests
airbnb['bedrooms'] = bedrooms
airbnb['beds'] = beds
airbnb['bathrooms'] = bathrooms

airbnb['pool'] = pool
airbnb['kitchen'] = kitchen
airbnb['wifi'] = wifi
airbnb['airconditioning'] = aircon

In [89]:
airbnb

Unnamed: 0,title,subtitle,rating,reviews,price_per_night,total_price,guests,bedrooms,beds,bathrooms,pool,kitchen,wifi,airconditioning
0,❣️Romantic Staycation-PrivateSunset Pool@megan...,Entire villa in Ubud,4.94,243.0,92.0,458.0,2.0,1.0,1.0,1.0,True,False,True,True
1,Cozy 2BR Villa with Panoramic View of Rice Fields,Entire villa in Kecamatan Ubud,4.93,41.0,348.0,1740.0,6.0,2.0,2.0,2.0,True,True,True,True
2,Sparkling Gem - private pool - Mins to La Brisa,Entire villa in Mengwi,4.85,98.0,69.0,342.0,2.0,1.0,1.0,1.0,True,False,True,False
3,Bali Bamboo House | Rescape Ubud - Relief Villa,Hut in Ubud,4.72,39.0,71.0,354.0,3.0,1.0,2.0,1.0,True,False,True,True
4,"NEW Private Luxe Villa, Jungle View-Pool-Koi Pond",Entire villa in Kecamatan Ubud,5.00,16.0,71.0,355.0,6.0,3.0,3.0,3.0,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,One Bedroom Private Pool Villa - Monthly Discount,Entire villa in Kecamatan Kuta Selatan,4.40,5.0,86.0,427.0,2.0,1.0,1.0,1.0,True,False,True,True
296,Walk to the Beach from this Gorgeous Designer ...,Entire villa in Seminyak,4.90,29.0,293.0,1464.0,4.0,3.0,3.0,3.0,True,True,True,True
297,Villa Beda,Entire villa in Kuta Utara,4.38,71.0,182.0,907.0,2.0,1.0,1.0,1.0,True,False,True,True
298,Cozy Room wth Pool - 10 min Walk to Ubud Market,Hotel room in Ubud,4.55,229.0,16.0,80.0,4.0,2.0,2.0,2.0,True,True,True,True
