# Webscrape AirBnB Search

AirBnB is a popular short-term rental listing site. It can be used to find 100,000's of listings a user can rent from home owners.

This first notebook, part of a larger project, is focused on scraping data from the search page of AirBnB. In broad brush strokes, the notebook accomplishes the following.

* Create 40,000~ unique links that take box-by-box map searches of the western United states, from California to the Colorado-Kansas border.
* Request the first page of search results. If there are multiple pages, click through each page of the results.
* On each page, get the listings on the page.
* Format the data into a structured DataFrame.

This won't handle all of the data cleaning -- right now, we're just primarily concerned with getting the data.

_(An example screenshot of the search page that is scraped.)_


![Image of AirBnB search](https://lh4.googleusercontent.com/7FeLyHckAoGePixHRazECphf4gNZbem5huowTH-VpFTiQfcaEos8_1aKWQOZQ0Zq3j8=w2400)


# Import Packages

In [1]:
# For webscraping.
import requests
from bs4 import BeautifulSoup
import re
import random
import time
from requests_html import HTMLSession
import json

# For standard data manipulation.
import numpy as np
import pandas as pd
from datetime import datetime

# For progress tracking.
from tqdm import tqdm

# Makes it easier to see all the columns in wide dataframes!
pd.set_option('display.max_colwidth', None)

# Set headers for webscraping

This will set the headers we'll use for scraping. This will help our requests look more "human" to AirBnB.

In [2]:
headers = {
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'OptanonAlertBoxClosed=NR; OptanonAlertBoxClosed=NR; bev=1671467614_NDg3YWI1OWViODE0; country=US; everest_cookie=1671467934.o8UVAqq76D3QPRXxo5SV.2TXJ26qIr6wRp2aGPSqpNu0hmcaL7JhFzmDBWbnzmDw; _csrf_token=V4%24.airbnb.com%24Dwa4m1DnRds%24l_ThNsfxVtYzJnV3gf7vehcjPJusOG_QBcYxaP2lZOM%3D; flags=0; _gcl_au=1.1.1979572412.1671467936; _ga=GA1.1.923589191.1671467936; tzo=-420; frmfctr=wide; ak_bmsc=2BF62A04339B5A8040F6BE42D7397A3E~000000000000000000000000000000~YAAQjuXEF4UQhgeFAQAA27JjLRJCqxbQs87wSwEdHRYX0G6/TDKZTovu4o6yCZ81IqZ6MhWckkFsgOG1eRvCyGcwYcGwZT6Hj0lKPA7+LECJW/p7uLVBPkdj9vDMc+M3APPxtp3dA4N7Xiz9k2CsK1cScWB15o99WkNT8TblkPRJMP+//tibMWG2xoEjeEV14znNbpxu/a19eOVXAFS6pvZFEKc1PvKaN6qVTNMeaHczy8vtIlUIhOHm8n0/neIB5WjhQewZ2bsI4Jk5FAyuugtHiXFiRJU7poWmTC1z3vQV5BNZly4umCoBfxL+4/1Gwtvg+5jzWIgEsQIOlNvhj6IhTWNdgIiUubyhm6SwXUHdD2r80FCzSqqkgs/F95M/C2eTuiMmpV414Qo=; jitney_client_session_id=d94ad55f-160c-4b75-aa19-6b94073c92fc; jitney_client_session_created_at=1671505980; _user_attributes=%7B%22curr%22%3A%22USD%22%2C%22guest_exchange%22%3A1.0%2C%22device_profiling_session_id%22%3A%221671467934--953d4ec0c3e885ffcfcecdcb%22%2C%22giftcard_profiling_session_id%22%3A%221671503786--bedf36f5fd7bc1727463feba%22%2C%22reservation_profiling_session_id%22%3A%221671503786--104feb525ef2701da84b3359%22%7D; jitney_client_session_updated_at=1671505982; _ga_2P6Q8PGG16=GS1.1.1671505983.4.1.1671505983.0.0.0; _uetsid=a14c3a907fbb11ed80888db3ffb5b814; _uetvid=a14c54c07fbb11edbfa35d16156e96a6; previousTab=%7B%22id%22%3A%22c5e39694-2720-411d-bbb4-5880ad1455b6%22%2C%22url%22%3A%22https%3A%2F%2Fwww.airbnb.com%2Fs%2FUnited-States%2Fhomes%22%7D; bm_sv=67E8D97D583DC2BA73262B2A33B8BD6A~YAAQjuXEF3gehweFAQAAzlqFLRKkxd6iaP1xq0hHbfO3CsyIIQifNN1kg5O8tYdobGiFUFkAU0ek5Tg67B28nu2btfk+1yHhT6TE1jibTRrmUjxmCnS6gZ9Sclthu91yIEEePEd32A66YE2G6ofZjSQCTS8TsZU9O8pKFQ7Nid5vVQ6DlyEPLX5yEYMLK94A1kBNulzVg+KjaWf2UN8pSkr66hIUxNn5Qq3fQtl8kCxJeSb6MGW977OC++DGDRXKqg==~1; cfrmfctr=MOBILE; cbkp=2',
    'device-memory': '8',
    'dpr': '2',
    'ect': '4g',
    'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'no-cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'viewport-width': '895'
    }


# Create the helper functions required to run the scraper

Let's set up a few functions we'll use to breathe life into this webscraper.


### def `reset_dataframe`:
**Args:** 
* None.

**Returns:**
* Empty DataFrame with the shape and headers of the data we intend to scrape.

In [3]:
def reset_dataframe():

    # Create the dataframe.
    listings_dataframe = pd.DataFrame(
            {'listing_id':       [],
             'listing_url':      [],
             'is_superhost':     [],
             'rating':           [],
             'n_reviews':        [],
             'listing_city':     [],
             'listing_title':    [],
             'n_pictures':       [],
             'room_type':        [],
             'latitude':         [],
             'longitude':        [],
             'beds':             [],
             'price':            [],
             'discounted_price': [],
             'original_price':   [],
             'price_qualifier':  [],
             'image_1':          [],
             'image_2':          [],
             'image_3':          [],
             'image_4':          [],
             'image_5':          []
            })
    
    return listings_dataframe

### def `first_page`:
**Args:** 
* url: A URL on a search page that will be scraped using the `request` Python package.
* headers: The pre-set headers used to make the requests look more human.

**Returns:**
* Either a `BeautifulSoup` soup object of the webpage, or a None object.

In [4]:
def first_page(url, headers=headers):
        
    try:
        r = requests.get(url, headers=headers)
        text = r.text
        soup = BeautifulSoup(text, 'html.parser')
        return soup
    except:
        try:
            r = requests.get(url, headers=headers)
            text = r.text
            soup = BeautifulSoup(text, 'html.parser')
            return soup
        except:
            try:
                r = requests.get(url, headers=headers)
                text = r.text
                soup = BeautifulSoup(text, 'html.parser')
                return soup
            except:
                return None


### def `extract_listing_info`:
**Args:** 
* obj: a JSON data structure object that is parsed out of a `BeautifulSoup` soup object of a webpage.
* df: a DataFrame with all of the listings scraped to date. For the first listing, it is the empty DataFrame generated by `reset_dataframe()`. For the second listing and onwards, it is a DataFrame of all the prior data scraped.

**Returns:**
* a DataFrame of all the listings scraped up to that point.

In [5]:
def extract_listing_info(obj, df):
    # Extract all the information about the listing from the search page.
    try:
        try:
            listing_id       = obj['listing']['id']
        except:
            listing_id = None

        try:
            listing_url      = 'https://www.airbnb.com/rooms/' + obj['listing']['id']
        except:
            listing_url = None
        try:
            is_superhost     = obj['listing']['formattedBadges'][0]['text']
        except:
            is_superhost = None
        try:
            rating           = obj['listing']['avgRatingLocalized'].split(' ')[0]
        except:
            rating = None
        try:   
            n_reviews        = obj['listing']['avgRatingLocalized'].split(' ')[1][1:-1]
        except:
            n_reviews = None
        try:
            listing_city     = obj['listing']['city']
        except:
            listing_city = None
        try:
            listing_title    = obj['listing']['title']
        except:
            listing_title = None
        try:
            n_pictures       = obj['listing']['contextualPicturesCount']
        except:
            n_pictures = None
        try:
            room_type        = obj['listing']['roomTypeCategory']
        except:
            room_type = None
        try:
            latitude         = obj['listing']['coordinate']['latitude']
        except:
            latitude = None
        try:
            longitude        = obj['listing']['coordinate']['longitude']
        except:
            longitude = None
        try:
            beds             = obj['listing']['structuredContent']['primaryLine'][0]['body']
        except:
            beds = None
        try:
            price            = obj['pricingQuote']['structuredStayDisplayPrice']['primaryLine']['price']
        except:
            price = None
        try:
            discounted_price = obj['pricingQuote']['structuredStayDisplayPrice']['primaryLine']['discountedPrice']
        except:
            discounted_price = None
        try:
            original_price   = obj['pricingQuote']['structuredStayDisplayPrice']['primaryLine']['originalPrice']
        except:
            original_price = None
        try:
            price_qualifier  = obj['pricingQuote']['structuredStayDisplayPrice']['primaryLine']['qualifier']
        except:
            price_qualifier = None
        try:
            image_1          = obj['listing']['contextualPictures'][0]['picture']
        except:
            listing_id = None
        try:
            image_2          = obj['listing']['contextualPictures'][1]['picture']
        except:
            image_2 = None
        try:
            image_3          = obj['listing']['contextualPictures'][2]['picture']
        except:
            image_3 = None
        try:
            image_4          = obj['listing']['contextualPictures'][3]['picture']
        except:
            image_4 = None
        try:
            image_5          = obj['listing']['contextualPictures'][4]['picture']
        except:
            image_5 = None
        
        row = pd.DataFrame(
            {'listing_id':       [listing_id],
             'listing_url':      [listing_url],
             'is_superhost':     [is_superhost],
             'rating':           [rating],
             'n_reviews':        [n_reviews],
             'listing_city':     [listing_city],
             'listing_title':    [listing_title],
             'n_pictures':       [n_pictures],
             'room_type':        [room_type],
             'latitude':         [latitude],
             'longitude':        [longitude],
             'beds':             [beds],
             'price':            [price],
             'discounted_price': [discounted_price],
             'original_price':   [original_price],
             'price_qualifier':  [price_qualifier],
             'image_1':          [image_1],
             'image_2':          [image_2],
             'image_3':          [image_3],
             'image_4':          [image_4],
             'image_5':          [image_5]
            }
        )
        # Return dataframe, with new row of data appended.
        return pd.concat([df, row], ignore_index=True)
    except:
        # Return the dataframe in its current state
        #print('error: data didn\'t scrape')
        return df
        

### def `get_next_page`:
**Args:** 
* obj: a JSON data structure object that is parsed out of a `BeautifulSoup` soup object of a webpage.
* url: The url of the current page of search results.
* headers: The pre-set headers used to make the requests look more human.

**Returns:**
* A `BeautifulSoup` soup object of the _next_ page of search results.
* The url of the _next_ page of search results' url, which will be fed into `get_next_page()` again to see if there is another page of search results.

In [6]:
def get_next_page(obj, url, headers=headers):

    try:
        # Get next page link if it is there.
        next_page = obj['niobeMinimalClientData'][0][1]['data']
        next_page = next_page['presentation']['explore']['sections']
        next_page = next_page['sectionIndependentData']['staysSearch']['paginationInfo']['nextPageCursor']
        
        # If there isn't already a next page cursor in the url:
        if 'cursor=' not in str(url):
            # add the url param and next page cursor.
            url = url + '&pagination_search=true&cursor=' + next_page
        # If there is a next page cursor in the url.
        elif 'cursor=' in str(url):
            # Remove the existing cursor, add the new one.
            #url = '='.join(url.split('=')[0:-2]) + '=' + next_page
            url = url.split('&pagination_search=true&cursor=')[0] + '&pagination_search=true&cursor=' + next_page
        else:
            # Just make the url None. It will fail, and the function will move on.
            url = None
        #print('URL:', url)
        r = requests.get(url, headers=headers)                               # Request the HTML.
        text = r.text                                                        # Extract the text.
        soup = BeautifulSoup(text, 'html.parser')                            # Soupify text.
        return soup, {'url': url}
        #print('Found next page.')
    except Exception as e:
        #print('No next page found.')
        #print(e)
        return None, None
        

### def `generate_mapviews`:
**Args:** 
* None.

**Returns:**
* A list of dictionaries where the 0th value is a url, the 1st value is the estimated latitude of the url, and 2nd value is the estimate value of the url. The list of urls should create a thousands of mapview urls that sum to the size of the box specified by `north_lat`, `south_lat`, `east_lng`, and `west_lng`. Depending on the size of the box, and the size of each incremental snapshot, this list could be in the tens of thousands up to the hundreds of thousands.


_Image demonstrating how mapbox filters each request link for search._
![Image demonstrating mapbox](https://lh6.googleusercontent.com/irSKqaHTVG0_28c8ma6b-9gcTkVaNOH4Nvg6FClmYnCkDmgTn96BH9FOJhHb1vuc35s=w2400)

In [7]:
def generate_mapviews():
    # The West: Washington state, Oregon, California, Idaho, Nevada, Utah, 
    # Arizona, Montana, Wyoming, Colorado, New Mexico.
    north_lat =   48.9995662408735
    south_lat =   31.333363318718604
    east_lng  = -102.05125274620897
    west_lng  = -124.74655654881464
    
    coords = []
    
    # incrementing between western and easternmost longitude, in .01 increments:
    for lng in np.arange(west_lng, east_lng,.1):      
        # incrementing between southern and northernmost latitude, in .01 increments:
        for lat in np.arange(south_lat, north_lat,.1):
            # Create a dictionary to define the 
            coord = {
                'ne_lat':lat+.1,
                'ne_lng':lng+.1,
                'sw_lat':lat,
                'sw_lng':lng
            }
            
            coords.append(coord)
            
    def gen_links(coord):
        
        return ('https://www.airbnb.com/s/',
                'United-States/homes?',
                'search_type=user_map_move',
                '&ne_lat={}'.format(coord['ne_lat']),
                '&ne_lng={}'.format(coord['ne_lng']),
                '&sw_lat={}'.format(coord['sw_lat']),
                '&sw_lng={}'.format(coord['sw_lng']),
                '&tab_id=home_tab')
    
    
    
    # Store the URL, as long as the estimated lat/long location -- average of the northern/southern most lat,
    # and the eastern/western most long in the view.
    coord_data = []
    
    for coord in coords:
        dict_ = {
            'url': ''.join(gen_links(coord)),
            'est_lat':(coord['ne_lat']+coord['sw_lat'])/2,
            'est_lng':(coord['ne_lng']+coord['sw_lng'])/2
        }
        
        coord_data.append(dict_)
    
    return coord_data

urls = generate_mapviews()

print('Sample output of URLs:')
print(urls[0:2])
print('# of URLs',len(urls))

Sample output of URLs:
[{'url': 'https://www.airbnb.com/s/United-States/homes?search_type=user_map_move&ne_lat=31.433363318718605&ne_lng=-124.64655654881464&sw_lat=31.333363318718604&sw_lng=-124.74655654881464&tab_id=home_tab', 'est_lat': 31.383363318718605, 'est_lng': -124.69655654881464}, {'url': 'https://www.airbnb.com/s/United-States/homes?search_type=user_map_move&ne_lat=31.533363318718607&ne_lng=-124.64655654881464&sw_lat=31.433363318718605&sw_lng=-124.74655654881464&tab_id=home_tab', 'est_lat': 31.483363318718606, 'est_lng': -124.69655654881464}]
# of URLs 40179


### def `process_listings`:
**Args:** 
* url: A URL on a search page that will be scraped using the request Python package.
* listings_dataframe: a DataFrame with all of the listings scraped to date. For the first listing, it is the empty DataFrame generated by `reset_dataframe()`. For the second listing and onwards, it is a DataFrame of all the prior data scraped.
* url_count: A count of the number of urls that have been processed.

**Returns:**
* A DataFrame with the results of scraping one url. The url is the starting point of `first_page()`, and then each listing object on the page is passed to `extract_listing_info`. If there is more than one page of search results, `get_next_page` will use the url passed to `first_page()` to get to the next page, and then pass it to `extract_listing_info`. Once there are no more pages of search results, the function will exit and return the DataFrame of all the results. 

In [8]:
# Reset the dataframe.
listings_dataframe = reset_dataframe()

# Keeping track of the number of URLs that have been worked through.
url_count = 0

def process_listings(url,listings_dataframe=listings_dataframe,url_count=url_count):

    frame_list = []
    
    # Feed the url through to get the soup of the first page.
    soup = first_page(url['url'])
    # TODO: create printing log to show pagination to make sure it is working.
    page_count = 1
    # Until the scraper cannot find another page to scrape.
    while soup != None:
        
        #print('Page Count:', page_count)

        # Get all of the individual listings on the page.
        
        try:
            string = str(soup.html.find_all('script')[-1])
            string_filtered = string[84:-9]
            json_obj = json.loads(string_filtered)


            json_obj = json_obj['niobeMinimalClientData'][0][1]['data']['presentation']
            json_obj = json_obj['explore']['sections']['sectionIndependentData']['staysSearch']['searchResults']

            if 'splitStaysListings' not in json_obj:

                objs = [obj for obj in json_obj]

                # for each listing.
                for obj in objs:

                    # Append the scraped data to page. Note that we feed
                    # the lat/long from the url.
                    frame = extract_listing_info(
                                    obj=obj,
                                    df=listings_dataframe
                        )

                    #if frame != None:
                    frame_list.append(frame)

                # Sleep for 1-5 seconds. Just in case someone is monitoring
                # my events on AirBnB's side, I want to give at least *some*
                # illusion that this is a human being and not a scraper.
                time.sleep(random.randint(1,5))

            # Attempt to go to next page. If no next page is found, returns
            # None and breaks the loop. This completes the loop for one url,
            # and the scraper moves onto the next url.
            page_count += 1
            json_obj = json.loads(string_filtered)
            soup, url = get_next_page(obj=json_obj, url=url['url'])

            # Tick another url completed.
            url_count += 1
        except:
            break

    try: 
        return pd.concat(frame_list, ignore_index=True)
    except:
        return reset_dataframe()


# Execute webscraping.

_Depending on your wifi connection, this job might get interrupted! There is an affordance in this to read the scraped results in as `extract 1`, and append that with the job when it is started back up._


**If True:**
(This takes approximately 7-10 days to run, depending on interruptions
* Set `url_count` to `0`, and `listing_dataframe` to `reset_dataframe()`.
* Create a list of frames, `frames`.
* Generate a list of urls, `urls`.
* For each url in `urls`, call `process_listings()`, and save the returned results as `frame`.
* Append `frame` to `frames`.
* Sleep for a random period of time between 1 and 5 seconds to keep rate of requests slow.
* For every 100 urls processed, save the results down to `scraped_listings.csv`. This allows us to (1) inspect results and (2) prevent losing all data if the job is interrupted (only up to a hundred will be lost).

**If False:**
* Just read in the already scraped data, and skip everything above. Allows someone to "run" the notebook and see all of the code without setting off the full job.

In [9]:
if False:
    # count of URLs.
    url_count = 0

    # Create empty DataFrame.
    listings_dataframe = reset_dataframe()

    # List of frames.
    frames = []

    # Generate urls. If job interrupted, use indexing to remove all URLs already processed.
    urls = generate_mapviews()[2295+300+1000+1800+100+4200+14100+13800+700:]

    # Temp solution: read in all of the already processed listings. Used if job is interrupted.
    extract1 = pd.read_csv('extract 1.csv')
    extract1.drop(columns=['Unnamed: 0'])

    # For each urls in the list of generated urls.
    for url in tqdm(urls):
        # Process all of the listings at that url, including any pagination through the next page of search.
        frame = process_listings(url=url)
        # Append the listings to the list of results, `frames`.
        frames.append(frame)
        # Pause the script for a random amount of time, 1 to 5 seconds. Don't want to overwhelm AirBnB with requests!
        time.sleep(random.randint(1,5))
        # Mark a processed url.
        url_count += 1
        # For every 100 urls processed:
        if url_count % 100 == 0:

            # Copy the list of frames.
            write_frames = frames.copy()
            # Add the extracted listings from prior jobs that were interrupted.
            write_frames.append(extract1)
            # Write the concatenated list of frames into one csv that is plugged into Tableau for monitoring.
            pd.concat(write_frames, ignore_index=True).to_csv('scraped_listings.csv')

    # Concatenate all final results into one DataFrame.
    write_frames = frames.copy()
    write_frames.append(extract1)
    listings_dataframe = pd.concat(write_frames, ignore_index=True)
if True:
    listings_dataframe = pd.read_csv('scraped_listings BACKUP.csv')

100%|██████████| 1884/1884 [3:46:07<00:00,  7.20s/it]  


# Deduplicate listings.

There might have been a few duplicate listings. Let's fix that!

### def `dedupe_data`:
**Args:** 
* listings_dataframe: the scraped listings in DataFrame form.

**Returns:**
* listings_dataframe: deduplicated copy of the original DataFrame.

In [12]:
# Dedupe listings.
def dedupe_data(listings_dataframe):

    # Get length of dataset.
    predupe = len(listings_dataframe)

    # We assume with this method their might be some duplicate scrapes. This method will
    # drop the duplicates, and keep the first occurrence.
    subset_cols = ['listing_id']
    listings_dataframe = listings_dataframe.drop_duplicates(subset=subset_cols, keep='first')
    postdupe = len(listings_dataframe)
    delta = predupe - postdupe

    # Pretty print the results post de-duplication.
    print(f'''
    After deduplication, the AirBnB listings dataset decreased from {predupe} to {postdupe}, a decrease of
    {delta}. Where duplicates were found, the first record was taken.
    ''')

    # Save the results.
    return listings_dataframe

In [13]:
listings_dataframe = dedupe_data(listings_dataframe)


    After deduplication, the AirBnB listings dataset decreased from 188457 to 186808, a decrease of
    1649. Where duplicates were found, the first record was taken.
    


In [14]:
listings_dataframe.to_csv('scraped_listings BACKUP.csv')