# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [1]:
# Relevant imports
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time

In [2]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
response = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(response.content, "html.parser")

In [3]:
# Find the container with event listings in it
event_container = soup.find('li', class_='Column-sc-18hsrnn-0 gnwWng')

In [4]:
# Find a list of events by date within that container
# There are 7 days with events lists in each day.
# each day has similar class starting 'Box-omzyfs-0' and ending with 6 characters 'xxxxxx'
# can use regular expression here to list these 7 days
import re
regex = re.compile('Box-omzyfs-0 (?=.{6}$)')
events_all = event_container.find_all('div', {'class': regex}, recursive=False)
len(events_all)

7

In [5]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
# take the first item from the list events_all and find the head h3 has the date and starting with '/'

# Grabbing just one to practice on
date = events_all[0].find('h3', class_='Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk').text[1:]
date

'Sat, 30 Mar'

In [6]:
# Extract the name from one of the events within that container
name = events_all[0].find('a', attrs = {'data-test-id': 'event-listing-heading'}).text
name

'UnterMania II'

In [7]:
# Actually, each event's name, venue and attendees are in <'li', class_='Column-sc-18hsrnn-0 jHShKh'> of each item in events_all
[event for event in events_all[0].find_all('li', class_='Column-sc-18hsrnn-0 jHShKh')]

[<li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" href="/web/20210326225933/https://ra.co/events/1234892"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf" color="primary" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" font-weight="normal" href="/events/1234892">UnterMania II</span></a></h3><div class="Box-omzyfs-0 sc-AxjAm jVLhoy"><span class="Text-sc-1t0gn2o-0 dWOMtb" color="primary" font-weight="normal">Mary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady Machine</span></div><div class="Box-omzyfs-0 sc-AxjAm fCFvgO"><div class="Box-omzyfs-0 sc-AxjAm ebaaK"><div class="Box-omzyfs-0 sc-AxjAm fOOuYI" height="30"><div class="Box-omzyfs-0 sc-AxjAm hoMiiH" color="accent" height="24" width="24"><svg aria-label=

In [8]:
# We need to iterate each item in events_all list to get the event info for each event in that date.

event_list_firstdate = events_all[0].find_all('li', class_='Column-sc-18hsrnn-0 jHShKh')
event_name = event_list_firstdate[0].find('a', attrs = {'data-test-id': 'event-listing-heading'}).text
event_name

'UnterMania II'

In [9]:
# Extract venue, and number of attendees from one of the
# events within that container
# Both venue and number of attendees are in a 'div' class_='Box-omzyfs-0 sc-AxjAm fCFvgO
# with venue in 'div', class_='Box-omzyfs-0 sc-AxjAm fOOuYI'
# and number of attendees in 'span', class_='Text-sc-1t0gn2o-0 hhfigA'

venue = event_list_firstdate[0].find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find('div', class_='Box-omzyfs-0 sc-AxjAm fOOuYI')
attendee = event_list_firstdate[0].find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find_all('span', class_='Text-sc-1t0gn2o-0 hhfigA')
venue.text, attendee[-1].text

('TBA - New York', '457')

In [10]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe
# Loop over all date containers on the page

# NOTE : there is a case where the number of attendees will mark with something like 1.8K. It will be necessary to calculate it to a number.
units = {'K': 10^3, 'M': 10^6, 'B': 10^9}
event_info = []
for events in events_all:
    for event in events.find_all('li', class_='Column-sc-18hsrnn-0 jHShKh'): 
        date = events.find('h3', class_='Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk').text[1:]
        name = event.find('a', attrs = {'data-test-id': 'event-listing-heading'}).text
        venue_container = event.find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find('div', class_='Box-omzyfs-0 sc-AxjAm fOOuYI')
        attendees = event.find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find_all('span', class_='Text-sc-1t0gn2o-0 hhfigA')
        if venue_container:
            venue = venue_container.text
        else:
            venue = np.nan
        if attendees:
            try:
                attd_num = int(attendees[-1].text)
            except ValueError:
                unit = attendees[-1].text[-1]
                attd_num = int((float(attendees[-1].text[:-1])) * units[unit])
            attd_num = attendees[-1].text
        else:
            attd_num = np.nan
        info_list = [name, venue, date, attd_num]
        event_info.append(info_list)
df = pd.DataFrame(event_info)
df

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1
116,Erik the Lover,Erv's,"Fri, 5 Apr",1
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1


In [11]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(events_page_url):
    resp = requests.get(events_page_url)
    soup = BeautifulSoup(resp.content, 'html.parser')
    event_container = soup.find('li', class_='Column-sc-18hsrnn-0 gnwWng')
    regex = re.compile('Box-omzyfs-0 (?=.{6}$)')
    events_all = event_container.find_all('div', {'class': regex}, recursive=False)
    units = {'K': 10**3, 'M': 10**6, 'B': 10**9}
    event_info = []
    for events in events_all:
        for event in events.find_all('li', class_='Column-sc-18hsrnn-0 jHShKh'): 
            date = events.find('h3', class_='Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk').text[1:]
            name = event.find('a', attrs = {'data-test-id': 'event-listing-heading'}).text
            venue_container = event.find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find('div', class_='Box-omzyfs-0 sc-AxjAm fOOuYI')
            attendees = event.find('div', class_='Box-omzyfs-0 sc-AxjAm fCFvgO').find_all('span', class_='Text-sc-1t0gn2o-0 hhfigA')
            if venue_container:
                venue = venue_container.text
            else:
                venue = np.nan
            if attendees:
                try:
                    attd_num = int(attendees[-1].text)
                except ValueError:
                    unit = attendees[-1].text[-1]
                    attd_num = int((float(attendees[-1].text[:-1])) * units[unit])
            else:
                attd_num = np.nan
            info_list = [name, venue, date, attd_num]
            event_info.append(info_list)
    df = pd.DataFrame(event_info)
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [12]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [13]:
# Find the button, find the relative path, create the URL for the current `soup`
next_button = soup.find('div', class_='Box-omzyfs-0 sc-AxjAm Panel__StyledAlignment-sc-1udo2qh-0 ArchiveNavigator___StyledPanel2-x733n4-2 kKGHmX').find('a').attrs['href']
next_page_url = 'https://web.archive.org' + next_button
next_page_url

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

In [14]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    #Your code here
    resp = requests.get(url)
    soup = BeautifulSoup(resp.content, 'html.parser')
    next_button = soup.find('div', class_='Box-omzyfs-0 sc-AxjAm Panel__StyledAlignment-sc-1udo2qh-0 ArchiveNavigator___StyledPanel2-x733n4-2 kKGHmX').find('a').attrs['href']
    next_page_url = 'https://web.archive.org' + next_button
    return next_page_url

In [15]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [16]:
# Your code here
df_all = pd.DataFrame()
current_url = EVENTS_PAGE_URL
while len(df_all) <= 500:
    df = scrape_events(current_url)
    time.sleep(5)
    current_url= next_page(current_url)
    time.sleep(5)
    df_all = pd.concat([df_all, df])
df_all.head()

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0


In [19]:
df_all.sort_values('Number_of_Attendees', ascending=False)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,"Charlotte de Witte, Monoloc, Victor Ruiz",Avant Gardner,"Sat, 13 Apr",1800.0
0,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
65,Secret Solstice Pre-Party (Free Entry): Metro ...,Kings Hall - Avant Gardner,"Thu, 18 Apr",670.0
0,Nina Kraviz / James Murphy / Justin Cudmore,Knockdown Center,"Sat, 20 Apr",501.0
89,Stavroz live! presented by Zero,The Williamsburg Hotel,"Fri, 12 Apr",481.0
...,...,...,...,...
56,420: A Musical Experience,The Kraine Theater,"Mon, 22 Apr",
61,420: A Musical Experience,The Kraine Theater,"Tue, 23 Apr",
75,420: A Musical Experience,The Kraine Theater,"Wed, 24 Apr",
34,Klandestino Brunch with Electronic Music,Avena Downtown,"Sat, 27 Apr",


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!