# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [1]:
# Relevant imports
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [4]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"


In [None]:

html_page = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(html_page.content, 'html.parser')

In [31]:
# Find the container with event listings in it
container = soup.find('div', attrs={"data-tracking-id": "events-all"})
print(len(container))

2


In [33]:
# Find a list of events by date within that container
all_events = container.find('ul').find('li')
march_31 = all_events.text.find("Sun, 31 Mar")
print(all_events.text[march_31:march_31 + 100])

Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New 


In [34]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
dates = all_events.findChildren(recursive=False)
mar30 = dates[0]
print(len(dates))
print(f'March 30 - {mar30}')


13
March 30 - <div class="Box-omzyfs-0 fYkcJU"><div class="Box-omzyfs-0 SectionStyledBox-tvjxx0-0 eEYxFS sticky-header"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk"><span class="Text-sc-1t0gn2o-0 dvCBwl" color="accent" font-weight="normal"><span class="Text-sc-1t0gn2o-0 gSvLLX" color="accent" font-weight="normal"≯</span>Sat, 30 Mar</span></h3></div><hr class="Divider__HorizontalDivider-sc-1qsmuc-0 klshtO"/><ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="ticketed-event"><li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" href="/web/20210326225933/https://ra.co/events/1234892"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf" color="primary" data-test-id="event-listing-heading" data-tracking-id="

In [35]:
# Extract the name, venue, and number of attendees from one of the
# events within that container

apr1= dates[4]
ap1events = apr1.findChildren("ul")
print(len(ap1events))
ap1event = apr1_events[1]
print(ap1event)

4
<ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="non-ticketed-event"><li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="non-ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1242553" href="/web/20210326225933/https://ra.co/events/1242553"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf" color="primary" data-test-id="event-listing-heading" data-tracking-id="/events/1242553" font-weight="normal" href="/events/1242553">Konkrete Jungle - April Fool's Fest</span></a></h3><div class="Box-omzyfs-0 sc-AxjAm jVLhoy"><span class="Text-sc-1t0gn2o-0 dWOMtb" color="primary" font-weight="normal">Wavewhore</span></div><div class="Box-omzyfs-0 sc-AxjAm fCFvgO"><div class="Box-omzyfs-0 sc-AxjAm ebaaK"><div class="Box-omzyfs-0 sc-AxjAm fOOuYI" height="30"><div class="Box-omzyfs-0 sc-Axj

In [6]:
def date_finder(date):
    try:
        event_date = date.find('h3').text.strip("'̸")
        return event_date
    except:
        return 'None'

In [7]:
def name_finder(event):
    try:
        event_name = event.find('a', attrs={"data-test-id": "event-listing-heading"}).text
        return event_name
    except:
        return 'None'

In [8]:
def artists_finder(event):
    try:
        event_artists = event.find('div', class_='Box-omzyfs-0 sc-AxjAm jVLhoy').text
        return event_artists
    except:
        return 'None'

In [9]:
def venue_finder(event): 
    try:
        event_venue = event.find('div', attrs={'height':'30'}).text
        return event_venue
    except:
        return 'None'

In [10]:
def attendees_finder(event):
    try:
        text_event_attendees = event.find('span', class_='Text-sc-1t0gn2o-0 hhfigA').text
        try:
            event_attendees = int(text_event_attendees)
        except:
            event_attendeed = 0
        return event_attendees 
    except:
        return 0

In [53]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe

date_list=[]
name_list=[]
artists_list=[]
venue_list=[]
attendees_list=[]

for date in dates:
    events = date.findChildren("ul")
    for event in events:
        date_list.append(date_finder(date))
        name_list.append(name_finder(event))
        artists_list.append(artists_finder(event))
        venue_list.append(venue_finder(event))
        attendees_list.append(attendees_finder(event))

df = pd.DataFrame([date_list, name_list, artists_list, venue_list, attendees_list]).transpose()
df.columns=('date', 'event name', 'announced artists', 'venue', 'confirmed attendees')
df

Unnamed: 0,date,event name,announced artists,venue,confirmed attendees
0,"Sat, 30 Mar",UnterMania II,"Mary Yuzovskaya, Manni Dee, Umfang, Juana, The...",TBA - New York,TBA - New York
1,"Sat, 30 Mar","Cocoon New York: Sven Väth, Ilario Alicante, B...","Sven Vath, Butch, Taimur, Ilario Alicante",99 Scott Ave,407
2,"Sat, 30 Mar",Horse Meat Disco - New York Residency,"Horse Meat Disco, The Carry Nation, Amber Vale...",Elsewhere,375
3,"Sat, 30 Mar",Rave: Underground Resistance All Night,"Nomadico, Mark Flash",Nowadays,232
4,"Sat, 30 Mar","Believe You Me // Beta Librae, Stephan Kimbel,...","Beta Librae, Stephan Kimbel",TBA - New York,TBA - New York
...,...,...,...,...,...
114,"Fri, 5 Apr",A Night at the Baths,"Manchildblack, The Illustrious Blacks",C'mon Everybody,1
115,"Fri, 5 Apr",Blaqk Audio,,Music Hall of Williamsburg,1
116,"Fri, 5 Apr",Erik the Lover,,Erv's,1
117,"Fri, 5 Apr",Wax On Vissions,"Interpreter, Silkroad Acid",Starliner,1


In [11]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe
date_list=[]
name_list=[]
artists_list=[]
venue_list=[]
attendees_list=[]

def scrape_events(events_page_url):
    date_list=[]
    name_list=[]
    artists_list=[]
    venue_list=[]
    attendees_list=[]
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    container = soup.find('div', attrs={"data-tracking-id": "events-all"})
    all_events = container.find('ul').find('li')
    dates = all_events.findChildren(recursive=False)
    for date in dates:
        events = date.findChildren("ul")
        for event in events:
            date_list.append(date_finder(date))
            name_list.append(name_finder(event))
            artists_list.append(artists_finder(event))
            venue_list.append(venue_finder(event))
            attendees_list.append(attendees_finder(event))
    df = pd.DataFrame([name_list, artists_list, venue_list, date_list, attendees_list]).transpose()        
    df.columns = ["Event_Name", "Announced Artists", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [12]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Announced Artists,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,"Mary Yuzovskaya, Manni Dee, Umfang, Juana, The...",TBA - New York,"Sat, 30 Mar",0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...","Sven Vath, Butch, Taimur, Ilario Alicante",99 Scott Ave,"Sat, 30 Mar",407
2,Horse Meat Disco - New York Residency,"Horse Meat Disco, The Carry Nation, Amber Vale...",Elsewhere,"Sat, 30 Mar",375
3,Rave: Underground Resistance All Night,"Nomadico, Mark Flash",Nowadays,"Sat, 30 Mar",232
4,"Believe You Me // Beta Librae, Stephan Kimbel,...","Beta Librae, Stephan Kimbel",TBA - New York,"Sat, 30 Mar",0
...,...,...,...,...,...
114,A Night at the Baths,"Manchildblack, The Illustrious Blacks",C'mon Everybody,"Fri, 5 Apr",1
115,Blaqk Audio,,Music Hall of Williamsburg,"Fri, 5 Apr",1
116,Erik the Lover,,Erv's,"Fri, 5 Apr",1
117,Wax On Vissions,"Interpreter, Silkroad Acid",Starliner,"Fri, 5 Apr",1


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [13]:
# Find the button, find the relative path, create the URL for the current `soup`
next_boxes = soup.findAll('div', attrs={'aria-hidden': "true"})
href = next_boxes[1].find('a').attrs['href']
url_base = 'https://web.archive.org'
next_url = url_base+href

In [14]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    next_boxes = soup.findAll('div', attrs={'aria-hidden': "true"})
    href = next_boxes[1].find('a').attrs['href']
    url_base = 'https://web.archive.org'
    next_page_url = url_base+href
    return next_page_url

In [15]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [16]:
# Your code here
import time
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

def frame_events(url=EVENTS_PAGE_URL, new = True):
    new_df = scrape_events(url)
    if new == True:
        global full_df 
        full_df = new_df
    else:
        frames = [full_df, new_df]
        full_df = pd.concat(frames)
    next_url = next_page(url)
    time.sleep(1)
    if len(full_df.index) < 500:
        frame_events(next_url, False)
    else:
        print(len(full_df))
        return full_df
    

In [17]:
frame_events()
full_df.sort_values(by=['Number_of_Attendees'], ascending = False)

606


Unnamed: 0,Event_Name,Announced Artists,Venue,Event_Date,Number_of_Attendees
0,Zero presents... The Masquerade,"Madmotormiquel, Chris Schwarzwälder, Lovecraft...",The 1896,"Sat, 6 Apr",919
65,Secret Solstice Pre-Party (Free Entry): Metro ...,"Alexi Delano, Metro Area, Nitin, No Regular Play",Kings Hall - Avant Gardner,"Thu, 18 Apr",670
0,Nina Kraviz / James Murphy / Justin Cudmore,"Mike Huckaby, Shit Robot, Nina Kraviz, James M...",Knockdown Center,"Sat, 20 Apr",501
89,Stavroz live! presented by Zero,Stavroz,The Williamsburg Hotel,"Fri, 12 Apr",481
91,Teksupport: Honey Dijon (All Night Long) Sold Out,Honey Dijon,99 Scott Ave,"Fri, 5 Apr",463
...,...,...,...,...,...
7,[CANCELLED] Yokoo: Nothing Can Compare - Album...,,TBA - New York,"Sat, 27 Apr",0
6,Coalesquerade,"Drewsky, Atman",TBA - New York,"Sat, 27 Apr",0
104,Cavali NY Nightclub Kourture Fridays Everyone ...,,TBA - New York,"Fri, 12 Apr",0
113,I Feel: Superhero Utopia,"Miyagi, Alex Cecil, Kate Stein",TBA - New York,"Fri, 12 Apr",0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!