# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [1]:
# Relevant imports
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import numpy as np

In [2]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
url = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"
html_page = requests.get(url) 
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en"><head><script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb-app203.us.archive.org';v.server_ms=417;archive_analytics.send_pageview({});});</script>
<script charset="utf-8" src="/_static/js/bundle-playback.js?v=UfTkgsKx" type="text/javascript"></script>
<script charset="utf-8" src="/_static/js/wombat.js?v=UHAOicsW" type="text/javascript"></script>
<script type="text/javascript">
  __wm.init("https://web.archive.org/web");
  __wm.wombat("https://ra.co/events/us/newyork?week=2019-03-30","20210326225933","https://web.archive.org/","web","/_static/",
	      "1616799573");
</script>
<link href="/_static/css/banner-styles.css?v=omkqRugM" rel="stylesheet" type="text/css"/>
<link href="/_static/css/iconochive.css?v=qtvMKcIJ" rel="stylesheet" type=

In [3]:
# Find the container with event listings in it
events_all = soup.find('div', {'data-tracking-id':'events-all'})
events_all.prettify

<bound method Tag.prettify of <div class="Box-omzyfs-0 sc-AxjAm cshLAo" data-tracking-id="events-all"><div class="Box-omzyfs-0 sc-AxjAm htYgja"></div><div><ul class="Grid__GridStyled-sc-1l00ugd-0 fHKIhJ grid"><li class="Column-sc-18hsrnn-0 gnwWng"><div class="Box-omzyfs-0 fYkcJU"><div class="Box-omzyfs-0 SectionStyledBox-tvjxx0-0 eEYxFS sticky-header"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fwuoVk"><span class="Text-sc-1t0gn2o-0 dvCBwl" color="accent" font-weight="normal"><span class="Text-sc-1t0gn2o-0 gSvLLX" color="accent" font-weight="normal"≯</span>Sat, 30 Mar</span></h3></div><hr class="Divider__HorizontalDivider-sc-1qsmuc-0 klshtO"/><ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="ticketed-event"><li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-he

In [7]:
# Find a list of events by date within that container
events_list = events_all.find('ul').find('li')

In [8]:
events_list.text[:150]

'̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Al'

In [10]:
events_by_dates = events_list.findChildren(recursive=False)

In [11]:
events_by_dates[0].text[:100]

'̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRA'

In [12]:
events_by_dates[1].text

''

In [13]:
events_by_dates[2].text[:100]

'̸Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New'

In [14]:
date = events_by_dates[0].find("div", class_="sticky-header").text
date

'̸Sat, 30 Mar'

In [16]:
# Extract the name, venue, and number of attendees from one of the
# events within that container
event = events_by_dates[0].findChildren('ul')
event[0].text


'UnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457'

In [17]:
name = event[0].find("h3").text
name

'UnterMania II'

In [21]:
venue_and_attendees = event[0].findAll("div", {"height": 30})

In [22]:
venue = venue_and_attendees[0].text
venue

'TBA - New York'

In [26]:
len(venue_and_attendees)

3

In [27]:
attend = venue_and_attendees[-1].text
attend

'457'

In [28]:
print("Name -", name)
print("Venue -", venue)
print("Date -", date.strip("'̸"))
print("Attendence -", attend)

Name - UnterMania II
Venue - TBA - New York
Date - Sat, 30 Mar
Attendence - 457


In [30]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe

rows = []

for event_date in events_by_dates:
    if not event_date.text:
        continue
    date = event_date.find("div", class_="sticky-header").text
    date = date.strip("'̸")
    events = event_date.findChildren('ul')
    for event in events:
        name = event.find("h3").text
        venue_and_attendees = event.findAll("div", {"height": 30})
        venue = venue_and_attendees[0].text   
        if len(venue_and_attendees) == 3:
            num_attendees = int(venue_and_attendees[-1].text)
        else:
            num_attendees = np.nan
        rows.append([name, venue, date, num_attendees])
    
    
df = pd.DataFrame(rows)
df.head()
   

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0


In [31]:
len(df)

119

In [38]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(events_page_url):
    #Your code here
    html_page = requests.get(events_page_url) 
    soup = BeautifulSoup(html_page.content, 'html.parser')
    events_all = soup.find('div', attrs={'data-tracking-id':'events-all'})
    events_list = events_all.find('ul').find('li')
    events_by_dates = events_list.findChildren(recursive=False)
    rows = []
    for event_date in events_by_dates:
        if not event_date.text:
            continue
        date = event_date.find("div", class_="sticky-header").text
        date = date.strip("'̸")
        events = event_date.findChildren('ul')
        for event in events:
            name = event.find("h3").text
            venue_and_attendees = event.findAll("div", attrs={"height": 30})
            venue = venue_and_attendees[0].text   
            if len(venue_and_attendees) == 3:
                num_attendees = int(venue_and_attendees[-1].text)
            else:
                num_attendees = np.nan
            rows.append([name, venue, date, num_attendees])
    df = pd.DataFrame(rows)
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [39]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",
116,Erik the Lover,Erv's,"Fri, 5 Apr",
117,Wax On Vissions,Starliner,"Fri, 5 Apr",


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [42]:
# Find the button, find the relative path, create the URL for the current `soup`
buttons = soup.find("div", attrs={"aria-hidden": 'true'})
buttons

<div aria-hidden="true" class="Box-omzyfs-0 sc-AxjAm Panel__StyledAlignment-sc-1udo2qh-0 ArchiveNavigator___StyledPanel-x733n4-1 gUdql"><a class="ArchiveNavigator__StyledLink-x733n4-0 fXcSbA" data-tracking-id="/events/us/newyork?week=2019-03-23" href="/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-23">Sat, 23 Mar</a><div class="Box-omzyfs-0 sc-AxjAm NavigationItem___StyledAlignment-t52po1-0 iHjmUI"><span class="Text-sc-1t0gn2o-0 dtYoLc" color="secondary" font-weight="normal">Previous week</span><div class="Box-omzyfs-0 sc-AxjAm fsoYzJ" width="100%"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-tracking-id="/events/us/newyork?week=2019-03-23" display="block" href="/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-23"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 eIXPIq" color="primary" data-tracking-id="/events/us/newyork?week=2019-03-23" display="block" font-weight="normal" href="/events/us/newyork?week=2019-03-23">Sat, 23 Mar</span></a

In [43]:
next_arrow = soup.find("svg", attrs={"aria-label": "Right arrow"})
next_arrow

<svg aria-label="Right arrow" height="100%" viewbox="0 0 24 24" width="100%"><g fill="none" fill-rule="evenodd"><path d="M0 0h24v24H0z" fill="none"></path><path d="M8.293 6.707a1 1 0 011.414-1.414l6 6a1 1 0 010 1.414l-6 6a1 1 0 11-1.414-1.414L13.586 12 8.293 6.707z" fill="currentColor"></path></g></svg>

In [44]:
next_arrow.parent

<div class="Box-omzyfs-0 sc-AxjAm dQKRnM" color="primary" height="24" width="24"><svg aria-label="Right arrow" height="100%" viewbox="0 0 24 24" width="100%"><g fill="none" fill-rule="evenodd"><path d="M0 0h24v24H0z" fill="none"></path><path d="M8.293 6.707a1 1 0 011.414-1.414l6 6a1 1 0 010 1.414l-6 6a1 1 0 11-1.414-1.414L13.586 12 8.293 6.707z" fill="currentColor"></path></g></svg></div>

In [47]:
link = next_arrow.parent.previous_sibling

In [48]:
next_page = link.get("href")

In [50]:
next_page_url = "https://web.archive.org" + next_page
next_page_url

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

In [51]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    #Your code here
    next_arrow = soup.find("svg", attrs={"aria-label": "Right arrow"})
    link = next_arrow.parent.previous_sibling
    next_page = link.get("href")
    next_page_url = "https://web.archive.org" + next_page
    return next_page_url

In [52]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [53]:
import time

In [57]:
# Your code here
concert_df = pd.DataFrame()
current_page = EVENTS_PAGE_URL

while len(concert_df) <= 500:
    scrap = scrape_events(current_page)
    time.sleep(1)
    concert_df = pd.concat([concert_df, scrap])
    current_page = next_page(current_page)
    time.sleep(1)
    
concert_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
111,"Gavin Rayna Russom, Sybil Jason, Fougere & Perrx",Bossa Nova Civic Club,"Fri, 12 Apr",
112,Lqqk Studio with DJ Healthy x Dondero x DJ Fire,Black Flamingo,"Fri, 12 Apr",
113,I Feel: Superhero Utopia,TBA - New York,"Fri, 12 Apr",
114,SUB:MERGE Presents: Sunk:. VI - SUB:Merge vs K...,Sunnyvale,"Fri, 12 Apr",


In [58]:
concert_df.sort_values("Number_of_Attendees", ascending=False)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
0,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
0,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
0,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
89,Stavroz live! presented by Zero,The Williamsburg Hotel,"Fri, 12 Apr",481.0
...,...,...,...,...
111,"Gavin Rayna Russom, Sybil Jason, Fougere & Perrx",Bossa Nova Civic Club,"Fri, 12 Apr",
112,Lqqk Studio with DJ Healthy x Dondero x DJ Fire,Black Flamingo,"Fri, 12 Apr",
113,I Feel: Superhero Utopia,TBA - New York,"Fri, 12 Apr",
114,SUB:MERGE Presents: Sunk:. VI - SUB:Merge vs K...,Sunnyvale,"Fri, 12 Apr",


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!