# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [136]:
page = requests.get('https://www.residentadvisor.net/events/us/newyork/')
soup = BeautifulSoup(page.content, 'html.parser')

In [137]:
event_listings = soup.find('ul', id='items')
event_listings1 = event_listings.find('div', class_='bbox')
event_listings.find('a').text
ev = event_listings

In [138]:
def event_names(ev):
    names = []
    for x in ev.find_all('div', class_='bbox'):
        names.append(x.find('a').text)
    return names
event_names(ev)

['Disorient presents: Country Club X: Astrodynamica',
 "Virtual Thursday: A Panel with Candidates of New York's DSA-Slate",
 'Virtual Thursday: Planetarium with Davis Galvin and Father of Two',
 'NYC MDW Kickoff Hip Hop vs Reggae® Yacht Party 2020',
 'Disorient presents: Country Club X: Astrodynamica',
 'Virtual Friday: Ash Lauryn and Musclecars',
 'Eggs and Toast',
 '[CANCELED] 𝐄 𝐗 𝐓 𝐄 𝐍 𝐃 𝐄 𝐃 ⇆ Speedy J // Shlømo // Tapefeed',
 '[CANCELLED] The Bunker with Ben UFO x Joy Orbison, Forest Drive West',
 '[CANCELLED] Jon Hopkins (DJ Set), Gee Dee, Timo Lee, Chittom and Eric From America',
 'Disorient presents: Country Club X: Astrodynamica',
 "Tony Humphries' Le Bain Residency",
 'Multiple Man, Figure Section',
 'Virtual Saturday: Breakwave and Lee Gamble',
 'NYC Memorial Day Sunday Yacht Party Cruise 2020',
 'Disorient presents: Country Club X: Astrodynamica',
 'Virtual Sunday: Mister Sunday Season Opener']

In [139]:
def event_venues(ev):
    venues = []
    for x in ev.find_all('div', class_='bbox'):
        hold = x.find('span').find('a')
        if hold != None:
            hold = hold.text
        venues.append(hold)
    return venues
event_venues(ev)

[None,
 'Nowadays',
 'Nowadays',
 'Skyport Marina',
 None,
 'Nowadays',
 None,
 '23 Meadow',
 'Market Hotel',
 'Good Room',
 None,
 'Le Bain',
 'Saint Vitus',
 'Nowadays',
 'Skyport Marina',
 None,
 'Nowadays']

In [144]:
def event_dates(ev):
    import re
    regex = re.compile('event-item (.*)')
    dates = []
    for x in ev.find_all('article', class_=regex):
        dates.append(x.find('time').text)
    return dates
event_dates(ev)

['2020-05-21T00:00',
 '2020-05-21T00:00',
 '2020-05-21T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-22T00:00',
 '2020-05-23T00:00',
 '2020-05-23T00:00',
 '2020-05-23T00:00',
 '2020-05-23T00:00',
 '2020-05-24T00:00',
 '2020-05-24T00:00',
 '2020-05-24T00:00']

In [212]:
def event_attendees(ev):
    attendees = []
    for x in ev.find_all('p', class_='attending'):
        attendees.append(int(x.find('span').text))
    return attendees
event_attendees(ev)

[4, 2, 2, 1, 4, 4, 2, 311, 112, 52, 4, 2, 2, 2, 3, 4, 1]

In [193]:
def scrape_events(events_page_url):
    url = events_page_url
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    ev = soup.find('ul', id='items')
    names = event_names(ev)
    venues = event_venues(ev)
    dates = event_dates(ev)
    attendees= event_attendees(ev)
    df = pd.DataFrame([names, venues, dates, attendees]).T
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [194]:
scrape_events('https://www.residentadvisor.net/events/us/newyork/')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Disorient presents: Country Club X: Astrodynamica,,2020-05-21T00:00,4
1,Virtual Thursday: A Panel with Candidates of N...,Nowadays,2020-05-21T00:00,2
2,Virtual Thursday: Planetarium with Davis Galvi...,Nowadays,2020-05-21T00:00,2
3,NYC MDW Kickoff Hip Hop vs Reggae® Yacht Party...,Skyport Marina,2020-05-22T00:00,1
4,Disorient presents: Country Club X: Astrodynamica,,2020-05-22T00:00,4
5,Virtual Friday: Ash Lauryn and Musclecars,Nowadays,2020-05-22T00:00,4
6,Eggs and Toast,,2020-05-22T00:00,2
7,[CANCELED] 𝐄 𝐗 𝐓 𝐄 𝐍 𝐃 𝐄 𝐃 ⇆ Speedy J // Shløm...,23 Meadow,2020-05-22T00:00,311
8,[CANCELLED] The Bunker with Ben UFO x Joy Orbi...,Market Hotel,2020-05-22T00:00,112
9,"[CANCELLED] Jon Hopkins (DJ Set), Gee Dee, Tim...",Good Room,2020-05-22T00:00,52


## Write a Function to Retrieve the URL for the Next Page

In [202]:
page = requests.get('https://www.residentadvisor.net/events/us/newyork/')
soup = BeautifulSoup(page.content, 'html.parser')
soup.find('li', id='liNext').find('a').attrs['href']

50


In [205]:
def next_page(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    ext = soup.find('li', id='liNext').find('a').attrs['href'][19:]
    next_page_url = url[:50] + ext
    return next_page_url

next_page('https://www.residentadvisor.net/events/us/newyork/')

'https://www.residentadvisor.net/events/us/newyork/week/2020-05-28'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [213]:
url = 'https://www.residentadvisor.net/events/us/newyork/'
master_df = pd.DataFrame()
while (len(master_df) < 100):
    print(url)
    master_df = master_df.append(scrape_events(url))
    print(len(master_df))
    url = next_page(url)

https://www.residentadvisor.net/events/us/newyork/
17
https://www.residentadvisor.net/events/us/newyork/week/2020-05-28
25
https://www.residentadvisor.net/events/us/newyork/week/2020-06-04
31
https://www.residentadvisor.net/events/us/newyork/week/2020-06-11
41
https://www.residentadvisor.net/events/us/newyork/week/2020-06-18
48
https://www.residentadvisor.net/events/us/newyork/week/2020-06-25
54
https://www.residentadvisor.net/events/us/newyork/week/2020-07-02
61
https://www.residentadvisor.net/events/us/newyork/week/2020-07-09
71
https://www.residentadvisor.net/events/us/newyork/week/2020-07-16
80
https://www.residentadvisor.net/events/us/newyork/week/2020-07-23
83
https://www.residentadvisor.net/events/us/newyork/week/2020-07-30
86
https://www.residentadvisor.net/events/us/newyork/week/2020-08-06
87
https://www.residentadvisor.net/events/us/newyork/week/2020-08-13
92
https://www.residentadvisor.net/events/us/newyork/week/2020-08-20
94
https://www.residentadvisor.net/events/us/newyork

In [217]:
master_df.reset_index(inplace=True)

In [218]:
master_df.sort_values(['Number_of_Attendees'], ascending=False)

Unnamed: 0,index,Event_Name,Venue,Event_Date,Number_of_Attendees
30,5,[POSTPONED] All Day I Dream Summer Season Opening,Brooklyn Mirage,2020-06-07T00:00,437
7,7,[CANCELED] 𝐄 𝐗 𝐓 𝐄 𝐍 𝐃 𝐄 𝐃 ⇆ Speedy J // Shløm...,23 Meadow,2020-05-22T00:00,311
80,0,Can't Stop The Feeling Midnight Yacht Cruise,Harbor Lights Yacht,2020-07-24T00:00,218
91,4,Lane 8 - Brightest Lights Tour (Sunday) - resc...,Brooklyn Mirage,2020-08-16T00:00,137
102,3,Live It Up Midnight Yacht Cruise,Harbor Lights Yacht,2020-09-26T00:00,122
...,...,...,...,...,...
82,2,Elrow NYC - Rowsattacks - Postponed,Avant Gardner,2020-07-25T00:00,
86,0,A Midsummer Night's Dream Midnight Yacht Cruise,Harbor Lights Yacht,2020-08-08T00:00,
93,1,Summer Obsession Midnight Yacht Cruise,Harbor Lights Yacht,2020-08-21T00:00,
97,0,Shake That Midnight Yacht Cruise,Harbor Lights Yacht,2020-09-12T00:00,


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!