# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
# extract packages

import pandas as pd
pd.options.display.max_rows = 999
from bs4 import BeautifulSoup
import requests
import re

In [4]:
# request page and soupify

url ='https://www.residentadvisor.net/events/uk/london/week/2020-06-15'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser' )
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html lang="en,ja,es">
<head id="_x1"><title>
	RA: Events in London, United Kingdom
</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="en,ja,es" http-equiv="content-language"/><meta content="RA: Resident Advisor" name="Description"/><meta content="RA, residentadvisor, resident, advisor, music, ra, events, in, london, united, kingdom" name="Keywords"/><meta content="Resident Advisor" name="Author"/><meta content="Resident Advisor" property="og:site_name"/><meta content="712773712080127" property="fb:app_id"/><link href="/bundles/default-css?v=FkfRVAlFvpndxqgZliJaJOXD-OhkiRFP8nrBK9Pg2R01" rel="stylesheet"/>
<meta content="app-id=981952703, app-argument=ra-guide://search" name="apple-itunes-app"/><link href="/bundles/cat-listings-css?v=qgpSmyPbylOKeJFqy2yvCrTgAsw9yQYcJtLKS_vPO6s1" rel="stylesheet"/>
<link href="/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/><link color="#000000" href="/images

In [5]:
# construct a container
container = soup.find(id = 'items')

In [6]:
# this is each row within the container, some may contain dates, some may contain event data
rows = container.findAll('li')
len(rows)

39

In [7]:
# find rows that have the date and rows that do not.
a = rows[3].find(class_ = 'eventDate date') # note this returns None
b = rows[2].find(class_ = 'eventDate date') # this will return a <p> tag with the date
print(a, b)

# using the result below, build a condition whereby if the result is None populate a list with NA else parse the tag
# and return the date
type(b)

None <p class="eventDate date"><a href="/events.aspx?ai=13&amp;v=day&amp;mn=6&amp;yr=2020&amp;dy=16"><span>Tue, 16 Jun 2020 /</span></a></p>


bs4.element.Tag

In [8]:
# traverse the container; searching by h1 heading should narrow things down
eventdata = container.findAll('h1', class_ ='event-title')
len(eventdata)

30

In [9]:
# looks like extracting the text behind the a tag might extract just the event name (this should be Event_Name)
name = eventdata[0].find('a').text

# write a loop that iterates through the container for a list of names that will be used in the dataframe (final_name_list)
final_name_list = [ ]

#set up a list that contains the list of names
#this list will be used by the for loop to populate the final_name_list 
#ONLY where an event-title exists.
name_list = [ ]

# this loop is used to populate a list of event names for use in name_list
for n in range(len(eventdata)):
    name = eventdata[n].find('a').text
    name_list.append(name)

# this loop is used to check if there is an event name in a row, if not, return NaN
# if there is an event name, it takes the actual name of the event from name_list and uses it to populate final_name_list
# final_name_list goes into final dataframe

name_list_counter = 0

for n in range(len(rows)):
    
    if rows[n].find(class_ = 'event-title') == None:
        final_name_list.append('NaN')
    else:
        final_name_list.append(name_list[name_list_counter])
        name_list_counter += 1

    
print(final_name_list, len(final_name_list))

['NaN', "Grace Jones' Meltdown festival", 'NaN', 'Art of Noise Reboot', "Grace Jones' Meltdown festival", 'NaN', 'Art of Noise Reboot', "Grace Jones' Meltdown festival", '[CANCELLED] VØID: Venetian Snares & Big Lad vs Sly & The Family Drone', 'NaN', 'Art of Noise Reboot', 'Throwback Thursdays at PI // Student Drink Deals', 'NaN', '[RESCHEDULED] The Pickle Factory with Prosumer & Tama Sumo All Night Long', 'Chancha Via Circuito', 'Session Victim: Every Friday in June', 'Fabio (Jungle Set) ft Jumping Jack Frost B2B Bryan Gee, DJ Ron, DJ Rap & Ragga Twins', 'Gold Teeth', 'Antics', 'Supa Dupa Fly x Trapeze Basement', '[CANCELLED] Palms Trax b2b Young Marco & Kamma & Masalo', 'NaN', 'NaN', 'Sankeys London Beach Rave', "Byday Bynight - June's Summer Rooftop Party Brixton", 'Chancha Via Circuito', 'Garagebox: 2020 Opening Party', '[RESCHEDULED] Footwrk 005: Cosmic Disco with Pete Herbert', 'UKG Brunch', 'London Boat Party with Free After Party', 'Shut The Front Door: Summer Solstice Day & Nig

In [10]:
eventdata[0]

<h1 class="event-title" itemprop="summary"><a href="/events/1343621" itemprop="url" title="Event details of Grace Jones' Meltdown festival">Grace Jones' Meltdown festival</a> <span>at <a href="/club.aspx?id=19738">Southbank Centre</a></span></h1>

In [11]:
# extract just the venue
# defining the venue with reference to the event helps to link the event and the venue

venue = eventdata[0].find('span').text.strip('at ')
venue

# set up a list of venues actually found in the container that will be used to populate final_venue_list
venue_list = [ ]

# set up an empty list of final_venue_list which will be used to construct the dataframe
final_venue_list = [ ]

# this loop is used to populate the event names actually found in the container that will, in turn, be used to
# populate the final_venue_list
for v in range(len(eventdata)):
    venue = eventdata[v].find('span').text.strip('at ')
    venue_list.append(venue)

# this loop is used to populate the final_venue_list used in the dataframe, it will search each row of the container
# where it does not find a venue name in a row, it will return NaN and populate final_venue_list accordingly
# where it does find a venue name, it will take a name from the venue_list and populate final_venue_list accordingly.

venue_list_counter = 0

for v in range(len(rows)):
    
    if rows[v].find(class_ = 'event-title') == None:
        final_venue_list.append('NaN')
    else:
        final_venue_list.append(venue_list[venue_list_counter])
        venue_list_counter += 1
    
print(final_venue_list, len(final_venue_list))


['NaN', 'Southbank Centre', 'NaN', 'The Jazz Cafe', 'Southbank Centre', 'NaN', 'The Jazz Cafe', 'Southbank Centre', 'fabric', 'NaN', 'The Jazz Cafe', 'Piccadilly Institute', 'NaN', 'The Pickle Factory', 'The Jazz Cafe', 'The Jazz Cafe', 'Camden Assembly', 'Brixton Jamm', 'Fest Camden', 'Trapeze Basemen', 'XOYO', 'NaN', 'NaN', 'Studio 338', 'The Prince of Wales', 'The Jazz Cafe', 'TBA - London', 'Simulacra Studio', 'TBA - London', 'Crown Pier', 'Brixton Jamm', 'Forest Road Brewery', 'Chelmsford City Racecourse', 'Apps Court Farm, Surrey', 'Dreamland Margate', 'Fire & Lightbox', 'NaN', 'NaN', 'Dreamland Margate'] 39


In [12]:
# extract just the date
date = container.findAll('p', class_ = 'eventDate date')
date[5].find('a').text.strip(' /')

# write a loop that iterates through the container for the date of the event to generate a list
# NOTE: certain dates have more events than others
final_date_list = [ ]

for d in range(len(rows)):
    if rows[d].find(class_ = 'eventDate date') == None:
        final_date_list.append(date)
    else:
        date = rows[d].find('a').text.strip(' /')
        final_date_list.append(date)
    
print(final_date_list, len(final_date_list))

['Mon, 15 Jun 2020', 'Mon, 15 Jun 2020', 'Tue, 16 Jun 2020', 'Tue, 16 Jun 2020', 'Tue, 16 Jun 2020', 'Wed, 17 Jun 2020', 'Wed, 17 Jun 2020', 'Wed, 17 Jun 2020', 'Wed, 17 Jun 2020', 'Thu, 18 Jun 2020', 'Thu, 18 Jun 2020', 'Thu, 18 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Fri, 19 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sat, 20 Jun 2020', 'Sun, 21 Jun 2020', 'Sun, 21 Jun 2020'] 39


In [13]:
# get the number of attendees turn it so that it returns an integer
# NOTE: not every event will have attendees
attend = container.findAll('p', class_ = 'attending')
int(attend[17].text.strip('Attending '))

#set up an empty list, populating with number of attendees specified by the page.
attend_list = [ ]

#set up a final_attend_list to be used to construct the dataframe, this is populated with
#the figures from attend_list where available and populated with NaN if not available.
final_attend_list = [ ]

# this loop populates attend_list and creates a list of number of attendees where they have been specified
for a in range(len(attend)):
        a_num = int(attend[a].text.strip('Attending '))
        attend_list.append(a_num)

# this loop populates the final_attend_list such with numbers from attend_list where available, otherwise for listings
# that do not specify any number of attendees, NaN is returned.

attend_list_counter = 0

for a in range(len(rows)):
    if rows[a].find(class_ = 'attending') == None:
        final_attend_list.append(0)
    else:
        final_attend_list.append(attend_list[attend_list_counter])
        attend_list_counter +=1
        
print(final_attend_list, len(final_attend_list))

[0, 30, 0, 7, 30, 0, 7, 30, 24, 0, 1, 0, 0, 75, 13, 13, 2, 1, 1, 1, 63, 0, 0, 199, 63, 13, 7, 6, 2, 2, 1, 0, 21, 0, 0, 0, 0, 0, 4] 39


In [14]:
type(attend_list[1])

int

In [15]:
# put all the lists together and return the length of each list just to ensure they are the same length
final_lists = [final_name_list, final_venue_list, final_date_list, final_attend_list]
y = [len(x) for x in final_lists]
y

[39, 39, 39, 39]

In [16]:
# construct the dataframe
df = pd.DataFrame(final_lists).transpose()
df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]

#columns where the Event Name is NaN can be removed.
df = df.loc[df['Event_Name'] != 'NaN']

# this resets the index and drops the old index which doesn't work anymore now that the NaNs have been removed.
df = df.reset_index(drop = True)
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Grace Jones' Meltdown festival,Southbank Centre,"Mon, 15 Jun 2020",30
1,Art of Noise Reboot,The Jazz Cafe,"Tue, 16 Jun 2020",7
2,Grace Jones' Meltdown festival,Southbank Centre,"Tue, 16 Jun 2020",30
3,Art of Noise Reboot,The Jazz Cafe,"Wed, 17 Jun 2020",7
4,Grace Jones' Meltdown festival,Southbank Centre,"Wed, 17 Jun 2020",30
5,[CANCELLED] VØID: Venetian Snares & Big Lad vs...,fabric,"Wed, 17 Jun 2020",24
6,Art of Noise Reboot,The Jazz Cafe,"Thu, 18 Jun 2020",1
7,Throwback Thursdays at PI // Student Drink Deals,Piccadilly Institute,"Thu, 18 Jun 2020",0
8,[RESCHEDULED] The Pickle Factory with Prosumer...,The Pickle Factory,"Fri, 19 Jun 2020",75
9,Chancha Via Circuito,The Jazz Cafe,"Fri, 19 Jun 2020",13


## Write a Function to Scrape all of the Events on the Given Page Events Page

In [17]:
# put together everything from before

def scrape_events(events_page_url):
    #Your code here
    page = requests.get(events_page_url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # set up the container
    container = soup.find(id = 'items')
    
    # set up each row of the container, we will be iterating over each row of the container
    rows = container.findAll('li') # each row is identifiable by its tag 'li'
    
    # traverse the container; searching by h1 heading should narrow things down
    eventdata = container.findAll('h1', class_ ='event-title') # eventdata contains both name and venue
    
    # write a loop that iterates through the container for a list of names that will be used in the dataframe (final_name_list)
    final_name_list = [ ]

    #set up a list that contains the list of names
    #this list will be used by the for loop to populate the final_name_list 
    #ONLY where an event-title exists.
    name_list = [ ]

    # this loop is used to populate a list of event names for use in name_list
    for n in range(len(eventdata)):
        name = eventdata[n].find('a').text
        name_list.append(name)

    # this loop is used to check if there is an event name in a row, if not, return NaN
    # if there is an event name, it takes the actual name of the event from name_list and uses it to populate final_name_list
    # final_name_list goes into final dataframe

    name_list_counter = 0

    for n in range(len(rows)):
    
        if rows[n].find(class_ = 'event-title') == None:
            final_name_list.append('NaN')
            
        else:
            
            final_name_list.append(name_list[name_list_counter])
            name_list_counter += 1
   
    # set up a list of venues actually found in the container that will be used to populate final_venue_list
    venue_list = [ ]

    # set up an empty list of final_venue_list which will be used to construct the dataframe
    final_venue_list = [ ]

    # this loop is used to populate the event names actually found in the container that will, in turn, be used to
    # populate the final_venue_list
    for v in range(len(eventdata)):
        venue = eventdata[v].find('span').text.strip('at ')
        venue_list.append(venue)

    # this loop is used to populate the final_venue_list used in the dataframe, it will search each row of the container
    # where it does not find a venue name in a row, it will return NaN and populate final_venue_list accordingly
    # where it does find a venue name, it will take a name from the venue_list and populate final_venue_list accordingly.

    venue_list_counter = 0

    for v in range(len(rows)):
    
        if rows[v].find(class_ = 'event-title') == None:
            final_venue_list.append('NaN')
        else:
            final_venue_list.append(venue_list[venue_list_counter])
            venue_list_counter += 1 
    
    # write a loop that iterates through the container for the date of the event to generate a list
    # NOTE: certain dates have more events than others
    final_date_list = [ ]

    for d in range(len(rows)):
        if rows[d].find(class_ = 'eventDate date') == None:
            final_date_list.append(date)
        else:
            date = rows[d].find('a').text.strip(' /')
            final_date_list.append(date) 
    
    #set up an empty list, populating with number of attendees specified by the page.
    attend_list = [ ]

    #set up a final_attend_list to be used to construct the dataframe, this is populated with
    #the figures from attend_list where available and populated with NaN if not available.
    final_attend_list = [ ]
    
    # This narrows down into listings that do have attendees
    attend = container.findAll('p', class_ = 'attending')

    # this loop populates attend_list and creates a list of number of attendees where they have been specified
    for a in range(len(attend)):
        a_num = int(attend[a].text.strip('Attending '))
        attend_list.append(a_num)

    # this loop populates the final_attend_list such with numbers from attend_list where available, otherwise for listings
    # that do not specify any number of attendees, NaN is returned.

    attend_list_counter = 0

    for a in range(len(rows)):
        
        if rows[a].find(class_ = 'attending') == None:
            final_attend_list.append(0)
            
        else:
            final_attend_list.append(attend_list[attend_list_counter])
            attend_list_counter +=1
    
    final_lists = [final_name_list, final_venue_list, final_date_list, final_attend_list]
    
    # construct the dataframe
    df = pd.DataFrame(final_lists).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]

    #columns where the Event Name is NaN can be removed.
    df = df.loc[df['Event_Name'] != 'NaN']

    # this resets the index and drops the old index which doesn't work anymore now that the NaNs have been removed.
    df = df.reset_index(drop = True)
    
    return df

In [18]:
scrape_events('https://www.residentadvisor.net/events')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Music Sans Frontiers,The Social,"Mon, 18 May 2020",1
1,Shy b2b Cat All Night Long,TBA - London,"Tue, 19 May 2020",1
2,Yves Tumor & Its Band,Electric Brixton,"Wed, 20 May 2020",26
3,Despacio,The Roundhouse,"Thu, 21 May 2020",17
4,Techno RE-Imagine:06,Canavan's Peckham Pool Club,"Thu, 21 May 2020",2
5,[POSTPONED] - Dispatch Recordings,E1 London,"Fri, 22 May 2020",99
6,"Human Traffic Live, Opening Night with Pete Tong",Printworks,"Fri, 22 May 2020",24
7,Katermukke Showcase: London,Night Tales,"Fri, 22 May 2020",8
8,The Fleetwood Mac Summer Rooftop Party 2020,The Prince of Wales,"Fri, 22 May 2020",2
9,All Points East 2020,Victoria Park,"Fri, 22 May 2020",96


## Write a Function to Retrieve the URL for the Next Page

In [19]:
# find the section in the soup where the previous and next buttons can be found
next_container = soup.find('div', class_ = 'page-items content sub clearfix')
next_container

<div class="page-items content sub clearfix">
<ul>
<li class="but arrow-left left" id="liPrevious2">
<a ga-event-action="Previous " ga-event-category="event-listings" ga-on="click" href="/events/uk/london/week/2020-06-08">Previous </a>
</li><li class="but arrow-right right" id="liNext2">
<a ga-event-action="Next " ga-event-category="event-listings" ga-on="click" href="/events/uk/london/week/2020-06-22">Next </a>
</li>
</ul>
</div>

In [20]:
# this will list links to both the previous AND the next buttons
buttons = next_container.findAll('a', href=True)

# the next button is index number 1 in the buttons list, specifying ['href'] like that should get the url for the next button
next = buttons[1]['href']
next

'/events/uk/london/week/2020-06-22'

In [21]:
# Now, a bit of string manipulation to join everything up, use a literal to complete the url and ensure it is a string
# This is what the full url looks like:
# https://www.residentadvisor.net/events/uk/london/week/2020-06-22

next_url = str(print('https://www.residentadvisor.net{}'.format(next)))

https://www.residentadvisor.net/events/uk/london/week/2020-06-22


In [22]:
# putting the above together

def next_page(url):
    #set up the page and the soup
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # find the section in the soup where the previous and next buttons can be found
    next_container = soup.find('div', class_ = 'page-items content sub clearfix')
    
    # this will list links to both the previous AND the next buttons
    buttons = next_container.findAll('a', href=True)

    # the next button is index number 1 in the buttons list, specifying ['href'] like that should get the url for the next button
    next = buttons[1]['href']
    
    # Now, a bit of string manipulation to join everything up, use a literal to complete the url and ensure it is a string
    # This is what the full url looks like: https://www.residentadvisor.net/events/uk/london/week/2020-06-22

    next_page_url = 'https://www.residentadvisor.net{}'.format(next)
    
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [23]:
# construct an initial dataframe of the initial page where scraping begins
start_url = 'https://www.residentadvisor.net/events'
df_large = scrape_events(start_url)

# get the next url; there will definitely be need of another url, we aren't going to get 1000 listings in 1 page....

new_url = next_page(start_url)

# scrape the next page and add the dataframe constructed thereby to the initial dataframe
# remember to test the length of the dataframe 
# keep running the loop to scrape the following page until the length of the dataframe = 500

df_length = len(df_large)

while df_length < 400:
    next_df = scrape_events(new_url)
    df_large = pd.concat([df_large, next_df])
    df_length = len(df_large)
    if df_length >= 400:
        break
    new_url = next_page(new_url)

# tidy up the index
df_large = df_large.reset_index(drop = True)

In [24]:
df_large = df_large.sort_values(by = ['Number_of_Attendees', 'Event_Name'], ascending = [False, True], axis=0)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
151,[CANCELLED] Cross The Tracks Festival 2020,Brockwell Park,"Sun, 07 Jun 2020",4339
296,LoveJuice Block Party at Site 5,E1 London,"Sat, 18 Jul 2020",1489
375,Waterworks Festival,Lee Valley Waterworks,"Sat, 22 Aug 2020",1385
274,Field Day 2020,The Drumsheds,"Sat, 11 Jul 2020",1213
297,"[RESCHEDULED] E1 & Labyrinth present: Sasha, J...",E1 London,"Sat, 18 Jul 2020",578
220,"[RESCHEDULED] Symmetry with Break, Skeptical, ...",The Steel Yard,"Fri, 26 Jun 2020",347
348,Bicep Live,O2 Academy Brixton,"Fri, 07 Aug 2020",316
265,"[RESCHEDULED] Turno, Problem Central, K Motion...",E1 London,"Fri, 10 Jul 2020",280
231,[POSTPONED] Care3: 3rd Birthday,TBA - London,"Sat, 27 Jun 2020",271
374,[RESCHEDULED] Giolì & Assia,The Jazz Cafe,"Fri, 21 Aug 2020",247


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!