# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
# Import libraries
import re
import requests
from bs4 import BeautifulSoup

In [4]:
# Load html page using requests and pass to Beautifulsoup
url = 'https://www.residentadvisor.net/events'
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')

In [5]:
# find area with data - middle column with event listings
relevantdata = soup.find('div', class_='fl col4')

In [6]:
#Narrow it down - pick up each event entry

entries = relevantdata.findAll('article')
entries[:2]

[<article class="event-item clearfix tickets-bkg-logo" itemscope="" itemtype="http://data-vocabulary.org/Event"><a href="/events/1381691#tickets"><img class="nohide" src="https://residentadvisor.net/images/ra-tix.png" style="height: 23px; width: 40px; right: 0px; position: absolute; top: 1px;"/></a><span style="display:none;"><time datetime="2020-02-24T00:00" itemprop="startDate">2020-02-24T00:00</time></span><a href="/events/1381691"><img height="76" src="/images/listing-default.gif" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1381691" itemprop="url" title="Event details of Play London Every Monday">Play London Every Monday</a> <span>at <a href="/club.aspx?id=33592">XOYO</a></span></h1><div class="grey event-lineup">DJ GLENN D</div><p class="attending"><span>3</span> Attending</p></div></article>,
 <article class="event-item clearfix tickets-bkg-logo" itemscope="" itemtype="http://data-vocabulary.org/Event"><a href="/events/1380682#ti

In [7]:
#getting first event name
print(entries[0].find('h1').find('a').text)

#getting all events name in a list
alleventnames = [entries.find('h1').find('a').text for entries in relevantdata.findAll('article')]
print(alleventnames[:5])
print(len(alleventnames))

Play London Every Monday
['Play London Every Monday', 'Desire Magic Mondays', 'London DJ School (DJ Lessons)', 'Kandy Mondays (£2 Drinks) at The Roxy', 'SCJ! Jam Night and Networking Event']
253


In [8]:
# define function to retrieve event names
def get_eventnames(soup):
    relevantdata = soup.find('div', class_='fl col4')
    entries = relevantdata.findAll('article')
    eventnames = [entries.find('h1').find('a').text for entries in relevantdata.findAll('article')]
    return eventnames

In [9]:
get_eventnames(soup)

['Play London Every Monday',
 'Desire Magic Mondays',
 'London DJ School (DJ Lessons)',
 'Kandy Mondays (£2 Drinks) at The Roxy',
 'SCJ! Jam Night and Networking Event',
 'Mobile Mondays',
 'Paradox presents Tuesday Madness',
 'Synergy: Back 2 Back',
 'Pancake Rave at Ministry of Sound',
 'Play London Every Monday',
 'Brazilian Carnival Party',
 'CDR London with Emmavie',
 'Sneak Tuesdays at Xoyo (£3 Drinks)',
 'Dirt',
 'Mobile Mondays',
 'F.M.G',
 'Soul In Motion - Wed Feb 26th',
 'Play London Every Monday',
 'Final Cut: Midweek Party - R&B, Charts, House and More',
 'FAT Action',
 'Carl Michael VON Hausswolff / Zachary Paul / Tears|OV',
 'The Beat Bangs',
 'Alex Bradley',
 'Pretend Residency',
 '100 Wardour St Originals presents Louisa',
 'SIN BIN Every Wednesday at Loop Bar',
 'Millies Lounge at The Ned',
 'GLO Thursday',
 'No Static / Automatic Label Launch with Radioactive Man, Sync 24, Ara-u',
 'Disco Teq: UKG Special',
 'Algo Mas Every Thursday Night/Friday Morning',
 'Mxlx // T

In [10]:
# getting first event date
entries[0].find('time').text.split('T')[0]

'2020-02-24'

In [11]:
# get all dates
alldates = [entries.find('time').text.split('T')[0] for entries in relevantdata.findAll('article')]
print(alldates[:5])
print(len(alldates))

['2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24']
253


In [12]:
def get_eventdates(soup):
    relevantdata = soup.find('div', class_='fl col4')
    entries = relevantdata.findAll('article')
    eventdates = [entries.find('time').text.split('T')[0] for entries in relevantdata.findAll('article')]
    return eventdates

In [13]:
get_eventdates(soup)

['2020-02-24',
 '2020-02-24',
 '2020-02-24',
 '2020-02-24',
 '2020-02-24',
 '2020-02-24',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-25',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-26',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-27',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-28',
 '2020-02-

In [14]:
#get all venues in a list
alleventvenues = []
    
for entries in relevantdata.findAll('article'):
    try:
        alleventvenues.append(entries.find('h1').find('span').find('a').text)
    
    except:
        alleventvenues.append('none')

In [15]:
alleventvenues[:5]

['XOYO', 'Union Club, Vauxhall', 'The Cause', 'The Roxy', 'Spice Of Life']

In [16]:
len(alleventvenues)

253

In [17]:
def get_eventvenues(soup):
    relevantdata = soup.find('div', class_='fl col4')
    entries = relevantdata.findAll('article')
    eventvenues = []
    for entries in relevantdata.findAll('article'):
        try:
            eventvenues.append(entries.find('h1').find('span').find('a').text)
        except:
            eventvenues.append('none')
    return eventvenues

In [18]:
get_eventvenues(soup)

['XOYO',
 'Union Club, Vauxhall',
 'The Cause',
 'The Roxy',
 'Spice Of Life',
 'Chip Shop Brixton',
 'Egg London',
 'Corsica Studios',
 'Ministry Of Sound',
 'XOYO',
 'JAKO',
 'Corsica Studios',
 'XOYO',
 'The Roxy',
 'Chip Shop Brixton',
 'Union Club, Vauxhall',
 'Orange Yard',
 'XOYO',
 'Egg London',
 'Folklore',
 'Iklectik',
 'Werkhaus',
 'The Library Lounge at The Standard, London',
 'New Cross Inn',
 '100 Wardour Street',
 'Loop Bar & Nightclub',
 'Millies Lounge - The Ned',
 'Egg London',
 'The Glove That Fits',
 'DreamBags Jaguarshoes',
 'Union Club, Vauxhall',
 'New River Studios',
 'Omeara',
 'Pop Brixton',
 'XOYO',
 'Camden Assembly',
 'The Phoenix',
 'The Barbican Centre',
 'Shoreditch Platform',
 "King's Head Members Club",
 'Basing House',
 'The Library Lounge at The Standard, London',
 'Iklectik',
 'Piccadilly Institute',
 'Kristin Hjellegjerde Gallery',
 'Hand of Glory',
 'Lion and Lamb',
 'Number 10 London',
 'Aquum Bar And Restaurant',
 'Queen Of Hoxton',
 'Kings Plac

In [20]:
allnumattendees = []

for entries in relevantdata.findAll('article'):
    try:
        allnumattendees.append(int(entries.find(class_='attending').text.split()[0]))
                               
    except:
        allnumattendees.append('NA')

In [21]:
allnumattendees[:5]

[3, 3, 3, 1, 'NA']

In [22]:
def get_numattendees(soup):
    relevantdata = soup.find('div', class_='fl col4')
    entries = relevantdata.findAll('article')
    numattendees = []
    for entries in relevantdata.findAll('article'):
        try:
            numattendees.append(int(entries.find(class_='attending').text.split()[0]))
        except:
            numattendees.append('NA')
    return numattendees

In [23]:
get_numattendees(soup)

[3,
 3,
 3,
 1,
 'NA',
 'NA',
 98,
 57,
 7,
 3,
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 14,
 5,
 3,
 2,
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 101,
 29,
 18,
 14,
 13,
 11,
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 388,
 341,
 296,
 283,
 243,
 213,
 167,
 155,
 140,
 114,
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 1665,
 1327,
 819,
 585,
 505,
 311,
 277,
 250,
 218,
 213,
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA'

In [24]:
import pandas as pd

def eventscraper(url):
    
    eventnames = []
    eventdates = []
    eventvenues = []
    numattendees = []
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    eventnames += get_eventnames(soup)
    eventdates += get_eventdates(soup)
    eventvenues += get_eventvenues(soup)
    numattendees += get_numattendees(soup)
    
    df = pd.DataFrame([eventnames, eventvenues, eventdates, numattendees]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    
    return df 

In [25]:
df = eventscraper('https://www.residentadvisor.net/events')
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Play London Every Monday,XOYO,2020-02-24,3
1,Desire Magic Mondays,"Union Club, Vauxhall",2020-02-24,3
2,London DJ School (DJ Lessons),The Cause,2020-02-24,3
3,Kandy Mondays (£2 Drinks) at The Roxy,The Roxy,2020-02-24,1
4,SCJ! Jam Night and Networking Event,Spice Of Life,2020-02-24,
...,...,...,...,...
248,Lime - Every Sunday Morning,Club Reina,2020-03-01,
249,Sunday Selections with Jack Sellen,Number 90,2020-03-01,
250,Loose Trax 6,Venue MOT Unit 18,2020-03-01,
251,Blaffirmative x Kupid - FHP,Brixton Jamm,2020-03-01,


In [45]:
len(df)

1000

## Write a Function to Retrieve the URL for the Next Page

In [41]:
# look for the next page url
nextinfo = soup.find('div', class_= 'page-items content sub clearfix').find('a', attrs = {'ga-event-action':"Next "}).attrs['href']
nextinfo

'/events/uk/london/week/2020-03-02'

In [39]:
def next_page(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    exturl = soup.find('div', class_= 'page-items content sub clearfix').find('a', attrs = {'ga-event-action':"Next "}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + exturl
    return next_page_url

In [40]:
next_page('https://www.residentadvisor.net/events/uk/london/week/2020-03-02')

'https://www.residentadvisor.net/events/uk/london/week/2020-03-09'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [52]:
#Your code here
rows = 0
url = "https://www.residentadvisor.net/events"
dfs = []

while rows <= 1000:
    df = eventscraper(url)#run the event scrapper function and save into df
    dfs.append(df)
    rows += len(df)
    url = next_page(url)

df = pd.concat(dfs)

In [56]:
df[:1000]

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Play London Every Monday,XOYO,2020-02-24,3
1,Desire Magic Mondays,"Union Club, Vauxhall",2020-02-24,3
2,London DJ School (DJ Lessons),The Cause,2020-02-24,3
3,Kandy Mondays (£2 Drinks) at The Roxy,The Roxy,2020-02-24,1
4,SCJ! Jam Night and Networking Event,Spice Of Life,2020-02-24,
...,...,...,...,...
23,Giolì & Assia,The Jazz Cafe,2020-04-03,217
24,Rupture,Corsica Studios,2020-04-03,200
25,Unleash presents Solomun,Printworks,2020-04-03,195
26,Spencer Brown - London,Colours Hoxton,2020-04-03,150


In [62]:
#choose to replace NA for attendees by 0
df['Number_of_Attendees'] = df['Number_of_Attendees'].replace('NA', int(0))

In [64]:
df.head(10)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Play London Every Monday,XOYO,2020-02-24,3
1,Desire Magic Mondays,"Union Club, Vauxhall",2020-02-24,3
2,London DJ School (DJ Lessons),The Cause,2020-02-24,3
3,Kandy Mondays (£2 Drinks) at The Roxy,The Roxy,2020-02-24,1
4,SCJ! Jam Night and Networking Event,Spice Of Life,2020-02-24,0
5,Mobile Mondays,Chip Shop Brixton,2020-02-24,0
6,Paradox presents Tuesday Madness,Egg London,2020-02-25,98
7,Synergy: Back 2 Back,Corsica Studios,2020-02-25,58
8,Pancake Rave at Ministry of Sound,Ministry Of Sound,2020-02-25,7
9,Play London Every Monday,XOYO,2020-02-25,3


In [67]:
df = df.sort_values('Number_of_Attendees', ascending = False)

In [68]:
df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
130,Space Ibiza 2020 Opening Party,Studio 338,2020-02-29,1668
57,"RE-TEXTURED — British Murder Boys (Live), Hele...",Tobacco Dock,2020-04-04,1544
131,ANTS,Printworks,2020-02-29,1327
17,RE-TEXTURED,none,2020-04-02,1166
95,Honey Dijon Pres. 'Black Girl Magic' with BBZ ...,Village Underground,2020-03-07,1127
...,...,...,...,...
151,Des Was a Bowie Fan: Walthamstow - The Chequers,The Chequers,2020-03-07,0
152,Faded,Junction House,2020-03-07,0
153,Tarabband,Rich Mix,2020-03-07,0
154,Glitterbox,Printworks,2020-03-07,0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!