# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [63]:
# open the first events page and save as a soup
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork/week/2020-03-03')
if html_page.status_code == 200:
    soup = BeautifulSoup(html_page.content, 'html.parser')
else:
    print('something went wrong.  the status code is: ', html_page.status_code)

In [64]:
event_items = soup.findAll(attrs={'class': 'event-item'})
print(len(event_items))
print(event_items[0])

75
<article class="event-item clearfix" itemscope="" itemtype="http://data-vocabulary.org/Event"><span style="display:none;"><time datetime="2020-03-03T00:00" itemprop="startDate">2020-03-03T00:00</time></span><a href="/events/1392346"><img height="76" src="/images/events/flyer/2020/3/us-0303-1392346-list.jpg" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1392346" itemprop="url" title="Event details of Hart &amp; Soul 10 with The Duchess">Hart &amp; Soul 10 with The Duchess</a> <span>at <a href="/club.aspx?id=71292">Bossa Nova Civic Club</a></span></h1><div class="grey event-lineup">The Duchess, Son of Lee, kyle.wav</div><p class="attending"><span>4</span> Attending</p></div></article>


In [13]:
# event_date
event_items[0].find('time').text[:10]

'2020-02-25'

In [14]:
# event_name
event_items[0].find(attrs={'class': 'event-title'}).find('a').text

'Aggressive Skool Nite'

In [15]:
# venue
event_items[0].find(attrs={'class': 'event-title'}).find('span').find('a').text

'Jupiter Disco'

In [16]:
# number_of_attendees
int(event_items[0].find(attrs={'class': 'attending'}).find('span').text)

5

In [40]:
event_items[104]

<article class="event-item clearfix small-item" itemscope="" itemtype="http://data-vocabulary.org/Event"><span style="display:none;"><time datetime="2020-03-02T00:00" itemprop="startDate">2020-03-02T00:00</time></span><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1387620" itemprop="url" title="Event details of The Office presents: Shhh Music By: Deborah Sattler">The Office presents: Shhh Music By: Deborah Sattler</a> <span>at <span class="grey" style="display:inline;">TBA - Brooklyn</span></span></h1></div></article>

In [24]:
event_items[30].find(attrs={'class': 'event-title'}).find('span').find('a').text

AttributeError: 'NoneType' object has no attribute 'text'

In [47]:
event_url = event_items[0].find(attrs={'class': 'event-title'}).find('a')['href']
event_url

'/events/1382985'

In [50]:
event_url = event_items[8].find(attrs={'class': 'event-title'}).find('a')['href']
event_page = requests.get(f'https://www.residentadvisor.net{event_url}')
if event_page.status_code == 200:
    event_soup = BeautifulSoup(event_page.content, 'html.parser')
    print(int(event_soup.find(id='MembersFavouriteCount').text))
else:
    print(f'something went wrong with {event_url}.  the status code is: {event_page.status_code}')

4


In [58]:
def scrape_events(events_page_url):
    
    # debugging flag
    DEBUG = False
    
    # initialize lists for the scraped data
    event_dates = []
    event_names = []
    venues = []
    attendees = []
    
    # load the html page into a soup
    html_page = requests.get(events_page_url)
    
    # check if the page was sucessfully loaded
    if html_page.status_code == 200:
        soup = BeautifulSoup(html_page.content, 'html.parser')
    else:
        print('something went wrong.  the status code is: ', html_page.status_code)
        return None
    
    # create a list of all the events
    event_items = soup.findAll(attrs={'class': 'event-item'})
    
    # store the number of events on the page in a var
    num_events = len(event_items)
    
    # loop through the events, loading them into the lists
    for i in range(0,num_events):
        print('event #', i) if DEBUG else None
        
        event_date = event_items[i].find('time').text[:10]
        print(event_date) if DEBUG else None
        event_dates.append(event_date)
        
        event_name = event_items[i].find(attrs={'class': 'event-title'}).find('a').text
        print(event_name) if DEBUG else None
        event_names.append(event_name)
        
        venue_element = event_items[i].find(attrs={'class': 'event-title'}).find('span')
        if venue_element.find('a'): # event location has a link
            venue = venue_element.find('a').text
        elif venue_element.find('span'): # event location doesn't have a link
            venue = venue_element.find('span').text
        print(venue) if DEBUG else None
        venues.append(venue)
        
        attendee_element = event_items[i].find(attrs={'class': 'attending'})
        if attendee_element: # there is a number of attendees listed on this page
            num_attendees = int(attendee_element.find('span').text)
        else: # there are no attendees listed on this page
            # open the event details page
            event_url = event_items[i].find(attrs={'class': 'event-title'}).find('a')['href']
            event_page = requests.get(f'https://www.residentadvisor.net{event_url}')
            if event_page.status_code == 200:
                event_soup = BeautifulSoup(event_page.content, 'html.parser')
                num_attendees = int(event_soup.find(id='MembersFavouriteCount').text)
            else:
                print(f'something went wrong with {event_url}.  the status code is: {event_page.status_code}')
        print(num_attendees) if DEBUG else None
        attendees.append(num_attendees)

    df = pd.DataFrame([event_names, venues, event_dates, attendees]).transpose()
    
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [52]:
df = scrape_events('https://www.residentadvisor.net/events')
print('total events: ',len(df))
df.head()

event # 0
2020-02-25
Aggressive Skool Nite
Jupiter Disco
5
event # 1
2020-02-25
Cheers Bklyn 32
TBA Brooklyn
5
event # 2
2020-02-25
Feel Real presents Purified
Rumpus Room
1
event # 3
2020-02-25
Cool Runnings with MKL
Wei's
1
event # 4
2020-02-26
Pure Immanence Xliii
Bossa Nova Civic Club
15
event # 5
2020-02-26
Wednesday Films: Buena Vista Social Club
Nowadays
13
event # 6
2020-02-26
Exit 32 Recordings Showcase with Alex Fogo, Federico Moore
TBA Brooklyn
7
event # 7
2020-02-26
Funk You with DJ Bruce
House Of Yes
4
event # 8
2020-02-26
Rare Frequency Transmissions
Jupiter Disco
4
event # 9
2020-02-26
Strange Edition Launch with GAIKA, Azekel, Russell E.L. Butler and Gloria
Strange Edition
4
event # 10
2020-02-26
Open Decks Session 101
Eris
3
event # 11
2020-02-26
Happy Hour
H0L0
0
event # 12
2020-02-26
Shots
Blank Forms
0
event # 13
2020-02-27
Deep Root Sessions At Public Arts with Leftwing:Kody
Public Arts
9
event # 14
2020-02-27
Modular Synthesis Workshop: Apocalipsis U
Nowadays
23
e

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Aggressive Skool Nite,Jupiter Disco,2020-02-25,5
1,Cheers Bklyn 32,TBA Brooklyn,2020-02-25,5
2,Feel Real presents Purified,Rumpus Room,2020-02-25,1
3,Cool Runnings with MKL,Wei's,2020-02-25,1
4,Pure Immanence Xliii,Bossa Nova Civic Club,2020-02-26,15


In [53]:
sorted_df = df.sort_values(by=['Number_of_Attendees', 'Event_Date'], ascending=False)
sorted_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
54,"Cityfox Live Festival: Paul Kalkbrenner, Schwa...",Avant Gardner,2020-02-29,1475
55,The Bunker and Interdimensional Transmissions ...,Market Hotel,2020-02-29,165
56,Altær: Henning Baer / VSK / Volvox / Auspex,BASEMENT,2020-02-29,141
28,Pangaea / Galcher Lustwerk / Lauren Flax,BASEMENT,2020-02-28,123
57,Honey Soundsystem Good Room Takeover,Good Room,2020-02-29,65
...,...,...,...,...
25,Shots,Blank Forms,2020-02-27,0
26,Telefon Tel Aviv,Buffalo/Rochester,2020-02-27,0
27,Uptown Nikko b2b Johnsville in Paradise,Paradise Club,2020-02-27,0
11,Happy Hour,H0L0,2020-02-26,0


## Write a Function to Retrieve the URL for the Next Page

In [77]:
def next_page(url):
    # load the html page into a soup
    html_page = requests.get(url)
    
    # check if the page was sucessfully loaded
    if html_page.status_code == 200:
        soup = BeautifulSoup(html_page.content, 'html.parser')
    else:
        print(f'something went wrong with {url}.  the status code is: {html_page.status_code}')
        return None
    
    # get the url anchor element
    url_anchor =  soup.find(id='liNext').find('a')
    
    # if url_anchor exists and has an href attribute
    if url_anchor and url_anchor.has_attr('href'):
        next_page_url = 'https://www.residentadvisor.net' + url_anchor['href']
    else: # you're at the last page and there is no next
        next_page_url = None
    
    return next_page_url

In [78]:
next_page('https://www.residentadvisor.net/events')

'https://www.residentadvisor.net/events/us/newyork/week/2020-03-03'

In [56]:
next_page('https://www.residentadvisor.net/events/us/newyork/week/2020-03-03')

'https://www.residentadvisor.net/events/us/newyork/week/2020-03-10'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [79]:
# Initialize the number of scraped events
scraped_events = 0
# set a max number of events
MAX_EVENTS = 1000
# initialize page url
url = 'https://www.residentadvisor.net/events'
# initialize an empty Events data frame
events_df = pd.DataFrame()
# set debug flag
DEBUG_HERE = True

while (scraped_events < MAX_EVENTS) and (url != None):
    
    print('current page ', url) if DEBUG_HERE else None
    
    # scrape the events from the current page
    df = scrape_events(url)
    
    # if this if the first scrape
    if scraped_events == 0:
        events_df = df
    else:
        # concatenate the newly scraped events to the events data frame
        events_df = pd.concat([events_df, df])
        
    # update the number of scraped events
    scraped_events = len(events_df)
    
    print('scraped events so far: ', scraped_events) if DEBUG_HERE else None
    
    # get the url of the next page
    url = next_page(url)



current page  https://www.residentadvisor.net/events
scraped events so far:  105
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-03-03
scraped events so far:  180
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-03-10
scraped events so far:  242
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-03-17
scraped events so far:  293
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-03-24
scraped events so far:  333
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-03-31
scraped events so far:  359
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-04-07
scraped events so far:  384
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-04-14
scraped events so far:  405
current page  https://www.residentadvisor.net/events/us/newyork/week/2020-04-21
scraped events so far:  419
current page  https://www.residentadvisor.net/events/us

In [80]:
sorted_df = events_df.iloc[:MAX_EVENTS].sort_values(by=['Number_of_Attendees', 'Event_Date'], ascending=False)
sorted_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
54,"Cityfox Live Festival: Paul Kalkbrenner, Schwa...",Avant Gardner,2020-02-29,1481
7,Brooklyn Mirage Opening Weekend 2020: Cityfox ...,Brooklyn Mirage,2020-05-09,831
25,Carl Cox Invites: Brooklyn Takeover,Avant Gardner,2020-03-28,585
17,Zero presents... The Masquerade 2020: Clarity,TBA - Brooklyn,2020-04-04,340
28,Amelie Lens / Farrago / Rachel Noon,Knockdown Center,2020-03-21,340
...,...,...,...,...
25,Shots,Blank Forms,2020-02-27,0
26,Telefon Tel Aviv,Buffalo/Rochester,2020-02-27,0
27,Uptown Nikko b2b Johnsville in Paradise,Paradise Club,2020-02-27,0
11,Happy Hour,H0L0,2020-02-26,0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!