# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
# import required libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
import time
from pprint import pprint

In [4]:
# set variables for parsing
url = "https://www.residentadvisor.net/events/us/washingtondc"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

### Find the Page's Event Names

In [5]:
event_listings = soup.find('div', id="event-listing")

In [6]:
entries = event_listings.findAll('li')
print(len(entries), entries[0])

21 <li><p class="eventDate date"><a href="/events.aspx?ai=22&amp;v=day&amp;mn=1&amp;yr=2020&amp;dy=16"><span>Thu, 16 Jan 2020 /</span></a></p></li>


### Develop the Function

In [7]:
rows = []
for entry in entries:
    # Is it a date or an event title?
    date = entry.find('p', class_="eventDate date")
    event = entry.find('h1', class_="event-title")
    if event:
        details = event.text.split(' at ') # split to find event name sub-string
        event_name = details[0].strip()
        venue = details[1].strip() # split to find venue
        # try converting number of attendees string to integer or else to Nan
        try:
            n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
        except:
            n_attendees = np.nan
        rows.append([event_name, venue, cur_date, n_attendees]) # add results to rows list
    elif date:
        cur_date = date.text.replace("/","")
    else:
        continue

df = pd.DataFrame(rows)
df.head()

Unnamed: 0,0,1,2,3
0,Contact: Nathan Barato,Flash,"Thu, 16 Jan 2020",10
1,Playlist 4 Year Anniversary - Navbox / Mina/ G...,Eighteenth Street Lounge,"Thu, 16 Jan 2020",3
2,Mark Farina,Flash,"Fri, 17 Jan 2020",17
3,The Gardens of Babylon Fundraiser Night - Wash...,Zeba Bar,"Fri, 17 Jan 2020",10
4,Hedonism I,The Thirsty Crow,"Fri, 17 Jan 2020",2


In [8]:
#Final function
def scrape_events(events_page_url):
    #Your code here
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        #Is it a date? If so, set current date.
        date = entry.find('p', class_="eventDate date")
        event = entry.find('h1', class_="event-title")
        if event:
            details = event.text.split(' at ')
            event_name = details[0].strip()
            venue = details[1].strip()
            try:
                n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
            except:
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees])
        elif date:
            cur_date = date.text.replace("/","") #remove trailing slash
        else:
            continue
    df = pd.DataFrame(rows)
    df.head()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df.sort_values(by='Number_of_Attendees', ascending=False)

scrape_events(url)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
6,TNX & SEQUENCE: Spank,TBA - Washington DC,"Sat, 18 Jan 2020",169
7,Horse Meat Disco,Flash,"Sat, 18 Jan 2020",96
8,[POSTPONED] 1213 K Event,1213 K,"Sat, 18 Jan 2020",49
2,Mark Farina,Flash,"Fri, 17 Jan 2020",17
0,Contact: Nathan Barato,Flash,"Thu, 16 Jan 2020",10
3,The Gardens of Babylon Fundraiser Night - Wash...,Zeba Bar,"Fri, 17 Jan 2020",10
11,Wunderdisco 2020,Wunder Garten,"Sun, 19 Jan 2020",5
13,Wunderdisco 2020,Wunder Garten,"Mon, 20 Jan 2020",5
12,The Cluck Off II,Jimmy Valentine's Lonely Hearts Club,"Sun, 19 Jan 2020",4
1,Playlist 4 Year Anniversary - Navbox / Mina/ G...,Eighteenth Street Lounge,"Thu, 16 Jan 2020",3


## Write a Function to Retrieve the URL for the Next Page

In [9]:
#Function development cell
soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']

'/events/us/washingtondc/week/2020-01-21'

In [10]:
def next_page(url):
    #Your code here
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # Break if there is no next page
    if soup.find('a', attrs={'ga-event-action':"Next "}) is None:
        Break
    else:
        # find the next page attibute
        url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
        next_page_url = "https://www.residentadvisor.net" + url_ext
    return next_page_url

next_page(url)

'https://www.residentadvisor.net/events/us/washingtondc/week/2020-01-21'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [11]:
dfs = []
total_rows = 0
cur_url = "https://www.residentadvisor.net/events/us/washingtondc"

while total_rows <= 100:
    df = scrape_events(cur_url)
    dfs.append(df)
    total_rows += len(df)
    #verify iteration
    display(f"{total_rows} rows after finding {cur_url}")
    cur_url = next_page(cur_url)
    time.sleep(.2)
    if next_page(cur_url) is None:
        break

'14 rows after finding https://www.residentadvisor.net/events/us/washingtondc'

'28 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-01-21'

'42 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-01-28'

'56 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-02-04'

'70 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-02-11'

'84 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-02-18'

'98 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-02-25'

'112 rows after finding https://www.residentadvisor.net/events/us/washingtondc/week/2020-03-03'

*...printing the same 14 rows for each loop*

In [12]:
df = pd.concat(dfs)
df = df.iloc[:1000]

print(len(df))
display(df.head(29))

112


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
6,TNX & SEQUENCE: Spank,TBA - Washington DC,"Sat, 18 Jan 2020",169
7,Horse Meat Disco,Flash,"Sat, 18 Jan 2020",96
8,[POSTPONED] 1213 K Event,1213 K,"Sat, 18 Jan 2020",49
2,Mark Farina,Flash,"Fri, 17 Jan 2020",17
0,Contact: Nathan Barato,Flash,"Thu, 16 Jan 2020",10
3,The Gardens of Babylon Fundraiser Night - Wash...,Zeba Bar,"Fri, 17 Jan 2020",10
11,Wunderdisco 2020,Wunder Garten,"Sun, 19 Jan 2020",5
13,Wunderdisco 2020,Wunder Garten,"Mon, 20 Jan 2020",5
12,The Cluck Off II,Jimmy Valentine's Lonely Hearts Club,"Sun, 19 Jan 2020",4
1,Playlist 4 Year Anniversary - Navbox / Mina/ G...,Eighteenth Street Lounge,"Thu, 16 Jan 2020",3


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!