# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Scrape events from a website
* Follow links to those events to retrieve further information
* Clean and store scraped data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
#Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
#Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [4]:
# Import required packages
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time

In [5]:
# Create the soup
response = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(response.content, 'html.parser')

# Preview the soup
# soup.prettify

In [6]:
# Get the div containing relevant info
event_listings = soup.find('div', id="event-listing")

In [7]:
# Get the event list items
entries = event_listings.findAll('li')

In [8]:
# Preview an event list item
entries[0]

<li><p class="eventDate date"><a href="/events.aspx?ai=8&amp;v=day&amp;mn=9&amp;yr=2019&amp;dy=13"><span>Fri, 13 Sep 2019 /</span></a></p></li>

In [9]:
# Practice extracting one date
a_date = entries[0].find('p', class_='eventDate date')
a_date.text.strip()

'Fri, 13 Sep 2019 /'

In [10]:
# Loop through 'entries' to extract dates
dates = []
for entry in entries:
    date = entry.find('p', class_='eventDate date')
    if date:
        date = date.text.strip()
        dates.append(date)
    
dates

['Fri, 13 Sep 2019 /',
 'Sat, 14 Sep 2019 /',
 'Sun, 15 Sep 2019 /',
 'Mon, 16 Sep 2019 /',
 'Tue, 17 Sep 2019 /',
 'Wed, 18 Sep 2019 /',
 'Thu, 19 Sep 2019 /']

In [11]:
# Loop through 'entries' to extract names and venues
names = []
venues = []
for entry in entries:
    event = entry.find('h1', class_='event-title')
    if event:
        info = event.text.split(' at ')
        name = info[0].strip()
        venue = info[1].strip()
        names.append(name)
        venues.append(venue)


In [12]:
# Loop through 'entries' to extract number of attendees
attendees = []
for entry in entries:
    try:
        num_attend = int(re.match("(\d*)", entry.find('p', class_='attending').text)[0])
        attendees.append(num_attend)
    except:
        num_attend = np.nan


In [13]:
# Gather the data into a list of lists
rows = []
for entry in entries:
    date = entry.find('p', class_='eventDate date')
    if date:
        date = date.text.strip()
    event = entry.find('h1', class_='event-title')
    if event:
        info = event.text.split(' at ')
        name = info[0].strip()
        venue = info[1].strip()
    try:
        num_attend = int(re.match("(\d*)", entry.find('p', class_='attending').text)[0])
        attendees.append(num_attend)
    except:
        num_attend = np.nan
    rows.append([date, name, venue, num_attend])

In [14]:
# Convert the list of lists to a DataFrame
df = pd.DataFrame(rows)
df.columns = ['Date', 'Event_name', 'Venue', 'Number_of_Attendees']
df.head()

Unnamed: 0,Date,Event_name,Venue,Number_of_Attendees
0,"Fri, 13 Sep 2019 /","Anything Goes: Filsonik, Athesun & Little New ...",1 Oak,
1,,James Murphy b2b The Black Madonna,Knockdown Center,401.0
2,,The Bunker with Wata Igarashi / Aleksi Perala ...,BASEMENT,270.0
3,,HARDER NYC,3 Dollar Bill,118.0
4,,Friday: Timmy Regisford All Night,Nowadays,106.0


In [39]:
# Define a function to extract event info to DataFrame
def scrape_events(events_page_url):
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    event_listings = soup.find('div', id="event-listing")
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        date = entry.find('p', class_='eventDate date')
        if date:
            date = date.text.strip()
            todays_date = date
        event = entry.find('h1', class_='event-title')
        if event:
            info = event.text.split(' at ')
            name = info[0].strip()
            venue = info[1].strip()
            try:
                num_attend = int(re.match("(\d*)", entry.find('p', class_='attending').text)[0])
            except:
                num_attend = np.nan
            rows.append([name, venue, todays_date, num_attend])
#         elif date:
#             todays_date = date.text
        else:
            continue
    df = pd.DataFrame(rows)
    df.columns = ['Event_name', 'Venue', 'Date', 'Number_of_Attendees']
    return df

In [16]:
# Test the function on a url
scrape_events('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_name,Venue,Date,Number_of_Attendees
0,James Murphy b2b The Black Madonna,Knockdown Center,"Fri, 13 Sep 2019 /",401.0
1,The Bunker with Wata Igarashi / Aleksi Perala ...,BASEMENT,"Fri, 13 Sep 2019 /",270.0
2,HARDER NYC,3 Dollar Bill,"Fri, 13 Sep 2019 /",118.0
3,Friday: Timmy Regisford All Night,Nowadays,"Fri, 13 Sep 2019 /",106.0
4,"Justin Martin, Justin Strauss (Elsewhere Rooftop)",Elsewhere,"Fri, 13 Sep 2019 /",94.0
5,Golden Record x Blind Colors present Jan Krueg...,TBA - New York,"Fri, 13 Sep 2019 /",88.0
6,"Love Medicine: Dramian, Concret, Kawas",The Williamsburg Hotel,"Fri, 13 Sep 2019 /",44.0
7,"Jlin, K-Hand, Scraaatch",Elsewhere,"Fri, 13 Sep 2019 /",36.0
8,"Wolf + Lamb, Lloyd Plus SPIRITUAL MENTAL PHYSICAL",Good Room,"Fri, 13 Sep 2019 /",
9,"Thugfucker, Alexander:Louis and A-Rock",Elsewhere,"Fri, 13 Sep 2019 /",


## Write a Function to Retrieve the URL for the Next Page

In [17]:
# Find the next button in the html
next_button = soup.find('li', id='liNext')
next_button

<li class="but arrow-right right" id="liNext">
<a ga-event-action="Next " ga-event-category="event-listings" ga-on="click" href="/events/us/newyork/week/2019-09-20">Next </a>
</li>

In [18]:
# Extract the slug for the next page's url
next_slug = next_button.find('a').attrs['href']
next_slug

'/events/us/newyork/week/2019-09-20'

In [19]:
# Split the original url so new slug can be attached
url = 'https://www.residentadvisor.net/events/us/newyork'
url_stem = url.split('/events')[0]
url_stem

'https://www.residentadvisor.net'

In [20]:
# Concatenate new url from stem and slug
next_url = url_stem + next_slug
next_url

'https://www.residentadvisor.net/events/us/newyork/week/2019-09-20'

In [36]:
# Define a function to get the next page's url given a url
def next_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
#     next_button = soup.find('li', id='liNext')
#     next_slug = next_button.find('a').attrs['href']
    url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
#     url_stem = url.split('/events')[0]
#     next_url = url_stem + next_slug
    next_page_url = "https://www.residentadvisor.net" + url_ext
    return next_url

In [37]:
# Test the function on a url
next_page('https://www.residentadvisor.net/events/us/newyork')

'https://www.residentadvisor.net/events/us/newyork/week/2019-09-20'

In [25]:
# Test the function on another url
next_page('https://www.residentadvisor.net/events/us/newyork/week/2019-09-20')

'https://www.residentadvisor.net/events/us/newyork/week/2019-09-27'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [26]:
test_scrape = scrape_events('https://www.residentadvisor.net/events/us/newyork')
len(test_scrape)

130

In [40]:
# # Solution
# dfs = []
# total_rows = 0
# cur_url = "https://www.residentadvisor.net/events/us/newyork"
# while total_rows <= 1000:
#     df = scrape_events(cur_url)
#     dfs.append(df)
#     total_rows += len(df)
#     cur_url = next_page(cur_url)
#     time.sleep(.2)
# df = pd.concat(dfs)
# df = df.iloc[:1000]
# print(len(df))
# df.head()

1000


Unnamed: 0,Event_name,Venue,Date,Number_of_Attendees
0,James Murphy b2b The Black Madonna,Knockdown Center,"Fri, 13 Sep 2019 /",403.0
1,The Bunker with Wata Igarashi / Aleksi Perala ...,BASEMENT,"Fri, 13 Sep 2019 /",270.0
2,HARDER NYC,3 Dollar Bill,"Fri, 13 Sep 2019 /",122.0
3,Friday: Timmy Regisford All Night,Nowadays,"Fri, 13 Sep 2019 /",107.0
4,"Justin Martin, Justin Strauss (Elsewhere Rooftop)",Elsewhere,"Fri, 13 Sep 2019 /",94.0


In [41]:
# My attempt
dfs_list = []
total_length = 0
current_url = "https://www.residentadvisor.net/events/us/newyork"
while total_length <= 1000:
    new_df = scrape_events(current_url)
    dfs_list.append(new_df)
    total_length += len(new_df)
    current_url = next_page(current_url)
    time.sleep(1)
master_df = pd.concat(dfs_list)
master_df = master_df.iloc[:1000]
print(len(master_df))
master_df.head()

1000


Unnamed: 0,Event_name,Venue,Date,Number_of_Attendees
0,James Murphy b2b The Black Madonna,Knockdown Center,"Fri, 13 Sep 2019 /",403.0
1,The Bunker with Wata Igarashi / Aleksi Perala ...,BASEMENT,"Fri, 13 Sep 2019 /",270.0
2,HARDER NYC,3 Dollar Bill,"Fri, 13 Sep 2019 /",122.0
3,Friday: Timmy Regisford All Night,Nowadays,"Fri, 13 Sep 2019 /",107.0
4,"Justin Martin, Justin Strauss (Elsewhere Rooftop)",Elsewhere,"Fri, 13 Sep 2019 /",94.0


In [56]:
master_df.sort_values('Number_of_Attendees', ascending=False)

Unnamed: 0,Event_name,Venue,Date,Number_of_Attendees
78,All Day I Dream's Summer Season Closing,Brooklyn Mirage,"Sun, 15 Sep 2019 /",1903.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0
24,"Paradise New York: Jamie Jones, Masters",Work & More,"Sat, 21 Sep 2019 /",718.0


I'm not sure why there seem to be duplicate entries.

## Summary 

Congratulations! In this lab, you successfully scraped a website for concert event information!