# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [1]:
import re
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

In [3]:
response = requests.get("https://www.residentadvisor.net/events/us/newyork")
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
event_listings = soup.find('div', id="event-listing")
entries = event_listings.findAll('li')
entries[0]

<li><p class="eventDate date"><a href="/events.aspx?ai=8&amp;v=day&amp;mn=7&amp;yr=2020&amp;dy=9"><span>Thu, 09 Jul 2020 /</span></a></p></li>

In [6]:
def scrape_events(events_page_url):
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        date = entry.find('p', class_="eventDate date")
        event = entry.find('h1', class_="event-title")
        if event:
            details = event.text.split(' at ')
            event_name = details[0].strip()
            venue = details[1].strip()
            try:
                n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
            except:
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees])
        elif date:
            cur_date = date.text
        else:
            continue
    df = pd.DataFrame(rows)
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df


In [9]:
url = "https://www.residentadvisor.net/events/us/newyork"
scrape_events(url)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Virtual Thursday: Planetarium with Changsie,Nowadays,"Thu, 09 Jul 2020 /",2
1,Virtual Thursday: Now Then: Ciel Chats with Ea...,Nowadays,"Thu, 09 Jul 2020 /",2
2,Techno Cabin 5.0,TBA - New York,"Fri, 10 Jul 2020 /",9
3,Manhattan Hip Hop vs. Reggae® Midnight Yacht P...,Skyport Marina,"Fri, 10 Jul 2020 /",1
4,NYC LED Glowsticks Booze Cruise Yacht Party 2020,Skyport Marina,"Fri, 10 Jul 2020 /",1
5,"Virtual Friday: Beautiful Swimmers, DJ Freez a...",Nowadays,"Fri, 10 Jul 2020 /",1
6,Techno Cabin 5.0,TBA - New York,"Sat, 11 Jul 2020 /",9
7,Ride The Wave Midnight Yacht Cruise,Harbor Lights Yacht,"Sat, 11 Jul 2020 /",1
8,Virtual Saturday: Groovy Groovy with Akanbi an...,Nowadays,"Sat, 11 Jul 2020 /",1
9,Virtual Mister Sunday: Eamon Harkin and Justin...,Nowadays,"Sun, 12 Jul 2020 /",1


## Write a Function to Retrieve the URL for the Next Page

In [13]:
url = 'https://www.residentadvisor.net/events/us/newyork/week/2020-07-14'
def next_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + url_ext
    return next_page_url

next_page(url)

'https://www.residentadvisor.net/events/us/newyork/week/2020-07-21'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [14]:
dfs = []
total_rows = 0
url = "https://www.residentadvisor.net/events/us/newyork"
while total_rows <= 100:
    df = scrape_events(url)
    dfs.append(df)
    total_rows += len(df)
    url = next_page(url)
    time.sleep(.2)
df = pd.concat(dfs)
df = df.iloc[:100]
print(len(df))

#could not get to 1000 my guess is that due to COVID they do not have events planned out that far ahead

100


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!