# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

In [40]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [4]:
url = "https://www.residentadvisor.net/events/us/newyork"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')

In [5]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>

<html lang="en,ja,es">
<head id="_x1">
<script>
            if (typeof dataLayer === 'undefined') {
                dataLayer = []
            }
        </script>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-940832047"></script>
<script>
            window.dataLayer = window.dataLayer || [];

            function gtag() { dataLayer.push(arguments); }
            gtag('js', new Date());

            gtag('config', 'AW-940832047');
        </script>
<script>
            (function (w, d, s, l, i) {
                w[l] = w[l] || []; w[l].push({
                    'gtm.start':
                        new Date().getTime(), event: 'gtm.js'
                }); var f = d.getElementsByTagName(s)[0],
                    j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
                        'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
       

In [23]:
soup.find('div', id="event-listing").findAll('li')[0].text.replace('/', '').strip()

'Wed, 02 Sep 2020'

In [25]:
soup.find('div', id="event-listing").findAll('li')[1].text

"2020-09-02T00:00Outdoor Films: Sign o' the Times at NowadaysN/A2 Attending"

In [26]:
soup.find('div', id='event-listing').findAll('li')[0]

<li><p class="eventDate date"><a href="/events.aspx?ai=8&amp;v=day&amp;mn=9&amp;yr=2020&amp;dy=2"><span>Wed, 02 Sep 2020 /</span></a></p></li>

In [28]:
soup.find('div', id="event-listing").findAll('li')[0].find(class_='eventDate date')

<p class="eventDate date"><a href="/events.aspx?ai=8&amp;v=day&amp;mn=9&amp;yr=2020&amp;dy=2"><span>Wed, 02 Sep 2020 /</span></a></p>

In [76]:
soup.find('div', id='event-listing').findAll('li')[4].find('h1').find('span').text 

'at Now and Then'

In [None]:
soup.find('div', id='event-listing').findAll('li')[1]

In [38]:
soup.find('div', id="event-listing").findAll('li')[1].find(class_='event-item clearfix').find(class_='event-title').find('a').text

"Outdoor Films: Sign o' the Times"

In [81]:
for x in soup.find('div', id="event-listing").findAll('li'):
    if x.find(class_='eventDate date'):
        print(x.text.replace('/', '').strip())
    if x.find(class_='event-item clearfix') or x.find(class_='event-item clearfix tickets-bkg-logo'):
        print(x.find(class_='event-title').find('a').text)
        print(x.find('h1').find('span').text.replace('at', '').strip())
        print(x.find('p').find('span').text)

Wed, 02 Sep 2020
Outdoor Films: Sign o' the Times
Nowadays
2
Sat, 05 Sep 2020
Cuff Tour: Amine Edge & Dance with Raw Phonics
Pier 15
17
White Rabbit
Now and Then
7
Sun, 06 Sep 2020
Taj Lounge NYC LDW Hip Hop vs. Reggae® Sunday Funday Brunch Party
Taj Lounge
1


## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [82]:
def scrape_events(events_page_url):
    #Your code here
    resp = requests.get(events_page_url)
    soup = BeautifulSoup(resp.content, 'html.parser')
    
    name_list = []
    date_list = []
    venue_list = []
    attendees_list = []
    
    for x in soup.find('div', id="event-listing").findAll('li'):
        if x.find(class_='eventDate date'):
            date = (x.text.replace('/', '').strip())
        if x.find(class_='event-item clearfix') or x.find(class_='event-item clearfix tickets-bkg-logo'):
            name_list.append(x.find(class_='event-title').find('a').text)
            venue_list.append(x.find('h1').find('span').text.replace('at', '').strip())
            attendees_list.append(x.find('p').find('span').text)
            date_list.append(date)
                
    df = pd.DataFrame({'Event_Name': name_list,
                      'Venue': venue_list,
                      'Event_Date': date_list,
                      'Number_of_Attendees': attendees_list})
    
    #df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [83]:
scrape_events('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Outdoor Films: Sign o' the Times,Nowadays,"Wed, 02 Sep 2020",2
1,Cuff Tour: Amine Edge & Dance with Raw Phonics,Pier 15,"Sat, 05 Sep 2020",17
2,White Rabbit,Now and Then,"Sat, 05 Sep 2020",7
3,Taj Lounge NYC LDW Hip Hop vs. Reggae® Sunday ...,Taj Lounge,"Sun, 06 Sep 2020",1


## Write a Function to Retrieve the URL for the Next Page

In [89]:
'https://www.residentadvisor.net' + soup.find('li', class_='but arrow-right right').find('a')['href']

'https://www.residentadvisor.net/events/us/newyork/week/2020-09-09'

In [87]:
url

'https://www.residentadvisor.net/events/us/newyork'

In [90]:
def next_page(url):
    #Your code here
    resp = requests.get(url)
    
    back_of_url = soup.find('li', class_='but arrow-right right').find('a')['href']
    next_page_url = 'https://www.residentadvisor.net' + back_of_url
    return next_page_url

In [92]:
next_page('https://www.residentadvisor.net/events/us/newyork')

'https://www.residentadvisor.net/events/us/newyork/week/2020-09-09'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [None]:
#Your code here

In [94]:
#grabs 2 pages

#start with blank dataframe
df1 = pd.DataFrame()

#grab first page
df_first_page = scrape_events('https://www.residentadvisor.net/events/us/newyork')

In [98]:
df1 = pd.concat([df1, df_first_page])

In [99]:
df1

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Outdoor Films: Sign o' the Times,Nowadays,"Wed, 02 Sep 2020",2
1,Cuff Tour: Amine Edge & Dance with Raw Phonics,Pier 15,"Sat, 05 Sep 2020",17
2,White Rabbit,Now and Then,"Sat, 05 Sep 2020",7
3,Taj Lounge NYC LDW Hip Hop vs. Reggae® Sunday ...,Taj Lounge,"Sun, 06 Sep 2020",1


In [100]:
#grab url of next page
next_url = next_page('https://www.residentadvisor.net/events/us/newyork')
next_url

'https://www.residentadvisor.net/events/us/newyork/week/2020-09-09'

In [102]:
#grab 2nd page
df_second_page = scrape_events(next_url)
df_second_page

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,She Past Away,Le Poisson Rouge,"Wed, 09 Sep 2020",3
1,Shake That Midnight Yacht Cruise,Harbor Lights Yacht,"Sat, 12 Sep 2020",2
2,Rooftop Yoga Dance Event,Singularity Bushwick,"Sun, 13 Sep 2020",1
3,Rooftop Yoga Dance Event,Singularity Bushwick,"Mon, 14 Sep 2020",1


In [106]:
df2 = pd.concat([df_first_page, df_second_page])
df2

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Outdoor Films: Sign o' the Times,Nowadays,"Wed, 02 Sep 2020",2
1,Cuff Tour: Amine Edge & Dance with Raw Phonics,Pier 15,"Sat, 05 Sep 2020",17
2,White Rabbit,Now and Then,"Sat, 05 Sep 2020",7
3,Taj Lounge NYC LDW Hip Hop vs. Reggae® Sunday ...,Taj Lounge,"Sun, 06 Sep 2020",1
0,She Past Away,Le Poisson Rouge,"Wed, 09 Sep 2020",3
1,Shake That Midnight Yacht Cruise,Harbor Lights Yacht,"Sat, 12 Sep 2020",2
2,Rooftop Yoga Dance Event,Singularity Bushwick,"Sun, 13 Sep 2020",1
3,Rooftop Yoga Dance Event,Singularity Bushwick,"Mon, 14 Sep 2020",1


In [110]:
#start with blank dataframe
df1 = pd.DataFrame()
url = 'https://www.residentadvisor.net/events/us/newyork'

while len(df1) < 1000:
    print(f'Currently have {len(df1)} events scraped')
    
    df_events = scrape_events(url)
    
    df1 = pd.concat([df1, df_events])
    
    try:
        url = next_page(url)
    except:
        break


Currently have 0 events scraped
Currently have 4 events scraped
Currently have 8 events scraped
Currently have 12 events scraped
Currently have 16 events scraped
Currently have 20 events scraped
Currently have 24 events scraped
Currently have 28 events scraped
Currently have 32 events scraped
Currently have 36 events scraped
Currently have 40 events scraped
Currently have 44 events scraped
Currently have 48 events scraped
Currently have 52 events scraped
Currently have 56 events scraped
Currently have 60 events scraped
Currently have 64 events scraped
Currently have 68 events scraped
Currently have 72 events scraped
Currently have 76 events scraped
Currently have 80 events scraped
Currently have 84 events scraped
Currently have 88 events scraped
Currently have 92 events scraped
Currently have 96 events scraped
Currently have 100 events scraped
Currently have 104 events scraped
Currently have 108 events scraped
Currently have 112 events scraped
Currently have 116 events scraped
Currentl

KeyboardInterrupt: 

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!