# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

In [15]:
import requests
import pandas as pd
from bs4 import BeautifulSoup 

url = 'https://www.residentadvisor.net/events'

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [2]:
def scrape_events(events_page_url):
    
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    event_block = soup.find('ul',id ="items")
    event_list = event_block.findAll('article')

    event_names = []
    event_times = []
    event_venues = []
    event_peopl = []
    event_city =  []

    for i in event_list:
        event_name = i.find('a',{'itemprop':'url'}).text
        event_names.append(event_name)

        event_time = i.find('time').text
        event_times.append(event_time)

        heading = i.find('h1',{'class':'event-title'})
        event_venue = heading.find('span').text
        #event_city = heading.find('span').findall('a')[1].text
        event_venues.append(event_venue)
        #event_cities.append(event_city)

        tag = i.find('p')
        if tag is not None:
            num = tag.find('span').text        
            event_peopl.append(int(num))
        else:
            event_peopl.append(0)

    frame_dict = {
        'Event_Name':event_names,
        'Venue':event_venues,
        'Event_Date':event_times,
        'Number_of_Attendees':event_peopl
    }

    df = pd.DataFrame.from_dict(frame_dict)
    
    return df

display(scrape_events(url))

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Steve Lawler (Viva Music),"at Club Here I Love You, El Paso",2019-12-08T00:00,4
1,Tacky Sweater Tech House Party,"at Voodoo Rm 3rd Floor, Austin",2019-12-08T00:00,5
2,R U Down,at The Dive,2019-12-08T00:00,0
3,Demuja by BauHaus Houston,"at Bauhaus, Houston",2019-12-13T00:00,4
4,Pitch Tempo X Propa: Lauren Flax,"at The Rooftop @ TNT, Dallas/Fort Worth",2019-12-13T00:00,0
5,Tourist & Matthew Dear,"at The Parish, Austin",2019-12-14T00:00,5
6,Cirque Noir presents: Do Not Sit in Houston,"at Secret Location, Houston",2019-12-14T00:00,4
7,As.If Records vs Denied Music,"at TBA - Austin, Austin",2019-12-14T00:00,2
8,Underwater Productions feat. Demuja,"at The Underground, Dallas/Fort Worth",2019-12-14T00:00,1
9,Disrupt Pres. Annie Errez,"at Bauhaus, Houston",2019-12-14T00:00,0


## Write a Function to Retrieve the URL for the Next Page

In [17]:
def next_page(url):
    #Your code here
    base_url = 'https://www.residentadvisor.net/'
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    next_list = soup.find('li',id = "liNext")
    link = next_list.find('a')
    
    if link.has_attr('href'):    
        next_page_url = base_url + link.attrs['href']
        return next_page_url
    else:
        return False

print(next_page('https://www.residentadvisor.net/events/us/texas/week/2020-04-18'))

False


## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [18]:
#Your code here
df = pd.DataFrame(columns=['Event_Name','Venue','Event_Date','Number_of_Attendees'])

while len(df) <=1000:
    
    dfx = scrape_events(url)
    
    if next_page(url) is not False:
        df = df.append(dfx)
        url = next_page(url)
    else:
        break
        

display(df)
    

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Steve Lawler (Viva Music),"at Club Here I Love You, El Paso",2019-12-08T00:00,4
1,Tacky Sweater Tech House Party,"at Voodoo Rm 3rd Floor, Austin",2019-12-08T00:00,5
2,R U Down,at The Dive,2019-12-08T00:00,0
3,Demuja by BauHaus Houston,"at Bauhaus, Houston",2019-12-13T00:00,4
4,Pitch Tempo X Propa: Lauren Flax,"at The Rooftop @ TNT, Dallas/Fort Worth",2019-12-13T00:00,0
5,Tourist & Matthew Dear,"at The Parish, Austin",2019-12-14T00:00,5
6,Cirque Noir presents: Do Not Sit in Houston,"at Secret Location, Houston",2019-12-14T00:00,4
7,As.If Records vs Denied Music,"at TBA - Austin, Austin",2019-12-14T00:00,2
8,Underwater Productions feat. Demuja,"at The Underground, Dallas/Fort Worth",2019-12-14T00:00,1
9,Disrupt Pres. Annie Errez,"at Bauhaus, Houston",2019-12-14T00:00,0


# PLEASE NOTE:

I have scraped the maximum possible number of events. There are no events past April 18, 2019, and the "Next" button is greyed out and does not appear to have an href attribute (probably by design to make it non functional). 

Interestingly enough my birthday is the next day (april 19) lol!

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!