# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Scrape events from a website
* Follow links to those events to retrieve further information
* Clean and store scraped data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')

In [2]:
#Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [3]:
#Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [4]:
def get_items(soup):
    items = soup.findAll('article', class_='highlight-top')
    return items
items = get_items(soup)
items[0]

<article class="highlight-top">
<p>Thu, 13 Jun 2019</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1275181"><img class="nohide" src="/images/events/flyer/2019/6/us-0613-1275181-list.jpg"/></a>
<p class="counter nohide">
<span>615</span> attending
</p>
<a ga-event-action="popular-events" ga-event-category="events-page" ga-on="click" href="/events/1275181">
<h1>
Morgana [free entry]: Atish, John Tejada Live, Nicone, Patricio
</h1>
</a>
<p class="copy nohide">
<a href="\club.aspx?id=105938">Brooklyn Mirage</a>
</p>
</article>

In [5]:
def get_names_venues(soup):
    event_names = [article.find('h1').text.strip() for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    names = []
    venues = []
    for e in event_names:
        names.append(e.split(' at ')[0])
        venues.append(e.split(' at ')[1])
    return names, venues
e_names, e_venues = get_names_venues(soup)

In [6]:
def get_dates(soup):
    dates = []
    event_dates = [article.find('time').text.strip() for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for d in event_dates:
        dates.append(d.split('T')[0])
    return dates
e_dates = get_dates(soup)

In [7]:
def get_attending(soup):
    attending = []
    event_attending = [article.find('p') for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for a in event_attending:
        try:
            attending.append(int(a.find('span').text))
        except:
            attending.append(a)
    return attending
e_attending = get_attending(soup)

In [8]:
df = pd.DataFrame([e_names, e_venues, e_dates, e_attending]).transpose()
df.columns = (["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"])
df.head()

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,"Delivery. Carlos Alkalina, Playsuit, Pjay, Bytz",Ms. Yoo,2019-06-11,5.0
1,Cancelled,14B Rooftop / Lounge,2019-06-11,2.0
2,Tempo with Steve Tek (Open to Close),TBA Brooklyn,2019-06-11,2.0
3,"Feel Real with DJ Disciple, Ejoe Wilson Friends",Rumpus Room,2019-06-11,1.0
4,Small Rave 019 - Xcreenplay (Live)/ Perris/ Fe...,Bossa Nova Civic Club,2019-06-11,


In [9]:
def scrape_events(events_page_url):
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    e_names, e_venues = get_names_venues(soup)
    e_dates = get_dates(soup)
    e_attending = get_attending(soup)
    df = pd.DataFrame([e_names, e_venues, e_dates, e_attending]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [10]:
scrape_events('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,"Delivery. Carlos Alkalina, Playsuit, Pjay, Bytz",Ms. Yoo,2019-06-11,5
1,Cancelled,14B Rooftop / Lounge,2019-06-11,2
2,Tempo with Steve Tek (Open to Close),TBA Brooklyn,2019-06-11,2
3,"Feel Real with DJ Disciple, Ejoe Wilson Friends",Rumpus Room,2019-06-11,1
4,Small Rave 019 - Xcreenplay (Live)/ Perris/ Fe...,Bossa Nova Civic Club,2019-06-11,
5,Feel Free: Grand Atrium + Buskko,Kinfolk 90,2019-06-11,
6,Nacho Rojas' B2b2b2b2b2b2b Bday Bash [actual B...,TBA - New York,2019-06-12,5
7,Party Party Party,Bossa Nova Civic Club,2019-06-12,9
8,"Full Flex with DJ Swagger, Klein Zage b2b Joey...",Good Room,2019-06-12,8
9,"Ūndisclosed - Ryan Crosson, Jeff Veliz, Martin...",TBA Brooklyn,2019-06-12,6


## Write a Function to Retrieve the URL for the Next Page

In [11]:
def next_page(url):
    next_page_url = soup.find(id="liNext")
    if next_page_url:
        return str("https://www.residentadvisor.net" + next_page_url.find('a').attrs['href'])
    else:
        return None
    return next_page_url

In [12]:
print(next_page('https://www.residentadvisor.net/events/us/newyork'))

https://www.residentadvisor.net/events/us/newyork/week/2019-06-18


## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [62]:
def get_names(soup, names):
    event_names = [article.find('h1').text.strip() for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for e in event_names:
        names.append(e.split(' at ')[0])
    return names


def get_venues(soup, venues):
    event_venues = [article.find('h1').text.strip() for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for e in event_venues:
        venues.append(e.split(' at ')[1])
    return venues
    

def get_dates(soup, dates):
    event_dates = [article.find('time').text.strip() for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for d in event_dates:
        dates.append(d.split('T')[0])
    return dates


def get_attending(soup, attending):
    event_attending = [article.find('p') for article in soup.findAll('article', attrs={'itemtype':'http://data-vocabulary.org/Event'})]
    for a in event_attending:
        try:
            attending.append(int(a.find('span').text))
        except:
            attending.append(a)
    return attending    

    
def events_to_df(names, venues, dates, attending):
    df = pd.DataFrame([names, venues, dates, attending]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df


# def next_page(url):
#     next_page_url = soup.find(id="liNext")
#     if next_page_url:
#         return str("https://www.residentadvisor.net" + next_page_url.find('a').attrs['href'])
#     else:
#         return None
#     return next_page_url


def next_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + url_ext
    #Your code here
    return next_page_url



def parse_url(url, names=[], venues=[], dates=[], attending=[]):
    names=names
    venues=venues
    dates=dates
    attending=attending
    url = url
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    names += get_names(soup, names)
    venues += get_venues(soup, venues)
    dates += get_dates(soup, dates)
    attending += get_attending(soup, attending)
    for i in range (0,10000):
        while len(names) < 1000:
            url = next_page(url)
            html_page = requests.get(url)
            soup = BeautifulSoup(html_page.content, 'html.parser')
            names += get_names(soup, names)
            venues += get_venues(soup, venues)
            dates += get_dates(soup, dates)
            attending += get_attending(soup, attending)
    return names, venues, dates, attending

In [63]:
url = "https://www.residentadvisor.net/events/us/newyork"

In [64]:
names, venues, dates, attending = parse_url(url)

In [65]:
df1 = events_to_df(names, venues, dates, attending).copy()
df1[360:920]

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
360,718 Sessions Returns to the Elsewhere Rooftop,Elsewhere,2019-06-24,21
361,Body Music Therapy with Nihal Ramchandani,Nowadays,2019-06-24,6
362,Rollup 1 Year Anniversary: Takaya Nagase zorenLo,Bossa Nova Civic Club,2019-06-24,2
363,"Delivery. Carlos Alkalina, Playsuit, Pjay, Bytz",Ms. Yoo,2019-06-11,5
364,Cancelled,14B Rooftop / Lounge,2019-06-11,2
365,Tempo with Steve Tek (Open to Close),TBA Brooklyn,2019-06-11,2
366,"Feel Real with DJ Disciple, Ejoe Wilson Friends",Rumpus Room,2019-06-11,1
367,Small Rave 019 - Xcreenplay (Live)/ Perris/ Fe...,Bossa Nova Civic Club,2019-06-11,
368,Feel Free: Grand Atrium + Buskko,Kinfolk 90,2019-06-11,
369,Nacho Rojas' B2b2b2b2b2b2b Bday Bash [actual B...,TBA - New York,2019-06-12,5


## Summary 

Congratulations! In this lab, you successfully scraped a website for concert event information!