## Lab | Web Scraping Multiple Pages
Jorge Castro DAPT NOV 2021

### Instructions

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

# 100 songs from Bilboard Single Page

In [72]:
from bs4 import BeautifulSoup
from time import sleep
import random
import requests
import pandas as pd
from datetime import datetime
# Library to create a progress bar
from tqdm.notebook import tqdm

In [73]:
r = requests.get('https://www.billboard.com/charts/hot-100/')

In [74]:
r.status_code

200

In [75]:
html = r.content

In [76]:
#html

In [77]:
soup = BeautifulSoup(html, 'html.parser')

In [78]:
#soup

In [79]:
html_songs = soup.find_all('h3', attrs={'id': 'title-of-a-story'})

In [80]:
#html_songs

In [81]:
html_top1 = soup.find_all('span', attrs={'class': 'c-label a-no-trucate a-font-primary-s \
lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 \
lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet'})


In [82]:
html_top1

[<span class="c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet">
 	
 	Jack Harlow
 </span>]

In [83]:
html_artists = soup.find_all('span', attrs={'class': 'c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max \
u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 \
u-max-width-230@tablet-only'})

In [84]:
#html_artists

In [85]:
# Looking with "Inspect" in the Billboard chart, a parent class "li" contains the info of both song titles and artists:
# Because the fonts are different, the first song has one class and all the rest of the songs have another class. So 
# we have to put the 2 list toguether. 

In [86]:
html_first = soup.find_all('li', attrs={'class': 'o-chart-results-list__item // lrv-u-flex-grow-1 lrv-u-flex \
lrv-u-flex-direction-column lrv-u-justify-content-center lrv-u-border-b-1 u-border-b-0@mobile-max \
lrv-u-border-color-grey-light lrv-u-padding-l-1@mobile-max'})

In [87]:
#html_first

In [88]:

html_rest = soup.find_all('li', attrs={'class': 'o-chart-results-list__item // lrv-u-flex-grow-1 lrv-u-flex \
lrv-u-flex-direction-column lrv-u-justify-content-center lrv-u-border-b-1 u-border-b-0@mobile-max lrv-u-border-color-grey-light \
lrv-u-padding-l-050 lrv-u-padding-l-1@mobile-max'})

In [89]:
#html_rest

In [90]:
# I can now join the 2 lists

In [91]:
html_all = html_first + html_rest

In [92]:
#html_all

## Data Cleaning

In [93]:
# Creating a DataFrame: here we create two arrays, we loop through the content in html_all
# then we push the data we need into the two arrays. 

song = []
artist = []

for entry in html_all:
    song.append(entry.find("h3").get_text().replace("\n", "").replace('\t', ''))
    artist.append(entry.find("span").get_text().replace("\n", "").replace('\t', ''))
    
# Here we ensamble the DataFrame 
top100 = pd.DataFrame()
top100['song'] = song
top100['artist'] = artist
top100
    

Unnamed: 0,song,artist
0,First Class,Jack Harlow
1,As It Was,Harry Styles
2,Heat Waves,Glass Animals
3,Big Energy,Latto
4,Enemy,Imagine Dragons X JID
...,...,...
95,Over,Lucky Daye
96,Neck & Wrist,Pusha T Featuring JAY-Z & Pharrell Williams
97,Desesperados,Rauw Alejandro & Chencho Corleone
98,X Ultima Vez,Daddy Yankee & Bad Bunny


In [94]:
data_ = pd.DataFrame()
data_['song'] = song
data_['artist'] = artist
data_
    

Unnamed: 0,song,artist
0,First Class,Jack Harlow
1,As It Was,Harry Styles
2,Heat Waves,Glass Animals
3,Big Energy,Latto
4,Enemy,Imagine Dragons X JID
...,...,...
95,Over,Lucky Daye
96,Neck & Wrist,Pusha T Featuring JAY-Z & Pharrell Williams
97,Desesperados,Rauw Alejandro & Chencho Corleone
98,X Ultima Vez,Daddy Yankee & Bad Bunny


In [95]:
# Another way to create a DataFrame

chart = []

for entry in html_all:
    col = {'song': entry.find('h3').get_text().replace('\n', "").replace('\t', ""),
           'artist': entry.find('span').get_text().replace('\n', "").replace('\t', "")}
    chart.append(col)

In [96]:
# we obtain an array "chart"

In [97]:
top100_b = pd.DataFrame(chart)

In [98]:
top100_b

Unnamed: 0,song,artist
0,First Class,Jack Harlow
1,As It Was,Harry Styles
2,Heat Waves,Glass Animals
3,Big Energy,Latto
4,Enemy,Imagine Dragons X JID
...,...,...
95,Over,Lucky Daye
96,Neck & Wrist,Pusha T Featuring JAY-Z & Pharrell Williams
97,Desesperados,Rauw Alejandro & Chencho Corleone
98,X Ultima Vez,Daddy Yankee & Bad Bunny


# Web Scraping Multiple Pages. 
### Getting more songs from Wikipedia

In [99]:
song = []
artist = []

In [100]:
urls = []
for i in range(1,7):
    urls.append(f"https://en.wikipedia.org/wiki/List_of_songs_in_Glee_(season_{i})")    

In [101]:
response = requests.get(urls[0])

In [102]:
soups = []
for i in urls:
    soups.append(BeautifulSoup((requests.get(i)).content, 'html.parser'))

In [103]:
len(soups[0].select('.wikitable > tbody > tr > th > a'))

132

In [104]:
song = []
artist = []

# Song selector from header row
#mw-content-text > div.mw-parser-output > table > thead > tr > th:nth-child(1)

# Artist selector from first song row
#mw-content-text > div.mw-parser-output > table > tbody > tr:nth-child(1) > td:nth-child(3)

for i in soups:
    for tag in (i.select('.wikitable > tbody > tr > th:nth-child(1)')):
        if (tag['scope'] == 'row'):
            song.append(tag.get_text().rstrip().strip('\"'))
    for tag in i.select('.wikitable > tbody > tr > td:nth-child(3)'):
        artist.append(tag.get_text())


In [105]:
# Here we ensamble the DataFrame 
songs_glee_7s = pd.DataFrame()
songs_glee_7s['song'] = song
songs_glee_7s['artist'] = artist
songs_glee_7s

Unnamed: 0,song,artist
0,Where Is Love?,Hank Saunders and Sandy Ryerson
1,Respect,Mercedes Jones
2,Mister Cellophane,Kurt Hummel
3,I Kissed a Girl,Tina Cohen-Chang
4,On My Own,Rachel Berry
...,...,...
738,Someday We'll Be Together,Mercedes Jones with gospel choir
739,The Winner Takes It All,Sue Sylvester and Will Schuester
740,Daydream Believer,Kurt Hummel and Blaine Anderson with schoolchi...
741,This Time,Rachel Berry


In [108]:
data = pd.concat([data_, songs_glee_7s]) 
data

Unnamed: 0,song,artist
0,First Class,Jack Harlow
1,As It Was,Harry Styles
2,Heat Waves,Glass Animals
3,Big Energy,Latto
4,Enemy,Imagine Dragons X JID
...,...,...
738,Someday We'll Be Together,Mercedes Jones with gospel choir
739,The Winner Takes It All,Sue Sylvester and Will Schuester
740,Daydream Believer,Kurt Hummel and Blaine Anderson with schoolchi...
741,This Time,Rachel Berry


In [110]:
# Top 40 songs. Can be searched by week (yyyy-mm-dd)
# Let's get one year of data
urls = []
dates = pd.date_range('2020-04-23', '2022-04-23', freq='W')
dates = [date.strftime('%Y-%m-%d') for date in dates]

for i in dates:
    urls.append(f'https://www.billboard.com/charts/pop-songs/{i}')

In [112]:
soups = []
for url in tqdm(urls):
    soups.append(BeautifulSoup(requests.get(url).content))
    sleep(random.random()*4)

  0%|          | 0/104 [00:00<?, ?it/s]