# Web Scraping

---

### In this notebook, we will be scraping two individual lists:
    
    1. The current Top 100 songs.
    2. The Top 100 songs in rock and pop from the 50s until present.
    
#### This will be done by using beautiful soup, and the resulting dataframe will serve as our top 200 hundred hot songs

---

## Import Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Scraping Top 100 songs in October 2023

For this first dataframe, we will be scraping the current top 100 songs. 

In [2]:
#determine url
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [3]:
#get request
response = requests.get(url)
response.status_code

200

'200' response, so we will be able to scrape the page via beatifulsoup

In [4]:
#parse html content using beautiful soup. 
soup = BeautifulSoup(response.content, "html.parser")

**Extract the songs and artists from the webpage**

In [5]:
#divide the different criteria into a different lists to be put into our final dataframe

artist = []
song = []
num_iter = len("body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item")

songart = soup.select("body > div.container > div:nth-child(4) > div.col-xs-12.col-md-8 > div.chart-wrapper > div.feed-item")

for i in range(num_iter):
    artist.append(songart[i].em.get_text().strip().lower())
    song.append(songart[i].cite.get_text().strip().lower())
    
#Attribute the lists to a dataframe
currenttop100 = pd.DataFrame({'artist':artist,'track':song})

In [6]:
currenttop100

Unnamed: 0,artist,track
0,dax,to be a man (feat. darius rucker)
1,ray parker jr.,ghostbusters
2,michael jackson,thriller
3,the citizens of halloween,this is halloween
4,paul russell,lil boo thang
...,...,...
95,noah kahan & post malone,dial drunk
96,elevation worship,"praise (feat. brandon lake, chris brown & chan..."
97,taylor swift,suburban legends (taylor's version) [from the ...
98,oingo boingo,dead man's party


## Scraping Billboard's all time top 100 songs

Now we will scrape the all time top 100 songs.

In [7]:
#determine url
urls = "https://www.billboard.com/charts/greatest-hot-100-singles/"

In [8]:
responses = requests.get(urls)
responses.status_code

200

In [9]:
#parse html content using beautiful soup. 
soups = BeautifulSoup(responses.content, "html.parser")

**Extract the songs and artists from the webpage**

In [10]:
songs = []
artists = []
for i in range(len(soups.select("body > div:nth-child(6) > main > div:nth-child(3) > div:nth-child(2) > div > div > div > div > ul > li:nth-child(3) > ul > li > span"))):
    artists.append(soups.select("body > div:nth-child(6) > main > div:nth-child(3) > div:nth-child(2) > div > div > div > div > ul > li:nth-child(3) > ul > li > span")[i].get_text().strip().lower())
    songs.append(soups.select("body > div:nth-child(6) > main > div:nth-child(3) > div:nth-child(2) > div > div > div > div > ul > li:nth-child(3) > ul > li > h3")[i].get_text().strip().lower())
    
    
#Attribute the lists to a dataframe
alltimetop100 = pd.DataFrame({'artist':artists,'track':songs})

In [11]:
alltimetop100

Unnamed: 0,artist,track
0,the weeknd,blinding lights
1,chubby checker,the twist
2,santana featuring rob thomas,smooth
3,bobby darin,mack the knife
4,mark ronson featuring bruno mars,uptown funk!
...,...,...
95,donna summer,hot stuff
96,post malone featuring 21 savage,rockstar
97,coolio featuring l.v.,gangsta's paradise
98,the steve miller band,abracadabra


## Concatenate the dataframes and export to csv

In [12]:
hot_tracks = pd.concat([currenttop100,alltimetop100], axis = 0)

In [13]:
hot_tracks

Unnamed: 0,artist,track
0,dax,to be a man (feat. darius rucker)
1,ray parker jr.,ghostbusters
2,michael jackson,thriller
3,the citizens of halloween,this is halloween
4,paul russell,lil boo thang
...,...,...
95,donna summer,hot stuff
96,post malone featuring 21 savage,rockstar
97,coolio featuring l.v.,gangsta's paradise
98,the steve miller band,abracadabra


**export to csv**

In [14]:
hot_tracks.to_csv('top.csv', index = False)