## Lab | Web Scraping Multiple Pages
Jorge Castro DAPT NOV 2021

### Instructions

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [60]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [61]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_Eurovision_Song_Contest_entries_(2004%E2%80%93present)')

In [62]:
r.status_code

200

In [63]:
html = r.content
soup = BeautifulSoup(html, 'html.parser')

# Get all the tables with Eurovision songs

html_table = soup.find_all('table', attrs={'class': 'wikitable plainrowheaders'})

In [64]:
#html_table

In [65]:
# Creating DataFrame, loopping through the raw data and fetching the elements

eurosong = pd.DataFrame()

for i in range(len(html_table)):
    df = pd.read_html(html_table[i].prettify())[0]
    eurosong = pd.concat([eurosong, df])

In [66]:
#df

##### Cleaning the data

In [67]:
# Making all column names lowercase and replacing spaces by underscore

eurosong.columns=[eurosong.columns[col_name].lower().replace(' ','_') \
for col_name in range(len(eurosong.columns))]

In [68]:
eurosong.columns

Index(['#', 'r/o_sf', 'r/o_f', 'country', '#.1', 'artist', 'song', 'language',
       'songwriter(s)', 'placing', 'year', 'reason', 'ref(s)'],
      dtype='object')

In [69]:
# Selecting only song and artist, we need to reset the index. To avoid
# having 2 indexes, we set the reset_index to drop=True

eurosong = eurosong[['song', 'artist']].reset_index(drop=True)

# Getting rid of the double quotes in every song, the I apply the strip 
# method removing both the leading and the trailing characters
eurosong['song'] = eurosong['song'].apply(lambda title: title.replace('"','').strip())

In [70]:
eurosong

Unnamed: 0,song,artist
0,Takes 2 to Tango,Jari Sillanpää
1,My Galileo,Aleksandra and Konstantin
2,Celebrate,Piero and the MusicStars
3,Dziesma par laimi,Fomins and Kleins
4,Leha'amin ( להאמין ),David D'Or
...,...,...
734,Universo,Blas Cantó
735,Move,The Mamas
736,Répondez-moi,Gjon's Tears
737,Solovey ( Соловей ),Go_A
