## Web Scraping Lab 1:


### Prepare your project:

#### Business goal:

Make sure you've understood the big picture of your project: the goal of the company (Gnod), their current product (Gnoosic), their strategy, and how your project fits into this context. Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to acomplish.

#### Scraping popular songs:

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputed song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard mantains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100. 

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [2]:
url = "https://www.billboard.com/charts/hot-100"

In [3]:
response = requests.get(url)

In [4]:
soup = BeautifulSoup(response.content, "html.parser")

In [10]:
print(soup.prettify)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [5]:
song_names_soup = soup.select("ol > li span.chart-element__information__song.text--truncate.color--primary")

In [6]:
artist_names_soup = soup.select("ol > li span.chart-element__information__artist.text--truncate.color--secondary")

In [7]:
song_names = [song_names_soup[x].get_text() for x in range(len(song_names_soup))]

In [8]:
artist_names = [artist_names_soup[x].get_text() for x in range(len(artist_names_soup))]

In [9]:
hot_top100_dict = {"Song":song_names,
                   "Artist":artist_names}

In [10]:
hot_top100 = pd.DataFrame(hot_top100_dict)

In [11]:
hot_top100["Main Artist"] = hot_top100["Artist"].str.split(" Featuring ")
hot_top100["Main Artist"] = hot_top100["Main Artist"].apply(lambda x: str(x[0]))

In [12]:
hot_top100["Featuring Artist"] = hot_top100["Artist"].str.split(" Featuring ")
hot_top100["Featuring Artist"] = hot_top100["Featuring Artist"].apply(lambda x: str(x[1]) if len(x) > 1 else None)

In [13]:
hot_top100.to_csv("hot_top100.csv")

### Bonus

Can you find other websites with lists of "hot" songs? What about songs that were popular on a certain decade? 

You can scrape more lists and add extra features to the project.

In [14]:
url_2010 = "https://www.billboard.com/charts/decade-end/hot-100"

In [15]:
response_2010 = requests.get(url_2010)

In [16]:
soup_2010 = BeautifulSoup(response_2010.content, "html.parser")

In [17]:
song_names_soup_2010 = soup_2010.select("div.ye-chart-item__title")

In [18]:
song_names_2010 = [song_names_soup_2010[x].get_text().replace("\n","") for x in range(len(song_names_soup_2010))]

In [19]:
artist_names_soup_2010 = soup_2010.select("div.ye-chart-item__artist")

In [20]:
artist_names_2010 = [artist_names_soup_2010[x].get_text().replace("\n","") for x in range(len(artist_names_soup_2010))]

In [21]:
hot_top100_2010_dict = {"Song":song_names_2010,
                        "Artist":artist_names_2010}

In [22]:
hot_top100_2010 = pd.DataFrame(hot_top100_2010_dict)

In [23]:
hot_top100_2010["Main Artist"] = hot_top100_2010["Artist"].str.split(" Featuring ")
hot_top100_2010["Main Artist"] = hot_top100_2010["Main Artist"].apply(lambda x: str(x[0]))

In [24]:
hot_top100_2010["Featuring Artist"] = hot_top100_2010["Artist"].str.split(" Featuring ")
hot_top100_2010["Featuring Artist"] = hot_top100_2010["Featuring Artist"].apply(lambda x: str(x[1]) if len(x) > 1 else None)

In [25]:
hot_top100_2010.to_csv("hot_top100_2010.csv")