# Case: GNOD
You have been hired as a Data Analyst for Gnod.

Gnod is a site that provides recommendations for music, art, literature and products based on collaborative filtering algorithms. Their flagship product is the music recommender, which you can try at www.gnoosic.com. The site asks users to input 3 bands they like, and computes similarity scores with the rest of the users. Then, they recommend to the user bands that users with similar tastes have picked.

Gnod is a small company, and its only revenue stream so far are adds in the site. In the future, they would like to explore partnership options with music apps (such as Deezer, Soundcloud or even Apple Music and Spotify). But for that to be possible, they need to expand and improve their recommendations.

That’s precisely where you come. They have hired you as a Data Analyst, and they expect you to bring a mix of technical expertise and business mindset to the table.

Jane, CTO of Gnod, has sent you an email assigning you with your first task

### Lab | Web Scraping Single Page

In [None]:
# get the Top 100
# https://www.popvortex.com/music/charts/top-100-songs.php
# https://playback.fm/charts

##### Lab 1
Start with Step 1:
Goal: List of 100 songs

- Scrap a list of **top 100 songs**
- Elements: Artist and title of the song

##### Lab 2
- Different notebook get more songs
- Focus on Practice web scraping (try at least 2)

##### Top 100 2023

In [1]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# 2. find url and store it in a variable
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [3]:
# 3. download html with a get request
response = requests.get(url) #gets the html and puts in response
#response.status_code

200

In [4]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")

In [7]:
# 4.2. check that the html code looks like it should
#soup.prettify()

In [8]:
# 5. retrieve/extract the desired info 
    # 1. Name of the artist
    # 2. Title of the song
    
soup.select("#chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p")

[<p class="title-artist"><cite class="title">Separate Ways (Worlds Apart) [feat. Lzzy Hale]</cite><em class="artist">Daughtry</em></p>]

In [9]:
soup.select("p.title-artist")

[<p class="title-artist"><cite class="title">Separate Ways (Worlds Apart) [feat. Lzzy Hale]</cite><em class="artist">Daughtry</em></p>,
 <p class="title-artist"><cite class="title">Unholy</cite><em class="artist">Sam Smith &amp; Kim Petras</em></p>,
 <p class="title-artist"><cite class="title">Heart Like A Truck</cite><em class="artist">Lainey Wilson</em></p>,
 <p class="title-artist"><cite class="title">Anti-Hero</cite><em class="artist">Taylor Swift</em></p>,
 <p class="title-artist"><cite class="title">Son Of A Sinner</cite><em class="artist">Jelly Roll</em></p>,
 <p class="title-artist"><cite class="title">Made You Look</cite><em class="artist">Meghan Trainor</em></p>,
 <p class="title-artist"><cite class="title">Lift Me Up (From Black Panther: Wakanda Forever - Music From and Inspired By)</cite><em class="artist">Rihanna</em></p>,
 <p class="title-artist"><cite class="title">Temperature</cite><em class="artist">Sean Paul</em></p>,
 <p class="title-artist"><cite class="title">wait 

In [15]:
# just get the title
soup.select("p.title-artist cite")[0].get_text()

'Separate Ways (Worlds Apart) [feat. Lzzy Hale]'

In [16]:
# just get the artist
soup.select("p.title-artist em")[0].get_text()

'Daughtry'

In [19]:
# 6. Create lists to put everything in a table
artist = []
title = []

# 6.1. Define the number of iterations for the loop
num_iter = len(soup.select("p.title-artist"))

nameartist = soup.select("p.title-artist em")
Ctitle = soup.select("p.title-artist cite")

for i in range(num_iter):
    artist.append(nameartist[i].get_text())
    title.append(Ctitle[i].get_text())

print(artist)
print(title)

['Daughtry', 'Sam Smith & Kim Petras', 'Lainey Wilson', 'Taylor Swift', 'Jelly Roll', 'Meghan Trainor', 'Rihanna', 'Sean Paul', 'HARDY & Lainey Wilson', 'Morgan Wallen', 'Lady Gaga', 'Kane Brown & Katelyn Brown', 'Sia', 'OneRepublic', 'David Guetta & Bebe Rexha', 'Zach Bryan', 'Tom MacDonald', 'Luke Grimes', 'Harry Styles', 'Bailey Zimmerman', 'Morgan Wallen', 'Coi Leray', 'Bazzi', 'Morgan Wallen', 'Old Dominion', 'Morgan Wallen', 'Lady Gaga', 'Shania Twain', 'Rema & Selena Gomez', 'Mike Posner', 'Elton John & Britney Spears', 'David Guetta & Bebe Rexha', 'Metro Boomin, The Weeknd & 21 Savage', 'Beyoncé', 'Luke Combs', 'Bailey Zimmerman', 'Zach Bryan', 'Brandon Lake', 'Bad Omens', 'Lizzo', 'Jordan Davis', 'JVKE', 'Fuerza Regida & Grupo Frontera', 'SZA', 'Luke Combs', 'Chris Brown', 'Lainey Wilson', 'P!nk', 'Carin Leon & Grupo Frontera', 'Lily Meola', 'Ed Sheeran', 'Tyler Hubbard', 'Nate Smith', 'Chris Stapleton', 'The Weeknd', 'Elton John & Dua Lipa', 'Three Dog Night', 'Morgan Wallen'

In [20]:
# 7. Put everything in a dataframe

top100 = pd.DataFrame({"artist": artist,
                      "title": title
                      })

In [23]:
top100.tail(60)

Unnamed: 0,artist,title
40,Jordan Davis,What My World Spins Around
41,JVKE,golden hour
42,Fuerza Regida & Grupo Frontera,Bebe Dame
43,SZA,Kill Bill
44,Luke Combs,"Going, Going, Gone"
45,Chris Brown,Under the Influence
46,Lainey Wilson,Watermelon Moonshine
47,P!nk,Sober
48,Carin Leon & Grupo Frontera,Que Vuelvas
49,Lily Meola,Daydream


**Comment**: If there is more than one artist split the column?

##### Top 100 2021

In [24]:
# 1. find url and store it in a variable
url = "https://playback.fm/charts/top-100-songs/2021"

# 2. download html with a get request
response = requests.get(url) #gets the html and puts in response
#response.status_code

200

In [25]:
# 3.1. parse html (create the 'soup')
soup2 = BeautifulSoup(response.content, "html.parser")

In [26]:
# 3.2. check that the html code looks like it should
#soup2.prettify()



In [135]:
# 4. retrieve/extract the desired info 
    # 1. Name of the artist
    # 2. Title of the song
    
soup2.select("#myTable")

[<table class="chartTbl" id="myTable">
 <thead>
 <tr class="tableHead">
 <th>Rank</th>
 <th><span class="mobile-only">Song</span><span class="mobile-hide">Artist</span></th>
 <th><span class="mobile-hide">Title</span></th>
 </tr>
 </thead>
 <tr itemprop="track" itemscope="" itemtype="https://schema.org/MusicRecording">
 <td>1</td>
 <td>
 <span class="mobile-only song">
 <a href="/charts/top-100-songs/video/2021/Dua-Lipa--DaBaby-Levitating" itemprop="name">
                        Levitating
                        </a>
 </span>
 <a class="artist" href="/artist/dua-lipa-top-songs" itemprop="byArtist">
                    Dua Lipa &amp; DaBaby
                    </a>
 <meta content="/artist/dua-lipa-top-songs" itemprop="url">
 </meta></td>
 <td class="mobile-hide">
 <a href="/charts/top-100-songs/video/2021/Dua-Lipa--DaBaby-Levitating">
 <span class="red-play">►</span>
 <span class="song" itemprop="name">Levitating</span>
 </a>
 </td>
 <td class="mobile-only play">
 <a href="/charts/top

In [159]:
soup2.select("table.chartTbl a")[0]["href"]

'/charts/top-100-songs/video/2021/Dua-Lipa--DaBaby-Levitating'

In [147]:
soup2.select("table.chartTbl a")[1].get_text()

'\n                   Dua Lipa & DaBaby\n                   '