# lab-web-scraping-multiple-pages

### Business goal:

#### - Check the case_study_gnod.md file.

#### - Make sure you've understood the big picture of your project:
#### the goal of the company (Gnod),
#### their current product (Gnoosic),
#### their strategy, and
#### how your project fits into this context.

#### Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

### Instructions

#### Prioritize the MVP

#### In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.
#### If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

#### If you're done, you can try to expand the project on your own. Here are a few suggestions:

#### - Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!

#### - Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.

#### - Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [1]:
# importing useful libraries 
#! pip install bs4 # (installing the library in case it is not installed)
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# Reading the data-results from the previous lab 'lab-web-scraping-single-page' into Python
top100=pd.read_csv('top100.csv')
top100

Unnamed: 0,songs,artists
0,About Damn Time,Lizzo
1,As It Was,Harry Styles
2,Running Up That Hill (A Deal With God),Kate Bush
3,First Class,Jack Harlow
4,Wait For U,Future Featuring Drake & Tems
...,...,...
95,Te Felicito,Shakira & Rauw Alejandro
96,Are You Entertained,Russ & Ed Sheeran
97,"Bzrp Music Sessions, Vol. 52",Bizarrap & Quevedo
98,Right On,Lil Baby


In [3]:
# Let's get the URL of a website with charts and store it in a variable
url1="https://www.offiziellecharts.de/charts/single/for-date-1651096800000"

In [4]:
# getting the HTML code from our URL using request from requests library and then getting the status code
request_charts1 = requests.get(url1)
request_charts1.status_code
# the status code is 200, so we do not face any issue

200

In [5]:
# getting the code with the attribute content
request_charts1.content[:100]
# Since we essentially have a giant string of HTML, we can print a slice of 100 characters to confirm we have the source of 
# the page and now it is not messy

b'<!doctype html>\r\n<html prefix="og: http://ogp.me/ns#" class="no-js" xmlns="http://www.w3.org/1999/xh'

In [6]:
# parsing the element and getting the code with the attribute content using the 'html.parser' so we know that we have html code
# Print the prettify version of soup instead if the simple soup, so it is not so messy like previously
soup1 = BeautifulSoup(request_charts1.content, 'html.parser')
# soup
# html well indented. not always works great...
print(soup1.prettify()[:3000])
# we could say that the html code looks like the way it should look and it is saved in a beautiful soup object

<!DOCTYPE html>
<html class="no-js" lang="de-de" prefix="og: http://ogp.me/ns#" xml:lang="de-de" xmlns="http://www.w3.org/1999/xhtml">
 <meta content="Hier gibt’s die Offiziellen Deutschen Charts in ihrer ganzen Vielfalt. Denn: Hier zählt die Musik." name="description"/>
 <head>
  <script type="text/javascript">
   (function(){   function blockCookies(disableCookies, disableLocal, disableSession){   if(disableCookies == 1){   if(!document.__defineGetter__){   Object.defineProperty(document, 'cookie',{   get: function(){ return ''; },   set: function(){ return true;}   });   }else{   var oldSetter = document.__lookupSetter__('cookie');   if(oldSetter) {   Object.defineProperty(document, 'cookie', {   get: function(){ return ''; },   set: function(v){   if(v.match(/reDimCookieHint\=/)) {   oldSetter.call(document, v);   }   return true;   }   });   }   }   var cookies = document.cookie.split(';');   for (var i = 0; i < cookies.length; i++) {   var cookie = cookies[i];   var pos = cookie.

In [7]:
# Let's find the artists by recognizing a pattern on them
artist1 = []
for span in soup1.find_all('span', attrs={'class':'info-artist'}):
    artist1.append(span.get_text())

In [8]:
# Let's find the songs (titles) by recognizing a pattern on them
title1 = []
for span in soup1.find_all('span', attrs={'class':'info-title'}):
    title1.append(span.get_text())

In [9]:
# creating a dataframe with artists and songs
charts1 = pd.DataFrame({"artists":artist1,
                        "songs":title1})

charts1

Unnamed: 0,artists,songs
0,Harry Styles,As It Was
1,Jack Harlow,First Class
2,Rammstein,Zick zack
3,Miksu / MacLoud & T-Low,Sehnsucht
4,Glass Animals,Heat Waves
...,...,...
95,Art [DE],Belgisches Viertel
96,Pashanim,Paris Freestyle
97,Gabry Ponte x Lum!x x Prezioso,Thunder
98,atb x Topic x A7S,Your Love (9PM)


In [10]:
# Let's get a second URL of a website with charts and store it in a variable
url2="https://www.officialcharts.com/charts/singles-chart/"

In [11]:
# getting the HTML code from our URL using request from requests library and then getting the status code
request_charts2 = requests.get(url2)
request_charts2.status_code
# the status code is 200, so we do not face any issue

200

In [12]:
# getting the code with the attribute content
request_charts2.content[:100]
# Since we essentially have a giant string of HTML, we can print a slice of 100 characters to confirm we have the source of 
# the page and now it is not messy

b'\r\n\r\n<!doctype html>\r\n<!--[if lt IE 7]><html class="no-js ie6 oldie" lang="en"><![endif]-->\r\n<!--[if '

In [22]:
# parsing the element and getting the code with the attribute content using the 'html.parser' so we know that we have html code
# Print the prettify version of soup instead if the simple soup, so it is not so messy like previously
soup2 = BeautifulSoup(request_charts2.content, 'html.parser')
# soup2
# html well indented. not always works great...
print(soup2.prettify()[:3000])
# we could say that the html code looks like the way it should look and it is saved in a beautiful soup object

<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js ie6 oldie" lang="en"><![endif]-->
<!--[if IE 7]><html class="no-js ie7 oldie" lang="en"><![endif]-->
<!--[if IE 8]><html class="no-js ie8 oldie" lang="en"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   Official Singles Chart Top 100 | Official Charts Company
  </title>
  <meta content="The Official UK Top 40 chart is compiled by the Official Charts Company, based on official sales of sales of downloads, CD, vinyl, audio streams and video streams. The Top 40 is broadcast on BBC Radio 1 and MTV, the full Top 100 is published exclusively on OfficialCharts.com." name="description"/>
  <meta content="Top 40, UK Top 40, Charts, Top 40 UK, UK Charts, UK singles chart, Music Charts, Official UK Top 40, Charts 2012, Hit 40 UK, UK Chart, Official Singles Chart, Official Albums Chart, Number 1, Nu

In [14]:
# Let's find the artists by recognizing a pattern on them
artist2 = []
for div in soup2.find_all('div', attrs={'class':'artist'}):
    artist2.append(div.get_text().strip())

In [15]:
artist2

['LF SYSTEM',
 'CENTRAL CEE',
 'HARRY STYLES',
 'GEORGE EZRA',
 'BURNA BOY',
 'BEYONCE',
 'DAVID GUETTA/HILL/HENDERSON',
 'LIZZO',
 'KATE BUSH',
 'TION WAYNE & LA ROUX',
 'NATHAN DAWE FT ELLA HENDERSON',
 'ONEREPUBLIC',
 'HARRY STYLES',
 'SIGALA & TALIA MAR',
 'DRAKE',
 'JAX JONES FT MNEK',
 'SAM FENDER',
 'STEVE LACY',
 'JOJI',
 'BRU-C',
 'FUTURE FT DRAKE & TEMS',
 'HARRIS/TIMBERLAKE/HALSEY',
 'BILLIE EILISH',
 'HARRY STYLES',
 'DOJA CAT',
 'ROSA LINN',
 'DRAKE FT 21 SAVAGE',
 'CAT BURNS',
 'POST MALONE FT DOJA CAT',
 'TIESTO & CHARLI XCX',
 'METALLICA',
 'BURNA BOY FT ED SHEERAN',
 'BILLIE EILISH',
 'CALVIN HARRIS/DUA LIPA/YOUNG',
 'LATTO/MARIAH CAREY/DJ KHALED',
 'JAMES HYPE/MIGGY DELA ROSA',
 'LUUDE & MATTAFIX',
 'D-BLOCK EUROPE/GHOST KILLER',
 'GLASS ANIMALS',
 'DIPLO & MIGUEL',
 'ED SHEERAN',
 'TOM GRENNAN',
 'BECKY HILL & DAVID GUETTA',
 'CIAN DUCROT',
 'LADY GAGA',
 'BENSON BOONE',
 'RUSS FT ED SHEERAN',
 'NICKY YOURE & DAZY',
 'D-BLOCK EUROPE',
 'ED SHEERAN',
 'FIREBOY DML & E

In [16]:
# Let's find the songs (titles) by recognizing a pattern on them
title2 = []
for div in soup2.find_all('div', attrs={'class':'title'}):
    title2.append(div.get_text().strip())  

In [17]:
title2

['AFRAID TO FEEL',
 'DOJA',
 'AS IT WAS',
 'GREEN GREEN GRASS',
 'LAST LAST',
 'BREAK MY SOUL',
 'CRAZY WHAT LOVE CAN DO',
 'ABOUT DAMN TIME',
 'RUNNING UP THAT HILL',
 'IFTK',
 '21 REASONS',
 "I AIN'T WORRIED",
 'LATE NIGHT TALKING',
 'STAY THE NIGHT',
 'MASSIVE',
 'WHERE DID YOU GO',
 'SEVENTEEN GOING UNDER',
 'BAD HABIT',
 'GLIMPSE OF US',
 'NO EXCUSES',
 'WAIT FOR U',
 'STAY WITH ME',
 'TV',
 'MUSIC FOR A SUSHI RESTAURANT',
 'VEGAS',
 'SNAP',
 'JIMMY COOKS',
 'GO',
 'I LIKE YOU (A HAPPIER SONG)',
 'HOT IN IT',
 'MASTER OF PUPPETS',
 'FOR MY HAND',
 'THE 30TH',
 'POTION',
 'BIG ENERGY',
 'FERRARI',
 'BIG CITY LIFE',
 'ELEGANT & GANG',
 'HEAT WAVES',
 "DON'T FORGET MY LOVE",
 'BAD HABITS',
 'REMIND ME',
 'REMEMBER',
 'ALL FOR YOU',
 'HOLD MY HAND',
 'IN THE STARS',
 'ARE YOU ENTERTAINED',
 'SUNROOF',
 'FANTASY',
 'SHIVERS',
 'PERU',
 'RAINFALL',
 'HELLO MATE',
 'WHERE ARE YOU NOW',
 'MIXED EMOTIONS',
 'MAKE ME FEEL GOOD',
 '2STEP',
 'COLD HEART',
 'SOMETHING TO SOMEONE',
 'DANDELIONS

In [18]:
# creating a third dataframe with artists and songs
charts2 = pd.DataFrame({"artists":artist2,
                        "songs":title2})

charts2

Unnamed: 0,artists,songs
0,LF SYSTEM,AFRAID TO FEEL
1,CENTRAL CEE,DOJA
2,HARRY STYLES,AS IT WAS
3,GEORGE EZRA,GREEN GREEN GRASS
4,BURNA BOY,LAST LAST
...,...,...
95,BELTERS ONLY,I WILL SURVIVE
96,GIVEON,LOST ME
97,TOM ODELL,ANOTHER LOVE
98,GEORGE EZRA,ANYONE FOR YOU


In [19]:
# concatenating the dataframes top100 and charts1
charts = pd.concat([top100, charts1, charts2], axis=0)
charts

Unnamed: 0,songs,artists
0,About Damn Time,Lizzo
1,As It Was,Harry Styles
2,Running Up That Hill (A Deal With God),Kate Bush
3,First Class,Jack Harlow
4,Wait For U,Future Featuring Drake & Tems
...,...,...
95,I WILL SURVIVE,BELTERS ONLY
96,LOST ME,GIVEON
97,ANOTHER LOVE,TOM ODELL
98,ANYONE FOR YOU,GEORGE EZRA


In [20]:
# removing any duplicate rows
charts = charts.drop_duplicates()
charts

Unnamed: 0,songs,artists
0,About Damn Time,Lizzo
1,As It Was,Harry Styles
2,Running Up That Hill (A Deal With God),Kate Bush
3,First Class,Jack Harlow
4,Wait For U,Future Featuring Drake & Tems
...,...,...
95,I WILL SURVIVE,BELTERS ONLY
96,LOST ME,GIVEON
97,ANOTHER LOVE,TOM ODELL
98,ANYONE FOR YOU,GEORGE EZRA


In [21]:
# creating a code for recommendation song according to the choice
recommend_song = input("enter a song name: ")

if recommend_song in list(charts['songs']):
    print("Your choice is: ")
    display(charts.loc[lambda charts: charts['songs'] == recommend_song])
    print("and our recommendation is:")
    display(charts.sample())
else:
    print("no recommendation")

enter a song name: Wait For U
Your choice is: 


Unnamed: 0,songs,artists
4,Wait For U,Future Featuring Drake & Tems


and our recommendation is:


Unnamed: 0,songs,artists
83,Love Me More,Sam Smith
