**Lab | Web Scraping Single Page (GNOD part 1)**

Business goal:
Check the case_study_gnod.md file.

Make sure you've understood the big picture of your project:

the goal of the company (Gnod),
their current product (Gnoosic),
their strategy, and
how your project fits into this context.
Re-read the business case and the e-mail from the CTO.

Instructions - Scraping popular songs
Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will also enjoy a recommendation of another song that is popular at the moment.

You have to find data on the internet about currently popular songs. Popvortex maintains a weekly Top 100 of "hot" songs here: http://www.popvortex.com/music/charts/top-100-songs.php.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [3]:
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [4]:
url = 'https://www.popvortex.com/music/charts/top-100-songs.php'

In [75]:
response = requests.get(url)
response.status_code 

200

In [10]:
# soup = BeautifulSoup(response.content, "html.parser")
# soup

In [77]:
#initialize empty lists
title = []
artist = []
genre = []
date = []


# define the number of iterations of our for loop
# by checking how many elements are in the retrieved result set
# (this is equivalent but more robust than just explicitly defining 250 iterations)
num_iter = len(soup.select("div.chart-content.col-xs-12.col-sm-8 > p"))

tlist = soup.select("div.chart-content.col-xs-12.col-sm-8 > p > cite")
alist = soup.select("div.chart-content.col-xs-12.col-sm-8 > p > em")
# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(tlist[i].get_text())
    artist.append(alist[i].get_text())


In [78]:
songs = pd.DataFrame({'title': title, 'artist': artist})
songs.head()

Unnamed: 0,title,artist
0,Fast Car,Luke Combs
1,Take Two,BTS
2,Last Night,Morgan Wallen
3,Need A Favor,Jelly Roll
4,"81 Million Votes, My Ass",The Truth Bombers & Kari Lake


In [79]:
songs.shape

(100, 2)

In [85]:
song = input('Enter a song you like: ')
if song in songs.title.to_list():
    print('You might also like: ', random.choice(songs.title))
else:
    print('Try a different song.')

Enter a song you like: Fast Car
You might also like:  red flag collector


**Lab | Web Scraping Multiple Pages**
Business goal:
Check the case_study_gnod.md file.

Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [6]:
url = 'https://www.billboard.com/charts/hot-100/'
response = requests.get(url)
response.status_code 
# soup = BeautifulSoup(response.content, "html.parser")
# soup

200

In [115]:
title = []
artist = []

result = soup.find_all('div', class_='o-chart-results-list-row-container')
for i in result:
    title.append(i.find('h3').text.strip())
    artist.append(i.find('h3').find_next('span').text.strip())


billboard = pd.DataFrame({'title': title, 'artist': artist})
billboard.head()

Unnamed: 0,title,artist
0,Last Night,Morgan Wallen
1,Karma,Taylor Swift Featuring Ice Spice
2,Flowers,Miley Cyrus
3,All My Life,Lil Durk Featuring J. Cole
4,Calm Down,Rema & Selena Gomez


In [116]:
billboard.shape

(100, 2)

In [122]:
song = input('Enter a song you like: ')
if song in billboard.title.to_list():
    print('You might also like: ', random.choice(songs.title))
else:
    print('Try a different song.')

Enter a song you like: Calm Down
You might also like:  Wild Thing


Instructions Part 2
Practice web scraping. This is not involved with the GNOD project of the week
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

**Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page**

In [7]:
url = 'https://en.wikipedia.org/wiki/Python'
response = requests.get(url)
response.status_code 
# soup = BeautifulSoup(response.content, "html.parser")
response = requests.get(url)
response.status_code 
# soup = BeautifulSoup(response.content, "html.parser")
# soup

200

In [180]:
link = []

llist = soup.select('#mw-content-text > div.mw-parser-output a')

for i in llist:
    url = i.get('href', '')
    link.append(url)

link

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/wiki/Python_(mythology)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(programming_language)',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/w/index.php?title=Python&action=edit&section=5',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/w/index.php?title=Python&action=edit&section=6',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Colt_Python',
 '/w/index.php?title=Python&act

**Create a Python list with the top ten FBI's Most Wanted names**

In [8]:
url = 'https://www.fbi.gov/wanted/topten'
response = requests.get(url)
response.status_code 
# soup = BeautifulSoup(response.content, "html.parser")
response = requests.get(url)
response.status_code 
# soup = BeautifulSoup(response.content, "html.parser")
# soup

200

In [186]:
soup.select('h3 > a')

[<a href="https://www.fbi.gov/wanted/topten/ruja-ignatova">RUJA IGNATOVA</a>,
 <a href="https://www.fbi.gov/wanted/topten/donald-eugene-fields-ii">DONALD EUGENE FIELDS II</a>,
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>,
 <a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>,
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>,
 <a href="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino">WILVER VILLEGAS-PALOMINO</a>,
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>,
 <a href="https://www.fbi.gov/wanted/topten/jose-rodolfo-villarreal-hernandez">JOSE RODOLFO VILLARREAL-HERNANDEZ</a>]

In [189]:
names = []

nlist = soup.select('h3 > a')
num_iter = len(nlist)

for i in range(num_iter):
    names.append(nlist[i].get_text())
    
names

['RUJA IGNATOVA',
 'DONALD EUGENE FIELDS II',
 'ARNOLDO JIMENEZ',
 'OMAR ALEXANDER CARDENAS',
 'ALEXIS FLORES',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'WILVER VILLEGAS-PALOMINO',
 'ALEJANDRO ROSALES CASTILLO',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

**List all language names and number of related articles in the order they appear in wikipedia.org**

In [9]:
url = 'https://www.wikipedia.org/'
response = requests.get(url)
display(response.status_code)
# soup = BeautifulSoup(response.content, "html.parser")
response = requests.get(url)
display(response.status_code)
# soup = BeautifulSoup(response.content, "html.parser")
# soup

200

200

In [199]:
soup.select('div > a > strong')

[<strong>English</strong>,
 <strong>日本語</strong>,
 <strong>Español</strong>,
 <strong>Русский</strong>,
 <strong>Deutsch</strong>,
 <strong>Français</strong>,
 <strong>Italiano</strong>,
 <strong>中文</strong>,
 <strong><bdi dir="rtl">فارسی</bdi></strong>,
 <strong>Português</strong>]

In [201]:
soup.select('div > a > small > bdi')

[<bdi dir="ltr">6 668 000+</bdi>,
 <bdi dir="ltr">1 376 000+</bdi>,
 <bdi dir="ltr">1 869 000+</bdi>,
 <bdi dir="ltr">1 921 000+</bdi>,
 <bdi dir="ltr">2 808 000+</bdi>,
 <bdi dir="ltr">2 528 000+</bdi>,
 <bdi dir="ltr">1 814 000+</bdi>,
 <bdi dir="ltr">1 360 000+</bdi>,
 <bdi dir="ltr">965 000+</bdi>,
 <bdi dir="ltr">1 103 000+</bdi>]

In [202]:
languages = []
articles = []

lanlist = soup.select('div > a > strong')
artlist = soup.select('div > a > small > bdi')

num_iter = len(lanlist)

for i in range(num_iter):
    languages.append(lanlist[i].get_text())
    articles.append(artlist[i].get_text())
    
wiki = pd.DataFrame({'language': languages, 'numArticles': articles})
wiki

Unnamed: 0,language,numArticles
0,English,6 668 000+
1,日本語,1 376 000+
2,Español,1 869 000+
3,Русский,1 921 000+
4,Deutsch,2 808 000+
5,Français,2 528 000+
6,Italiano,1 814 000+
7,中文,1 360 000+
8,فارسی,965 000+
9,Português,1 103 000+
