![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`




In [1]:
from bs4 import BeautifulSoup
from time import sleep
import random
from tqdm.notebook import tqdm
import requests
import pandas as pd
from datetime import datetime

In [2]:
url = "https://www.billboard.com/charts/hot-100"

In [3]:
response = requests.get(url)

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
titles = []
artists = []
for tag in soup.find_all("span", attrs={"class":"chart-element__information__song text--truncate color--primary"}):
    titles.append(tag.get_text())
for tag in soup.find_all("span", attrs={"class":"chart-element__information__artist text--truncate color--secondary"}):
    artists.append(tag.get_text())

In [6]:
data = pd.DataFrame({'title':titles, 'artist':artists})
data.head()

Unnamed: 0,title,artist
0,Drivers License,Olivia Rodrigo
1,34+35,Ariana Grande
2,Calling My Phone,Lil Tjay Featuring 6LACK
3,Blinding Lights,The Weeknd
4,Up,Cardi B


In [7]:
data.shape

(100, 2)

Get more songs from wikipedia

In [8]:
urls = []
for i in range(1,7):
    urls.append(f"https://en.wikipedia.org/wiki/List_of_songs_in_Glee_(season_{i})")

response = requests.get(urls[0])

soups = []
for i in urls:
    soups.append(BeautifulSoup((requests.get(i)).content, 'html.parser'))

In [9]:
len(soups[0].select('.wikitable > tbody > tr > th > a'))

132

In [10]:
titles = []
artists = []

for i in soups:
    for tag in (i.select('.wikitable > tbody > tr > th:nth-child(1)')):
        if (tag['scope'] == 'row'):
            titles.append(tag.get_text().rstrip().strip('\"'))
    for tag in i.select('.wikitable > tbody > tr > td:nth-child(3)'):
        artists.append(tag.get_text())
#     print('titles',len(titles),'artists',len(artists))

In [11]:
def add_new_data(df, titles, artists):
    df = pd.concat([df, pd.DataFrame({'title':titles, 'artist':artists})], axis=0)
    return df

In [12]:
data = add_new_data(data, titles, artists)

In [13]:
data.shape

(843, 2)

In [14]:
# Top 40 'alt' songs. Can be searched by week (yyyy-mm-dd)
# Let's get one year of data
urls = []
dates = pd.date_range('2020-02-24', '2021-02-24', freq='W')
dates = [date.strftime('%Y-%m-%d') for date in dates]

for i in dates:
    urls.append(f'https://www.billboard.com/charts/alternative-airplay/{i}')

In [15]:
soups = []
for url in tqdm(urls):
    soups.append(BeautifulSoup(requests.get(url).content))
    sleep(random.random()*4)

  0%|          | 0/52 [00:00<?, ?it/s]

In [16]:
titles = []
artists = []

for soup in soups:

    for tag in soup.find_all('span', 'chart-list-item__title-text'):
        titles.append(tag.get_text().strip())
    
    for tag in soup.find_all('div', 'chart-list-item__artist'):
        artists.append(tag.get_text().strip())

#     print(len(titles), len(artists))

In [17]:
data = add_new_data(data, titles, artists)

In [18]:
data.shape

(1923, 2)

Recommendation algo

In [26]:
def ask_song():
    print("""
    Enter your song
    example:
    >>> Where Is Love?
    """)
    user_input = input()
    
    dtemp = data['title'][data['title'].str.contains(user_input, na=False, case=False)]
    dtemp = dtemp.to_frame().reset_index(drop=True)
    print(dtemp)
    print("""
    Is it any of this?
    Pick one
    """)
    user_input = int(input())
    
    print("""
    You selected:
    """)
    print(data[data['title'] == dtemp.iloc[user_input].title])
    
    

In [27]:
def make_recommendation(x):
    print("""
    Making recommendation
    """)
    [i for i in tqdm(range(10000000))]
    y = data.sample()
    
    print("""
    The almighty algorithm recommends:
    """)
    return y

In [28]:
print(make_recommendation(ask_song()))


    Enter your song
    example:
    >>> Where Is Love?
    
where
                    title
0          Where Is Love?
1  Somewhere Only We Know
2               Somewhere
3  I Know Where I've Been

    Is it any of this?
    Pick one
    
2

    You selected:
    
         title                            artist
276  Somewhere  Rachel Berry and Shelby Corcoran

    Making recommendation
    


  0%|          | 0/10000000 [00:00<?, ?it/s]


    The almighty algorithm recommends:
    
      title artist
1009  Bang!    AJR


- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`

In [29]:
url ='https://en.wikipedia.org/wiki/Python'
soup = BeautifulSoup(requests.get(url).content)

In [30]:
WIKI_URL = 'https://en.wikipedia.org'

for tag in soup.find('div', 'mw-parser-output').find_all('ul'):
    if tag.a.get('href')[0] == "/":
        print(WIKI_URL+tag.a.get('href'))

https://en.wikipedia.org/wiki/Pythons
https://en.wikipedia.org/wiki/Python_(genus)
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/Python_of_Aenus
https://en.wikipedia.org/wiki/Python_(Efteling)
https://en.wikipedia.org/wiki/Python_(automobile_maker)
https://en.wikipedia.org/wiki/Python_(missile)
https://en.wikipedia.org/wiki/PYTHON
https://en.wikipedia.org/wiki/Python_(Monty)_Pictures
https://en.wikipedia.org/wiki/Cython


- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`

In [31]:
url = 'http://uscode.house.gov/download/download.shtml'
soup = BeautifulSoup(requests.get(url).content)

In [32]:
print("The are "+str(len(soup.find_all('div', 'usctitlechanged')))+" titles changed.")

The are 13 titles changed.


In [33]:
for tag in soup.find_all('div', 'usctitlechanged'):
    print(tag.get_text().strip())

Title 5 - Government Organization and Employees ٭
Title 6 - Domestic Security
Title 20 - Education
Title 25 - Indians
Title 29 - Labor
Title 33 - Navigation and Navigable Waters
Title 34 - Crime Control and Law Enforcement
Title 36 - Patriotic and National Observances, Ceremonies, and Organizations ٭
Title 38 - Veterans' Benefits ٭
Title 40 - Public Buildings, Property, and Works ٭
Title 42 - The Public Health and Welfare
Title 49 - Transportation ٭
Title 51 - National and Commercial Space Programs ٭


- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`

In [34]:
url = 'https://www.fbi.gov/wanted/topten'
soup = BeautifulSoup(requests.get(url).content)

In [35]:
[i.get_text().strip() for i in soup.find_all('h3', 'title')]

['RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'EUGENE PALMER',
 'YASER ABDEL SAID']

- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`

In [36]:
url = 'https://www.emsc-csem.org/Earthquake/'
soup = BeautifulSoup(requests.get(url).content)

In [38]:
date = []
time = []
latitude = []
longitude = []
region_name = []

for tag1 in soup.find(id='tbody').find_all('tr', limit=10):
    s = tag1.get_text().split('earthquake')[1]
    date.append(s.split()[0])
    time.append(s.split()[1][:10])
    v = s.split(' ago')[1].split()[0:2]
    latitude.append("".join(v)[:-1]+" "+"".join(v)[-1])
    v = s.split(' ago')[1].split()[2:4]
    longitude.append("".join(v)[:-1]+" "+"".join(v)[-1])
    region_name.append(tag1.find('td', 'tb_region').string.replace(u'\xa0', u''))

In [39]:
pd.DataFrame({'date':date, 'time':time, 'latitude':latitude, 'longitude':longitude, 'region_name':region_name})

Unnamed: 0,date,time,latitude,longitude,region_name
0,2021-02-25,10:05:33.0,16.41 N,98.47 W,"GUERRERO, MEXICO"
1,2021-02-25,09:56:04.8,38.69 N,15.69 E,"SICILY, ITALY"
2,2021-02-25,09:48:21.3,39.30 N,41.15 E,EASTERN TURKEY
3,2021-02-25,09:22:02.0,24.17 S,67.43 W,"SALTA, ARGENTINA"
4,2021-02-25,08:51:05.0,47.48 S,100.11 E,SOUTHEAST INDIAN RIDGE
5,2021-02-25,08:44:29.0,9.30 N,123.18 E,"NEGROS- CEBU REG, PHILIPPINES"
6,2021-02-25,08:41:09.3,38.18 N,117.79 W,NEVADA
7,2021-02-25,08:37:54.0,0.31 N,98.59 E,"NIAS REGION, INDONESIA"
8,2021-02-25,08:36:39.3,36.63 N,71.49 E,"HINDU KUSH REGION, AFGHANISTAN"
9,2021-02-25,08:11:51.7,36.71 N,121.35 W,CENTRAL CALIFORNIA


- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`

In [40]:
url = 'https://www.wikipedia.org/'
soup = BeautifulSoup(requests.get(url).content)

In [41]:
langs = ['lang'+str(i) for i in range(1,11)]
results = []

for lang in langs:
    results.append(soup.find('div', lang).get_text().split()[0])
    results.append(soup.find('div', lang).get_text().split(maxsplit=1)[1].replace('\xa0','').strip())

In [42]:
pd.DataFrame({'language':results[::2], 'description':results[1:][::2]})

Unnamed: 0,language,description
0,English,6245000+ articles
1,日本語,1252000+ 記事
2,Deutsch,2534000+ Artikel
3,Español,1659000+ artículos
4,Русский,1697000+ статей
5,Français,2296000+ articles
6,Italiano,1672000+ voci
7,中文,1175000+ 條目
8,Polski,1454000+ haseł
9,Português,1055000+ artigos


- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`

In [43]:
url = 'https://www.data.gov.uk/'
soup = BeautifulSoup(requests.get(url).content)

In [44]:
[tag.a.string for tag in soup.find_all('h3', 'govuk-heading-s dgu-topics__heading')]

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`

In [45]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
soup = BeautifulSoup(requests.get(url).content)

In [46]:
l = []
for i in range(1,11):
    l.append(soup.select('.mw-parser-output > .wikitable > tbody')[0].find_all('tr')[i].a.string)
    l.append(soup.select('.mw-parser-output > .wikitable > tbody')[0].find_all('tr')[i].find_all('td')[2].string.strip())
    

In [47]:
pd.DataFrame({'language':l[::2], 'speakers(millons)':l[1:][::2]})

Unnamed: 0,language,speakers(millons)
0,Mandarin Chinese,918.0
1,Spanish,480.0
2,English,379.0
3,Hindi,341.0
4,Bengali,228.0
5,Portuguese,221.0
6,Russian,154.0
7,Japanese,128.0
8,Western Punjabi,92.7
9,Marathi,83.1
