![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`


In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Adele"
response = requests.get(url)

In [33]:
soup = BeautifulSoup(response.text,"html.parser")
print(soup)

In [30]:
soup.find('th',{'scope':'row'}).text.strip()

'"All I Ask"'

In [32]:
song_names = []
for row in soup.find_all('tr'):
    th_tag = row.find('th', {'scope': 'row'})
    if th_tag:
        song_name = th_tag.text.strip()
        song_names.append(song_name)
song_names

['"All I Ask"',
 '"All Night Parking"',
 '"Be Divine"',
 '"Best for Last"',
 '"Can I Get It"',
 '"Can\'t Be Together"',
 '"Can\'t Let Go"',
 '"Chasing Pavements"',
 '"Cold Shoulder"',
 '"Crazy for You"',
 '"Cry Your Heart Out"',
 '"Daydreamer"',
 '"Don\'t You Remember"',
 '"Every Glance" #',
 '"Easy on Me"',
 '"Easy on Me" (feature version)',
 '"First Love"',
 '"Fool That I Am" (cover)',
 '"He Won\'t Go"',
 '"Hello"',
 '"Hiding My Heart" (cover)',
 '"Hold On"',
 '"Hometown Glory"',
 '"I Can\'t Make You Love Me" (cover)',
 '"I Drink Wine"',
 '"I Found a Boy"',
 '"I Miss You"',
 '"I\'ll Be Waiting"',
 '"If It Hadn\'t Been For Love"(cover)',
 '"Lay Me Down"',
 '"Love in the Dark"',
 '"Love Is a Game"',
 '"Lovesong" (cover)',
 '"Make You Feel My Love" (cover)',
 '"Many Shades of Black"',
 '"Melt My Heart to Stone"',
 '"Million Years Ago"',
 '"My Little Love"',
 '"My Same"',
 '"My Yvonne"  #',
 '"Need You Now" (cover)',
 '"Now and Then"',
 '"Oh My God"',
 '"One and Only"',
 '"Painting Pictu

In [None]:
# kanye west

In [None]:
https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Kanye_West

In [48]:
url_kanye_west = "https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Kanye_West"
response_kanye_west = requests.get(url_kanye_west)

In [49]:
soup_kanye_west = BeautifulSoup(response_kanye_west.text,"html.parser")
print(soup)

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of songs recorded by Kanye West - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-cont

In [50]:
soup.find('th',{'scope':'row'}).text.strip()

'"30 Hours"'

In [53]:
song_names_kanye_west = []
for row in soup_kanye_west.find_all('tr'):
    th_tag = row.find('th', {'scope': 'row'})
    if th_tag:
        song_name_kanye_west = th_tag.text.strip()
        song_names_kanye_west.append(song_name_kanye_west)
song_names_kanye_west

['"30 Hours"',
 '"4th Dimension"',
 '"Addiction"',
 '"All Day"',
 '"All Falls Down"',
 '"All Mine"',
 '"All of the Lights" (Interlude)',
 '"All of the Lights"',
 '"Amazing"',
 '"Back to Basics"',
 '"Bad News"',
 '"Barry Bonds"',
 '"Big Brother"',
 '"Bittersweet Poetry"',
 '"Black Skinhead"',
 '"Blame Game"',
 '"Blood on the Leaves"',
 '"Bound 2"',
 '"Breathe In Breathe Out"',
 '"Bring Me Down"',
 '"Can\'t Tell Me Nothing"',
 '"Celebration"',
 '"Chain Heavy"',
 '"Champion"',
 '"Champions"',
 '"Champions"',
 '"Christian Dior Denim Flow"',
 '"Christmas In Harlem"',
 '"Classic (Better Than I\'ve Ever Been)"',
 '"Clique"',
 '"Closed On Sunday"',
 '"Cold"',
 '"Coldest Winter"',
 '"Crack Music"',
 '"Cudi Montage"',
 '"Dark Fantasy"',
 '"Devil in a New Dress"',
 '"Diamonds from Sierra Leone"',
 '"Diamonds from Sierra Leone" (Remix)',
 '"Don\'t Like. 1"',
 '"Don\'t Look Down"',
 '"Donda Chant"',
 '"Drive Slow"',
 '"Drunk and Hot Girls"',
 '"Every Hour"',
 '"Everything I Am"',
 '"Everything We N

In [None]:
#rfm

In [54]:
url_rfm= "https://rfm.sapo.pt/top25rfm"
response_rfm = requests.get(url_rfm)

In [56]:
soup_rfm = BeautifulSoup(response_rfm.text,"html.parser")
print(soup_rfm)

In [61]:
soup_rfm.find_all('div', {'class':'tAuthor'})

[<div class="tAuthor">Heaven</div>,
 <div class="tAuthor">Kiss Me</div>,
 <div class="tAuthor">Como Tu</div>,
 <div class="tAuthor">Flowers</div>,
 <div class="tAuthor">Bloody Mary</div>,
 <div class="tAuthor">Casa</div>,
 <div class="tAuthor">Never Gonna Not Dance Again</div>,
 <div class="tAuthor">Glimpse of Us</div>,
 <div class="tAuthor">Escola</div>,
 <div class="tAuthor">Creepin</div>,
 <div class="tAuthor">I Give Up</div>,
 <div class="tAuthor">Celestial</div>,
 <div class="tAuthor">A Nossa Dança</div>,
 <div class="tAuthor">Outros Planos</div>,
 <div class="tAuthor">Pilantra</div>,
 <div class="tAuthor">Lay Low</div>,
 <div class="tAuthor">Maria Joana</div>,
 <div class="tAuthor">Forget Me</div>,
 <div class="tAuthor">Despecha</div>,
 <div class="tAuthor">All By Myself</div>,
 <div class="tAuthor">Made You Look</div>,
 <div class="tAuthor">Agarra Em Mim</div>,
 <div class="tAuthor">Como Antes</div>,
 <div class="tAuthor">Shakira Bzrp Music Sessions Vol 53</div>,
 <div class="tA

In [62]:
song_names_rfm = []
for row in soup_rfm.find_all('li'):
    th_tag = row.find('div', {'class': 'tAuthor'})
    if th_tag:
        song_name_rfm = th_tag.text.strip()
        song_names_rfm.append(song_name_rfm)
song_names_rfm

['Heaven',
 'Kiss Me',
 'Como Tu',
 'Flowers',
 'Bloody Mary',
 'Casa',
 'Never Gonna Not Dance Again',
 'Glimpse of Us',
 'Escola',
 'Creepin',
 'I Give Up',
 'Celestial',
 'A Nossa Dança',
 'Outros Planos',
 'Pilantra',
 'Lay Low',
 'Maria Joana',
 'Forget Me',
 'Despecha',
 'All By Myself',
 'Made You Look',
 'Agarra Em Mim',
 'Como Antes',
 'Shakira Bzrp Music Sessions Vol 53',
 'La Bachata']