## Instructions Part 2

Practice web scraping. This is not involved with the GNOD project of the week
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
7. Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

Excersices: 
Done: 1,6,7
in progress: 4,5

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep
from random import randint

### 1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

- Here is not specified, which kind of links do we need. That is why I took all links, which belong to wikipedia and start with wiki

In [2]:
url = "https://en.wikipedia.org/wiki/Python"

In [3]:
# response = requests.get(url)
response = requests.get(url, headers = {"Accept-Language": "en-US"})
response.status_code 

200

In [4]:
soup = BeautifulSoup(response.content, "html.parser")

In [5]:
# Checking the output
# soup

In [6]:
soup.select("a")

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>,
 <a href=

In [7]:
links_list = []  # Creating an empty list of links

for link in soup.select("a"):
    href = link.get('href')
    if href and href.startswith('/wiki/'):
        links_list.append(href)

In [8]:
# Output: list of links
links_list

['/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Main_Page',
 '/wiki/Special:Search',
 '/wiki/Help:Introduction',
 '/wiki/Special:MyContributions',
 '/wiki/Special:MyTalk',
 '/wiki/Python',
 '/wiki/Talk:Python',
 '/wiki/Python',
 '/wiki/Python',
 '/wiki/Special:WhatLinksHere/Python',
 '/wiki/Special:RecentChangesLinked/Python',
 '/wiki/Wikipedia:File_Upload_Wizard',
 '/wiki/Special:SpecialPages',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/wiki/Python_(mythology)',
 '/wiki/Python_(programming_language)',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_G

### 7. Display the top 10 languages by number of native speakers stored in a pandas dataframe

In [9]:
# The link for the source
# https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 

In [10]:
url = "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers"

In [11]:
response = requests.get(url)
response.status_code 

200

In [12]:
soup = BeautifulSoup(response.content, "html.parser")

In [13]:
# soup  # checking the output

In [14]:
# Looking for languages

soup.select("table tr td a")

[<a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>,
 <a href="/wiki/Sinitic_languages" title="Sinitic languages">Sinitic</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Romance_languages" title="Romance languages">Romance</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Germanic_languages" title="Germanic languages">Germanic</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Indo-Aryan_languages" title="Indo-Aryan languages">Indo-Ary

In [15]:
pre_list= []

for i in soup.select("tr td a"):
    pre_list.append(i.get_text())

In [16]:
pre_list

['Mandarin Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Spanish',
 'Indo-European',
 'Romance',
 'English',
 'Indo-European',
 'Germanic',
 'Hindi',
 'Indo-European',
 'Indo-Aryan',
 'Portuguese',
 'Indo-European',
 'Romance',
 'Bengali',
 'Indo-European',
 'Indo-Aryan',
 'Russian',
 'Indo-European',
 'Balto-Slavic',
 'Japanese',
 'Japonic',
 'Japanese',
 'Yue Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Vietnamese',
 'Austroasiatic',
 'Vietic',
 'Turkish',
 'Turkic',
 'Oghuz',
 'Wu Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Marathi',
 'Indo-European',
 'Indo-Aryan',
 'Telugu',
 'Dravidian',
 'Korean',
 'Koreanic',
 'French',
 'Indo-European',
 'Romance',
 'Tamil',
 'Dravidian',
 'Egyptian Arabic',
 'Afroasiatic',
 'Semitic',
 'Standard German',
 'Indo-European',
 'Germanic',
 'Urdu',
 'Indo-European',
 'Indo-Aryan',
 'Javanese',
 'Austronesian',
 'Malayo-Polynesian',
 'Western Punjabi',
 'Indo-European',
 'Indo-Aryan',
 'Italian',
 'Indo-European',
 'Romance',
 'Gujarati',
 'Indo-European',
 'I

In [17]:
languages = pre_list[::3]
languages

['Mandarin Chinese',
 'Spanish',
 'English',
 'Hindi',
 'Portuguese',
 'Bengali',
 'Russian',
 'Japanese',
 'Yue Chinese',
 'Vietnamese',
 'Turkish',
 'Wu Chinese',
 'Marathi',
 'Telugu',
 'Koreanic',
 'Romance',
 'Egyptian Arabic',
 'Standard German',
 'Urdu',
 'Javanese',
 'Western Punjabi',
 'Italian',
 'Gujarati',
 'Iranian Persian',
 'Bhojpuri',
 'Hausa',
 'Mandarin Chinese',
 'Arabic',
 'Portuguese',
 'Western Punjabi',
 'Countries by the number of recognized official languages',
 'Dutch/Afrikaans',
 'German',
 'Malay',
 'Romanian',
 'Tamil',
 'Exonyms',
 'D–I',
 'Cambodia',
 'Iceland',
 'Japan',
 'Myanmar',
 'Vietnam',
 'Americas',
 'South America',
 'South Asia',
 'Oceania',
 'Polynesia',
 'Countries by the number of recognized official languages',
 'Europe',
 'language family',
 'List of Austronesian languages',
 'List of Tungusic languages',
 'geopolitical',
 'Commonwealth of Nations (English)',
 'Three Linguistic Spaces',
 'Países Africanos de Língua Oficial Portuguesa (Port

In [18]:
language_top10 = languages[:11]
language_top10

['Mandarin Chinese',
 'Spanish',
 'English',
 'Hindi',
 'Portuguese',
 'Bengali',
 'Russian',
 'Japanese',
 'Yue Chinese',
 'Vietnamese',
 'Turkish']

In [19]:
soup.select("table tr td")

[<td><a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>
 </td>,
 <td>939
 </td>,
 <td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
 </td>,
 <td><a href="/wiki/Sinitic_languages" title="Sinitic languages">Sinitic</a>
 </td>,
 <td><a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>
 </td>,
 <td>485
 </td>,
 <td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
 </td>,
 <td><a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
 </td>,
 <td><a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>
 </td>,
 <td>380
 </td>,
 <td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
 </td>,
 <td><a href="/wiki/Germanic_languages" title="Germanic languages">Germanic</a>
 </td>,
 <td><a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>
 </td>,
 <td>3

In [20]:
soup.select("table tr td")[1]

<td>939
</td>

In [21]:
pre_list2= []

for i in soup.select("table tr td"):
    pre_list2.append(i.get_text().strip())

In [22]:
pre_list2

['Mandarin Chinese',
 '939',
 'Sino-Tibetan',
 'Sinitic',
 'Spanish',
 '485',
 'Indo-European',
 'Romance',
 'English',
 '380',
 'Indo-European',
 'Germanic',
 'Hindi',
 '345',
 'Indo-European',
 'Indo-Aryan',
 'Portuguese',
 '236',
 'Indo-European',
 'Romance',
 'Bengali',
 '234',
 'Indo-European',
 'Indo-Aryan',
 'Russian',
 '147',
 'Indo-European',
 'Balto-Slavic',
 'Japanese',
 '123',
 'Japonic',
 'Japanese',
 'Yue Chinese',
 '86.1',
 'Sino-Tibetan',
 'Sinitic',
 'Vietnamese',
 '85.0',
 'Austroasiatic',
 'Vietic',
 'Turkish',
 '84.0',
 'Turkic',
 'Oghuz',
 'Wu Chinese',
 '83.4',
 'Sino-Tibetan',
 'Sinitic',
 'Marathi',
 '83.2',
 'Indo-European',
 'Indo-Aryan',
 'Telugu',
 '83.0',
 'Dravidian',
 'South-Central',
 'Korean',
 '81.7',
 'Koreanic',
 '—',
 'French',
 '80.8',
 'Indo-European',
 'Romance',
 'Tamil',
 '78.6',
 'Dravidian',
 'South',
 'Egyptian Arabic',
 '77.4',
 'Afroasiatic',
 'Semitic',
 'Standard German',
 '75.3',
 'Indo-European',
 'Germanic',
 'Urdu',
 '70.6',
 'Indo-E

In [23]:
pre_list2[1]

'939'

In [24]:
speakers = pre_list2[1::4]

In [25]:
speakers_top10 = speakers[:11]

In [26]:
top_10_languages = pd.DataFrame({"language":language_top10,
                                 "nb_of_native_speakers":speakers_top10
                                })

In [27]:
top_10_languages

Unnamed: 0,language,nb_of_native_speakers
0,Mandarin Chinese,939.0
1,Spanish,485.0
2,English,380.0
3,Hindi,345.0
4,Portuguese,236.0
5,Bengali,234.0
6,Russian,147.0
7,Japanese,123.0
8,Yue Chinese,86.1
9,Vietnamese,85.0


### 4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
url = "https://www.emsc-csem.org/Earthquake_information/#1"

In [None]:
response = requests.get(url)
response.status_code 

In [None]:
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
# soup

In [None]:
# Looking for 
soup.select("table tr td")

### 5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'

In [None]:
url = "https://www.wikipedia.org/"
response = requests.get(url)
response.status_code 

In [None]:
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
soup

In [None]:
soup.select("div ul li a")

In [None]:
# List with languages 

languages = []

for i in soup.select("div ul li a"):
    if "lang" in str(i):
        languages.append(i.get_text())

    
languages

In [None]:
list_href= []

for i in soup.select("div ul li a"):
    link = i.get("href")
    if link is not None:
        if ("wikipedia.org" in str(link)):
            list_href.append(link)

list_href

In [None]:
# Request from multiple pages 

pages = []

for i in list_href[:3]:
    # assemble the url:
    url = "https:" + str(i)

    # download html with a get request:
    response = requests.get(url)  # Here we get a request and accept the result in the language which is for our computer
    #response = requests.get(url, headers = {"Accept-Language": "en-US"})   # Here we get response in English

    # monitor the process by printing the status code
    print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,4000)
    print("I will sleep for " + str(wait_time/1000) + " seconds.")
    sleep(wait_time/1000)

In [None]:
print(BeautifulSoup(pages[0].content, "html.parser").prettify())

### 6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'

In [None]:
url = "https://www.data.gov.uk/"
response = requests.get(url)
response.status_code 

In [None]:
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
# soup

In [None]:
soup.select("div ul li h3")

In [None]:
datasets = []

for i in soup.select("div ul li h3"):
    datasets.append(i.get_text())
    
datasets