# Practice web scraping

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

**1. Retrieve an arbitrary [Wikipedia page of "Python"](https://en.wikipedia.org/wiki/Python) and create a list of links on that page**

In [2]:
# Download html
wiki_python_response = requests.get('https://en.wikipedia.org/wiki/Python')

# Parse html (create the soup)
wiki_python_soup = BeautifulSoup(wiki_python_response.content, 'html.parser')

In [3]:
links = ['https://en.wikipedia.org' + link['href'] for link in wiki_python_soup.select('#mw-content-text ul [href^="/wiki"]')]
links[:5]

['https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(genus)',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/CMU_Common_Lisp',
 'https://en.wikipedia.org/wiki/PERQ#PERQ_3']

**2. Find the number of titles that have changed in the [United States Code](http://uscode.house.gov/download/download.shtml) since its last release point**

From the website: "Titles in **bold** have been changed since the last release point."

In [4]:
# Download html
us_code_response = requests.get('http://uscode.house.gov/download/download.shtml')

# Parse html (create the soup)
us_code_soup = BeautifulSoup(us_code_response.content, 'html.parser')

In [5]:
len(us_code_soup.select('.usctitlechanged'))

2

**3. Create a Python list with the top ten [FBI's Most Wanted](https://www.fbi.gov/wanted/topten) names**

In [6]:
# Download html
most_wanted_response = requests.get('https://www.fbi.gov/wanted/topten')

# Parse html (create the soup)
most_wanted_soup = BeautifulSoup(most_wanted_response.content, 'html.parser')

In [7]:
most_wanted = [wanted.get_text().strip() for wanted in most_wanted_soup.select('.portal-type-person .title')]
most_wanted

['RAFAEL CARO-QUINTERO',
 'YULAN ADONAY ARCHAGA CARIAS',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'OCTAVIANO JUAREZ-CORRO']

**4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the [EMSC](https://www.emsc-csem.org/Earthquake/) as a pandas dataframe**

In [8]:
# Download html
earthquake_response = requests.get('https://www.emsc-csem.org/Earthquake/')

# Parse html (create the soup)
earthquake_soup = BeautifulSoup(earthquake_response.content, 'html.parser')

**Date & time**

In [9]:
date = [element.get_text().split()[0] for element in earthquake_soup.select('.tabev6 a')[:20]]
time = [element.get_text().split()[1] for element in earthquake_soup.select('.tabev6 a')[:20]]

First we access the tag containing the date & time information:
```python
earthquake_soup.select('.tabev6 a')[0]
```
```
>><a href="/Earthquake/earthquake.php?id=1065652">2021-11-23   09:18:39.9</a>
```

Then we `.get_text()` and `.split()` the string to retrieve the date (index `[0]`) and time (index `[1]`).

**Latitude & longitude**

In [10]:
lat_long = earthquake_soup.select('tbody tr .tabev1')
directions = earthquake_soup.select('tbody tr :nth-child(-n+2 of .tabev2)')

latitude = [lat_long[i].get_text(strip=True) + ' ' +  directions[i].get_text(strip=True) for i in range(0,40,2)]
longitude = [lat_long[i].get_text(strip=True) + ' ' +  directions[i].get_text(strip=True) for i in range(1,41,2)]

First we access the tag containing the desired information:
```python
earthquake_soup.select('tbody tr .tabev1')[:4]
```
```
>> [<td class="tabev1">28.55 </td>, - lat
    <td class="tabev1">17.86 </td>, - long
    <td class="tabev1">37.66 </td>, - lat
    <td class="tabev1">22.87 </td>] - long 
```
Every adjecent data is a pair (latitude, longitude). The cardinal points are located in another tag:
```python
earthquake_soup.select('tbody tr .tabev2')[:3]
```
```
>> [<td class="tabev2">N  </td>, - lat
    <td class="tabev2">W  </td>, - long
    <td class="tabev2">2.7</td>] - magnitude
```
This way we also return information about the magnitude. So we need to access only the first two items returned, which can be achieved using the `:nth-child()` selector:
```python
earthquake_soup.select('tbody tr :nth-child(-n+2 of .tabev2)')[:4]
```
```
>> [<td class="tabev2">N  </td>,
    <td class="tabev2">W  </td>,
    <td class="tabev2">N  </td>,
    <td class="tabev2">E  </td>]
```
Then we can concatenate the coordinates and cardinal points to return the latitude & longitude with a list comprehension.

**Region**

In [11]:
region = [region.get_text(strip=True) for region in earthquake_soup.select('.tb_region')[:20]]

In [12]:
earthquake_df = pd.DataFrame({
    'Date': date,
    'Time': time,
    'Latitude': latitude,
    'Longitude': longitude,
    'Region': region,
    
})
earthquake_df

Unnamed: 0,Date,Time,Latitude,Longitude,Region
0,2021-11-23,17:50:09.1,33.63 S,177.52 E,NORTH OF NEW ZEALAND
1,2021-11-23,17:49:29.5,35.51 N,22.89 E,CENTRAL MEDITERRANEAN SEA
2,2021-11-23,17:48:32.0,11.13 N,86.79 W,NEAR COAST OF NICARAGUA
3,2021-11-23,17:45:15.7,28.53 N,17.84 W,"CANARY ISLANDS, SPAIN REGION"
4,2021-11-23,17:15:06.3,28.53 N,17.82 W,"CANARY ISLANDS, SPAIN REGION"
5,2021-11-23,17:11:10.0,21.53 S,68.51 W,"ANTOFAGASTA, CHILE"
6,2021-11-23,17:09:16.2,28.55 N,17.86 W,"CANARY ISLANDS, SPAIN REGION"
7,2021-11-23,17:03:48.1,37.67 N,22.88 E,SOUTHERN GREECE
8,2021-11-23,16:57:55.0,0.92 S,125.36 E,MOLUCCA SEA
9,2021-11-23,16:41:24.3,19.22 N,155.43 W,"ISLAND OF HAWAII, HAWAII"


**5. List all language names and number of related articles in the order they appear in [wikipedia.org](https://www.wikipedia.org/)**

In [13]:
# Download html
wiki_response = requests.get('https://www.wikipedia.org/')

# Parse html (create the soup)
wiki_soup = BeautifulSoup(wiki_response.content, 'html.parser')

In [14]:
languages = [language.get_text() for language in wiki_soup.select('.central-featured-lang strong')]
num_articles = [','.join(num.string.split()) for num in wiki_soup.select('.central-featured-lang bdi')]

articles_language = list(zip(languages, num_articles))
articles_language

[('English', '6,383,000+'),
 ('日本語', '1,292,000+'),
 ('Русский', '1,756,000+'),
 ('Deutsch', '2,617,000+'),
 ('Español', '1,717,000+'),
 ('Français', '2,362,000+'),
 ('中文', '1,231,000+'),
 ('Italiano', '1,718,000+'),
 ('Português', '1,074,000+'),
 ('Polski', '1,490,000+')]

**Number of articles**

Both `.string` and `.get_text()` return an unformated number (as string):  
```python
wiki_soup.select('.central-featured-lang bdi')[0].string
```
```
> '6\xa0383\xa0000+'
```

To solve that, we can `.split()` the string and `.join()` its components using a comma as separator

In [15]:
# Or as a dataset

articles_language_df = pd.DataFrame({
    'Language': languages,
    'Number of articles': num_articles
})
articles_language_df

Unnamed: 0,Language,Number of articles
0,English,"6,383,000+"
1,日本語,"1,292,000+"
2,Русский,"1,756,000+"
3,Deutsch,"2,617,000+"
4,Español,"1,717,000+"
5,Français,"2,362,000+"
6,中文,"1,231,000+"
7,Italiano,"1,718,000+"
8,Português,"1,074,000+"
9,Polski,"1,490,000+"


In [16]:
# Download html
uk_dataset_response = requests.get('https://data.gov.uk/')

# Parse html (create the soup)
uk_dataset_soup = BeautifulSoup(uk_dataset_response.content, 'html.parser')

In [17]:
uk_datasets = [dataset.get_text() for dataset in uk_dataset_soup.select('.govuk-heading-s a')]
uk_datasets

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

**7. Display the [top 10 languages by number of native speakers](https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers) stored in a pandas dataframe**

In [18]:
# Download html
top_languages_response = requests.get('https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers')

# Parse html (create the soup)
top_languages_soup = BeautifulSoup(top_languages_response.content, 'html.parser')

In [19]:
languages = [language.string for language in
                   top_languages_soup.select('td:nth-of-type(2) [title]:not([title="Hindustani language"])')[:10]]
speakers = [number.string.strip() for number in top_languages_soup.select('td:nth-of-type(3)')[:10]]

In [20]:
top10_languages = pd.DataFrame({
    'Language': languages,
    'Speakers (millions)': speakers
}, index=range(1,11))
top10_languages

Unnamed: 0,Language,Speakers (millions)
1,Mandarin Chinese,918.0
2,Spanish,480.0
3,English,379.0
4,Hindi,341.0
5,Bengali,300.0
6,Portuguese,221.0
7,Russian,154.0
8,Japanese,128.0
9,Western Punjabi,92.7
10,Marathi,83.1
