# Use the table of country names and links to download all the data

The [country list in Berkley Earth](http://berkeleyearth.lbl.gov/country-list/) contains a mapping of the country names and their representation of the name in URLs and file names. We'll mine this for country names and use them to generate direct links to the average temperature time series files. The download is managed by [Pooch](https://www.fatiando.org/pooch/).

In [1]:
from pathlib import Path
import urllib.parse
from bs4 import BeautifulSoup
from tqdm import tqdm
import pooch

Set the Pooch logging level so that it doesn't print out a message for every single time we download.

In [2]:
logger = pooch.get_logger()
logger.setLevel("WARNING")

Create a folder for the raw data (if it doesn't exist).

In [3]:
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

Download the HTML table from the website.

In [4]:
html_path = pooch.retrieve(
    url="http://berkeleyearth.lbl.gov/country-list/",
    known_hash=None,
    path=output_dir, 
    fname="country-list.html",
)

Load the HTML into [BeatifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) so we can parse the content and isolate the links from the table.

In [5]:
with open(html_path) as file_pointer:
    html = BeautifulSoup(file_pointer)

Get a list of the country IDs (the name used in the Berkeley Earth URLs). To do this, find all table row definitions (`tr`) and skip the first since it defines the table header. Then, get the link (`a`) from the first element (`td`) of the row and keep only the last part of the URL.

In [6]:
names = sorted(tr.td.a["href"].split("/")[-1] for tr in html.table.find_all("tr")[1:])
names[:10]

['afghanistan',
 'albania',
 'algeria',
 'american-samoa',
 'andorra',
 'angola',
 'anguilla',
 'antarctica',
 'antigua-and-barbuda',
 'argentina']

The base URL is the same for all of them, we only needed to know the name of the country (as it appears in the URL), which we got in the last step. The only trick now is quoting the special characters in the country names to make them safe for use in URLs. The detail is that Berkeley Earth seems to use Latin1 encodings for everything instead of UTF-8.

Some european contries have data both for their continental areas as well as their extended territories. We only want the continental parts. To do that, we'll generate a list of URLs and replace the URL for these special cases.

In [7]:
base_url = "http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/{country}-TAVG-Trend.txt"

url = {}
for country in names:
    if country not in url:
        url[country.replace("-(europe)", "")] = base_url.format(country=urllib.parse.quote(country, encoding="latin1"))
        
for country in ["denmark", "france", "united-kingdom"]:
    print(url[country])

http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/denmark-%28europe%29-TAVG-Trend.txt
http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/france-%28europe%29-TAVG-Trend.txt
http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/united-kingdom-%28europe%29-TAVG-Trend.txt


Now we can download the data files for all countries. 

In [8]:
%%time
for country in tqdm(url, ncols=100):
    pooch.retrieve(
        url[country],
        known_hash=None,
        path=output_dir,
        fname=f"{country}.txt",
    )

100%|█████████████████████████████████████████████████████████████| 233/233 [03:02<00:00,  1.27it/s]

CPU times: user 3.97 s, sys: 672 ms, total: 4.64 s
Wall time: 3min 2s





Now we have a folder full of files with the country data.

In [9]:
sorted(output_dir.glob("*.txt"))

[PosixPath('../data/raw/afghanistan.txt'),
 PosixPath('../data/raw/albania.txt'),
 PosixPath('../data/raw/algeria.txt'),
 PosixPath('../data/raw/american-samoa.txt'),
 PosixPath('../data/raw/andorra.txt'),
 PosixPath('../data/raw/angola.txt'),
 PosixPath('../data/raw/anguilla.txt'),
 PosixPath('../data/raw/antarctica.txt'),
 PosixPath('../data/raw/antigua-and-barbuda.txt'),
 PosixPath('../data/raw/argentina.txt'),
 PosixPath('../data/raw/armenia.txt'),
 PosixPath('../data/raw/aruba.txt'),
 PosixPath('../data/raw/australia.txt'),
 PosixPath('../data/raw/austria.txt'),
 PosixPath('../data/raw/azerbaijan.txt'),
 PosixPath('../data/raw/bahamas.txt'),
 PosixPath('../data/raw/bahrain.txt'),
 PosixPath('../data/raw/baker-island.txt'),
 PosixPath('../data/raw/bangladesh.txt'),
 PosixPath('../data/raw/barbados.txt'),
 PosixPath('../data/raw/belarus.txt'),
 PosixPath('../data/raw/belgium.txt'),
 PosixPath('../data/raw/belize.txt'),
 PosixPath('../data/raw/benin.txt'),
 PosixPath('../data/raw/bhu