# Geographic Data
***
Two objectives:
1. Get shapefile data for countries to build a choropleth map
2. Get a mapping of countries to it's continent (to add to the population table)

In [2]:
import geopandas as gpd

import requests
import zipfile
import os
import io

# allow web-acces for downloading: https://stackoverflow.com/a/60671292
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

from src.data.quick_queries import queryDB
qdb = queryDB('sqlite','../../data/processed/covid_db.sqlite')
%load_ext sql

%load_ext autoreload
%autoreload 2

sqlite:///../../data/processed/covid_db.sqlite
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Gather Data
***
Relevant shapefiles for country data can be found on www.naturalearthdata.com. There are three levels of data here (10m, 50m and 110m), where we have chosen for the last option (110m).

This causes us to miss some small countries (i.e. Singapore), however, the other files are found to big to display on interactive choropleth maps.

A potential improvement could be to find the missing countries in the 110m shapfile, and only add more precise (50 or 10m) shapefiles for these, leaving the bigger countries coarse.

#### 1.1. helper function to download & extract zipfiles

In [7]:
def extractZipfile(path, url):
    """
    Extract zipfile from url and store files under path 
    """
    # download
    r = requests.get(url, stream=True)
    with zipfile.ZipFile(io.BytesIO(r.content)) as myzip:
        # get the files inside the zip-file
        file_list = myzip.namelist()

        # extract files one-by-one
        for name in file_list:
            # remove MAXOSX folder
            if name[:8] != '__MACOSX':
                #only keep shapefile
                if name.split('.')[-1] == 'shp':
                    myzip.extract(name, path)
                    print(name)

#### 1.2 Download shapefiles

In [8]:
# extract zipfiles
#url_10 = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip'
#url_50 = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_admin_0_countries.zip'
url_110 = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip'
path = '../../data/raw'

#for url in [url_10, url_50, url_110]:
#    extractZipfile(path, url)
extractZipfile(path, url_110)

ne_110m_admin_0_countries.shp


#### 1.3 Load Geopandas dataframe

In [9]:
countries = gpd.read_file(path + '/' + 'ne_110m_admin_0_countries.shp')[['ADMIN','CONTINENT','geometry']]
print(countries.shape)
countries.head()

(177, 3)


Unnamed: 0,ADMIN,CONTINENT,geometry
0,Fiji,Oceania,"MULTIPOLYGON (((180.00000 -16.06713, 180.00000..."
1,United Republic of Tanzania,Africa,"POLYGON ((33.90371 -0.95000, 34.07262 -1.05982..."
2,Western Sahara,Africa,"POLYGON ((-8.66559 27.65643, -8.66512 27.58948..."
3,Canada,North America,"MULTIPOLYGON (((-122.84000 49.00000, -122.9742..."
4,United States of America,North America,"MULTIPOLYGON (((-122.84000 49.00000, -120.0000..."


## 2. Assess Data
***
The goal is to use this data merged with the existing Covid data, again, using the country-name as merging key.

#### 2.1 Compare with countries is `stats` table

In [22]:
# get the countries in our main stats table
query = """
    SELECT DISTINCT country
      FROM stats"""

df = qdb.output_query(query)
df.head(2)

Unnamed: 0,country
0,Afghanistan
1,Albania


In [21]:
# full join to get comparison
merged = countries.merge(df, left_on = 'ADMIN', right_on = 'country', how = 'outer')
merged.head(2)

Unnamed: 0,ADMIN,CONTINENT,geometry,country
0,Fiji,Oceania,"MULTIPOLYGON (((180.00000 -16.06713, 180.00000...",Fiji
1,United Republic of Tanzania,Africa,"POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...",


In [24]:
# not in our shapefile
merged[merged['ADMIN'].isnull()]

Unnamed: 0,ADMIN,CONTINENT,geometry,country
177,,,,Andorra
178,,,,Antigua and Barbuda
179,,,,Bahamas
180,,,,Bahrain
181,,,,Barbados
182,,,,Cabo Verde
183,,,,Comoros
184,,,,Congo
185,,,,Czech Republic
186,,,,DR Congo


In [25]:
# not in our stats table
merged[merged['country'].isnull()]

Unnamed: 0,ADMIN,CONTINENT,geometry,country
1,United Republic of Tanzania,Africa,"POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...",
4,United States of America,North America,"MULTIPOLYGON (((-122.84000 49.00000, -120.0000...",
11,Democratic Republic of the Congo,Africa,"POLYGON ((29.34000 -4.49998, 29.51999 -5.41998...",
19,The Bahamas,North America,"MULTIPOLYGON (((-78.98000 26.79000, -78.51000 ...",
20,Falkland Islands,South America,"POLYGON ((-61.20000 -51.85000, -60.00000 -51.2...",
22,Greenland,North America,"POLYGON ((-46.76379 82.62796, -43.40644 83.225...",
23,French Southern and Antarctic Lands,Seven seas (open ocean),"POLYGON ((68.93500 -48.62500, 69.58000 -48.940...",
24,East Timor,Asia,"POLYGON ((124.96868 -8.89279, 125.08625 -8.656...",
45,Puerto Rico,North America,"POLYGON ((-66.28243 18.51476, -65.77130 18.426...",
67,Republic of the Congo,Africa,"POLYGON ((18.45307 3.50439, 18.39379 2.90044, ...",


#### 2.2 learnings
***
Issues:
* mis-named countries
* missing countries in 110 m

## 3. Clean Data
***

In [12]:
# translate countries & continents

## 4. Store Data
***

In [13]:
# continent in populations table

In [14]:
# geojson the rest!?