# Prepare Sources

The HMRC provide overseas trade statistics broken down by country and commoditiy code using the Combined Nomenclature "CN8" 8 digit codes.

These statistics have been obtained as a series of CSV files as "Tidy Data".

However, some preparation is necessary in order to process these files using the table2qb utility.

Firstly, fetch the source data, in this case from a shared (open) Google drive.

We also keep track of the processing and the provenance of the inputs and outputs using W3C Prov.

In [1]:
from datetime import datetime
import json
from pytz import timezone
from os import environ

provActivity = {
    '@id': environ.get('BUILD_URL', 'unknown-build') + "#prepare_sources",
    '@type': 'activity',
    'startedAtTime': datetime.now(timezone('Europe/London')).isoformat(),
    'label': 'Prepare sources',
    'comment': 'Jupyter Python notebook as part of Jenkins job %s' % environ.get('JOB_NAME', 'unknown-job')
}

In [2]:
import requests
from pathlib import Path

provSources = []

sourceFolder = Path('in')
sourceFolder.mkdir(exist_ok=True)

sources = [
    ('CN8_Non-EU_cod_2012.csv', '1P7YyFF6qXKXWVtR0Vt3kkvFPOjThMQH8'),
    ('CN8_Non-EU_cod_2013.csv', '1de-Le9ungrbdoGyvWI_RwmEhNpTmR-70'),
    ('CN8_Non-EU_cod_2014.csv', '1oC3jlItfsUshd54KOR7yn9NxpR83iCbC'),
    ('CN8_Non-EU_cod_2015.csv', '1H54-FYrCFa1DylCBg38RAPAeCtkGq4la'),
    ('CN8_Non-EU_cod_2016.csv', '11fLsnoiWzTcA1d3nSDWvyrKQEHwIf6Hz')
]

for filename, google_id in sources:
    sourceFile = sourceFolder / filename
    sourceUrl = f'https://drive.google.com/uc?export=download&id={google_id}'

    if not (sourceFile.exists() and sourceFile.is_file()):
        response = requests.get(sourceUrl)
        with open(sourceFile, 'wb') as f:
            f.write(response.content)
    provSources.append({
        '@id': sourceUrl,
        '@type': 'entity',
        'label': filename,
        'wasUsedBy': provActivity['@id']
    })

In [3]:
import pandas
pandas.read_csv(sourceFolder / sources[0][0], dtype={'comcode': str}).head()

Unnamed: 0,year,flow,comcode,country,svalue
0,2012,e,1012100,Norway,1773490
1,2012,e,1012100,Switzerland,69378
2,2012,e,1012100,Turkey,406337
3,2012,e,1012100,Ukraine,49903
4,2012,e,1012100,Serbia,32550


The table2qb utility requires that the input CSV look like:

```Year,Flow,Commodity,Foreign Country,Measure Type,Unit,Value
2012,Export,28399000,Singapore,GBP Total,£ million,35275
2012,Export,42050011,Ghana,GBP Total,£ million,1709
2012,Export,85049018,Israel,GBP Total,£ million,13205
2012,Import,73269060,Hong Kong,GBP Total,£ million,2414```

In [4]:
destFolder = Path('out')
destFolder.mkdir(exist_ok=True, parents=True)

countries = set()
provOutputs = []

table = pandas.concat([pandas.read_csv(sourceFolder / filename, dtype={'comcode': str})
                       for filename, google_id in sources], ignore_index=True).rename(
    index = str,
    columns = {'year': 'Year', 'flow': 'Flow', 'comcode': 'Commodity',
               'country': 'Foreign Country', 'svalue': 'Value'})
table['Measure Type'] = 'GBP Total'
table['Unit'] = '£ million'
table['Flow'] = table['Flow'].map(lambda x: {'i': 'Import', 'e': 'Export'}[x])
table = table[['Year', 'Flow', 'Commodity', 'Foreign Country', 'Measure Type', 'Unit', 'Value']]
countries.update(table['Foreign Country'])
destFile = destFolder / 'CN8_Non-EU_cod_2012-2016.csv'
table.sample(n=100000, random_state=149).to_csv(destFile, index=False)
#table.to_csv(destFile, index=False)
provOutputs.append((destFile, 'CN8_Non-EU_cod-2012-2016 table'))
table.head()

AttributeError: 'NoneType' object has no attribute 'to_csv'

table2qb further requires strings that will form URIs to be formatted for use as (RDF) identifiers and the bijection to be output as another CSV file. In this case, we currently need a countries.csv file along the following lines:

```Label,Notation,Parent Notation
Australia,australia,
Chile,chile,
Falkland Islands,falkland-islands,
French Polynesia,french-polynesia,
Ghana,ghana,
Hong Kong,hong-kong,```

In [None]:
countriesTable = pandas.DataFrame(data={'Label': sorted(list(countries))})
countriesTable['Notation'] = countriesTable['Label'].map(lambda x: x.lower().replace('&', 'and').replace(' ', '-'))
countriesTable['Parent Notation'] = ''
countriesTable

In [None]:
countriesTable.to_csv(destFolder / 'countries.csv', index=False)
provOutputs.append((destFolder / 'countries.csv', 'countries table'))

Finally, output the PROV metadata as JSON-LD. This goes to the 'out' folder.

In [None]:
metadataDir = Path('metadata')
with open(metadataDir / 'prov_context.json') as contextFile:
    context = json.load(contextFile)

provActivity['endedAtTime'] = datetime.now(timezone('Europe/London')).isoformat()
prov = {
    '@context': context,
    '@graph': [ provActivity ] + provSources + [
        {
            '@id': environ.get('BUILD_URL', 'unknown-build') + 'artifact/' + str(filename),
            '@type': 'entity',
            'wasGeneratedBy': provActivity['@id'],
            'label': label
        } for (filename, label) in provOutputs
    ]
}

with open(destFolder / 'prov.jsonld', 'w') as provFile:
    json.dump(prov, provFile, indent=2)