# PRIMAP-hist data Preparation

This jupyter notebook sets out a method for preparing PRIMAPhist data for plotting. It can further be used as a guide and basis for setting up other input datasets. 

The dataset can be viewed and accessed at:
https://www.pik-potsdam.de/paris-reality-check/primap-hist/

Gütschow, J.; Jeffery, L.; Gieseke, R. (2019): The PRIMAP-hist national historical emissions time series (1850-2016). v2.0. GFZ Data Services. https://doi.org/10.5880/pik.2019.001

The downloaded data should be stored in the 'input-data' folder of this repository. PRIMAP-hist is updated roughly once per year. The user should therefore check for updates at the above link and download new data when available. If this is done, the new filenames will need to be added below. 

For further description on the data requirememts, please see **TODO**. 

------
TEMPORARY TO DO LIST!
* consider adding unit conversion
* try a few more examples - is the metadata / data description sufficient?
* ? warn of over-writing files AND/OR automise file names.  Currently it's possible to give a file a poor name. Would require a look-up for category names though...


In [5]:
# import modules

# system
import re
import sys
import os

# data handling
import pandas as pd
import numpy as np

# open climate data packages
from countrygroups import UNFCCC, EUROPEAN_UNION, ANNEX_ONE, NON_ANNEX_ONE
from shortcountrynames import to_name

# global stocktake tools
import gst_tools.gst_utils as utils

In [6]:
## User definition of requirements

include_extrapolated_data = False

# to understand the codes to use here, please see the PRIMAP-hist documentation
raw_entity = 'KYOTOGHGAR4'
raw_sector = 'IPC2'
raw_scenario = 'HISTCR' 

# choose something useful! These will be used to generate the new filename.
new_variable_name = 'IPPU-KyotoGHG'
new_source_name = 'PRIMAP-hist_v2.0'

# Based on countrygroups package, select the group of countries you would like to extract. 
# Note that the raw data also includes groups.
needed_countries = UNFCCC

# First year of data needed for further plotting
start_year = 1990 


In [7]:
## reduce the data

# select the right file 
# !*** Note that the file names here will need to be updated for new versions of PRIMAP-hist ***! 
if include_extrapolated_data:
    raw_data_file = 'PRIMAP-hist_v2.0_11-Dec-2018.csv'
else:
    raw_data_file = 'PRIMAP-hist_v2.0_no_extrapolation_11-Dec-2018.csv'
    

# get the data
raw_data_folder = 'input-data'
fname = os.path.join('', raw_data_folder, raw_data_file)
print('reading ' + fname)
raw_data = pd.read_csv(fname)

# reduce to only the desired variable (one per output file)
new_data = raw_data.loc[(raw_data['entity'] == raw_entity) & 
                        (raw_data['scenario'] == raw_scenario) &
                        (raw_data['category'] == raw_sector)
                       ]

# reduce the countries or regions to only those desired
new_data = new_data.loc[raw_data['country'].isin(needed_countries)]

# tell the user if any of the needed countries are missing and, if yes, which ones:
missing_countries = list(set(needed_countries) - set(new_data['country'].unique()))
if missing_countries:
    print('Not all countries requested were available in the raw data. You are missing the following:')
    for country in missing_countries:
        print('   ' + to_name(country))
    print('---------')


# reduce to only required years
new_data = utils.change_first_year(new_data, start_year)

# rename columns to follow conventions
new_data = new_data.rename(columns={'entity': 'variable'})

# make sure 'variable' contains all necessary information
new_data['variable'] = new_variable_name

# label the source
new_data['source'] = new_source_name

new_data = utils.check_column_order(new_data)

# check! 
new_data

reading input-data/PRIMAP-hist_v2.0_no_extrapolation_11-Dec-2018.csv
Not all countries requested were available in the raw data. You are missing the following:
   Palestine
   European Union
---------
First year of data available is now 1990
Last year of data available is 2016


Unnamed: 0,category,country,scenario,source,unit,variable,1990,1991,1992,1993,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,IPC2,AFG,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,223.000,196.000,204.000,182.000,...,173.000,179.000,182.000,198.00,177.0000,232.0000,19.70,39.3,31.6,31.6
1,IPC2,AGO,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,211.000,214.000,236.000,219.000,...,744.000,897.000,914.000,806.00,760.0000,802.0000,778.00,969.0,988.0,988.0
2,IPC2,ALB,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,775.000,626.000,244.000,261.000,...,664.000,707.000,774.000,898.00,1040.0000,1240.0000,1150.00,1100.0,1100.0,1100.0
3,IPC2,AND,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,52.800,53.400,55.300,55.800,...,45.700,42.600,43.800,47.00,49.1000,44.4000,2.62,,,
4,IPC2,ARE,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,3680.000,4110.000,4140.000,4130.000,...,15100.000,19200.000,16700.000,17000.00,17400.0000,15600.0000,10100.00,10300.0,10300.0,10300.0
5,IPC2,ARG,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,10700.000,9530.000,9480.000,10100.000,...,17100.000,16600.000,14200.000,16500.00,17500.0000,16700.0000,16500.00,16700.0,7120.0,7120.0
6,IPC2,ARM,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,808.000,826.000,338.000,271.000,...,997.000,1050.000,940.000,1020.00,350.0000,360.0000,214.00,210.0,208.0,208.0
7,IPC2,ATG,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,48.200,48.800,49.700,50.800,...,72.300,73.900,75.400,76.90,2.3100,2.3100,,,,
8,IPC2,AUS,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,27000.000,26300.000,26900.000,26700.000,...,35800.000,36300.000,34300.000,37800.00,38700.0000,36500.0000,33200.00,32600.0,33800.0,34200.0
9,IPC2,AUT,HISTCR,PRIMAP-hist_v2.0,GgCO2eq,IPPU-KyotoGHG,13800.000,13800.000,12100.000,12100.000,...,17000.000,17300.000,14000.000,16000.00,16000.0000,15600.0000,15900.00,16100.0,16700.0,16500.0


In [8]:
## write the data to file

"""
First ensure that years, unit, 'country', and variable are all in data. If they are
can proceed to print data
"""

if 'country' not in new_data.columns or 'unit' not in new_data.columns:
    
    print('Missing required information! Please check your input data and processing!')
    
else:
    
    # define filename as composite of variable and source name
    fname_out = new_source_name + '_' + new_variable_name + '.csv' 
    fullfname_out = os.path.join('proc-data', fname_out)

    # check folder exists
    if not os.path.exists('proc-data'):
        os.makedirs('proc-data')

    # write to csv in proc data folder
    new_data.to_csv(fullfname_out, index=False)

    # celebrate success 
    print('Processed data written to file! - ' + fullfname_out)
    

Processed data written to file! - proc-data/PRIMAP-hist_v2.0_IPPU-KyotoGHG.csv


**Test ground below**

In [6]:
new_data.columns


Index(['category', 'country', 'scenario', 'source', 'unit', 'variable', '1990',
       '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016'],
      dtype='object')