## Load WRI Power Plant data from 2019 dataset (see https://datasets.wri.org/dataset/globalpowerplantdatabase) for original source

Copyright (C) 2021 OS-Climate

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

### We have a local copy rooted in the S3_BUCKET : WRI/global_power_plant_database_v_1_3/global_power_plant_database.csv
### To tidy the data we factor into three tables:

* **wri_plants** (all the fixed data about each plant)
* **wri_annual_gwh** (per plant/per year annual generation in GWh)
* **wri_estimated_gwh** (per plant/per year estimated generation in GWh)

### The next step is to enrich with OS-C Factor metadata

Contributed by Michael Tiemann (Github: MichaelTiemannOSC)

Load Credentials and Data Commons libraries

In [1]:
# From the AWS Account page, copy the export scripts from the appropriate role using the "Command Line or Programmatic Access" link
# Paste the copied text into ~/credentials.env

from dotenv import dotenv_values, load_dotenv
import os
import pathlib

dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

import pyarrow as pa
import pyarrow.parquet as pq
import json
import io
import uuid

In [2]:
# Thanks to https://stackoverflow.com/a/56172803/1291237 (and their CC-BY-SA 4.0 contribution!)
def set_metadata(tbl, col_meta={}, tbl_meta={}):
    """Store table- and column-level metadata as json-encoded byte strings.

    Table-level metadata is stored in the table's schema.
    Column-level metadata is stored in the table columns' fields.

    To update the metadata, first new fields are created for all columns.
    Next a schema is created using the new fields and updated table metadata.
    Finally a new table is created by replacing the old one's schema, but
    without copying any data.

    Args:
        tbl (pyarrow.Table): The table to store metadata in
        col_meta: A json-serializable dictionary with column metadata in the form
            {
                'column_1': {'some': 'data', 'value': 1},
                'column_2': {'more': 'stuff', 'values': [1,2,3]}
            }
        tbl_meta: A json-serializable dictionary with table-level metadata.
    """
    # Create updated column fields with new metadata
    if col_meta or tbl_meta:
        fields = []
        for col in tbl.schema.names:
            if col in col_meta:
                # Get updated column metadata
                metadata = tbl.field(col).metadata or {}
                for k, v in col_meta[col].items():
                    metadata[k] = json.dumps(v).encode('utf-8')
                # Update field with updated metadata
                fields.append(tbl.field(col).with_metadata(metadata))
            else:
                fields.append(tbl.field(col))
        
        # Get updated table metadata
        tbl_metadata = tbl.schema.metadata or {}
        for k, v in tbl_meta.items():
            if type(v)==bytes:
                tbl_metadata[k] = v
            else:
                tbl_metadata[k] = json.dumps(v).encode('utf-8')

        # Create new schema with updated field metadata and updated table metadata
        schema = pa.schema(fields, metadata=tbl_metadata)

        # With updated schema build new table (shouldn't copy data)
        # tbl = pa.Table.from_batches(tbl.to_batches(), schema)
        tbl = tbl.cast(schema)

    return tbl


def decode_metadata(metadata):
    """Arrow stores metadata keys and values as bytes.
    We store "arbitrary" data as json-encoded strings (utf-8),
    which are here decoded into normal dict.
    """
    if not metadata:
        # None or {} are not decoded
        return metadata

    decoded = {}
    for k, v in metadata.items():
        key = k.decode('utf-8')
        val = json.loads(v.decode('utf-8'))
        decoded[key] = val
    return decoded


def table_metadata(tbl):
    """Get table metadata as dict."""
    return decode_metadata(tbl.schema.metadata)


def column_metadata(tbl):
    """Get column metadata as dict."""
    return {col.name: decode_metadata(col.metadata) for col in tbl.schema}


def get_metadata(tbl):
    """Get column and table metadata as dicts."""
    return column_metadata(tbl), table_metadata(tbl)

Build a map and define schema mapping logic for parquet to sql

In [3]:
import re
import pandas as pd
_wsdedup = re.compile(r"\s+")
_usdedup = re.compile(r"__+")
_rmpunc = re.compile(r"[,.()&$/+-]+")
# 63 seems to be a common max column name length
def snakify(name, maxlen=63):
    if isinstance(name, list):
        return [snakify(e) for e in name]
    w = str(name).casefold().rstrip().lstrip()
    w = w.replace("-", "_")
    w = _rmpunc.sub("", w)
    w = _wsdedup.sub("_", w)
    w = _usdedup.sub("_", w)
    w = w.replace("average", "avg")
    w = w.replace("maximum", "max")
    w = w.replace("minimum", "min")
    w = w.replace("absolute", "abs")
    w = w.replace("source", "src")
    w = w.replace("distribution", "dist")
    # these are common in the sample names but unsure of standard abbv
    #w = w.replace("inference", "inf")
    #w = w.replace("emissions", "emis")
    #w = w.replace("intensity", "int")
    #w = w.replace("reported", "rep")
    #w = w.replace("revenue", "rev")
    w = w[:maxlen] 
    return w

def snakify_columns(df, inplace=False, maxlen=63):
    icols = df.columns.to_list()
    ocols = snakify(icols, maxlen=maxlen)
    scols = set(ocols)
    if (len(set(ocols)) < len(ocols)):
        raise ValueError("remapped column names were not unique!")
    rename_map = dict(list(zip(icols,snakify(icols))))
    return df.rename(columns=rename_map, inplace=inplace)

rename_year_columns={}
for y in range(1900,2100):
    rename_year_columns[str(y)] = 'y{yr}'.format(yr=y)
#rename_year_columns

_p2smap = {
    'object': 'varchar',
    'string': 'varchar',
    'str': 'varchar',
    'float32': 'real',
    'Float32': 'real',
    'float64': 'double',
    'Float64': 'double',
    'int32': 'integer',
    'Int32': 'integer',
    'int64': 'bigint',
    'Int64': 'bigint',
    'category': 'varchar',
    'datetime64[ns, UTC]': 'timestamp',
    'datetime64[ns]': 'timestamp'
}

def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))

# add ability to specify optional dict for specific fields?
# if column name is present, use specified value?
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0],t=e[1]) for e in pz])

Create an S3 resource for the bucket holding source data

In [4]:
import boto3
s3_resource = boto3.resource(
    service_name="s3",
    endpoint_url=os.environ['S3_LANDING_ENDPOINT'],
    aws_access_key_id=os.environ['S3_LANDING_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_LANDING_SECRET_KEY'],
)
bucket = s3_resource.Bucket(os.environ['S3_LANDING_BUCKET'])

In [5]:
# Create an S3 client.  We will user later when we write out data and metadata
s3 = boto3.client(
    service_name="s3",
    endpoint_url=os.environ['S3_DEV_ENDPOINT'],
    aws_access_key_id=os.environ['S3_DEV_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_DEV_SECRET_KEY'],
)

In [6]:
import trino

conn = trino.dbapi.connect(
    host=os.environ['TRINO_HOST'],
    port=int(os.environ['TRINO_PORT']),
    user=os.environ['TRINO_USER'],
    http_scheme='https',
    auth=trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    verify=True,
)
cur = conn.cursor()

# Show available schemas to ensure trino connection is set correctly
cur.execute('show schemas in osc_datacommons_dev')
cur.fetchall()

[['aicoe_osc_demo'],
 ['company_data'],
 ['default'],
 ['defaultschema1'],
 ['demo'],
 ['eje'],
 ['epacems'],
 ['epacems_y95_al'],
 ['information_schema'],
 ['metastore'],
 ['pudl'],
 ['rmi_utility_transition_hub'],
 ['team1'],
 ['team2'],
 ['testaccessschema1'],
 ['urgentem'],
 ['wri'],
 ['wri_gppd'],
 ['wri_gppd_md']]

In [7]:
schemaname = 'wri_gppd_md'  # Add the _md so we don't disturb others who are trying to use a "stable" version of wri_gppd right now
cur.execute('create schema if not exists osc_datacommons_dev.' + schemaname)
cur.fetchall()

[[True]]

For osc_datacommons_dev, a trino pipeline is a parquet file stored in the S3_DEV_BUCKET

In [8]:
tablename_to_df = {}

def create_trino_pipeline (s3, schemaname, tablename, df):
    global tablename_to_df
    df.to_parquet('/tmp/{sname}.{tname}.parquet'.format(sname=schemaname, tname=tablename), index=False)
    tablename_to_df[tablename] = df
    s3.upload_file(
        Bucket=os.environ['S3_DEV_BUCKET'],
        Key='trino/{sname}/{tname}/{tname}.parquet'.format(sname=schemaname, tname=tablename),
        Filename='/tmp/{sname}.{tname}.parquet'.format(sname=schemaname, tname=tablename)
    )
    cur.execute('.'.join(['drop table if exists osc_datacommons_dev', schemaname, tablename]))
    print('dropping table: ' + tablename)
    cur.fetchall()
    
    schema = generate_table_schema_pairs(df)

    tabledef = """create table if not exists osc_datacommons_dev.{sname}.{tname}(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{bucket}/trino/{sname}/{tname}/'
)""".format(schema=schema,bucket=os.environ['S3_DEV_BUCKET'],sname=schemaname,tname=tablename)
    print(tabledef)

    # tables created externally may not show up immediately in cloud-beaver
    cur.execute(tabledef)
    cur.fetchall()

Load WRI data file using pandas *read_csv*

In [9]:
import pandas as pd
import numpy as np

wri_file = bucket.Object('WRI/global_power_plant_database_v_1_3/global_power_plant_database.csv').get()['Body']

# Because NaN cannot be converted to int type, we cannot use int as a datatype for years that are NaN

wri_df = pd.read_csv(wri_file, dtype={'latitude':'float64', 'longitude':'float64', 'capacity_mw':'float64', 'other_fuel3':'str'})

# Add a unique identifier to the data set
uid = str(uuid.uuid4())
wri_df['uuid'] = uid

display(wri_df.columns)

Index(['country', 'country_long', 'name', 'gppd_idnr', 'capacity_mw',
       'latitude', 'longitude', 'primary_fuel', 'other_fuel1', 'other_fuel2',
       'other_fuel3', 'commissioning_year', 'owner', 'source', 'url',
       'geolocation_source', 'wepp_id', 'year_of_capacity_data',
       'generation_gwh_2013', 'generation_gwh_2014', 'generation_gwh_2015',
       'generation_gwh_2016', 'generation_gwh_2017', 'generation_gwh_2018',
       'generation_gwh_2019', 'generation_data_source',
       'estimated_generation_gwh_2013', 'estimated_generation_gwh_2014',
       'estimated_generation_gwh_2015', 'estimated_generation_gwh_2016',
       'estimated_generation_gwh_2017', 'estimated_generation_note_2013',
       'estimated_generation_note_2014', 'estimated_generation_note_2015',
       'estimated_generation_note_2016', 'estimated_generation_note_2017',
       'uuid'],
      dtype='object')

### Melt the generation data into a more tidy format, dropping NA values

In [10]:
wri_plants = wri_df[['country', 'country_long', 'name', 'gppd_idnr', 'capacity_mw',
       'latitude', 'longitude', 'primary_fuel', 'other_fuel1', 'other_fuel2',
       'other_fuel3', 'commissioning_year', 'owner', 'source', 'url',
       'geolocation_source', 'wepp_id', 'year_of_capacity_data', 'generation_data_source', 'uuid']]

wri_id_vars = ['gppd_idnr', 'uuid']
wri_value_vars = ['generation_gwh_2013', 'generation_gwh_2014', 'generation_gwh_2015', 'generation_gwh_2016',
                  'generation_gwh_2017', 'generation_gwh_2018', 'generation_gwh_2019']
wri_annual_gwh = wri_df.melt(wri_id_vars, wri_value_vars, var_name='year', value_name='generation_gwh')
wri_annual_gwh['year'] = pd.to_numeric(wri_annual_gwh['year'].apply(lambda x: int(x.split('_')[-1])))
wri_annual_gwh.dropna(subset=['generation_gwh'],inplace=True)

# Push uuid to the end of the column list
wri_annual_gwh.insert(len(wri_annual_gwh.columns)-1, 'uuid', wri_annual_gwh.pop('uuid'))

# display(wri_annual_gwh)

### Create and store metadata for annual_gwh table

Convert the DataFrame to an Arrow table using PyArrow and inspect the table’s metadata property 

In [11]:
table = pa.Table.from_pandas(wri_annual_gwh)
print(table.schema.metadata)

{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "gppd_idnr", "field_name": "gppd_idnr", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "year", "field_name": "year", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "generation_gwh", "field_name": "generation_gwh", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "uuid", "field_name": "uuid", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": "5.0.0"}, "pandas_version": "1.3.3"}'}


Create custom meta data and key

In [12]:
custom_meta_fields = {
    'year': { 'description': 'year of report', 'dimension': 'year'},
    'gppd_idnr': { 'description': 'unique index into plants table', 'dimension': None},
    'generation_gwh': { 'description': 'electricity generation in gigawatt-hours reported for the year', 'dimension': 'GWh'}
}
custom_meta_key_fields = 'metafields'

custom_meta_content = {
    'title': 'WRI GPPD Annual Generation Data (GWh)',
    'description': 'A tidy representation of annual generation data from the WRI GPPD database',
    'version': '1.3.0',
    'release_date': '20210602'
    # How should we describe our transformative step here?
}
custom_meta_key = 'metaset'

pyarrow tables are immutable, so we create a new table that combines the original (pandas) metadata with our custom metadata

In [13]:
table = set_metadata(table, custom_meta_fields, {custom_meta_key_fields: custom_meta_fields, custom_meta_key: custom_meta_content})
display(table.schema)

gppd_idnr: string
  -- field metadata --
  description: '"unique index into plants table"'
  dimension: 'null'
year: int64
  -- field metadata --
  description: '"year of report"'
  dimension: '"year"'
generation_gwh: double
  -- field metadata --
  description: '"electricity generation in gigawatt-hours reported for th' + 7
  dimension: '"GWh"'
uuid: string
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 788
metafields: '{"year": {"description": "year of report", "dimension": "yea' + 208
metaset: '{"title": "WRI GPPD Annual Generation Data (GWh)", "description' + 128

Save the dataframe as a parquet file to S3 client so that dataFrame & metadata are now coupled together

Write Arrow table to temp storage location in S3_DEV_BUCKET

In [14]:
tablename = 'annual_gwh'
pq.write_table(
    table,
    '/tmp/{tname}.parquet'.format(tname=tablename)
)
# df.to_parquet('/tmp/{tname}.parquet'.format(tname=tablename), index=False)
s3.upload_file(
    Bucket=os.environ['S3_DEV_BUCKET'],
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename),
    Filename='/tmp/{tname}.parquet'.format(tname=tablename)
)

In [15]:
create_trino_pipeline (s3, schemaname, tablename, wri_annual_gwh)

dropping table: annual_gwh
create table if not exists osc_datacommons_dev.wri_gppd_md.annual_gwh(
    gppd_idnr varchar,
    year bigint,
    generation_gwh double,
    uuid varchar
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/trino/wri_gppd_md/annual_gwh/'
)


Restore data and metadata

In [16]:
# Read the Parquet file into an Arrow table
obj = s3.get_object(
    Bucket=os.environ['S3_DEV_BUCKET'], 
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename)
)
restored_table = pq.read_table(io.BytesIO(obj['Body'].read()))
# Call the table’s to_pandas conversion method to restore the dataframe
# This operation uses the Pandas metadata to reconstruct the dataFrame under the hood
restored_df = restored_table.to_pandas()
# The custom metadata is accessible via the Arrow table’s metadata object
# Use the custom metadata key used earlier (taking care to once again encode the key as bytes)
restored_meta_json = restored_table.schema.metadata[custom_meta_key.encode()]
# Deserialize the json string to get back metadata
restored_meta = json.loads(restored_meta_json)
# Use the custom metadata fields key used earlier (taking care to once again encode the key as bytes)
restored_meta_json_fields = restored_table.schema.metadata[custom_meta_key_fields.encode()]
# Deserialize the json string to get back metadata
restored_meta_fields = json.loads(restored_meta_json_fields)

In [17]:
restored_df

Unnamed: 0,gppd_idnr,year,generation_gwh,uuid
337,AUS0000065,2013,89.595278,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
340,AUS0000114,2013,1095.676944,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
342,AUS0000264,2013,204.804444,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
345,AUS0000081,2013,132.456667,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
346,AUS0000113,2013,4.194444,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...
244154,USA0056871,2019,22.647000,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
244155,USA0001368,2019,-0.045000,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
244156,USA0057648,2019,1.211000,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
244157,USA0061574,2019,1.589000,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


In [18]:
restored_meta

{'title': 'WRI GPPD Annual Generation Data (GWh)',
 'description': 'A tidy representation of annual generation data from the WRI GPPD database',
 'version': '1.3.0',
 'release_date': '20210602'}

In [19]:
restored_meta_fields

{'year': {'description': 'year of report', 'dimension': 'year'},
 'gppd_idnr': {'description': 'unique index into plants table',
  'dimension': None},
 'generation_gwh': {'description': 'electricity generation in gigawatt-hours reported for the year',
  'dimension': 'GWh'}}

Merge the estimation data so that estimates and notes are 1:1 together, dropping NA values

In [20]:
wri_id_vars = ['gppd_idnr', 'uuid']
wri_value_vars = ['estimated_generation_gwh_2013', 'estimated_generation_gwh_2014', 'estimated_generation_gwh_2015',
                  'estimated_generation_gwh_2016', 'estimated_generation_gwh_2017']
wri_estimated_gwh = wri_df.melt(wri_id_vars, wri_value_vars, var_name='year', value_name='estimated_generation_gwh')
wri_estimated_gwh['year'] = pd.to_numeric(wri_estimated_gwh['year'].apply(lambda x: int(x.split('_')[-1])))

wri_value_vars = ['estimated_generation_note_2013', 'estimated_generation_note_2014', 'estimated_generation_note_2015',
                  'estimated_generation_note_2016', 'estimated_generation_note_2017']
wri_estimated_note = wri_df.melt(wri_id_vars, wri_value_vars, var_name='year', value_name='estimated_generation_note')
wri_estimated_note['year'] = pd.to_numeric(wri_estimated_note['year'].apply(lambda x: int(x.split('_')[-1])))

# Push uuid to the end of the column list
wri_estimated_gwh.insert(len(wri_estimated_gwh.columns)-1, 'uuid', wri_estimated_gwh.pop('uuid'))

display(wri_estimated_gwh)

wri_estimated_gwh = wri_estimated_gwh.merge(wri_estimated_note, on=['gppd_idnr', 'year', 'uuid'], validate="one_to_one")

wri_estimated_gwh.dropna(subset=['estimated_generation_gwh'],inplace=True)

# Push uuid to the end of the column list
wri_estimated_gwh.insert(len(wri_estimated_gwh.columns)-1, 'uuid', wri_estimated_gwh.pop('uuid'))

display(wri_estimated_gwh)

Unnamed: 0,gppd_idnr,year,estimated_generation_gwh,uuid
0,GEODB0040538,2013,123.77,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
1,WKS0070144,2013,18.43,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
2,WKS0071196,2013,18.64,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
3,GEODB0040541,2013,225.06,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
4,GEODB0040534,2013,406.16,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...
174675,WRI1022386,2017,183.79,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174676,WRI1022384,2017,73.51,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174677,WRI1022380,2017,578.32,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174678,GEODB0040404,2017,2785.10,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


Unnamed: 0,gppd_idnr,year,estimated_generation_gwh,estimated_generation_note,uuid
0,GEODB0040538,2013,123.77,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
1,WKS0070144,2013,18.43,SOLAR-V1-NO-AGE,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
2,WKS0071196,2013,18.64,SOLAR-V1-NO-AGE,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
3,GEODB0040541,2013,225.06,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
4,GEODB0040534,2013,406.16,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...,...
174675,WRI1022386,2017,183.79,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174676,WRI1022384,2017,73.51,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174677,WRI1022380,2017,578.32,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174678,GEODB0040404,2017,2785.10,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


### Create and store metadata for estimated_gwh table

Convert the DataFrame to an Arrow table using PyArrow and inspect the table’s metadata property 

In [22]:
table = pa.Table.from_pandas(wri_estimated_gwh)
print(table.schema.metadata)

{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "gppd_idnr", "field_name": "gppd_idnr", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "year", "field_name": "year", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "estimated_generation_gwh", "field_name": "estimated_generation_gwh", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "estimated_generation_note", "field_name": "estimated_generation_note", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "uuid", "field_name": "uuid", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "creator": {"library": "pyarrow", "version": 

Create custom meta data and key

pyarrow tables are immutable, so we create a new table that combines the original (pandas) metadata with our custom metadata

In [23]:
custom_meta_fields = {
    'year': { 'description': 'year of report', 'dimension': 'year'},
    'gppd_idnr': { 'description': 'unique index into plants table', 'dimension': None},
    'estimated_generation_gwh': { 'description': 'estimated electricity generation in gigawatt-hours reported for the year', 'dimension': 'GWh'},
    'estimated_generation_note': { 'description': 'type of generation estimated', 'dimension': None}
}
custom_meta_key_fields = 'metafields'

custom_meta_content = {
    'title': 'WRI GPPD Estimated Generation Data (GWh)',
    'description': 'A tidy representation of estimated generation data from the WRI GPPD database',
    'version': '1.3.0',
    'release_date': '20210602'
    # How should we describe our transformative step here?
}
custom_meta_key = 'metaset'

In [24]:
table = set_metadata(table, custom_meta_fields, {custom_meta_key_fields: custom_meta_fields, custom_meta_key: custom_meta_content})
display(table.schema)

gppd_idnr: string
  -- field metadata --
  description: '"unique index into plants table"'
  dimension: 'null'
year: int64
  -- field metadata --
  description: '"year of report"'
  dimension: '"year"'
estimated_generation_gwh: double
  -- field metadata --
  description: '"estimated electricity generation in gigawatt-hours repor' + 17
  dimension: '"GWh"'
estimated_generation_note: string
  -- field metadata --
  description: '"type of generation estimated"'
  dimension: 'null'
uuid: string
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 958
metafields: '{"year": {"description": "year of report", "dimension": "yea' + 325
metaset: '{"title": "WRI GPPD Estimated Generation Data (GWh)", "descript' + 134

Construct the combined metadata by merging existing table metadata and custom metadata.
Note: The metadata content must be JSON serialisable and encoded as bytes; the metadata key must also be encoded as bytes.

In [25]:
tablename = 'estimated_gwh'
pq.write_table(
    table,
    '/tmp/{tname}.parquet'.format(tname=tablename)
)
# df.to_parquet('/tmp/{tname}.parquet'.format(tname=tablename), index=False)
s3.upload_file(
    Bucket=os.environ['S3_DEV_BUCKET'],
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename),
    Filename='/tmp/{tname}.parquet'.format(tname=tablename)
)

In [26]:
create_trino_pipeline (s3, schemaname, tablename, wri_estimated_gwh)

dropping table: estimated_gwh
create table if not exists osc_datacommons_dev.wri_gppd_md.estimated_gwh(
    gppd_idnr varchar,
    year bigint,
    estimated_generation_gwh double,
    estimated_generation_note varchar,
    uuid varchar
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/trino/wri_gppd_md/estimated_gwh/'
)


Restore data and metadata

In [27]:
# Read the Parquet file into an Arrow table
obj = s3.get_object(
    Bucket=os.environ['S3_DEV_BUCKET'], 
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename)
)
restored_table = pq.read_table(io.BytesIO(obj['Body'].read()))
# Call the table’s to_pandas conversion method to restore the dataframe
# This operation uses the Pandas metadata to reconstruct the dataFrame under the hood
restored_df = restored_table.to_pandas()
# The custom metadata is accessible via the Arrow table’s metadata object
# Use the custom metadata key used earlier (taking care to once again encode the key as bytes)
restored_meta_json = restored_table.schema.metadata[custom_meta_key.encode()]
# Deserialize the json string to get back metadata
restored_meta = json.loads(restored_meta_json)
# Use the custom metadata fields key used earlier (taking care to once again encode the key as bytes)
restored_meta_json_fields = restored_table.schema.metadata[custom_meta_key_fields.encode()]
# Deserialize the json string to get back metadata
restored_meta_fields = json.loads(restored_meta_json_fields)

In [28]:
restored_df

Unnamed: 0,gppd_idnr,year,estimated_generation_gwh,estimated_generation_note,uuid
0,GEODB0040538,2013,123.77,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
1,WKS0070144,2013,18.43,SOLAR-V1-NO-AGE,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
2,WKS0071196,2013,18.64,SOLAR-V1-NO-AGE,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
3,GEODB0040541,2013,225.06,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
4,GEODB0040534,2013,406.16,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...,...
174675,WRI1022386,2017,183.79,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174676,WRI1022384,2017,73.51,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174677,WRI1022380,2017,578.32,HYDRO-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
174678,GEODB0040404,2017,2785.10,CAPACITY-FACTOR-V1,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


In [29]:
restored_meta

{'title': 'WRI GPPD Estimated Generation Data (GWh)',
 'description': 'A tidy representation of estimated generation data from the WRI GPPD database',
 'version': '1.3.0',
 'release_date': '20210602'}

In [30]:
restored_meta_fields

{'year': {'description': 'year of report', 'dimension': 'year'},
 'gppd_idnr': {'description': 'unique index into plants table',
  'dimension': None},
 'estimated_generation_gwh': {'description': 'estimated electricity generation in gigawatt-hours reported for the year',
  'dimension': 'GWh'},
 'estimated_generation_note': {'description': 'type of generation estimated',
  'dimension': None}}

### Finally, write out plants with metadata

Convert the DataFrame to an Arrow table using PyArrow and inspect the table’s metadata property 

In [31]:
table = pa.Table.from_pandas(wri_plants)
print(table.schema.metadata)

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 34936, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "country", "field_name": "country", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "country_long", "field_name": "country_long", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "name", "field_name": "name", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "gppd_idnr", "field_name": "gppd_idnr", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "capacity_mw", "field_name": "capacity_mw", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "latitude", "field_name": "latitude", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "longitude", "field_name": "l

Create the actual metadata for the source.  In this case, it is WRI_GPPD.

The quoted text comes from the README.txt file that comes with the dataset.

In [32]:
custom_meta_content = {}
metadata_text = """Title: Global Power Plant Database
Description: A comprehensive, global, open source database of power plants
Version: 1.3.0
Release Date: 2021-06-02
URI: http://datasets.wri.org/dataset/globalpowerplantdatabase
Copyright: Copyright 2018-2021 World Resources Institute and Data Contributors
License: Creative Commons Attribution 4.0 International -- CC BY 4.0
Contact: powerexplorer@wri.org
Citation: Global Energy Observatory, Google, KTH Royal Institute of Technology in Stockholm, Enipedia, World Resources Institute. 2019. Global Power Plant Database. Published on Resource Watch and Google Earth Engine. http://resourcewatch.org/ https://earthengine.google.com/  """

for line in metadata_text.split('\n'):
    k, v = line.split(':', 1)
    k = snakify(k)
    custom_meta_content[k] = v

custom_meta_content['abstract'] = """An affordable, reliable, and environmentally sustainable power sector is central to modern society.
Governments, utilities, and companies make decisions that both affect and depend on the power sector.
For example, if governments apply a carbon price to electricity generation, it changes how plants run and which plants are built over time.
On the other hand, each new plant affects the electricity generation mix, the reliability of the system, and system emissions.
Plants also have significant impact on climate change, through carbon dioxide (CO2) emissions; on water stress, through water withdrawal and consumption; and on air quality, through sulfur oxides (SOx), nitrogen oxides (NOx), and particulate matter (PM) emissions.

The Global Power Plant Database is an open-source open-access dataset of grid-scale (1 MW and greater) electricity generating facilities operating across the world.

The Database currently contains nearly 35000 power plants in 167 countries, representing about 72% of the world's capacity.
Entries are at the facility level only, generally defined as a single transmission grid connection point.
Generation unit-level information is not currently available. 
The methodology for the dataset creation is given in the World Resources Institute publication "A Global Database of Power Plants" [0].
Associated code for the creation of the dataset can be found on GitHub [1].
See also the technical note published in early 2020 on an improved methodology to estimate annual generation [2].

To stay updated with news about the project and future database releases, please sign up for our newsletter for the release announcement [3].


[0] www.wri.org/publication/global-power-plant-database
[1] https://github.com/wri/global-power-plant-database
[2] https://www.wri.org/publication/estimating-power-plant-generation-global-power-plant-database
[3] https://goo.gl/ivTvkd"""
custom_meta_content['name'] = 'WRI_GPPD'

Create the metadata for all the fields in all the tables

In [33]:
field_text = """`country` (text): 3 character country code corresponding to the ISO 3166-1 alpha-3 specification [https://www.iso.org/iso-3166-country-codes.html]
`country_long` (text): longer form of the country designation
`name` (text): name or title of the power plant, generally in Romanized form
`gppd_idnr` (text): 10 or 12 character identifier for the power plant
`capacity_mw` (number): electrical generating capacity in megawatts
`latitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
`longitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)
`primary_fuel` (text): energy source used in primary electricity generation or export
`other_fuel1` (text): energy source used in electricity generation or export
`other_fuel2` (text): energy source used in electricity generation or export
`other_fuel3` (text): energy source used in electricity generation or export
`commissioning_year` (number): year of plant operation, weighted by unit-capacity when data is available
`owner` (text): majority shareholder of the power plant, generally in Romanized form
`source` (text): entity reporting the data; could be an organization, report, or document, generally in Romanized form
`url` (text): web document corresponding to the `source` field
`geolocation_source` (text): attribution for geolocation information
`wepp_id` (text): a reference to a unique plant identifier in the widely-used PLATTS-WEPP database.
`year_of_capacity_data` (number): year the capacity information was reported
`generation_data_source` (text): attribution for the reported generation information"""

field_descs = [line.split(': ')[1] for line in field_text.split('\n')]
field_keys = [line.split(': ')[0].split(' ')[0][1:-1] for line in field_text.split('\n')]

Create custom meta data and key

In [34]:
custom_meta_fields = {}
for k, v in zip(field_keys, field_descs):
    custom_meta_fields[k] = { 'description': v }

custom_meta_fields['capacity_mw']['dimension'] = 'MW'
custom_meta_fields['latitude']['dimension'] = 'degrees'
custom_meta_fields['longitude']['dimension'] = 'degrees'
custom_meta_fields['commissioning_year']['dimension'] = 'year'
custom_meta_fields['year_of_capacity_data']['dimension'] = 'year'
custom_meta_fields['year'] = { 'description': 'year of report', 'dimension': 'year'}
custom_meta_fields['gppd_idnr'] = { 'description': 'unique index into plants table', 'dimension': None}
custom_meta_fields['generation_gwh'] = { 'description': 'electricity generation in gigawatt-hours reported for the year', 'dimension': 'GWh'}
custom_meta_fields['estimated_generation_gwh'] = { 'description': 'estimated electricity generation in gigawatt-hours reported for the year', 'dimension': 'GWh'}
custom_meta_fields['estimated_generation_note'] = { 'description': 'label of the model/method used to estimate generation for the year', 'dimension': None }
custom_meta_key_fields = 'metafields'

custom_meta_content = {
    'title': 'Global Power Plant Database',
    'description': 'A comprehensive, global, open source database of power plants',
    'version': '1.3.0',
    'release_date': '20210602'
}
custom_meta_key = 'metaset'

In [35]:
table = set_metadata(table, custom_meta_fields, {custom_meta_key_fields: custom_meta_fields, custom_meta_key: custom_meta_content})
display(table.schema)

country: string
  -- field metadata --
  description: '"3 character country code corresponding to the ISO 3166-1' + 73
country_long: string
  -- field metadata --
  description: '"longer form of the country designation"'
name: string
  -- field metadata --
  description: '"name or title of the power plant, generally in Romanized' + 6
gppd_idnr: string
  -- field metadata --
  description: '"unique index into plants table"'
  dimension: 'null'
capacity_mw: double
  -- field metadata --
  description: '"electrical generating capacity in megawatts"'
  dimension: '"MW"'
latitude: double
  -- field metadata --
  description: '"geolocation in decimal degrees; WGS84 (EPSG:4326)"'
  dimension: '"degrees"'
longitude: double
  -- field metadata --
  description: '"geolocation in decimal degrees; WGS84 (EPSG:4326)"'
  dimension: '"degrees"'
primary_fuel: string
  -- field metadata --
  description: '"energy source used in primary electricity generation or ' + 7
other_fuel1: string
  -- field meta

In [36]:
tablename = 'plants'
pq.write_table(
    table,
    '/tmp/{tname}.parquet'.format(tname=tablename)
)
# df.to_parquet('/tmp/{tname}.parquet'.format(tname=tablename), index=False)
s3.upload_file(
    Bucket=os.environ['S3_DEV_BUCKET'],
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename),
    Filename='/tmp/{tname}.parquet'.format(tname=tablename)
)

In [37]:
create_trino_pipeline (s3, schemaname, tablename, wri_plants)

dropping table: plants
create table if not exists osc_datacommons_dev.wri_gppd_md.plants(
    country varchar,
    country_long varchar,
    name varchar,
    gppd_idnr varchar,
    capacity_mw double,
    latitude double,
    longitude double,
    primary_fuel varchar,
    other_fuel1 varchar,
    other_fuel2 varchar,
    other_fuel3 varchar,
    commissioning_year double,
    owner varchar,
    source varchar,
    url varchar,
    geolocation_source varchar,
    wepp_id varchar,
    year_of_capacity_data double,
    generation_data_source varchar,
    uuid varchar
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/trino/wri_gppd_md/plants/'
)


Restore data and metadata

In [38]:
# Read the Parquet file into an Arrow table
obj = s3.get_object(
    Bucket=os.environ['S3_DEV_BUCKET'], 
    Key='trino/wri_gppd/{tname}/{tname}.parquet'.format(tname=tablename)
)
restored_table = pq.read_table(io.BytesIO(obj['Body'].read()))
# Call the table’s to_pandas conversion method to restore the dataframe
# This operation uses the Pandas metadata to reconstruct the dataFrame under the hood
restored_df = restored_table.to_pandas()
# The custom metadata is accessible via the Arrow table’s metadata object
# Use the custom metadata key used earlier (taking care to once again encode the key as bytes)
restored_meta_json = restored_table.schema.metadata[custom_meta_key.encode()]
# Deserialize the json string to get back metadata
restored_meta = json.loads(restored_meta_json)
# Use the custom metadata fields key used earlier (taking care to once again encode the key as bytes)
restored_meta_json_fields = restored_table.schema.metadata[custom_meta_key_fields.encode()]
# Deserialize the json string to get back metadata
restored_meta_fields = json.loads(restored_meta_json_fields)

In [39]:
restored_df

Unnamed: 0,country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_data_source,uuid
0,AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
1,AFG,Afghanistan,Kandahar DOG,WKS0070144,10.0,31.6700,65.7950,Solar,,,,,,Wiki-Solar,https://www.wiki-solar.org,Wiki-Solar,,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
2,AFG,Afghanistan,Kandahar JOL,WKS0071196,10.0,31.6230,65.7920,Solar,,,,,,Wiki-Solar,https://www.wiki-solar.org,Wiki-Solar,,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
3,AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
4,AFG,Afghanistan,Naghlu Dam Hydroelectric Power Plant Afghanistan,GEODB0040534,100.0,34.6410,69.7170,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009797,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34931,ZMB,Zambia,Ndola,WRI1022386,50.0,-12.9667,28.6333,Oil,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1089529,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34932,ZMB,Zambia,Nkana,WRI1022384,20.0,-12.8167,28.2000,Oil,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1043097,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34933,ZMB,Zambia,Victoria Falls,WRI1022380,108.0,-17.9167,25.8500,Hydro,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1033763,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34934,ZWE,Zimbabwe,Hwange Coal Power Plant Zimbabwe,GEODB0040404,920.0,-18.3835,26.4700,Coal,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1033856,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


In [40]:
restored_meta

{'title': 'Global Power Plant Database',
 'description': 'A comprehensive, global, open source database of power plants',
 'version': '1.3.0',
 'release_date': '20210602'}

In [41]:
restored_meta_fields

{'country': {'description': '3 character country code corresponding to the ISO 3166-1 alpha-3 specification [https://www.iso.org/iso-3166-country-codes.html]'},
 'country_long': {'description': 'longer form of the country designation'},
 'name': {'description': 'name or title of the power plant, generally in Romanized form'},
 'gppd_idnr': {'description': 'unique index into plants table',
  'dimension': None},
 'capacity_mw': {'description': 'electrical generating capacity in megawatts',
  'dimension': 'MW'},
 'latitude': {'description': 'geolocation in decimal degrees; WGS84 (EPSG:4326)',
  'dimension': 'degrees'},
 'longitude': {'description': 'geolocation in decimal degrees; WGS84 (EPSG:4326)',
  'dimension': 'degrees'},
 'primary_fuel': {'description': 'energy source used in primary electricity generation or export'},
 'other_fuel1': {'description': 'energy source used in electricity generation or export'},
 'other_fuel2': {'description': 'energy source used in electricity generati

In [42]:
display(wri_plants)

Unnamed: 0,country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_data_source,uuid
0,AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
1,AFG,Afghanistan,Kandahar DOG,WKS0070144,10.0,31.6700,65.7950,Solar,,,,,,Wiki-Solar,https://www.wiki-solar.org,Wiki-Solar,,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
2,AFG,Afghanistan,Kandahar JOL,WKS0071196,10.0,31.6230,65.7920,Solar,,,,,,Wiki-Solar,https://www.wiki-solar.org,Wiki-Solar,,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
3,AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
4,AFG,Afghanistan,Naghlu Dam Hydroelectric Power Plant Afghanistan,GEODB0040534,100.0,34.6410,69.7170,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009797,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34931,ZMB,Zambia,Ndola,WRI1022386,50.0,-12.9667,28.6333,Oil,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1089529,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34932,ZMB,Zambia,Nkana,WRI1022384,20.0,-12.8167,28.2000,Oil,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1043097,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34933,ZMB,Zambia,Victoria Falls,WRI1022380,108.0,-17.9167,25.8500,Hydro,,,,,ZESCO,Energy Regulation Board of Zambia,http://www.erb.org.zm/reports/EnergySectorRepo...,Power Africa,1033763,,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
34934,ZWE,Zimbabwe,Hwange Coal Power Plant Zimbabwe,GEODB0040404,920.0,-18.3835,26.4700,Coal,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1033856,2017.0,,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0


## Load metadata following an ingestion process into trino metadata store

### The schema is *metastore*, and the table names are *meta_schema*, *meta_table*, *meta_field*

In [44]:
# Create metastore structure
metastore = {'catalog':'osc_datacommons_dev',
             'schema':'wri_gppd_md',
             'table':tablename,
             'metadata':custom_meta_content,
             'uuid':uid}
# Create DataFrame
df_meta = pd.DataFrame(metastore)
# Print the output
df_meta

Unnamed: 0,catalog,schema,table,metadata,uuid
description,osc_datacommons_dev,wri_gppd_md,plants,"A comprehensive, global, open source database ...",a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
release_date,osc_datacommons_dev,wri_gppd_md,plants,20210602,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
title,osc_datacommons_dev,wri_gppd_md,plants,Global Power Plant Database,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
version,osc_datacommons_dev,wri_gppd_md,plants,1.3.0,a3f38a4c-fb44-4d3f-becf-b218db1b7cd0
