## Load 2020 WIDE-formatted ESG data (Generic)

Copyright (C) 2021 OS-Climate

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

### Initially developed using the Royal Dutch Shell plc Sustainability Report 2020 report (Many Sheets)

Contributed by Michael Tiemann (Github: MichaelTiemannOSC)

Load Credentials

In [1]:
# From the AWS Account page, copy the export scripts from the appropriate role using the "Command Line or Programmatic Access" link
# Paste the copied text into ~/credentials.env

from dotenv import dotenv_values, load_dotenv
import os
import pathlib
import sys

dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

import sys
sys.path.append('../src/')

In [2]:
import re
import pandas as pd
import numpy as np

import openpyxl
from openpyxl import load_workbook
from openpyxl.worksheet.dimensions import ColumnDimension, DimensionHolder
from openpyxl.utils import get_column_letter
from openpyxl.styles import Alignment, Font
from itertools import islice

import pint
import pint_pandas
import iam_units
from openscm_units import unit_registry
pint_pandas.PintType.ureg = unit_registry
ureg = unit_registry
ureg.define('fraction = [] = frac')
ureg.define('percent = 1e-2 frac = pct = percentage')
ureg.define('ppm = 1e-6 fraction')

ureg.define("USD = [currency]")
ureg.define("EUR = nan USD")
ureg.define("JPY = nan USD")
ureg.define("MM_USD = 1000000 USD")
ureg.define("revenue = USD")

ureg.define("btu = Btu")
ureg.define("tBtu = T Btu")
ureg.define("boe = 5.712 GJ")
ureg.define("UEDCTM = [shell_index]")

ureg.define("CO2e = CO2 = CO2eq = CO2_eq")
ureg.define("HFC = [ HFC_emissions ]")
ureg.define("PFC = [ PFC_emissions ]")
ureg.define("mercury = Hg = Mercury")
ureg.define("PM10 = [ PM10_emissions ]")

ureg.define("production = [ output ]")
ureg.define("Index = pct = Share")

ureg.define("Number = dimensionless")

one_co2 = ureg("CO2e")
print(one_co2)

from osc_ingest_trino import *
import pyarrow as pa
import pyarrow.parquet as pq
import json
import io
import uuid

1 CO2e


In [3]:
ureg("tonnes CO2e/revenue")

For spreadsheets in WIDE format, pre-process the spreadsheet as a workbook, cascading label data into 3rd-normal form row and column metadata

* var_col is the label of the variable being measured (whose specificity (like CO2, CH4, NOx, etc) often affects units)
* units_col is the column where units are stated
* val_col:last_val_col are the column where the values are quantitatively reported
* last_val_col+1:last_col are additional columns that are presumed to be metadata labels (such as GRI or SASB labels)

We add:
* notes_col (source worksheet-specific; could act as a kind of source table metadata)
* topic_col (sheet-level category; if we wanted large tables, they could be named by topic)
* category_col (to which row-level data rolls up; if we wanted small tables, they could be named by topic:category)
* segment_col (the dimension by which row-level data is segmented)
* units_col (if not already existing in input)

Some spreadsheets use color to express a multi-level category hierarchy (such as Energy Consumption>>Business Use>>Fuel Type).  We concatenate the categories from left to right as the category for our purposes, except we split off the rightmost subcategory as the segmentation.

Based on all of the above, we don't really have table-level metadata other than notes attached to sheets and generic column information.  An argument could be made that we need to allocate specifier columns for additional data we want to split out from our variables.  That could look like:

* spec1_col
* spec2_col

etc

In [4]:
# var_col = 1

# Magic knowledge
# last_col = 4
# max_hidden_col = 5
# year_regex = r'^(20\d\d) Data$'

ingest_columns = [ 'Variable', 'Notes', 'Topic', 'Category', 'Segmentation', 'Unit' ]
ingest_col_offsets = dict((j,i) for i,j in enumerate(ingest_columns[1:], start=1))

# In this case, Value columns are named like 2020, 2019, 2018, ... .  It is the pd.melt function that gives us an actual Value column.
# the val_col index merely refers to the first such value row (which hopefully has tasty data)

# Magic knowledge
# val_col = units_col+1     # units_col starts as var_col+1, val_col starts as var_col+2 which is also units_col+1

# If topic_row is None, set topic based on name of sheet
# topic = topic_row = None
# header_row = None

# If init_header_row is None, find header row based on color scheme
# init_header_row = 1

class corp_report_magic:
    def __init__(self, shortname, input_filename, ws_start, ws_end, var_col=None, units_col=None, 
                 notes_col=None, topic_row=None, topic_col=None, category_col=None, init_header_row=None, header_row_list=None,
                 header_color=None, cat_color_dict={ None:0 }, year_regex=None, max_hidden_col=None,
                 val_col=None, last_val_col=None):
        self.shortname = shortname
        self.input_filename = '/'.join([os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src')),
                                        'osc-ingest-shell/data/external', input_filename])
        self.ws_start = ws_start,
        self.ws_end = ws_end,
        self.init_topic_row = topic_row    # If topic_row is None, use the worksheet name as the topic
        self.var_col = var_col or 1
        self.init_units_col = units_col            # If units_col is none, we have to allocate it
        self.init_topic_col = topic_col    # If topic_col is non-null, we get topics from this row
        self.init_category_col = category_col
        self.init_notes_col = notes_col
        self.init_val_col = val_col
        self.init_last_val_col = last_val_col
        # val_col, last_val_col, and last_col can be derived from the spreadsheet
        self.units_row = -1 if units_col==None else 0   # -1: Carry across only; 0: no units seen yet; > 0 row of prevailing unit
        self.init_header_row = init_header_row
        self.header_row_list = header_row_list if header_row_list else ([-1] * ws_start) + ([init_header_row] * (ws_end-ws_start+1))
        self.header_row = None
        self.header_color = header_color
        self.cat_color_dict = cat_color_dict
        self.year_regex = year_regex
        # For AEP, there are several hidden columns on the first sheet we must delete
        # to make that sheet line up with other sheets
        self.max_hidden_col = max_hidden_col
        
        self.units_col = units_col
        self.topic_col = topic_col
        self.category_col = category_col
        self.notes_col = notes_col
        self.segmentation_col = None
        self.val_col = val_col or units_col+1 if units_col else var_col+1 if var_col else 2
        self.last_val_col = last_val_col
        self.last_val_row = None              # Set by preprocess (after we've identified our value columns)
        self.last_col = None                  # Set by crop_sheet
    
    def preprocess(self):
        self.wb_superscripts = None
        self.topic_row = self.init_topic_row
        self.units_col = self.init_units_col
        self.topic_col = self.init_topic_col
        self.category_col = self.init_category_col
        self.notes_col = self.init_notes_col
        self.segmentation_col = None
        self.val_col = self.init_val_col or self.init_units_col+1 if self.init_units_col else self.var_col+1
        self.last_val_col = self.init_last_val_col

Shell_magic = corp_report_magic("Shell", r"greenhouse-gas-and-energy-data-shell-sr20.xlsx", 1, 10,
                                init_header_row=5, units_col=2)
DPDHL_magic = corp_report_magic("DPDHL", r"DPDHL-ESG-Statbook-2020-en.xlsx", 2, 4,
                                topic_row=1, header_row_list=[ -1, -1, 8, 5, 4], header_color='FFBF00',
                                cat_color_dict={ 'FF00B050':0, 'E2F0D9':1, 'D0CECE':0, 'E7E6E6':0 },
                                units_col=2)
Unilever_magic = corp_report_magic("Unilever", r"Unilever sustainability performance data_Climate FINAL.xlsx", 0, 0,
                                   topic_row=9, init_header_row=10,
                                   cat_color_dict={'FFEBF1DE':0, 'E2F0D9':0})
AEP_magic = corp_report_magic("AEP", r"2021-Data-Centerv1.xlsx", 0, 3,
                              init_header_row=1,
                              cat_color_dict={'FF237F2E':0, 'FF40B14B':1, 'FFC6E7C8':2,
                                              'FF757575':0, 'FFBDBDBD':1, 
                                              'FF5FB3F9':0, 'FFB9DDFC':1, 
                                              'FFD0AF8F':0, 'FFEEDCCA':1},
                              year_regex=r'^(20\d\d) Data$', max_hidden_col=5)
Altria_magic = corp_report_magic("Altra", r"esg-tables.xlsx", 1, 1,
                                 init_header_row=2,
                                 cat_color_dict={'FF9BDA44':0, 'FF92D050':1},
                                 units_col=2)

SUEZ_magic = corp_report_magic("SUEZ", r"SUEZ-FY-2020-ESG-dataset-xls-may2020.xlsx", 1, 1,
                               init_header_row=3,
                               topic_col=1, category_col=2, var_col=3, units_col=4,
                               val_col=9, last_val_col=10)

filename_magic = {
    r"greenhouse-gas-and-energy-data-shell-sr20.xlsx": Shell_magic,
    r"2021-Data-Centerv1.xlsx": AEP_magic,
    r"DPDHL-ESG-Statbook-2020-en.xlsx": DPDHL_magic,
    r"Unilever sustainability performance data_Climate FINAL.xlsx": Unilever_magic,
    r"esg-tables.xlsx": Altria_magic,
}
# A storage area in case we delete items from the above.
foo = {
    r"esg-tables.xlsx": Altria_magic,
    r"greenhouse-gas-and-energy-data-shell-sr20.xlsx": Shell_magic,
    r"DPDHL-ESG-Statbook-2020-en.xlsx": DPDHL_magic,
    r"Unilever sustainability performance data_Climate FINAL.xlsx": Unilever_magic,
    r"2021-Data-Centerv1.xlsx": AEP_magic,
}

crm = None
value_vars = None

In [5]:
scale_regex = re.compile(r'^((mi|bi|tri|quadri)llion|thousand|hundred)(s of)? ', flags=re.I)
sc_xlate = {'hun':1e2, 'tho':1e3, 'mil':1e6, 'bil':1e9, 'tri':1e12, 'qua':1e15}

def find_units(var):
    scale = 1.0
    if var in ['%', 'pct', 'percent']:
        return 'percent'
    if '-based' in var or 'KPI' in var:
        return None
    if 'Total no.' in var:
        var = var.replace('Total no.', 'Number')
    elif 'No.' in var:
        var = var.replace('No.', 'Number')
    if ' of production' in var:
        var = var.replace(' of production', '')
    var = var.replace ('Net MWh', 'MWh')
    var = var.replace ('trillion (10^12)', 'trillion')
    var = var.replace ('m3', 'kl')
    var = var.replace ('KWh', 'kWh')
    var = var.replace ('Index points', 'Number')
    if var.lower() in ureg:
        var = var.lower()
    var = var.replace ('metric ton', 'metric_ton')
    var = var.replace('short ton', 'short_ton')
    if var in ureg:
        return f'{ureg(var).u:~}'
    m = re.search(scale_regex, var)
    if m:
        var = ' '.join([var[0:m.start(0)],var[m.end(0):]]).strip()
        if var in ureg:
            units = sc_xlate[m.group(1)[0:3].lower()] * ureg(var)
            units = units.to_compact()
            if units.m - 1.0 < 0.00001:
                # Address roundoff problems such as giga = 1.00000000000002 x 10^9
                return f'{units.u:~}'
            print(f'units do not reduce: {units}')
    print(f'find units: nothing found for {var}')
    return None

In [6]:
topic_keywords = { 'footprint':['intensity'],
                   'emissions':['scope 1', 'scope 2', 'scope 3', 'ghg', 'intensity'],
                   'energy':['consum', 'generat', 'renewable', 'intensity'],
                   'water':['consum', 'discharge', 'withdraw', 'intensity'],
                   'waste':['landfill', 'incinerate', 'compost', 'recycle', 'reuse', 'intensity'],
                   'other':[]}

topic_cell = None
category_cell = None
segmentation_stack = []

In [7]:
def process_topic(ws, row):
    """
    Topics are major headers.  If such headers also have units defined, they are also categories.
    If such headers also have values defined, they are also processed as a variable.
    """
    global topic_cell, category_cell
    
    if row==None:
        # We have to put the topic in header_row+1 because header_row is column info, not data, for the dataframe
        row = crm.header_row+1
        topic_cell = ws.cell(row, crm.topic_col)
        topic_cell.value = ws.title.lower()
        print(f'process_topic {row}: setting topic from title {topic_cell.value}')

    cell = ws.cell(row, crm.var_col)
    if cell.value:
        var_text = cell.value.split('\n')[0]    # notes have been stripped out
    else:
        var_text = None
    
    if var_text:
        topic_cell = ws.cell(row, crm.topic_col)
        if topic_cell.value!=ws.title.lower():
            # Let's assume topic text is not parenthetical, but titular
            var_text = re.sub(r'\(.+\)', '', var_text)
            var_words = var_text.split(' ')
            for word in var_words:
                if topic_cell.value and word.lower() == topic_cell.value:
                    # Not a new topic
                    break
                if word.lower() in topic_keywords.keys():
                    print(f'process_topic {row}: setting topic {word}')
                    topic_cell.value = word.lower()

            if topic_cell.value==None:
                print(f'worksheet {ws.title}: unknown topic {var_text}')
                topic_cell.value = ws.title.lower()
                # topic_keywords[ws.title.lower()] = []

        # Try to extract units from Variable description
        if ws.cell(row, crm.units_col).value==None:
            p_exprs = re.findall(r'\((.+)\)', ws.cell(row, crm.var_col).value)
            for p in p_exprs:
                if find_units(p):
                    print(f'process_topic {row}: setting units from var: {p}')
                    ws.cell(row, crm.units_col).value = p
                    break
        # Here we don't look for species_unit; mistake???
        
        # If we definitely have units, set the category, which will also process the variable (if needed)
        if ws.cell(row, crm.units_col).value:
            print(f'process_topic {row}: setting category {var_text}')
            category_cell = ws.cell(row, crm.category_col)
            ws.cell(row, crm.category_col).value = var_text

            print(f'process_topic {row}: setting units {ws.cell(row, crm.units_col).value}')
            units = find_units (ws.cell(row, crm.units_col).value)
            if units == None:
                error(f'unknown units {ws.cell(row, crm.units_col).value}')
            ws.cell(row, crm.units_col).value = units
            
            row = process_categories (ws, row)
    else:
        print(f'process_topic {row}: no var text')
    if row < 0 or row >= crm.last_val_row:
        return crm.last_val_row
    return row+1

In [8]:
def formatted_as_sub(cell1, cell2):
    color1 = cell2rgb(cell1)
    color2 = cell2rgb(cell2)
    if color1 in crm.cat_color_dict:
        if color2 in crm.cat_color_dict:
            if crm.cat_color_dict[color1] < crm.cat_color_dict[color2]:
                return True
            if crm.cat_color_dict[color1] > crm.cat_color_dict[color2]:
                return False
        else:
            return True
    elif color2 in crm.cat_color_dict:
        return False
    
    sub_score = 0
    if cell1.font.b and cell2.font.b==False:
        print('+bold')
        sub_score += 1
    if cell1.font.b==False and cell2.font.b:
        print('-bold')
        sub_score -= 1
    if cell1.font.u and cell2.font.u==False:
        print('+underline')
        sub_score += 1
    if cell1.font.u==False and cell2.font.u:
        print('-underline')
        sub_score += 1
    if cell1.alignment.indent < cell2.alignment.indent:
        print('+indent')
        sub_score += 1
    elif cell1.alignment.indent > cell2.alignment.indent:
        sub_score -= 1
    if cell1.font.sz < cell2.font.sz:
        print('+size')
        sub_score += 1
    elif cell1.font.sz > cell2.font.sz:
        sub_score -= 1
    if cell1.alignment.horizontal == 'left' and cell2.alignment.horizontal == 'right':
        print('+halign')
        sub_score += 1
    elif cell1.alignment.horizontal == 'right' and cell2.alignment.horizontal == 'left':
        sub_score -= 1
    print(f'sub_score = {sub_score}')
    if sub_score > 0:
        return True
    if sub_score < 0:
        return False
    if sub_score == 0:
        return None

def process_categories(ws, row):
    """
    Categories have units
    """
    global topic_cell, category_cell, segmentation_stack
    
    while row < crm.last_val_row:
        cell = ws.cell(row, crm.var_col)
        if cell.value:
            var_text = cell.value.split('\n')[0]    # notes have been stripped out
        else:
            # Some categories are declared across multiple lines, with units by themselves
            var_text = None
        
        # If we're already processing this row as a topic (and being called from that context), don't recurse
        if var_text and topic_cell.row < row:
            color = cell2rgb(ws.cell(row,1))
            if color and color in crm.cat_color_dict:
                # Hack because Unilever uses colors wrongly, but punctuation saves the day
                if crm.cat_color_dict[color]==0 and var_text[-1]!=':':
                    # Register that we have a new topic
                    topic_cell = ws.cell(row, crm.topic_col)
                    topic_cell.value = var_text
                    print(f'process_categories: new topic set at row {row}: {var_text} (color = {color})')
                    return process_topic(ws, row)
                if crm.cat_color_dict[color]==1:
                    category_cell = ws.cell(row, crm.category_col)
                    category_cell.value = var_text
                    print(f'process_categories: new category set at row {row}: {var_text} (color = {color})')
            elif color:
                    print(f'process_categories: unknown color {color} at row {row}: {var_text}')
            elif None in crm.cat_color_dict:
                category_cell = ws.cell(row, crm.category_col)
                category_cell.value = var_text
                print(f'process_categories: new category set at row {row}: {var_text}')

        # Try to extract units from Variable description
        if ws.cell(row, crm.units_col).value==None and var_text:
            p_exprs = re.findall(r'\((.+)\)', ws.cell(row, crm.var_col).value)
            for p in p_exprs:
                if 'scope' in p.lower() or 'category' in p.lower():
                    continue
                if find_units(p):
                    print(f'process_categories {row}: setting units from var: {p}')
                    ws.cell(row, crm.units_col).value = p
                    break
        
        # Apply our best guess for units in case we need to propagate in segmentation
        var_units = ws.cell(row, crm.units_col).value
        var_species = ''
        if var_units:
            var_units = find_units (var_units)
            m = re.search('r\((.+)\)', var_text)
            if m:
                var_species = m.group(1)
                species_units = find_units(' '.join([var_units, var_species]))
                if species_units:
                    units = species_units
                    var_text = ' '.join([var_text[0:m.start(1)], var_text[m.end(1)+1:]]).replace('  ', ' ')
            else:
                units = var_units
        elif category_cell:
            units = ws.cell(category_cell.row, crm.units_col).value
        else:
            units = None
        ws.cell(row, crm.units_col).value = units
        
        total_of = ''
        if var_text and 'total' in var_text.lower():
            c1, c2 = re.split(r'\s*totals?\s*', var_text, flags=re.I)
            if c2.strip()=='':
                total_of = c2 = c1
                c1 = var_text
            elif c1.strip()=='':
                c1 = var_text
                total_of = c2
            else:
                total_of = var_text
                # we have c1:c2
        
        segment_by = ''
        for x in [ ' per ', ' by ', ' of ' ]:
            if var_text and x in var_text:
                segment_by = x
                break

        if formatted_as_sub(ws.cell(row, crm.var_col), ws.cell(row+1, crm.var_col)):
            if segment_by:
                c1, c2 = var_text.split(segment_by, 1)
            elif not total_of:
                c1 = var_text
                c2 = '(anon)'
            category_cell.value = c1
            if segment_by or total_of:
                print(f'process_categories {row}: segmenting {c1}::{c2}')
                segmentation_stack = [ ws.cell(row, crm.segmentation_col) ]
                segmentation_stack[-1].value = c2
                row = process_var(ws, row)
                row = process_segmentation (ws, row)
                if segmentation_stack != []:
                    print(f'process_categories: segmentation_stack = {segmentation_stack}')
                if row < 0:
                    return row
                continue
        segmentation_stack = []
        print(f'process_category {row}: processing variable')
        row = process_var(ws, row)
        if row < 0:
            return row
    if row < crm.last_val_row:
        return row+1
    if row == crm.last_val_row:
        process_var (ws, row)
    return -1

In [9]:
def process_segmentation(ws, row):
    """Process rows starting at ROW as a part of a segmentation.  We push and recurse if we see
    a new level of indentation.  We pop and return if we see an outdent.
    """
    global segmentation_stack
    
    seg_start_cell = ws.cell(row, crm.var_col)
    seg_start_units = ws.cell(row, crm.units_col).value
    while row < crm.last_val_row:
        cell = ws.cell(row, crm.var_col)
        if cell.value==None:
            if ws.cell(row, crm.units_col).value and ws.cell(row+1, crm.units_col).value==None:
                ws.cell(row+1, crm.units_col).value = ws.cell(row, crm.units_col).value
                row = row+1
                # Don't start a segment with an unlabeled cell
                seg_start_cell = ws.cell(row, crm.var_col)
                continue
            else:
                print(f'unhandled units at row {row}: {ws.cell(row+1, crm.units_col).value}; {ws.cell(row, crm.units_col).value}')

        if cell==seg_start_cell:
            # Handle easy case
            row = process_var (ws, row)
        elif formatted_as_sub(seg_start_cell, cell):
            if ws.cell(row, crm.units_col).value==None:
                ws.cell(row, crm.units_col).value = seg_start_units
            # There could be many rows at the same level as SEG_START_CELL before a new segmentation is seen
            # The label we care about is the one immediately preceding, not the first one with that indentation
            segmentation_stack.append(ws.cell(row-1, crm.var_col))
            row = process_segmentation(ws, row)
        elif formatted_as_sub(seg_start_cell, cell)==False:
            segmentation_stack.pop()
            print(f'pop at row {row}: segmentation_stack now {segmentation_stack}')
            return row
        else:
            if ws.cell(row, crm.units_col).value==None and seg_start_units:
                ws.cell(row, crm.units_col).value = seg_start_units
            row = process_var (ws, row)
        if row < 0:
            return row
    if row < crm.last_val_row:
        return row+1
    if row == crm.last_val_row:
        process_var (ws, row)
        if segmentation_stack:
            print(f'process_segmentation: stack at end = {segmentation_stack}')
            segmentation_stack = []
    return -1

In [10]:
def process_var(ws, row):
    global topic_cell, category_cell, segmentation_stack
    
    cell = ws.cell(row, crm.var_col)
    
    # Treat X (Y) as 'Category X Segmentation Y'
    if cell.value:
        var_text = cell.value.split('\n')[0]    # notes have been stripped out
        var_text = var_text.replace('(%)', '(pct)')
        m = re.search(r'^(.*) \((.*?)\)', var_text)
    else:
        var_text = None
        m = None
    if m and '-based' not in m.group(2) and 'scope' not in m.group(2).lower() and 'category' not in m.group(2).lower():
        species_units = find_units (m.group(2))
        if species_units:
            print(f'process_var {row}: found species or units in {var_text}')
            units = ws.cell(row, crm.units_col).value
            if units and units != species_units:
                species_units = find_units(' '.join([units, m.group(2)]))
                if species_units:
                    var_text = m.group(1).rstrip()
                    units = species_units
                else:
                    print(f'??? Not overriding {units} with {m.group(2)}')
                    # units = ws.cell(row, crm.units_col).value
            else:
                units = species_units
                if ' ' not in m.group(1) and m.group(1) in ureg:
                    species_units = find_units(' '.join([units, m.group(1)]))
                    if species_units:
                        var_text = m.group(1).rstrip()
                        units = species_units
                print(f'Inferring/composing units: {units}')
            if units != ws.cell(category_cell.row, crm.units_col).value:
                print(f'changing units from category: {ws.cell(category_cell.row, crm.units_col).value} to {units}')
                ws.cell(row, crm.units_col).value = units
        elif m.group(2).lower() and topic_cell.value.lower() in topic_keywords and m.group(2).lower() in topic_keywords[topic_cell.value.lower()]:
            # Scope 1 is actually a sneaky segmentation
            category_cell = ws.cell(row, crm.category_col)
            category_cell.value = m.group(2)
        else:
            print(f'process_var {row}: unhandled ( {m.group(2)} )')
    else:
        if ws.cell(row, crm.units_col).value == None:
            print(f'process_var {row}: propagating units {ws.cell(category_cell.row, crm.units_col).value}')
            ws.cell(row, crm.units_col).value = ws.cell(category_cell.row, crm.units_col).value
        else:
            print(f'process_var {row}: using units {ws.cell(row, crm.units_col).value}')
    ws.cell(row, crm.topic_col).value = topic_cell.value
    ws.cell(row, crm.category_col).value = category_cell.value
    if segmentation_stack != []:
        ws.cell(row, crm.segmentation_col).value = '::'.join(s.value for s in segmentation_stack)
    if row < crm.last_val_row:
        return row+1
    return -1

In [11]:
# ??? The header row color is going to be spreadsheet-specific.  This is what DPDHL gives us.

import cell2rgb
from cell2rgb import cell2rgb

def find_header_row(wb, ws):
    # If we haven't found the header by max_row-1, we'll never find it...
    for row in range(1, ws.max_row):
        color = cell2rgb(ws.cell(row,1))
        if color == crm.header_color:
            return row
        print(f'color = {color}')
    error('No header found')
    return -1

In [12]:
import zipfile
from lxml import etree
import xml.etree.ElementTree as eTree

# We pre-process the structure of the worksheet so that it can be trivially loaded into a dataframe for further reshaping.

# Stash notes for each worksheet here.  These are *per worksheet*
# ??? In the case of DPDHL, there's a Comment field we don't track, which means we miss a stated target
ws_notes = {}

def preprocess(wb, ws):
    global crm, value_vars
    global topic_cell, category_cell, segmentation_stack
    
    cell_notes_text = []
    cell_notes_cells = []
    
    def crop_sheet(ws):
        global crm
        # Frist, set max_row/max_column based on actually active cells, not cells with random spaces or empty strings
        this_max_row = 1
        this_max_col = 1
        for row in range(1,ws.max_row+1):
            row_max_col = None
            for col in range(1,ws.max_column+1):
                cell = ws.cell(row,col)
                if cell.value==None:
                    continue
                if type(cell.value)==str and cell.value.strip()=='':
                    cell.value = None
                    continue
                if col > this_max_col:
                    this_max_col = col
                row_max_col = col
            if row_max_col:
                this_max_row = row
        print('crop_sheet')
        print('{} x {}'.format(ws.max_row, ws.max_column))
        ws.delete_rows(this_max_row+1,ws.max_row)
        ws.delete_cols(this_max_col+1,ws.max_column)
        print('{} x {}'.format(ws.max_row, ws.max_column))
        crm.last_col = ws.max_column
        
    def preprocess_notes():
        global crm
        z = zipfile.ZipFile(crm.input_filename)

        if crm.wb_superscripts==None:
            with z.open('xl/sharedStrings.xml') as fp:
                ss_xml = etree.fromstring(fp.read())
            # get the namespaces                                                                                                                                                                                                                                             
            ssns = ss_xml.nsmap
            if None in ssns:
                ssns['none'] = ssns.pop(None)
            crm.text_list = ss_xml.xpath('//none:si', namespaces=ssns)
            # All shared strings across all sheets with superscripts                                                                                                                                                                                                                           
            crm.wb_superscripts = [s for s in range(len(crm.text_list))
                                   if 'superscript' in eTree.tostring(crm.text_list[s], encoding='unicode')]
        
        with z.open(f'xl/worksheets/sheet{wb.worksheets.index(ws)+1}.xml') as fp:
            ws_xml = etree.fromstring(fp.read())
        z.close()

        # get the namespaces                                                                                                                                                                                                                                             
        wsns = ws_xml.nsmap
        if None in wsns:
            wsns['none'] = wsns.pop(None)
        cell_list = ws_xml.xpath('//none:c', namespaces=wsns)

        # Dictionary of cells:shared strings with superscripts within this sheet's cells                                                                                                                                                                                                          
        sheet_ss_dict = {c:s for c in range(len(cell_list))
                         for s in crm.wb_superscripts if f' t="s"><ns0:v>{s}</ns0:v></ns0:c>' in eTree.tostring(cell_list[c],
                                                                                                                 encoding='unicode')}

        cell_notes_text = []
        cell_notes_cells = []
        for c, s in sheet_ss_dict.items():
            cell_name = eTree.tostring(cell_list[c], encoding='unicode').split(' ')[2].split('=')[1][1:-1]
            cell_text_xml = eTree.tostring(crm.text_list[s],encoding='unicode')
            ss_bool = ['superscript' in x for x in cell_text_xml.split('<ns0:r>')[1:]]
            cell_text_parts = [re.split('<ns0:t.*?>', x)[-1].split('</ns0:t>')[0]
                  for x in cell_text_xml.split('<ns0:r>')[1:]]
            for x in range(len(ss_bool)-1):
                if ss_bool[x]==False and ss_bool[x+1]==True:
                    cell_notes_text.append(cell_text_parts[x+1])
                    cell_notes_cells.append(cell_name)
                    ws[cell_name].value = ws[cell_name].value.replace(cell_text_parts[x]+cell_text_parts[x+1],
                                                                      cell_text_parts[x])
    
    crm.preprocess()
    preprocess_notes()
    print(cell_notes_text)
    print(cell_notes_cells)
    
    topic_cell = None
    category_cell = None
    segmentation_stack = []

    # Remove merged cells
    mergedRanges=ws.merged_cells.ranges
    while mergedRanges:
        for entry in mergedRanges:
            ws.unmerge_cells(str(entry))

    if crm.max_hidden_col and wb.worksheets[0]==ws:
        ws.delete_cols(1,crm.max_hidden_col)
    crop_sheet(ws)

    if crm.init_header_row:
        crm.header_row = crm.init_header_row
    else:
        crm.header_row = find_header_row (wb, ws)
    
    # Reset this for each worksheet
    if crm.units_row >= 0:
        crm.units_row = 0

    col = crm.val_col
    last_val_col = crm.last_val_col or col
    while crm.last_val_col==None or col<=crm.last_val_col:
        # ??? Deal with note in header value (such as '2019(b)' or, God forbit '20197' where the superscripted 7 just sits like it's part of the number)
        if crm.year_regex:
            maybe_year = re.sub(crm.year_regex, r'\1', str(ws.cell(crm.header_row, col).value))
        else:
            maybe_year = str(ws.cell(crm.header_row, col).value)
        if len(maybe_year)>=4 and maybe_year[0:2]=='20' and maybe_year[2].isdigit() and maybe_year[3].isdigit():
            ws.cell(crm.header_row, col).value = maybe_year[0:4]
            last_val_col = col
        elif crm.last_val_col==None:
            crm.last_val_col = last_val_col
            break
        col = col+1
    value_vars = [ None ] * (crm.last_val_col-crm.val_col+1)
    for col in range(crm.val_col, crm.last_val_col+1):
        value_vars[col-crm.val_col] = ws.cell(crm.header_row, col).value
    print(value_vars)
    
    # Make space for TOPIC : CATEGORY : SEGMENTATION triple.
    # This triple could very well become an index into a data framework (such as SASB, TCFD, etc)
    new_column_count = (len(ingest_columns)-1
                        -int(crm.notes_col!=None)
                        -int(crm.topic_col!=None)
                        -int(crm.category_col!=None)
                        -int(crm.units_col!=None))
    ws.insert_cols(crm.last_val_col+1,amount=new_column_count)
    if crm.notes_col==None:
        crm.notes_col = crm.last_val_col+ingest_col_offsets['Notes']
    ws.cell(crm.header_row,crm.notes_col).value = 'Notes'
    if crm.topic_col==None:
        crm.topic_col = crm.last_val_col+ingest_col_offsets['Topic']
    ws.cell(crm.header_row,crm.topic_col).value = 'Topic'
    if crm.category_col==None:
        crm.category_col = crm.last_val_col+ingest_col_offsets['Category']
    ws.cell(crm.header_row,crm.category_col).value = 'Category'
    crm.segmentation_col = crm.last_val_col+ingest_col_offsets['Segmentation']
    ws.cell(crm.header_row,crm.segmentation_col).value = 'Segmentation'
    if crm.units_col==None:
        crm.units_col = crm.last_val_col+ingest_col_offsets['Unit']
    ws.cell(crm.header_row,crm.units_col).value = 'Unit'
    ws.cell(crm.header_row,crm.var_col).value = 'Variable'
        
    crm.last_col = crm.last_col + new_column_count

    # Find last row of actual values so we can process notes at the end
    for row in range(ws.max_row, 0, -1):
        if any([True for col in range(crm.val_col, crm.last_val_col+1) if ws.cell(row, col).value]):
            crm.last_val_row = row
            break

In [13]:
def postprocess(wb, ws):
    
    # Intended for Shell notes
    def save_ws_notes(ws, note):
        global ws_notes
        
        if ws.title not in ws_notes:
            ws_notes[ws.title] = {}
        note_label, note_text = note.split(' ', 1)
        ws_notes[ws.title][note_label] = note_text.strip()
    
    # Intended for DPDHL notes
    def save_ws_notes2(ws, note):
        global ws_notes
        
        if ws.title not in ws_notes:
            ws_notes[ws.title] = {}
        notes = re.split(r' (\d+)\)\s+', note)
        print('NOTES')
        print(notes)
        print('END NOTES')
        ws_notes[ws.title]['0'] = notes[0]
        for i in range(int(len(notes)/2)):
            ws_notes[ws.title][notes[1+2*i]] = notes[2+2*i].strip()
    
    # Used for Unilever
    def finish_notes(row):
        print('finish_notes @ {}'.format(row))

    for row in range(crm.last_val_row+1, ws.max_row+1):
        cell = ws.cell(row, crm.var_col)
        # Find either bracketed note or note that begins with possible superscript
        if cell.value==None:
            continue
        if cell.value[0]=='[':
            save_ws_notes(ws, cell.value)
            continue
        if re.search(r'^[^(]*\d[)]', str(cell.value)):
            save_ws_notes2(ws, cell.value)
        if re.search(r'notes', str(cell.value), flags=re.I):
            finish_notes(row)
            return

In [14]:
# With a nicely formatted workbook, do the rest of our work (including writing to Trino) using dataframes

# IPIECA, SASB, and GRI columns all feed metadata

def ws_to_df(wb, i):
    data = islice(wb.worksheets[i].values, crm.header_row_list[i]-1, None)
    cols = list(next(data))
    data = list(data)
    # idx = [r[0] for r in data]
    # data = (islice(r, 0, None) for r in data)
    cols[crm.units_col-1] = 'Unit'            # Already set by Shell; DF indexes are XLSX-1
    df = pd.DataFrame(data, columns=cols) # we don't pass in an index here

    # Remove null columns
    df = df[[c for c in df.columns if c!= None]]
    
    # For now, do not remove rows lacking units.  Those are basically where Notes are stored (for better or worse).
    # print('rows lacking proper Units')
    # display(df[df['Unit'].isnull()])
    df = df.loc[df.Unit.notna() | df.Category.isna()]

    # Clear out data that is n/a, n/c (not collected), n/d (not disclosed)
    df[df['Unit'].notna()].replace(to_replace='^n/[acd]$', value='', regex=True, inplace=True)
    
    # Change numerical years to strings to make pandas indexing behave
    df.columns = [str(c) for c in df.columns]
    # Drop completely empty rows
    # df.dropna(how='all', axis=0, inplace=True)
    return df

Write out polymorphic dataframe in LONG format.  This follows tidy data model, with one variable observation per row.  
Polymorphic means that Units/dimensions of each row are specified, but not necessarily the same row to row.  
Aggregation functions must be careful that selection criteria does not mix up incompatible unit types and/or observation variables.

In [15]:
wb = None
ws = None
melted_df = None

def ingest_filename(filename):
    global melted_df
    global topic_cell
    global crm, wb, ws
    
    crm = filename_magic[filename]

    wb = load_workbook(crm.input_filename, data_only=True)

    # For a label like "Scope 1 emissions by country" return ['', 'Scope 1 emissions', 'country']
    # For a label like "Direct GHG emissions (Scope 1) [A] [B] [C] [D]" return ['[A] [B] [C] [D]', 'Direct GHG emissions (Scope 1) ', '']

    long_fmt_filename = ''
    wide_fmt_filename = ''

    for i in range(crm.ws_start[0], crm.ws_end[0]+1):
        ws = wb.worksheets[i]
        ws_notes = {}
        preprocess(wb, ws)
        
        row = process_topic(ws, crm.topic_row)

        while row < crm.last_val_row:
            if row == crm.header_row_list[i]:
                row = row+1
                continue
            color = cell2rgb(ws.cell(row,1))
            if color and color in crm.cat_color_dict:
                if crm.cat_color_dict[color]==0:
                    # Register that we have a new topic
                    topic_cell = ws.cell(row, crm.topic_col)
                    topic_cell.value = ws.cell(row, crm.var_col).value
                    print(f'new topic set at row {row}: {topic_cell.value} (color = {color})')
                    row = process_topic(ws, row)
                else:
                    row = process_categories(ws, row)
                    if row < 0:
                        break
            else:
                print(f'ingest_file: processing {row}')
                row = process_topic(ws, row)
    
        # process_topic (wb, ws, header_row_list[i])
        # preprocess2(wb, ws)
        postprocess(wb, ws)
        # What to do with ws_notes???
        df = ws_to_df(wb, i)
        df.replace('',pd.NA,inplace=True)
        print(f'wb({i}) dataframe')
        display(df.loc[0:min(len(df),45)])
        melted_df = pd.melt(df, id_vars=ingest_columns, var_name='Year', value_name='Value', value_vars=value_vars)
        melted_df.dropna(subset=['Value'],inplace=True)
        melted_df = melted_df.astype({'Year': 'int'})

        if i==crm.ws_start[0]:
            report_year = max(df.columns[crm.val_col-1:crm.last_val_col])
            long_fmt_filename = ''.join([os.environ.get('PWD', '/opt/app-root/src'), '/osc-ingest-shell/data/interim/',
                                         crm.shortname, '_', report_year, '_', 'LONG.xlsx'])
            writer_long = pd.ExcelWriter(long_fmt_filename)
            wide_fmt_filename = ''.join([os.environ.get('PWD', '/opt/app-root/src'), '/osc-ingest-shell/data/interim/',
                                         crm.shortname, '_', report_year, '_', 'WIDE.xlsx'])
            writer_wide = pd.ExcelWriter(wide_fmt_filename)

        # This writes out LONG data with TOPIC as SHEET_NAME.  Later we'll create a truly long table with TOPIC restored as a column
        melted_df.loc[:, melted_df.columns != 'Topic'].to_excel(writer_long, index=False, sheet_name=df.iloc[0]['Topic'][0:30])

        print(ws.title)
        columns = ['Variable', 'Unit']
        # We need these columns to reshape our data
        for extra_col in ['Notes', 'Category', 'Segmentation']:
            if df[extra_col].notna().any():
                columns.append(extra_col)
        # In the case of Shell, we have only one topic per sheet, so can transform melted_df directly
        pf = melted_df.pivot(index=['Year', 'Topic'], columns=columns, values=['Value'])
        pf = pf.droplevel('Topic')
        # Once reshaped, the extra columns actually appear as multi-level indexes.  Drop them from also behaving like values
        pf[[c for c in columns if c not in ['Variable', 'Unit']]] = pd.NA
        pf.dropna(how='all', axis=1, inplace=True)
        pf.to_excel(writer_wide, sheet_name=df.iloc[0]['Topic'][0:30])

    writer_long.close()
    writer_wide.close()
    
    # We are now working with our own workbook, which doesn't have a zero-index sheet to ignore
    # Make the workbook more legible to those reading it
    long_wb = load_workbook(long_fmt_filename, data_only=True)
    for ws in long_wb.worksheets:
        dim_holder = DimensionHolder(worksheet=ws)
        for col in range(ws.min_column, ws.max_column + 1):
            if get_column_letter(col)=='A':
                width = 40
            elif get_column_letter(col) in ['B', 'E']:
                width = 15
            elif get_column_letter(col) in ['C', 'D']:
                width = 25
            else:
                width = 10
            dim_holder[get_column_letter(col)] = ColumnDimension(ws, min=col, max=col, width=width)
        ws.column_dimensions = dim_holder
    
    long_wb.save(long_fmt_filename)
    long_wb.close()
    
    def as_text(value):
        if value is None:
            return ""
        return str(value)
    
    # Write out dataframe in WIDE format.  This data is technically tidy, with one multi-dimensional observation per row.
    # Units/dimensions are consistent on a per-column basis, making it easy to aggregate column-based data.
    wide_wb = load_workbook(wide_fmt_filename, data_only=True)
    # Make the workbook more legible to those reading it
    for ws in wide_wb.worksheets:
        dim_holder = DimensionHolder(worksheet=ws)
        for col in range(ws.min_column, ws.max_column + 1):
            cell = ws.cell(2, col)
            cell.alignment = Alignment(wrap_text=True,vertical='top') 
            dim_holder[get_column_letter(col)] = ColumnDimension(ws, min=col, max=col, width=max(10,1+len(as_text(cell.value))/3))
        ws.column_dimensions = dim_holder

    wide_wb.save(wide_fmt_filename)
    wide_wb.close()

In [16]:
for filename in filename_magic:
    print(filename)
    crm = filename_magic[filename]
    ingest_filename(filename)

print('Done!')

greenhouse-gas-and-energy-data-shell-sr20.xlsx
[]
[]
crop_sheet
77 x 11
25 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-net-carbon-footprint
ingest_file: processing 7
process_topic 7: setting topic Footprint
process_topic 7: setting category Net Carbon Footprint
process_topic 7: setting units g CO2e/ MJ
sub_score = 0
process_category 7: processing variable
process_var 7: using units CO2eq * g / MJ
process_categories: unknown color 00000000 at row 8: Estimated total energy delivered by Shell [B]
-bold
sub_score = -1
process_category 8: processing variable
process_var 8: using units EJ
process_categories: unknown color 00000000 at row 9: Share of energy delivered per energy product type [C]
+bold
+indent
sub_score = 2
process_categories 9: segmenting Share of energy delivered::energy product type [C]
process_var 9: using units CO2eq * g / MJ
process_var 10: using units %
sub_score = 0
sub_score = 0
process_var 11: using units %
sub_score = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,NET CARBON FOOTPRINT [A],,,,,,,,opd-net-carbon-footprint,,,,,
1,Net Carbon Footprint,CO2eq * g / MJ,75.0,78.0,79.0,79.0,79.0,,footprint,Carbon intensity,,,,
2,Estimated total energy delivered by Shell [B],EJ,18.4,21.05,22.0,21.44,20.93,,footprint,Net Carbon Footprint,,,,
3,Share of energy delivered per energy product t...,CO2eq * g / MJ,,,,,,,footprint,Share of energy delivered,energy product type [C],,,
4,Oil products and GTL,%,47.0,56.0,55.0,54.0,54.0,,footprint,Share of energy delivered,energy product type [C],,,
5,Gas,%,21.0,17.0,21.0,23.0,24.0,,footprint,Share of energy delivered,energy product type [C],,,
6,LNG,%,19.0,18.0,16.0,15.0,14.0,,footprint,Share of energy delivered,energy product type [C],,,
7,Biofuels,%,1.0,1.0,1.0,1.0,1.0,,footprint,Share of energy delivered,energy product type [C],,,
8,Power,%,12.0,9.0,7.0,7.0,7.0,,footprint,Share of energy delivered,energy product type [C],,,
9,Total estimated greenhouse gas emissions cover...,CO2eq * Mt,1384.0,1646.0,1731.0,1688.0,1645.0,,footprint,Share of energy delivered,,,,


opd-net-carbon-footprint
[]
[]
crop_sheet
111 x 12
60 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-scope-1-ghg-emissions
process_topic 6: setting category Direct GHG emissions (Scope 1) [A] [B] [C] [D]
process_topic 6: setting units million tonnes CO2e
+bold
+indent
sub_score = 2
process_category 6: processing variable
process_var 6: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 7: Carbon dioxide (CO2)
sub_score = 0
process_category 7: processing variable
process_var 7: found species or units in Carbon dioxide (CO2)
changing units from category: CO2eq * Mt to CO2 * Mt
process_categories: unknown color 00000000 at row 8: Methane (CH4)
sub_score = 0
process_category 8: processing variable
process_var 8: found species or units in Methane (CH4)
changing units from category: CO2eq * Mt to CH4 * kt
process_categories: unknown color 00000000 at row 9: Nitrous oxide (N2O)
sub_score = 0
process_category 9: processing vari

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Direct GHG emissions (Scope 1) [A] [B] [C] [D],CO2eq * Mt,63.0,70.0,71.0,73.0,72.0,,opd-scope-1-ghg-emissions,Total hydrocarbons flared,,CCE-4,EM-EP-110a.1,305-1
1,Carbon dioxide (CO2),CO2 * Mt,61.0,67.0,69.0,70.0,68.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
2,Methane (CH4),CH4 * kt,67.0,91.0,92.0,123.0,138.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
3,Nitrous oxide (N2O),N2O * kt,1.0,1.0,1.0,1.0,1.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
4,Hydrofluorocarbons (HFCs),HFC * t,30.0,29.0,31.0,22.0,22.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
5,Sulphur hexafluoride (SF6),SF6 * t,0.0,0.0,0.0,0.0,2.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
6,Perfluorocarbons (PFC),PFC * t,0.0,0.0,0.0,0.0,0.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
7,Nitrogen trifluoride (NF3),NF3 * t,0.0,0.0,0.0,0.0,0.0,,opd-scope-1-ghg-emissions,Direct GHG emissions (Scope 1) [A] [B] [C] [D],,CCE-4,EM-EP-110a.1,305-1
8,Scope 1 emissions by business,CO2eq * Mt,,,,,,,opd-scope-1-ghg-emissions,Scope 1 emissions,business,,,
9,Upstream,million tonnes CO2e,12.8,12.9,14.8,19.6,19.0,,opd-scope-1-ghg-emissions,Scope 1 emissions,business,CCE-4,EM-EP-110a.1,305-1


opd-scope-1-ghg-emissions
[]
[]
crop_sheet
91 x 12
33 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-scope-2-ghg-emissions
process_topic 6: setting category Scope 2 emissions - market-based method 
process_topic 6: setting units million tonnes CO2e
sub_score = 0
process_category 6: processing variable
process_var 6: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 7: Scope 2 emissions - location-based method
-bold
sub_score = -1
process_category 7: processing variable
process_var 7: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 8: Scope 2 emissions by business (market-based method)
+bold
+indent
sub_score = 2
process_categories 8: segmenting Scope 2 emissions::business (market-based method)
process_var 8: using units CO2eq * Mt
process_var 9: using units million tonnes CO2e
sub_score = 0
sub_score = 0
process_var 10: using units million tonnes CO2e
sub_score = 0
sub_score = 0
process_var 11:

Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Scope 2 emissions - market-based method,CO2eq * Mt,9.0,10.0,11.0,12.0,11.0,,opd-scope-2-ghg-emissions,Scope 2 emissions,,CCE-4,–,305-2
1,Scope 2 emissions - location-based method,CO2eq * Mt,11.0,11.0,11.0,11.0,11.0,,opd-scope-2-ghg-emissions,Scope 2 emissions - market-based method,,CCE-4,–,305-2
2,Scope 2 emissions by business (market-based me...,CO2eq * Mt,,,,,,,opd-scope-2-ghg-emissions,Scope 2 emissions,business (market-based method),,,
3,Upstream,million tonnes CO2e,0.6,1.1,1.4,1.4,1.4,,opd-scope-2-ghg-emissions,Scope 2 emissions,business (market-based method),CCE-4,–,305-2
4,Integrated Gas,million tonnes CO2e,1.5,1.6,2.4,2.4,2.0,,opd-scope-2-ghg-emissions,Scope 2 emissions,business (market-based method),CCE-4,–,305-2
5,Downstream,million tonnes CO2e,7.0,7.3,6.8,7.5,7.5,,opd-scope-2-ghg-emissions,Scope 2 emissions,business (market-based method),CCE-4,–,305-2
6,Other,million tonnes CO2e,0.1,0.2,0.2,0.2,0.2,,opd-scope-2-ghg-emissions,Scope 2 emissions,business (market-based method),CCE-4,–,305-2
7,Scope 2 emissions by country (market-based met...,CO2eq * Mt,,,,,,,opd-scope-2-ghg-emissions,Scope 2 emissions,country (market-based method),,,
8,USA,million tonnes CO2e,3.1,3.1,3.2,3.1,2.7,,opd-scope-2-ghg-emissions,Scope 2 emissions,country (market-based method),CCE-4,–,305-2
9,Netherlands,million tonnes CO2e,1.8,2.1,1.8,1.9,1.8,,opd-scope-2-ghg-emissions,Scope 2 emissions,country (market-based method),CCE-4,–,305-2


opd-scope-2-ghg-emissions
[]
[]
crop_sheet
67 x 12
13 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-greenhouse-gas-intensities
process_topic 6: setting category Upstream and Integrated Gas GHG intensity [A]
process_topic 6: setting units tonne CO2e/tonne production
+indent
sub_score = 1
process_category 6: processing variable
process_var 6: using units CO2eq * production
process_categories: unknown color 00000000 at row 7: Upstream and Integrated Gas GHG intensity [B]
sub_score = -1
process_category 7: processing variable
process_var 7: using units CO2eq * kg / boe
process_categories: unknown color 00000000 at row 8: Refinery GHG intensity [C]
find units: nothing found for tonne CO2e/UEDC
sub_score = 0
process_category 8: processing variable
process_var 8: propagating units CO2eq * production
process_var 9: using units tonne CO2e/tonne production
wb(4) dataframe


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Upstream and Integrated Gas GHG intensity [A],CO2eq * production,0.159,0.168,0.158,0.166,0.166,,opd-greenhouse-gas-intensities,Upstream and Integrated Gas GHG intensity [A],,CCE-4,–,305-4
1,Upstream and Integrated Gas GHG intensity [B],CO2eq * kg / boe,21.0,22.0,21.0,22.0,23.0,,opd-greenhouse-gas-intensities,Upstream and Integrated Gas GHG intensity [A],,CCE-4,–,305-4
2,Refinery GHG intensity [C],CO2eq * production,1.05,1.06,1.05,1.14,1.18,,opd-greenhouse-gas-intensities,Upstream and Integrated Gas GHG intensity [A],,CCE-4,–,305-4
3,Chemical GHG intensity [D],tonne CO2e/tonne production,0.98,1.04,0.96,0.95,0.99,,opd-greenhouse-gas-intensities,Upstream and Integrated Gas GHG intensity [A],,CCE-4,–,305-4
4,[A] In tonnes of Scope 1 and Scope 2 GHG emiss...,,,,,,,,,,,,,
5,[B] In kilograms of Scope 1 and Scope 2 GHG em...,,,,,,,,,,,,,
6,[C] UEDCTM (Utilised Equivalent Distillation C...,,,,,,,,,,,,,
7,[D] High-value chemicals include olefin produc...,,,,,,,,,,,,,


opd-greenhouse-gas-intensities
[]
[]
crop_sheet
79 x 15
21 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-scope-1-2-ghg-emissions
process_topic 6: setting category Direct GHG emissions (Scope 1)
process_topic 6: setting units million tonnes CO2e
+bold
+indent
sub_score = 2
process_category 6: processing variable
process_var 6: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 7: Upstream
sub_score = 0
process_category 7: processing variable
process_var 7: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 8: Integrated Gas
sub_score = 0
process_category 8: processing variable
process_var 8: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 9: Downstream
sub_score = 0
process_category 9: processing variable
process_var 9: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 10: Other
-bold
sub_score = -2
process_category 10: processing variable
process_var

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Direct GHG emissions (Scope 1),CO2eq * Mt,98.0,105.0,102.0,97.0,100,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (location-based method),,CCE-4,EM-EP-110a.1,305-1
1,Upstream,CO2eq * Mt,20.1,21.7,22.2,25.4,25.1,,opd-scope-1-2-ghg-emissions,Direct GHG emissions (Scope 1),,CCE-4,EM-EP-110a.1,305-1
2,Integrated Gas,CO2eq * Mt,24.2,25.9,25.2,24.1,24.6,,opd-scope-1-2-ghg-emissions,Direct GHG emissions (Scope 1),,CCE-4,EM-EP-110a.1,305-1
3,Downstream,CO2eq * Mt,53.2,57.3,53.8,47.1,47.8,,opd-scope-1-2-ghg-emissions,Direct GHG emissions (Scope 1),,CCE-4,EM-EP-110a.1,305-1
4,Other,CO2eq * Mt,0.2,0.2,0.8,0.26,2.4,,opd-scope-1-2-ghg-emissions,Direct GHG emissions (Scope 1),,CCE-4,EM-EP-110a.1,305-1
5,Scope 2 emissions (market-based method),CO2eq * Mt,9.0,11.0,11.0,13.0,13,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (market-based method),,CCE-4,–,305-2
6,Upstream,CO2eq * Mt,0.7,1.2,1.3,1.3,1.5,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (market-based method),,CCE-4,–,305-2
7,Integrated Gas,CO2eq * Mt,1.0,1.1,1.8,2.0,1.6,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (market-based method),,CCE-4,–,305-2
8,Downstream,CO2eq * Mt,7.1,8.0,7.7,9.2,8.4,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (market-based method),,CCE-4,–,305-2
9,Other,CO2eq * Mt,0.1,0.2,0.2,0.23,1.3,,opd-scope-1-2-ghg-emissions,Scope 2 emissions (market-based method),,CCE-4,–,305-2


opd-scope-1-2-ghg-emissions
[]
[]
crop_sheet
71 x 12
21 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-scope-3-greenhosue-gas-em
find units: nothing found for Category 1
ingest_file: processing 7
worksheet opd-scope-3-greenhosue-gas-em: unknown topic Third-party products [C]
process_topic 7: setting category Third-party products [C]
process_topic 7: setting units million tonnes CO2e
-bold
sub_score = -2
process_category 7: processing variable
process_var 7: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 8: Fuel and energy-related activities (not included in Scope 1 or Scope 2) (Category 3)
+bold
+indent
sub_score = 2
process_category 8: processing variable
process_var 8: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 9: Third-party power [D]
-bold
sub_score = -2
process_category 9: processing variable
process_var 9: using units CO2eq * Mt
process_categories: unknown color 00000000 at row 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Purchased goods and services (Category 1),,,,,,,,opd-scope-3-greenhosue-gas-em,,,,,
1,Third-party products [C],CO2eq * Mt,147.0,178.0,190.0,186.0,172.0,,opd-scope-3-greenhosue-gas-em,Use,,CCE-4,–,305-3
2,Fuel and energy-related activities (not includ...,CO2eq * Mt,,,,,,,opd-scope-3-greenhosue-gas-em,Fuel and energy-related activities (not includ...,,,,
3,Third-party power [D],CO2eq * Mt,103.0,102.0,96.0,87.0,89.0,,opd-scope-3-greenhosue-gas-em,Fuel and energy-related activities (not includ...,,CCE-4,–,305-3
4,Use of sold products (Category 11),CO2eq * Mt,,,,,,,opd-scope-3-greenhosue-gas-em,Use,sold products (Category 11),,,
5,Use of sold products [E] [F],million tonnes CO2e,1054.0,1271.0,1351.0,1318.0,1284.0,,opd-scope-3-greenhosue-gas-em,Use,sold products (Category 11),CCE-4,–,305-3
6,Own production [G],million tonnes CO2e,452.0,564.0,594.0,582.0,593.0,,opd-scope-3-greenhosue-gas-em,Use,sold products (Category 11) ::Use of sold prod...,CCE-4,–,305-3
7,Third-party products [H],million tonnes CO2e,602.0,708.0,757.0,736.0,681.0,,opd-scope-3-greenhosue-gas-em,Use,sold products (Category 11) ::Use of sold prod...,CCE-4,–,305-3
8,[A] The values in this table reflect estimated...,,,,,,,,,,,,,
9,[B] Estimated emissions from other Scope 3 cat...,,,,,,,,,,,,,


opd-scope-3-greenhosue-gas-em
[]
[]
crop_sheet
66 x 12
9 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-other-greenhouse-gas-data
ingest_file: processing 7
worksheet opd-other-greenhouse-gas-data: unknown topic CO2 captured and stored
process_topic 7: setting category CO2 captured and stored
process_topic 7: setting units million tonnes
sub_score = 0
process_category 7: processing variable
process_var 7: using units mt
process_var 8: using units million tonnes
wb(7) dataframe


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Carbon capture and storage and CO2 transfer out,,,,,,,,opd-other-greenhouse-gas-data,,,,,
1,CO2 captured and stored,mt,0.94,1.13,1.07,1.14,1.11,,opd-other-greenhouse-gas-data,CO2 captured and stored,,CCE-3,EM-EP-530a.1,305-5
2,CO2 transferred out [A],million tonnes,0.3,0.43,0.46,0.45,0.58,,opd-other-greenhouse-gas-data,CO2 captured and stored,,CCE-3,EM-EP-530a.1,305-5
3,[A] CO2 captured and transferred to another or...,,,,,,,,,,,,,


opd-other-greenhouse-gas-data
[]
[]
crop_sheet
66 x 12
9 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-carbon-offsets
process_topic 6: setting category Total carbon offsets retired
process_topic 6: setting units million tonnes
+bold
+indent
sub_score = 2
process_categories 6: segmenting Total carbon offsets retired::carbon offsets retired
process_var 6: using units mt
process_var 7: using units million tonnes
process_var 8: using units million tonnes
process_segmentation: stack at end = [<Cell 'opd-carbon-offsets'.K6>]
wb(8) dataframe


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Total carbon offsets retired,mt,,,,,,,opd-carbon-offsets,Total carbon offsets retired,carbon offsets retired,,,
1,Included in Net Carbon Footprint,million tonnes,3.9,2.2,0,0,0,,opd-carbon-offsets,Total carbon offsets retired,carbon offsets retired,–,EM-EP-530a.1,305-5
2,Other carbon offsets,million tonnes,0.4,0.5,n/c,n/c,n/c,,opd-carbon-offsets,Total carbon offsets retired,carbon offsets retired,–,EM-EP-530a.1,305-5
3,n/c - not collected,,,,,,,,,,,,,


opd-carbon-offsets
[]
[]
crop_sheet
78 x 12
22 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title opd-energy-use
process_topic 6: setting category Total energy use
process_topic 6: setting units million MWh
+bold
+indent
sub_score = 2
process_categories 6: segmenting Total energy use::energy use
process_var 6: using units TWh
process_var 7: using units million MWh
sub_score = 0
sub_score = 0
process_var 8: using units million MWh
sub_score = 0
sub_score = 0
process_var 9: using units million MWh
sub_score = 0
sub_score = 0
process_var 10: using units million MWh
sub_score = 0
sub_score = 0
process_var 11: using units million MWh
-bold
sub_score = -2
-bold
sub_score = -2
pop at row 12: segmentation_stack now []
process_categories: unknown color 00000000 at row 12: Consumption of energy from renewable sources
+bold
+indent
sub_score = 2
process_categories 12: segmenting Consumption::energy from renewable sources
process_var 12: using units TWh
process

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Total energy use,TWh,240.0,264,268,269,266,,opd-energy-use,Energy intensity,energy use,CCE-6,–,302-1
1,Own energy generated,million MWh,219.0,236,240,241,238,,opd-energy-use,Total energy use,energy use,CCE-6,–,302-1
2,Imported electricity,million MWh,22.0,27,26,26,23,,opd-energy-use,Total energy use,energy use,CCE-6,–,302-1
3,Imported steam and heat,million MWh,17.0,17,15,17,20,,opd-energy-use,Total energy use,energy use,CCE-6,–,302-1
4,Exported electricity,million MWh,12.0,10,10,10,8,,opd-energy-use,Total energy use,energy use,CCE-6,–,302-1
5,Exported steam and heat,million MWh,5.0,6,3,5,7,,opd-energy-use,Total energy use,energy use,CCE-6,–,302-1
6,Consumption of energy from renewable sources,TWh,,,,,,,opd-energy-use,Consumption,energy from renewable sources,,,
7,Renewable sources - onsite energy generation c...,million MWh,0.01,n/c,n/c,n/c,n/c,,opd-energy-use,Consumption,energy from renewable sources,CCE-6,–,302-1
8,Renewable sources - purchased electricity,million MWh,1.8,1.5,0.03,0.03,0.03,,opd-energy-use,Consumption,energy from renewable sources,CCE-6,–,302-1
9,Renewable sources - purchased steam,million MWh,0.0,n/c,n/c,n/c,n/c,,opd-energy-use,Consumption,energy from renewable sources,CCE-6,–,302-1


opd-energy-use
[]
[]
crop_sheet
84 x 11
9 x 11
['2020', '2019', '2018', '2017', '2016']
process_topic 6: setting topic from title sef-sales-gas-power
process_topic 6: setting units from var: tBtu
process_topic 6: setting category Gas (tBtu)
process_topic 6: setting units tBtu
sub_score = 0
process_category 6: processing variable
process_var 6: found species or units in Gas (tBtu)
Inferring/composing units: tBtu
process_var 7: found species or units in Power (TWh)
Inferring/composing units: TWh
changing units from category: tBtu to TWh
wb(10) dataframe


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2020,2019,2018,2017,2016,Notes,Topic,Category,Segmentation,IPIECA,SASB,GRI
0,Gas (tBtu),tBtu,3009.0,2720.0,3246.0,3276.0,3298.0,,sef-sales-gas-power,Gas (tBtu),,,,
1,Power (TWh),TWh,252.0,207.0,179.0,165.0,169.0,,sef-sales-gas-power,Gas (tBtu),,,,
2,"In certain cases, prior to 2019, it was not po...",,,,,,,,,,,,,
3,"[A] From 2019, gas and power sales volumes are...",,,,,,,,,,,,,


sef-sales-gas-power
2021-Data-Centerv1.xlsx
[]
[]
crop_sheet
306 x 4
44 x 4
['2018', '2019', '2020']
process_topic 2: setting topic from title energy
process_categories: new category set at row 3: Owned Generation Capacity (MW) (color = FF40B14B)
process_categories 3: setting units from var: MW
process_category 3: processing variable
process_var 3: found species or units in Owned Generation Capacity (MW)
Inferring/composing units: MW
+bold
+halign
sub_score = 2
process_categories 4: segmenting Total Owned Nameplate Generation Capacity::Owned Nameplate Generation Capacity
process_var 4: using units MW
process_var 5: propagating units MW
sub_score = 0
sub_score = 0
process_var 6: propagating units MW
sub_score = 0
sub_score = 0
process_var 7: propagating units MW
sub_score = 0
sub_score = 0
process_var 8: propagating units MW
sub_score = 0
sub_score = 0
process_var 9: propagating units MW
sub_score = 0
sub_score = 0
process_var 10: propagating units MW
sub_score = 0
sub_score = 0
process

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,2018,2019,2020,Notes,Topic,Category,Segmentation,Unit
0,Energy,,,,,energy,,,
1,Owned Generation Capacity (MW),,,,,energy,Total Owned Nameplate Generation Capacity,,MW
2,Total Owned Nameplate Generation Capacity,25447,25490.0,25490.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
3,Coal,14056,13230.0,13230.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
4,Natural Gas,7809,7678.0,7678.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
5,Nuclear,2278,2288.0,2288.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
6,Total Renewable Energy Resources,1304,2294.0,2294.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
7,Hydroelectric,853,853.0,853.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
8,Solar,190,229.0,229.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW
9,Wind,261,1212.0,1212.0,,energy,Total Owned Nameplate Generation Capacity,Owned Nameplate Generation Capacity,MW


Energy
[]
[]
crop_sheet
16 x 4
16 x 4
['2018', '2019', '2020']
process_topic 2: setting topic from title emissions
find units: nothing found for From AEP owned facilities only
process_categories: new category set at row 3: Scope 1 emissions breakdown (color = FFBDBDBD)
process_category 3: processing variable
process_var 3: propagating units None
process_categories: unknown color FFE5E5E5 at row 4: CO2 (Metric Tons)
find units: nothing found for Metric Tons
sub_score = 0
process_category 4: processing variable
find units: nothing found for Metric Tons
process_var 4: unhandled ( Metric Tons )
process_categories: unknown color FFE5E5E5 at row 5: SO2 (Lbs)
process_categories 5: setting units from var: Lbs
sub_score = 0
process_category 5: processing variable
process_var 5: found species or units in SO2 (Lbs)
Inferring/composing units: SO2 * lb
changing units from category: None to SO2 * lb
process_categories: unknown color FFE5E5E5 at row 6: SO2 (MT)
process_categories 6: setting units fro

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,2018,2019,2020,Notes,Topic,Category,Segmentation,Unit
0,Emissions\n(From AEP owned facilities only),,,,,emissions,,,
3,SO2 (Lbs),137291386.0,104466600.0,65521730.0,,emissions,Scope 1 emissions breakdown,,SO2 * lb
4,SO2 (MT),62274.0,47385.0,29720.0,,emissions,Scope 1 emissions breakdown,,SO2 * mt
5,NOx (Lbs),99830243.0,78809060.0,52889170.0,,emissions,Scope 1 emissions breakdown,,lb * nox
6,NOx (MT),45282.0,35747.0,23990.0,,emissions,Scope 1 emissions breakdown,,mt * nox
7,Mercury (Lbs),395.0,311.0,195.0,,emissions,Scope 1 emissions breakdown,,Mercury * lb
8,Mercury (kg),179.0,141.1,88.6,,emissions,Scope 1 emissions breakdown,,Mercury * kg
11,CO2 (MT),74661649.0,64157260.0,48807820.0,,emissions,Scope 1 Emissions GHG CO2e,,CO2 * mt


Emissions
[]
[]
crop_sheet
28 x 4
28 x 4
['2018', '2019', '2020']
process_topic 2: setting topic from title water
process_categories: new category set at row 3: Total Water Withdrawal (color = FFB9DDFC)
find units: nothing found for Comanche Plant
process_categories 3: segmenting Total Water Withdrawal::Water Withdrawal
process_var 3: propagating units None
find units: nothing found for Gallons/day
process_var 4: unhandled ( Million Gallons/day )
sub_score = 0
sub_score = 0
find units: nothing found for Gallons/ year
process_var 5: unhandled ( Million Gallons/ year )
sub_score = 0
sub_score = 0
find units: nothing found for Liters/year
process_var 6: unhandled ( Billions of Liters/year )
sub_score = 0
sub_score = 0
process_var 7: found species or units in Total Water Withdrawal (Millions of m3/year)
Inferring/composing units: Gl / a
changing units from category: None to Gl / a
pop at row 8: segmentation_stack now []
process_categories: new category set at row 8: Water Withdrawal by Sou

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,2018,2019,2020,Notes,Topic,Category,Segmentation,Unit
0,Water,,,,,water,,,
5,Total Water Withdrawal (Millions of m3/year),5773.0,5506.0,5569.0,,water,Total Water Withdrawal,Water Withdrawal,Gl / a
9,Surface Water Withdrawal (m3/year)\n*excludes ...,5763987000.0,5498195000.0,5562475000.0,,water,Water Withdrawal,Source Breakdown,kl / a
10,Groundwater (m3/year),7306941.0,6984231.0,5474566.0,,water,Water Withdrawal,Source Breakdown,kl / a
11,Other (m3/year) \n*Represents Comanche Plant W...,1525092.0,1138772.0,1156019.0,,water,Water Withdrawal,Source Breakdown,kl / a


Water
[]
[]
crop_sheet
13 x 4
13 x 4
['2018', '2019', '2020']
process_topic 2: setting topic from title waste
process_categories: new category set at row 3: Facility Waste Generation (color = FFEEDCCA)
find units: nothing found for Waste data does not include waste streams from competitive portion of business
process_category 3: processing variable
process_var 3: propagating units None
process_categories: unknown color FFF2E8DE at row 4: Recycled Paper and Office Waste (Lbs)
process_categories 4: setting units from var: Lbs
sub_score = 0
process_category 4: processing variable
process_var 4: found species or units in Recycled Paper and Office Waste (Lbs)
Inferring/composing units: lb
changing units from category: None to lb
process_categories: unknown color FFF2E8DE at row 5: Recycled Scrap Metal Waste (Lbs)
process_categories 5: setting units from var: Lbs
sub_score = 0
process_category 5: processing variable
process_var 5: found species or units in Recycled Scrap Metal Waste (Lbs)
In

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,2018,2019,2020,Notes,Topic,Category,Segmentation,Unit
0,Waste,,,,,waste,,,
2,Recycled Paper and Office Waste (Lbs),382000.0,159300.0,67581.0,,waste,Facility Waste Generation,,lb
3,Recycled Scrap Metal Waste (Lbs),50500000.0,28950000.0,87031171.0,,waste,Facility Waste Generation,,lb
4,Batteries Recycled (Lbs),216000.0,169000.0,171545.0,,waste,Facility Waste Generation,,lb
5,Electronic Waste Recycled( Lbs),234000.0,430000.0,28183.0,,waste,Facility Waste Generation,,lb
6,Light Bulbs Recycled (Lbs),40200.0,44500.0,26727.0,,waste,Facility Waste Generation,,lb
7,Recycled used Oil (Gallons),400500.0,725500.0,273994.0,,waste,Facility Waste Generation,,gal
9,Total CCPs Generated (Tons),4846451.0,4123465.84,2908761.0,,waste,Coal Combustion Products,,ton


Waste
DPDHL-ESG-Statbook-2020-en.xlsx
[]
[]
crop_sheet
90 x 13
80 x 10
color = FF00B050
color = 00000000
color = 00000000
color = 00000000
color = 00000000
color = 00000000
color = 00000000
['2016', '2017', '2018', '2019', '2020']
worksheet Environmental Group Overview: unknown topic Environmental Data at Group levels
ingest_file: processing 2
worksheet Environmental Group Overview: unknown topic The Group's material topics: Carbon Efficiency & Climate Protection, Air Pollution
ingest_file: processing 3
worksheet Environmental Group Overview: unknown topic Key Performance Indicator : Carbon Efficiency Index 
ingest_file: processing 4
worksheet Environmental Group Overview: unknown topic Relevant GRI-indicators: 103 and 305
ingest_file: processing 5
worksheet Environmental Group Overview: unknown topic Relevant SASB-codes: TR-AF-110a.1 - 3; 120a.1
ingest_file: processing 6
worksheet Environmental Group Overview: unknown topic Materiality analysis
ingest_file: processing 7
process_topic 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2016,2017,2018,2019,2020,Notes,Topic,Category,Segmentation,YoY,Comment
0,Carbon emissions,,,,,,,,emissions,,,,"2020 ESG Presentation, slides 12ff"
1,KPI: Carbon Efficiency Index (CEX),Number,30,32,33,35,37.0,,emissions,CO2e emissions total,,0.057143,"Target 2021: 38%, Target 2025: 50%\nBase year ..."
2,CO2e emissions total,CO2eq * m * t,26.86,28.86,29.46,27.42,27.38,,emissions,CO2e emissions total,CO2e emissions,-0.001459,"Metric tons= 1,000 kg"
3,Scope 1,Number,5.68,5.9,6.3,6.27,6.58,,emissions,CO2e emissions total,CO2e emissions,0.049442,
4,Scope 2 (market-based),Number,0.37,0.44,0.27,0.21,0.19,,emissions,CO2e emissions total,CO2e emissions,-0.095238,
5,Scope 3,Number,20.81,22.52,22.89,20.94,20.61,,emissions,CO2e emissions total,CO2e emissions,-0.015759,
6,CO2e emissions by modes,Share,,,,,,,emissions,CO2e emissions,modes,,
7,Air transport,Share,0.65,0.64,0.64,0.64,0.66,,emissions,CO2e emissions,modes,,
8,Ocean transport,Share,0.11,0.12,0.13,0.12,0.1,,emissions,CO2e emissions,modes,,
9,Road transport,Share,0.21,0.21,0.21,0.22,0.22,,emissions,CO2e emissions,modes,,


Environmental Group Overview
[]
[]
crop_sheet
49 x 10
49 x 9
color = FF00B050
color = 00000000
color = 00000000
color = 00000000
['2016', '2017', '2018', '2019', '2020']
worksheet Environmental Data by Division: unknown topic Environmental Data by Divisions
ingest_file: processing 2
worksheet Environmental Data by Division: unknown topic The Group's material topics: Carbon Efficiency & Climate Protection, Air Pollution
ingest_file: processing 3
worksheet Environmental Data by Division: unknown topic Key Performance Indicator: Carbon Efficiency Index 
find units: nothing found for CEX
ingest_file: processing 4
process_topic 4: no var text
new topic set at row 6: Post & Parcel  Germany (color = FF00B050)
process_categories: new category set at row 7: CEX (color = E2F0D9)
process_category 7: processing variable
process_var 7: using units Number
process_categories: unknown color 00000000 at row 8: CO2e emissions total
+indent
sub_score = 1
process_categories 8: segmenting CO2e emissions to

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2016,2017,2018,2019,2020,Notes,Topic,Category,Segmentation,Y-o-Y,Comment
0,Post & Parcel Germany,,,,,,,,Post & Parcel Germany,,,,
1,CEX,Number,31,31,39.0,41.0,45.0,,Post & Parcel Germany,CO2e emissions total,,0.097561,
2,CO2e emissions total,CO2eq * m * t,1.85,2.14,1.36,1.33,1.32,,Post & Parcel Germany,CO2e emissions total,CO2e emissions,-0.007519,"Metric tons = 1,000 kg"
3,Scope 1,Number,0.53,0.54,0.36,0.36,0.36,,Post & Parcel Germany,CO2e emissions total,CO2e emissions,0.0,
4,Scope 2 (market-based),Number,0.03,0.09,0.05,0.05,0.04,,Post & Parcel Germany,CO2e emissions total,CO2e emissions,-0.2,
5,Scope 3,Number,1.29,1.51,0.95,0.92,0.92,,Post & Parcel Germany,CO2e emissions total,CO2e emissions,0.0,
6,Energy consumption (own operations) total,kWh * m,1861,1903,1913.0,1895.0,1974.0,,Post & Parcel Germany,Energy consumption (own operations) total,,0.041689,
7,Express,,,,,,,,Express,,,,
8,CEX,Number,37,39,38.0,38.0,41.0,,Express,CO2e emissions total,,0.078947,
9,CO2e emissions total,CO2eq * m * t,9.42,9.71,10.77,10.62,12.09,,Express,CO2e emissions total,CO2e emissions,0.138418,"Metric tons = 1,000 kg"


Environmental Data by Division
[]
[]
crop_sheet
42 x 10
40 x 9
color = FF00B050
color = 00000000
color = 00000000
['2016', '2017', '2018', '2019', '2020']
worksheet Group Fleet Data: unknown topic Fleet Data at Group Levels
ingest_file: processing 2
worksheet Group Fleet Data: unknown topic The Group's material topics: Carbon Efficiency & Climate Protection, Air Pollution
ingest_file: processing 3
process_topic 3: no var text
new topic set at row 5: Air fleet (jets and feeders) (color = FF00B050)
process_topic 5: setting category Air fleet 
process_topic 5: setting units Total no.
process_category 5: processing variable
find units: nothing found for jets and feeders
process_var 5: unhandled ( jets and feeders )
process_categories: new category set at row 6: Jets by NOx emission standards (color = E2F0D9)
process_categories 6: segmenting Jets::NOx emission standards
process_var 6: using units Number
process_var 7: propagating units Number
sub_score = 0
sub_score = 0
process_var 8: propa

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2016,2017,2018,2019,2020,Notes,Topic,Category,Segmentation,YoY,Comment
0,Air fleet (jets and feeders),Number,,,> 260,> 260,> 280,,Air fleet (jets and feeders),Air fleet (jets and feeders),,,"2020 ESG Presentation, slide 20"
1,Jets by NOx emission standards,Number,190,208.0,214,219,246,,Air fleet (jets and feeders),Jets,NOx emission standards,0.123288,
2,CAEP/8,Number,38,43.0,50,55,67,,Air fleet (jets and feeders),Jets,NOx emission standards,0.218182,
3,CAEP/6,Number,74,82.0,85,88,101,,Air fleet (jets and feeders),Jets,NOx emission standards,0.147727,
4,CAEP/4,Number,38,45.0,42,36,38,,Air fleet (jets and feeders),Jets,NOx emission standards,0.055556,
5,CAEP/2,Number,23,17.0,12,13,13,,Air fleet (jets and feeders),Jets,NOx emission standards,0.0,
6,Unclassified,Number,17,21.0,25,27,27,,Air fleet (jets and feeders),Jets,NOx emission standards,0.0,
7,Jets by noise standards,Number,190,208.0,214,219,246,,Air fleet (jets and feeders),Jets,noise standards,0.123288,
8,Chapter 14,Number,32,41.0,49,59,73,,Air fleet (jets and feeders),Jets,noise standards,0.237288,
9,Chapter 4,Number,127,133.0,130,131,141,,Air fleet (jets and feeders),Jets,noise standards,0.076336,


Group Fleet Data
Unilever sustainability performance data_Climate FINAL.xlsx


  warn("DrawingML support is incomplete and limited to charts and images only. Shapes and drawings will be lost.")


[]
[]
crop_sheet
48 x 1007
47 x 12
['2020', '2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012', '2011', '2010']
worksheet Sustainability performance data: unknown topic CLIMATE ACTION
new topic set at row 11: Value chain GHG emissions (color = FFEBF1DE)
process_topic 11: setting topic emissions
ingest_file: processing 12
process_topic 12: setting topic emissions
process_topic 12: setting units from var: tonnes CO2
process_topic 12: setting category Unilever operations: Scope 1 GHG emissions 
process_topic 12: setting units tonnes CO2
sub_score = 0
process_category 12: processing variable
process_var 12: found species or units in Unilever operations: Scope 1 GHG emissions (tonnes CO2)
Inferring/composing units: CO2 * t
process_categories: unknown color FFFFFF at row 13: Unilever operations: Scope 2 GHG emissions (tonnes CO2)
process_categories 13: setting units from var: tonnes CO2
sub_score = 0
process_category 13: processing variable
process_var 13: found species or units 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,Notes,Topic,Category,Segmentation,Unit
0,Value chain GHG emissions,,,,,,,,,,,,,emissions,,,
1,Unilever operations: Scope 1 GHG emissions (to...,606771.0,659028.0,758232.0,820355.3,882972.7,890801.0,,,,,,,emissions,Unilever operations: Scope 1 GHG emissions,,CO2 * t
2,Unilever operations: Scope 2 GHG emissions (to...,171906.0,443897.0,893825.0,938943.8,1007953.0,1071076.0,,,,,,,emissions,Unilever operations: Scope 1 GHG emissions,,CO2 * t
3,Upstream and downstream of Unilever operations...,60388592.33,61020357.3,62017584.93,59823310.0,61118870.0,57758580.0,,,,,,,emissions,Unilever operations: Scope 1 GHG emissions,,CO2eq * t
4,Ingredients and packaging:,14239917.96,14897174.0,15367491.0,16539560.0,17903180.0,17083190.0,,,,,,,Ingredients and packaging:,,,
5,Ingredients,11270832.65,11782483.0,12416204.0,13409640.0,14534100.0,13787060.0,,,,,,,,,,
6,Primary packaging,2326020.65,2389654.84,2246470.77,2379938.0,2436180.0,2343797.0,,,,,,,sustainability performance data,,,
7,Secondary packaging,376701.46,462670.66,464439.9,522212.0,553198.0,596911.0,,,,,,,sustainability performance data,,,
8,Inbound transport,266363.2,262364.83,240376.47,227771.0,379703.0,355423.0,,,,,,,sustainability performance data,,,
9,Distribution and retail:,4055333.37,4379729.3,4368625.93,4089568.0,4137010.0,4002856.0,,,,,,,Distribution and retail:,,,


Sustainability performance data
esg-tables.xlsx


  warn(msg)
  warn(msg)
  warn(msg)


[]
[]
crop_sheet
132 x 28
122 x 7
['2017', '2020', '2019', '2018']
process_topic 3: setting topic from title environmental
process_topic 3: setting category Scope 1 (direct) emissions
process_topic 3: setting units tonnes CO2e
process_category 3: processing variable
find units: nothing found for direct
process_var 3: unhandled ( direct )
process_categories: unknown color FFFFFFFF at row 4: Normalized Scope 1 emissions
process_category 4: processing variable
process_var 4: using units CO2eq * revenue * t / MM
process_categories: new category set at row 5: Scope 1 emissions by business division (color = FF92D050)
process_categories 5: segmenting Scope 1 emissions::business division
process_var 5: propagating units None
process_var 6: using units tonnes CO2e
sub_score = 0
sub_score = 0
process_var 7: using units tonnes CO2e
sub_score = 0
sub_score = 0
process_var 8: using units tonnes CO2e
sub_score = 0
sub_score = 0
process_var 9: using units tonnes CO2e
sub_score = 0
sub_score = 0
proce

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Variable,Unit,2017,2020,2019,2018,Notes,Topic,Category,Segmentation,Assurance Letter
0,Scope 1 (direct) emissions,CO2eq * t,167720,139868.0,154507,162162,,environmental,Scope 1 (direct) emissions,,
1,Normalized Scope 1 emissions,CO2eq * revenue * t / MM,6.56,5.348067,6.15,6.39,,environmental,Scope 1 (direct) emissions,,
3,Altria Group Distribution Company,tonnes CO2e,20779,10333.0,17076,19422,,environmental,Scope 1 emissions,business division,
4,Altria Client Services,tonnes CO2e,12839,10862.0,14924,14001,,environmental,Scope 1 emissions,business division,
5,John Middleton,tonnes CO2e,3885,3042.0,3110,3424,,environmental,Scope 1 emissions,business division,
6,Nat Sherman,tonnes CO2e,22.25,13.9,25.7,24.21,,environmental,Scope 1 emissions,business division,
7,Nu Mark,tonnes CO2e,0,0.0,0,0,,environmental,Scope 1 emissions,business division,
8,Philip Morris USA,tonnes CO2e,102659,85258.0,93063,99701,,environmental,Scope 1 emissions,business division,
9,Ste. Michelle Wine Estates,tonnes CO2e,6376,14114.0,7557,6508,,environmental,Scope 1 emissions,business division,
10,U.S. Smokeless Tobacco Co.,tonnes CO2e,21141,16238.0,18744,18921,,environmental,Scope 1 emissions,business division,


Environmental
Done!


In [17]:
segmentation_stack

[]

In [18]:
ws.cell(11,1).border.top.style

'thin'

In [19]:
columns = ['Variable', 'Unit']
# We need these columns to reshape our data
for extra_col in ['Notes', 'Category', 'Segmentation']:
    if melted_df[extra_col].notna().any():
        columns.append(extra_col)
melted_df[melted_df['Segmentation']=='(anon)::Road transport']

Unnamed: 0,Variable,Notes,Topic,Category,Segmentation,Unit,Year,Value


In [20]:
units = 1e6 * ureg('kl/year')
units.to_compact()

In [21]:
df = ws_to_df(wb, 2)

IndexError: list index out of range

In [None]:
melted_df = pd.melt(df, id_vars=ingest_columns, var_name='Year', value_name='Value', value_vars=value_vars)
melted_df.dropna(subset=['Value'],inplace=True)
melted_df = melted_df.astype({'Year': 'int'})
melted_df.loc[:, melted_df.columns != 'Topic']

In [None]:
df.iloc[0:2]

In [None]:
find_units('trillion (10^12) MJ')

In [None]:
from cell2rgb import cell2rgb
cell2rgb(ws.cell(8, crm.units_col))

In [None]:
gj = f"{ureg('1e9 J').to_compact():P}".split(' ', 1)[1]

In [None]:
ureg(gj)

In [None]:
ureg('1e9 J').to_compact().u

In [None]:
scale_regex = re.compile(r'^(((mi|bi|tri|quadri)llion)|(thousand)|(hundred))(s of)? ', flags=re.I)
scale_regex = re.compile(r'^((mi|bi|tri|quadri)llion|thousand|hundred)(s of)? ', flags=re.I)
sc_xlate[re.search(scale_regex, 'thousand ').group(1)[0:3]]

In [None]:
bletch!

In [None]:
crm = filename_magic['DPDHL-ESG-Statbook-2020-en.xlsx']
wb = load_workbook(crm.input_filename, data_only=True)
ws = wb.worksheets[2]
preprocess(wb, ws)

row = process_topic(pc, crm.topic_row)


In [None]:
def preprocess2(wb, ws):
    global crm
    
    scope1_gases = ['CO2', 'CH4', 'N2O', 'HFC', 'SF6', 'PFC', 'NF3', 'CO2e', 'NOx', 'SO2', 'PM10']
    scope1_regex = re.compile('(' + ')|('.join(scope1_gases) + ')', flags=re.I)
    
    scope3_dict = { 'Purchased Goods and Services':1,
                    'Capital Goods':2,
                    'Fuel and Energy Related Activities':3,
                    'Fuel and Energy Related Activities (Market-Based)':3,
                    'Fuel and Energy Related Activities (Location-Based)':3,
                    'Upstream Transportation and Distribution':4,
                    'Transportation services':4,                # DPDHL
                    'Fuel- and energy-related activities':4,    # DPDHL
                    'Waste Generated in Operations (Large office campuses)':5,
                    'Business Travel':6,
                    'Employee Commuting':7,
                    'Upstream Leased Assets':8,
                    'Downstream Transportation and Distribution':9,
                    'Processing of Sold Products':10,
                    'Use of Sold Products':11,
                    'End of Life Treatment of Sold Products':12,
                    'Downstream Leads Assets':13,
                    'Franchises':14,
                    'Investments':15 }

    def normalize_scope3(s3):
        # Later we should normalize against the scope3_dict
        return s3
    
    # This only returns notes and a value, and it presumes the cell is None, Numeric, or would be Numeric, but for the note.
    def split_value_cell(c):
        if c.value==None:
            return '', ''
        
        notes = ''
        v = str(c.value)
        m = re.search(r'(\([a-z]\))+', v)
        if m:
            notes = m.group(0)
            v = v.replace(notes,'').strip().replace(',','')
            if '.' in v:
                v = float(v)
            else:
                v = int(v)
            return notes, v
        return '', c.value
    
    # Convert reported units to things standard in `pint`
    unit_dict = { 'trillion (10^12) MJ':'PJ', 'million MWh':'TWh', 'MW':'MW', 
                  'million tonnes CO2e':'Mt CO2e', 'tonnes CO2e':'t CO2e', 'm t CO2e':'Mt CO2e', 'MT CO2e':'Mt CO2e',
                  'million tonnes':'Mt', 'thousand tonnes':'kilot', 'tonnes':'t', 'kg':'kg', 'MT':'Mt', 'Lbs':'lbs', 'Metric Tons':'t', 
                  'tBtu':'TBtu',
                  'm liter':'M liter', 'Grams per € revenue':'Grams / EUR',
                  'Millions of m3':'1000 dam', 'm3':'m3', 'Gallons':'gal',
                  'Million Gallons':'M gallons', 'Billions of Liters':'10^9 l', 'billions of Liters':'10^9 l',}
    u2u_dict = { '%':'pct', 'Grams per € revenue':'Grams / EUR', 'revenue':'EUR', 'MM$ revenue':'1000000 revenue',
                 'short ton':'short_ton', 'No.':'[]', 'Nb':'[]', }
    scale_regex = re.compile(r'^(((mi|bi|tri|quadri)llion)|(thousand)|(hundred))(s of)? ', flags=re.I)
    sc_xlate = {'hun':1e2, 'tho':1e3, 'mil':1e6, 'bil':1e9, 'tri':1e12, 'qua':1e15}
    def normalize_units(u, g):
        scale = 1.0
        m = re.search(scale_regex, u)
        if m:
            u = u[m.end(0):].strip()
            scale = sc_xlate[m.group(0)[0:3].lower()]
        m = re.search(r'((short)|(long)|(metric))( )ton', u, flags=re.I)
        if m:
            u = '_'.join([u[0:m.start(5)], u[m.start(5)+1:]])
            print(u)
        if u in u2u_dict:
            u = u2u_dict[u]
        if g and g not in u:
            u = ' '.join([u, g])
        return ureg(u) * scale
        
        if g in u:
            g = ''
        if '/' in u:
            u1, u2 = u.split('/', 1)
            if g in u2:
                g1 = ''
                g2 = g
            else:
                g1 = g
                g2 = ''
            return ' / '.join([normalize_units(u1, g1), normalize_units(u2, g2)])
        u = u.strip()
        if u in unit_dict and u!=unit_dict[u]:
            return normalize_units(unit_dict[u], g)
        if g:
            return ' '.join([u, g])
        return u
    
    def finish_notes(row):
        print('finish_notes @ {}'.format(row))
    
    if crm.topic_row==None:
        topic = ws.title
    else:
        topic = ws.cell(crm.topic_row,crm.var_col).value
    
    notes = ''
    category = ''
    categories = ['', '', '']
    segmentation = ''
    scope1_gas = ''
    units = ''
    
    # Make the inferences, filling out TOPIC : CATEGORY : SEGMENTATION, as well as inferring/adjusting UNITS
    # If we start with no Units column, then all units can be carried across from parenthetical expressions in Variable
    # If we do have a Units column, either it's fully expressed (like Shell),
    # or a prevailing unit can be carried down (and cross-combined with parenthetical expressions in Variable)
    
    for row in range(crm.header_row+1, crm.last_val_row+1):
        cell = ws.cell(row, crm.var_col)
        
        # Needed to put dataframe together later
        if (crm.topic_col < crm.var_col):
            topic = ws.cell(row, crm.topic_col).value
        else:
            ws.cell(row, crm.topic_col).value = topic
        
        # Carry-forward comes from state variables: category, segmentation, units
        if cell.value==None:
            continue
        
        # *BOLD* text indicates we have a header to parse, as do particular colors
        if cell.fill.fgColor.type == 'rgb':
            cat_color = format(cell.fill.fgColor.rgb)
        else:
            theme = cell.fill.start_color.theme
            tint = cell.fill.start_color.tint
            cat_color = theme_and_tint_to_rgb(wb, theme, tint)
        if cat_color not in crm.cat_color_dict:
            cat_color = None

        if cell.font.b or cat_color:
            notes, category, segmentation = split_cell(cell)

            if cat_color==None:
                # Shell doesn't use colors
                categories[0] = re.sub(r' total\s?', '', category)
            else:
                categories[crm.cat_color_dict[cat_color]] = re.sub(r' total\s?', '', category)
                for i in range(crm.cat_color_dict[cat_color]+1, len(categories)):
                    categories[i] = ''
                category = ':'.join([c for c in categories[0:2] if c])
            if re.search(r'Scope\s*3', category, flags=re.I):
                m = re.search(r'Scope\s+3 (CO2e?\s*)?(emissions\s+)(by.*categor((y)|(ies)))?', category, flags=re.I)
                category = 'Scope 3 emissions'
                segmentation = 'GHG Categories'
            else:
                if categories[2]:
                    if segmentation:
                        print('cat[2] = {}; segmentation = {}'.format(categories[2], segmentation))
                    segmentation = categories[2]

        if crm.units_row >= 0 and ws.cell(row, crm.units_col).value:
            crm.units_row = row
            # We might refine this as "tonnes of WHAT", depending on Variable
            units = normalize_units (ws.cell(row, crm.units_col).value, '')

        if any([True for col in range(crm.val_col, crm.last_val_col+1) if ws.cell(row, col).value]):
            # Has a variable observation.  Need to set/use units
            if crm.units_row < 0 and any([True for col in range(crm.val_col, crm.last_val_col+1) if ws.cell(row, col).value]):
                # We have to dig out units for each and every variable row
                notes, var_text, unit_text = split_cell(cell)
                s1_gases = [ g for g in var_text.split(' ') if re.search(scope1_regex, g) ]
                s1_gas = '' if s1_gases==[] else s1_gases[0]
                if unit_text not in ['market-based', 'location_based']:
                    units = normalize_units (unit_text, s1_gas)
                # Could also check for fuel types, water types, and other things
                ws.cell(row, crm.units_col).value = format(units.u, '~')
            elif crm.units_row >= 0:
                # If we have no units, borrow from following row (see 'opd-scope-1-2-ghg-emissions')
                # Theory: if border line heavier above, borrow from below; if heavier below, borrow from above

                # from openpyxl.styles.borders import Border, Side, BORDER_THIN
                # thin_border = Border(
                #     top=Side(border_style=BORDER_THIN, color='00000000'),
                #     bottom=Side(border_style=BORDER_THIN, color='00000000')
                # )
                # ws.cell(row=3, column=2).border = thin_border
                
                if ws.cell(row, crm.units_col).value==None and ws.cell(row+1, crm.units_col).value!=None:
                    ws.cell(row, crm.units_col).value = normalize_units(ws.cell(row+1, crm.units_col).value, '')

                # If there is no disclosure here, move on with the notes/category/segmentation we've captured
                if not any([True for col in range(crm.val_col, crm.last_val_col+1) if ws.cell(row, col).value]):
                    continue

                # ??? Should correctly compute segmentation here and pass as argument to normalize_scope3
                if category == 'Scope 3 emissions' and segmentation == 'GHG Categories':
                    ws.cell(row, crm.var_col).value = normalize_scope3 (ws.cell(row, crm.var_col).value)
                
                # Try to get units and category from variable description by using a found unit, inferring from previous rows,
                # and possibly combining with other info in the variable (such as gas species)
                if ws.cell(row, crm.units_col).value==None:
                    ws.cell(row, crm.units_col).value = units
                units = ureg(ws.cell(row, crm.units_col).value)
                
                maybe_notes, maybe_category, xyzzy = split_cell(cell)
                if maybe_var_units:
                    maybe_var_units = re.sub(r'\((.*)\)', r'\1', maybe_var_units)
                    print('maybe_var_units: {}'.format(maybe_var_units))
                    if ' per ' in maybe_var_units:
                        u1, u2 = maybe_var_units.split(' per ', 1)
                        units = normalize_units(u1, '') / normalize_units(u2, '')
                    elif maybe_var_units in ureg:
                        units = normalize_units (maybe_var_units, '')
                    else:
                        error('maybe_var_units: {}'.format(maybe_var_units))
                    # We carry down units, whether we changed them this row or not
                    ws.cell(row, crm.units_col).value = format(units.u, '~')
                notes, category = maybe_notes, maybe_category

        # Now fill the empty columns we created with the metadata we have inferred
        ws.cell(row, crm.category_col).value = category
        ws.cell(row, crm.segmentation_col).value = segmentation
        if 'emissions' in category.lower() and not re.search('CO2e', ws.cell(row, crm.units_col).value, flags=re.I):
            m = re.search(scope1_regex, str(ws.cell(row, crm.var_col).value))
            if m:
                scope1_gas = m.group(0)
            # else it carries forward
        else:
            scope1_gas = ''
        if ws.cell(row, crm.units_col).value!=None:
            units = normalize_units(ws.cell(row, crm.units_col).value, scope1_gas)
            ws.cell(row, crm.units_col).value = format(units.u, '~')
        # Find notes hiding in values 
        for col in range(crm.val_col, crm.last_val_col+1):
            # print('cell({},{}) = {}'.format(row,col,ws.cell(row,col).value))
            maybe_notes, value = split_value_cell(ws.cell(row, col))
            if maybe_notes:
                if maybe_notes not in notes:
                    notes = notes + maybe_notes
                ws.cell(row,col).value = value
        # Scan for notes in remaining columns, but don't scan again the columns we ourselves created
        # (namely notes, topic, category, segmentation, and possibly units)
        for col in range(crm.last_val_col+1, crm.notes_col):
            # print('cell({},{}) = {}'.format(row,col,ws.cell(row,col).value))
            maybe_notes, main_text, error_if_nonempty = split_cell(ws.cell(row, col))
            if maybe_notes:
                if maybe_notes not in notes:
                    notes = notes + maybe_notes
                if error_if_nonempty:
                    error('error_if_nonempty={}; cell({},{}) = {}'.format(error_if_nonempty,row,col,ws.cell(row, col)))
                ws.cell(row,col).value = main_text
        ws.cell(row, crm.notes_col).value = notes

### Time for a Pint!

See https://github.com/IAMconsortium/units/issues/9https://github.com/IAMconsortium/units/issues/9
and https://github.com/openscm/openscm-units/issues/31https://github.com/openscm/openscm-units/issues/31
and 

In [None]:
import pandas as pd
import pint_pandas
from openscm_units import unit_registry

pint_pandas.PintType.ureg = u = unit_registry

one_co2 = unit_registry("CO2")
print(one_co2)

x = pd.DataFrame([[2.0,'Mt CO2']], columns=['Value', 'Unit'])
print(x)
x = x.astype({'Value': 'pint[Mt CO2]'})
print(x.Value.pint.to('t CO2'))

In [None]:
u('Mt/1000000').to_compact()

In [None]:
PA_ = pint_pandas.PintArray

ureg = unit_registry
Q_ = ureg.Quantity

Note that pint[unit] must be used for the Series constuctor, whereas the PintArray constructor allows the unit string or object.

```
    df = pd.DataFrame({
        "length" : pd.Series([1.,2.], dtype="pint[m]"),
        "width" : PA_([2.,3.], dtype="pint[m]"),
        "distance" : PA_([2.,3.], dtype="m"),
        "height" : PA_([2.,3.], dtype=ureg.m),
        "depth" : PA_.from_1darray_quantity(Q_([2,3],ureg.m)),
    })
```

See https://pint.readthedocs.io/en/0.18/pint-pandas.html

In [None]:
wb = load_workbook(long_fmt_filename, data_only=True)

from itertools import islice

def long_ws_to_df(ws):
    data = ws.values
    cols = next(data)
    data = list(data)
    # idx = [r[0] for r in data]
    # data = (islice(r, 1, None) for r in data)
    
    df = pd.DataFrame(data, columns=cols)

    # The original data has topic we construct.  It is removed when writing LONG data but can be restored from SHEET_NAME
    if 'Topic' not in df.columns:
        print('Restoring Topic ' + ws.title)
        df.insert(crm.topic_col-1, 'Topic', ws.title)
    
    return df

trino_df = pd.concat([long_ws_to_df(ws) for ws in wb.worksheets])
    
len(trino_df)

In [None]:
print(trino_df['Unit'].value_counts())
trino_df.Unit.unique()

Now create data in Trino

In [None]:
import boto3

# Create an S3 client.  We will user later when we write out data and metadata
s3 = boto3.client(
    service_name="s3",
    endpoint_url=os.environ['S3_DEV_ENDPOINT'],
    aws_access_key_id=os.environ['S3_DEV_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_DEV_SECRET_KEY'],
)

In [None]:
import trino

conn = trino.dbapi.connect(
    host=os.environ['TRINO_HOST'],
    port=int(os.environ['TRINO_PORT']),
    user=os.environ['TRINO_USER'],
    http_scheme='https',
    auth=trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    verify=True,
)
cur = conn.cursor()

# Show available schemas to ensure trino connection is set correctly
cur.execute('show schemas in osc_datacommons_dev')
cur.fetchall()

In [None]:
import datetime
# datetime.datetime.now()
# For now we used a fixed date so we don't fill things up needlessly
timestamp = "2008-09-03T20:56:35.450686Z"

In [None]:
ingest_uuid = str(uuid.uuid4())

custom_meta_key_fields = 'metafields'
custom_meta_key = 'metaset'

schemaname = 'osc_corp_data'
cur.execute('create schema if not exists osc_datacommons_dev.' + schemaname)
cur.fetchall()

For osc_datacommons_dev, a trino pipeline is a parquet data stored in the S3_DEV_BUCKET
It is a 5-step process to get there from a pandas dataframe

In [None]:
def create_trino_pipeline (s3, schemaname, tablename, timestamp, df, meta_fields, meta_content):
    global ingest_uuid
    global custom_meta_key_fields, custom_meta_key
    
    # First convert dataframe to pyarrow for type conversion and basic metadata
    table = pa.Table.from_pandas(enforce_sql_column_names(df))
    # Second, since pyarrow tables are immutable, create a new table with additional combined metadata
    if meta_fields or meta_content:
        meta_json_fields = json.dumps(meta_fields)
        meta_json = json.dumps(meta_content)
        existing_meta = table.schema.metadata
        combined_meta = {
            custom_meta_key_fields.encode(): meta_json_fields.encode(),
            custom_meta_key.encode(): meta_json.encode(),
            **existing_meta
        }
        table = table.replace_schema_metadata(combined_meta)
    # Third, convert table to parquet format (which cannot be written directly to s3)
    pq.write_table(table, '/tmp/{sname}.{tname}.{uuid}.{timestamp}.parquet'.format(sname=schemaname, tname=tablename, uuid=ingest_uuid, timestamp=timestamp))
    # df.to_parquet('/tmp/{sname}.{tname}.{uuid}.parquet'.format(sname=schemaname, tname=tablename, uuid=ingest_uuid, index=False))
    # Fourth, put the parquet-ified data into our S3 bucket for trino.  We cannot compute parquet format directly to S3 but we can copy it once computed
    s3.upload_file(
        Bucket=os.environ['S3_DEV_BUCKET'],
        Key='trino/{sname}/{tname}/{uuid}/{timestamp}/{tname}.parquet'.format(sname=schemaname, tname=tablename, uuid=ingest_uuid, timestamp=timestamp),
        Filename='/tmp/{sname}.{tname}.{uuid}.{timestamp}.parquet'.format(sname=schemaname, tname=tablename, uuid=ingest_uuid, timestamp=timestamp)
    )
    # Finally, create the trino table backed by our parquet files enhanced by our metadata
    cur.execute('.'.join(['drop table if exists osc_datacommons_dev', schemaname, tablename]))
    print('dropping table: ' + tablename)
    cur.fetchall()
    
    schema = create_table_schema_pairs(df)

    tabledef = """create table if not exists osc_datacommons_dev.{sname}.{tname}(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{bucket}/trino/{sname}/{tname}/{uuid}/{timestamp}'
)""".format(schema=schema,bucket=os.environ['S3_DEV_BUCKET'],sname=schemaname,tname=tablename,uuid=ingest_uuid,timestamp=timestamp)
    print(tabledef)

    # tables created externally may not show up immediately in cloud-beaver
    cur.execute(tabledef)
    cur.fetchall()

### Write out Report with metadata

Create the actual metadata for the source.  In this case, it is osc_corp_data.

In [None]:
custom_meta_content = {}
metadata_text = """Title: AEP GHG and Energy Report, 2020
Description: 
Version: 2020
Release Date: 
URI: https://reports.shell.com/sustainability-report/2020/our-performance-data/greenhouse-gas-and-energy-data.html
Copyright: 
License: 
Contact: 
Citation: """

for line in metadata_text.split('\n'):
    k, v = line.split(':', 1)
    k = sql_compliant_name(k)
    custom_meta_content[k] = v

custom_meta_content['abstract'] = """Abstract text"""
custom_meta_content['name'] = 'osc_corp_data'

Create the metadata for all the fields in all the tables

Create custom meta data and key

In [None]:
shell_df

In [None]:
tablename = 'aep_2020'
custom_meta_fields = {}
create_trino_pipeline (s3, schemaname, tablename, timestamp, shell_df, custom_meta_fields, custom_meta_content)

Restore data and metadata

In [None]:
# Everything below here is speculative / in process of design

## Load metadata following an ingestion process into trino metadata store

### The schema is *metastore*, and the table names are *meta_schema*, *meta_table*, *meta_field*

In [None]:
# Create metastore structure
metastore = {'catalog':'osc_datacommons_dev',
             'schema':'aep_2020',
             'table':tablename,
             'metadata':custom_meta_content,
             'uuid':ingest_uuid}
# Create DataFrame
df_meta = pd.DataFrame(metastore)
# Print the output
df_meta

In [None]:
help(iam_units)

In [None]:
help(registry)