# Relevance Inference
This notebook takes in the extracted text from PDF preprocessing stage, the fine tuned relevance model from the training stage, and performs inference on the input text.

In [1]:
import os
import pandas as pd
import pathlib

from src.models.relevance_infer import TextRelevanceInfer
from config_farm_train import InferConfig
import config
from src.data.s3_communication import S3Communication, S3FileType
from dotenv import load_dotenv
import zipfile

07/16/2022 15:48:16 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
infer_config = InferConfig("infer_demo")

In [5]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    # inference results dir
    if not os.path.exists(infer_config.result_dir['Text']):
        pathlib.Path(infer_config.result_dir['Text']).mkdir(parents=True, exist_ok=True)

    # load dir
    if not os.path.exists(infer_config.load_dir['Text']):
        pathlib.Path(infer_config.load_dir['Text']).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

In [6]:
model_root = pathlib.Path(infer_config.load_dir['Text']).parent
model_rel_zip = pathlib.Path(model_root, 'RELEVANCE.zip')

# This file must be download because model training doesn't save a local copy
s3c.download_file_from_s3(model_rel_zip, config.BASE_SAVED_MODELS_S3_PREFIX, "RELEVANCE.zip")

with zipfile.ZipFile(pathlib.Path(model_root, 'RELEVANCE.zip'), 'r') as z:
    z.extractall(model_root)

However, we advise that you manually update the parameters in the corresponding config file

`esg_data_pipeline/config/config_farm_trainer.py`

## Inference

### Loading the model

The following cell will load the trained model.

In [7]:
print(infer_config.load_dir)
print(infer_config.extracted_dir)
print(infer_config.result_dir)

{'Text': '/opt/app-root/src/aicoe-osc-demo-MichaelTiemannOSC/models/RELEVANCE'}
/opt/app-root/src/aicoe-osc-demo-MichaelTiemannOSC/data/extraction
{'Text': '/opt/app-root/src/aicoe-osc-demo-MichaelTiemannOSC/data/infer_relevance'}


In [8]:
kpi_df = s3c.download_df_from_s3(
    f"{config.EXPERIMENT_NAME}/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category
0,1.0,What is the company name?,"OG, CM, CU",False,TEXT
1,2.0,What is the Start Date of the CDP report publi...,"OG, CM, CU",False,TEXT
2,3.0,What is the End Date of the CDP report published?,"OG, CM, CU",False,TEXT
3,4.0,What is the currency used for all financial in...,"OG, CM, CU",False,TEXT
4,5.0,Did you have an emissions target that was acti...,"OG, CM, CU",False,TEXT


In [9]:
component = TextRelevanceInfer(infer_config, kpi_df)



### Prediction on a Single Example

In [10]:
input_text = "The company is going to reduce 8% in gas production"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)



[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is going to reduce 8% in gas production',
    'label': '1',
    'probability': 0.99976236}]}]

In [11]:
input_text = "The company is about semi conductors"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is about semi conductors',
    'label': '0',
    'probability': 0.5169579}]}]

### Prediction on an Entire Folder

`run_folder()` will make prediction on all the JSON files in the /data/extraction folder. This will take some time, based on the number of json files.

In [12]:
component.run_folder()

07/16/2022 15:48:28 - INFO - src.models.relevance_infer -   #################### Starting Relevence Inference for the following extracted pdf files found in /opt/app-root/src/aicoe-osc-demo-MichaelTiemannOSC/data/infer_relevance:
['2020-cdp-climate-response-checkpoint', '2020-cdp-climate-response', 'Adobe_CDP_Climate_Change_Questionnaire_2021', 'Apple_CDP-Climate-Change-Questionnaire_2021', 'Bayer AG Climate Change 2021', 'Corning_Incorporated_CDP_Climate_Change_Questionnaire_2021_FINAL', 'Michelin-CDP-Climate-Change-2021_def', 'NextEra Energy 2021 CDP Response', 'PGE_Corporation_CDP_Climate_Change_Questionnaire_2021', 'Unilever CDP Climate Response', 'bp-cdp-climate-change-questionnaire-2021', 'gap_inc-_cdp_climate_change_questionnaire_2021', 'vodafone-group-cdp-climate-change-questionnaire2021'] 
07/16/2022 15:48:28 - INFO - src.models.relevance_infer -   #################### 1/13 PDFs
07/16/2022 15:48:28 - INFO - src.models.relevance_infer -   The relevance infer results for 2020-cd

The results are saved in a CSV. For each table, the extracted text, as well as the page number from the source pdf file are saved.

In [13]:
df_table_results = pd.read_csv(infer_config.result_dir['Text'] + "/NextEra Energy 2021 CDP Response_predictions_relevant.csv")
df_table_results.head(20)

Unnamed: 0.1,Unnamed: 0,page,pdf_name,text,text_b,source
0,0,0,NextEra Energy 2021 CDP Response,What is the company name?,"NextEra Energy, Inc. (NextEra Energy) is one o...",Text
1,1,0,NextEra Energy 2021 CDP Response,What is the company name?,FPL is the largest electric utility in the sta...,Text
2,2,0,NextEra Energy 2021 CDP Response,What is the company name?,"NextEra Energy Resources, together with its af...",Text
3,3,0,NextEra Energy 2021 CDP Response,What is the company name?,Capital investment is central to executing our...,Text
4,4,0,NextEra Energy 2021 CDP Response,What is the company name?,"For decades, we have focused on building a bus...",Text
5,5,0,NextEra Energy 2021 CDP Response,What is the company name?,Executing on this vision exemplifies what it m...,Text
6,6,3,NextEra Energy 2021 CDP Response,What is the company name?,President and CEO of NextEra Energy Resources:,Text
7,7,3,NextEra Energy 2021 CDP Response,What is the company name?,Rationale for why these responsibilities are a...,Text
8,8,10,NextEra Energy 2021 CDP Response,What is the company name?,Company-specific description The transition to...,Text
9,9,19,NextEra Energy 2021 CDP Response,What is the company name?,NextEra Energy Resources is at the leading edg...,Text


In [14]:
# upload the predicted files to s3
s3c.upload_files_in_dir_to_prefix(
    infer_config.result_dir['Text'],
    config.BASE_INFER_RELEVANCE_S3_PREFIX
)

# Conclusion
This notebook ran the _Relevance_ inference on a sample dataset and stored the output in a csv format.