# Relevance Inference
This notebook takes in the extracted text from PDF preprocessing stage, the fine tuned relevance model from the training stage, and performs inference on the input text.

In [1]:
import os
import pandas as pd
import pathlib
from src.models.relevance_infer import TextRelevanceInfer
from config_farm_train import InferConfig
import config
from src.data.s3_communication import S3Communication, S3FileType
from dotenv import load_dotenv
import zipfile

07/12/2022 21:44:54 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
infer_config = InferConfig("infer_demo")

In [6]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    # inference results dir
    if not os.path.exists(infer_config.result_dir['Text']):
        pathlib.Path(infer_config.result_dir['Text']).mkdir(parents=True, exist_ok=True)

    # load dir
    if not os.path.exists(infer_config.load_dir['Text']):
        pathlib.Path(infer_config.load_dir['Text']).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

In [8]:
model_root = pathlib.Path(infer_config.load_dir['Text']).parent
model_rel_zip = pathlib.Path(model_root, 'RELEVANCE.zip')
s3c.download_file_from_s3(model_rel_zip, config.BASE_SAVED_MODELS_S3_PREFIX, "RELEVANCE.zip")

with zipfile.ZipFile(pathlib.Path(model_root, 'RELEVANCE.zip'), 'r') as z:
    z.extractall(model_root)

However, we advise that you manually update the parameters in the corresponding config file

`esg_data_pipeline/config/config_farm_trainer.py`

## Inference

### Loading the model

The following cell will load the trained model.

In [9]:
print(infer_config.load_dir)
print(infer_config.extracted_dir)
print(infer_config.result_dir)

{'Text': '/opt/app-root/src/aicoe-osc-demo/models/RELEVANCE'}
/opt/app-root/src/aicoe-osc-demo/data/extraction
{'Text': '/opt/app-root/src/aicoe-osc-demo/data/infer_relevance'}


In [10]:
kpi_df = s3c.download_df_from_s3(
    f"{config.EXPERIMENT_NAME}/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category,Unnamed: 5,Unnamed: 6
0,0.0,What is the company name?,"OG, CM, CU",False,TEXT,,
1,1.0,In which year was the annual report or the sus...,"OG, CM, CU",False,TEXT,,
2,2.0,What is the total volume of proven and probabl...,OG,True,"TEXT, TABLE",,
3,2.1,What is the volume of estimated proven hydroca...,OG,True,"TEXT, TABLE",,
4,2.2,What is the volume of estimated probable hydro...,OG,True,"TEXT, TABLE",,


In [11]:
component = TextRelevanceInfer(infer_config, kpi_df)

Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Process ForkPoolWorker-7:
Process ForkPoolWorker-4:
Process ForkPoolWorker-6:
Process ForkPoolWorker-9:
Process ForkPoolWorker-14:
Process ForkPoolWorker-3:
Process ForkPoolWorker-10:
Process ForkPoolWorker-13:
Process ForkPoolWorker-11:
Process ForkPoolWorker-5:
Process ForkPoolWorker-12:
Process ForkPoolWorker-15:
Traceback (most recent call last):
Process ForkPoolWorker-8:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
Traceback (most recent ca

### Prediction on a Single Example

In [12]:
input_text = "The company is going to reduce 8% in gas production"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is going to reduce 8% in gas production',
    'label': '1',
    'probability': 0.73437506}]}]

In [13]:
input_text = "The company is about semi conductors"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is about semi conductors',
    'label': '0',
    'probability': 0.989326}]}]

### Prediction on an Entire Folder

`run_folder()` will make prediction on all the JSON files in the /data/extraction folder. This will take some time, based on the number of json files.

In [14]:
component.run_folder()

04/08/2022 18:47:22 - INFO - src.models.relevance_infer -   #################### Starting Relevence Inference for the following extracted pdf files found in /opt/app-root/src/aicoe-osc-demo/data/infer_relevance:
['sustainability-report-2019'] 
04/08/2022 18:47:22 - INFO - src.models.relevance_infer -   #################### 1/1 PDFs
04/08/2022 18:47:22 - INFO - src.models.relevance_infer -   Running inference for sustainability-report-2019:
04/08/2022 18:47:22 - INFO - src.models.relevance_infer -   ###### Received 726 examples for Text, number of questions: 24
04/08/2022 18:49:22 - INFO - src.models.relevance_infer -   Saved 559 relevant Text examples for sustainability-report-2019 in /opt/app-root/src/aicoe-osc-demo/data/infer_relevance/sustainability-report-2019_predictions_relevant.csv


Unnamed: 0,page,pdf_name,text,text_b,source
0,1,sustainability-report-2019,What is the company name?,Equinor supports the Paris agreement and a net...,Text
1,1,sustainability-report-2019,What is the company name?,broad energy company is founded on a strong co...,Text
2,1,sustainability-report-2019,What is the company name?,Equinor and partners reached a final investmen...,Text
3,1,sustainability-report-2019,What is the company name?,awarded five major contracts. Equinor is posit...,Text
4,1,sustainability-report-2019,What is the company name?,Equinor is a values-based company. How we deli...,Text
...,...,...,...,...,...
554,29,sustainability-report-2019,"What is the total amount of scope 1, scope 2 a...",GHG emissions associated with the production a...,Text
555,29,sustainability-report-2019,"What is the total amount of scope 1, scope 2 a...",Indirect GHG emissions from energy imported fr...,Text
556,29,sustainability-report-2019,"What is the total amount of scope 1, scope 2 a...",Upstream carbon dioxide (CO₂ ) emission intensity,Text
557,29,sustainability-report-2019,"What is the total amount of scope 1, scope 2 a...",Total scope one emissions of CO₂ (kg CO₂) from...,Text


The results are saved in a CSV. For each table, the extracted text, as well as the page number from the source pdf file are saved.

In [15]:
df_table_results = pd.read_csv(infer_config.result_dir['Text'] + "/sustainability-report-2019_predictions_relevant.csv")
df_table_results.head(20)

Unnamed: 0.1,Unnamed: 0,page,pdf_name,text,text_b,source
0,0,1,sustainability-report-2019,What is the company name?,Equinor supports the Paris agreement and a net...,Text
1,1,1,sustainability-report-2019,What is the company name?,broad energy company is founded on a strong co...,Text
2,2,1,sustainability-report-2019,What is the company name?,Equinor and partners reached a final investmen...,Text
3,3,1,sustainability-report-2019,What is the company name?,awarded five major contracts. Equinor is posit...,Text
4,4,1,sustainability-report-2019,What is the company name?,Equinor is a values-based company. How we deli...,Text
5,5,1,sustainability-report-2019,What is the company name?,2019 marked the start-up of Johan Sverdrup – t...,Text
6,6,1,sustainability-report-2019,What is the company name?,"For almost 50 years, Equinor has dedicated its...",Text
7,7,1,sustainability-report-2019,What is the company name?,Equinor is partnering with SSE Renewables to d...,Text
8,8,1,sustainability-report-2019,What is the company name?,Equinor Sustainability report 2019Introduction,Text
9,9,2,sustainability-report-2019,What is the company name?,"We are Equinor, an international energy compan...",Text


In [16]:
# upload the predicted files to s3
s3c.upload_files_in_dir_to_prefix(
    infer_config.result_dir['Text'],
    config.BASE_INFER_RELEVANCE_S3_PREFIX
)

# Conclusion
This notebook ran the _Relevance_ inference on a sample dataset and stored the output in a csv format.