# Text Curation
As the second step of the pipeline, the notebook aims to take the extracted text from PDFs and annotated files to create the curated training set for the language model. The extracted text for this notebook is in the `ROOT/data/extraction` directory and the output csv dataset will be stored in `ROOT/data/curation` directory. The Curator class finds positive labels from annotated files and creates negative examples from the extracted text. The output dataset from this notebook will be used for the next step of the pipeline, i.e., training the model. 

In [4]:
# Author: ALLIANZ NLP esg data pipeline
import os
import pathlib
from dotenv import load_dotenv


import config
from src.components.preprocessing.curator import Curator
from src.data.s3_communication import S3FileType, S3Communication

### Injecting Credentials

In order to run this notebook, we need credentials to connect with S3 storage to retrieve and store data.

In an automated environment, the credentials can be specified in a pipeline's environment variables or through Openshift secrets.

For running the notebook in automation in an elyra pipeline, the environment variables can be updated in the notebook "Properties" in the pipeline UI or under `"env_vars"` in the `demo2.pipeline yaml` file.

For the purpose of running the elyra pipeline, you can configure the `DATA_S3_PREFIX` in the `config.py` as `"corpdata/ESG/pipeline_run/sample_1"` which consists of the pdf for which the annotation exists.

For running the notebook in a local environment, we will define them as environment variables in a `credentials.env` file at the root of the project repository, and load them using dotenv. An example of what the contents of `credentials.env` could look like is shown below

```
# s3 credentials
S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
S3_BUCKET=ocp-odh-os-demo-s3
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
```

In [5]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [6]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [8]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_ANNOTATION_FOLDER):
        config.BASE_ANNOTATION_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_CURATION_FOLDER):
        config.BASE_CURATION_FOLDER.mkdir(parents=True, exist_ok=True)

    # download the files created by the extraction phase
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

In [9]:
# download the annoatation files
s3c.download_files_in_prefix_to_dir(
    config.BASE_ANNOTATION_S3_PREFIX,
    config.BASE_ANNOTATION_FOLDER,
)

### Call text Curator

In [7]:
SEED = 42
TextCurator_kwargs = {
    "retrieve_paragraph": False,
    "neg_pos_ratio": 1,
    "columns_to_read": [
        "company",
        "source_file",
        "source_page",
        "kpi_id",
        "year",
        "answer",
        "data_type",
        "relevant_paragraphs",
    ],
    "company_to_exclude": [],
    "create_neg_samples": True,
    "seed": SEED,
}

In [8]:
kpi_df = s3c.download_df_from_s3(
    f"{config.EXPERIMENT_NAME}/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category,Unnamed: 5,Unnamed: 6
0,0.0,What is the company name?,"OG, CM, CU",False,TEXT,,
1,1.0,In which year was the annual report or the sus...,"OG, CM, CU",False,TEXT,,
2,2.0,What is the total volume of proven and probabl...,OG,True,"TEXT, TABLE",,
3,2.1,What is the volume of estimated proven hydroca...,OG,True,"TEXT, TABLE",,
4,2.2,What is the volume of estimated probable hydro...,OG,True,"TEXT, TABLE",,


In [9]:
cur = Curator([("TextCurator", TextCurator_kwargs)])
#pdb.set_trace()
cur.run(config.BASE_EXTRACTION_FOLDER, config.BASE_ANNOTATION_FOLDER, config.BASE_CURATION_FOLDER, kpi_df)

Could not process row number 15 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 16 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 17 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 18 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 37 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 165 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 166 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 167 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 179 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could not process row number 180 in 20201030 1Qbit aggregated_annotations_needs_correction (1).xlsx
Could

In [10]:
# upload the curation file to s3
ret = s3c.upload_file_to_s3(
    config.BASE_CURATION_FOLDER / "esg_TEXT_dataset.csv",
    config.BASE_CURATION_S3_PREFIX,
    "esg_TEXT_dataset.csv",
)
ret['ResponseMetadata']['HTTPStatusCode']

200

### Conclusion
We called the Curator class to combine extracted text and annotated files and store the ouput in the `ROOT/data/curation` folder.