# Text Extraction
As a first step of the pipeline, we aim to extract text from PDFs in this notebook. The input PDFs for this notebook is in the `ROOT/data/pdfs` directory and the output json will be stored in `ROOT/data/extract` directory. The output from this notebook combined with the annotations will be used in the next step of curation.

In [1]:
# Author: ALLIANZ NLP esg data pipeline
import os
import pathlib
from dotenv import load_dotenv
from pathlib import Path

import config
from src.components.preprocessing import Extractor
from src.data.s3_communication import S3Communication

### Injecting Credentials

In order to run this notebook, we need credentials to connect with S3 storage to retrieve and store data.

In an automated environment, the credentials can be specified in a pipeline's environment variables or through Openshift secrets.

For running the notebook in automation in an elyra pipeline, the environment variables can be updated in the notebook "Properties" in the pipeline UI or under `"env_vars"` in the `demo2.pipeline yaml` file.

For running the notebook in a local environment, we will define them as environment variables in a `credentials.env` file at the root of the project repository, and load them using dotenv. An example of what the contents of `credentials.env` could look like is shown below

```
# s3 credentials
S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
S3_BUCKET=ocp-odh-os-demo-s3
S3_ACCESS_KEY=xxx
S3_SECRET_KEY=xxx
```

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [5]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    if not os.path.exists(config.CHECKPOINT_FOLDER):
        config.CHECKPOINT_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_PDF_FOLDER):
        config.BASE_PDF_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

In [6]:
# download all files from which text is to be extracted
s3c.download_files_in_prefix_to_dir(
    config.BASE_PDF_S3_PREFIX,
    config.BASE_PDF_FOLDER,
)

### Call text extracter

In [7]:
PDFTextExtractor_kwargs = {
    "min_paragraph_length": 30,
    "annotation_folder": None,
    "skip_extracted_files": False,
}

In [8]:
ext = Extractor([("PDFTextExtractor", PDFTextExtractor_kwargs)])
ext.run_folder(config.BASE_PDF_FOLDER, config.BASE_EXTRACTION_FOLDER)

['/opt/app-root/src/aicoe-osc-demo/data/pdfs/sustainability-report-2019.pdf', '/opt/app-root/src/aicoe-osc-demo/data/pdfs/75506106_BOA_2016-12-31.pdf', '/opt/app-root/src/aicoe-osc-demo/data/pdfs/90044053_Fisher & Paykel Hl_2017-11-07.pdf', '/opt/app-root/src/aicoe-osc-demo/data/pdfs/88094292_Carriage Svcs Inc_2019-07-23.pdf']


In [9]:
# upload the extracted files to s3
s3c.upload_files_in_dir_to_prefix(
    config.BASE_EXTRACTION_FOLDER,
    config.BASE_EXTRACTION_S3_PREFIX
)

### Conclusion
We called the Extractor class to extract text from the PDF and store the ouput in the `ROOT/data/extraction` folder.