# Text Extraction
As a first step of the pipeline, we aim to extract text from PDFs in this notebook. The input PDFs for this notebook is in the `ROOT/data/pdfs` directory and the output json will be stored in `ROOT/data/extract` directory. The output from this notebook combined with the annotations will be used in the next step of curation.

In [9]:
# Author: ALLIANZ NLP esg data pipeline
import os
import pathlib
from dotenv import load_dotenv
from pathlib import Path

import config
from src.components.preprocessing import Extractor
from src.data.s3_communication import S3Communication

In [10]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [11]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [12]:
if not os.path.exists(config.CHECKPOINT_FOLDER):
    config.CHECKPOINT_FOLDER.mkdir(parents=True, exist_ok=True)

if not os.path.exists(config.BASE_PDF_FOLDER):
    config.BASE_PDF_FOLDER.mkdir(parents=True, exist_ok=True)

if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
    config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

In [13]:
# download all files from which text is to be extracted
s3c.download_files_in_prefix_to_dir(
    config.BASE_PDF_S3_PREFIX,
    config.BASE_PDF_FOLDER,
)

### Call text extracter

In [14]:
PDFTextExtractor_kwargs = {
    "min_paragraph_length": 30,
    "annotation_folder": None,
    "skip_extracted_files": False,
}

In [15]:
ext = Extractor([("PDFTextExtractor", PDFTextExtractor_kwargs)])
ext.run_folder(config.BASE_PDF_FOLDER, config.BASE_EXTRACTION_FOLDER)

['/opt/app-root/src/pipeline/aicoe-osc-demo/data/pdfs/sustainability-report-2019.pdf']


In [16]:
# upload the extracted files to s3
s3c.upload_files_in_dir_to_prefix(
    config.BASE_EXTRACTION_FOLDER,
    config.BASE_EXTRACTION_S3_PREFIX
)

### Conclusion
We called the Extractor class to extract text from the PDF and store the ouput in the `ROOT/data/extraction` folder.