# Document Processing Pipeline

In this notebook, we demonstrate three types of ways to process a given document so that it could be used as repositories for a question and answering chatbot (as shown in 04-rag.ipynb). The Flan T5 LLM and GPT-J embedding model are not required for this section, this notebook just demonstrates an understanding of the document processing pipeline. 

We use the following document, [Schedule 14A SEC document from Alabama Power Company](https://www.sec.gov/Archives/edgar/data/3153/000000315320000004/apc2020noticeofannualmeeti.htm), to extract useful information and answer questions. Before we can do that, it is vital to process the documents in a meaningful manner. This will help get the appropriate context from these large documents into the LLM so that the LLM can query the data to generate answers. The document is downloaded at `data/14A/0000003153-20-000004.html`

The three different ways to process a document include: 
1. Using a standard HTML Loader from langchain
2. Using a custom HTML loader parser
3. Using ideal textual paragraphs

Run through the sequence of cells to understand the mechanisms behind the steps in each of the approaches of the document processing pipeline. 

### 1. Set Up Kernel and Required Dependencies

First, check that the correct kernel is chosen.

<img src="img/kernel_set_up_03.png" width="300"/>

You can click on that to see and check the details of the image, kernel, and instance type.

<img src="img/w3_kernel_and_instance_type_03.png" width="600"/>

# NOTE:  YOU CANNOT CONTINUE UNTIL THE KERNEL IS STARTED
# ### PLEASE WAIT UNTIL THE KERNEL IS STARTED BEFORE CONTINUING!!! ###

# Use `Shift+Enter` to Run Each Cell

Use `Shift+Enter` on the cell below to see the output.

# Click `Kernel` => `Restart Kernel and Run All Cells` to Run All Cells
![](img/restart-kernel-and-run-all-cells.png)

In [2]:
import sys

# Get the Python version.
python_version = sys.version_info

# Check if the Python version is above 3.9.
if python_version.major < 3 or python_version.minor < 9:
  # Raise an error message if the Python version is not above 3.9.
  raise Exception("Python version must be above 3.9.")

# Print a success message if the Python version is above 3.9.
print("Python version is above 3.9.")

Python version is above 3.9.


## _==> Please ignore all WARNINGs and ERRORs from the `pip install`'s below. <==_

In [3]:
!pip install -r ../requirements.txt

Collecting langchain
  Downloading langchain-0.0.321-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hCollecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting lxml
  Downloading lxml-4.9.3-cp39-cp39-manylinux_2_28_x86_64.whl (8.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting regex
  Downloading regex-2023.10.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m773.3/773.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting opensearch-py
  Downloading opensearch_py-2.3.2-py2.py3-none-any.whl (327 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.3/327.3 kB[0m [31m3.5 MB/s[0m eta [36m0

In [16]:
# standard imports
import re
import glob
import logging
from typing import Dict, List, Union

from langchain.document_loaders import BSHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

The following show the path to the document being processed as well as its name. 

In [17]:
doc_path = "../data/14A/0000003153-20-000004.html"
san = "0000003153-20-000004"

In [18]:
# helper functions
def clean_text(s):
    s = s.replace(u'\xa0', u' ') # no-break space 
    s = s.replace('\n', ' ')
    s = re.sub(r'\s+',' ', s) 
    return s

def clean_html(s):
    s = s.replace(u'\xa0', u' ') # no-break space 
    return s
    
def like_page_number(text):
    '''Match standalone one or two digit number'''
    return re.match(r'^(\d+){1,2}\.?$',text)

def read_text(path):
    with open(path, 'r', encoding='utf-8') as f:
        return f.read()

## Approach 1: Using a Standard HTML Loader from Langchain

For processing the documents in the standard format, we will use langchain's `BSHTMLLoader`. It uses BeautifulSoup4 to load HTML documents to extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`.

Now, we will load the HTML file into a langchain `Document`.

In [19]:
print('Loading', doc_path)
loader = BSHTMLLoader(doc_path)

data = loader.load()

Loading ../data/14A/0000003153-20-000004.html


The langchain `Document` class can be formed with any piece of text and optional metadata. The piece of text is what we interact with the language model, while the optional metadata is useful for keeping track of metadata about the document (such as the source). A single 14A document can be long so we will split the document in multiple chunks using [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter). Lastly, we edit the metadata of the docs to include further information. 

In [22]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1400, chunk_overlap=0)

docs_standard = text_splitter.split_documents(data)

for i, doc in enumerate(docs_standard):
    doc.page_content = clean_text(doc.page_content)  
    doc.metadata['passage_id'] = i

In [26]:
print(f'Now you have {len(docs_standard)} short passages')

Now you have 111 short passages


## Approach 2: Using a Custom HTML Parser and Loader

Using the langchain HTML loader can help in chunking the page into different parts. However, it still does not chunk in a way that preserves some of the components we have in the html file e.g. preserving the sentence strucutures. Therefore, we will use a custom loader similar to the `BSHTMLLoader` that adds parsing for different html tags. In this example, we show how to add a custom tag to the table to make it easier for our models to learn what is a table and what is a text. 

In [28]:
"""Custom serialization of HTML into text."""

class BSHTMLLoaderEx(BaseLoader):
    """Loader that uses beautiful soup to parse HTML files."""

    block_level_elements = set([
        'address','article','aside','blockquote','canvas',
        'dd','div','dl','dt','fieldset','figcaption','figure','footer','form',
        'h1','h2','h3','h4','h5','h6',
        'header','hr','li','main','nav','noscript','ol',
        'p','pre','section','table','tfoot','ul','video',
        'body',
    ])

    
    def __init__(
        self,
        file_path: str,
        open_encoding: Union[str, None] = None,
        bs_kwargs: Union[dict, None] = None,
    ) -> None:
        """Initialise with path, and optionally, file encoding to use, and any kwargs
        to pass to the BeautifulSoup object."""
        try:
            import bs4  # noqa:F401
        except ImportError:
            raise ValueError(
                "bs4 package not found, please install it with " "`pip install bs4`"
            )

        self.file_path = file_path
        self.open_encoding = open_encoding
        if bs_kwargs is None:
            bs_kwargs = {"features": "lxml"}
        self.bs_kwargs = bs_kwargs

    def load(self) -> List[Document]:
        from bs4 import BeautifulSoup, Tag, NavigableString, Comment

        """Load HTML document into document objects."""
        with open(self.file_path, "r", encoding=self.open_encoding) as f:
            soup = BeautifulSoup(f, **self.bs_kwargs)

        # text = soup.get_text()
        text = self._parse(soup)

        if soup.title:
            title = str(soup.title.string)
        else:
            title = ""

        metadata: Dict[str, Union[str, None]] = {
            "source": self.file_path,
            "title": title,
        }
        return [Document(page_content=text, metadata=metadata)]

    def _parse(self,soup):
        '''Custom parser'''
        from bs4 import BeautifulSoup, Tag, NavigableString, Comment
        
        def __clean(s):
            s = s.replace(u'\xa0', u' ')
            return s
            
        # DFS post-order. 
        # Adding a different delimeter at the end of each tag
        def dfs(tags,texts):
            for tag in tags:
                if isinstance(tag,Comment):
                    pass
                elif isinstance(tag,NavigableString):
                    texts += tag.string,

                else:
                    # Starting a new paragraph tag, 
                    # terminate previous non-paragraph by the period.
                    if tag.name in type(self).block_level_elements:
                        # if texts:
                        #     print("'{}'".format(texts[-1][-10:]))
                        #     print(texts[-1][-1].isalnum())
                        if texts and texts[-1] and texts[-1][-1].isalnum():
                            texts[-1] += '.'
                        texts += '\n',

                    # Serialize a table
                    if tag.name in ['td'] and tag.text != '':
                        texts += ' <Cell> ',
                    if tag.name in ['table']:
                        texts += ' <Table Start> '

                    dfs(tag.children,texts)

                    if tag.name in ['table']:
                        texts += ' <Table End> '

        texts = []   

        dfs([soup.find('body')],texts)        

        s = ''.join(texts)
        s = __clean(s)
        s = re.sub(r'[\.]+','.',s) # multiple dots
        
        return s

Now in a similar way that the document was loaded through langchain, we call the BSHTMLLoaderEx custom class and split the documents using the RecursiveCharacterTextSplitter. An example of a document is outputted from the below cell. 

In [29]:
print('Loading:', doc_path)

# 
loader = BSHTMLLoaderEx(doc_path, bs_kwargs={'features':'html.parser'})

data = loader.load()

# Clean entire HTML
data[0].page_content = clean_html(data[0].page_content)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1400, chunk_overlap=0)

docs = text_splitter.split_documents(data)

for i,doc in enumerate(docs):
    doc.page_content = clean_text(doc.page_content)
    doc.metadata['passage_id'] = i

# Filter page numbers
docs = [doc for doc in docs if not like_page_number(doc.page_content)]

print(f'Now you have {len(docs)} short passages')
docs[0]

Loading: ../data/14A/0000003153-20-000004.html
Now you have 172 short passages


Document(page_content='UNITED STATES. SECURITIES AND EXCHANGE COMMISSION. WASHINGTON, D.C. 20549. SCHEDULE 14A INFORMATION. Proxy Statement Pursuant To Section 14(a) of the Securities Exchange Act of 1934. <Table Start> <Cell> x <Cell> Filed by the Registrant <Cell> o <Cell> Filed by a party other than the Registrant <Table End> Check the appropriate box: <Table Start> <Cell> o <Cell> Preliminary proxy statement <Cell> o <Cell> Confidential, for use of the Commission only (as permitted by Rule 14a-6(e)(2)) <Cell> x <Cell> Definitive proxy statement <Cell> o <Cell> Definitive additional materials <Cell> o <Cell> Soliciting material under Rule 14a-12 <Table End> ALABAMA POWER COMPANY. (Name of Registrant as Specified in Its Charter) (Name of Person(s) Filing Proxy Statement, if Other Than the Registrant) Payment of Filing Fee (Check the appropriate box):', metadata={'source': '../data/14A/0000003153-20-000004.html', 'title': 'None', 'passage_id': 0})

## Approach 3: Using Ideal Textual 14A paragraphs
This section demonstrates how to load the fragments of texts as shown as a document. Each of the files are of .txt format and include all important paragraphs. 

In [15]:
folder = "../data/14A_frags/"
print(f'Loading: {folder}{san}.*.txt')

docs_ideal = []
for i,fname in enumerate(glob.iglob(folder + san + '*.txt')):
    docs_ideal += Document(
                page_content=read_text(fname),
                metadata ={'source': fname, 'title': None, 'passage_id': i}
    ),

print(f'Now you have {len(docs)} short textual passages')
print(docs_ideal[1])

Loading: ../data/14A_frags/0000003153-20-000004.*.txt
Now you have 172 short textual passages
page_content="Phillip M. Webb - Director since 2018. Mr. Webb, 62, is President of Webb Concrete and Building Materials, a position he has held since 1982. Mr. Webb serves on the Board of Directors of NobleBank & Trust, as well as numerous philanthropic and non-profit boards, such as the Business Council of Alabama, Calhoun County Home Builders Association, the Jacksonville State University Foundation, the Calhoun. 3. County Chamber of Commerce, and the Greater Alabama Council of the Boy Scouts of America. He is also the Chairman of the Knox Concert Series. Mr. Webb's business experience and investment in his local community make him a well-qualified member of the Company's Board. Each nominee has served in his or her present position for at least the past five years, unless otherwise noted. Vote Required." metadata={'source': '../data/14A_frags/0000003153-20-000004.webb.txt', 'title': None, '