# A Quick Start on the Extended JSON HTML Loader/Splitter

## Introduction

Contact: Clara Zang (cz692t@attt.com)

Date: 11/16/2023 (please keep an eye on code update beyond this date in case there is any.)

In this short EDA, you will have a quick start on how to the Extended Json Html Loader/Splitter:
- [what does a HTML file look like?](#section_1)
- [how to load/parse HTML contained in a json file using JSONLoaderWithHtml.](#section_2)
- [split loaded HTML using HTMLSectionSplitter with some key paramters set](#section_3)
- [print out the actual size and content of each chunk](#section_4)

<a id="section_1"></a>
## What HTML files look like?

Actually, I'm saving this quick start as a HTML file too!!


In [24]:
with open(path, 'r', encoding='utf-8') as html_file:
    html_content = html_file.read()
    
html_content[:5000]

'<a id=\\"Top\\" name=\\"Top\\"> </a>\\n<table align=\\"center\\" border=\\"0\\" cellpadding=\\"5\\" style=\\"width: 100%;\\"><tbody><tr><td colspan=\\"1\\" rowspan=\\"1\\" style=\\"text-align: center;width: 20%;vertical-align: top;\\"><a href=\\"#policyDetails\\"><img alt=\\"policy details jump to\\" src=\\"https://attone.file.force.com/servlet/rtaImage?eid=ka04M000000WU9R&amp;feoid=00N6g00000UM7L2&amp;refid=0EM4M0000059w2T\\"></img></a><br>\\t\\t\\t<a href=\\"#policyDetails\\" style=\\"text-decoration: none;color: #454b52;\\"><span style=\\"padding: 8px;font-size: 11px;text-align: center;\\">POLICY\\u00a0 \\u00a0 DETAILS/SUGGESTED VERBIAGE</span></a></td><td colspan=\\"1\\" rowspan=\\"1\\" style=\\"text-align: center;width: 20%;vertical-align: top;\\"><a href=\\"#etftrs\\"><img alt=\\"etf types jump to\\" src=\\"https://attone.file.force.com/servlet/rtaImage?eid=ka04M000000WU9R&amp;feoid=00N6g00000UM7L2&amp;refid=0EM4M0000059w2d\\"></img></a><br>\\t\\t\\t<a href=\\"#etftrs\\" style=\

<a id="section_2"></a>
## Load a json file (HTML+metadata) using JSONLoaderWithHtml 

 Note: 
 1. change the path according to where your **json_loader.py** is located
 
 Some parameters:
 - remove_all_tags (bool): a boolean flag to remove all html tags or not
 - html_tags_to_decompose (Optional[list[str]]): a list of tags that should be removed with their content
 - html_tags_to_unwrap (Optional[list[str]]): a list of the tags to unwrap. These flags will be removed but their content will be remained.
 - html_attrs_to_keep (Optional[list[str]]): a list of attributes to be retained.

In [5]:
from json_loader import JSONLoaderWithHtml

from importlib import import_module
metadata_func = getattr(import_module("metadata_enrichment.functions"),"care_metadata")

file_name = "MSS_000002692_20230917_2236uf.json"

html_tags_to_unwrap= [
      "li",
      "ul",
      "ol",
      "a",
      "span"
    ]

json_loader = JSONLoaderWithHtml(
        file_name,
        '.',
        "BW_Article_Details__c",
        metadata_func,
        text_content = False,
        remove_all_tags = False,
        html_tags_to_unwrap= html_tags_to_unwrap)
    
docs = json_loader.load()



In [6]:
docs

[Document(page_content='<html><body>\n<div>\n POLICY    DETAILS/SUGGESTED VERBIAGE\n ETF TYPES, RATES &amp; SCHEDULES\n WAIVED ETF&amp;     SCENARIOS\n CRU\n FAQ\n</div>\n<div>\n<h1>Policy Details/Suggested Verbiage</h1>\n<div>\n<h2><b>Rules and Restrictions</b></h2>\nCustomers who activate/upgrade service with AT&amp;T on a contract agree to maintain active service with AT&amp;T for a specified time, normally 24 months; also, customers who activate service on an installment plan are required to pay the full-outstanding balance on the installment plan when cancelling their service and this may be taxable according to state tax codes.\n\t\t\t\tWhen customer cancels service before the end of the agreed term, the existing contract terms determine the ETF and this may be taxable according to state tax codes.\nFollow normal retention steps before cancelling a customer\'s account for any reason:\n\t\t\t\tOffer solutions and explain what services are available that would keep the customer sat

In [7]:
print(docs[0].page_content)

<html><body>
<div>
 POLICY    DETAILS/SUGGESTED VERBIAGE
 ETF TYPES, RATES &amp; SCHEDULES
 WAIVED ETF&amp;     SCENARIOS
 CRU
 FAQ
</div>
<div>
<h1>Policy Details/Suggested Verbiage</h1>
<div>
<h2><b>Rules and Restrictions</b></h2>
Customers who activate/upgrade service with AT&amp;T on a contract agree to maintain active service with AT&amp;T for a specified time, normally 24 months; also, customers who activate service on an installment plan are required to pay the full-outstanding balance on the installment plan when cancelling their service and this may be taxable according to state tax codes.
				When customer cancels service before the end of the agreed term, the existing contract terms determine the ETF and this may be taxable according to state tax codes.
Follow normal retention steps before cancelling a customer's account for any reason:
				Offer solutions and explain what services are available that would keep the customer satisfied.
To determine if the customer's equipment

<a id="section_3"></a>
## Split loaded Document using HTMLSectionSplitter 

 Note: 
 1. change the path according to where your **html_section_splitter.py** is located
 2. pay attention to separator
 
Some paramters:
- headers_to_split_on: list of tuples of headers we want to track mapped to
                (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
                h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2)]
- parameters of class CharacterTextSplitter

In [12]:
from html_section_splitter import HTMLSectionSplitter

headers_to_split_on = [
        ["h1", "Header 1"],
        ["h2", "Header 2"],
        ["h3", "Header 3"]
    ]
chunk_size = 2000
chunk_overlap = 200
separator =  '\n'
is_separator_regex = True,

sec_splitter = HTMLSectionSplitter(
    headers_to_split_on = headers_to_split_on,
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    is_separator_regex = is_separator_regex,
    separator = separator
        )

docs_splitted = sec_splitter.split_documents(docs)
docs_splitted

Created a chunk of size 3725, which is longer than the specified 2000
Created a chunk of size 7581, which is longer than the specified 2000


[Document(page_content='POLICY    DETAILS/SUGGESTED VERBIAGE\n ETF TYPES, RATES & SCHEDULES\n WAIVED ETF&     SCENARIOS\n CRU\n FAQ', metadata={'source': '/Users/clara_zang/Documents/Planning2023/Cognitive_AI/Care/After_September/BS4-html-preprocessing/MSS_000002692_20230917_2236uf.json', 'seq_num': 1, 'attributes_type': 'Knowledge__kav', 'attributes_url': '/services/data/v55.0/sobjects/Knowledge__kav/ka04M000000WU9RQAW', 'document_id': 'ka04M000000WU9RQAW', 'ArticleNumber': '000002692', 'Title': 'Early Termination Fees (ETFs) - Mobility', 'UrlName': 'early-termination-fees-et-fs---mobility', 'articleUrl': 'https://attone.my.salesforce.com/lightning/articles/Knowledge/early-termination-fees-et-fs---mobility', 'PublishStatus': 'Published', 'ArticleSummary': 'We bill an early termination fee (ETF) to customers who cancel service prior to completing the contractual agreement made with AT&amp;T.', 'Header 1': 'Early Termination Fees (ETFs) - Mobility'}),
 Document(page_content='Policy Deta

In [13]:
len(docs), len(docs_splitted)

(1, 17)

<a id = "section_4"> </a>
## Take a look at the actual size and content of each chunk

In [14]:
[len(docs_splitted[i].page_content) for i in range(len(docs_splitted))]

[103,
 33,
 1287,
 1565,
 1163,
 31,
 72,
 3724,
 1433,
 1908,
 226,
 3,
 7580,
 1607,
 1590,
 1787,
 985]

## Take a look at the chunks content!


In [15]:
for i in range(len(docs_splitted)):
    print(" chunk " + str(i) + "****"*10)
    #print(len(docs_splitted[i].page_content))
    print(docs_splitted[i].page_content)
    print('\n')
    #print("****"*10)

 chunk 0****************************************
POLICY    DETAILS/SUGGESTED VERBIAGE
 ETF TYPES, RATES & SCHEDULES
 WAIVED ETF&     SCENARIOS
 CRU
 FAQ


 chunk 1****************************************
Policy Details/Suggested Verbiage


 chunk 2****************************************
Rules and Restrictions 
Customers who activate/upgrade service with AT&T on a contract agree to maintain active service with AT&T for a specified time, normally 24 months; also, customers who activate service on an installment plan are required to pay the full-outstanding balance on the installment plan when cancelling their service and this may be taxable according to state tax codes.
				When customer cancels service before the end of the agreed term, the existing contract terms determine the ETF and this may be taxable according to state tax codes.
Follow normal retention steps before cancelling a customer's account for any reason:
				Offer solutions and explain what services are available that wou