## Notebook For Viewing HTML and Elements, Chunks, Documents ##

* HTML obtained via Requests and Filtered by Beautiful Soup
* Elements obtained by Unstructured partition_html
* Chunks from Unstructured chunk_by_title
* Document format is from LangChain

In [1]:

import json
from IPython.core.display import HTML
import requests
from bs4 import BeautifulSoup
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
from langchain_core.documents.base import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

First let's get the webpage of interest and filter according to the div(s) we think are interesting.

In [2]:
# Web page to fetch
html_source =  "https://www.chilisin.com/en-global/Inductor/index/emi_common"
# What type of tag do we want to filter by?
tag_type = "div"
# What is the class name to filter?
tag_class = 'container'

Now use requests to get the webpage html text and Beautiful Soup to filter it

In [3]:

def get_filtered_html(html_source:str, tag_type:str, tag_class:str):
    page = requests.get(html_source)
    soup = BeautifulSoup(page.text, 'html.parser')
    output = soup.find_all(tag_type, class_=tag_class)
    return output

In [4]:
html_output = get_filtered_html(html_source, tag_type, tag_class)

In [5]:
# Raw HTML Output
print(html_output[0])

<div class="container">
<div class="banner">
<img src="/upload/website/banner/chilisin_product_banner_18062116137.jpg"/>
</div>
<div class="unit">
<div class="crumbs_block clearfix">
<ul class="crumbs no_ul clearfix">
<li>
<a href="/en-global/home/index">
					Home					</a>
</li>
<li><a href="/en-global/Inductor/index">Products</a></li>
</ul>
</div>
</div>
<div class="main_content">
<h1 class="main_title">EMI-Common Mode Choke</h1>
<script type="text/javascript">
	
	$(window).load(function()
	{
		var url = window.location.toString();
	    var id = url.split('#')[1];
	   	if(id)
	   	{
		    var t = $('#'+id).offset().top;
		   $('html,body').animate({scrollTop:(t)}, 1000);
		}
	});
</script>
<!-- Search -->
<div class="top_search_block">
<div class="search_sel_block lv1 clearfix">
<div class="search_title hidden_m">Search by : </div>
<div class="search_title hidden_pc" id="search_block">Search<i aria-hidden="true" class="fa fa-chevron-down"></i></div>
<div class="portfolio-experiment" 

In [6]:
HTML(str(html_output[0]))

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BWCU_121008-02,1210/0504,25~330,-,0.25~1.3,100~400,More
BWCU_160811-02,1608/0603,25~220,-,0.077~0.209,500,More
BWCU_201212-02,2012/0805,30~900,-,0.2~0.88,80~450,More
BWCU_231512-02,2012/0805,30~260,-,0.2~0.6,700~1300,More
BWCU_321619-02,3216/1206,90~2200,-,0.3~1.2,200~370,More
BWCU_201212-03,2012/0805,50~130,-,0.2~0.4,300~500,More
BWCU_121008-03,1210/0504,22~90,-,0.2~0.4,250~400,More
BWCU_322518-00 NEW,3225/1210,90~1000,-,0.08~0.25,480~1000,More
BWCU_252012-3P NEW,2520/1008,67~800,-,0.16~1.6,150~500,More

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BWCU_322518-00 NEW,3225/1210,90~1000,-,0.08~0.25,600~1000,More
BWCU_252012-3P NEW,2520/1008,67~800,-,0.16~1.6,150~500,More

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BWCC_201208,2012/0805,30~260,-,0.2~0.6,700~1300,More

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BWDM_341620 NEW,3416/1306,-,40~105,1.5~3,120~300,More

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BPPM_050520,5050/2020,100~1200,-,0.045~0.081,1900~3000,More
BPPM_070638,7060/2824,70~3000,-,0.005~0.075,900~15000,More
BPPM_090748,9070/3628,300~2700,-,0.006~0.086,2000~6000,More
BPPM_121164,1211/4844,230~2700,-,0.0022~0.05,1500~10000,More
BPPM_151360,1513/6052,300~1500,-,0.0038~0.009,8500~13000,More
BPPM_485023,4850/1920,100~1500,-,0.01~0.04,2000~6000,More

Series,Size Code (JIS/EIA),Impedance(Ω),Inductance(uH),RDC(Ω),Rated Current(mA),Unnamed: 6
BPPI_050525,4850/1920,100~1500,-,0.009~0.04,1500~6000,More
BPPI_050545,4850/1920,190~3000,0.6~6,0.02~0.2,500~5000,More


Now we use the Unstructured library to partition the HTML into Elements.

Reference:

[partition_html](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-html)

In [7]:

elements = partition_html(text=str(html_output[0]))

Let's define a few helper functions for examing the Elements

In [8]:
def print_elements(els: list) -> None:
    '''Print type and text of each element'''
    for i, el in enumerate(els):
        print(i, ' ', el.to_dict()['type'].upper(), ': ', el.text)

In [9]:
def print_element_json(element):
    '''Print JSON representation of an individual element'''
    print(json.dumps(element.to_dict(), indent=2))

In [10]:
print_elements(elements)

0   LISTITEM :  Home
1   LISTITEM :  Products
2   TITLE :  EMI-Common Mode Choke
3   TITLE :  Search by :
4   TITLE :  Search
5   TITLE :  Part Number
6   TITLE :  Characteristics
7   TITLE :  Cross reference
8   TITLE :  General
9   TITLE :  Automotive Grade
10   TITLE :  Note
11   TITLE :  BWCU
12   TITLE :  Series Pdf
13   TITLE :  02_For USB 2.0,IEEE1394,LVDS HDMI1.4
14   TITLE :  03_For USB 3.1,HDMI 2.0, IEEE1394b,LVDS
15   TABLE :  Series Size Code (JIS/EIA) Impedance(Ω) Inductance(uH) RDC(Ω) Rated Current(mA) BWCU_121008-02 1210/0504 25~330 - 0.25~1.3 100~400 More BWCU_160811-02 1608/0603 25~220 - 0.077~0.209 500 More BWCU_201212-02 2012/0805 30~900 - 0.2~0.88 80~450 More BWCU_231512-02 2012/0805 30~260 - 0.2~0.6 700~1300 More BWCU_321619-02 3216/1206 90~2200 - 0.3~1.2 200~370 More BWCU_201212-03 2012/0805 50~130 - 0.2~0.4 300~500 More BWCU_121008-03 1210/0504 22~90 - 0.2~0.4 250~400 More BWCU_322518-00 NEW 3225/1210 90~1000 - 0.08~0.25 480~1000 More BWCU_252012-3P NEW 2520/1008

In [11]:
len(elements)

36

In [12]:
for element in elements:
    print_element_json(element)
    print('\n')

{
  "type": "ListItem",
  "element_id": "a205dbbfd0ab1d9556653f03607bc637",
  "text": "Home",
  "metadata": {
    "category_depth": 1,
    "link_texts": [
      "\r\n\t\t\t\t\tHome\t\t\t\t\t"
    ],
    "link_urls": [
      "/en-global/home/index"
    ],
    "link_start_indexes": [
      0
    ],
    "languages": [
      "eng"
    ],
    "filetype": "text/html"
  }
}


{
  "type": "ListItem",
  "element_id": "b6465dc3bb053dffaf4495cb47290ebd",
  "text": "Products",
  "metadata": {
    "category_depth": 1,
    "link_texts": [
      "Products"
    ],
    "link_urls": [
      "/en-global/Inductor/index"
    ],
    "link_start_indexes": [
      0
    ],
    "languages": [
      "eng"
    ],
    "filetype": "text/html"
  }
}


{
  "type": "Title",
  "element_id": "f8b9c87f4aab79e28b1828bb8f7884bd",
  "text": "EMI-Common Mode Choke",
  "metadata": {
    "category_depth": 0,
    "languages": [
      "eng"
    ],
    "filetype": "text/html"
  }
}


{
  "type": "Title",
  "element_id": "cd0943f

Now let's convert the elements into chunks that are suitable for embedding.
We will use the Unstructured Library chunk_by_title function.

Reference:

[chunk_by_title](https://docs.unstructured.io/open-source/core-functionality/chunking)

In [13]:

def elements2chunks(elements):
    '''
    Convert elements to chunks of text
    '''
    chunks = chunk_by_title(elements, 
                            combine_text_under_n_chars=100, 
                            max_characters=3000)
    return chunks

In [23]:
chunks = elements2chunks(elements)

In [24]:

def print_chunks(chunks):
    for i, chunk in enumerate(chunks):
        print('Chunk Number: ', i, ' ***', chunk.category.upper(), '****')
        if (chunk.category == 'Table'):
            print(chunk.metadata.text_as_html)
        else:
            print(chunk.text)
        print("\n\n" + "-"*80)

In [25]:
print_chunks(chunks)

Chunk Number:  0  *** COMPOSITEELEMENT ****
Home

Products

EMI-Common Mode Choke

Search by :

Search

Part Number

Characteristics

Cross reference


--------------------------------------------------------------------------------
Chunk Number:  1  *** COMPOSITEELEMENT ****
General

Automotive Grade

Note

BWCU

Series Pdf

02_For USB 2.0,IEEE1394,LVDS HDMI1.4

03_For USB 3.1,HDMI 2.0, IEEE1394b,LVDS


--------------------------------------------------------------------------------
Chunk Number:  2  *** TABLE ****
<table><tr><td>Series</td><td>Size Code (JIS/EIA)</td><td>Impedance(Ω)</td><td>Inductance(uH)</td><td>RDC(Ω)</td><td>Rated Current(mA)</td><td></td></tr><tr><td>BWCU_121008-02</td><td>1210/0504</td><td>25~330</td><td>-</td><td>0.25~1.3</td><td>100~400</td><td>More</td></tr><tr><td>BWCU_160811-02</td><td>1608/0603</td><td>25~220</td><td>-</td><td>0.077~0.209</td><td>500</td><td>More</td></tr><tr><td>BWCU_201212-02</td><td>2012/0805</td><td>30~900</td><td>-</td><td>0.2~0.88</

In [26]:
for chunk in chunks:
    print_element_json(chunk)

{
  "type": "CompositeElement",
  "element_id": "0cc42e01-686a-45c1-9c0e-097bd56b85a2",
  "text": "Home\n\nProducts\n\nEMI-Common Mode Choke\n\nSearch by :\n\nSearch\n\nPart Number\n\nCharacteristics\n\nCross reference",
  "metadata": {
    "filetype": "text/html",
    "languages": [
      "eng"
    ],
    "link_texts": [
      "\r\n\t\t\t\t\tHome\t\t\t\t\t",
      "Products",
      "Part Number",
      "Characteristics",
      "Cross reference"
    ],
    "link_urls": [
      "/en-global/home/index",
      "/en-global/Inductor/index",
      "/en-global/Product_Search_Engine/index/",
      "/en-global/Product_Search_Engine/specSearch/",
      "/en-global/Product_Search_Engine/crossReference/"
    ],
    "orig_elements": "eJzFlGtr2zAUhv+K8eeG6GZL6tdSWGAdY9u3JoSjo6PE1JegKNBQ9t9nz70ty7p0rBSDQee8enn9WEfXdznV1FCblpXPz7McBCu8c8EzcNzboijLQgYmS6YdllLnZ1neUAIPCXr9XY6QaNXF/dLTJq37Eu8Voaop7Tc0OCa6TdN1auphaw3tagcr2vad65zaVb4YqlV7s9wmiH2K1tPt2GaPrcFi3DGP83aeHp4PXUNPqyerXaxH+ZTayaruHNTTda+d/nTPF9/Px

Let's do some cleanup on the text in the chunks

In [27]:

def clean_chunks(chunks):
    # Clean up
    strings2replace = ["Series Pdf"]
    for ch in chunks:
        for string2replace in strings2replace:
            if (string2replace in ch.text):
                x = ch.text.replace(string2replace, "")
                ch.text = x
    return chunks

In [28]:
new_chunks = clean_chunks(chunks)

In [30]:
for ch in new_chunks:
    print_element_json(ch)

{
  "type": "CompositeElement",
  "element_id": "0cc42e01-686a-45c1-9c0e-097bd56b85a2",
  "text": "Home\n\nProducts\n\nEMI-Common Mode Choke\n\nSearch by :\n\nSearch\n\nPart Number\n\nCharacteristics\n\nCross reference",
  "metadata": {
    "filetype": "text/html",
    "languages": [
      "eng"
    ],
    "link_texts": [
      "\r\n\t\t\t\t\tHome\t\t\t\t\t",
      "Products",
      "Part Number",
      "Characteristics",
      "Cross reference"
    ],
    "link_urls": [
      "/en-global/home/index",
      "/en-global/Inductor/index",
      "/en-global/Product_Search_Engine/index/",
      "/en-global/Product_Search_Engine/specSearch/",
      "/en-global/Product_Search_Engine/crossReference/"
    ],
    "orig_elements": "eJzFlGtr2zAUhv+K8eeG6GZL6tdSWGAdY9u3JoSjo6PE1JegKNBQ9t9nz70ty7p0rBSDQee8enn9WEfXdznV1FCblpXPz7McBCu8c8EzcNzboijLQgYmS6YdllLnZ1neUAIPCXr9XY6QaNXF/dLTJq37Eu8Voaop7Tc0OCa6TdN1auphaw3tagcr2vad65zaVb4YqlV7s9wmiH2K1tPt2GaPrcFi3DGP83aeHp4PXUNPqyerXaxH+ZTayaruHNTTda+d/nTPF9/Px

Further cleanup to combine a table with its title text.

In [31]:
def combine_chunks(chunks):
    '''Combining tables with their title
       Take advantage of that the chunk prior to a 
       table is its title.
       The text of the chunk becomes the title
       text_as_html in metadata will stay
    '''
    chunks2remove = []
    for i in range(len(chunks)):
        if(chunks[i].to_dict()['type'] == 'Table'):
            chunks[i].text = chunks[i-1].text
            chunks2remove.append(i-1)
    new_chunks = [chunk for i, chunk in enumerate(chunks) if i not in chunks2remove ]
    return new_chunks

In [32]:
new_new_chunks = combine_chunks(new_chunks)

In [33]:
for chunk in new_new_chunks:
    print_element_json(chunk)

{
  "type": "CompositeElement",
  "element_id": "0cc42e01-686a-45c1-9c0e-097bd56b85a2",
  "text": "Home\n\nProducts\n\nEMI-Common Mode Choke\n\nSearch by :\n\nSearch\n\nPart Number\n\nCharacteristics\n\nCross reference",
  "metadata": {
    "filetype": "text/html",
    "languages": [
      "eng"
    ],
    "link_texts": [
      "\r\n\t\t\t\t\tHome\t\t\t\t\t",
      "Products",
      "Part Number",
      "Characteristics",
      "Cross reference"
    ],
    "link_urls": [
      "/en-global/home/index",
      "/en-global/Inductor/index",
      "/en-global/Product_Search_Engine/index/",
      "/en-global/Product_Search_Engine/specSearch/",
      "/en-global/Product_Search_Engine/crossReference/"
    ],
    "orig_elements": "eJzFlGtr2zAUhv+K8eeG6GZL6tdSWGAdY9u3JoSjo6PE1JegKNBQ9t9nz70ty7p0rBSDQee8enn9WEfXdznV1FCblpXPz7McBCu8c8EzcNzboijLQgYmS6YdllLnZ1neUAIPCXr9XY6QaNXF/dLTJq37Eu8Voaop7Tc0OCa6TdN1auphaw3tagcr2vad65zaVb4YqlV7s9wmiH2K1tPt2GaPrcFi3DGP83aeHp4PXUNPqyerXaxH+ZTayaruHNTTda+d/nTPF9/Px

In [34]:

def chunks2docs(chunks):
    docs = []
    for chunk in chunks:
        if(chunk.to_dict()['type'] == 'Table'):
            table_metadata = {'source':html_source, 'text_as_html': chunk.to_dict()['metadata']['text_as_html']}
            docs.append(Document(page_content=chunk.text,
                                    metadata=table_metadata))
        else:
            metadata = {'source':html_source}
            docs.append(Document(page_content=chunk.text,
                                        metadata=metadata))
    return docs

In [35]:
docs = chunks2docs(new_new_chunks)

In [36]:
docs

[Document(page_content='Home\n\nProducts\n\nEMI-Common Mode Choke\n\nSearch by :\n\nSearch\n\nPart Number\n\nCharacteristics\n\nCross reference', metadata={'source': 'https://www.chilisin.com/en-global/Inductor/index/emi_common'}),
 Document(page_content='General\n\nAutomotive Grade\n\nNote\n\nBWCU\n\n\n\n02_For USB 2.0,IEEE1394,LVDS HDMI1.4\n\n03_For USB 3.1,HDMI 2.0, IEEE1394b,LVDS', metadata={'source': 'https://www.chilisin.com/en-global/Inductor/index/emi_common', 'text_as_html': '<table><tr><td>Series</td><td>Size Code (JIS/EIA)</td><td>Impedance(Ω)</td><td>Inductance(uH)</td><td>RDC(Ω)</td><td>Rated Current(mA)</td><td></td></tr><tr><td>BWCU_121008-02</td><td>1210/0504</td><td>25~330</td><td>-</td><td>0.25~1.3</td><td>100~400</td><td>More</td></tr><tr><td>BWCU_160811-02</td><td>1608/0603</td><td>25~220</td><td>-</td><td>0.077~0.209</td><td>500</td><td>More</td></tr><tr><td>BWCU_201212-02</td><td>2012/0805</td><td>30~900</td><td>-</td><td>0.2~0.88</td><td>80~450</td><td>More</td><

Now let's add the LangChain documents to the Chroma vector database.

In [37]:
def docs2db(docs, persist_directory:str):
    mydb = Chroma(embedding_function=OpenAIEmbeddings(),
                    persist_directory=persist_directory)
    ids = mydb.add_documents(docs)
    return ids

In [38]:
docs2db(docs, '../mydb')

['b6df74cb-02c3-48a1-af90-d9888d7812c5',
 '2a272af8-d134-436c-bcad-22ab4d5a7210',
 'a46f1cae-d412-4dc5-8fd8-3c3017fdce81',
 '7c80f1fc-4e5a-447f-9b29-5946950b42a6',
 '8b1467dd-d483-4f0c-b60c-eb96ed40dcef',
 '85d01aa9-4e83-45af-9698-9388db4ec382',
 '5035b3ae-c15c-466e-a6d0-db887e74306c',
 '7071d052-5301-4fae-ba97-b307123aa99b']