# Web extraction

This notebook shows examples of text extraction from DOCX files with different packages

**Table of contents**<a id='toc0_'></a>    
- 1. [Methods to load from websites](#toc1_)    
  - 1.1. [Load from Async local loader](#toc1_1_)    
    - 1.1.1. [Clean html output from async loader with the Html2Text transformer](#toc1_1_1_)    
    - 1.1.2. [Split output from Html2Text transformer with recursive character text splitter](#toc1_1_2_)    
- 2. [Evaluate loded docs by embedding similarity](#toc2_)    
  - 2.1. [Embedding & Storage](#toc2_1_)    
  - 2.2. [Similarity search](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [2]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

import pandas as pd
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm.autonotebook import trange


## 1. <a id='toc1_'></a>[Methods to load from websites](#toc0_)

In [3]:
urls = [
    "https://en.wikipedia.org/wiki/Unstructured_data",
    "https://unstructured-io.github.io/unstructured/introduction.html",
]

##### Load text splitter

In [4]:
text_splitter = RecursiveCharacterTextSplitter(
        # Set a small chunk size, just to make splitting evident.
        chunk_size = 500,
        chunk_overlap  = 100,
        length_function = len,
        add_start_index = True,
        separators = ["\n\n\n","\n\n", "\n", "."]
    )

### 1.1. <a id='toc1_1_'></a>[Load from Async local loader](#toc0_)

In [5]:
from langchain.document_loaders import AsyncHtmlLoader

loader = AsyncHtmlLoader(urls, verify_ssl=False)
docs = loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---')

Fetching pages: 100%|##########| 2/2 [00:00<00:00,  2.53it/s]


<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Unstructured data - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vec



#### 1.1.1. <a id='toc1_1_1_'></a>[Clean html output from async loader with the Html2Text transformer](#toc0_)

In [6]:
from langchain.document_transformers import Html2TextTransformer

html2text_transformer = Html2TextTransformer()
docs=html2text_transformer.transform_documents(documents=docs)
for doc in docs:
    print(f'{doc.page_content}\n---')

Jump to content

Main menu

Main menu

move to sidebar hide

Navigation

  * Main page
  * Contents
  * Current events
  * Random article
  * About Wikipedia
  * Contact us
  * Donate

Contribute

  * Help
  * Learn to edit
  * Community portal
  * Recent changes
  * Upload file

Languages

Language links are at the top of the page.

Search

Search

  * Create account
  * Log in

Personal tools

  * Create account
  * Log in

Pages for logged out editors learn more

  * Contributions
  * Talk

## Contents

move to sidebar hide

  * (Top)

  * 1Background

  * 2Issues with terminology

  * 3Dealing with unstructured data

Toggle Dealing with unstructured data subsection

    * 3.1Approaches in natural language processing

    * 3.2Approaches in medicine and biomedical research

  * 4The use of "unstructured" in data privacy regulations

  * 5See also

  * 6Notes

  * 7References

  * 8External links

Toggle the table of contents

# Unstructured data

11 languages

  * العربية
  * Català

#### 1.1.2. <a id='toc1_1_2_'></a>[Split output from Html2Text transformer with recursive character text splitter](#toc0_)

In [7]:
docs = text_splitter.split_documents(docs)
for doc in docs:
    print(f'{doc.page_content}\n---')

Jump to content

Main menu

Main menu

move to sidebar hide

Navigation

  * Main page
  * Contents
  * Current events
  * Random article
  * About Wikipedia
  * Contact us
  * Donate

Contribute

  * Help
  * Learn to edit
  * Community portal
  * Recent changes
  * Upload file

Languages

Language links are at the top of the page.

Search

Search

  * Create account
  * Log in

Personal tools

  * Create account
  * Log in

Pages for logged out editors learn more

  * Contributions
  * Talk
---
* Create account
  * Log in

Pages for logged out editors learn more

  * Contributions
  * Talk

## Contents

move to sidebar hide

  * (Top)

  * 1Background

  * 2Issues with terminology

  * 3Dealing with unstructured data

Toggle Dealing with unstructured data subsection

    * 3.1Approaches in natural language processing

    * 3.2Approaches in medicine and biomedical research

  * 4The use of "unstructured" in data privacy regulations

  * 5See also

  * 6Notes

  * 7References
---
* 5S

## 2. <a id='toc2_'></a>[Evaluate loded docs by embedding similarity](#toc0_)

### 2.1. <a id='toc2_1_'></a>[Embedding & Storage](#toc0_)

In [8]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
                                            embed_instruction="", # no instructions needed for candidate passages
                                            query_instruction="Represent this sentence for searching relevant passages: ",
                                            encode_kwargs=encode_kwargs)
vectorstore = FAISS.from_documents(documents=docs, embedding=embd_model)

load INSTRUCTOR_Transformer
max_seq_length  512


### 2.2. <a id='toc2_2_'></a>[Similarity search](#toc0_)

In [9]:
query = "how unstructured deal witn ambiguities?"

ans = vectorstore.similarity_search(query)
print("-------Async local Loader + html2text transformer----------\n")
print(ans[0].page_content)



-------Async local Loader + html2text transformer----------

**Unstructured data** (or **unstructured information** ) is information that
either does not have a pre-defined data model or is not organized in a pre-
defined manner. Unstructured information is typically text-heavy, but may
contain data such as dates, numbers, and facts as well. This results in
irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in fielded form in databases
or annotated (semantically tagged) in documents.
