[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb)


#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Preparing Text Data for use with Retrieval-Augmented LLMs

In this walkthrough we'll take a look at an example and some of the considerations when we need to prepare text data for retrieval augmented question-answering using **L**arge **L**anguage
**M**odels (LLMs).
The notebook is adapted to the use of gpt4all and Flutter documentation website from https://github.com/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb and video explaination at https://www.youtube.com/watch?v=eqOfr4AGLk8


## Required Libraries


There are a few Python libraries we must `pip install` for this notebook to run, those are:


In [None]:
%pip install -qU langchain unstructured matplotlib seaborn tqdm transformers sentencepiece


## Preparing Data


The data is getting from the [Flutter documentation website](https://github.com/flutter/website), which is hosted at https://docs.flutter.dev/ and was last updated in March 2023. The static HTML files are stored in the `site` directory.


Now we can use LangChain itself to process these docs. We do this using the `UnstructuredHTMLLoader` like so:


In [None]:
from langchain.document_loaders import UnstructuredHTMLLoader

file_paths = None
with open('html_files_index.txt', 'r') as file:
    file_paths = file.readlines()

docs = []
for i, file_path in enumerate(file_paths):
    file_path = file_path.rstrip("\n")
    doc = UnstructuredHTMLLoader(file_path).load()
    docs.extend(doc)
len(docs)


This leaves us with `428` processed doc pages. Let's take a look at the format each one contains:


In [None]:
docs[0]


We access the plaintext page content like so:


In [None]:
print(docs[0].page_content)


In [None]:
print(docs[5].page_content)


We can also find the source of each document:


In [None]:
docs[5].metadata['source'].replace('./site', 'https://docs.flutter.dev')


Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of the latest LLMs. We will use `gpt4all` as an example.

To count the number of tokens that `gpt4all` will use for some text we need to initialize the `transformers.LlamaTokenizer`.


In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("tokenizer.model")

# create the length function
def token_len(text):
    tokens = tokenizer.encode(text)
    return len(tokens)


Using the `token_len` function, let's count and visualize the number of tokens across our webpages.


In [None]:
token_counts = [token_len(doc.page_content) for doc in docs]


Let's see `min`, average, and `max` values:


In [None]:
print(f"""Min: {min(token_counts)}
Avg: {int(sum(token_counts) / len(token_counts))}
Max: {max(token_counts)}""")


Now visualize:


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# set style and color palette for the plot
sns.set_style("whitegrid")
sns.set_palette("muted")

# create histogram
plt.figure(figsize=(12, 6))
sns.histplot(token_counts, kde=False, bins=50)

# customize the plot info
plt.title("Token Counts Histogram")
plt.xlabel("Token Count")
plt.ylabel("Frequency")

plt.show()


The vast majority of pages seem to contain a lower number of tokens. But our limits for the number of tokens to add to each chunk is actually smaller than some of the smaller pages. But, how do we decide what this number should be?


### Chunking the Text

At the time of writing, `gpt4all` is built on `llama-7b` model which supports a context window of 2048 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 2048 without hitting an error.

So we 100% need to keep below this. If we assume a very safe margin of ~1000 tokens for the input prompt into `gpt4all`, leaving ~1000 tokens for conversation history and completion.

With this ~1000 token limit we may want to include _4_ snippets of relevant information, meaning each snippet can be no more than **250** token long.

To create these snippets we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of snippets we also need a _length function_. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `llama` tokenizer), and returns that number. We define it like so:


With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=token_len,
    separators=['\n\n', '\n', ' ', '']
)


Then we split the text for a document like so:


In [None]:
chunks = text_splitter.split_text(docs[5].page_content)
len(chunks)


In [None]:
token_len(chunks[0]), token_len(chunks[1])


For `docs[5]` we created `2` chunks of token length `200` and `215`.

This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into a format that looks like:

```json
[
    {
        "id": "abc-0",
        "text": "some important document text",
        "source": "https://docs.flutter.dev/codelabs/implicit-animations/index.html"
    },
    {
        "id": "abc-1",
        "text": "the next chunk of important document text",
        "source": "https://docs.flutter.dev/whatnew/index.html"
    }
    ...
]
```

The `"id"` will be created based on the URL of the text + it's chunk number.


In [None]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5].metadata['source'].replace('./site', 'https://docs.flutter.dev')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)


Then use the `uid` alongside chunk number and actual `url` to create the format needed:


In [None]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'source': url
    } for i, chunk in enumerate(chunks)
]
data


Now we repeat the same logic across our full dataset:


In [None]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc.metadata['source'].replace('./site', 'https://docs.flutter.dev')
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'source': url
        })

len(documents)


We're now left with `14600` documents. We can save them to a JSON lines (`.jsonl`) file like so:


In [None]:
import json

with open('train.jsonl', 'w') as f:
    for doc in documents:
        f.write(json.dumps(doc) + '\n')


To load the data from file we'd write:


In [None]:
documents = []

with open('train.jsonl', 'r') as f:
    for line in f:
        documents.append(json.loads(line))

len(documents)


In [None]:
documents[0]


In [None]:
texts = [doc.pop('text') for doc in documents]
print(len(texts))
print(texts[0])
print(documents[0])

### (Optional) Sharing the Dataset


We've now created our dataset and you can go ahead and use it in any way you like. However, if you'd like to share the dataset, or store it somewhere that you can get easy access to later — we can use [Hugging Face Datasets Hub](https://huggingface.co/datasets).

To begin we first need to create an account by clicking the **Sign Up** button at [huggingface.co](https://huggingface.co/). Once done we click our profile button in the same location > click **New Dataset** > give it a name like _"flutter-website-3.7"_ > set the dataset to **Public** or **Private** > click **Create dataset**.


In [None]:
%pip install datasets

In [None]:
from datasets import load_dataset

data = load_dataset("limcheekin/flutter-website-3.7", split="train")
data

In [None]:
data[0]