# Text Units Extraction

## Overview

This guide shows the usage of `TextUnitExtractor` class which relies on 
the supplied `TextSplitter` to extract text units from the supplied documents

The output of this component is a pandas DataFrame with the following columns:
-  `document_id`
-  `id`
-  `text_unit`

## Make a fake Document

Below is some random text that we will use to make a `langchain` [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document).

In [1]:
from langchain_core.documents import Document

from langchain_graphrag.indexing import TextUnitExtractor

In [2]:
SOME_TEXT = """
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, 
making it over 2000 years old. Richard McClintock, a Latin professor 
at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words,
consectetur, from a Lorem Ipsum passage, and going through the cites of the word in 
classical literature, discovered the undoubtable source. Lorem Ipsum comes 
from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" 
(The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a 
treatise on the theory of ethics, very popular during the Renaissance. 
The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", 
comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below 
for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et 
Malorum" by Cicero are also reproduced in their exact original form, accompanied
by English versions from the 1914 translation by H. Rackham.
"""

document = Document(page_content=SOME_TEXT)

## Select a TextSplitter

`TextUnitExtractor` requirs you to supply a TextSplitter.

See all available splitters from [langchain_text_splitters](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html) and of course you can write your own splitter.

In this example, we are going to use the simplest of them - [CharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html#langchain_text_splitters.character.CharacterTextSplitter).

In [3]:
from langchain_text_splitters import CharacterTextSplitter

In [4]:
splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)

text_unit_extractor = TextUnitExtractor(text_splitter=splitter)

## Run the TextUnitExtractor

And now we run it, the run method takes the list of the documents and returns
a pandas DataFrame object.

In [5]:
df_text_units = text_unit_extractor.run([document])

df_text_units.head()

Processing documents ...:   0%|          | 0/1 [00:00<?, ?it/s]Created a chunk of size 773, which is longer than the specified 512
Extracting text units ...: 100%|██████████| 2/2 [00:00<00:00, 25653.24it/s]
Processing documents ...: 100%|██████████| 1/1 [00:00<00:00, 430.58it/s]


Unnamed: 0,document_id,id,text_unit
0,d6b99162-5843-4c73-89c1-e53d92d6dd56,534d87de-7463-47b0-81d2-2e4c392b4e7b,"Contrary to popular belief, Lorem Ipsum is not..."
1,d6b99162-5843-4c73-89c1-e53d92d6dd56,78f69b7b-9c74-4c57-af0d-e39c83119866,The standard chunk of Lorem Ipsum used since t...


## Final Remarks

As you can see above, this dataframe has three columns:
-  `document_id`
-  `id`
-  `text_unit`

Since our document was not very big, given our `chunk_size` we only have two rows

Every text_unit gets a unique id that would be used in other components.

If the document object (type `Document`) did not have `id` then one is 
generated by the `TextUnitExtractor`.