# Extract text from documents

_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._

Up to this point, all the examples have been working with sections of text, which have already been split through some other means. What happens if we're working with documents? First we need to get the text out of these documents, then figure out how to index to best support similarity search.

This notebook shows how documents can have text extracted and segmented to support similarity search.

# Install dependencies

Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package.

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]

# Get test data
!wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Install NLTK
import nltk
nltk.download('punkt')

# Create a Textractor instance

The Textractor instance is the main entrypoint for extracting text. This method is backed by Apache Tika, a robust text extraction library written in Java. [Apache Tika](https://tika.apache.org/0.9/formats.html) has support for a large number of file formats: PDF, Word, Excel, HTML and others. The [Python Tika package](https://github.com/chrismattmann/tika-python) automatically installs Tika and starts a local REST API instance used to read extracted data.

*Note: This requires Java to be installed locally.*

In [2]:
%%capture

from txtai.pipeline import Textractor

# Create textractor model
textractor = Textractor()

2022-10-02 18:11:12.138884: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-02 18:11:15.539851: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-02 18:11:15.539901: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-02 18:11:15.865067: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-02 18:11:21.045934: W tensorflow/stream_executor/pla

# Extract text

The example below shows how to extract text from a file.

In [3]:
textractor("txtai/article.pdf")

'Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application. Introducing txtai txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based 

Note that the text from the article was extracted into a single string. Depending on the articles, this may be acceptable. For long articles, often you'll want to split the content into logical sections to build better downstream vectors.

# Extract sentences

In [9]:
textractor = Textractor(sentences=True)
extracted = textractor("txtai/article.pdf")
print(extracted)
print(len(extracted))
for i, line in enumerate(extracted):
    print(f"{i+1} : {line}")

['Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications.', 'Once data starts to pile up, users want to be able to find it.', 'It’s the foundation of the internet and an ever-growing challenge that is never solved or done.', 'The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments.', 'Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people.', 'Innovation continues with new models and advancements coming in at what seems a weekly basis.', 'This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.', 'Introducing txtai txtai builds an AI-powered index over sections of text.', 'txtai supports building text indices to perform similarity searches and create extractive 

Now the document is split up at the sentence level. These sentences can be feed to a workflow that adds each sentence to an embeddings index. Depending on the task, this may work well. Alternatively, it may be even better to split at the paragraph level.

# Extract paragraphs

In [10]:
textractor = Textractor(paragraphs=True)
extracted = textractor("txtai/article.pdf")
# print(extracted)
print(len(extracted))
for i, paragraph in enumerate(extracted):
    print(f"{i+1} : {paragraph}")

13
1 : Introducing txtai, an AI-powered search engine built on Transformers
2 : Add Natural Language Understanding to any application
3 : Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.
4 : The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.
5 : This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.
6 : Introducing txtai txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive

Bad pipe message: %s [b'T#\x02>\r\xa5\x7f\xe5\\\x1c\xbf\x81\x06T\xfd\xf3\x1aJ \xd0\xfa\x06_t \xaf\xe0\xc4\xad\x84\x8c\x9dP*\xc0\xc3\xe3-\x1ap4\xc8\xc2\xb5Kjd\x08\xb3\xa1\x9a\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0']
Bad pipe message: %s [b'\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00#\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x00\x1e\x00\x1c\x04\x03\x05', b'\x03\x08']
Bad pipe message: %s [b'\x08\x08\t\x08\n\x08']
Bad pipe message: %s [b'\x04\x08\x05\x08\x06\x04\x01\x05\x01\x06']
Bad pipe message: %s [b'']
Bad pipe message: %s [b'\x03\x02\x03\x04\x00-\x00\x02\x01\x01\x003\x00&\x00$\x00\x1d\x00 \xa4\x8e\xf9\xccPKP\xc8!\x16m@\rC\xc5\r\xbf8\xd8\xc8\xfd\x93']
Bad pipe message: %s [b"\xb3\xb7\xe1\x95U\x18\x02S\xabTs\xf7\x9b\x97(\x853\x10\x00\x00|\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0S\xc0+\xc0/\x00\xa2\x00\x9e\xc0\xae