# Part 1: Introducing txtai

[txtai](https://github.com/neuml/txtai) builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. 

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

- [paperai](https://github.com/neuml/paperai) - AI-powered literature discovery and review engine for medical/scientific papers
- [tldrstory](https://github.com/neuml/tldrstory) - AI-powered understanding of headlines and story text
- [neuspo](https://neuspo.com) - a fact-driven, real-time sports event and news site
- [codequestion](https://github.com/neuml/codequestion) - Ask coding questions directly from the terminal

txtai is built on the following stack:

- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [transformers](https://github.com/huggingface/transformers)
- [faiss](https://github.com/facebookresearch/faiss)
- Python 3.6+

This notebook gives an overview of txtai and how to run similarity searches.

# Install dependencies

Install txtai and all dependencies

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai

# Create an Embeddings instance

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector. 

In [2]:
%%capture

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Running similarity queries

An embedding instance relies on the underlying transformer model to build text embeddings. The following example shows how to use an transformers Embedding instance to run similarity searches for a list of different concepts.

In [3]:
import numpy as np

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Get index of best section that best matches query
    uid = np.argmax(embeddings.similarity(query, sections))

    print("%-20s %s" % (query, sections[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


The example above shows for almost all of the queries, the actual text isn't stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is ðŸ”¥ðŸ”¥ðŸ”¥!

# Building an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn't make sense to tokenize and convert to embeddings on each query. txtai supports building pre-computed indices which signficantly improve performance. 

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search. 

In [4]:
# Create an index for the list of sections
embeddings.index([(uid, text, None) for uid, text in enumerate(sections)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print section
    print("%-20s %s" % (query, sections[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


# Embeddings load/save

Embeddings indices can be saved to disk and reloaded. At this time, indices are not incrementally created, the index needs a full rebuild to incorporate new data. But that enhancement is in the backlog.

In [5]:
embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(sections[uid])

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg


In [6]:
!ls index

config	embeddings


# Embedding methods

Embeddings supports two methods for creating text vectors, the sentence-transformers library and word embeddings vectors. Both methods have their merits as shown below:

- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
  - Creates a single embeddings vector via mean pooling of vectors generated by the transformers library. 
  - Supports models stored on Hugging Face's model hub or stored locally. 
  - See sentence-transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face's model hub.
  - Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that tradeoff accuracy for speed.
- word embeddings
  - Creates a single embeddings vector via BM25 scoring of each word component. See this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) for the logic behind this method.
  - Backed by the [pymagnitude](https://github.com/plasticityai/magnitude) library. Pre-trained word vectors can be installed from the referenced link.
  - See [vectors.py](https://github.com/neuml/txtai/blob/master/src/python/txtai/vectors.py) for code that can build word vectors for custom datasets.
  - Significantly better performance with default models. For larger datasets, it offers a good tradeoff of speed and accuracy.

# Next
In part 2 of this series, we'll look at how to use txtai to run extractive searches