# Part 1: Introducing txtai

[txtai](https://github.com/neuml/txtai) executes machine-learning workflows to transform data and build AI-powered text indices to perform similarity search. The following is a summary of key features:

- 🔎 Large-scale similarity search with multiple index backends ([Faiss](https://github.com/facebookresearch/faiss), [Annoy](https://github.com/spotify/annoy), [Hnswlib](https://github.com/nmslib/hnswlib))
- 📄 Create embeddings for text snippets, documents, audio and images. Supports transformers and word vectors.
- 💡 Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction
- ↪️️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or full-fledged indexing workflows.
- 🔗 API bindings for [JavaScript](https://github.com/neuml/txtai.js), [Java](https://github.com/neuml/txtai.java), [Rust](https://github.com/neuml/txtai.rs) and [Go](https://github.com/neuml/txtai.go)
- ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)

txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below:

- [paperai](https://github.com/neuml/paperai) - AI-powered literature discovery and review engine for medical/scientific papers
- [tldrstory](https://github.com/neuml/tldrstory) - AI-powered understanding of headlines and story text
- [neuspo](https://neuspo.com) - Fact-driven, real-time sports event and news site
- [codequestion](https://github.com/neuml/codequestion) - Ask coding questions directly from the terminal

txtai is built with Python 3.6+, [Hugging Face Transformers](https://github.com/huggingface/transformers), [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [FastAPI](https://github.com/tiangolo/fastapi)

This notebook gives an overview of txtai and how to run similarity searches.

# Install dependencies

Install `txtai` and all dependencies.

In [None]:
%%capture
!pip install git+https://github.com/neuml/txtai

# Create an Embeddings instance

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector. 

In [None]:
%%capture

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

# Running similarity queries

An embedding instance relies on the underlying transformer model to build text embeddings. The following example shows how to use an transformers Embedding instance to run similarity searches for a list of different concepts.

In [None]:
data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, data)[0][0]

    print("%-20s %s" % (query, data[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


The example above shows for almost all of the queries, the actual text isn't stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!

# Building an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn't make sense to tokenize and convert to embeddings on each query. txtai supports building pre-computed indices which signficantly improve performance. 

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search. 

In [None]:
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print("%-20s %s" % (query, data[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day


# Embeddings load/save

Embeddings indices can be saved to disk and reloaded.

In [None]:
embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg


In [None]:
!ls index

config	embeddings


# Embeddings update/delete

Updates and deletes are supported for Embedding indices. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

In [None]:
# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])

# Update data
data[0] = "Feel good story: baby panda born"
embeddings.upsert([(0, data[0], None)])

uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", data[uid])

# Remove record just added from index
embeddings.delete([0])

# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", data[uid])

Initial:  Maine man wins $1M from $25 lottery ticket
After update:  Feel good story: baby panda born
After delete:  Maine man wins $1M from $25 lottery ticket


# Embedding methods

Embeddings supports two methods for creating text vectors, the sentence-transformers library and word embeddings vectors. Both methods have their merits as shown below:

- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
  - Creates a single embeddings vector via mean pooling of vectors generated by the transformers library. 
  - Supports models stored on Hugging Face's model hub or stored locally. 
  - See sentence-transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face's model hub.
  - Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that tradeoff accuracy for speed.
- word embeddings
  - Creates a single embeddings vector via BM25 scoring of each word component. See this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240) for the logic behind this method.
  - Backed by the [pymagnitude](https://github.com/plasticityai/magnitude) library. Pre-trained word vectors can be installed from the referenced link.
  - See [words.py](https://github.com/neuml/txtai/blob/master/src/python/txtai/vectors/words.py) for code that can build word vectors for custom datasets.
  - Significantly better performance with default models. For larger datasets, it offers a good tradeoff of speed and accuracy.

# Next
In part 2 of this series, we'll look at how to use txtai on larger datasets with the Hugging Face Datasets library.