In [27]:
#!pip install git+https://github.com/neuml/txtai
#!pip install newspaper3k

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector

In [7]:
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/distilbert-base-nli-stsb-mean-tokens"})

Downloading:   0%|          | 0.00/436 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/209 [00:00<?, ?B/s]

Running Similarity Query

In [9]:
data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest"):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, data)[0][0]

    print("%-20s %s" % (query, data[uid]))

Query                Best Match
--------------------------------------------------
feel good            US tops 5 million confirmed virus cases
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               The National Park Service warns against sacrificing slower friends in a bear attack
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
dishonest            Make huge profits without work, earn up to $100,000 a day


In [12]:
embeddings.similarity(query, data)

[(5, 0.2530994415283203),
 (1, 0.12102699279785156),
 (3, 0.07071232795715332),
 (4, 0.04331689700484276),
 (2, 0.025478333234786987),
 (0, 0.020189998671412468)]

But for larger repositories of documents, it doesn't make sense to tokenize and convert to embeddings on each query. txtai supports building pre-computed indices which signficantly improve performance.

In [16]:
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print("%-20s %s" % (query, data[uid]))

Query                Best Match
--------------------------------------------------
feel good story      US tops 5 million confirmed virus cases
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               The National Park Service warns against sacrificing slower friends in a bear attack
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
dishonest junk       Make huge profits without work, earn up to $100,000 a day


In [24]:
embeddings.search(query, 1)

[(5, 0.24277672171592712)]

Getting your own dataset and then making a query system

In [87]:
import pandas as pd
import numpy as np

In [89]:
test = pd.read_csv('/content/BBC News Test.csv')

In [90]:
dataset = test['Text'].tolist()

In [91]:
len(dataset)

735

In [81]:
train['Category'].unique()

array(['business', 'tech', 'politics', 'sport', 'entertainment'],
      dtype=object)

In [93]:
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/distilbert-base-nli-stsb-mean-tokens"})

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ('business', 'tech', 'politics', 'sport', 'entertainment'):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, dataset[:100])[0][0]

    print("%-20s %s" % (query, dataset[:100][uid]))

Query                Best Match
--------------------------------------------------
business             fiat chief takes steering wheel the chief executive of the fiat conglomerate has taken day-to-day control of its struggling car business in an effort to turn it around.  sergio marchionne has replaced herbert demel as chief executive of fiat auto  with mr demel leaving the company. mr marchionne becomes the fourth head of the business - which is expected to make a 800m euro ($1bn) loss in 2004 - in as many years. fiat underperformed the market in europe last year  seeing flat sales.  the car business has made an operating loss in five of the last six years and was forced to push back its break-even target from 2005 to 2006. the management changes are part of a wider shake-up of the business following fiat s resolution of its dispute with general motors. as part of a major restructuring  fiat is to integrate the maserati car company - currently owned by ferrari - within its own operat