## Semantic Search with txtai

vectorize -> index -> search 

First, install the txtai package 

Warning: the data for this example are selected from the Bitter Aloe Project, a research project documenting human rights abuses during apartheid in South Africa. 

In [3]:
%%capture
!pip install git+https://github.com/neuml/txtai

The original data set contains 21,000+ documents which would take a long time to train. So for the sake for this example, I'll use 2,500 documents. 

In [4]:
%%capture
from txtai.embeddings import Embeddings 
import json 
from google.colab import drive 
import pandas as pd 

drive.mount('/content/drive', force_remount = True) # gives access to google drive for my account 

# create a dictionary with one key
# which is the model we need to vectorize the corpus 

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
# load data from Google Drive 
with open("/content/drive/MyDrive/Colab Notebooks/Data/vol7.json", "r") as f: 
  data = json.load(f)['descriptions'][:2500]



In [5]:
# examine length of data
len(data) 

2500

In [6]:
data[0]

"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."

In [7]:
txtai_data = [] 
i = 0
for text in data: 
  txtai_data.append((i, text , None))
  i = i + 1
txtai_data[0]

(0,
 "An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.",
 None)

In [8]:
# create embeddings 
# this will take a bit! 
embeddings.index(txtai_data)

Once we've created the embeddings, we can query the data at a semantic level, meaning that we can return documents that match our search even if specific keywords are not entered.

The example below asks for documents related to the word knife, which will include docs that use the word but also docs that use similar themes though without the exact keywords, knife, that we put into the search. 

In [9]:
res = embeddings.search("protest", 10)

# print doc number and the similarity score 
for item in res: 
  print(item) 

(1426, 0.4392104744911194)
(2203, 0.4273015260696411)
(122, 0.4196664094924927)
(2307, 0.4092385768890381)
(3, 0.40719857811927795)
(2349, 0.4065841734409332)
(31, 0.4040210247039795)
(1419, 0.4027186334133148)
(2464, 0.39241018891334534)
(547, 0.38735347986221313)


In [10]:
# print out documents that are similar to our search terms 
res = embeddings.search("protest", 10)
for r in res: 
  print(f"Text: {data[r[0]]}")
  print(f"Similarity: {r[1]}")
  print() 

Text: Was injured on his right leg when he was knocked over by a police van in Klipplaat, Cape, on 13 May 1985, during a political protest in the area, which involved protesters toyi-toying and singing freedom songs in the street.
Similarity: 0.4392104744911194

Text: Was shot and injured by members of the SAP during a protest against racial discrimination in the use of beach facilities in Durban, on 1 January 1986. People claiming to be PAC supporters assaulted ‘whites’ and police fired shots randomly in an attempt to disperse the crowd.
Similarity: 0.4273015260696411

Text: A UDF supporter who was injured in a drive-by shooting by named members of the Municipal Police during protests over the Black Local Authority in Fort Beaufort, Cape, on 30 November 1985.
Similarity: 0.4196664094924927

Text: A UDF supporter who was severely injured when he was hit with pellets by named Special Constables during protests in Hofmeyr, Cape, on 10 February 1988.
Similarity: 0.4092385768890381

Text: 

Next, we can save the embeddings index for use on future data. 

In [11]:
embeddings.save("/content/drive/MyDrive/Colab Notebooks/Models/embeddings_model1") 

Trained embeddings models can be saved so that they can be used on future data. In the code above, I've saved the model to the 'Models' folder in my drive. Next, the code below can be used in a separate notebook to load the trained model. Note, when connecting to a new runtime in Colab, you'll need to re-install txtai, and reconnect to drive, etc. 

In [12]:
%%capture
embeddings.load("/content/drive/MyDrive/Colab Notebooks/Models/embeddings_model1") 


In [13]:
# import data (here, it's the same data set) 
import json
with open("/content/drive/MyDrive/Colab Notebooks/Data/vol7.json", "r") as f: 
  data = json.load(f)['descriptions'][:2500]

# test out the search tool on the data 
  res = embeddings.search("Cape Town", 10)
for r in res: 
  print(f"Text: {data[r[0]]}")
  print(f"Similarity: {r[1]}")
  print() 

Text: Was shot and injured by members of the SAP in Langa, Cape Town, in September 1976, after the Soweto uprising had spread to the Cape.
Similarity: 0.3715386986732483

Text: Was shot by police as he walked down a street in Bonteheuwel, Cape Town, on 17 June 1980, during unrest and protest in the area commemorating the Soweto uprising. As a result of the shooting, he is partially paralysed.
Similarity: 0.36472249031066895

Text: Was shot with rubber bullets by police in Eersterivier, Cape Town, on 6 July 1993, while participating in a community sit-in demanding improved water services.
Similarity: 0.36368122696876526

Text: Was shot and injured by SAP members near Crossroads, Cape Town, in February 1987.
Similarity: 0.35981452465057373

Text: Was shot and injured by members of the SAP in Crossroads, Cape Town, on 20 May 1993, during attempts by a local leader to remove residents from Section 2 by force.
Similarity: 0.3529833257198334

Text: A UDF supporter from Atlantis, Cape who was