# Part 3: Build an Embeddings index from a data source

In Part 1, we gave a general overview of txtai, the backing technology and examples of how to use it for similarity searches. Part 2 covered how to use txtai for extractive question-answer systems.

The previous examples worked on data stored in memory for demo purposes. For real world large-scale use cases, data is usually stored in a database (Elasticsearch, SQL, MongoDB, files, etc). This example covers reading data from SQLite, building a Embedding index backed by word embeddings and running queries against the generated Embeddings index.

This example covers functionality found in the [paperai](https://github.com/neuml/paperai) library. See that library for a full solution that can be used with the dataset discussed below.

# Install dependencies

Install txtai and all dependencies

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai

# Download data

This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is SQLite database with a subject of CORD-19, generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook.

In [2]:
!wget https://www.kaggleusercontent.com/kf/40510829/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..i594VxIvDFKijTSDVIfvfw.4rZ0qT2s7KPZGWRQUt5NVETt_SiaIlq0VPV4O9qqmYD3MRAJ2D0nZDWSdSNwlW9aOZPJNsJNOSMnWdhGGmI9tSAYtRdaBoach-i0zFhVCUNYp1Y04dqB_YLtcx1whx6s0_jxl0TnIlenaJqpZvaSizpuOZjRrmiO4nb4hctJTftbV0AJJNKWeKYMex-1dxK0FFaT_d4lbh3p_ArVZguQIbOKbBRrJcrA589PylYl_2oqlowgb2OazsTZpe4JQFXekWmr5IP4Yem5llN-j3CTFp9M4AAKtbIS908FoRia9bLc6JyP8mEaVt4PNF_ayHNnmarovnK8DOTueK1Ld8OCyFlbGxerPClihQ0HuosiQH5GeX8MAEsZV8Ot8dvU8fR_Pp2xJ0Td_OB8FcL5jMO1yVOB6_P1GtPq2OSriaSN731QlvK5WbXfkYhUSREWOyTmXC0G1dlsdJzgXaw7U5VeLRH5yfA4n9HZWq5s_hisJdAoxiLtRjSmKg8Dw3t5uNlig1QPLq_VcyLtCiO4sZh5xe4qRCL-tQ1PfTVOe7z8QMM06UsRaX0686PgSOFTarKYuB6t44sjc7YcddiCNK33hPWbDR2vAtcjHoxmj-xeM-zgV1S89OVD971eUpsLz5jF.ihLqVQjVU5xZtKKDCu8MwA/articles.sqlite

--2020-08-24 12:12:07--  https://www.kaggleusercontent.com/kf/40510829/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..i594VxIvDFKijTSDVIfvfw.4rZ0qT2s7KPZGWRQUt5NVETt_SiaIlq0VPV4O9qqmYD3MRAJ2D0nZDWSdSNwlW9aOZPJNsJNOSMnWdhGGmI9tSAYtRdaBoach-i0zFhVCUNYp1Y04dqB_YLtcx1whx6s0_jxl0TnIlenaJqpZvaSizpuOZjRrmiO4nb4hctJTftbV0AJJNKWeKYMex-1dxK0FFaT_d4lbh3p_ArVZguQIbOKbBRrJcrA589PylYl_2oqlowgb2OazsTZpe4JQFXekWmr5IP4Yem5llN-j3CTFp9M4AAKtbIS908FoRia9bLc6JyP8mEaVt4PNF_ayHNnmarovnK8DOTueK1Ld8OCyFlbGxerPClihQ0HuosiQH5GeX8MAEsZV8Ot8dvU8fR_Pp2xJ0Td_OB8FcL5jMO1yVOB6_P1GtPq2OSriaSN731QlvK5WbXfkYhUSREWOyTmXC0G1dlsdJzgXaw7U5VeLRH5yfA4n9HZWq5s_hisJdAoxiLtRjSmKg8Dw3t5uNlig1QPLq_VcyLtCiO4sZh5xe4qRCL-tQ1PfTVOe7z8QMM06UsRaX0686PgSOFTarKYuB6t44sjc7YcddiCNK33hPWbDR2vAtcjHoxmj-xeM-zgV1S89OVD971eUpsLz5jF.ihLqVQjVU5xZtKKDCu8MwA/articles.sqlite
Resolving www.kaggleusercontent.com (www.kaggleusercontent.com)... 35.190.26.106
Connecting to www.kaggleusercontent.com (www.kaggleusercontent.com)|35.190.26.106|:443... connec

# Build Word Vectors

This example will build a search system backed by word embeddings. While note quite as powerful as transformer embeddings, they often provide a good tradeoff of performance to functionality for an embedding based search system.

For this notebook, we'll build our own custom embeddings for demo purposes. A number of pre-trained word embedding models are available:

 - [General language models from pymagnitude](https://github.com/plasticityai/magnitude)
 - [CORD-19 fastText](https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors)

In [3]:
import os
import sqlite3
import tempfile

from txtai.tokenizer import Tokenizer
from txtai.vectors import WordVectors

print("Streaming tokens to temporary file")

# Stream tokens to temp working file
with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as output:
  # Save file path
  tokens = output.name

  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()
  cur.execute("SELECT Text from sections")

  for row in cur:
    output.write(" ".join(row[0]) + "\n")

  # Free database resources
  db.close()

# Build word vectors model - 300 dimensions, 3 min occurrences
WordVectors.build(tokens, 300, 3, "cord19-300d")

# Remove temporary tokens file
os.remove(tokens)

# Show files
!ls -l

Streaming tokens to temporary file
Building 300 dimension model
Converting vectors to magnitude format
total 9024
-rw-r--r-- 1 root root 8065024 Aug 24 12:12 articles.sqlite
-rw-r--r-- 1 root root  360448 Aug 24 12:13 cord19-300d.magnitude
-rw-r--r-- 1 root root  807886 Aug 24 12:13 cord19-300d.txt
drwxr-xr-x 1 root root    4096 Jul 30 16:30 sample_data


# Build an embeddings index

The following steps builds an embeddings index using the word vector model just created. This model builds a BM25 + fastText index. BM25 is used to build a weighted average of the word embeddings for a section. More information on this method can be found in this [Medium article](https://towardsdatascience.com/building-a-sentence-embedding-index-with-fasttext-and-bm25-f07e7148d240?gi=79da927aa10). 

In [4]:
import sqlite3

import regex as re

from txtai.embeddings import Embeddings
from txtai.tokenizer import Tokenizer

def stream():
  # Connection to database file
  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()

  # Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
  cur.execute("SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND tags is not null")

  count = 0
  for row in cur:
    # Unpack row
    uid, name, text = row

    # Only process certain document sections
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      # Tokenize text
      tokens = Tokenizer.tokenize(text)

      document = (uid, tokens, None)

      count += 1
      if count % 1000 == 0:
        print("Streamed %d documents" % (count), end="\r")

      # Skip documents with no tokens parsed
      if tokens:
        yield document

  print("Iterated over %d total rows" % (count))

  # Free database resources
  db.close()

# BM25 + fastText vectors
embeddings = Embeddings({"path": "cord19-300d.magnitude",
                         "scoring": "bm25",
                         "pca": 3})

# Build scoring index if scoring method provided
if embeddings.config.get("scoring"):
  embeddings.score(stream())

# Build embeddings index
embeddings.index(stream())


Iterated over 21499 total rows
Iterated over 21499 total rows


# Query data

The following runs a query against the embeddings index for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

In [5]:
import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  results.append(cur.fetchone() + (text,))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))

Title,Published,Reference,Match
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection,2020-04-24 00:00:00,http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1,"This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors."
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants,2020-07-23 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32705587/,"The identification of risk factors for contracting COVID-19 is crucial, to inform public health policy and to facilitate the appropriate distribution of healthcare resources."
Quantitative evaluation of olfactory dysfunction in hospitalized patients with Coronavirus [2] (COVID-19),2020-05-25 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32451613/,"In addition, these reports included patients with minor COVID-19 symptoms and low-risk factor burden."
COVID-19 from the perspective of urban and rural general adult mental health services,2020-05-21 00:00:00,https://doi.org/10.1017/ipm.2020.62,At-risk groups among staff members and service users were identified early and prioritised in service changes.
Management of osteoarthritis during COVID‐19 pandemic,2020-05-21 00:00:00,https://doi.org/10.1002/cpt.1910,"Consistently, a recent report indicated diabetes as a risk factor significantly associated with COVID-19 unfavourable clinical outcomes (37) ."


# Extracting additional columns from query results

The example above uses the Embeddings index to find the top 5 best matches. In addition to this, an Extractor instance is used to ask additional questions over the search results, creating a richer query response.

In [6]:
%%capture
from txtai.extractor import Extractor

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

In [7]:
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  # Get list of document text sections to use for the context
  cur.execute("SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ?", [uid])
  sections = []
  for sid, name, txt in cur.fetchall():
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      sections.append((sid, txt))

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  article = cur.fetchone()

  # Use QA extractor to derive additional columns
  answers = extractor(sections, [("Risk Factors", "risk factors", "What risk factors?", False),
                                 ("Locations", "hospital country", "What locations?", False)])

  results.append(article + (text,) + tuple([answer[1] for answer in answers]))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])
display(HTML(df.to_html(index=False)))

Title,Published,Reference,Match,Risk Factors,Locations
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection,2020-04-24 00:00:00,http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1,"This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.",neither CVD nor risk factors,New York City hospitals
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants,2020-07-23 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32705587/,"The identification of risk factors for contracting COVID-19 is crucial, to inform public health policy and to facilitate the appropriate distribution of healthcare resources.",Frailty and multimorbidity,hospital settings
Quantitative evaluation of olfactory dysfunction in hospitalized patients with Coronavirus [2] (COVID-19),2020-05-25 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32451613/,"In addition, these reports included patients with minor COVID-19 symptoms and low-risk factor burden.",patients with minor COVID-19 symptoms and low-risk factor burden,COVID-19 wards
COVID-19 from the perspective of urban and rural general adult mental health services,2020-05-21 00:00:00,https://doi.org/10.1017/ipm.2020.62,At-risk groups among staff members and service users were identified early and prioritised in service changes.,At-risk groups among staff members and service users,rural regions
Management of osteoarthritis during COVID‐19 pandemic,2020-05-21 00:00:00,https://doi.org/10.1002/cpt.1910,"Consistently, a recent report indicated diabetes as a risk factor significantly associated with COVID-19 unfavourable clinical outcomes (37) .","sex, obesity, genetic factors and mechanical factors",


In the example above, the Embeddings index is used to find the top N results for a given query. On top of that, a question-answer extractor is used to derive additional columns based on a list of questions. In this case, the "Risk Factors" and "Location" columns were pulled from the document text.

# Next
In part 4 of this series, we'll use combine the power of Elasticsearch with Extractive QA to build a large-scale, advanced search system.
