# Build an Embeddings index from a data source

In Part 1, we gave a general overview of txtai, the backing technology and examples of how to use it for similarity searches. Part 2 covered an embedding index with a larger dataset.

For real world large-scale use cases, data is often stored in a database (Elasticsearch, SQL, MongoDB, files, etc). Here we'll show how to read from SQLite, build an Embedding index and run queries against the generated Embeddings index.

This example covers functionality found in the [paperai](https://github.com/neuml/paperai) library. See that library for a full solution that can be used with the dataset discussed below.

# Install dependencies

Install `txtai` and all dependencies.

In [None]:
%%capture
!pip install git+https://github.com/neuml/txtai

# Download data

This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook.

In [None]:
%%capture
!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
!gunzip tests.gz
!mv tests articles.sqlite

# Build an embeddings index

The following steps build an embeddings index using a vector model designed for medical papers, [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings).

In [2]:
import sqlite3

import regex as re

from txtai import Embeddings

def stream():
  # Connection to database file
  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()

  # Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
  cur.execute("SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND tags is not null")

  count = 0
  for row in cur:
    # Unpack row
    uid, name, text = row

    # Only process certain document sections
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      document = (uid, text, None)

      count += 1
      if count % 1000 == 0:
        print("Streamed %d documents" % (count), end="\r")

      yield document

  print("Iterated over %d total rows" % (count))

  # Free database resources
  db.close()

# Create embeddings index 
embeddings = Embeddings(path="neuml/pubmedbert-base-embeddings")

# Build embeddings index
embeddings.index(stream())


Iterated over 21499 total rows


# Query data

The following runs a query against the embeddings index for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

In [7]:
import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  results.append(cur.fetchone() + (text,))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

# It has been reported that displaying HTML within VSCode doesn't work.
# When using VSCode, the data can be exported to an external HTML file to view.
# See example below.

# htmlData = df.to_html(index=False)
# with open("data.html", "w") as file:
#     file.write(htmlData)

display(HTML(df.to_html(index=False)))

Title,Published,Reference,Match
Management of osteoarthritis during COVID‐19 pandemic,2020-05-21 00:00:00,https://doi.org/10.1002/cpt.1910,"Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) ."
Does apolipoprotein E genotype predict COVID-19 severity?,2020-04-27 00:00:00,https://doi.org/10.1093/qjmed/hcaa142,"Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors ."
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection,2020-04-24 00:00:00,http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1,"This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors."
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants,2020-07-23 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32705587/,BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.
"Risk Stratification for Healthcare workers during the CoViD-19 Pandemic; using demographics, co-morbid disease and clinical domain in order to assign clinical duties",2020-05-09 00:00:00,http://medrxiv.org/cgi/content/short/2020.05.05.20091967v1?rss=1,"Vascular disease, diabetes and chronic pulmonary disease further increased risk."


# Extracting additional columns from query results

The example above uses the Embeddings index to find the top 5 best matches. In addition to this, an Extractor instance (this will be explained further in part 5) is used to ask additional questions over the search results, creating a richer query response.

In [None]:
%%capture
from txtai.pipeline import Extractor

# Create extractor instance using qa model designed for the CORD-19 dataset
# Note: That extractive QA was a predecessor to Large Language Models (LLMs). LLMs likely will get better results.
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

In [9]:
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  # Get list of document text sections to use for the context
  cur.execute("SELECT Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ? ORDER BY Id", [uid])
  texts = []
  for name, txt in cur.fetchall():
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      texts.append(txt)

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  article = cur.fetchone()

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk Factors", "risk factors", "What risk factors?", False),
                       ("Locations", "hospital country", "What locations?", False)], texts)

  results.append(article + (text,) + tuple([answer[1] for answer in answers]))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])
display(HTML(df.to_html(index=False)))

Title,Published,Reference,Match,Risk Factors,Locations
Management of osteoarthritis during COVID‐19 pandemic,2020-05-21 00:00:00,https://doi.org/10.1002/cpt.1910,"Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .","sex, obesity, genetic factors and mechanical factors",hospitals and clinics
Does apolipoprotein E genotype predict COVID-19 severity?,2020-04-27 00:00:00,https://doi.org/10.1093/qjmed/hcaa142,"Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .",,
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection,2020-04-24 00:00:00,http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1,"This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.",neither CVD nor risk factors,Mount Sinai Health System
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants,2020-07-23 00:00:00,https://www.ncbi.nlm.nih.gov/pubmed/32705587/,BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.,Frailty and multimorbidity,213 countries and territories
"Risk Stratification for Healthcare workers during the CoViD-19 Pandemic; using demographics, co-morbid disease and clinical domain in order to assign clinical duties",2020-05-09 00:00:00,http://medrxiv.org/cgi/content/short/2020.05.05.20091967v1?rss=1,"Vascular disease, diabetes and chronic pulmonary disease further increased risk.","Vascular disease, diabetes and chronic pulmonary disease",


In the example above, the Embeddings index is used to find the top N results for a given query. On top of that, a question-answer extractor is used to derive additional columns based on a list of questions. In this case, the "Risk Factors" and "Location" columns were pulled from the document text.