<a href="https://colab.research.google.com/github/jonas-jun/haystack_search_engine/blob/main/Search_haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Faster & Accurate CORD Search Engine

Apr. 2021

**Dataset**  
COVID-19 Open Research Dataset Challenge (CORD-19)  
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House
[link](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)


**Reference**  
Medium [link](https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d)  
Kaggle notebook [link](https://www.kaggle.com/officialshivanandroy/building-faster-accurate-cord-search-engine)  
Haystack github [link](https://github.com/deepset-ai/haystack)  
Basic QA pipeline tutorial by Farm-Haystack [Colab link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb#scrollTo=ENjEn8L4Y8Fo)  


## Explain

**Total Structure**  
pypi.org/project/farm-haystack  
![Key components](https://warehouse-camo.ingress.cmh1.psfhosted.org/1e0b43ce96afebe6ad8436b361fc791398a810c2/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f646565707365742d61692f686179737461636b2f6d61737465722f646f63732f5f7372632f696d672f636f6e63657074735f686179737461636b5f76322e706e67)  
github.com/deepset-ai/haystack
![Key components github](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/concepts_haystack_handdrawn.png)  
  
    
    
**Document Store**: Database storing the documents for our search.  
![DocumentStore_github](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F0f58b074-9e5a-4080-89be-b53bd2c37713%2Fdocument_store.jpg?table=block&id=981d43aa-cf84-46b9-96ef-3051862d7378&width=4520&userId=8202ecb5-2b7f-4cf4-a51f-002f5e64247c&cache=v2)
- Elasticsearch
- SQL
- In-Memory
- FAISS

**Retriever**: Fast, simple algorithm that identifies candidate passages from a large collection of documents. (Document Store) Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based apporaches. The Retreiver helps to narrow down the scope for Reader.
- TF-IDF: Term Frequency - Inverse Document Frequency
- BM25

**Reader**: Powerful neural model that reads through texts in detail to find an answer, like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. It takes multiple texts as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.

**Pipeline**: glues together ("Finder")

## BM25 Algorithm

- "Okapi BM25", Best Matching  
- A ranking function used by search engines to esdtimate the relevance of documents to a given search query. [Wiki link](https://en.wikipedia.org/wiki/Okapi_BM25)  
- Several modified ranking functions (BM11: b=1, BM15: b=0, BM25+: additional free parameter)

![BM_1](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F78c6fe9f-aaff-404c-9be4-6090f4b282f2%2Fbm25_algorithm.jpg?table=block&id=dec383f4-4030-43d1-9bb1-d66b4437a6b8&width=1840&userId=8202ecb5-2b7f-4cf4-a51f-002f5e64247c&cache=v2)  
![BM_2](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F616ebde9-95a9-4237-ac0e-bea2a3da2ea2%2FBM25_algorithm2.jpg?table=block&id=00d469d8-e92c-45f4-a928-a7146af64384&width=2420&userId=&cache=v2)  
Components
![BM_3](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F33ec9b8a-5911-45cd-aab9-ca8af8e5e100%2F_2021-04-02__1.16.49.png?table=block&id=e4ca495e-7af5-493f-a22b-365f6357fb04&width=2270&userId=&cache=v2)

## Prepare Haystack

**What to build with Haystack**  
- Ask questions in natural language and find granular answers in your documents.
- Perform semantic search and retrieve documents according to meaning, not keywords
- Use off-the-shelf models or fine-tune them to your domain.
- Use user feedback to evaluate, benchmark, and continuously improve your live models.
- Leverage existing knowledge bases and better handle the long tail of queries that chatbots receive.
- Automate processes by automatically applying a list of questions to new documents and using the extracted answers.

**For installation**  
- from github !pip install git+https://github.com/deepset-ai/haystack.git

In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git

In [5]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text #haystack.indexing -> haystack.preprocessor
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

04/02/2021 04:56:10 - INFO - faiss.loader -   Loading faiss with AVX2 support.
04/02/2021 04:56:10 - INFO - faiss.loader -   Loading faiss.
04/02/2021 04:56:11 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


## Dataset

from json to dataframe

In [6]:
# for colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
GDRIVE_HOME = '/content/drive/MyDrive'
FOLDER = 'GSDS/2021_1/Search_engine_Haystack/data_cord'

Mounted at /content/drive


In [None]:
# # read 50,000 docs to dataframe [id, title, abstract, full text]
# import numpy as np
# import pandas as pd
# import os
# import json
# import re
# from tqdm import tqdm

# dirs = ['pmc_json', 'pdf_json']
# docs = list()
# counts = list()

# for d in dirs:
#     print(d)
#     counts = 0
#     target_dir = os.path.join(GDRIVE_HOME, FOLDER, d)
#     for f in tqdm(os.listdir(target_dir)):
#         file_path = os.path.join(target_dir, f)
#         j = json.load(open(file_path, 'rb'))
#         paper_id = j['paper_id']
#         paper_id = paper_id[-7:] # take last 7 characters for id
#         title = j['metadata']['title']

#         try: # no abstracts in some docs
#             abstract = j['abstract'][0]['text']
#         except:
#             abstract = ''

#         full_text = str()
#         bib_entries = list()
#         for text in j['body_text']:
#             full_text += text['text']

#         docs.append([paper_id, title, abstract, full_text])
#         counts += 1
#         if count >= 25000:
#             break # only for 25000 files

# df = pd.DataFrame(docs, columns=['paper_id', 'title', 'abstract', 'full_text'])

In [7]:
import os
import pandas as pd
df = pd.read_csv(os.path.join(GDRIVE_HOME, FOLDER, 'processed.csv'))
print('Shape of Dataframe: {}\n'.format(df.shape))
print('Information')
print(df.info())
print('\nSamples')
df.sample(5)

Shape of Dataframe: (50000, 4)

Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   50000 non-null  object
 1   title      47432 non-null  object
 2   abstract   17286 non-null  object
 3   full_text  50000 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB
None

Samples


Unnamed: 0,paper_id,title,abstract,full_text
35618,2e559b1,Epitope-Based Immunome-Derived Vaccines: A Strategy for Improved Design and ...,Vaccine science has extended beyond genomics to proteomics and has come to a...,The availability of immunome-mining tools has fueled the design and developm...
1919,7202459,Changing trends of ocular trauma in the time of COVID-19 pandemic,,"To reduce the spread of the novel coronavirus (2019-nCoV), countries have pr..."
14444,7478888,200 Years of Florence and the challenges of nursing practices\nmanagement in...,,"The development of nursing faces different challenges concerning autonomy,\n..."
48329,f1261cd,Application of Open-Source Software in Knowledge Graph Construction,"Knowledge graph (KG), as a new type of knowledge representation, has gained ...",specific definition and representation of the knowledge based on graph could...
34335,082af2f,AIOSP -Association Internationale d'Orientation Scolaire et Professionnelle ...,,The aims of educational and vocational guidance are to assist students and a...


In [8]:
# remove null samples in those columns
df = df.dropna(subset=['paper_id', 'title', 'full_text']) # add 'paper_id'
df.info()

04/02/2021 04:59:08 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 47432 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   47432 non-null  object
 1   title      47432 non-null  object
 2   abstract   16608 non-null  object
 3   full_text  47432 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


## Set up DocumentStore

*Haystack* finds answer to queries within the documents stored in a *DocumentStore*.  

The current implementations of DocumentStore include *ElasticsearchDocumentStore*, *SQLDocumentStore*, *FAISSDocumentStore*, and *InMemoryDocumentStore*.  

But they recommend *ElasticsearchDocumentStore* because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

In [None]:
# Recommended: Start Elasticsearch using Docker (basic, but manually download in colab)
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

In [9]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
# version change to 7.9.2 from 7.6
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [10]:
# Connect to Elasticsearch
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore # file name changed from database to document_store
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

04/02/2021 04:59:59 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.082s]
04/02/2021 04:59:59 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.348s]
04/02/2021 05:00:00 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.179s]


Change structure of the dataset (dicts list)

1. [{'name': str, 'text': str}, {'name': str, 'text': str}, ...]
2. [{'text': str, 'meta': {'name': str, ...}, ...] - recommend

In [11]:
# Structure 1
modified = df[['title', 'full_text']].rename(columns={'title': 'name', 'full_text': 'text'})
dicts_1 = modified.to_dict(orient='records')

In [None]:
dicts_1[:3]

In [13]:
# Structure 2
dicts_2 = list()

for i in range(len(df['title'])):
    data = df.iloc[i]
    temp = dict()
    temp['text'] = data['full_text']
    temp['meta'] = {'name': data['title'], 'p_id': data['paper_id']}
    dicts_2.append(temp)

In [None]:
dicts_2[:3]

In [None]:
document_store.write_documents(dicts_2)

## Retriever
- dense: get dense embeddings for query and passage using bi-encoder [github](https://github.com/deepset-ai/haystack/blob/master/haystack/retriever/dense.py)
- sparse

In [17]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store, top_k=10)

##Reader

FARMReader.train() for fine-tuning on own data [github](https://github.com/deepset-ai/haystack/tree/master/haystack/reader)  
@params  
data_dir, train_filename, dev_filename, test_filename, use_gpu, batch_size, n_epochs, learning_rate, max_seq_len, warmup_proportion, dev_split, evaluate_every, save_dir, num_processes, use_amp

In [None]:
reader = FARMReader(model_name_or_path='deepset/roberta-base-squad2-covid',
                    use_gpu=True,
                    context_window_size=500) # length of answer context
'''
model_name_or_path: dir. of saved model or the name of a public model,
e.g. 'bert-base-cased', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
'''

## Finder

In [None]:
#finder = Finder(reader, retriever)

In [19]:
# updated version
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Search

In [20]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(query="what is the covid19 symptom?", top_k_retriever=10, top_k_reader=3)

04/02/2021 05:19:04 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.286s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.12 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.93 Batches/s]
Inferencing Samples: 100%|██████████| 3/3 [00:01<00:00,  1.66 Batches/s]
04/02/2021 05:19:08 - ERROR - farm.modeling.predictions -   Both start and end offsets should be 0: 
41658, 41658 with a no_answer. 
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.99 Batches/s]


In [21]:
print(type(prediction))
print('\n===== Minimal Answer =====')
print_answers(results=prediction, details='minimal')
print('\n===== Number of Answers =====')
print('{}, same with top_k_reader'.format(len(prediction['answers'])))
print('\n===== Answer Structure =====')
print(prediction['answers'][0].keys())
print('\n===== Meta Information =====')
print(prediction['answers'][0]['meta'])
print('\n===== Length of Context =====')
print(len(prediction['answers'][-1]['context']))

<class 'dict'>

===== Minimal Answer =====
[   {   'answer': 'the most common symptom was dizziness (16.8%) followed '
                  'closely by headache (13.1%)',
        'context': 'airment, neuropathic pain, Guillain-Barre Syndrome and '
                   'variants), and skeletal muscular injury 2 . In one '
                   'observational study from Wuhan, of the 36.4% of COVID19 '
                   'patients who showed neurologic manifestations, the most '
                   'common symptom was dizziness (16.8%) followed closely by '
                   'headache (13.1%) 2 . In another prospective analysis out '
                   'of Wuhan, headache was present in 8% of all patients, '
                   'overall the most common neurological symptom 1 . Neither '
                   'of these studies collected data on milder nervous system '},
    {   'answer': 'that fever was the most common initial symptom, followed by '
                  'a cough, fatigue and shortness o

In [22]:
prediction = pipe.run(query="what is the impact of coronavirus on pregnant women?", top_k_retriever=10, top_k_reader=2)

04/02/2021 05:20:09 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.090s]
Inferencing Samples: 100%|██████████| 2/2 [00:01<00:00,  1.90 Batches/s]
Inferencing Samples: 100%|██████████| 2/2 [00:01<00:00,  1.96 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.58 Batches/s]


In [24]:
print(type(prediction))
print('\n===== Medium Answer =====')
print_answers(results=prediction, details='medium')
print('\n===== Number of Answers =====')
print('{}, same with top_k_reader'.format(len(prediction['answers'])))
print('\n===== Answer Structure =====')
print(prediction['answers'][0].keys())
print('\n===== Meta Information =====')
print(prediction['answers'][0]['meta'])
print('\n===== Length of Context =====')
print(len(prediction['answers'][-1]['context']))

<class 'dict'>

===== Minimal Answer =====
[   {   'answer': 'pregnant women are not prone to experiencing higher levels '
                  'of stress and anxiety in comparison to non-pregnant '
                  'controls',
        'context': 'TSD development in vulnerable individuals [3] .Pregnancy '
                   'is a rewarding yet challenging period of life, which '
                   'demands physical, psychological and social adjustment to a '
                   'new role. In general, pregnant women are not prone to '
                   'experiencing higher levels of stress and anxiety in '
                   'comparison to non-pregnant controls [4, 5] . Nevertheless, '
                   'women with complicated pregnancies report higher levels of '
                   'anxiety symptoms compared to low-risk pregnant subjects '
                   '[6, 7] . Literature data on the impact of COVID-19 ',
        'score': 14.60550308227539},
    {   'answer': 'pregnant women were