<a href="https://colab.research.google.com/github/jonas-jun/haystack_search_engine/blob/main/Search_haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Faster & Accurate CORD Search Engine

**Dataset**  
COVID-19 Open Research Dataset Challenge (CORD-19)  
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House
[link](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)


**Reference**  
Kaggle notebook [link](https://www.kaggle.com/officialshivanandroy/building-faster-accurate-cord-search-engine)  
Medium [link](https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d)  
haystack [link](https://github.com/deepset-ai/haystack)  
Basic QA pipeline tutorial by Farm-Haystack [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb#scrollTo=ENjEn8L4Y8Fo)  


## Prepare Haystack

**What to build with Haystack**  
- Ask questions in natural language and find granular answers in your documents.
- Perform semantic search and retrieve documents according to meaning, not keywords
- Use off-the-shelf models or fine-tune them to your domain.
- Use user feedback to evaluate, benchmark, and continuously improve your live models.
- Leverage existing knowledge bases and better handle the long tail of queries that chatbots receive.
- Automate processes by automatically applying a list of questions to new documents and using the extracted answers.

**For installation**  
- from github !pip install git+https://github.com/deepset-ai/haystack.git

In [1]:
!pip install git+https://github.com/deepset-ai/haystack.git

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-wtf_uo1x
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-wtf_uo1x
Collecting farm==0.6.2
[?25l  Downloading https://files.pythonhosted.org/packages/5b/3d/91c184813b8205c697c13117154f3216f01709291155cc9ee88628cb63d2/farm-0.6.2-py3-none-any.whl (207kB)
[K     |████████████████████████████████| 215kB 13.7MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/9f/33/1b643f650688ad368983bbaf3b0658438038ea84d775dd37393d826c3833/fastapi-0.63.0-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 7.5MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/c8/de/953f0289508b1b92debdf0a6822d9b88ffb0c6ad471d709cf639a2c8a176/uvicorn-0.13.4-py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 7.9MB/s 
[?25hCollectin

In [2]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text #haystack.indexing -> haystack.preprocessor
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

04/01/2021 08:30:20 - INFO - faiss.loader -   Loading faiss with AVX2 support.
04/01/2021 08:30:20 - INFO - faiss.loader -   Loading faiss.
04/01/2021 08:30:21 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


## Dataset

from json to dataframe

In [3]:
# for colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
GDRIVE_HOME = '/content/drive/MyDrive'
FOLDER = 'GSDS/2021_1/Search_engine_Haystack/data_cord'

Mounted at /content/drive


In [4]:
# # read 50,000 docs to dataframe [id, title, abstract, full text]
# import numpy as np
# import pandas as pd
# import os
# import json
# import re
# from tqdm import tqdm

# dirs = ['pmc_json', 'pdf_json']
# docs = list()
# counts = list()

# for d in dirs:
#     print(d)
#     counts = 0
#     target_dir = os.path.join(GDRIVE_HOME, FOLDER, d)
#     for f in tqdm(os.listdir(target_dir)):
#         file_path = os.path.join(target_dir, f)
#         j = json.load(open(file_path, 'rb'))
#         paper_id = j['paper_id']
#         paper_id = paper_id[-7:] # take last 7 characters for id
#         title = j['metadata']['title']

#         try: # no abstracts in some docs
#             abstract = j['abstract'][0]['text']
#         except:
#             abstract = ''

#         full_text = str()
#         bib_entries = list()
#         for text in j['body_text']:
#             full_text += text['text']

#         docs.append([paper_id, title, abstract, full_text])
#         counts += 1
#         if count >= 25000:
#             break # only for 25000 files

# df = pd.DataFrame(docs, columns=['paper_id', 'title', 'abstract', 'full_text'])

In [5]:
import os
import pandas as pd
df = pd.read_csv(os.path.join(GDRIVE_HOME, FOLDER, 'processed.csv'))
print(df.shape)
df.sample(5)

(50000, 4)


Unnamed: 0,paper_id,title,abstract,full_text
8902,7260474,Critical adjustments in a department of orthopaedics through the COVID-19 pa...,,"The pandemic caused by the previously unknown SARS-CoV-2 (2019-nCoV, COVID-1..."
13228,7892327,Multi-omics highlights ABO plasma protein as a causal risk factor for COVID-19,,The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsi...
48610,b8e9ddc,,,Bovine respiratory disease (BRD) is the most significant production problem ...
45255,be96982,Development and Implementation of Influenza A Virus Subtyping and Detection ...,,Influenza virus is an RNA virus of the Orthomyxoviridae family comprised of ...
31050,9cfc359,The use of synthetic polymers for delivery of therapeutic antisense oligodeo...,"Developed over the past two decades, the antisense strategy has become a tec...","Treatment with traditional drugs is based on molecular substitution, which i..."


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47432 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   47432 non-null  object
 1   title      47432 non-null  object
 2   abstract   16608 non-null  object
 3   full_text  47432 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [26]:
# remove null samples in those columns
df = df.dropna(subset=['title', 'full_text'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47432 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   47432 non-null  object
 1   title      47432 non-null  object
 2   abstract   16608 non-null  object
 3   full_text  47432 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


## Set up DocumentStore

*Haystack* finds answer to queries within the documents stored in a *DocumentStore*.  

The current implementations of DocumentStore include *ElasticsearchDocumentStore*, *SQLDocumentStore*, *FAISSDocumentStore*, and *InMemoryDocumentStore*.  

But they recommend *ElasticsearchDocumentStore* because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

In [6]:
# Recommended: Start Elasticsearch using Docker (basic, but manually download in colab)
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

In [11]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
# version change to 7.9.2 from 7.6
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [12]:
# Connect to Elasticsearch
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore # file name changed from database to document_store
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

04/01/2021 08:35:08 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.078s]
04/01/2021 08:35:08 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.370s]
04/01/2021 08:35:09 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.217s]


In [16]:
modified = df[['title', 'full_text']].rename(columns={'title': 'name', 'full_text': 'text'})
m_dict = modified.to_dict(orient='records') # dictionary 만드는 방식

In [14]:
dicts = list()

for i in range(len(df['title'])):
    temp = dict()
    temp['text'] = df['full_text'][i]
    temp['meta'] = {'name': df['title'][i]}
    dicts.append(temp)

KeyError: ignored

In [None]:
dicts[:3]

In [17]:
print(len(m_dict))
m_dict[:3]

47432


[{'name': 'The in-vitro effect of famotidine on sars-cov-2 proteases and virus replication',
  'text': 'A large part of the current therapeutic discovery effort against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV)-2 is focused on drug repurposing1. Of such agents, only remdesivir has thus far shown clinical evidence of antiviral effect2, while several others have not met their primary endpoints in various clinical studies3,4. Recently, famotidine has gained attention as a therapeutic option against SARS-CoV-2, initially based on anecdotal evidence of its positive effects in COVID-19 patients in China. Famotidine (PEPCID®), a histamine-2 receptor (H2R) antagonist, is an FDA approved drug for the treatment of gastroesophageal reflux disease (GERD) and gastric ulcers5.Earlier reports of the beneficial effect of famotidine in China were recently supported by a retrospective clinical study involving 1620 patients in the U.S., which noted that hospitalized COVID-19 patients

In [18]:
document_store.write_documents(m_dict)

04/01/2021 08:36:16 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:4.525s]
04/01/2021 08:36:18 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.265s]
04/01/2021 08:36:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.319s]
04/01/2021 08:36:23 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.304s]
04/01/2021 08:36:26 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.253s]
04/01/2021 08:36:28 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.202s]
04/01/2021 08:36:31 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.202s]
04/01/2021 08:36:33 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.336s]


## Retriever

In [19]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

##Reader

In [20]:
reader = FARMReader(model_name_or_path='deepset/roberta-base-squad2-covid',
                    use_gpu=True,
                    context_window_size=500) # length of answer context

04/01/2021 08:39:57 - INFO - farm.utils -   Using device: CUDA 
04/01/2021 08:39:57 - INFO - farm.utils -   Number of GPUs: 1
04/01/2021 08:39:57 - INFO - farm.utils -   Distributed Training: False
04/01/2021 08:39:57 - INFO - farm.utils -   Automatic Mixed Precision: None
04/01/2021 08:39:57 - INFO - filelock -   Lock 139684915273040 acquired on /root/.cache/huggingface/transformers/542143302684ef63fbbe923b7ff830b6cae5acb8dcdcfedf1812a6e8bcbf61fd.8dd41b466536c6c3bdd95bf3af2ef1b1d1f46ecff1ca46f0ca3326f0ac7cab2a.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1265.0, style=ProgressStyle(description…

04/01/2021 08:39:57 - INFO - filelock -   Lock 139684915273040 released on /root/.cache/huggingface/transformers/542143302684ef63fbbe923b7ff830b6cae5acb8dcdcfedf1812a6e8bcbf61fd.8dd41b466536c6c3bdd95bf3af2ef1b1d1f46ecff1ca46f0ca3326f0ac7cab2a.lock
04/01/2021 08:39:57 - INFO - filelock -   Lock 139684927591824 acquired on /root/.cache/huggingface/transformers/6fbd2164a7378c55ee270a6853cd8d4bf2135249471b9ea73e29921c739ec027.b6430c15132d2e032ab8de4b43a60282fe1ed45e48e8aca09fe4544410677760.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637768.0, style=ProgressStyle(descri…

04/01/2021 08:40:07 - INFO - filelock -   Lock 139684927591824 released on /root/.cache/huggingface/transformers/6fbd2164a7378c55ee270a6853cd8d4bf2135249471b9ea73e29921c739ec027.b6430c15132d2e032ab8de4b43a60282fe1ed45e48e8aca09fe4544410677760.lock





04/01/2021 08:40:26 - INFO - filelock -   Lock 139684865648976 acquired on /root/.cache/huggingface/transformers/bede7fe1ec2f842f41d9f9824e7b8fb3cf9a67dc6c51bc12982b78e5d60e0057.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

04/01/2021 08:40:26 - INFO - filelock -   Lock 139684865648976 released on /root/.cache/huggingface/transformers/bede7fe1ec2f842f41d9f9824e7b8fb3cf9a67dc6c51bc12982b78e5d60e0057.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867497936 acquired on /root/.cache/huggingface/transformers/98053e4f8868c5c6d632dcd9f12115c1ece3cc1ccf854e38fd9ab6d7f659c97c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867497936 released on /root/.cache/huggingface/transformers/98053e4f8868c5c6d632dcd9f12115c1ece3cc1ccf854e38fd9ab6d7f659c97c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867495312 acquired on /root/.cache/huggingface/transformers/2fed5eaddcc70a0f63af0bfc72be760e4f0fec918dace5db252027cc20f3ba16.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867495312 released on /root/.cache/huggingface/transformers/2fed5eaddcc70a0f63af0bfc72be760e4f0fec918dace5db252027cc20f3ba16.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8.lock
04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867448656 acquired on /root/.cache/huggingface/transformers/caeecc32aeb918c12e16d5d820921b4c48bacad570362f72c273ea0635842292.7c7e847c8fcb54bf424535132294cf251d0f50c7c6a386139bdfe7fb2bbb2939.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…

04/01/2021 08:40:26 - INFO - filelock -   Lock 139684867448656 released on /root/.cache/huggingface/transformers/caeecc32aeb918c12e16d5d820921b4c48bacad570362f72c273ea0635842292.7c7e847c8fcb54bf424535132294cf251d0f50c7c6a386139bdfe7fb2bbb2939.lock





04/01/2021 08:40:26 - INFO - farm.utils -   Using device: CUDA 
04/01/2021 08:40:26 - INFO - farm.utils -   Number of GPUs: 1
04/01/2021 08:40:26 - INFO - farm.utils -   Distributed Training: False
04/01/2021 08:40:26 - INFO - farm.utils -   Automatic Mixed Precision: None
04/01/2021 08:40:26 - INFO - farm.infer -   Got ya 2 parallel workers to do inference ...
04/01/2021 08:40:26 - INFO - farm.infer -    0    0 
04/01/2021 08:40:26 - INFO - farm.infer -   /w\  /w\
04/01/2021 08:40:26 - INFO - farm.infer -   /'\  / \
04/01/2021 08:40:26 - INFO - farm.infer -     


## Finder

In [47]:
#finder = Finder(reader, retriever)

            1. The 'Finder' class will be deprecated in the next Haystack release in 
            favour of a new `Pipeline` class that supports building custom search pipelines using Haystack components
            including Retriever, Readers, and Generators.
            For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/544
            2. The `question` parameter in search requests & results is renamed to `query`.


In [21]:
# changed version
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Search

In [24]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(query="what is the covid19 symptom?", top_k_retriever=10, top_k_reader=3)

04/01/2021 08:41:21 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.094s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.67 Batches/s]
Inferencing Samples: 100%|██████████| 3/3 [00:01<00:00,  1.65 Batches/s]
04/01/2021 08:41:24 - ERROR - farm.modeling.predictions -   Both start and end offsets should be 0: 
41658, 41658 with a no_answer. 
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.94 Batches/s]


In [25]:
print(type(prediction))
print('\n===== Minimal Answer =====')
print_answers(results=prediction, details='minimal')
print('\n===== Number of Answers =====')
print(len(prediction['answers']))
print('\n===== Answer Structure =====')
print(prediction['answers'][0].keys())
print('\n===== Meta Information =====')
print(prediction['answers'][0]['meta'])
print('\n===== Length of Context =====')
print(len(prediction['answers'][-1]['context']))

<class 'dict'>

===== Minimal Answer =====
[   {   'answer': 'the most common symptom was dizziness (16.8%) followed '
                  'closely by headache (13.1%)',
        'context': 'airment, neuropathic pain, Guillain-Barre Syndrome and '
                   'variants), and skeletal muscular injury 2 . In one '
                   'observational study from Wuhan, of the 36.4% of COVID19 '
                   'patients who showed neurologic manifestations, the most '
                   'common symptom was dizziness (16.8%) followed closely by '
                   'headache (13.1%) 2 . In another prospective analysis out '
                   'of Wuhan, headache was present in 8% of all patients, '
                   'overall the most common neurological symptom 1 . Neither '
                   'of these studies collected data on milder nervous system '},
    {   'answer': 'that fever was the most common initial symptom, followed by '
                  'a cough, fatigue and shortness o

1000