<a href="https://colab.research.google.com/github/jonas-jun/haystack_search_engine/blob/main/Search_haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Faster & Accurate CORD Search Engine

**Dataset**  
COVID-19 Open Research Dataset Challenge (CORD-19)  
An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House
[link](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)


**Reference**  
Medium [link](https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d)  
Kaggle notebook [link](https://www.kaggle.com/officialshivanandroy/building-faster-accurate-cord-search-engine)  
Haystack [link](https://github.com/deepset-ai/haystack)  
Basic QA pipeline tutorial by Farm-Haystack [Colab link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb#scrollTo=ENjEn8L4Y8Fo)  


## Prepare Haystack

**What to build with Haystack**  
- Ask questions in natural language and find granular answers in your documents.
- Perform semantic search and retrieve documents according to meaning, not keywords
- Use off-the-shelf models or fine-tune them to your domain.
- Use user feedback to evaluate, benchmark, and continuously improve your live models.
- Leverage existing knowledge bases and better handle the long tail of queries that chatbots receive.
- Automate processes by automatically applying a list of questions to new documents and using the extracted answers.

**For installation**  
- from github !pip install git+https://github.com/deepset-ai/haystack.git

In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git

In [2]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text #haystack.indexing -> haystack.preprocessor
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

04/01/2021 11:35:06 - INFO - faiss.loader -   Loading faiss with AVX2 support.
04/01/2021 11:35:06 - INFO - faiss.loader -   Loading faiss.
04/01/2021 11:35:08 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


## Dataset

from json to dataframe

In [3]:
# for colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
GDRIVE_HOME = '/content/drive/MyDrive'
FOLDER = 'GSDS/2021_1/Search_engine_Haystack/data_cord'

Mounted at /content/drive


In [4]:
# # read 50,000 docs to dataframe [id, title, abstract, full text]
# import numpy as np
# import pandas as pd
# import os
# import json
# import re
# from tqdm import tqdm

# dirs = ['pmc_json', 'pdf_json']
# docs = list()
# counts = list()

# for d in dirs:
#     print(d)
#     counts = 0
#     target_dir = os.path.join(GDRIVE_HOME, FOLDER, d)
#     for f in tqdm(os.listdir(target_dir)):
#         file_path = os.path.join(target_dir, f)
#         j = json.load(open(file_path, 'rb'))
#         paper_id = j['paper_id']
#         paper_id = paper_id[-7:] # take last 7 characters for id
#         title = j['metadata']['title']

#         try: # no abstracts in some docs
#             abstract = j['abstract'][0]['text']
#         except:
#             abstract = ''

#         full_text = str()
#         bib_entries = list()
#         for text in j['body_text']:
#             full_text += text['text']

#         docs.append([paper_id, title, abstract, full_text])
#         counts += 1
#         if count >= 25000:
#             break # only for 25000 files

# df = pd.DataFrame(docs, columns=['paper_id', 'title', 'abstract', 'full_text'])

In [5]:
import os
import pandas as pd
df = pd.read_csv(os.path.join(GDRIVE_HOME, FOLDER, 'processed.csv'))
print('Shape of Dataframe: {}\n'.format(df.shape))
print('Information')
print(df.info())
print('\nSamples')
df.sample(5)

Shape of Dataframe: (50000, 4)

Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   50000 non-null  object
 1   title      47432 non-null  object
 2   abstract   17286 non-null  object
 3   full_text  50000 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB
None

Samples


Unnamed: 0,paper_id,title,abstract,full_text
28058,f010403,6 Population Health Unit,Policy makers in Africa need robust estimates of the current and future spre...,The potential risk from SARS-CoV-2 to Africa was identified early in the glo...
6839,7833263,Fear and agony of the pandemic leading to stress and mental illness: An emer...,,With the pandemic of the novel coronavirus (COVID-19) sweeping over the glob...
30708,a36c220,Mitteilungen der Deutschen Gesellschaft für Neurologie Rückblick und Ausblic...,Die DGN schaut auf ein erfolgreiches Jahr 2019 zurück. Die Mitgliederzahl is...,"Wir sind sehr zufrieden. 2019 war ein spannendes Jahr, wir konnten viel auf ..."
30289,daaa517,Identifying Facemask-Wearing Condition Using Image Super-Resolution with Cla...,The rapid worldwide spread of Coronavirus Disease 2019 has resulted in a glo...,Coronavirus disease 2019 (COVID- 19) is an emerging respiratory infectious d...
18522,7726608,Case management for frequent emergency department users: no longer a questio...,,"Ten years ago, an editorial entitled “Frequent Users of Emergency Department..."


In [6]:
# remove null samples in those columns
df = df.dropna(subset=['paper_id', 'title', 'full_text']) # add 'paper_id'
df.info()

04/01/2021 11:36:01 - INFO - numexpr.utils -   NumExpr defaulting to 2 threads.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 47432 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   paper_id   47432 non-null  object
 1   title      47432 non-null  object
 2   abstract   16608 non-null  object
 3   full_text  47432 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


## Set up DocumentStore (종류별로 조사)

*Haystack* finds answer to queries within the documents stored in a *DocumentStore*.  

The current implementations of DocumentStore include *ElasticsearchDocumentStore*, *SQLDocumentStore*, *FAISSDocumentStore*, and *InMemoryDocumentStore*.  

But they recommend *ElasticsearchDocumentStore* because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

In [7]:
# Recommended: Start Elasticsearch using Docker (basic, but manually download in colab)
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

In [8]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2
# version change to 7.9.2 from 7.6
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [9]:
# Connect to Elasticsearch
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore # file name changed from database to document_store
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

04/01/2021 11:36:42 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.083s]
04/01/2021 11:36:42 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.368s]
04/01/2021 11:36:42 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.204s]


Change structure of the dataset (dicts list)

1. [{'name': str, 'text': str}, {'name': str, 'text': str}, ...]
2. [{'text': str, 'meta': {'name': str, ...}, ...]

In [10]:
# Structure 1
modified = df[['title', 'full_text']].rename(columns={'title': 'name', 'full_text': 'text'})
dicts_1 = modified.to_dict(orient='records') # dictionary 만드는 방식

In [None]:
dicts_1[:3]

In [12]:
# Structure 2
dicts_2 = list()

for i in range(len(df['title'])):
    data = df.iloc[i]
    temp = dict()
    temp['text'] = data['full_text']
    temp['meta'] = {'name': data['title'], 'p_id': data['paper_id']}
    dicts_2.append(temp)

In [None]:
dicts_2[:3]

In [None]:
document_store.write_documents(dicts_2)

## Retriever

In [15]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

##Reader

In [None]:
reader = FARMReader(model_name_or_path='deepset/roberta-base-squad2-covid',
                    use_gpu=True,
                    context_window_size=500) # length of answer context

## Finder

In [17]:
#finder = Finder(reader, retriever)

In [18]:
# updated version
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Search

In [None]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(query="what is the covid19 symptom?", top_k_retriever=10, top_k_reader=3)

In [20]:
print(type(prediction))
print('\n===== Minimal Answer =====')
print_answers(results=prediction, details='minimal')
print('\n===== Number of Answers =====')
print(len(prediction['answers']))
print('\n===== Answer Structure =====')
print(prediction['answers'][0].keys())
print('\n===== Meta Information =====')
print(prediction['answers'][0]['meta'])
print('\n===== Length of Context =====')
print(len(prediction['answers'][-1]['context']))

<class 'dict'>

===== Minimal Answer =====
[   {   'answer': 'the most common symptom was dizziness (16.8%) followed '
                  'closely by headache (13.1%)',
        'context': 'airment, neuropathic pain, Guillain-Barre Syndrome and '
                   'variants), and skeletal muscular injury 2 . In one '
                   'observational study from Wuhan, of the 36.4% of COVID19 '
                   'patients who showed neurologic manifestations, the most '
                   'common symptom was dizziness (16.8%) followed closely by '
                   'headache (13.1%) 2 . In another prospective analysis out '
                   'of Wuhan, headache was present in 8% of all patients, '
                   'overall the most common neurological symptom 1 . Neither '
                   'of these studies collected data on milder nervous system '},
    {   'answer': 'that fever was the most common initial symptom, followed by '
                  'a cough, fatigue and shortness o