# Семинар №11 - Вопросно-ответная система

Сделаем поисковую систему на основе BM25 и BERT, чтобы отфильтровать научные работы по запросу.

## Bert for QA
![bert.png](./imgs/bert.png)

BERT (Bidirectional Encoder Representations from Transformers) - недавняя статья, опубликованная исследователями Google AI Language. Она вызвала ажиотаж в сообществе машинного обучения, представив самые современные результаты в широком спектре задач NLP, включая ответы на вопросы (SQuAD v1.1), вывод на естественном языке (MNLI) и другие.

Ключевым техническим новшеством БЕРТА является применение двунаправленного обучения Transformer, популярной модели внимания, к языковому моделированию. Это контрастирует с предыдущими попытками, в которых рассматривалась последовательность текста либо слева направо, либо комбинированное обучение слева направо и справа налево. Результаты работы показывают, что языковая модель, которая обучается в двух направлениях, может иметь более глубокое представление о языковом контексте и потоке, чем однонаправленные языковые модели. В статье исследователи подробно описывают новую технику под названием Masked LM (MLM), которая позволяет проводить двунаправленное обучение в моделях, в которых ранее это было невозможно.

Question answering (QA) - дисциплина компьютерных наук в области поиска информации и обработки естественного языка (NLP), которая занимается созданием систем, автоматически отвечающих на вопросы, задаваемые людьми на естественном языке.

## BM25
![image.png](./imgs/bm25_formula.jpg)

Здесь:
- D - документ.
- Q - запрос.
- f(q_i, D) - число вхождений слова q_i в документ.
- |D| - длина документа.
- avgdl - средняя длина документов в коллекции.
- N - число документов в коллекции.
- n(q_i) - число документов, содержащих слово q_i.
- k_1 - параметр в диапазоне [1.2, 2.0].
- b - параметр, обычно 0.75.
- $\delta$ - параметр, обычно 1.

В области поиска информации Okapi BM25 (BM - сокращение от best matching) - это функция ранжирования, используемая поисковыми системами для оценки релевантности документов заданному поисковому запросу. Он основан на системе вероятностного поиска, разработанной в 1970-х и 1980-х годах Стивеном Э. Робертсоном, Карен Сперк Джонс и другими.

Название фактической функции ранжирования - BM25. Более полное название, Okapi BM25, включает название первой системы, которая его использовала, - информационно-поисковой системы Okapi, внедренной в Лондонском городском университете в 1980-х и 1990-х годах. BM25 и его более новые варианты, например BM25F (версия BM25, которая может учитывать структуру документа и текст привязки), представляют собой современные функции поиска, подобные TF-IDF, используемые при поиске документов.

Почитать про BM25: https://habr.com/ru/articles/823568/

## Install and import packages

In [1]:
!pip install rank_bm25 nltk transformers
!pip install ipywidgets

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [2]:
import re
import os
import json
from pathlib import Path, PurePath

import requests
from requests.exceptions import HTTPError, ConnectionError

import numpy as np
import pandas as pd
from tqdm import tqdm
from rank_bm25 import BM25Okapi

import nltk
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download('stopwords')
nltk.download('punkt_tab')

import torch
from transformers import BertTokenizer, BertForQuestionAnswering

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# Предварительная обработка данных

[Загрузите](https://disk.yandex.ru/d/Z3MfKtbtCwdZkw) полнотекстовые публикации и очистите их.

In [4]:
merged = pd.read_csv(r'processed_data_v8_2 2.csv', sep=';')

In [5]:
merged.head()

Unnamed: 0,paper_id,authors,title,abstract,text,date,url,risk_factor_age,risk_factor_sex,risk_factor_overweight,...,tag_design_retrospective_cohort,tag_design_cross_sectional_case_control,tag_design_matched_case_control,tag_design_prevalence_survey,tag_design_time_series_analysis,tag_design_systematic_review,tag_design_randomized_control,tag_design_pseudo_randomized_control,tag_design_case_study,tag_design_simulation
0,sl79r65r,"Zhang, Zhonghua; Jiang, Shan; Liu, Yun; Sun, Y...","Identification of ireA, 0007, 0008, and 2235 a...","Avian pathogenic Escherichia coli (APEC), a pa...","Avian pathogenic Escherichia coli (APEC), a pa...",2020-01-23,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,wvx6q999,,Note from the editors: novel coronavirus (2019...,,"At the end of 2019, on 31 December, the World ...",2020-01-23,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,z9jnljrt,"Cao, Yurou; Gao, Lulu; Zhang, Li; Zhou, Lixian...",Genome-wide screening of lipoproteins in Actin...,Actinobacillus pleuropneumoniae is an importan...,Actinobacillus pleuropneumoniae is a Gram-nega...,2020-02-11,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,dqpujzo8,"Li, Hong-Ying; Zhu, Guang-Jian; Zhang, Yun-Zhi...",A qualitative study of zoonotic risk factors a...,BACKGROUND: Strategies are urgently needed to ...,Emerging and re-emerging zoonotic diseases are...,2020-02-10,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,True,True,False,...,False,False,False,False,False,False,False,False,False,False
4,q6mhqcho,"Zhou, Zhen; Zhang, Pan; Cui, Yuxia; Zhang, Yon...",Experiments Investigating the Competitive Grow...,Human metapneumovirus (hMPV) is an important p...,"Human metapneumovirus (hMPV), which was first ...",2020-02-18,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,True,False,False,...,False,False,False,False,False,False,False,False,True,False


In [6]:
f = merged.loc[0].to_frame().fillna('')
f.columns = ['Value']
f.iloc[:10]

Unnamed: 0,Value
paper_id,sl79r65r
authors,"Zhang, Zhonghua; Jiang, Shan; Liu, Yun; Sun, Y..."
title,"Identification of ireA, 0007, 0008, and 2235 a..."
abstract,"Avian pathogenic Escherichia coli (APEC), a pa..."
text,"Avian pathogenic Escherichia coli (APEC), a pa..."
date,2020-01-23
url,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
risk_factor_age,False
risk_factor_sex,False
risk_factor_overweight,False


## Задача №1

Создайте классную работу, чтобы упростить получение какого-либо поля, такого как аннотация, полный текст, doi и так далее.

In [8]:
class Paper:
    def __init__(self, item):
        self.paper = item.to_frame().fillna('')
        # self.paper.columns = ['Value']
    def url(self):
        return self._get_field('url')
    def text(self):
        return self._get_field('text')
    def abstract(self):
        return self._get_field('abstract')
    def title(self):
        return self._get_field('title')
    def authors(self, split=False):
        return self._get_field('authors')
    def _get_field(self, field_name):
        return self.paper.get(field_name, '')

## Задача №2

Некоторые функции предварительной обработки для очистки исходных данных. Мы обозначаем текст, удаляем знаки препинания и некоторые специальные символы.

In [9]:
english_stopwords = list(set(stopwords.words('english')))
SEARCH_DISPLAY_COLUMNS = ['title', 'abstract', 'url', 'authors', 'text']

In [10]:
def strip_characters(text):
    return re.sub(r'[\(\)\[\]\{\}:;,\\.!?\'\"-]', '', text)

def clean(text):
    text = text.lower()
    text = strip_characters(text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def tokenize(text):
    words = nltk.word_tokenize(text)
    return list(
        set([word for word in words
             if len(word) > 1
             and not word in english_stopwords
             and not (word.isnumeric() and len(word) is not 4)
             and (not word.isnumeric() or word.isalpha())])
    )

def preprocess(text):
    t = clean(text)
    tokens = tokenize(t)
    return tokens


# Поисковая система BM25
Определите класс для построения индекса и поиска по строкам.

In [11]:
class SearchResults:
    def __init__(self,
                 data: pd.DataFrame,
                 columns = None):
        self.results = data
        if columns:
            self.results = self.results[columns]

    def __getitem__(self, item):
        return Paper(self.results.loc[item])

    def __len__(self):
        return len(self.results)

    def set_ans(self, ans):
        col_name = self.results.columns.tolist()
        col_name.insert(2, 'Answer')
        self.results = self.results.reindex(columns=col_name)
        self.results['Answer'] = ans

In [12]:
class WordTokenIndex:
    def __init__(self,
                 corpus: pd.DataFrame,
                 columns=SEARCH_DISPLAY_COLUMNS):
        self.corpus = corpus

        raw_search_str = self.corpus.abstract.fillna('') + ' ' + self.corpus.title.fillna('')
        self.index = raw_search_str.apply(preprocess).to_frame()
        self.index.columns = ['terms']
        self.index.index = self.corpus.index
        self.columns = columns

    def search(self, search_string):
        search_terms = preprocess(search_string)
        # получить индексы, включающие строку search_string
        result_index = self.index.terms.apply(lambda terms: any(i in terms for i in search_terms))
        # получить найденные papers
        results = self.corpus[result_index].copy().reset_index().rename(columns={'index':'paper'})

        return SearchResults(results, self.columns + ['paper'])

In [13]:
class RankBM25Index(WordTokenIndex):
    def __init__(self, corpus: pd.DataFrame, columns=SEARCH_DISPLAY_COLUMNS):
        super().__init__(corpus, columns)
        self.bm25 = BM25Okapi(self.index.terms.tolist())

    def search(self, search_string, n=4):
        search_terms = preprocess(search_string)
        doc_scores = self.bm25.get_scores(search_terms)

        ind = np.argsort(doc_scores)[::-1][:n]
        results = self.corpus.iloc[ind][self.columns]
        results['Score'] = doc_scores[ind]
        results = results[results.Score > 0]

        return SearchResults(results.reset_index(), self.columns + ['Score'])

# Bert furtuer pre-train and fine
Используйте реализацию transformers bert и загрузите предварительно обученную модель "bert-base-uncased". Мы используем все статьи в качестве корпуса для дальнейшего предварительного обучения bert задаче языковой модели. После дальнейшей предварительной подготовки мы используем набор данных SQuAD-2.0 для точной настройки модели bert. Наконец, мы сохраняем обученную модель, которая называется "output_squad".

In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# QA function
Создайте функцию контроля качества, которая получает вопрос и фрагмент текста, передает их в модель bert и выводит возможный ответ в тексте. Если модель выдает неверный ответ, функция выдаст "Нет ответа".

## Постройте пайплайн поиска и покажите результаты

### Задача №3
Мы показываем выходные данные модели в таблицах и заключаем их в "【】", чтобы представить ответ в статье.

In [97]:
max_question_len = 100

def getAnswer(question, text):
    input_ids = tokenizer.encode(question, text, max_length=512, truncation=True)
    # mask text sequence  # 102 - SEP между question и text
    token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]

    model_out = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    start_scores, end_scores = model_out['start_logits'], model_out['end_logits']

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # выделить  часть за начало ответа и его конец
    start_scores = start_scores[0, input_ids.index(102):]  # bs = 1
    end_scores = end_scores[0, input_ids.index(102):]  # bs = 1

    if torch.argmax(start_scores).item() >= torch.argmax(end_scores).item():
        return "No Answer"

    # выделить токены до ответа + [ начало контекста ответа
    all_tokens = all_tokens[: input_ids.index(102) + torch.argmax(start_scores)] + ["【"] + all_tokens[input_ids.index(102) + torch.argmax(start_scores):]
    # выделить токены конца ответа контекста ] + токены после ответа
    all_tokens = all_tokens[: input_ids.index(102) + torch.argmax(end_scores) + 1] + ["】"] + all_tokens[input_ids.index(102) + torch.argmax(end_scores) + 1:]

    # найти начало и конец ответа
    start_span = max(input_ids.index(102) + torch.argmax(start_scores) - 5, input_ids.index(102))
    end_span = min(input_ids.index(102) + torch.argmax(end_scores) + 6, len(all_tokens) + 1)

    answer = tokenizer.convert_tokens_to_string(all_tokens[start_span:end_span])
    answer = answer.replace("[SEP]", "")
    return answer

In [47]:
text = merged.iloc[0].text
text

'Avian pathogenic Escherichia coli (APEC), a pathotype of extraintestinal pathogenic E. coli (ExPEC), causes serious infectious diseases in poultry [1, 2]. Different serotypes of APEC cause local or systemic infectious diseases in poultry, including respiratory infections, sepsis, polyserositis, coligranuloma, cellulitis, yolk sac infection, omphalitis, and swollen head syndrome, resulting in significant economic losses to the poultry industry [3]. Strains of APEC and neonatal meningitis-associated E. coli (NMEC, a subpathotype of ExPEC), the latter of which causes infections in humans, reportedly share some common virulence genes [4–6]. It is thus particularly important to study the genes encoding virulence factors in APEC strains. These strains contain several virulence-associated genes that encode various virulence factors, including adhesins (fimC, ompA, papC), invasins (ibeA), avian haemolysins (hlyF), serum survival proteins (iss, ompT), and siderophores (iutA, fyuA, iroN); moreo

In [84]:
query = 'What is it omphalitis'
text = merged.iloc[0].text
getAnswer(query, text)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


'to the poultry industry [ 【 3 ] . strains of apec and neonatal meningitis - associated e . coli ( nmec , a subpathotype of expec ) , the latter of which causes infections in humans , reportedly share some common virulence genes [ 4 – 6 ] . it is thus particularly important to study the genes encoding virulence factors in apec strains . these strains contain several virulence - associated genes that encode various virulence factors , including adhesins ( fimc , ompa , papc ) , invasins ( ibea ) , avian haemolysins ( hlyf ) , serum survival proteins ( iss , ompt ) , and siderophores ( iuta , fyua , iron ) ; moreover , they show the presence of a pathogenicity island [ 7 – 9 ] . iron is a vital micronutrient that regulates enzyme activity and metabolism . this element plays a key 】 role in basic cellular'

### Задача №4
Соберем весь процесс поиска в одной функции.

In [90]:
bm25_index = RankBM25Index(merged)

In [91]:
def searchAndGetAnswer(question, top_bm=50, top_k=10):
    assert top_bm > top_k and top_bm > 1 and top_k > 1, 'set top_bm > top_k > 1'
    results = bm25_index.search(question, top_bm)
    ans_list = []
    ans_index = []
    for i in range(len(results)):
        text = results[i].text()
        ans = getAnswer(question, text)
        if ans != "No Answer":
            ans_list.append(ans)
            ans_index.append(i)
            print(f"{len(ans_index)}/{top_k}")
            if len(ans_index) >= top_k:
                break
    if not ans_index:
        print("No valid answers found.")
        return results.results
    final_results = results.results.iloc[ans_index].reset_index(drop=True)
    final_results = final_results.copy()
    final_results['Answer'] = ans_list
    cols = ['title', 'abstract', 'Answer', 'Score', 'url', 'authors']
    res = final_results.loc[:, cols]
    return res

### Q1: *Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.*

In [92]:
p = "Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time."
res = searchAndGetAnswer(p)
res

No valid answers found.


Unnamed: 0,title,abstract,url,authors,text,Score
0,Genome Detective Coronavirus Typing Tool for r...,"SUMMARY: Genome Detective is a web-based, user...",https://doi.org/10.1093/bioinformatics/btaa145...,"Cleemput, Sara; Dumon, Wim; Fonseca, Vagner; K...",We are currently faced with a potential global...,22.262102
1,Genome Detective Coronavirus Typing Tool for r...,"Genome Detective is a web-based, user-friendly...",https://doi.org/10.1101/2020.01.31.928796,"Cleemput, Sara; Dumon, Wim; Fonseca, Vagner; A...",We are currently faced with a potential global...,21.113115
2,Genotyping coronavirus SARS-CoV-2: methods and...,The emerging global infectious COVID-19 corona...,https://arxiv.org/pdf/2003.10965v1.pdf,"Yin, Changchuan","The novel coronavirus in humans, first discove...",20.446664
3,Genotyping coronavirus SARS-CoV-2: Methods and...,Abstract The emerging global infectious COVID-...,https://api.elsevier.com/content/article/pii/S...,"Yin, Changchuan",through nsp16 in all coronaviruses [8] . There...,20.102274
4,BioLaboro: A bioinformatics system for detecti...,Background Emerging and reemerging infectious ...,https://doi.org/10.1101/2020.04.08.031963,"Holland, Mitchell; Negrón, Daniel; Mitchell, S...","Using next generation sequencing, the whole ge...",18.086823
5,MINERVA: A facile strategy for SARS-CoV-2 whol...,The novel coronavirus disease 2019 (COVID-19) ...,https://doi.org/10.1101/2020.04.25.060947,"Chen, Chen; Li, Jizhou; Di, Lin; Jing, Qiuyu; ...","As of May 22, 2020, the ongoing COVID-19 viral...",16.010989
6,"The ongoing COVID-19 epidemic in Minas Gerais,...",The recent emergence of a previously unknown c...,http://medrxiv.org/cgi/content/short/2020.05.0...,"Xavier, J.; Giovanetti, M.; Adelino, T.; Fonse...","To date, more than 3.5 million cases of the di...",15.469203
7,Differences in power-law growth over time and ...,An automated statistical and error analysis of...,https://doi.org/10.1101/2020.03.31.20048827,"Merrin, Jack",Mathematics is essential to predict and contro...,15.163071
8,Infection and Rapid Transmission of SARS-CoV-2...,The outbreak of coronavirus disease 2019 (COVI...,https://www.ncbi.nlm.nih.gov/pubmed/32259477/;...,"Kim, Young-Il; Kim, Seong-Gyu; Kim, Se-Mi; Kim...",Coronaviruses (CoVs) are a large family of vir...,15.076527
9,Tracking COVID-19 in Europe: Infodemiology App...,"BACKGROUND: Infodemiology (ie, information epi...",https://www.ncbi.nlm.nih.gov/pubmed/32250957/;...,"Mavragani, Amaryllis","In December 2019, Chinese researchers identifi...",14.799094


### Q2: *Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.*

In [98]:
p = "Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged."
res = searchAndGetAnswer(p)
res

1/10
2/10
3/10
4/10
5/10
6/10
7/10
8/10
9/10
10/10


Unnamed: 0,title,abstract,Answer,Score,url,authors
0,No-Test Medication Abortion: A Sample Protocol...,,【 】,18.587669,https://doi.org/10.1016/j.contraception.2020.0...,"Raymond, Elizabeth G.; Grossman, Daniel; Mark,..."
1,Evaluation of Rural Public Libraries to Addres...,"Introduction: In the United States, access to ...",【 】,16.814598,http://medrxiv.org/cgi/content/short/2020.05.2...,"DeGuzman, P. B.; Siegfried, Z. C.; Leimkuhler,..."
2,Perfectionism and Perceived Control in Posttra...,"In this study, we sought to examine associatio...",【 】,16.680524,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"Molnar, Danielle S.; Flett, Gordon L.; Hewitt,..."
3,Anatomic evidence shows that lymphatic drainag...,Respiratory infections can result in intracran...,【 】,16.471302,https://www.sciencedirect.com/science/article/...,"Elham, Elzat; Wumaier, Reziya; Wang, Chengji; ..."
4,Quantifying antibody kinetics and RNA shedding...,Our ability to understand and mitigate the spr...,【 】,15.147414,http://medrxiv.org/cgi/content/short/2020.05.1...,"Borremans, B.; Gamble, A.; Prager, K. C.; Helm..."
5,Presence of SARS-Coronavirus-2 in sewage,"In the current COVID-19 pandemic, a significan...",【 】,14.39056,http://medrxiv.org/cgi/content/short/2020.03.2...,"Medema, G.; Heijnen, L.; Elsinga, G.; Italiaan..."
6,Accelerated infection testing at scale: a prop...,"In pandemics or epidemics, public health autho...",【 】,13.906085,https://arxiv.org/pdf/2003.13282v1.pdf,"Jain, Tarun; Jain, Bijendra Nath"
7,Mobilization of Telepsychiatry in Response to ...,The COVID-19 pandemic threatens to disrupt the...,【 】,13.785966,https://doi.org/10.1007/s10488-020-01044-z; ht...,"Kannarkat, Jacob T.; Smith, Noah N.; McLeod-Br..."
8,Genomic variance of the 2019‐nCoV coronavirus,There is a rising global concern for the recen...,【 】,13.749903,https://www.ncbi.nlm.nih.gov/pubmed/32027036/;...,"Ceraolo, Carmine; Giorgi, Federico M."
9,Identification of IgG antibody response to SAR...,Diagnostic testing and evaluation of patient i...,【 】,13.572998,http://medrxiv.org/cgi/content/short/2020.05.0...,"McAndrews, K. M.; Dowlatshahi, D. P.; Hensel, ..."


### Q3: *Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.*
* Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
* Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
* Experimental infections to test host range for this pathogen.

In [99]:
p = '''
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.
'''
res = searchAndGetAnswer(p)
res

No valid answers found.


Unnamed: 0,title,abstract,url,authors,text,Score
0,Coronavirus surveillance of wildlife in the La...,Coronaviruses can become zoonotic as in the ca...,https://doi.org/10.1101/2020.04.22.056218,"McIver, David J.; Silithammavong, Soubanh; The...",The latest coronavirus (CoV) outbreak in human...,30.306554
1,SARS-CoV-2 spike protein predicted to form sta...,SARS-CoV-2 has a zoonotic origin and was trans...,https://doi.org/10.1101/2020.05.01.072371,"Lam, SD; Bordin, N; Waman, VP; Scholes, HM; As...",Severe acute respiratory syndrome coronavirus ...,29.467318
2,Spillover of SARS-CoV-2 into novel wild hosts ...,Abstract There is evidence that the current ou...,https://api.elsevier.com/content/article/pii/S...,"Franklin, Alan B.; Bevins, Sarah N.",J o u r n a l P r e -p r o o f 2 infected huma...,26.276919
3,Are pangolins the intermediate host of the 201...,The outbreak of 2019-nCoV pneumonia (COVID-19)...,https://doi.org/10.1101/2020.02.18.954628,"Liu, Ping; Jiang, Jing-Zhe; Wan, Xiu-Feng; Hua...","In December 2019, there was an outbreak of pne...",25.001637
4,A potential role for integrins in host cell en...,• Integrin may act as an alternative receptor ...,https://doi.org/10.1016/j.antiviral.2020.10475...,"Sigrist, Christian JA; Bridge, Alan; Le Mercie...","Since December 2019, a novel coronavirus (nCoV...",23.883754
5,Hypothesis: angiotensin-converting enzyme inhi...,Intravenous infusions of angiotensin-convertin...,https://www.ncbi.nlm.nih.gov/pubmed/32186711/;...,"Diaz, James H",Highlight Intravenous infusions of angiotensin...,23.743547
6,Proteolytic cleavage of the SARS-CoV-2 spike p...,Abstract Severe acute respiratory syndrome cor...,https://api.elsevier.com/content/article/pii/S...,"Jaimes, Javier A.; Millet, Jean K.; Whittaker,...","Since December 2019, human infections by a nov...",23.606242
7,Predicting wildlife hosts of betacoronaviruses...,Despite massive investment in research on rese...,https://doi.org/10.1101/2020.05.22.111344,"Becker, Daniel J.; Albery, Gregory F.; Sjodin,...",sampling bias and can only make predictions fo...,23.308656
8,"SARS-CoV-2, an evolutionary perspective of int...",The emergence of SARS-CoV-2 has resulted in mo...,https://doi.org/10.1101/2020.03.21.001933,"Armijos-Jaramillo, Vinicio; Yeager, Justin; Mu...",The recent emergence of the novel SARS coronav...,22.974869
9,Use of the informational spectrum methodology ...,A novel coronavirus recently identified in Wuh...,https://www.ncbi.nlm.nih.gov/pubmed/32419926/;...,"Veljkovic, Veljko; Vergara-Alert, Júlia; Segal...",Fears are mounting worldwide over the cross-bo...,22.492407


### Q4: *Animal host(s) and any evidence of continued spill-over to humans*

In [100]:
p = '''
Animal host(s) and any evidence of continued spill-over to humans
'''
res = searchAndGetAnswer(p)
res

No valid answers found.


Unnamed: 0,title,abstract,url,authors,text,Score
0,Spillover of SARS-CoV-2 into novel wild hosts ...,Abstract There is evidence that the current ou...,https://api.elsevier.com/content/article/pii/S...,"Franklin, Alan B.; Bevins, Sarah N.",J o u r n a l P r e -p r o o f 2 infected huma...,22.379831
1,Are pangolins the intermediate host of the 201...,The outbreak of 2019-nCoV pneumonia (COVID-19)...,https://doi.org/10.1101/2020.02.18.954628,"Liu, Ping; Jiang, Jing-Zhe; Wan, Xiu-Feng; Hua...","In December 2019, there was an outbreak of pne...",15.671288
2,Are pangolins the intermediate host of the 201...,The outbreak of a novel corona Virus Disease 2...,https://www.ncbi.nlm.nih.gov/pubmed/32407364/;...,"Liu, Ping; Jiang, Jing-Zhe; Wan, Xiu-Feng; Hua...","In December 2019, there was an outbreak of pne...",14.773573
3,Exceptional diversity and selection pressure o...,Pandemics originating from pathogen transmissi...,https://doi.org/10.1101/2020.04.20.051656,"Frank, Hannah K.; Enard, David; Boyd, Scott D.",because studies only examine a small subset of...,14.143423
4,"COVID-19: Epidemiology, Evolution, and Cross-D...",The recent outbreak of COVID-19 in Wuhan turne...,https://www.ncbi.nlm.nih.gov/pubmed/32359479/;...,"Sun, Jiumeng; He, Wan-Ting; Wang, Lifang; Lai,...","In December 2019, a cluster of pneumonia with ...",12.735855
5,Severe acute respiratory syndrome-related coro...,The present outbreak of lower respiratory trac...,https://doi.org/10.1101/2020.02.07.937862,"Gorbalenya, Alexander E.; Baker, Susan C.; Bar...",Is the outbreak of an infectious disease cause...,12.127306
6,Horses as a Crucial Part of One Health,"One Health (OH) is a crucial concept, where th...",https://www.ncbi.nlm.nih.gov/pubmed/32121327/;...,"Lönker, Nelly Sophie; Fechner, Kim; Abd El Wah...",One Health (OH) is a holistic approach which d...,12.061077
7,Emerging novel coronavirus (2019-nCoV)—current...,Coronaviruses are the well-known cause of seve...,https://www.ncbi.nlm.nih.gov/pubmed/32036774/;...,"Malik, Yashpal Singh; Sircar, Shubhankar; Bhat...",Coronaviruses (CoVs) are well-known causes of ...,11.823325
8,The species Severe acute respiratory syndrome-...,The present outbreak of a coronavirus-associat...,https://doi.org/10.1038/s41564-020-0695-z; htt...,,Understanding the cause of a specific disease ...,11.678412
9,Emergence of SARS-CoV-2 through Recombination ...,COVID-19 has become a global pandemic caused b...,https://doi.org/10.1101/2020.03.20.000885,"Li, Xiaojun; Giorgi, Elena E.; Marichann, Manu...",The severe respiratory disease COVID-19 was fi...,11.136798


### Q5: *Socioeconomic and behavioral risk factors for this spill-over*

In [101]:
p = '''
Socioeconomic and behavioral risk factors for this spill-over
'''
res = searchAndGetAnswer(p)
res

No valid answers found.


Unnamed: 0,title,abstract,url,authors,text,Score
0,Household financial vulnerability in Indonesia...,This study assesses the level of financial vul...,https://api.elsevier.com/content/article/pii/S...,"Noerhidajati, Sri; Purwoko, Agung Bayu; Werdan...","The crisis in 2008, which was caused by securi...",12.580977
1,BigO: A public health decision support system ...,Obesity is a complex disease and its prevalenc...,https://arxiv.org/pdf/2005.02928v1.pdf,"Diou, Christos; Sarafis, Ioannis; Papapanagiot...",Obesity prevalence has been continuously risin...,11.940364
2,A deadly spillover: SARS-CoV-2 outbreak,,https://www.ncbi.nlm.nih.gov/pubmed/32321324/;...,"Mori, Mattia; Capasso, Clemente; Carta, Fabriz...",Coronaviruses (CoV) are a family of viruses th...,10.810611
3,Is Brazil prepared for the new era of infectio...,Brazil must maintain a focus on enhancing the ...,https://www.ncbi.nlm.nih.gov/pubmed/32374800/;...,"Vicente, Creuza Rachel",Dear Editor:\nThe world has been facing a new ...,10.182449
4,On the dynamics emerging from pandemics and in...,This position paper discusses emerging behavio...,https://arxiv.org/pdf/2004.08917v1.pdf,"Leitner, Stephan",in the direct aftermath of a pandemic are key ...,10.110361
5,Exposure to air pollution and COVID-19 mortali...,Objectives: United States government scientist...,https://doi.org/10.1101/2020.04.05.20054502,"Wu, Xiao; Nethery, Rachel C.; Sabath, Benjamin...",The scale of the COVID-19 public health emerge...,9.56009
6,COVID-19: the First Documented Coronavirus Pan...,The novel human coronavirus disease COVID-19 h...,https://www.ncbi.nlm.nih.gov/pubmed/32387617/;...,"Liu, Yen-Chin; Kuo, Rei-Lin; Shih, Shin-Ru","Currently, people all over the world have been...",9.036929
7,Invasion Science and the Global Spread of SARS...,Abstract Emerging infectious diseases like COV...,https://www.sciencedirect.com/science/article/...,"Nuñez, M. A.; Pauchard, A.; Ricciardi, A.",A sinister combination of ecosystem alteration...,8.91751
8,Horses as a Crucial Part of One Health,"One Health (OH) is a crucial concept, where th...",https://www.ncbi.nlm.nih.gov/pubmed/32121327/;...,"Lönker, Nelly Sophie; Fechner, Kim; Abd El Wah...",One Health (OH) is a holistic approach which d...,8.913372
9,Climate affects global patterns of COVID-19 ea...,"Environmental factors, including seasonal clim...",https://doi.org/10.1101/2020.03.23.20040501,"Ficetola, Gentile Francesco; Rubolini, Diego","1 Abstract 1 2 Environmental factors, includin...",8.739939


### Q6: *Sustainable risk reduction strategies*

In [102]:
p = '''
Sustainable risk reduction strategies
'''
res = searchAndGetAnswer(p)
res

1/10
2/10
3/10
4/10
5/10
6/10
7/10
8/10
9/10
10/10


Unnamed: 0,title,abstract,Answer,Score,url,authors
0,Indirect effects of COVID-19 on the environment,Abstract This research aims to show the positi...,【 】,9.35098,https://api.elsevier.com/content/article/pii/S...,"Zambrano-Monserrate, Manuel A.; Ruano, María A..."
1,Post COVID 19 and food pathways to sustainable...,,【 】,9.074676,https://doi.org/10.1007/s10460-020-10051-7; ht...,"Blay-Palmer, Alison; Carey, Rachel; Valette, E..."
2,A new vehicle to accelerate the UN Sustainable...,,【 】,8.976158,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"Sherr, Lorraine; Cluver, Lucie; Desmond, Chris..."
3,A Brave New World: Lessons from the COVID-19 P...,,【 】,8.693033,https://www.ncbi.nlm.nih.gov/pubmed/32313383/;...,"Sarkis, Joseph; Cohen, Maurie J.; Dewick, Paul..."
4,Editorial: Impacts of COVID-19 on agricultural...,,【 】,8.602586,https://www.sciencedirect.com/science/article/...,"Stephens, Emma; Martin, Guillaume; van Wijk, M..."
5,From crisis to utopia: crafting new public–pri...,,【 】,8.514002,https://www.ncbi.nlm.nih.gov/pubmed/32395003/;...,"Caron, Patrick"
6,COVID-19 lockdowns cause global air pollution ...,The lockdown response to COVID-19 has caused a...,【 】,8.45078,https://doi.org/10.1101/2020.04.10.20060673,"Venter, Zander S; Aunan, Kristin; Chowdhury, S..."
7,Further analysis of the impact of distancing u...,This paper questions various claims from the p...,【 】,8.254299,https://doi.org/10.1101/2020.04.14.20048025,"Bernstein, Daniel J."
8,A cloth mask for under-resourced healthcare se...,INTRODUCTION: COVID19 pandemic poses a global ...,【 】,8.101378,https://www.ncbi.nlm.nih.gov/pubmed/32394153/;...,"Sugrue, Michael; O’Keeffe, Derek; Sugrue, Ryan..."
9,Preparing your intensive care unit for the COV...,The coronavirus disease 2019 (COVID-19) has ra...,【 】,8.032679,https://doi.org/10.1186/s13054-020-02916-4; ht...,"Goh, Ken Junyang; Wong, Jolin; Tien, Jong-Chie..."
