# Simple Search Engine over Scientific Papers
and other small attempts to make searching easier. 

In this notebook we have attempted a few sligthly naive approaches to searching smartly over a database of scientific papers covering medicine, biology, genetics, virology, etc. The technologies we wanted to test out were the Natural Language Tool Kit, the Doc2Vec and Word2Vec, and attempt to build a lightweight RNN. 

As students of information science and not medicine, we realized that we have no expertise in the field of viruses or illnesses, so our idea was to create a search engine that could search through papers without the user knowing the domainspecific vocabulary or terminology. This was our sole ambition through this shortlived project. So let's get into the specifics of what worked, and what did not. 

Resources:
https://www.nltk.org/ ---- Natural Language Tool Kit
https://www.kaggle.com/maksimeren/covid-19-literature-clustering ---- Parts of preprocessing

In [1]:
import numpy as np
seed = 666
np.random.seed(seed)
import os
import sys
import math
from datetime import datetime
import tensorflow as tf

###############################################################################
# Compatibility: Tensorflow 2.1 with Python 3.7
###############################################################################

# Check GPUs:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            # Prevent TensorFlow from allocating all memory of all GPUs:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

# Enable mixed-precision (speeds-up computations and uses less memory):
tf.keras.mixed_precision.experimental.set_policy('infer_float32_vars')

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


1 Physical GPUs, 1 Logical GPUs


The next few lines are dedicated to building an all inclusive dataframe to hold the data. 

In [2]:
import numpy as np
import pandas as pd 
import os

for dirname, _, filenames in os.walk('C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\COVID.DATA.LIC.AGMT.pdf
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\json_schema.txt
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\metadata.csv
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\metadata.readme
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\processed.csv
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\biorxiv_medrxiv\biorxiv_medrxiv\0015023cc06b5362d332b3baf348d11567ca2fbb.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\biorxiv_medrxiv\biorxiv_medrxiv\004f0f8bb66cf446678dc13cf2701feec4f36d76.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\biorxiv_medrxiv\biorxiv_medrxiv\00d16927588fb04d4be0e6b269fc02f0d3c2aa7b.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\biorxiv_medrxiv\biorxiv_medrxiv\0139ea4ca580af99b602c6435368e7fdbefacb03.json
C:\Users\si

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\000b7d1517ceebb34e1e3e817695b6de03e2fa78.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\00142f93c18b07350be89e96372d240372437ed9.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\0022796bb2112abd2e6423ba2d57751db06049fb.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\00326efcca0852dc6e39dc6b7786267e1bc4f194.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\00352a58c8766861effed18a4b079d1683fec2ec.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\0043d044273b8eb1585d3a66061e9b4e03edc062.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\0049ba8861864506e1e8559e7815f4de8b03db

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\551c5105a50ea1d7d52d13424441d1788f67bc91.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\55248fb2e12f9712b35a26e3fdb723baea0ac775.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\552ee0cc0a8266b6c22dbc287bd080513a3374fb.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\55322bfc591eeae6d68a9826baaebf6ec74234ca.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\5534fcf5c7e7df182da3c253d2312cd5662259b8.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\5535ed240ab00c780ce2aa42f1fbe13ef24b9120.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\5538bf904bcd32abf4ca1dbe9fc7e6c7514ebe

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a283bfd56827aac7e25105d2d9488714de7769a.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a2c1193b96d3e54da56f17d1060e236df5b0e38.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a31e63c4ec26320695ef383dfd8c772d6afc92b.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a4eb1665b578ba6744d7c2a62775379f6fcfc00.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a512204e9b43d11f12458baaa3db5cb71e5ae16.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a5586078edf5a43100db04f987d635d39fee4a5.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\9a63f7c66c750e9076dfb9d04319c1811cd49e

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\efdf0a3bdc300ce80392b2b52be33728db54a51f.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\efdfec9626b7961580cf4ecea75e8d4fde27980d.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\efe13a8d42b60ef9f7387ea539a1b2eeb5f80101.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\eff26d8739498efca2d32fe2e66cdbebf0569c50.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\eff3310317521aed7abe06ef1fa9963ca9d6caf3.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\eff8bed68ef6109e8f0c51a8b1ec4b6ca5b6329e.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\comm_use_subset\comm_use_subset\f00106cad50635bb15409ac6039b93b5af0315

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\0001418189999fea7f7cbe3e82703d71c85a6fe5.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\00016663c74157a66b4d509d5c4edffd5391bbe0.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\0005d253951fedc237715a37db147032eea28912.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\000e754142ba65ef77c6fdffcbcbe824e141ea7b.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\000eec3f1e93c3792454ac59415c928ce3a6b4ad.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\001259ae6d9bfa9376894f61aa6b6c5f18be2177.json
C:\Use

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cc5edbddf62154d1e40ef0d7530ca1dbbd2ea34.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2ccc1adec85afa08657d6691bc0b8c973043a7cc.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cd35285dc4881d52dbd8c1887576424a0457491.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cd6423e41f0a2038315c47d4f629eae3851031d.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cd74646ef69d8ea8f04a046a88eb22dc7187dab.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cdb7b3be2a0c0e1c276302897afc6d3f4a16ef4.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\2cdf51f941c0b55e2d2f7d9b682e7be065f50c48.json
C:\Use

C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5ab3cede4366752ef371ecd626d9c45ae745634a.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5ab5348a892519dc2274322f3dc9900e40f1113e.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5ab6c8f7098cb0e7491d7818b4069f7d7856d692.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5ac2452e7b3463270f4c19ca10892b1a80ab7661.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5ac495bc3644fff725ad9da58c2891f193f4569e.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5acb4a64dd7a05fef7a6a3bed1609f2301d949d0.json
C:\Users\sindr\MachineLearning\CORD-19\CORD-19-research-challenge\custom_license\custom_license\5acde628c301c2b30515a4adb699d8b633795580.json
C:\Use

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [3]:
root_path = 'C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge'
metadata_path = f'{root_path}\\metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={})
print(len(meta_df))

44220


In [4]:
import glob
import json
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

29315

In [5]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

0015023cc06b5362d332b3baf348d11567ca2fbb: word count: 194 22 Text word count: 5168 23 24 25 author/funder. All rights reserved. No reuse allowed without permission. Abstract 27 The positive stranded RNA genomes of picornaviruses comprise a si... VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions encode the non-structural proteins 2B and 2C and 3A, 3B (1-3) (VPg), 3C pro and 4 structura...


In [19]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    elif len(content.abstract.split(' ')) > 100:
        # abstract provided is too long for plot, take first 300 words append with ...
        info = content.abstract.split(' ')[:100]
        summary = get_breaks(' '.join(info), 40)
        dict_['abstract_summary'].append(summary + "...")
    else:
        # abstract is short enough
        summary = get_breaks(content.abstract, 40)
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # more than 2 authors, may be problem when plotting, so take first 2 append with ...
            dict_['authors'].append(". ".join(authors[:2]) + "...")
        else:
            # authors will fit in plot
            dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = get_breaks(meta_data['title'].values[0], 40)
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

Processing index: 0 of 29315
Processing index: 2931 of 29315
Processing index: 5862 of 29315
Processing index: 8793 of 29315
Processing index: 11724 of 29315
Processing index: 14655 of 29315
Processing index: 17586 of 29315
Processing index: 20517 of 29315
Processing index: 23448 of 29315
Processing index: 26379 of 29315
Processing index: 29310 of 29315


Unnamed: 0,paper_id,abstract,body_text,authors,title,journal,abstract_summary
0,0015023cc06b5362d332b3baf348d11567ca2fbb,word count: 194 22 Text word count: 5168 23 24...,"VP3, and VP0 (which is further processed to VP...","Ward, J. C. J.. Lasecka-Dykes, L....",The RNA pseudoknots in foot-and-mouth disease...,,word count: 194 22 Text word count: 5168 23 2...
1,004f0f8bb66cf446678dc13cf2701feec4f36d76,,The 2019-nCoV epidemic has spread across China...,Hanchu Zhou. Jianan Yang...,Healthcare-resource-adjusted<br>vulnerabiliti...,,Not provided.
2,00d16927588fb04d4be0e6b269fc02f0d3c2aa7b,Infectious bronchitis (IB) causes significant ...,"Infectious bronchitis (IB), which is caused by...","Butt, S. L.. Erwood, E. C....","Real-time, MinION-based, amplicon<br>sequenci...",,Infectious bronchitis (IB) causes<br>signific...
3,0139ea4ca580af99b602c6435368e7fdbefacb03,Nipah Virus (NiV) came into limelight recently...,Nipah is an infectious negative-sense single-s...,Nishi Kumari. Ayush Upadhyay...,A Combined Evidence Approach to Prioritize<br...,,Nipah Virus (NiV) came into limelight recentl...
4,013d9d1cba8a54d5d3718c229b812d7cf91b6c89,Background: A novel coronavirus (2019-nCoV) em...,"In December 2019, a cluster of patients with p...",Shengjie Lai. Isaac Bogoch...,Assessing spread risk of Wuhan novel<br>coron...,,Background: A novel coronavirus (2019-nCoV)<b...


In [100]:
df_covid.to_csv (r'C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge\\processed.csv', index = False, header=True)

In [3]:
import pandas as pd
root_path = 'C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge'
processed_path = f'{root_path}\\processed.csv'
df_covid = pd.read_csv(processed_path, dtype={})

In [171]:
import nltk
from nltk.corpus import stopwords
#nltk.download('punkt')
#nltk.download('stopwords')
bodies = df_covid["body_text"]
sentences = []
b=0
s=0
lines = 0
stopWords = set(stopwords.words('english'))
stopWords.add('(')
stopWords.add(')')
stopWords.add(',')
stopWords.add(', ')

for b in range(len(bodies)):
    tokens = nltk.word_tokenize(bodies[b])
    words = []
    for t in range(len(tokens)):
        if tokens[t] == '.':
            words.append(tokens[t])
            lines += 1
            if lines > 6:
                sentences.append(words)
                words = []
                lines = 0
        elif tokens[t] not in stopWords:
            words.append(tokens[t])
            

In [173]:
sentences[3:4]

[['All',
  'rights',
  'reserved',
  'No',
  'reuse',
  'allowed',
  'without',
  'permission',
  'Plasmid',
  'construction',
  '117',
  'The',
  'FMDV',
  'replicon',
  'plasmids',
  'pRep-ptGFP',
  'replication-defective',
  'polymerase',
  'mutant',
  '118',
  'control',
  '3D-GNN',
  'already',
  'described',
  '10',
  'To',
  'introduce',
  'mutations',
  'PK',
  'region',
  'pRep-ptGFP',
  'replicon',
  'plasmid',
  'digested',
  '121',
  'SpeI',
  'KpnI',
  'resulting',
  'fragment',
  'inserted',
  'sub-cloning',
  'vector',
  'pBluescript',
  '122',
  'create',
  'pBluescript',
  'PK',
  'PKs',
  '3',
  '4',
  'removed',
  'digestion',
  'HindIII',
  'AatII',
  '123',
  'insertion',
  'synthetic',
  'DNA',
  'sequence',
  'PK',
  '3',
  '4',
  'deleted',
  'PKs',
  '2',
  '3',
  '4',
  '124',
  'deleted',
  'PCR',
  'amplification',
  'using',
  'ΔPK',
  '234',
  'Forward',
  'primer',
  'FMDV',
  '1331-1311',
  'reverse',
  '125',
  'primer',
  'resultant',
  'product',
  'd

In [195]:
import pickle
with open("paragraphs.txt", "wb") as fp:
    pickle.dump(sentences, fp)
#with open("tokens.txt", "wb") as fp:
#    pickle.dump(tokens, fp)

In [180]:
#https://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/
from gensim.models import Word2Vec
  
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 
from sklearn import cluster
from sklearn import metrics
 
model = Word2Vec(sentences, min_count=1)
def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw+=1
        except:
            pass
     
    return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
    X.append(sent_vectorizer(sentence, model))   




In [13]:
print (model.similarity('infectious', 'virus'))
print(model.most_similar(positive=['Immunity', 'immune', 'response'], negative=[], topn=10)) 

0.43073845
[('immunity', 0.8353835344314575), ('immunoresponses', 0.7473381161689758), ('responses', 0.7448035478591919), ('humoral', 0.7415076494216919), ('defense', 0.6948438882827759), ('defenses', 0.6890788674354553), ('immunities', 0.6828920245170593), ('CMI', 0.6561574935913086), ('innate', 0.6515140533447266), ('defence', 0.6503939628601074)]


  """Entry point for launching an IPython kernel.
  


In [70]:
#filename = 'embedding_model.sav'
#pickle.dump(model, open(filename, 'wb'))

In [5]:
import pandas as pd
import pickle
root_path = 'C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge'
processed_path = f'{root_path}\\processed.csv'
df_covid = pd.read_csv(processed_path, dtype={})
with open("sentences.txt", "rb") as fp: 
    sentences = pickle.load(fp)
model = pickle.load(open('embedding_model.sav', 'rb'))

In [11]:
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 
from sklearn import cluster
from sklearn import metrics
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
stopWords.add('(')
stopWords.add(')')
stopWords.add(',')
stopWords.add(', ')
search = "immune immunity"
prepare_search = nltk.word_tokenize(search)
search_tokens = []
for w in prepare_search:
    if w not in stopWords:
        search_tokens.append(w)
similar_tokens = model.most_similar(positive=search_tokens, negative=[], topn=len(search_tokens))
for s in similar_tokens:
    search_tokens.append(s[0])
search_tokens



['immune', 'immunity', 'humoral', 'immunoresponses']

In [12]:
i=0
for i in range(len(df_covid)):
    for x in search_tokens:
        if df_covid['body_text'][i].find(x) > 0:
            print(df_covid['title'][i])
            print("===========================================================================================================================")
            break;

 A Combined Evidence Approach to Prioritize<br>Nipah Virus Inhibitors
 TWIRLS, an automated topic-wise inference<br>method based on massive literature, suggests a<br>possible mechanism via ACE2 for the pathological<br>changes in the human host after coronavirus infection
 Viruses are a dominant driver of protein<br>adaptation in mammals
 The impact of regular school closure on<br>seasonal influenza epidemics: a data-driven spatial<br>transmission model for Belgium
 Development of CRISPR as a prophylactic<br>strategy to combat novel coronavirus and influenza
 A spatial model of CoVID-19 transmission in<br>England and Wales: early spread and peak timing
 Beyond R0: the importance of contact tracing<br>when predicting epidemics
 International travelers and genomics uncover<br>a ‘hidden’ Zika outbreak
 Quantifying the success of measles<br>vaccination campaigns in the Rohingya refugee camps
 Mono-ADP-ribosylation by ARTD10 restricts<br>Chikungunya virus replication by interfering with the<

 Houttuynia cordata Thunb. and its bioactive<br>compound 2-undecanone significantly suppress<br>benzo(a)pyrene-induced lung tumorigenesis by activating the<br>Nrf2-HO-1/NQO-1 signaling pathway
 Dynamics and Differences in Systemic and Local<br>Immune Responses After Vaccination With Inactivated<br>and Live Commercial Vaccines and Subsequent<br>Subclinical Infection With PRRS Virus
 Display of Porcine Epidemic Diarrhea Virus<br>Spike Protein on Baculovirus to Improve<br>Immunogenicity and Protective Efficacy
 Lack of cross-protection against Mycoplasma<br>haemofelis infection and signs of enhancement in<br>“Candidatus Mycoplasma turicensis”-recovered cats
 Cellular Proteins Associated with the<br>Interior and Exterior of Vesicular Stomatitis Virus<br>Virions
 Pathogenesis, imaging and clinical<br>characteristics of CF and non-CF bronchiectasis
 Contact among healthcare workers in the<br>hospital setting: developing the evidence base for<br>innovative approaches to infection control
 Exp

 Functional and Homeostatic Impact of<br>Age-Related Changes in Lymph Node Stroma
 Neisseria gonorrhoeae uses cellular proteins<br>CXCL10 and IL8 to enhance HIV‐1 transmission across<br>cervical mucosa
 Analysis of IFITM-IFITM Interactions by a Flow<br>Cytometry-Based FRET Assay
 Deep Sequencing in Infectious Diseases:<br>Immune and Pathogen Repertoires for the Improvement<br>of Patient Outcomes
 Distinct Patterns of IFITM-Mediated<br>Restriction of Filoviruses, SARS Coronavirus, and<br>Influenza A Virus
 Spatial modelling of contribution of<br>individual level risk factors for mortality from Middle<br>East respiratory syndrome coronavirus in the<br>Arabian Peninsula
 Essential role of HCMV deubiquitinase in<br>promoting oncogenesis by targeting anti-viral innate<br>immune signaling pathways
 Let the sun shine in: effects of ultraviolet<br>radiation on invasive pneumococcal disease risk in<br>Philadelphia, Pennsylvania
 Haemophilus is overrepresented in the<br>nasopharynx of infants ho

 Etiological analysis and predictive<br>diagnostic model building of community-acquired<br>pneumonia in adult outpatients in Beijing, China
 Necrotizing pneumonia: an emerging problem in<br>children?
 Beyond cells – The virome in the human holobiont
 Comparative Pathogenesis of Three Human and<br>Zoonotic SARS-CoV Strains in Cynomolgus Macaques
 Development and comparison of enzyme-linked<br>immunosorbent assays based on recombinant trimeric<br>full-length and truncated spike proteins for detecting<br>antibodies against porcine epidemic diarrhea virus
 Cross-sectional survey of selected enteric<br>viruses in Polish turkey flocks between 2008 and 2011
 A placebo-controlled trial of Korean red<br>ginseng extract for preventing Influenza-like<br>illness in healthy adults
 Southern Hemisphere Influenza and Vaccine<br>Effectiveness Research and Surveillance
 Novel PDE4 Inhibitors Derived from Chinese<br>Medicine Forsythia
 Innovation in observation: a vision for early<br>outbreak detection


 Illuminating pathogen–host intimacy through<br>optogenetics
 Identification of a conserved linear B-cell<br>epitope in the M protein of porcine epidemic diarrhea<br>virus
 Protection against Virulent Infectious<br>Bronchitis Virus Challenge Conferred by a Recombinant<br>Baculovirus Co-Expressing S1 and N Proteins
 Tick-borne encephalitis virus induces<br>chemokine RANTES expression via activation of IRF-3<br>pathway
 Specialty Grand Challenge In Pediatric<br>Infectious Diseases
 Immunity-Related Protein Expression and<br>Pathological Lung Damage in Mice Poststimulation with<br>Ambient Particulate Matter from Live Bird Markets
 Role of Incretin Axis in Inflammatory Bowel<br>Disease
 Classical Swine Fever Virus Infection Induces<br>Endoplasmic Reticulum Stress-Mediated Autophagy to<br>Sustain Viral Replication in vivo and in vitro
 Respiratory virus of severe pneumonia in South<br>Korea: Prevalence and clinical implications
 International Air Travel to Ohio, USA, and the<br>Impact on Ma

 Pulmonary infections in the returned<br>traveller
 A Porcine Epidemic Diarrhea Virus Outbreak in<br>One Geographic Region of the United States:<br>Descriptive Epidemiology and Investigation of the<br>Possibility of Airborne Virus Spread
 Therapeutic Potential of Annexin A1 in<br>Ischemia Reperfusion Injury
 The ecology and adaptive evolution of<br>influenza A interspecies transmission
 'ONE HEALTH' and parasitology
 The Antiviral Restriction Factors IFITM1, 2<br>and 3 Do Not Inhibit Infection of Human<br>Papillomavirus, Cytomegalovirus and Adenovirus
 A dose and time response Markov model for the<br>in-host dynamics of infection with intracellular<br>bacteria following inhalation: with application to<br>Francisella tularensis
 Proteome analysis of vaccinia virus<br>IHD-W-infected HEK 293 cells with 2-dimensional gel<br>electrophoresis and MALDI-PSD-TOF MS of on solid phase support<br>N-terminally sulfonated peptides
 Feasibility of a randomized controlled trial<br>to assess treatment 

 Avian Group D Rotaviruses: Structure,<br>Epidemiology, Diagnosis, and Perspectives on Future<br>Research Challenges
 Influenza Transmission in the Mother-Infant<br>Dyad Leads to Severe Disease, Mammary Gland<br>Infection, and Pathogenesis by Regulating Host Responses
 Parasites or Cohabitants: Cruel Omnipresent<br>Usurpers or Creative “Éminences Grises”?
 Global research trends in<br>microbiome-gut-brain axis during 2009–2018: a bibliometric and<br>visualized study
 Potential Vaccines and Post-Exposure<br>Treatments for Filovirus Infections
 Novel avian single-chain fragment variable<br>(scFv) targets dietary gluten and related natural<br>grain prolamins, toxic entities of celiac disease
 Evidence of Aujeszky’s disease in wild boar in<br>Serbia
 Individual or Common Good? Voluntary Data<br>Sharing to Inform Disease Surveillance Systems in Food<br>Animals
 Using LongSAGE to Detect Biomarkers of<br>Cervical Cancer Potentially Amenable to Optical<br>Contrast Agent Labelling
 Combined app

 Protective effect of intranasal peste des<br>petits ruminants virus and bacterin vaccinations:<br>Clinical, hematological, serological, and serum<br>oxidative stress changes in challenged goats
 Expression profiles of immune mediators in<br>feline Coronavirus-infected cells and clinical<br>samples of feline Coronavirus-positive cats
 Specific mutations in H5N1 mainly impact the<br>magnitude and velocity of the host response in mice
 Coronavirus Gene 7 Counteracts Host Defenses<br>and Modulates Virus Virulence
 Glucose-6-Phosphate Dehydrogenase<br>(G6PD)-Deficient Epithelial Cells Are Less Tolerant to<br>Infection by Staphylococcus aureus
 Clinical Features and Courses of Adenovirus<br>Pneumonia in Healthy Young Adults during an Outbreak<br>among Korean Military Personnel
 Detection of dicistroviruses RNA in blood of<br>febrile Tanzanian children
 CD8 T Cell–Independent Antitumor Response and<br>Its Potential for Treatment of Malignant Gliomas
 Efficacy and synergy of live-attenuated a

 Chapter 17 Parasitic Diseases
 TGEV corona virus ORF4 encodes a membrane<br>protein that is incorporated into virions
 Chapter 11 Pathology The Clinical Description<br>of Human Disease
 Speculation on whether a vaccine against<br>cryptosporidiosis is a reality or fantasy
 Reproductive emergencies in camelids
 Plasma therapy against infectious pathogens,<br>as of yesterday, today and tomorrow
 Epidemiological survey in a day care center<br>following toddler sudden death due to human<br>metapneumovirus infection
 CHAPTER 107 Genomic Approaches to the Host<br>Response to Pathogens
 Enhancement of safety and immunogenicity of<br>the Chinese Hu191 measles virus vaccine by<br>alteration of the S-adenosylmethionine (SAM) binding<br>site in the large polymerase protein
 Bovine coronaviruses from the respiratory<br>tract: Antigenic and genetic diversity
 Systemic acute phase proteins response in<br>calves experimentally infected with Eimeria zuernii
 Polymorphisms of interferon-inducible genes

 Dissecting host cell death programs in the<br>pathogenesis of influenza
 Plagues and Diseases in History
 Systems biology approach: Panacea for<br>unravelling host-virus interactions and dynamics of<br>vaccine induced immune response
 Recent achievements in studies on diseases of<br>common carp (Cyprinus carpio L.)
 Outbreak detection model based on danger<br>theory
 Modern Plasma Fractionation
 Infectious encephalitis: Management without<br>etiological diagnosis 48hours after onset
 Chapter 6 Integumentary System
 A killed Leishmania vaccine with sand fly<br>saliva extract and saponin adjuvant displays<br>immunogenicity in dogs
 The influence of social behaviour on<br>competition between virulent pathogen strains
 Characterization of anti-porcine epidemic<br>diarrhea virus neutralizing activity in mammary<br>secretions
 Toroviruses of Animals And Humans: A Review
 Deciphering the Nucleotide and RNA Binding<br>Selectivity of the Mayaro Virus Macro Domain
 Redecoration of apartments pr

 Development of vaccines and passive<br>immunotherapy against SARS corona virus using SCID-PBL/hu<br>mouse models
 Chapter 10 Digestive disorders
 Characteristics of Group A Streptococcus<br>Strains Circulating during Scarlet Fever Epidemic,<br>Beijing, China, 2011
 Intestinal changes associated with rotavirus<br>and enterotoxigenic Escherichia coli infection<br>in calves
 Edible bird's nest extract inhibits influenza<br>virus infection
 Myelination by oligodendrocytes isolated<br>from 4–6-week-old rat central nervous system and<br>transplanted into newborn shiverer brain
 Seasonal variation of respiratory pathogen<br>colonization in asymptomatic health care professionals: A<br>single-center, cross-sectional, 2-season observational<br>study
 Enhancement of the immunogenicity of an<br>infectious bronchitis virus DNA vaccine by a bicistronic<br>plasmid encoding nucleocapsid protein and<br>interleukin-2
 The BAFF/APRIL system: Emerging functions<br>beyond B cell biology and autoimmunity
 

 Screening and identification of T helper 1 and<br>linear immunodominant antibody-binding epitopes in<br>spike 1 domain and membrane protein of feline<br>infectious peritonitis virus
 Chapter 3 Autopsy Biosafety
 Chapter 12 Viral Disease
 From SARS in 2003 to H1N1 in 2009: lessons<br>learned from Taiwan in preparation for the next<br>pandemic
 Chapter 13 Eosinophils in Human Disease
 Health conditions for travellers to Saudi<br>Arabia for the Umra and pilgrimage to Mecca (Hajj) –<br>2014
 Meeting report: 4th ISIRV antiviral group<br>conference: Novel antiviral therapies for influenza and<br>other respiratory viruses
 Community-Acquired Pneumonia: An Unfinished<br>Battle
 7 Antisense Oligonucleotides and RNA<br>Interference
 Comparison of mono- and co-infection by swine<br>influenza A viruses and porcine respiratory coronavirus<br>in porcine precision-cut lung slices
 Assessment of returning travellers with fever
 The clinical impact of coronavirus infection<br>in patients with hematolo

 Antiviral activities of niclosamide and<br>nitazoxanide against chikungunya virus entry and<br>transmission
 Monitoring tourism flows and destination<br>management: Empirical evidence for Portugal
 Temperature, nitrogen dioxide, circulating<br>respiratory viruses and acute upper respiratory<br>infections among children in Taipei, Taiwan: A<br>population-based study
 Methods for studying stem cells: Adult stem<br>cells for lung repair
 Subject and Author Indexes for volume 62
 EMMPRIN-Targeted Magnetic Nanoparticles for<br>In Vivo Visualization and Regression of Acute<br>Myocardial Infarction
 Subject index Veterinary Microbioloy,<br>volumes 26–50, 1991–1996
 Identification of<br>6′-β-fluoro-homoaristeromycin as a potent inhibitor of chikungunya virus<br>replication
 Passive Immunity Stimulated by Vaccination of<br>Dry Cows with a Salmonella Bacterial Extract
 The diversity, evolution and origins of<br>vertebrate RNA viruses
 Effect of TLR agonist on infections bronchitis<br>virus repl

 Quantitative Temporal in Vivo Proteomics<br>Deciphers the Transition of Virus-Driven Myeloid Cells<br>into M2 Macrophages
 CHAPTER 10 Focal Bacterial Infections
 Ring Vaccination and Smallpox Control
 Super-spreaders and the rate of transmission<br>of the SARS virus
 19 Respiratory Viruses and Atypical Bacteria
 Chapter 47 Felidae
 CHAPTER 22 Seizures and Sleep Disorders
 Cost-Benefit of Stockpiling Drugs for<br>Influenza Pandemic
 B cell homeostasis and follicle confines are<br>governed by fibroblastic reticular cells
 SARS: Epidemiology, Clinical Presentation,<br>Management, and Infection Control Measures
 Multiple sclerosis as a viral disease
 ISNI 2010 Abstracts Tuesday October 26th, 2010<br>10th Course of the European School of<br>Neuroimmunology
 Seasonality and selective trends in viral<br>acute respiratory tract infections
 Interferon induction in porcine leukocytes<br>with transmissible gastroenteritis virus
 Protective effect of Xuebijing injection on<br>myocardial injury in

 Vectored vaccines to protect against PRRSV
 5 Overcoming regulatory gaps in biological<br>materials oversight by enhancing IBC protocol review
 Characterization and inhibition of norovirus<br>proteases of genogroups I and II using a fluorescence<br>resonance energy transfer assay
 Immunogenicity and protective efficacy in<br>monkeys of purified inactivated Vero-cell SARS<br>vaccine
 Severe Acute Respiratory Syndrome: Temporal<br>Stability and Geographic Variation in Death Rates and<br>Doubling Times
 Modified vaccinia virus Ankara as a vaccine<br>against feline coronavirus: immunogenicity and<br>efficacy
 Surface-displayed porcine epidemic diarrhea<br>viral (PEDV) antigens on lactic acid bacteria
 Report from the World Health Organization's<br>Product Development for Vaccines Advisory Committee<br>(PDVAC) meeting, Geneva, 7–9th Sep 2015
 Canine Distemper Spillover in Domestic Dogs<br>from Urban Wildlife
 Enumeration of isotype-specific<br>antibody-secreting cells derived from gnotobio

 Chapter 21 New Emerging Viruses
 Low-Incidence, High-Consequence Pathogens
 Sequence analysis of the spike protein gene of<br>murine coronavirus variants: Study of genetic sites<br>affecting neuropathogenicity
 Severe Community-Acquired Pneumonia
 CHAPTER 34 SEROLOGIC TESTS FOR DETECTION OF<br>ANTIBODY TO RODENT VIRUSES
 Research Note: Lyophilization of hyperimmune<br>egg yolk: effect on antibody titer and protection<br>of broilers against Campylobacter colonization
 Hajj: infectious disease surveillance and<br>control
 Influenza A(H1N1)pdm09 virus infection in<br>Norwegian swine herds 2009/10: The risk of human to swine<br>transmission
 A review on the antagonist Ebola: A<br>prophylactic approach
 Intranasal Protollin-formulated<br>recombinant SARS S-protein elicits respiratory and serum<br>neutralizing antibodies and protection in mice
 Quoi de neuf en dermatologie clinique ?
 Plagues and adaptation: Lessons from the<br>Felidae models for SARS and AIDS
 An investigation of the combi

 Risk of Bacterial Coinfections in Febrile<br>Infants 60 Days Old and Younger with Documented Viral<br>Infections
 Advanced nanotechnologies in avian<br>influenza: Current status and future trends – A review
 Novel system for detecting SARS coronavirus<br>nucleocapsid protein using an ssDNA aptamer
 Cyclophilin inhibitors as antiviral agents
 Chapter 4 Molecular Modeling of Major<br>Structural Protein Genes of Avian Coronavirus:<br>Infectious Bronchitis Virus Mass H120 and Italy02 Strains
 Gut microbiota: Implications in Parkinson's<br>disease
 CHAPTER 8 Emerging and Reemerging Viral<br>Diseases
 Respiratory viruses transmission from<br>children to adults within a household
 Chapter 41 Virus Infection of Epithelial Cells
 Antiviral Agents☆
 Oral immunization with LacVax® OmpA induces<br>protective immune response against Shigella flexneri 2a<br>ATCC 12022 in a murine model
 Prevalence of Isospora suis and Eimeria spp. in<br>suckling piglets and sows in Poland
 Chapter 6 Clinical pathol

 Mini-transposons in microbial ecology and<br>environmental biotechnology
 Intake and growth in transported Holstein<br>calves classified as diarrheic or healthy within the<br>first 21 days after arrival in a retrospective<br>observational study
 Antiviral combinations for severe influenza
 DC-SIGN mediates avian H5N1 influenza virus<br>infection in cis and in trans
 Animal virus schemes for translation<br>dominance
 Using Complementary and Alternative<br>Medicines to Target the Host Response during Severe<br>Influenza
 The impact of synthetic biology on drug<br>discovery
 Adverse Reactions to Vaccination From<br>Anaphylaxis to Autoimmunity
 Applications of the Phytomedicine Echinacea<br>purpurea (Purple Coneflower) in Infectious Diseases
 Expressional induction of Paralichthys<br>olivaceus cathepsin B gene in response to virus, poly I:C<br>and lipopolysaccharide
 T-705 (favipiravir) and related compounds:<br>Novel broad-spectrum inhibitors of RNA viral<br>infections
 Membrane Interact

 Nrf2 expression modifies influenza A entry and<br>replication in nasal epithelial cells
 A comparative evaluation of modelling<br>strategies for the effect of treatment and host<br>interactions on the spread of drug resistance
 Enhancement of immunostimulatory properties<br>of exosomal vaccines by incorporation of<br>fusion-competent G protein of vesicular stomatitis virus
 Infección nosocomial en el paciente receptor<br>de un trasplante de órgano sólido o de precursores<br>hematopoyéticos
 Chapter 6 Conservation Genetics of the<br>Cheetah: Genetic History and Implications for<br>Conservation
 A review of experimental infections with<br>bluetongue virus in the mammalian host
 Aminoacyl-tRNA synthetases, therapeutic<br>targets for infectious diseases
 Talking about colds and flu: The lay diagnosis<br>of two common illnesses among older British<br>people
 Single B cell antibody technologies
 SARS-coronavirus protein 6 conformations<br>required to impede protein import into the nucleus
 

 Induction of neutralising antibodies and<br>cellular immune responses against SARS coronavirus by<br>recombinant measles viruses
 Effects of level of social contact on dairy calf<br>behavior and health
 An antigen to remember: regulation of B cell<br>memory in health and disease
 Specific elevation of DcR3 in sera of sepsis<br>patients and its potential role as a clinically<br>important biomarker of sepsis
 Examples of expression systems based on animal<br>RNA viruses: Alphaviruses and influenza virus
 Postexposure protection of non-human<br>primates against a lethal Ebola virus challenge with RNA<br>interference: a proof-of-concept study
 The Role of Infection in Interstitial Lung<br>Diseases A Review
 Critical care and the global burden of critical<br>illness in adults
 Chapter 10 The Digestive System
 The pathogenesis of nephritis in chickens<br>induced by infectious bronchitis virus
 Induction of type I interferons by a novel<br>porcine reproductive and respiratory syndrome virus<

 IL-21 optimizes T cell and humoral responses in<br>the central nervous system during viral<br>encephalitis
 Viral Pneumonias Other Than Cytomegalovirus<br>in Transplant Recipients
 Should dermatologists be immunized against<br>hepatitis B?
 CHAPTER 9 Viral Diseases * * The authors are<br>grateful to Drs. Barbara Deeb, Douglas Gregg, John<br>Kreider, Charles Leathers, David McLean, Peter<br>Medveczky, David Small, and Margaret Thouless for review<br>of the chapter and to Alice Ruff for excellent<br>editorial assistance.
 Medicinal plants of the genus<br>Betula—Traditional uses and a phytochemical–pharmacological<br>review
 Viral infections of the respiratory tract in<br>patients with cystic fibrosis
 A double antibody sandwich enzyme-linked<br>immunosorbent assay for detection of soft-shelled turtle<br>iridovirus antigens
 Chapter 8 Reproduction and Breeding of<br>Nonhuman Primates
 Exploration of the mechanisms of Ge Gen<br>Decoction against influenza A virus infection
 Chapter 1 Intr

 Cost-effectiveness analysis of oral versus<br>intravenous drip infusion of levofloxacin in the treatment<br>of acute lower respiratory tract infection in<br>Chinese elderly patients
 IFN-γ– and IL-10–expressing virus<br>epitope-specific Foxp3(+) T reg cells in the central nervous<br>system during encephalomyelitis
 Mouse LSECtin as a model for a human Ebola virus<br>receptor
 The Novel Coronavirus: A Bird's Eye View
 Differential tumor necrosis factor alpha<br>expression by astrocytes from experimental allergic<br>encephalomyelitis-susceptible and -resistant rat strains
 The Epidemiology of Hand, Foot and Mouth<br>Disease in Asia: A Systematic Review and Analysis
 Influence of Breed Size, Age, Fecal Quality,<br>and Enteropathogen Shedding on Fecal<br>Calprotectin and Immunoglobulin A Concentrations in<br>Puppies During the Weaning Period
 ViPR: an open bioinformatics database and<br>analysis resource for virology research
 Viral Etiology of Acute Respiratory<br>Infections in Pediatric

 Occurrence of infectious bronchitis in layer<br>birds in Plateau state, north central Nigeria
 Heat shock protein 90β in the Vero cell membrane<br>binds Japanese encephalitis virus
 Epidemiological characteristics of<br>pulmonary tuberculosis in Shandong, China, 2005–2017: A<br>retrospective study
 Systems vaccinology and big data in the vaccine<br>development chain
 Establishment of minimal positive-control<br>conditions to ensure brain safety during rapid<br>development of emergency vaccines
 The rubella virus E2 and E1 spike glycoproteins<br>are targeted to the Golgi complex
 The cell biology of receptor-mediated virus<br>entry
 Emergency treatment and nursing of children<br>with severe pneumonia complicated by heart failure<br>and respiratory failure: 10 case reports
 Prevalence and outcomes of Guillain-Barré<br>syndrome among pediatrics in Saudi Arabia: a 10-year<br>retrospective study
 Emerging roles of interferon-stimulated<br>genes in the innate immune response to hepatitis C<

In [15]:
#https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py
import smart_open
import gensim
def read_corpus(fname, tokens_only=False):
    for i, line in enumerate(fname):
        tokens = gensim.utils.simple_preprocess(line)
        if tokens_only:
            yield tokens
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
                
train_corpus = list(read_corpus(df_covid['body_text']))

In [16]:
train_corpus[3]

TaggedDocument(words=['nipah', 'is', 'an', 'infectious', 'negative', 'sense', 'single', 'stranded', 'rna', 'virus', 'which', 'belongs', 'to', 'the', 'genus', 'henipavirus', 'and', 'family', 'paramyxoviridae', 'it', 'is', 'pleomorphic', 'enveloped', 'virus', 'with', 'particle', 'size', 'ranging', 'from', 'to', 'nm', 'while', 'fruit', 'bats', 'are', 'thought', 'to', 'be', 'the', 'natural', 'reservoirs', 'of', 'the', 'virus', 'they', 'are', 'also', 'able', 'to', 'spread', 'to', 'humans', 'and', 'some', 'other', 'species', 'there', 'are', 'two', 'major', 'genetic', 'lineages', 'of', 'the', 'virus', 'which', 'are', 'known', 'to', 'infect', 'humans', 'niv', 'malaysia', 'niv', 'and', 'niv', 'bangladesh', 'niv', 'the', 'first', 'outbreak', 'of', 'nipah', 'virus', 'infection', 'erupted', 'between', 'september', 'and', 'april', 'when', 'cases', 'of', 'febrile', 'encephalitis', 'in', 'the', 'suburb', 'of', 'ipoh', 'perak', 'southern', 'peninsular', 'malaysia', 'were', 'reported', 'to', 'the', 'ma

In [17]:
doc_model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=1)
doc_model.build_vocab(train_corpus)
doc_model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
ranks = []
second_ranks = []
text = "immune immunity virus"
for doc_id in range(len(train_corpus)):
    inferred_vector = doc_model.infer_vector([text])
    sims = doc_model.docvecs.most_similar([inferred_vector], topn=len(doc_model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

In [None]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(nltk.word_tokenize(text))))
sims = doc_model.docvecs.most_similar([inferred_vector], topn=len(doc_model.docvecs))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % doc_model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

In [456]:
from keras.utils import to_categorical
import pandas as pd
import pickle
import nltk
from nltk.corpus import stopwords
root_path = 'C:\\Users\\sindr\\MachineLearning\\CORD-19\\CORD-19-research-challenge'
processed_path = f'{root_path}\\processed.csv'
df_covid = pd.read_csv(processed_path, dtype={})
#tokenize, embed, vectorize, or something
data = df_covid.drop(['paper_id', 'body_text', 'authors', 'journal','abstract_summary'], axis=1)
data = data.dropna().reset_index(drop=True)[:10]
print(len(data))

10


In [457]:
stopWords = set(stopwords.words('english'))
stopWords.add('(')
stopWords.add(')')
stopWords.add(',')
stopWords.add(', ')
proc_data = pd.DataFrame(columns = ['abstract', 'title'])

for x in range (len(data)):
    tokens = nltk.word_tokenize(data['abstract'][x])
    #words = []
    for t in range(len(tokens)):
        if tokens[t] not in stopWords:
            proc_data = proc_data.append({'abstract' : tokens[t] , 'title' : data['title'][x]} , ignore_index=True)

In [445]:
stopWords = set(stopwords.words('english'))
stopWords.add('(')
stopWords.add(')')
stopWords.add(',')
stopWords.add(', ')
proc_data = pd.DataFrame(columns = ['abstract', 'title'])

for x in range (len(data)):
    tokens = nltk.word_tokenize(data['abstract'][x])
    words = []
    for t in range(len(tokens)):
        if tokens[t] not in stopWords:
            if tokens[t] == '.':
                proc_data = proc_data.append({'abstract' : words , 'title' : data['title'][x]} , ignore_index=True)
                words = []
            else:
                words.append(tokens[t])

In [443]:
for x in range(len(proc_data)):
    print(proc_data['title'][x])
#Perceptions of the Adult US Population<br>regarding the Novel Coronavirus Outbreak

 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome replication but<br>essential for the production of infectious virus.
 The RNA pseudoknots in foot-and-mouth disease<br>virus are dispensable for genome repli

In [458]:
proc_data = proc_data.reset_index(drop=True)
proc_data.tail()

Unnamed: 0,abstract,title
2729,reuse,Live-cell single RNA imaging reveals bursts o...
2730,allowed,Live-cell single RNA imaging reveals bursts o...
2731,without,Live-cell single RNA imaging reveals bursts o...
2732,permission,Live-cell single RNA imaging reveals bursts o...
2733,.,Live-cell single RNA imaging reveals bursts o...


In [459]:
from sklearn import preprocessing 
from sklearn.model_selection import StratifiedShuffleSplit

label_encoder = preprocessing.LabelEncoder() 
labels = label_encoder.fit_transform(proc_data['title'])
abstracts = label_encoder.fit_transform(proc_data['abstract'])


sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
sss.get_n_splits(abstracts, labels)

for train_index, test_index in sss.split(abstracts, labels):
    print("TRAIN:", train_index, "TEST:", test_index)
    train_x, test_x = abstracts[train_index], abstracts[test_index]
    train_y, test_y = labels[train_index], labels[test_index]

#train_x = abstracts
#train_y = labels
#test_x = abstracts[15000:19075]
#test_y = labels[15000:19075]

TRAIN: [ 699  429 2633 ... 1194 1278 2323] TEST: [1277 2373  312 1564 2449 2732 2690  628 1317 1226 1932 2018  237 1777
 2712 2196  357 1995  833   70  899   75  870 1970 1505  336 1622  795
  997 1406  972 2102 2395  475 1114 1526 2658 1010 1940 1820 1073  735
 2502 2147 1355 1528 1410  854 1810 1060 1110 1315  730 1290  184  367
 2397 2151 1149 2307 1246 1420 1286   37  155 1224 2448   11  657 1851
  984  313 2526  706  761 2284  145 2364  603 1704   51 1546  963 1301
 1780 2199 2465 1413  864 2237 1815 1753 1550  305 1758   23   34 1304
 2573 1929 2076  637 2730 1262 2060  467 2544 2659 1398 1896 2575 1296
  223 2175  428 2614 1699 2049  157 1702 2297 2689 1927 1843  215 1145
  234 2056  560 2382   22 1594 1885  316  597 1980 2610  858 1259  897
 2305 2670 1517 1343  283 2579 1478 1955 1335  653 2615 1982 2144  673
  837 2417  538 1799 1868  183 2480  901 1423 1892  827 2678  756 1472
   43  166 2542 1519 2259  973  132 2671 1221 2012 1696 2282 1051 2636
  388  127 1630 1802  988  2

In [307]:
print(test_x[0])
print(test_y[0])

2558
36


In [460]:
#https://github.com/WillKoehrsen/recurrent-neural-networks/blob/master/notebooks/Quick%20Start%20to%20Recurrent%20Neural%20Networks.ipynb
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam
from keras.utils import plot_model
def get_model():
    lstm = Sequential()

    # Embedding layer
    lstm.add(
        Embedding(
            input_dim=1,
            output_dim=90,
            weights=None,
            trainable=True))

    # Recurrent layer
    lstm.add(
        LSTM(
            60, return_sequences=False, dropout=0.1,
            recurrent_dropout=0.1))

    # Fully connected layer
    lstm.add(Dense(1, activation='relu'))

    # Dropout for regularization
    lstm.add(Dropout(0.1))

    # Output layer
    lstm.add(Dense(90, activation='softmax'))

    # Compile the model
    lstm.compile(
        optimizer='SGD', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    lstm.summary()
    return lstm

In [461]:
lstm = get_model()
history = lstm.fit(train_x,  train_y, 
                    batch_size=16, epochs=20,
                    validation_data=(test_x, test_y))

Model: "sequential_48"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_48 (Embedding)     (None, None, 90)          90        
_________________________________________________________________
lstm_48 (LSTM)               (None, 60)                36240     
_________________________________________________________________
dense_95 (Dense)             (None, 1)                 61        
_________________________________________________________________
dropout_48 (Dropout)         (None, 1)                 0         
_________________________________________________________________
dense_96 (Dense)             (None, 90)                180       
Total params: 36,571
Trainable params: 36,571
Non-trainable params: 0
_________________________________________________________________
Train on 2187 samples, validate on 547 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Ep

In [465]:
sample_id = 3
preds = lstm.predict_classes(test_x[:10])
print("Predicted: ", dicts[preds[sample_id]])
print("Actual: ", dicts[test_y[sample_id]])

Predicted:   A hidden gene in astroviruses encodes a<br>cell-permeabilizing protein involved in virus release
Actual:   A hidden gene in astroviruses encodes a<br>cell-permeabilizing protein involved in virus release


In [411]:
for x in range(10):
    print(dicts[train_y[x]])

 Characterizing the transmission and<br>identifying the control strategy for COVID-19 through<br>epidemiological modeling
 Forecasting the Wuhan coronavirus<br>(2019-nCoV) epidemics using a simple (simplistic) model -<br>update (Feb. 8, 2020)
 A planarian nidovirus expands the limits of RNA<br>genome size
 Early in planta detection of Xanthomonas<br>axonopodis pv. punicae in pomegranate using enhanced<br>loop-mediated isothermal amplification assay
 Risk of disease spillover from dogs to wild<br>carnivores in Kanha Tiger Reserve, India.
 68 Consecutive patients assessed for COVID-19<br>infection; experience from a UK regional infectious<br>disease unit
 Evolution and variation of 2019-novel<br>coronavirus
 Functional pangenome analysis suggests<br>inhibition of the protein E as a readily available therapy<br>for COVID-2019.
 SKEMPI 2.0: An updated benchmark of changes in<br>protein-protein binding energy, kinetics and thermodynamics<br>upon mutation
 The Viral Protein Corona Directs Vi