# COVID-19 Research Papers Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayfeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jayfeng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jayfeng/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
"""Read in full_texts, containing the full texts of all papers, from data cleaning notebook."""

full_texts = pd.read_csv("full_texts.csv")
full_texts

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [3]:
"""Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook."""

df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,eb60faa5390ed6225f47be23de4d9f42d852d84f,0
1,80c1d869f81784f4b1d5a064b5751b2ee5f33f26,0
2,793b90203f2d2abdc7ae32cf413579954f2278eb,0
3,a6529715bc34fce68e08780be0b80acdc823e744,0
4,0ad114139585a123079f9d6658429b37457e6b07,0
...,...,...
24040,3767c72c714d598a7668f1b53fd57db3383ce6ca,9
24041,c8ca3a5306db10a7842b853031404ecbc0a363ed,9
24042,60aa58e759985165053ddd6f154ea42518143056,9
24043,cced09f9a42aebacffd3847b8891ed5942687d7d,9


In [4]:
"""Filter out full texts that don't have an assigned topic."""
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [5]:
"""Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)"""

merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = merged.loc[merged['assigned topic'] == 8]
medical_care_papers

Unnamed: 0,sha,assigned topic,text
20987,525f05f9b074167ed6255533b8db2349d8a157d5,8,Pentraxins are a superfamily of phylogenetica...
20988,2c0c7b9bdc81461fbf533ae6b89fd78326651306,8,Type I interferons (IFNs) are major players o...
20989,602e1d7bdb8d5504c5fe21d39ab5f915cfc6287f,8,breast and uterus. These types of compounds a...
20990,4296edbb527d1e55ad2079c76f619659e0b5720d,8,"The threat of using orthopoxviruses, variola ..."
20991,3d77851b29b6aae8f448825a52262e6792d2dbf8,8,The major comorbidity in patients with non-sm...
...,...,...,...
23389,4ec02c71e62b4df8efeed2f53623778f49c7438a,8,Crimean-Congo hemorrhagic fever (CCHF) is a w...
23390,c249ffd76e3166c034955824b02dd3cc89cc23d2,8,Following recent advances in molecular biolog...
23391,98051953b3bd0b24f84d4ccbfa579130eacfe1d0,8,"Oxalate is normally produced in plants, prima..."
23392,432f684fc8610e68154ea21a86387c4f9ec80cf5,8,It is estimated that 10e33% of hospital-acqui...


In [6]:
"""Sample 100 papers from cluster 3."""

mcp_partial = medical_care_papers.sample(100)
mcp_partial

Unnamed: 0,sha,assigned topic,text
22281,50897189565f10bfa3b4884058b5a51cf75c091e,8,High-mortality and occurrence of Ebola diseas...
21407,0afa3ea846396533c7ca515968abcfea3f895082,8,port neutrophil infiltration in inflammatory-...
22603,f33e644d058bc8373378c115878be06346298687,8,As a consequence of advances in modern cancer...
22707,889ba9338ea71cd42c3bc675db30a1928d487f43,8,Plague is one of the most feared infectious d...
21246,44023966b2c4e4a1538fe2e8dd9f0511898ca344,8,first vaccines were similar to early pneumoco...
...,...,...,...
21796,81b22a5b5989dfc8f57c6d05fd29d3c735785887,8,"In 2012, a novel virus, termed Middle East re..."
21421,21280ea42df08d5a4e52024bcfaadd4abba39088,8,Transmissible spongiform encephalopathies (TS...
22665,1c7c9d9c03f404119c5cc1dd98e0ebe17ea3bb3e,8,a1111111111 a1111111111 a1111111111 a11111111...
22672,41ee771b9bd14efb72c392edf03b3cff55c9d570,8,Neonatal calf diarrhea (NCD) is the most comm...


In [7]:
"""Add column of full text split into sentences using nltk sentence tokenizer."""

text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences
22281,50897189565f10bfa3b4884058b5a51cf75c091e,8,High-mortality and occurrence of Ebola diseas...,[ High-mortality and occurrence of Ebola disea...
21407,0afa3ea846396533c7ca515968abcfea3f895082,8,port neutrophil infiltration in inflammatory-...,[ port neutrophil infiltration in inflammatory...
22603,f33e644d058bc8373378c115878be06346298687,8,As a consequence of advances in modern cancer...,[ As a consequence of advances in modern cance...
22707,889ba9338ea71cd42c3bc675db30a1928d487f43,8,Plague is one of the most feared infectious d...,[ Plague is one of the most feared infectious ...
21246,44023966b2c4e4a1538fe2e8dd9f0511898ca344,8,first vaccines were similar to early pneumoco...,[ first vaccines were similar to early pneumoc...
...,...,...,...,...
21796,81b22a5b5989dfc8f57c6d05fd29d3c735785887,8,"In 2012, a novel virus, termed Middle East re...","[ In 2012, a novel virus, termed Middle East r..."
21421,21280ea42df08d5a4e52024bcfaadd4abba39088,8,Transmissible spongiform encephalopathies (TS...,[ Transmissible spongiform encephalopathies (T...
22665,1c7c9d9c03f404119c5cc1dd98e0ebe17ea3bb3e,8,a1111111111 a1111111111 a1111111111 a11111111...,[ a1111111111 a1111111111 a1111111111 a1111111...
22672,41ee771b9bd14efb72c392edf03b3cff55c9d570,8,Neonatal calf diarrhea (NCD) is the most comm...,[ Neonatal calf diarrhea (NCD) is the most com...


In [8]:
"""Set up stop words, stemmer, and lemmatizer."""

stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [9]:
"""Tokenize and clean the abstracts of every paper."""

import re
pattern = re.compile("/\b([a-z]+)\b/gi")


def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3 and re.match(r"^[A-Za-z]+$", i):
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

In [10]:
"""Add column of cleaned and tokenized sentences."""

sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
22281,50897189565f10bfa3b4884058b5a51cf75c091e,8,High-mortality and occurrence of Ebola diseas...,[ High-mortality and occurrence of Ebola disea...,"[[occurr, ebola, diseas, ebod, outbreak, almos..."
21407,0afa3ea846396533c7ca515968abcfea3f895082,8,port neutrophil infiltration in inflammatory-...,[ port neutrophil infiltration in inflammatory...,"[[port, neutrophil, infiltr, model], [neutroph..."
22603,f33e644d058bc8373378c115878be06346298687,8,As a consequence of advances in modern cancer...,[ As a consequence of advances in modern cance...,"[[consequ, advanc, modern, cancer, care, incre..."
22707,889ba9338ea71cd42c3bc675db30a1928d487f43,8,Plague is one of the most feared infectious d...,[ Plague is one of the most feared infectious ...,"[[plagu, fear, infecti, diseas, human], [kill,..."
21246,44023966b2c4e4a1538fe2e8dd9f0511898ca344,8,first vaccines were similar to early pneumoco...,[ first vaccines were similar to early pneumoc...,"[[first, vaccin, similar, earli, pneumococc, v..."
...,...,...,...,...,...
21796,81b22a5b5989dfc8f57c6d05fd29d3c735785887,8,"In 2012, a novel virus, termed Middle East re...","[ In 2012, a novel virus, termed Middle East r...","[[novel, virus, term, middl, east, respiratori..."
21421,21280ea42df08d5a4e52024bcfaadd4abba39088,8,Transmissible spongiform encephalopathies (TS...,[ Transmissible spongiform encephalopathies (T...,"[[transmiss, spongiform, encephalopathi, degen..."
22665,1c7c9d9c03f404119c5cc1dd98e0ebe17ea3bb3e,8,a1111111111 a1111111111 a1111111111 a11111111...,[ a1111111111 a1111111111 a1111111111 a1111111...,"[[coxsackievirus, virus, belong, genus, human,..."
22672,41ee771b9bd14efb72c392edf03b3cff55c9d570,8,Neonatal calf diarrhea (NCD) is the most comm...,[ Neonatal calf diarrhea (NCD) is the most com...,"[[neonat, calf, diarrhea, common, caus, calf, ..."


In [11]:
"""Given a sorted df and int n, returns indexes of the top n rows."""

def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

In [12]:
"""Returns the indexes of the most important n sentences of the text in the given row based on the
sum of TF-IDF values divided by the log(number of words in the sentence) to reduce bias towards longer sentences."""

def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    sentence_lengths = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))
        sentence_lengths.append(len(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)
    
    print(df)

    sums = []
    for i in np.arange(len(df)):
        if sentence_lengths[i] < 10:
            sums.append(-1)
        else:    
            sums.append(np.sum(df.iloc[i].values) / np.log(sentence_lengths[i] + 1))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

In [13]:
"""Adds column of indexes of the top 3 most relevant sentences based on tfidf sum."""

temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

     absolut  abt  abund  accept  access  accomplish  accord  account  \
0        0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
1        0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
2        0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
3        0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
4        0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
..       ...  ...    ...     ...     ...         ...     ...      ...   
183      0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
184      0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
185      0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
186      0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   
187      0.0  0.0    0.0     0.0     0.0         0.0     0.0      0.0   

        accur  accuraci  ...  whose  wide  within  without  witwatersrand  \
0    0.000000       0.0  ...    0.0   0.0     

     abil  abl  accept  access  accommod  accord  achiev  actual      acut  \
0     0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
1     0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
2     0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
3     0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.158797   
4     0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
..    ...  ...     ...     ...       ...     ...     ...     ...       ...   
224   0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
225   0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
226   0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
227   0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   
228   0.0  0.0     0.0     0.0       0.0     0.0     0.0     0.0  0.000000   

      ad  ...  within  without  woldehiwet  wooden  worral  wou

     abbkin  abbrevi  abl  absorb  accommod  accord  account  acetyl  acid  \
0       0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
1       0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
2       0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
3       0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
4       0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
..      ...      ...  ...     ...       ...     ...      ...     ...   ...   
141     0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
142     0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
143     0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
144     0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   
145     0.0      0.0  0.0     0.0       0.0     0.0      0.0     0.0   0.0   

     acquir  ...  weight      well  western  whole  within  wit

     abil  absolut  absorb  absorpt  abstract  abund  acceler  access  accord  \
0     0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
1     0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
2     0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
3     0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
4     0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
..    ...      ...     ...      ...       ...    ...      ...     ...     ...   
360   0.0      0.0     0.0      0.0  0.310741    0.0      0.0     0.0     0.0   
361   0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
362   0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
363   0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   
364   0.0      0.0     0.0      0.0  0.000000    0.0      0.0     0.0     0.0   

      account  ...  whole  

     abcam  abil       abl  absorb  accompani  accord  accumul  acellular  \
0      0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
1      0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
2      0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
3      0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
4      0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
..     ...   ...       ...     ...        ...     ...      ...        ...   
245    0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
246    0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
247    0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   
248    0.0   0.0  0.232176     0.0        0.0     0.0      0.0        0.0   
249    0.0   0.0  0.000000     0.0        0.0     0.0      0.0        0.0   

     aceton  achiev  ...  whit  wiener   within  without      work  worldwi

     abattoir  abl  abrupt  accompani  accomplish  accord  account  accuraci  \
0         0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
1         0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
2         0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
3         0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
4         0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
..        ...  ...     ...        ...         ...     ...      ...       ...   
191       0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
192       0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
193       0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
194       0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   
195       0.0  0.0     0.0        0.0         0.0     0.0      0.0       0.0   

     acquir  actinobacillus  ...  weste

     aaalac  abdomen  abil  abolish  accord  accredit  accumul    acquir  \
0       0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
1       0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
2       0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
3       0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
4       0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
..      ...      ...   ...      ...     ...       ...      ...       ...   
180     0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
181     0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.274715   
182     0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
183     0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   
184     0.0      0.0   0.0      0.0     0.0       0.0      0.0  0.000000   

     activ  acut  ...  weakest  weight  well  whether  white  wide  wider  \
0      0.0

     acid  activ      acut   ad  adapt  adjust  administr     adult    affect  \
0     0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
1     0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
2     0.0    0.0  0.389934  0.0    0.0     0.0   0.000000  0.000000  0.000000   
3     0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.305527  0.305527   
4     0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
..    ...    ...       ...  ...    ...     ...        ...       ...       ...   
159   0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
160   0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
161   0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
162   0.0    0.0  0.000000  0.0    0.0     0.0   0.000000  0.000000  0.000000   
163   0.0    0.0  0.000000  0.0    0.0     0.0   0.430128  0.000000  0.000000   

     almost  ...    wesley 

     abbotsford  abbrevi  abil  abl  ablat  absorpt  acceler  accept  access  \
0           0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
1           0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
2           0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
3           0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
4           0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
..          ...      ...   ...  ...    ...      ...      ...     ...     ...   
968         0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
969         0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
970         0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
971         0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   
972         0.0      0.0   0.0  0.0    0.0      0.0      0.0     0.0     0.0   

     accommod  ...  world  worldwid  wo

     aberr  abil  abl  absenc  absent  absolut  accept  access  accompani  \
0      0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
1      0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
2      0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
3      0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
4      0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
..     ...   ...  ...     ...     ...      ...     ...     ...        ...   
324    0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
325    0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
326    0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
327    0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   
328    0.0   0.0  0.0     0.0     0.0      0.0     0.0     0.0        0.0   

     accomplish  ...  whole  wide  widespread  wild  within     work  \
0  

     abil  abnorm  absorb  accept  access  accord     activ   ad     addit  \
0     0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
1     0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
2     0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
3     0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
4     0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
..    ...     ...     ...     ...     ...     ...       ...  ...       ...   
153   0.0     0.0     0.0     0.0     0.0     0.0  0.297823  0.0  0.000000   
154   0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
155   0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.337386   
156   0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   
157   0.0     0.0     0.0     0.0     0.0     0.0  0.000000  0.0  0.000000   

     administ  ...     virus  volum  wall  wash  water  week  w

     absenc  access  acclim  accord  acid  action   ad  adapt     addit  \
0       0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
1       0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
2       0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
3       0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
4       0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
..      ...     ...     ...     ...   ...     ...  ...    ...       ...   
156     0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
157     0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.555457   
158     0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
159     0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   
160     0.0     0.0     0.0     0.0   0.0     0.0  0.0    0.0  0.000000   

     adjust  ...  well  wherea  whether  whole  within  without  work  \
0       0.0  ...   0.0    

     abil  accept  achiev  activ  actual  adapt  addit  adequ  adjust  adopt  \
0     0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
1     0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
2     0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
3     0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
4     0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
..    ...     ...     ...    ...     ...    ...    ...    ...     ...    ...   
132   0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
133   0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
134   0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
135   0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   
136   0.0     0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0    0.0   

     ...  worldview  worthless  would  

     abil  academi  accord  accumul  activ  acut   ad  adapt  addit  \
0     0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
1     0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
2     0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
3     0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
4     0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
..    ...      ...     ...      ...    ...   ...  ...    ...    ...   
222   0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
223   0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
224   0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
225   0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   
226   0.0      0.0     0.0      0.0    0.0   0.0  0.0    0.0    0.0   

     administr  ...  well  went  whether  whole  without  worcest  worldwid  \
0          0.0  ...   0.0   0.0      0.0    0.0      0.0      0.0  0

     abdomin  abil  abl  abolish  abrog  absenc  accept  accomplish  account  \
0        0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
1        0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
2        0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
3        0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
4        0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
..       ...   ...  ...      ...    ...     ...     ...         ...      ...   
831      0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
832      0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
833      0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
834      0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   
835      0.0   0.0  0.0      0.0    0.0     0.0     0.0         0.0      0.0   

     accredit  ...  yoshikawa  young  y

          abl  acquir   ad     addit  adher  adult  allow   alreadi      also  \
0    0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   
1    0.000000     0.0  0.0  0.256784    0.0    0.0    0.0  0.000000  0.000000   
2    0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   
3    0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   
4    0.213657     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   
..        ...     ...  ...       ...    ...    ...    ...       ...       ...   
101  0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.239721   
102  0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.336466  0.000000   
103  0.000000     0.0  0.0  0.291700    0.0    0.0    0.0  0.000000  0.000000   
104  0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   
105  0.000000     0.0  0.0  0.000000    0.0    0.0    0.0  0.000000  0.000000   

     although  ...      viv

        aatf  aberr    absenc     abund   accept  accord  accumul     addit  \
0   0.000000    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.000000   
1   0.000000    0.0  0.320349  0.000000  0.00000     0.0      0.0  0.000000   
2   0.000000    0.0  0.000000  0.000000  0.32403     0.0      0.0  0.000000   
3   0.000000    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.000000   
4   0.000000    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.279364   
..       ...    ...       ...       ...      ...     ...      ...       ...   
74  0.238901    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.000000   
75  0.238837    0.0  0.000000  0.207420  0.00000     0.0      0.0  0.000000   
76  0.000000    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.000000   
77  0.000000    0.0  0.000000  0.000000  0.00000     0.0      0.0  0.000000   
78  0.000000    0.0  0.000000  0.278735  0.00000     0.0      0.0  0.000000   

    adenocarcinoma  adenocerv  ...    weight  well 

     abdulaziz      abil  absorb  access  accord  acid  acquir  activ  actual  \
0          0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
1          0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
2          0.0  0.258717     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
3          0.0  0.224674     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
4          0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
..         ...       ...     ...     ...     ...   ...     ...    ...     ...   
137        0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
138        0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
139        0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
140        0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   
141        0.0  0.000000     0.0     0.0     0.0   0.0     0.0    0.0     0.0   

      ad  ...  well  wester

     absenc     abund  accord  account  accumul  acid  acknowledg     activ  \
0       0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
1       0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.179341   
2       0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
3       0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
4       0.0  0.259155     0.0      0.0      0.0   0.0         0.0  0.000000   
..      ...       ...     ...      ...      ...   ...         ...       ...   
205     0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
206     0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
207     0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
208     0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   
209     0.0  0.000000     0.0      0.0      0.0   0.0         0.0  0.000000   

         acut  addit  ...     white  widespread    

     abl  accord  accret  activ   ad     addit  adfi  adjust  allow      also  \
0    0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
1    0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
2    0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
3    0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
4    0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
..   ...     ...     ...    ...  ...       ...   ...     ...    ...       ...   
98   0.0     0.0     0.0    0.0  0.0  0.297437   0.0     0.0    0.0  0.000000   
99   0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
100  0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.483471   
101  0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   
102  0.0     0.0     0.0    0.0  0.0  0.000000   0.0     0.0    0.0  0.000000   

     ...  virus  wash  wate

         abil  abl  absorb  access  accord  account  activ  acut   ad  addit  \
0    0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
1    0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
2    0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
3    0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
4    0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
..        ...  ...     ...     ...     ...      ...    ...   ...  ...    ...   
138  0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
139  0.368666  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
140  0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
141  0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   
142  0.000000  0.0     0.0     0.0     0.0      0.0    0.0   0.0  0.0    0.0   

     ...  whole  wide  within  without 

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
22281,50897189565f10bfa3b4884058b5a51cf75c091e,8,High-mortality and occurrence of Ebola diseas...,[ High-mortality and occurrence of Ebola disea...,"[[occurr, ebola, diseas, ebod, outbreak, almos...","[38, 10, 143]"
21407,0afa3ea846396533c7ca515968abcfea3f895082,8,port neutrophil infiltration in inflammatory-...,[ port neutrophil infiltration in inflammatory...,"[[port, neutrophil, infiltr, model], [neutroph...","[83, 26, 103]"
22603,f33e644d058bc8373378c115878be06346298687,8,As a consequence of advances in modern cancer...,[ As a consequence of advances in modern cance...,"[[consequ, advanc, modern, cancer, care, incre...","[165, 163, 204]"
22707,889ba9338ea71cd42c3bc675db30a1928d487f43,8,Plague is one of the most feared infectious d...,[ Plague is one of the most feared infectious ...,"[[plagu, fear, infecti, diseas, human], [kill,...","[168, 64, 206]"
21246,44023966b2c4e4a1538fe2e8dd9f0511898ca344,8,first vaccines were similar to early pneumoco...,[ first vaccines were similar to early pneumoc...,"[[first, vaccin, similar, earli, pneumococc, v...","[31, 115, 157]"
...,...,...,...,...,...,...
21796,81b22a5b5989dfc8f57c6d05fd29d3c735785887,8,"In 2012, a novel virus, termed Middle East re...","[ In 2012, a novel virus, termed Middle East r...","[[novel, virus, term, middl, east, respiratori...","[173, 147, 28]"
21421,21280ea42df08d5a4e52024bcfaadd4abba39088,8,Transmissible spongiform encephalopathies (TS...,[ Transmissible spongiform encephalopathies (T...,"[[transmiss, spongiform, encephalopathi, degen...","[130, 144, 23]"
22665,1c7c9d9c03f404119c5cc1dd98e0ebe17ea3bb3e,8,a1111111111 a1111111111 a1111111111 a11111111...,[ a1111111111 a1111111111 a1111111111 a1111111...,"[[coxsackievirus, virus, belong, genus, human,...","[259, 256, 201]"
22672,41ee771b9bd14efb72c392edf03b3cff55c9d570,8,Neonatal calf diarrhea (NCD) is the most comm...,[ Neonatal calf diarrhea (NCD) is the most com...,"[[neonat, calf, diarrhea, common, caus, calf, ...","[75, 54, 33]"


In [14]:
"""Print out top 3 most relevant sentences of each text in the sample."""

for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"]:
        print(row["sentences"][i])
        print("\n")

The codon optimized nucleotide sequence for EBOV nucleocapsid (NP) (GenBank accession number AF086833.2) was synthesized with a C-terminal glycine linker to a 6× Histidine tag (GenScript, Piscataway, NJ, USA) and subcloned into the NcoI and XhoI restriction sites of the pET-15b expression vector (Novagen, Merck, USA).


SL blood specimens were originally submitted for EBOV RT-PCR testing to the SA modular high-biosafety field Ebola diagnostic laboratory (FEDL) established in August 2014 near Freetown, in international response to the rapidly increasing number of EBOD cases in SL [36] .


High detection rate  The diagnostic decision limit or cut-off represents a serological assay test value used to dichotomize negative and positive results, and by inference, to define the infection status of an individual against a specific pathogen of disease.


Recent work has shown that gut microbiota metabolism of dietary fiber can increase the concentration of circulating short-chain fatty acids (S