# Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayfeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jayfeng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jayfeng/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
"""Read in full_texts, containing the full texts of all papers, from data cleaning notebook."""

full_texts = pd.read_csv("full_texts.csv")
full_texts

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [3]:
"""Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook."""

df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,eb60faa5390ed6225f47be23de4d9f42d852d84f,0
1,80c1d869f81784f4b1d5a064b5751b2ee5f33f26,0
2,793b90203f2d2abdc7ae32cf413579954f2278eb,0
3,a6529715bc34fce68e08780be0b80acdc823e744,0
4,0ad114139585a123079f9d6658429b37457e6b07,0
...,...,...
24040,3767c72c714d598a7668f1b53fd57db3383ce6ca,9
24041,c8ca3a5306db10a7842b853031404ecbc0a363ed,9
24042,60aa58e759985165053ddd6f154ea42518143056,9
24043,cced09f9a42aebacffd3847b8891ed5942687d7d,9


In [4]:
"""Filter out full texts that don't have an assigned topic."""
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [5]:
"""Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)"""

merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = merged.loc[merged['assigned topic'] == 8]
medical_care_papers

Unnamed: 0,sha,assigned topic,text
20987,525f05f9b074167ed6255533b8db2349d8a157d5,8,Pentraxins are a superfamily of phylogenetica...
20988,2c0c7b9bdc81461fbf533ae6b89fd78326651306,8,Type I interferons (IFNs) are major players o...
20989,602e1d7bdb8d5504c5fe21d39ab5f915cfc6287f,8,breast and uterus. These types of compounds a...
20990,4296edbb527d1e55ad2079c76f619659e0b5720d,8,"The threat of using orthopoxviruses, variola ..."
20991,3d77851b29b6aae8f448825a52262e6792d2dbf8,8,The major comorbidity in patients with non-sm...
...,...,...,...
23389,4ec02c71e62b4df8efeed2f53623778f49c7438a,8,Crimean-Congo hemorrhagic fever (CCHF) is a w...
23390,c249ffd76e3166c034955824b02dd3cc89cc23d2,8,Following recent advances in molecular biolog...
23391,98051953b3bd0b24f84d4ccbfa579130eacfe1d0,8,"Oxalate is normally produced in plants, prima..."
23392,432f684fc8610e68154ea21a86387c4f9ec80cf5,8,It is estimated that 10e33% of hospital-acqui...


In [6]:
"""Sample 100 papers from cluster 3."""

mcp_partial = medical_care_papers.sample(100)
mcp_partial

Unnamed: 0,sha,assigned topic,text
22581,b15d045694070f5ef9844bffcf76b80698863665,8,Porcine epidemic diarrhea (PED) is characteri...
21964,620d69e419e3c8756de8bf96d86f6bf0de7ed919,8,Neural injury associated with virus infection...
22090,2cbfb121b9aa939783bc8da1ba5c91204662b047,8,The human central nervous system (CNS) diseas...
22471,dc38fa3321df5137294140ffe1a358399befb500,8,Coronaviruses are large enveloped single-stra...
22430,fb9dd0cc6798855dbdc9f5194636243b353cad4a,8,Animals that hunt and scavenge are often expo...
...,...,...,...
21019,894e9a681490af51667a059d727a9cce8dddbf7f,8,"and member of the Flaviviridae family, which ..."
21377,7737bafb5ddf5ec04a3c0e586e7115112b8cdbe0,8,The first parvovirus shown to be a filterable...
22195,f9c1e38f9dc4fdbcd4aa6ebf7bf7022053215762,8,Recombinant virus-like particles (VLPs) are t...
22393,8e706c240839154f4fb123d28c5b57e078703bc6,8,The agent that caused the 2002-2003 Severe Ac...


In [7]:
"""Add column of full text split into sentences using nltk sentence tokenizer."""

text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences
22581,b15d045694070f5ef9844bffcf76b80698863665,8,Porcine epidemic diarrhea (PED) is characteri...,[ Porcine epidemic diarrhea (PED) is character...
21964,620d69e419e3c8756de8bf96d86f6bf0de7ed919,8,Neural injury associated with virus infection...,[ Neural injury associated with virus infectio...
22090,2cbfb121b9aa939783bc8da1ba5c91204662b047,8,The human central nervous system (CNS) diseas...,[ The human central nervous system (CNS) disea...
22471,dc38fa3321df5137294140ffe1a358399befb500,8,Coronaviruses are large enveloped single-stra...,[ Coronaviruses are large enveloped single-str...
22430,fb9dd0cc6798855dbdc9f5194636243b353cad4a,8,Animals that hunt and scavenge are often expo...,[ Animals that hunt and scavenge are often exp...
...,...,...,...,...
21019,894e9a681490af51667a059d727a9cce8dddbf7f,8,"and member of the Flaviviridae family, which ...","[ and member of the Flaviviridae family, which..."
21377,7737bafb5ddf5ec04a3c0e586e7115112b8cdbe0,8,The first parvovirus shown to be a filterable...,[ The first parvovirus shown to be a filterabl...
22195,f9c1e38f9dc4fdbcd4aa6ebf7bf7022053215762,8,Recombinant virus-like particles (VLPs) are t...,[ Recombinant virus-like particles (VLPs) are ...
22393,8e706c240839154f4fb123d28c5b57e078703bc6,8,The agent that caused the 2002-2003 Severe Ac...,[ The agent that caused the 2002-2003 Severe A...


In [8]:
"""Set up stop words, stemmer, and lemmatizer."""

stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [9]:
"""Tokenize and clean the abstracts of every paper."""

import re
pattern = re.compile("/\b([a-z]+)\b/gi")


def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3 and re.match(r"^[A-Za-z]+$", i):
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

In [10]:
"""Add column of cleaned and tokenized sentences."""

sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
22581,b15d045694070f5ef9844bffcf76b80698863665,8,Porcine epidemic diarrhea (PED) is characteri...,[ Porcine epidemic diarrhea (PED) is character...,"[[porcin, epidem, diarrhea, character, sever, ..."
21964,620d69e419e3c8756de8bf96d86f6bf0de7ed919,8,Neural injury associated with virus infection...,[ Neural injury associated with virus infectio...,"[[neural, injuri, associ, virus, infect, repre..."
22090,2cbfb121b9aa939783bc8da1ba5c91204662b047,8,The human central nervous system (CNS) diseas...,[ The human central nervous system (CNS) disea...,"[[human, central, nervous, system, diseas, mul..."
22471,dc38fa3321df5137294140ffe1a358399befb500,8,Coronaviruses are large enveloped single-stra...,[ Coronaviruses are large enveloped single-str...,"[[coronavirus, larg, envelop, virus, associ, w..."
22430,fb9dd0cc6798855dbdc9f5194636243b353cad4a,8,Animals that hunt and scavenge are often expo...,[ Animals that hunt and scavenge are often exp...,"[[anim, hunt, scaveng, often, expos, broad, ar..."
...,...,...,...,...,...
21019,894e9a681490af51667a059d727a9cce8dddbf7f,8,"and member of the Flaviviridae family, which ...","[ and member of the Flaviviridae family, which...","[[member, flavivirida, famili, also, includ, w..."
21377,7737bafb5ddf5ec04a3c0e586e7115112b8cdbe0,8,The first parvovirus shown to be a filterable...,[ The first parvovirus shown to be a filterabl...,"[[first, parvovirus, shown, filter, agent, fel..."
22195,f9c1e38f9dc4fdbcd4aa6ebf7bf7022053215762,8,Recombinant virus-like particles (VLPs) are t...,[ Recombinant virus-like particles (VLPs) are ...,"[[recombin, particl, vlps, icosahedr, structur..."
22393,8e706c240839154f4fb123d28c5b57e078703bc6,8,The agent that caused the 2002-2003 Severe Ac...,[ The agent that caused the 2002-2003 Severe A...,"[[agent, caus, sever, acut, respiratori, syndr..."


In [11]:
"""Given a sorted df and int n, returns indexes of the top n rows."""

def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

In [18]:
"""Returns the indexes of the most important n sentences of the text in the given row based on the
sum of TF-IDF values divided by the log(number of words in the sentence) to reduce bias towards longer sentences."""

def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    sentence_lengths = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))
        sentence_lengths.append(len(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)
    
    print(df)

    sums = []
    for i in np.arange(len(df)):
        if sentence_lengths[i] < 10:
            sums.append(-1)
        else:    
            sums.append(np.sum(df.iloc[i].values) / np.log(sentence_lengths[i] + 1))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

In [19]:
"""Adds column of indexes of the top 3 most relevant sentences based on tfidf sum."""

temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

     abcam  abund  access  accessori  accord  accur      acid  across  activ  \
0      0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
1      0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
2      0.0    0.0     0.0   0.208144     0.0    0.0  0.000000     0.0    0.0   
3      0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
4      0.0    0.0     0.0   0.000000     0.0    0.0  0.245286     0.0    0.0   
..     ...    ...     ...        ...     ...    ...       ...     ...    ...   
227    0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
228    0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
229    0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
230    0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   
231    0.0    0.0     0.0   0.000000     0.0    0.0  0.000000     0.0    0.0   

      ad  ...  window  within  without 

         abil  abl  absorb  abund  accord  account  accur  across  activ   ad  \
0    0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
1    0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
2    0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
3    0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
4    0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
..        ...  ...     ...    ...     ...      ...    ...     ...    ...  ...   
186  0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
187  0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
188  0.230928  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
189  0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   
190  0.000000  0.0     0.0    0.0     0.0      0.0    0.0     0.0    0.0  0.0   

     ...  wildlif  within  

     abil  abl  abnorm  access  accommod  accompani  account  achiev  acid  \
0     0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
1     0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
2     0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
3     0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
4     0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
..    ...  ...     ...     ...       ...        ...      ...     ...   ...   
182   0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
183   0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
184   0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
185   0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   
186   0.0  0.0     0.0     0.0       0.0        0.0      0.0     0.0   0.0   

     action  ...  widespread  within  work  would  yeaman  year

     abil  abl  abort  absenc     abund  accord  acetonitril  achiev  acid  \
0     0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
1     0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
2     0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
3     0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
4     0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
..    ...  ...    ...     ...       ...     ...          ...     ...   ...   
233   0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
234   0.0  0.0    0.0     0.0  0.205569     0.0          0.0     0.0   0.0   
235   0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
236   0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   
237   0.0  0.0    0.0     0.0  0.000000     0.0          0.0     0.0   0.0   

     acquir  ...  water  weak  webster  weigh      well  wester

     abl  absorb  acclim  accompani  accord  account  acid     activ  addit  \
0    0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   
1    0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   
2    0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   
3    0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.334151    0.0   
4    0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.188170    0.0   
..   ...     ...     ...        ...     ...      ...   ...       ...    ...   
202  0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   
203  0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   
204  0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.177389    0.0   
205  0.0     0.0     0.0   0.471787     0.0      0.0   0.0  0.000000    0.0   
206  0.0     0.0     0.0   0.000000     0.0      0.0   0.0  0.000000    0.0   

        adhes  ...   whether  white  window  within

     aaalac  abl  accord  accredit  acid  acquir  activ      acut   ad  addit  \
0       0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.206123  0.0    0.0   
1       0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
2       0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
3       0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
4       0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
..      ...  ...     ...       ...   ...     ...    ...       ...  ...    ...   
188     0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
189     0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
190     0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
191     0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   
192     0.0  0.0     0.0       0.0   0.0     0.0    0.0  0.000000  0.0    0.0   

     ...  week  welfar  wel

     absenc  abund  accept  accord  accuraci  acquir  action  activ    actual  \
0       0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
1       0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
2       0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
3       0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
4       0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
..      ...    ...     ...     ...       ...     ...     ...    ...       ...   
150     0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.342796   
151     0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
152     0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
153     0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   
154     0.0    0.0     0.0     0.0       0.0     0.0     0.0    0.0  0.000000   

         acut  ...  word  w

    abl  absenc  absent  accept  access  account  acet  activ   ad  addit  \
0   0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
1   0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
2   0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
3   0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
4   0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
..  ...     ...     ...     ...     ...      ...   ...    ...  ...    ...   
94  0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
95  0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
96  0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
97  0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   
98  0.0     0.0     0.0     0.0     0.0      0.0   0.0    0.0  0.0    0.0   

    ...  view  virus  volum  water  weak  whether  william  within  without

       absenc  absolut  accept  accord  account     accur  accuraci  achiev  \
0    0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
1    0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
2    0.000000      0.0     0.0     0.0      0.0  0.316829       0.0     0.0   
3    0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
4    0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
..        ...      ...     ...     ...      ...       ...       ...     ...   
286  0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
287  0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
288  0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
289  0.325589      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   
290  0.000000      0.0     0.0     0.0      0.0  0.000000       0.0     0.0   

     across  acut  ...  wire  within  wooden  work 

     abil       abl  abort  absorb  access  accompani  accord  acnpv  \
0     0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
1     0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
2     0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
3     0.0  0.247593    0.0     0.0     0.0        0.0     0.0    0.0   
4     0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
..    ...       ...    ...     ...     ...        ...     ...    ...   
204   0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
205   0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
206   0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
207   0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   
208   0.0  0.000000    0.0     0.0     0.0        0.0     0.0    0.0   

        activ  acut  ...  whitney  wild  within  without  wood  worthington  \
0    0.000000   0.0  ...      0.0   0.0     0.0      0.0

         abil      abl  access  accord  acknowledg  activ  administ  advers  \
0    0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
1    0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
2    0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
3    0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
4    0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
..        ...      ...     ...     ...         ...    ...       ...     ...   
99   0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
100  0.000000  0.00000     0.0     0.0    0.269463    0.0       0.0     0.0   
101  0.000000  0.22164     0.0     0.0    0.000000    0.0       0.0     0.0   
102  0.000000  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   
103  0.219899  0.00000     0.0     0.0    0.000000    0.0       0.0     0.0   

     afford  allow  ...     viral  virul     virus 

      abandon  absenc  accord  accumul  acetylglucosamin  achiev  acid  \
0    0.315273     0.0     0.0      0.0               0.0     0.0   0.0   
1    0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
2    0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
3    0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
4    0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
..        ...     ...     ...      ...               ...     ...   ...   
132  0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
133  0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
134  0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
135  0.000000     0.0     0.0      0.0               0.0     0.0   0.0   
136  0.000000     0.0     0.0      0.0               0.0     0.0   0.0   

     acidif  act  action  ...     water  weak      well  wherea  whether  \
0       0.0  0.0     0.0  ...  0.00

     absenc  access    accord  acid      acut     adapt     addit  administr  \
0       0.0     0.0  0.000000   0.0  0.000000  0.000000  0.000000        0.0   
1       0.0     0.0  0.000000   0.0  0.000000  0.000000  0.000000        0.0   
2       0.0     0.0  0.000000   0.0  0.170668  0.000000  0.000000        0.0   
3       0.0     0.0  0.000000   0.0  0.260463  0.000000  0.000000        0.0   
4       0.0     0.0  0.000000   0.0  0.000000  0.000000  0.257874        0.0   
..      ...     ...       ...   ...       ...       ...       ...        ...   
178     0.0     0.0  0.000000   0.0  0.000000  0.000000  0.000000        0.0   
179     0.0     0.0  0.259457   0.0  0.000000  0.000000  0.000000        0.0   
180     0.0     0.0  0.000000   0.0  0.000000  0.234704  0.000000        0.0   
181     0.0     0.0  0.000000   0.0  0.000000  0.000000  0.000000        0.0   
182     0.0     0.0  0.000000   0.0  0.000000  0.000000  0.000000        0.0   

     affect  agricultur  ...  week  wel

     abil  abl  abort  access  achiev     activ  actual  adapt  addit  \
0     0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
1     0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
2     0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
3     0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
4     0.0  0.0    0.0     0.0     0.0  0.300512     0.0    0.0    0.0   
..    ...  ...    ...     ...     ...       ...     ...    ...    ...   
159   0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
160   0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
161   0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
162   0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   
163   0.0  0.0    0.0     0.0     0.0  0.000000     0.0    0.0    0.0   

     adenovirus  ...  whether  whose  wide  wildlif  would  yasu      year  \
0      0.000000  ...      0.0    0.0   0.0   

    abl  abund  acaca  access  accur  achiev  acid  activ  acut  addit  ...  \
0   0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
1   0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
2   0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
3   0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
4   0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
..  ...    ...    ...     ...    ...     ...   ...    ...   ...    ...  ...   
77  0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
78  0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
79  0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
80  0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   
81  0.0    0.0    0.0     0.0    0.0     0.0   0.0    0.0   0.0    0.0  ...   

    week  well  whether  whilst  whose  wide  withi

     abil       abl  absolut  absorb    accord   accumul  achiev  acid  \
0     0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
1     0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
2     0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
3     0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
4     0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
..    ...       ...      ...     ...       ...       ...     ...   ...   
175   0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
176   0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   
177   0.0  0.000000      0.0     0.0  0.000000  0.301071     0.0   0.0   
178   0.0  0.217355      0.0     0.0  0.175552  0.000000     0.0   0.0   
179   0.0  0.000000      0.0     0.0  0.000000  0.000000     0.0   0.0   

     across  activ  ...  whose  winooski  within   without  worldwid  \
0       0.0    0.0  ...    0.0       0.

     abil  abl  abnorm  absenc  abstin  access  accord  acid    across  activ  \
0     0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
1     0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.404034    0.0   
2     0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
3     0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
4     0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
..    ...  ...     ...     ...     ...     ...     ...   ...       ...    ...   
143   0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
144   0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
145   0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
146   0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   
147   0.0  0.0     0.0     0.0     0.0     0.0     0.0   0.0  0.000000    0.0   

     ...  whole  will  with

     abcam  abil  abl  abolish  absorb  absorpt  abund  academi  acceler  \
0      0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
1      0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
2      0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
3      0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
4      0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
..     ...   ...  ...      ...     ...      ...    ...      ...      ...   
532    0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
533    0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
534    0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
535    0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   
536    0.0   0.0  0.0      0.0     0.0      0.0    0.0      0.0      0.0   

       access  ...  would  written  wuxi  yecuri  yellow  yield  zeisel  \
0    0.00000

     abelseth  abil  abnorm  abominai  abort  abortus  abrad  abras  abscess  \
0         0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
1         0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
2         0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
3         0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
4         0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
..        ...   ...     ...       ...    ...      ...    ...    ...      ...   
934       0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
935       0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
936       0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
937       0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   
938       0.0   0.0     0.0       0.0    0.0      0.0    0.0    0.0      0.0   

     absenc  ...  yohn  yolk  york  you

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
22581,b15d045694070f5ef9844bffcf76b80698863665,8,Porcine epidemic diarrhea (PED) is characteri...,[ Porcine epidemic diarrhea (PED) is character...,"[[porcin, epidem, diarrhea, character, sever, ...","[12, 9, 182]"
21964,620d69e419e3c8756de8bf96d86f6bf0de7ed919,8,Neural injury associated with virus infection...,[ Neural injury associated with virus infectio...,"[[neural, injuri, associ, virus, infect, repre...","[160, 139, 58]"
22090,2cbfb121b9aa939783bc8da1ba5c91204662b047,8,The human central nervous system (CNS) diseas...,[ The human central nervous system (CNS) disea...,"[[human, central, nervous, system, diseas, mul...","[12, 17, 53]"
22471,dc38fa3321df5137294140ffe1a358399befb500,8,Coronaviruses are large enveloped single-stra...,[ Coronaviruses are large enveloped single-str...,"[[coronavirus, larg, envelop, virus, associ, w...","[10, 91, 21]"
22430,fb9dd0cc6798855dbdc9f5194636243b353cad4a,8,Animals that hunt and scavenge are often expo...,[ Animals that hunt and scavenge are often exp...,"[[anim, hunt, scaveng, often, expos, broad, ar...","[15, 161, 188]"
...,...,...,...,...,...,...
21019,894e9a681490af51667a059d727a9cce8dddbf7f,8,"and member of the Flaviviridae family, which ...","[ and member of the Flaviviridae family, which...","[[member, flavivirida, famili, also, includ, w...","[161, 3, 116]"
21377,7737bafb5ddf5ec04a3c0e586e7115112b8cdbe0,8,The first parvovirus shown to be a filterable...,[ The first parvovirus shown to be a filterabl...,"[[first, parvovirus, shown, filter, agent, fel...","[148, 33, 49]"
22195,f9c1e38f9dc4fdbcd4aa6ebf7bf7022053215762,8,Recombinant virus-like particles (VLPs) are t...,[ Recombinant virus-like particles (VLPs) are ...,"[[recombin, particl, vlps, icosahedr, structur...","[28, 209, 18]"
22393,8e706c240839154f4fb123d28c5b57e078703bc6,8,The agent that caused the 2002-2003 Severe Ac...,[ The agent that caused the 2002-2003 Severe A...,"[[agent, caus, sever, acut, respiratori, syndr...","[16, 110, 66]"


In [20]:
"""Print out top 3 most relevant sentences of each text in the sample."""

for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"]:
        print(row["sentences"][i])
        print("\n")

In this study, a panel of recombinant PEDV ORFs encoding structural and nonstructural proteins were expressed in mammalian and/or bacterial cells and screened for reactivity with porcine sera from seven provinces of China by ELISA and/or western blot analysis, in order to determine which antigen is most suitable as a diagnostic marker for PEDV infection.


Many tools have been developed for the detection of anti-PEDV antibodies based upon the major structural proteins (such as S, M or N proteins) in serum, colostrum, milk, feces and oral fluid, including indirect immunofluorescence assays (IFA), virus neutralization assays (SN), enzyme-linked immunosorbent assays (ELISA), and fluorescent microsphere immunoassays (FMIA) (Diel et al., 2016; Gerber et al., 2014; Gerber and Opriessnig, 2015; Gimenez-Lirola et al., 2017; Okda et al., 2015) .


The current study showed a similar decline, further confirming the reliability of the S1-based ELISA assays applicable to weaning pigs in addition to