# Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayfeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jayfeng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jayfeng/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
"""Read in full_texts, containing the full texts of all papers, from data cleaning notebook."""

full_texts = pd.read_csv("full_texts.csv")
full_texts

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [3]:
"""Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook."""

df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,4d1f6d57c9a112fb1afc5c8a819b293cf8146ee9,0
1,d11eb07952f5a0fb076a49935508707abeecf8af,0
2,168667e863cc79cb44157edf0e39ca66a37b2b7b,0
3,361e12416e0ff49f28521c2dc5fc69934e419910,0
4,14c6f99cf0ec54c78447ad5a8403e3ee2708176f,0
...,...,...
24040,a61e707ef1a808123b3c6f542a79373d4315d603,9
24041,12329a694934f5824e1ef7d695a7a1fa0310f52c,9
24042,a02952893a51f2255a8d9d84b3eba6fe1b113996,9
24043,866943e037d3b683d0405a9d65d94568074ed592,9


In [4]:
"""Filter out full texts that don't have an assigned topic."""
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a p...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting pat..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastr...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease..."


In [5]:
"""Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)"""

merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = merged.loc[merged['assigned topic'] == 8]
medical_care_papers

Unnamed: 0,sha,assigned topic,text
20076,1cbbb93a94f33426a219e0f912f329549fa0f59d,8,The recent severe acute respiratory syndrome ...
20077,02d808df510466b74102dd922486bf922a5ed947,8,"Severe acute respiratory syndrome (SARS), a c..."
20078,4f88e7da806c41f481d1243ec6d3581d1696c360,8,Eukaryotic and most viral mRNAs possess a 5 -...
20079,409c387fb844383d955d8a6d7aace1a4d9e40e63,8,Middle East respiratory syndrome coronavirus ...
20080,6d5ab19bef25b700b2d283b8c57f8f044fb24290,8,The first outbreak of severe acute respirator...
...,...,...,...
21074,e65b607b21bcb4822a4b28666621b56261176a1d,8,Management of inpatients exposed to an outbre...
21075,37852fc821ed2b33624de0347c78bd135be5933f,8,The first introduction to the medical communi...
21076,348b0c8306d3648f4223c802b77ded58758a3fa1,8,An outbreak of atypical pneumonia has wreaked...
21077,d6750dff2734f993de45ffc3020428036b15ac21,8,Middle East respiratory syndrome coronavirus ...


In [6]:
"""Sample 100 papers from cluster 3."""

mcp_partial = medical_care_papers.sample(100)
mcp_partial

Unnamed: 0,sha,assigned topic,text
20931,cbc7ee25347acd1ad2fef88420dca328d27d8f5d,8,Middle East respiratory syndrome (MERS) is a ...
20746,9dfa774c9dee1273ae85f5c7199e83bbb45eb18b,8,T he 2003 outbreak of severe acute respirator...
21039,6b8f7239c992dfe42aaca36e104a33252619d909,8,The Post Graduate Institute of Medical Educat...
20940,d7c1b2d0dd9e1e7843bcbdbed7986ffdf53eb822,8,"Aprotinin, also known as bovine pancreatic tr..."
20575,20b1e0b8230e2037ba57108c70dd0fbee2c1e44e,8,Middle East respiratory syndrome coronavirus ...
...,...,...,...
20823,018a299cbdcbebd6f4c2584376941907fe7d3aaf,8,Helm and his associates (1981)'s work suggest...
20342,ec85fddb316d71605d2e8e705555005a3b31d3ea,8,Middle East Respiratory Syndrome (MERS) coron...
20997,aaf27331f63d372547f8055f7484dab80551423c,8,"On June 13, 2012, a 60-year-old Saudi man suf..."
20470,d3fb38a79ef539dc92d5db845fd061a22bab33df,8,The increased surveillance was to include a m...


In [7]:
"""Add column of full text split into sentences using nltk sentence tokenizer."""

text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences
20931,cbc7ee25347acd1ad2fef88420dca328d27d8f5d,8,Middle East respiratory syndrome (MERS) is a ...,[ Middle East respiratory syndrome (MERS) is a...
20746,9dfa774c9dee1273ae85f5c7199e83bbb45eb18b,8,T he 2003 outbreak of severe acute respirator...,[ T he 2003 outbreak of severe acute respirato...
21039,6b8f7239c992dfe42aaca36e104a33252619d909,8,The Post Graduate Institute of Medical Educat...,[ The Post Graduate Institute of Medical Educa...
20940,d7c1b2d0dd9e1e7843bcbdbed7986ffdf53eb822,8,"Aprotinin, also known as bovine pancreatic tr...","[ Aprotinin, also known as bovine pancreatic t..."
20575,20b1e0b8230e2037ba57108c70dd0fbee2c1e44e,8,Middle East respiratory syndrome coronavirus ...,[ Middle East respiratory syndrome coronavirus...
...,...,...,...,...
20823,018a299cbdcbebd6f4c2584376941907fe7d3aaf,8,Helm and his associates (1981)'s work suggest...,[ Helm and his associates (1981)'s work sugges...
20342,ec85fddb316d71605d2e8e705555005a3b31d3ea,8,Middle East Respiratory Syndrome (MERS) coron...,[ Middle East Respiratory Syndrome (MERS) coro...
20997,aaf27331f63d372547f8055f7484dab80551423c,8,"On June 13, 2012, a 60-year-old Saudi man suf...","[ On June 13, 2012, a 60-year-old Saudi man su..."
20470,d3fb38a79ef539dc92d5db845fd061a22bab33df,8,The increased surveillance was to include a m...,[ The increased surveillance was to include a ...


In [8]:
"""Set up stop words, stemmer, and lemmatizer."""

stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [9]:
"""Tokenize and clean the abstracts of every paper."""

import re
pattern = re.compile("/\b([a-z]+)\b/gi")


def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3 and re.match(r"^[A-Za-z]+$", i):
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

In [10]:
"""Add column of cleaned and tokenized sentences."""

sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
20931,cbc7ee25347acd1ad2fef88420dca328d27d8f5d,8,Middle East respiratory syndrome (MERS) is a ...,[ Middle East respiratory syndrome (MERS) is a...,"[[middl, east, respiratori, syndrom, mer, resp..."
20746,9dfa774c9dee1273ae85f5c7199e83bbb45eb18b,8,T he 2003 outbreak of severe acute respirator...,[ T he 2003 outbreak of severe acute respirato...,"[[outbreak, sever, acut, respiratori, syndrom,..."
21039,6b8f7239c992dfe42aaca36e104a33252619d909,8,The Post Graduate Institute of Medical Educat...,[ The Post Graduate Institute of Medical Educa...,"[[post, graduat, institut, medic, educ, resear..."
20940,d7c1b2d0dd9e1e7843bcbdbed7986ffdf53eb822,8,"Aprotinin, also known as bovine pancreatic tr...","[ Aprotinin, also known as bovine pancreatic t...","[[aprotinin, also, known, bovin, pancreat, try..."
20575,20b1e0b8230e2037ba57108c70dd0fbee2c1e44e,8,Middle East respiratory syndrome coronavirus ...,[ Middle East respiratory syndrome coronavirus...,"[[middl, east, respiratori, syndrom, coronavir..."
...,...,...,...,...,...
20823,018a299cbdcbebd6f4c2584376941907fe7d3aaf,8,Helm and his associates (1981)'s work suggest...,[ Helm and his associates (1981)'s work sugges...,"[[helm, associ, work, suggest, govern, public,..."
20342,ec85fddb316d71605d2e8e705555005a3b31d3ea,8,Middle East Respiratory Syndrome (MERS) coron...,[ Middle East Respiratory Syndrome (MERS) coro...,"[[middl, east, respiratori, syndrom, mer, coro..."
20997,aaf27331f63d372547f8055f7484dab80551423c,8,"On June 13, 2012, a 60-year-old Saudi man suf...","[ On June 13, 2012, a 60-year-old Saudi man su...","[[june, saudi, suffer, diseas, character, feve..."
20470,d3fb38a79ef539dc92d5db845fd061a22bab33df,8,The increased surveillance was to include a m...,[ The increased surveillance was to include a ...,"[[increas, surveil, includ, medic, presenc, in..."


In [11]:
"""Given a sorted df and int n, returns indexes of the top n rows."""

def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

In [12]:
"""Returns the indexes of the most important n sentences of the text in the given row based on the
sum of TF-IDF values divided by the log(number of words in the sentence) to reduce bias towards longer sentences."""

def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    sentence_lengths = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))
        sentence_lengths.append(len(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)
    
    print(df)

    sums = []
    for i in np.arange(len(df)):
        if sentence_lengths[i] < 10:
            sums.append(-1)
        else:    
            sums.append(np.sum(df.iloc[i].values) / np.log(sentence_lengths[i] + 1))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

In [13]:
"""Adds column of indexes of the top 3 most relevant sentences based on tfidf sum."""

temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

    abdomin  abl    absent  acquir     addit  adenovirus  admiss  afebril  \
0       0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
1       0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
2       0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
3       0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
4       0.0  0.0  0.000000     0.0  0.463123         0.0     0.0      0.0   
..      ...  ...       ...     ...       ...         ...     ...      ...   
64      0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
65      0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
66      0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   
67      0.0  0.0  0.232236     0.0  0.000000         0.0     0.0      0.0   
68      0.0  0.0  0.000000     0.0  0.000000         0.0     0.0      0.0   

    airport  airspac  ...  weaker  well  went  wheelchair  whether   withou

     abil  abl  abolish  absenc  abund    access  accumul  accuraci  acid  \
0     0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
1     0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
2     0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
3     0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
4     0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
..    ...  ...      ...     ...    ...       ...      ...       ...   ...   
237   0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   
238   0.0  0.0      0.0     0.0    0.0  0.343565      0.0       0.0   0.0   
239   0.0  0.0      0.0     0.0    0.0  0.343565      0.0       0.0   0.0   
240   0.0  0.0      0.0     0.0    0.0  0.244603      0.0       0.0   0.0   
241   0.0  0.0      0.0     0.0    0.0  0.000000      0.0       0.0   0.0   

     action  ...     world  would  xiaotao  xiong  yellow  ying  yuan  yuso

    absolut    access    accord  account  activ  acut   ad     addit  \
0       0.0  0.000000  0.355327      0.0    0.0   0.0  0.0  0.000000   
1       0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
2       0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
3       0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
4       0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
..      ...       ...       ...      ...    ...   ...  ...       ...   
62      0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
63      0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.361495   
64      0.0  0.236524  0.000000      0.0    0.0   0.0  0.0  0.000000   
65      0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   
66      0.0  0.000000  0.000000      0.0    0.0   0.0  0.0  0.000000   

      affect  ageweight  ...  warn  websit      wide  within  without  woman  \
0   0.000000        0.0  ...   0.0     0.0  0.000000   

     abelson  abil  abl  ablat  abrog  absorb  absorpt  academ  access  \
0        0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
1        0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
2        0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
3        0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
4        0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
..       ...   ...  ...    ...    ...     ...      ...     ...     ...   
355      0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
356      0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
357      0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
358      0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   
359      0.0   0.0  0.0    0.0    0.0     0.0      0.0     0.0     0.0   

     account  ...  worthwhil  would  year  yeast  ying  young  zhou  \
0        0.0  ...        0.0    0.0   0.

     aberr  abil  abl  absolut  absorb  abund  acarbos  accord  account  \
0      0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
1      0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
2      0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
3      0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
4      0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
..     ...   ...  ...      ...     ...    ...      ...     ...      ...   
404    0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
405    0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
406    0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
407    0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   
408    0.0   0.0  0.0      0.0     0.0    0.0      0.0     0.0      0.0   

     accumul  ...  wide  without  wood  worker  worth  wound  xylos  year  \
0        0.0  ...   0.

     abl  abund  accessori  accompani  acid    across  activ      acut  adapt  \
0    0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.221216    0.0   
1    0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
2    0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
3    0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
4    0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
..   ...    ...        ...        ...   ...       ...    ...       ...    ...   
176  0.0    0.0        0.0        0.0   0.0  0.223699    0.0  0.000000    0.0   
177  0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
178  0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
179  0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   
180  0.0    0.0        0.0        0.0   0.0  0.000000    0.0  0.000000    0.0   

     addavax  ...    within

     abbrevi  abil  abl  absenc  abund  academia  accommod  accord  acetamid  \
0        0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
1        0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
2        0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
3        0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
4        0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
..       ...   ...  ...     ...    ...       ...       ...     ...       ...   
221      0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
222      0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
223      0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
224      0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   
225      0.0   0.0  0.0     0.0    0.0       0.0       0.0     0.0       0.0   

     acid  ...  within      work  world

     abl  absenc  accompani  accomplish  accord   account  acknowledg  \
0    0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
1    0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
2    0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
3    0.0     0.0        0.0         0.0     0.0  0.261062         0.0   
4    0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
..   ...     ...        ...         ...     ...       ...         ...   
187  0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
188  0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
189  0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
190  0.0     0.0        0.0         0.0     0.0  0.000000         0.0   
191  0.0     0.0        0.0         0.0     0.0  0.000000         0.0   

       acquir  across  activ  ...  wore  work  worker     world  worri  would  \
0    0.000000     0.0    0.0  ...   0.0   

    access  accord  acid  activ      acut  adduct  agreement     align  along  \
0      0.0     0.0   0.0    0.0  0.340324     0.0        0.0  0.000000    0.0   
1      0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
2      0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
3      0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
4      0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
..     ...     ...   ...    ...       ...     ...        ...       ...    ...   
61     0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
62     0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.308492    0.0   
63     0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
64     0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   
65     0.0     0.0   0.0    0.0  0.000000     0.0        0.0  0.000000    0.0   

        also  ...     viral

     abil  abl  abnorm  absenc  accid  achiev  acid  action     activ  \
0     0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
1     0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
2     0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
3     0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
4     0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
..    ...  ...     ...     ...    ...     ...   ...     ...       ...   
272   0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
273   0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
274   0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   
275   0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.180308   
276   0.0  0.0     0.0     0.0    0.0     0.0   0.0     0.0  0.000000   

         acut  ...  wild    within  without  work     world  yang  year  \
0    0.373966  ...   0.0  0.000000      0.0   0.

     abelson  abraham  abrupt  accord  achiev  across  action  activ  actual  \
0        0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
1        0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
2        0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
3        0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
4        0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
..       ...      ...     ...     ...     ...     ...     ...    ...     ...   
240      0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
241      0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
242      0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
243      0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   
244      0.0      0.0     0.0     0.0     0.0     0.0     0.0    0.0     0.0   

         acut  ...  woodcock  word  wor

    abdomin  abnorm  accord  acquir  action  activ  actual      acut  \
0       0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
1       0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.365443   
2       0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
3       0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
4       0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
..      ...     ...     ...     ...     ...    ...     ...       ...   
95      0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
96      0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
97      0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
98      0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   
99      0.0     0.0     0.0     0.0     0.0    0.0     0.0  0.000000   

       addit  administr  ...  wheez  whether  white  work  worker  workplac  \
0   0.000000        0.0  ...    0.0      0.0    0.0   0.

     abil  abl  absorbt  accelri  acceptor  accur  accuraci  acid  across  \
0     0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
1     0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
2     0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
3     0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
4     0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
..    ...  ...      ...      ...       ...    ...       ...   ...     ...   
236   0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
237   0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
238   0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
239   0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   
240   0.0  0.0      0.0      0.0       0.0    0.0       0.0   0.0     0.0   

        activ  ...  well  wherein  whether  wildtyp    within  without  \
0

     absorb  absorpt  accord  accur  acet  acid  activ      acut   ad  addit  \
0       0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.224695  0.0    0.0   
1       0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
2       0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
3       0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
4       0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
..      ...      ...     ...    ...   ...   ...    ...       ...  ...    ...   
99      0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
100     0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
101     0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
102     0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   
103     0.0      0.0     0.0    0.0   0.0   0.0    0.0  0.000000  0.0    0.0   

     ...  western  wherea      wide  wi

     abl  absent  access  accord  activ      acut   ad  adapt  addit  \
0    0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
1    0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
2    0.0     0.0     0.0     0.0    0.0  0.255322  0.0    0.0    0.0   
3    0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
4    0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
..   ...     ...     ...     ...    ...       ...  ...    ...    ...   
136  0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
137  0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
138  0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
139  0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   
140  0.0     0.0     0.0     0.0    0.0  0.000000  0.0    0.0    0.0   

       admiss  ...   whether  widespread  wild  wildlif  world  write  xation  \
0    0.000000  ...  0.000000         0.0   0.0      0.

    abl  accord  acet  aceton  acetoxyimbricatol       ach  achn     acid  \
0   0.0     0.0   0.0     0.0           0.095168  0.000000   0.0  0.80944   
1   0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
2   0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
3   0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
4   0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
..  ...     ...   ...     ...                ...       ...   ...      ...   
70  0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
71  0.0     0.0   0.0     0.0           0.000000  0.317767   0.0  0.00000   
72  0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
73  0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   
74  0.0     0.0   0.0     0.0           0.000000  0.000000   0.0  0.00000   

       activ  acut  ...      wide     world  wound  yarrowia  yeast  yemeni

    abl  abnorm  absenc  accept  accid  accord  account  accur     activ  \
0   0.0     0.0     0.0     0.0    0.0     0.0  0.32838    0.0  0.000000   
1   0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
2   0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
3   0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
4   0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
..  ...     ...     ...     ...    ...     ...      ...    ...       ...   
91  0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
92  0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
93  0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.241153   
94  0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   
95  0.0     0.0     0.0     0.0    0.0     0.0  0.00000    0.0  0.000000   

        acut  ...  wean      week  weight      well  whitish  without  woman  \
0   0.2

     abil  abl  abrog  absorb  accapezzato  accumul  achiev  acid  acidif  \
0     0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
1     0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
2     0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
3     0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
4     0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
..    ...  ...    ...     ...          ...      ...     ...   ...     ...   
343   0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
344   0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
345   0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
346   0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   
347   0.0  0.0    0.0     0.0          0.0      0.0     0.0   0.0     0.0   

     acknowledg  ...  without      work  world  worldwid  would  writer  ya

     abil   abolish  abort  absenc  abstract    access  accord  accur  achiev  \
0     0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
1     0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
2     0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
3     0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
4     0.0  0.321616    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
..    ...       ...    ...     ...       ...       ...     ...    ...     ...   
208   0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
209   0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
210   0.0  0.000000    0.0     0.0       0.0  0.348157     0.0    0.0     0.0   
211   0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   
212   0.0  0.000000    0.0     0.0       0.0  0.000000     0.0    0.0     0.0   

     acid  ...  whether    

     abbrevi  abil  abl  abnorm  abolish  abroad  abrog  absolut  absorb  \
0        0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
1        0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
2        0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
3        0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
4        0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
..       ...   ...  ...     ...      ...     ...    ...      ...     ...   
772      0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
773      0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
774      0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
775      0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   
776      0.0   0.0  0.0     0.0      0.0     0.0    0.0      0.0     0.0   

     absorpt  ...  yashida  year  yellow  yield  yond  yoshida  zedoaria  \
0        0.

     abil  abl  absenc  acceler  accord  acid  act  action  activ  acut  ...  \
0     0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
1     0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
2     0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
3     0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
4     0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
..    ...  ...     ...      ...     ...   ...  ...     ...    ...   ...  ...   
235   0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
236   0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
237   0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
238   0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   
239   0.0  0.0     0.0      0.0     0.0   0.0  0.0     0.0    0.0   0.0  ...   

       worker  worn  xcell  year  yello

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
20931,cbc7ee25347acd1ad2fef88420dca328d27d8f5d,8,Middle East respiratory syndrome (MERS) is a ...,[ Middle East respiratory syndrome (MERS) is a...,"[[middl, east, respiratori, syndrom, mer, resp...","[58, 67, 59]"
20746,9dfa774c9dee1273ae85f5c7199e83bbb45eb18b,8,T he 2003 outbreak of severe acute respirator...,[ T he 2003 outbreak of severe acute respirato...,"[[outbreak, sever, acut, respiratori, syndrom,...","[8, 51, 35]"
21039,6b8f7239c992dfe42aaca36e104a33252619d909,8,The Post Graduate Institute of Medical Educat...,[ The Post Graduate Institute of Medical Educa...,"[[post, graduat, institut, medic, educ, resear...","[122, 83, 0]"
20940,d7c1b2d0dd9e1e7843bcbdbed7986ffdf53eb822,8,"Aprotinin, also known as bovine pancreatic tr...","[ Aprotinin, also known as bovine pancreatic t...","[[aprotinin, also, known, bovin, pancreat, try...","[13, 41, 36]"
20575,20b1e0b8230e2037ba57108c70dd0fbee2c1e44e,8,Middle East respiratory syndrome coronavirus ...,[ Middle East respiratory syndrome coronavirus...,"[[middl, east, respiratori, syndrom, coronavir...","[183, 188, 225]"
...,...,...,...,...,...,...
20823,018a299cbdcbebd6f4c2584376941907fe7d3aaf,8,Helm and his associates (1981)'s work suggest...,[ Helm and his associates (1981)'s work sugges...,"[[helm, associ, work, suggest, govern, public,...","[2, 18, 36]"
20342,ec85fddb316d71605d2e8e705555005a3b31d3ea,8,Middle East Respiratory Syndrome (MERS) coron...,[ Middle East Respiratory Syndrome (MERS) coro...,"[[middl, east, respiratori, syndrom, mer, coro...","[3, 8, 41]"
20997,aaf27331f63d372547f8055f7484dab80551423c,8,"On June 13, 2012, a 60-year-old Saudi man suf...","[ On June 13, 2012, a 60-year-old Saudi man su...","[[june, saudi, suffer, diseas, character, feve...","[225, 187, 236]"
20470,d3fb38a79ef539dc92d5db845fd061a22bab33df,8,The increased surveillance was to include a m...,[ The increased surveillance was to include a ...,"[[increas, surveil, includ, medic, presenc, in...","[4, 24, 18]"


In [14]:
"""Print out top 3 most relevant sentences of each text in the sample."""

for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"]:
        print(row["sentences"][i])
        print("\n")

13, 14 As the infiltrates were localized mainly in the periphery of the lungs, and probably in the interstitial space rather than in the airspace, we also speculated that symptoms due to airway inflammation, such as cough and sputum production, were not prominent.


The present record highlights early detection of a MERS-CoV case is challenging, because 1) exposure history is often absent, 2) presenting symptoms are not specific to MERS, 3) pneumonia can be missed without chest radiography.


A recent case report also highlighted that respiratory symptoms did not develop, although chest CT scan showed multiple patch infiltrates in both lungs.


The questionnaire consisted of 25 questions in different formats (multiple choice, 7-point scale, open-ended, and follow-up streamed questions) and collected information on the following topics: demographics; access to and use of the Internet, radio, television, magazines, and newspapers for general use and as a source for SARS information; perc