# COVID-19 Research Papers Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayfeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jayfeng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jayfeng/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
"""Read in full_texts, containing the full texts of all papers, from data cleaning notebook."""

full_texts = pd.read_csv("full_texts.csv")
full_texts

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally ..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTEY...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, l..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNER...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently encou..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a pu...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting patt..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastro...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease ..."


In [3]:
"""Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook."""

df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,8589358e390499b6c9d95a5eb20b0bfb4bc75466,0
1,be57ba746b8fec268025df6afc68536fbd0188d8,0
2,704eebd9653b61d2dfbe1483d1a616e6836eeec8,0
3,7440e73586a86e38a773cb529bca7ffef7a1afb0,0
4,c56ffdaf1cfbae5a6ed0abea495eaf7fa1cbc031,0
...,...,...
24040,b001ccf1b6c3a6bf6b6928a619a04cebae6c45a8,9
24041,c383b8dedcefbf78e370b0152058176d9219330e,9
24042,de1c5a16a75d5c28d9e3a9be2e928406fee458e4,9
24043,543eecd34daf0351160934011c8da277b883b56c,9


In [4]:
"""Filter out full texts that don't have an assigned topic."""
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally ..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTEY...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, l..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNER...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently encou..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a pu...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting patt..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastro...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease ..."


In [5]:
"""Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)"""

merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = merged.loc[merged['assigned topic'] == 8]
medical_care_papers

Unnamed: 0,sha,assigned topic,text
18740,85600d58166573ed016072d757f3b6e133098b74,8,Both the humoral and cellular components of th...
18741,c1ae608c7ffb926a0f50a6a34c0780983274ea74,8,"Novel coronaviruses (COVID- 19) , which was fo..."
18742,964442dc964b2d97ceef1883ce2353e866efd8c2,8,Difficulties exist in the detection of etiolog...
18743,d94e5b9c6acf81bf4e1e4966129b3281af3cf728,8,Well-controlled clinical trials are the most e...
18744,faac6cddc8f0126d5f93e948689eead41f54ad1f,8,Tonsillitis with associated tonsillar hypertro...
...,...,...,...
20551,3208852c9234f3d7d415af376bdb22cdb95f5bba,8,How emergency departments of different levels ...
20552,494fcd165e1795794862bfc44aa019e280fb695e,8,able to reach consensus (80% of committee memb...
20553,74c245d5a0275629dce204354579d47071dffb85,8,Various indexes of premature death are propose...
20554,7ef2b2ac31e6f1c498641e1b775c7120152577ff,8,Objective monitoring of nasal patency and nasa...


In [6]:
"""Sample 100 papers from cluster 3."""

mcp_partial = medical_care_papers.sample(100)
mcp_partial

Unnamed: 0,sha,assigned topic,text
19091,d715943a776ca56fc921de789ec636f8f6866dd0,8,The 2014-2016 West Africa Ebola virus disease ...
19291,957d77501f48dcd482bfa9491c169ffb781a3277,8,Cytomegalovirus (CMV) latently infects up to 7...
20483,25f32c16d43ab75509ee5468b322c09553bea898,8,r 2014 Elsevier Inc. All rights reserved.Autog...
20471,9fa838a7ec9e083660848bf720e4d13d7d6f372f,8,T he patterns of spread of severe acute respir...
20044,99ebc69b9e40e9d3736b3c6494bb0911fd635356,8,Ischaemia-reperfusion (I/R) contributes to the...
...,...,...,...
18850,1ad1d4b84aea4ceaf05c62a1ad04e7150f7f4684,8,SARS-CoV-2 is a recently named novel coronavir...
20345,21fcf6e2d4e563502141ef61141de43d26f6676c,8,État des connaissances La survenue d'un troubl...
20005,ab77b7faa3d2ca4577c472c8bcdeecbfe6c0373d,8,The aim of this study was to identify possible...
19706,1bc049b8dc09a80fd35d6aacbfda2b80e5f41b2b,8,Le but essentiel de la prise en charge d'un pa...


In [7]:
"""Add column of full text split into sentences using nltk sentence tokenizer."""

text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences
19091,d715943a776ca56fc921de789ec636f8f6866dd0,8,The 2014-2016 West Africa Ebola virus disease ...,[The 2014-2016 West Africa Ebola virus disease...
19291,957d77501f48dcd482bfa9491c169ffb781a3277,8,Cytomegalovirus (CMV) latently infects up to 7...,[Cytomegalovirus (CMV) latently infects up to ...
20483,25f32c16d43ab75509ee5468b322c09553bea898,8,r 2014 Elsevier Inc. All rights reserved.Autog...,[r 2014 Elsevier Inc. All rights reserved.Auto...
20471,9fa838a7ec9e083660848bf720e4d13d7d6f372f,8,T he patterns of spread of severe acute respir...,[T he patterns of spread of severe acute respi...
20044,99ebc69b9e40e9d3736b3c6494bb0911fd635356,8,Ischaemia-reperfusion (I/R) contributes to the...,[Ischaemia-reperfusion (I/R) contributes to th...
...,...,...,...,...
18850,1ad1d4b84aea4ceaf05c62a1ad04e7150f7f4684,8,SARS-CoV-2 is a recently named novel coronavir...,[SARS-CoV-2 is a recently named novel coronavi...
20345,21fcf6e2d4e563502141ef61141de43d26f6676c,8,État des connaissances La survenue d'un troubl...,[État des connaissances La survenue d'un troub...
20005,ab77b7faa3d2ca4577c472c8bcdeecbfe6c0373d,8,The aim of this study was to identify possible...,[The aim of this study was to identify possibl...
19706,1bc049b8dc09a80fd35d6aacbfda2b80e5f41b2b,8,Le but essentiel de la prise en charge d'un pa...,[Le but essentiel de la prise en charge d'un p...


In [8]:
"""Set up stop words, stemmer, and lemmatizer."""

stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [9]:
"""Tokenize and clean the abstracts of every paper."""

import re
pattern = re.compile("/\b([a-z]+)\b/gi")


def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3 and re.match(r"^[A-Za-z]+$", i):
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

In [10]:
"""Add column of cleaned and tokenized sentences."""

sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
19091,d715943a776ca56fc921de789ec636f8f6866dd0,8,The 2014-2016 West Africa Ebola virus disease ...,[The 2014-2016 West Africa Ebola virus disease...,"[[west, africa, ebola, virus, diseas, outbreak..."
19291,957d77501f48dcd482bfa9491c169ffb781a3277,8,Cytomegalovirus (CMV) latently infects up to 7...,[Cytomegalovirus (CMV) latently infects up to ...,"[[cytomegalovirus, latent, infect, general, po..."
20483,25f32c16d43ab75509ee5468b322c09553bea898,8,r 2014 Elsevier Inc. All rights reserved.Autog...,[r 2014 Elsevier Inc. All rights reserved.Auto...,"[[elsevi, right, vaccin, vaccin, made, microor..."
20471,9fa838a7ec9e083660848bf720e4d13d7d6f372f,8,T he patterns of spread of severe acute respir...,[T he patterns of spread of severe acute respi...,"[[pattern, spread, sever, acut, respiratori, s..."
20044,99ebc69b9e40e9d3736b3c6494bb0911fd635356,8,Ischaemia-reperfusion (I/R) contributes to the...,[Ischaemia-reperfusion (I/R) contributes to th...,"[[contribut, pathophysiolog, mani, clinic, pro..."
...,...,...,...,...,...
18850,1ad1d4b84aea4ceaf05c62a1ad04e7150f7f4684,8,SARS-CoV-2 is a recently named novel coronavir...,[SARS-CoV-2 is a recently named novel coronavi...,"[[recent, name, novel, coronavirus, respons, o..."
20345,21fcf6e2d4e563502141ef61141de43d26f6676c,8,État des connaissances La survenue d'un troubl...,[État des connaissances La survenue d'un troub...,"[[connaiss, survenu, troubl, ventilatoir, obst..."
20005,ab77b7faa3d2ca4577c472c8bcdeecbfe6c0373d,8,The aim of this study was to identify possible...,[The aim of this study was to identify possibl...,"[[studi, identifi, possibl, risk, factor, calf..."
19706,1bc049b8dc09a80fd35d6aacbfda2b80e5f41b2b,8,Le but essentiel de la prise en charge d'un pa...,[Le but essentiel de la prise en charge d'un p...,"[[essentiel, prise, charg, patient, ayant, pne..."


In [11]:
"""Given a sorted df and int n, returns indexes of the top n rows."""

def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

In [12]:
"""Returns the indexes of the most important n sentences of the text in the given row based on the
sum of TF-IDF values divided by the log(number of words in the sentence) to reduce bias towards longer sentences."""

def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    sentence_lengths = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))
        sentence_lengths.append(len(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)
    
    print(df)

    sums = []
    for i in np.arange(len(df)):
        if sentence_lengths[i] < 10:
            sums.append(-1)
        else:    
            sums.append(np.sum(df.iloc[i].values) / np.log(sentence_lengths[i] + 1))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

In [13]:
"""Adds column of indexes of the top 3 most relevant sentences based on tfidf sum."""

temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

     abil  abl  access  accomplish  accur  achiev  across     activ  actual  \
0     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.257164     0.0   
1     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
2     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
3     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
4     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
..    ...  ...     ...         ...    ...     ...     ...       ...     ...   
116   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
117   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
118   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
119   0.0  0.0     0.0    0.217625    0.0     0.0     0.0  0.000000     0.0   
120   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   

     acut  ...  will   willing    within  without  

     abil  abl  abnorm  absent    accord  accumul  achiev  acid  action  \
0     0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
1     0.0  0.0     0.0     0.0  0.255822      0.0     0.0   0.0     0.0   
2     0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
3     0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
4     0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
..    ...  ...     ...     ...       ...      ...     ...   ...     ...   
128   0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
129   0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
130   0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
131   0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   
132   0.0  0.0     0.0     0.0  0.000000      0.0     0.0   0.0     0.0   

        activ  ...      weak  wherea  whose  wide  within  wkymvm     would  \
0    0.000000  ...  

     acceler  access    accord  accumul   achiev    across  acut  addit  \
0        0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
1        0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
2        0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
3        0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
4        0.0     0.0  0.215044      0.0  0.00000  0.000000   0.0    0.0   
..       ...     ...       ...      ...      ...       ...   ...    ...   
176      0.0     0.0  0.000000      0.0  0.47651  0.000000   0.0    0.0   
177      0.0     0.0  0.000000      0.0  0.00000  0.290781   0.0    0.0   
178      0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
179      0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   
180      0.0     0.0  0.000000      0.0  0.00000  0.000000   0.0    0.0   

     adjac     admit  ...     wind  window  within  without  wong  work  \
0      0.0  0.247928  ..

     abil  access  accord  account  achiev  acknowledg  across  action  \
0     0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
1     0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
2     0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
3     0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
4     0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
..    ...     ...     ...      ...     ...         ...     ...     ...   
125   0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
126   0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
127   0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
128   0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   
129   0.0     0.0     0.0      0.0     0.0         0.0     0.0     0.0   

         acut   ad  ...  work    worker  workplac     world  worldwid  worn  \
0    0.338477  0.0  ...   0.0  0

     abil     ablat  abnorm  abroad  absenc    absent  abund  abus  \
0     0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
1     0.0  0.000000     0.0     0.0     0.0  0.392995    0.0   0.0   
2     0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
3     0.0  0.371593     0.0     0.0     0.0  0.000000    0.0   0.0   
4     0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
..    ...       ...     ...     ...     ...       ...    ...   ...   
283   0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
284   0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
285   0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
286   0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   
287   0.0  0.000000     0.0     0.0     0.0  0.000000    0.0   0.0   

     acanthacea  acceler  ...  wolfberri  word  world  worldwid  \
0           0.0      0.0  ...        0.0   0.0    0.0       0.0   
1           0.0      0.0 

     abil  absolut  access  accompani  accomplish  accord  accur  accuraci  \
0     0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
1     0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
2     0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
3     0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
4     0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
..    ...      ...     ...        ...         ...     ...    ...       ...   
182   0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
183   0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
184   0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
185   0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   
186   0.0      0.0     0.0        0.0         0.0     0.0    0.0       0.0   

     achiev  acquir  ...  wherea  white     whole  wide  width 

      abnorm     abort   account      acid     activ     acut     addit  \
0   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
1   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
2   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
3   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
4   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
5   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
6   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
7   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
8   0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
9   0.000000  0.000000  0.000000  0.000000  0.148740  0.00000  0.000000   
10  0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
11  0.000000  0.000000  0.000000  0.000000  0.000000  0.00000  0.000000   
12  0.000000  0.000000  0

      abdomin  abl  abnorm  absent  abstract  access  accompani  accord  \
0    0.000000  0.0     0.0     0.0  0.207893     0.0        0.0     0.0   
1    0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
2    0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
3    0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
4    0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
..        ...  ...     ...     ...       ...     ...        ...     ...   
165  0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
166  0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
167  0.319189  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   
168  0.000000  0.0     0.0     0.0  0.263902     0.0        0.0     0.0   
169  0.000000  0.0     0.0     0.0  0.000000     0.0        0.0     0.0   

     account  accur  ...  wors  worst  would      year  young  younger  \
0        0.0    0.0  ... 

     abat  abil  abl  aborigin  absenc  absorb  absorpt  access  accord  \
0     0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
1     0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
2     0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
3     0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
4     0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
..    ...   ...  ...       ...     ...     ...      ...     ...     ...   
164   0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
165   0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
166   0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
167   0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   
168   0.0   0.0  0.0       0.0     0.0     0.0      0.0     0.0     0.0   

     achiev  ...  work  workload  world  worldwid  xiii  year     young  \
0       0.0  ...   0.0  

     abdomin  abl  abstract  academ   academi  access  accord  account  \
0        0.0  0.0  0.334988     0.0  0.289025     0.0     0.0      0.0   
1        0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
2        0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
3        0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
4        0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
..       ...  ...       ...     ...       ...     ...     ...      ...   
109      0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
110      0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
111      0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
112      0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   
113      0.0  0.0  0.000000     0.0  0.000000     0.0     0.0      0.0   

     actual  acuiti  ...  white   whose  wide  widespread    within  without  \
0       0.0     0.0  ...    0.0

     abil  abnorm  abstract  accur  acquisit  acut  addit  affect  aim  \
0     0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
1     0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
2     0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
3     0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
4     0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
..    ...     ...       ...    ...       ...   ...    ...     ...  ...   
103   0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
104   0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
105   0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
106   0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   
107   0.0     0.0       0.0    0.0       0.0   0.0    0.0     0.0  0.0   

     albumin  ...  virus  visit  visitor  week  welfar  well  wong  worker  \
0        0.0  ...    0.0    0.0  

     absenc    access    accord  account  accur  acid  acquir  acut  address  \
0       0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
1       0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
2       0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
3       0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
4       0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
..      ...       ...       ...      ...    ...   ...     ...   ...      ...   
97      0.0  0.000000  0.124997      0.0    0.0   0.0     0.0   0.0      0.0   
98      0.0  0.202115  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
99      0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   
100     0.0  0.000000  0.212364      0.0    0.0   0.0     0.0   0.0      0.0   
101     0.0  0.000000  0.000000      0.0    0.0   0.0     0.0   0.0      0.0   

     adenovirus  ...  worldwid  wors  w

          abl  absenc  absent  accord  achiev  acid  actinomycetal  activ  \
0    0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
1    0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
2    0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
3    0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
4    0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
..        ...     ...     ...     ...     ...   ...            ...    ...   
149  0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
150  0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
151  0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
152  0.218863     0.0     0.0     0.0     0.0   0.0            0.0    0.0   
153  0.000000     0.0     0.0     0.0     0.0   0.0            0.0    0.0   

     acut     addit  ...  wherev  whether  white  whose  willich  without  

    absorb    accord   account  accur  achiev  acid  action  activ  acut  \
0      0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
1      0.0  0.239645  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
2      0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
3      0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
4      0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
..     ...       ...       ...    ...     ...   ...     ...    ...   ...   
70     0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
71     0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
72     0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
73     0.0  0.000000  0.000000    0.0     0.0   0.0     0.0    0.0   0.0   
74     0.0  0.000000  0.289176    0.0     0.0   0.0     0.0    0.0   0.0   

    adaptogen  ...  weak  weaker      week   whether  white  wide  womac  \
0         0

     abandon  abdomin  abnorm  absenc  absente  absolut  accept  accessori  \
0        0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
1        0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
2        0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
3        0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
4        0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
..       ...      ...     ...     ...      ...      ...     ...        ...   
191      0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
192      0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
193      0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
194      0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   
195      0.0      0.0     0.0     0.0      0.0      0.0     0.0        0.0   

     accid  accord  ...  wood      wors  worsen  worst  yean   

     abbott  abdomen   abdomin  abil  abnorm  abscess  absenc  absent  \
0       0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
1       0.0      0.0  0.180213   0.0     0.0      0.0     0.0     0.0   
2       0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
3       0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
4       0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
..      ...      ...       ...   ...     ...      ...     ...     ...   
209     0.0      0.0  0.236316   0.0     0.0      0.0     0.0     0.0   
210     0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
211     0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
212     0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   
213     0.0      0.0  0.000000   0.0     0.0      0.0     0.0     0.0   

     absorpt  accept  ...  wide  wider  withheld  within  without  work  wors  \
0        0.0     0.0  ...   0.0    0.0    

    abbott  abl  account   accumul      acut  advantag  african  agreement  \
0      0.0  0.0      0.0  0.000000  0.197648       0.0      0.0        0.0   
1      0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
2      0.0  0.0      0.0  0.000000  0.132885       0.0      0.0        0.0   
3      0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
4      0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
..     ...  ...      ...       ...       ...       ...      ...        ...   
63     0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
64     0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
65     0.0  0.0      0.0  0.000000  0.231640       0.0      0.0        0.0   
66     0.0  0.0      0.0  0.000000  0.000000       0.0      0.0        0.0   
67     0.0  0.0      0.0  0.324262  0.000000       0.0      0.0        0.0   

       allel  alleleseqr  ...  wherea  whole     wide  within  

         abl     accur      acut        ad     adapt  administr  although  \
0   0.000000  0.000000  0.327353  0.000000  0.000000   0.000000  0.000000   
1   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
2   0.000000  0.000000  0.000000  0.000000  0.392039   0.000000  0.000000   
3   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
4   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
5   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
6   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
7   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
8   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
9   0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
10  0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
11  0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   

     accord  activ     addit  adult  affect  african  agre  alagarasu  \
0       0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
1       0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
2       0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
3       0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
4       0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
..      ...    ...       ...    ...     ...      ...   ...        ...   
117     0.0    0.0  0.315358    0.0     0.0      0.0   0.0        0.0   
118     0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
119     0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
120     0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   
121     0.0    0.0  0.000000    0.0     0.0      0.0   0.0        0.0   

        allel      also  ...  view  vitamin  volum      well  wherea  wide  \
0    0.000000  0.000000  ...   0.0      0.0  

     abl  absenc  absent  accord   account  accur  accuraci  achiev   across  \
0    0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
1    0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
2    0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
3    0.0     0.0     0.0     0.0  0.243018    0.0       0.0     0.0  0.00000   
4    0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.20528   
..   ...     ...     ...     ...       ...    ...       ...     ...      ...   
126  0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
127  0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
128  0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
129  0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   
130  0.0     0.0     0.0     0.0  0.000000    0.0       0.0     0.0  0.00000   

     acut  ...  west  wherea  whole  wi

    absenc  absolu  access     acsh  actif  actuel  actuell      afin  agent  \
0      0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
1      0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
2      0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
3      0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
4      0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
..     ...     ...     ...      ...    ...     ...      ...       ...    ...   
84     0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
85     0.0     0.0     0.0  0.19059    0.0     0.0      0.0  0.000000    0.0   
86     0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   
87     0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.360012    0.0   
88     0.0     0.0     0.0  0.00000    0.0     0.0      0.0  0.000000    0.0   

    aggrav  ...  variat  vem  ventilato

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
19091,d715943a776ca56fc921de789ec636f8f6866dd0,8,The 2014-2016 West Africa Ebola virus disease ...,[The 2014-2016 West Africa Ebola virus disease...,"[[west, africa, ebola, virus, diseas, outbreak...","[15, 64, 74]"
19291,957d77501f48dcd482bfa9491c169ffb781a3277,8,Cytomegalovirus (CMV) latently infects up to 7...,[Cytomegalovirus (CMV) latently infects up to ...,"[[cytomegalovirus, latent, infect, general, po...","[60, 29, 94]"
20483,25f32c16d43ab75509ee5468b322c09553bea898,8,r 2014 Elsevier Inc. All rights reserved.Autog...,[r 2014 Elsevier Inc. All rights reserved.Auto...,"[[elsevi, right, vaccin, vaccin, made, microor...","[23, 541, 192]"
20471,9fa838a7ec9e083660848bf720e4d13d7d6f372f,8,T he patterns of spread of severe acute respir...,[T he patterns of spread of severe acute respi...,"[[pattern, spread, sever, acut, respiratori, s...","[3, 67, 2]"
20044,99ebc69b9e40e9d3736b3c6494bb0911fd635356,8,Ischaemia-reperfusion (I/R) contributes to the...,[Ischaemia-reperfusion (I/R) contributes to th...,"[[contribut, pathophysiolog, mani, clinic, pro...","[124, 106, 60]"
...,...,...,...,...,...,...
18850,1ad1d4b84aea4ceaf05c62a1ad04e7150f7f4684,8,SARS-CoV-2 is a recently named novel coronavir...,[SARS-CoV-2 is a recently named novel coronavi...,"[[recent, name, novel, coronavirus, respons, o...","[53, 4, 42]"
20345,21fcf6e2d4e563502141ef61141de43d26f6676c,8,État des connaissances La survenue d'un troubl...,[État des connaissances La survenue d'un troub...,"[[connaiss, survenu, troubl, ventilatoir, obst...","[22, 36, 28]"
20005,ab77b7faa3d2ca4577c472c8bcdeecbfe6c0373d,8,The aim of this study was to identify possible...,[The aim of this study was to identify possibl...,"[[studi, identifi, possibl, risk, factor, calf...","[64, 212, 85]"
19706,1bc049b8dc09a80fd35d6aacbfda2b80e5f41b2b,8,Le but essentiel de la prise en charge d'un pa...,[Le but essentiel de la prise en charge d'un p...,"[[essentiel, prise, charg, patient, ayant, pne...","[271, 56, 142]"


In [14]:
"""Print out top 3 most relevant sentences of each text in the sample."""

for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"]:
        print(row["sentences"][i])
        print("\n")

Results were analyzed to determine current levels of HID training, education, current protocols and procedures in place at the respondent's agency, and potential differences in levels of certification or position responsibilities correlated to differences in HID knowledge to determine areas of HID training and education that can be bolstered in this industry to increase occupational safety and health in EMS practitioners.Adapted from vetted checklists used by the European Network of Highly Infectious Disease Units to survey the capabilities, training, and resources available at European high-level containment facilities, 25 an EMS-specific gap analysis survey was developed, reviewed by subject matter experts in local and national EMS organizations, and administered during July 2016 utilizing Qualtrics Software version 2016.17 (Provo, UT) (Indiana University institutional review board exemption No.


This could be achieved by adding specific competencies on HIDs within continuing educat