# Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
#nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
#nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
#nltk.download('wordnet')

Read in full_texts, containing the full texts of all papers, from data cleaning notebook.

In [2]:
full_texts = pd.read_csv("full_texts.csv")
full_texts.head()

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."


Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook.

In [3]:
df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,87d5b80231b8956e498791ab3507f0f1ca529be8,0
1,bb5d7caba7ff8afec3c1fde62cadf65db745ce35,0
2,c48208dd7beb82687c017d5d1b44bde08de0990a,0
3,e1e50e72368b173717a1b7f886c65c61ce5ade59,0
4,dfe64ba8bf59ab09b956b293c47a3798df11b31a,0
...,...,...
9995,cdcb57ca4a03749e961145c638a2ba30aea8a1d8,9
9996,f210cb3da35e11e4e268372baad244fbdc140ab1,9
9997,e5adf7aca5308543613d1f966616c2fbec18ec21,9
9998,eb289af3995ebdf5654f8553444b4dbc6ca30667,9


Filter out full texts that don't have an assigned topic.

In [4]:
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts.head()

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
10,d9d3627bd3e93877a8934f06db472f3d641bbc99,Restriction of virus replication by macrophag...
11,005d48b545794f09d6db2d03a770466dcacaf7c2,"Carbocyclic nucleoside analogues, which conta..."


Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)


In [5]:
merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = merged.loc[merged['assigned topic'] == 8]
medical_care_papers.head()

Unnamed: 0,sha,assigned topic,text
8417,f1e09c7e9de97c71b670d0a1db786ba24a1216bf,8,Vaccination has historically and remains one ...
8418,a9d9fca4fb167b81456adbef78d4ff7be77d8b8f,8,N osocomial infections affect nearly 10% of h...
8419,1a392cbb2aa69bde3a1cc82dd07226328fee713b,8,"Early in transfusion compatibility, hemolysis..."
8420,9ac01047c360e0def0d96adf2e59f6e5bd68b3b7,8,Hospital-acquired (nosocomial) infections pos...
8421,357d5838127aad0439fe56f9a619d11919e51d01,8,"During autumn 2005, we conducted 3,436 interv..."


Sample 100 papers from cluster 3.


In [6]:
mcp_partial = medical_care_papers.sample(100)
mcp_partial.head()

Unnamed: 0,sha,assigned topic,text
8916,0d578824fe7a24ea08124982a21a7fa9f7011f5f,8,The primary responsibility of the boar stud i...
8743,1a9fdd51745b132ee92107b4e9c68597d3767b7e,8,"In this paper, we review examples of these no..."
8640,a42ff48d50aa3a0ebfe840d46ed2204c49955442,8,In addition to the impact of the COVID-19 on ...
9041,64504909cf5e7b5249b8c617c2d1973a3c301e85,8,The United States-Japan Cooperative Medical S...
8833,f3f471d10a36a7a28e9050c10bd4dfd680cba17b,8,Preparedness against pandemic influenza has b...


Add column of full text split into sentences using nltk sentence tokenizer.

In [8]:
text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

  sentences = np.array(sentences)


Unnamed: 0,sha,assigned topic,text,sentences
8916,0d578824fe7a24ea08124982a21a7fa9f7011f5f,8,The primary responsibility of the boar stud i...,[ The primary responsibility of the boar stud ...
8743,1a9fdd51745b132ee92107b4e9c68597d3767b7e,8,"In this paper, we review examples of these no...","[ In this paper, we review examples of these n..."
8640,a42ff48d50aa3a0ebfe840d46ed2204c49955442,8,In addition to the impact of the COVID-19 on ...,[ In addition to the impact of the COVID-19 on...
9041,64504909cf5e7b5249b8c617c2d1973a3c301e85,8,The United States-Japan Cooperative Medical S...,[ The United States-Japan Cooperative Medical ...
8833,f3f471d10a36a7a28e9050c10bd4dfd680cba17b,8,Preparedness against pandemic influenza has b...,[ Preparedness against pandemic influenza has ...
...,...,...,...,...
9005,bfa379b686f05c7dec75708cfa044c653e8dfaaa,8,"t is not uncommon to hear, ""Don't worry, we h...","[ t is not uncommon to hear, ""Don't worry, we ..."
9179,f5fb2153b992cdc2d4bdb3678e667126c9a27689,8,The 'One Health' concept states simply that t...,[ The 'One Health' concept states simply that ...
9153,a125265919886034e8bec3a56c600656de445cb4,8,The spread of infectious organisms within hea...,[ The spread of infectious organisms within he...
9024,328b36edcc6169a2aceba2c7e6859d1770d6a81d,8,Antimicrobial resistance (AMR) has gained con...,[ Antimicrobial resistance (AMR) has gained co...


Set up stop words, stemmer, and lemmatizer.

In [9]:
stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

Tokenize and clean the abstracts of every paper.

In [10]:
import re
pattern = re.compile("/\b([a-z]+)\b/gi")


def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3 and re.match(r"^[A-Za-z]+$", i):
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

Add column of cleaned and tokenized sentences.

In [11]:
sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial.head()

  tokenized_sentences = np.array(tokenized_sentences)


Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
8916,0d578824fe7a24ea08124982a21a7fa9f7011f5f,8,The primary responsibility of the boar stud i...,[ The primary responsibility of the boar stud ...,"[[primari, respons, boar, stud, provid, genet,..."
8743,1a9fdd51745b132ee92107b4e9c68597d3767b7e,8,"In this paper, we review examples of these no...","[ In this paper, we review examples of these n...","[[paper, review, exampl, novel, applic, detect..."
8640,a42ff48d50aa3a0ebfe840d46ed2204c49955442,8,In addition to the impact of the COVID-19 on ...,[ In addition to the impact of the COVID-19 on...,"[[addit, impact, peopl, emot, peopl, cope, str..."
9041,64504909cf5e7b5249b8c617c2d1973a3c301e85,8,The United States-Japan Cooperative Medical S...,[ The United States-Japan Cooperative Medical ...,"[[unit, cooper, medic, scienc, program, usjcms..."
8833,f3f471d10a36a7a28e9050c10bd4dfd680cba17b,8,Preparedness against pandemic influenza has b...,[ Preparedness against pandemic influenza has ...,"[[prepared, pandem, influenza, becom, high, pr..."


Given a sorted df and int n, returns indexes of the top n rows.


Define `top_n_index(df,n)`, which takes sorted df and int n, and returns indexes of the top n rows.

In [12]:
def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

Define `top_n_sentences(df, row, n)`, which returns the indexes of the most important n sentences of the text in the given row based on the sum of TF-IDF values divided by the log(number of words in the sentence) to reduce bias towards longer sentences.


In [13]:
def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    sentence_lengths = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))
        sentence_lengths.append(len(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)
    
    print(df)

    sums = []
    for i in np.arange(len(df)):
        if sentence_lengths[i] < 10:
            sums.append(-1)
        else:    
            sums.append(np.sum(df.iloc[i].values) / np.log(sentence_lengths[i] + 1))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

Add column of indexes of the top 3 most relevant sentences based on TF-IDF sum.

In [14]:
temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

     abil  abl  absent  access  accord  activ  actual   ad     adapt  addit  \
0     0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
1     0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
2     0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.338586    0.0   
3     0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
4     0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
..    ...  ...     ...     ...     ...    ...     ...  ...       ...    ...   
198   0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
199   0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
200   0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
201   0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   
202   0.0  0.0     0.0     0.0     0.0    0.0     0.0  0.0  0.000000    0.0   

     ...  withheld  withhold  within   without  wor

         abil  accept  access  accompani  accord  account  accredit  accur  \
0    0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
1    0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
2    0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
3    0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
4    0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
..        ...     ...     ...        ...     ...      ...       ...    ...   
119  0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
120  0.325482     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
121  0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
122  0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   
123  0.000000     0.0     0.0        0.0     0.0      0.0       0.0    0.0   

     action      acut  ...  vibrio  virtual  weapon  well  west

     abil  abl  absenc  accept  access  accessori    accord  account  accumul  \
0     0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
1     0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
2     0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
3     0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
4     0.0  0.0     0.0     0.0     0.0   0.219052  0.000000      0.0      0.0   
..    ...  ...     ...     ...     ...        ...       ...      ...      ...   
210   0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
211   0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
212   0.0  0.0     0.0     0.0     0.0   0.000000  0.338578      0.0      0.0   
213   0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   
214   0.0  0.0     0.0     0.0     0.0   0.000000  0.000000      0.0      0.0   

     accur  ...  wild  will

     aberr  abl  abnorm  access  accord  accur  accuraci  action  activ  \
0      0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
1      0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
2      0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
3      0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
4      0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
..     ...  ...     ...     ...     ...    ...       ...     ...    ...   
174    0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
175    0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
176    0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
177    0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   
178    0.0  0.0     0.0     0.0     0.0    0.0       0.0     0.0    0.0   

         acut  ...   whether  wide  window  without  work  workflow  worldwid  \
0    0.337698  ...

          abl  abovement  accord  account  accumul  accur  accuraci  \
0    0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
1    0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
2    0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
3    0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
4    0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
..        ...        ...     ...      ...      ...    ...       ...   
200  0.271083        0.0     0.0      0.0      0.0    0.0       0.0   
201  0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
202  0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
203  0.000000        0.0     0.0      0.0      0.0    0.0       0.0   
204  0.000000        0.0     0.0      0.0      0.0    0.0       0.0   

     acknowledg  acquir  acquisit  ...    within  without  wolstenholm  world  \
0           0.0     0.0       0.0  ...  0.000000      0.0         

     abstract  accord  account  achiev  across  addit  adequ  adjust  affili  \
0         0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
1         0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
2         0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
3         0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
4         0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
..        ...     ...      ...     ...     ...    ...    ...     ...     ...   
98        0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
99        0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
100       0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
101       0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   
102       0.0     0.0      0.0     0.0     0.0    0.0    0.0     0.0     0.0   

     africa  ...  within  word  work  w

     abil  abl  abstract  acceler  accord  account  accumul  accur  accuraci  \
0     0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
1     0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
2     0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
3     0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
4     0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
..    ...  ...       ...      ...     ...      ...      ...    ...       ...   
248   0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
249   0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
250   0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
251   0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   
252   0.0  0.0       0.0      0.0     0.0      0.0      0.0    0.0       0.0   

     action  ...  vivo  wherea  whether

     abdomin  abil  abl  abnorm  abortus  abund  accompani  accumul  across  \
0        0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
1        0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
2        0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
3        0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
4        0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
..       ...   ...  ...     ...      ...    ...        ...      ...     ...   
218      0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
219      0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
220      0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
221      0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   
222      0.0   0.0  0.0     0.0      0.0    0.0        0.0      0.0     0.0   

     activ  ...  work  would  wrote  year  yearl  y

      acceler  accept  accord  account    achiev     activ    actual  acut  \
0    0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
1    0.000000     0.0     0.0      0.0  0.190525  0.000000  0.000000   0.0   
2    0.277805     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
3    0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
4    0.000000     0.0     0.0      0.0  0.272493  0.352864  0.000000   0.0   
..        ...     ...     ...      ...       ...       ...       ...   ...   
163  0.000000     0.0     0.0      0.0  0.000000  0.000000  0.230223   0.0   
164  0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
165  0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
166  0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   
167  0.000000     0.0     0.0      0.0  0.000000  0.000000  0.000000   0.0   

      ad  adiabat  ...  wetmarket  wheel  whilst  widen  width 

     absenc  academ  accept  access  accomplish  accord  account  accredit  \
0       0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
1       0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
2       0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
3       0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
4       0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
..      ...     ...     ...     ...         ...     ...      ...       ...   
291     0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
292     0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
293     0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
294     0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   
295     0.0     0.0     0.0     0.0         0.0     0.0      0.0       0.0   

     achiev  acknowledg  ...  work  worker  workforc  workload 

     ababa  abil    academ  access  accid  accur  acinetobact  acquir  across  \
0      0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
1      0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
2      0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
3      0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
4      0.0   0.0  0.241286     0.0    0.0    0.0          0.0     0.0     0.0   
..     ...   ...       ...     ...    ...    ...          ...     ...     ...   
192    0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
193    0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
194    0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
195    0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   
196    0.0   0.0  0.000000     0.0    0.0    0.0          0.0     0.0     0.0   

     action  ...  would  wr

         abil  abl  abnorm  abod  absenc  absolut  access    accord  accur  \
0    0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
1    0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.289675    0.0   
2    0.148946  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
3    0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
4    0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
..        ...  ...     ...   ...     ...      ...     ...       ...    ...   
143  0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
144  0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
145  0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
146  0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   
147  0.000000  0.0     0.0   0.0     0.0      0.0     0.0  0.000000    0.0   

     acquir  ...  wide  widespread  within  without  word      

     abnorm  abscess  acceler    accord     accur  activ  acut  adapt  addit  \
0       0.0      0.0      0.0  0.236592  0.000000    0.0   0.0    0.0    0.0   
1       0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
2       0.0      0.0      0.0  0.000000  0.238888    0.0   0.0    0.0    0.0   
3       0.0      0.0      0.0  0.000000  0.301323    0.0   0.0    0.0    0.0   
4       0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
..      ...      ...      ...       ...       ...    ...   ...    ...    ...   
107     0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
108     0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
109     0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
110     0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   
111     0.0      0.0      0.0  0.000000  0.000000    0.0   0.0    0.0    0.0   

     adenovirus  ...  work     world  w

     abdomin  access  account  accredit  accuraci  acquir  across  activ  \
0        0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
1        0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
2        0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
3        0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
4        0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
..       ...     ...      ...       ...       ...     ...     ...    ...   
198      0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
199      0.0     0.0      0.0       0.0  0.414338     0.0     0.0    0.0   
200      0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
201      0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   
202      0.0     0.0      0.0       0.0  0.000000     0.0     0.0    0.0   

         acut   ad  ...  water  weight  went  whether  whole  whose  window  \
0    0.2

     abil  abl  access  accomplish  accur  achiev  across     activ  actual  \
0     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.255717     0.0   
1     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
2     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
3     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
4     0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
..    ...  ...     ...         ...    ...     ...     ...       ...     ...   
141   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
142   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
143   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   
144   0.0  0.0     0.0    0.216548    0.0     0.0     0.0  0.000000     0.0   
145   0.0  0.0     0.0    0.000000    0.0     0.0     0.0  0.000000     0.0   

     acut  ...  will  willing    within  without  w

     ababa  abandon  abeyratn  abolish  abrupt  absenc  absolut  access  \
0      0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
1      0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
2      0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
3      0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
4      0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
..     ...      ...       ...      ...     ...     ...      ...     ...   
254    0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
255    0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
256    0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
257    0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   
258    0.0      0.0       0.0      0.0     0.0     0.0      0.0     0.0   

     accord  account  ...     would  wrong  year  yield  york  zambia  \
0       0.0      0.0  ... 

     abl  absenc    accept  access  accommod  accompani  accord    achiev  \
0    0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
1    0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
2    0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
3    0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
4    0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
..   ...     ...       ...     ...       ...        ...     ...       ...   
120  0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
121  0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.502253   
122  0.0     0.0  0.000000     0.0       0.0   0.339427     0.0  0.000000   
123  0.0     0.0  0.000000     0.0       0.0   0.000000     0.0  0.000000   
124  0.0     0.0  0.284815     0.0       0.0   0.000000     0.0  0.000000   

       across  activ  ...  wilmington  wipe  within  without      work  \
0

     abandon  abid  abil  absenc  absent  accept  accident  accord  account  \
0        0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
1        0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
2        0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
3        0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
4        0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
..       ...   ...   ...     ...     ...     ...       ...     ...      ...   
278      0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
279      0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
280      0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
281      0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   
282      0.0   0.0   0.0     0.0     0.0     0.0       0.0     0.0      0.0   

     accur  ...  world  worldwid  worri  worst  wou

     abalon  abil  abl    absenc  absorpt  accept  access  achiev  acquir  \
0       0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
1       0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
2       0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
3       0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
4       0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
..      ...   ...  ...       ...      ...     ...     ...     ...     ...   
213     0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
214     0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
215     0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   
216     0.0   0.0  0.0  0.227438      0.0     0.0     0.0     0.0     0.0   
217     0.0   0.0  0.0  0.000000      0.0     0.0     0.0     0.0     0.0   

     across  ...  yersenia  yersinia  ylitalo  yokoyama      yolk  yolken  

     abil  abroad  absenc  academi  acceler  access  accord  account  accur  \
0     0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
1     0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
2     0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
3     0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
4     0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
..    ...     ...     ...      ...      ...     ...     ...      ...    ...   
168   0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
169   0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
170   0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
171   0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   
172   0.0     0.0     0.0      0.0      0.0     0.0     0.0      0.0    0.0   

     accuraci  ...  worri  worth  would  wreak  wri

     abandon  abil  abroad  absenc  absente  access  accid  accord  account  \
0        0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
1        0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
2        0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
3        0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
4        0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
..       ...   ...     ...     ...      ...     ...    ...     ...      ...   
233      0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
234      0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
235      0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
236      0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   
237      0.0   0.0     0.0     0.0      0.0     0.0    0.0     0.0      0.0   

     acknowledg  ...  workforc  workplac  world  wo

     abl  absenc  absent  abus  academ  academia  accept  access  accommod  \
0    0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
1    0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
2    0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
3    0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
4    0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
..   ...     ...     ...   ...     ...       ...     ...     ...       ...   
349  0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
350  0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
351  0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
352  0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   
353  0.0     0.0     0.0   0.0     0.0       0.0     0.0     0.0       0.0   

     account  ...  work  workforc  workgroup    world  worri  w

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
8916,0d578824fe7a24ea08124982a21a7fa9f7011f5f,8,The primary responsibility of the boar stud i...,[ The primary responsibility of the boar stud ...,"[[primari, respons, boar, stud, provid, genet,...","[174, 19, 100]"
8743,1a9fdd51745b132ee92107b4e9c68597d3767b7e,8,"In this paper, we review examples of these no...","[ In this paper, we review examples of these n...","[[paper, review, exampl, novel, applic, detect...","[36, 10, 24]"
8640,a42ff48d50aa3a0ebfe840d46ed2204c49955442,8,In addition to the impact of the COVID-19 on ...,[ In addition to the impact of the COVID-19 on...,"[[addit, impact, peopl, emot, peopl, cope, str...","[17, 134, 32]"
9041,64504909cf5e7b5249b8c617c2d1973a3c301e85,8,The United States-Japan Cooperative Medical S...,[ The United States-Japan Cooperative Medical ...,"[[unit, cooper, medic, scienc, program, usjcms...","[138, 77, 2]"
8833,f3f471d10a36a7a28e9050c10bd4dfd680cba17b,8,Preparedness against pandemic influenza has b...,[ Preparedness against pandemic influenza has ...,"[[prepared, pandem, influenza, becom, high, pr...","[55, 17, 52]"
...,...,...,...,...,...,...
9005,bfa379b686f05c7dec75708cfa044c653e8dfaaa,8,"t is not uncommon to hear, ""Don't worry, we h...","[ t is not uncommon to hear, ""Don't worry, we ...","[[uncommon, hear, worri, disast, plan, ask, re...","[2, 107, 8]"
9179,f5fb2153b992cdc2d4bdb3678e667126c9a27689,8,The 'One Health' concept states simply that t...,[ The 'One Health' concept states simply that ...,"[[health, concept, state, simpli, seamless, in...","[92, 94, 73]"
9153,a125265919886034e8bec3a56c600656de445cb4,8,The spread of infectious organisms within hea...,[ The spread of infectious organisms within he...,"[[spread, infecti, organ, within, health, care...","[156, 26, 27]"
9024,328b36edcc6169a2aceba2c7e6859d1770d6a81d,8,Antimicrobial resistance (AMR) has gained con...,[ Antimicrobial resistance (AMR) has gained co...,"[[antimicrobi, resist, gain, consider, recogni...","[344, 350, 145]"


Print out top 3 most relevant sentences of each text in the sample.

In [15]:
for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"][:10]:
        print(row["sentences"][i])
        print("\n")

A reasonable example recommended in the field by the author, is to spray the back of the trailer and the loadout door with a glutaraldehyde disinfectant, combined with windshield washer fluid in the winter, and then wait 20 min before loading or unloading boars [32] .


The movement of people and pigs is certainly different now from that in the past, and this has led to widespread concern about diseases such as ASF (African Swine Fever) virus spreading to free areas, leading not only to animal suffering, but also to political and economic impacts.


When crossing these barriers through a gate or entrance, the incorporation of physical barriers provides a separation point at which footwear can be changed, potential fomites left in quarantine, and other measures of control implemented to reduce the chance of introducing new disease to the boar or lab facility.




In many developing countries, surveillance is limited due to the lack of a robust public health or laboratory infrastructure;