# COVID-19 Research Papers Text Extraction

Goal: Answer the question "What has been published about medical care?"

Print out the most "important" sentences from a sample of papers labeled cluster 3, as determined by the LDA model in "COVID-19 Research Papers LDA Topic Modeling". The most important sentences are determined by the sum of the TFIDF scores for each word in the sentence, which represents how often the word appears in the sentence vs how often it appears in the document.

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayfeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jayfeng/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jayfeng/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
"""Read in full_texts, containing the full texts of all papers, from data cleaning notebook."""

full_texts = pd.read_csv("full_texts.csv")
full_texts

Unnamed: 0.1,Unnamed: 0,0
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally ..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTEY...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, l..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNER...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently encou..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a pu...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting patt..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastro...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease ..."


In [3]:
"""Read in document_clusters, containing the assigned topic of each paper from LDA clusters notebook."""

df_clusters = pd.read_csv("document_clusters.csv")[["sha", "assigned topic"]]
df_clusters

Unnamed: 0,sha,assigned topic
0,f27a5562dd776c3a927ef078b0038ac690d03d90,0
1,9862f8f952ee3c06f71abde040191057aae32175,0
2,613b280bd1f7e0a0dd50cbf2501da003caf95eb4,0
3,9c32d461dc9d4737756a990cf13bae1a03e078a9,0
4,b67c1adb9815a8ac0b118d1bd2f563d0d0e7c2bb,0
...,...,...
24040,868afcaa176cdfdc50900313a5657583d5a74e9e,9
24041,8ccaf50414e8f530aaa405630c4e477d377d09ce,9
24042,3399d0fe01c7cb8bff0615a506b7beacc813a05e,9
24043,06989a9659f1b9b10abc5b92a90ecff38a778d55,9


In [15]:
"""Filter out full texts that don't have an assigned topic."""
has_cluster = df_clusters["sha"].to_numpy()
valid_full_texts = pd.DataFrame(columns=["sha", "text"])

for index, row in full_texts.iterrows():
    if row[0] in has_cluster:
        valid_full_texts.loc[index] = [row[0], row[1]]

valid_full_texts

Unnamed: 0,sha,text
0,aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally ..."
1,212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTEY...
2,bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, l..."
3,ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNER...
4,a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently encou..."
...,...,...
27673,179df1e769292dd113cef1b54b0b43213e6b5c97,The ongoing novel coronavirus outbreak is a pu...
27674,9b4445849937393a4b05378653521a9d0c34dc8e,"interactions, co-residence, and commuting patt..."
27675,4e618ec5d2edea031a9ff8058a9bafafe30937be,Human coronaviruses (HCoV) which causes gastro...
27676,28b53e0cab53b10ab87431d6cc4ac1e0a7c4d6b9,"In December 2019, a novel coronavirus disease ..."


In [5]:
"""Merge df_clusters and valid_full texts by sha, select papers relevant to medical care (cluster 3)"""

merged = pd.merge(df_clusters, valid_full_texts, on="sha")

medical_care_papers = temp.loc[merged['assigned topic'] == 3]
medical_care_papers

Unnamed: 0,sha,assigned topic,text
7460,b2f59ed63b03f298f809f4aaefc80dc3e0415685,3,Infl uenza virus infection causes substantial ...
7461,4b6924e8a47cd58844905f4ce2a134e44783d68b,3,Novel influenza A (H1N1) emerged from Mexico i...
7462,0b71c286d446269acaefbfa63e64125241ed3efb,3,The human bocavirus (HBoV) was initially disco...
7463,01f48822661d416716bb7c132c1fffd0a4b68ede,3,"A new respiratory virus, human metapneumovirus..."
7464,d7a311b3538fe182e2e894c731591d015f67c779,3,"O n average, the total blood volume in rodent ..."
...,...,...,...
11576,f13c88733ea45be9e923a282dfd42f8c277c187c,3,Acute respiratory infections (ARIs) manifest a...
11577,c0bec3dd80af76e0f60020b72ab8f6e1f14354fb,3,Infection by one of the four serotypes of the ...
11578,388db9cf002b256532f7dd45511dd8357630bcc6,3,Influenza virus (IV) and respiratory syncytial...
11579,60cfb43ca9b8d54f5b8cb4ab264d8d628c7c227b,3,It has been suggested that annual epidemics of...


In [6]:
"""Sample 100 papers from cluster 3."""

mcp_partial = medical_care_papers.sample(100)
mcp_partial

Unnamed: 0,sha,assigned topic,text
10030,d03cf2f21e08b7e0df78531221da6c06915e1b50,3,"The influenza pandemic of 1918-1919, which con..."
7790,cacf69c575667e4fe2e84b872563ed59f55f0ce2,3,Respiratory viral infections account for many ...
10889,18d6b41f2b9f5cad93519eb39f422f7f008a4ed3,3,Community-acquired pneumonia (CAP) is an impor...
10592,0e46f380b9d3d066fb9a46098731ae1247ccc4e7,3,"In recent years, emerging infectious diseases ..."
10343,3a2ebe61bc67b4f57b6af20d70a3636ee22e5337,3,Bronchoscopy is performed as a diagnostic proc...
...,...,...,...
9562,d93f7790f05612e0984dba1bf0e720d2771ac304,3,The status of the oral and oropharyngeal cavit...
9411,1cfcdcf2a3783b9adba2b091bd50c5d8ac3252ff,3,Influenza viruses belong to the family Orthomy...
11274,cca041657183da140d9bff16a7ec097eb67cdc41,3,Certain neurotropic viruses can invade the bra...
10167,e62f6de9bc9588f663d52ae7d998bbe160485d48,3,Three main clinical syndromes can be distingui...


In [7]:
"""Add column of full text split into sentences using nltk sentence tokenizer."""

text_list = mcp_partial["text"].values
sentences = []

from nltk.tokenize import sent_tokenize

for i in np.arange(len(text_list)):
    text_i = text_list[i]
    text_i = sent_tokenize(text_i)
    sentences.append(text_i)

sentences = np.array(sentences)

mcp_partial["sentences"] = sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences
10030,d03cf2f21e08b7e0df78531221da6c06915e1b50,3,"The influenza pandemic of 1918-1919, which con...","[The influenza pandemic of 1918-1919, which co..."
7790,cacf69c575667e4fe2e84b872563ed59f55f0ce2,3,Respiratory viral infections account for many ...,[Respiratory viral infections account for many...
10889,18d6b41f2b9f5cad93519eb39f422f7f008a4ed3,3,Community-acquired pneumonia (CAP) is an impor...,[Community-acquired pneumonia (CAP) is an impo...
10592,0e46f380b9d3d066fb9a46098731ae1247ccc4e7,3,"In recent years, emerging infectious diseases ...","[In recent years, emerging infectious diseases..."
10343,3a2ebe61bc67b4f57b6af20d70a3636ee22e5337,3,Bronchoscopy is performed as a diagnostic proc...,[Bronchoscopy is performed as a diagnostic pro...
...,...,...,...,...
9562,d93f7790f05612e0984dba1bf0e720d2771ac304,3,The status of the oral and oropharyngeal cavit...,[The status of the oral and oropharyngeal cavi...
9411,1cfcdcf2a3783b9adba2b091bd50c5d8ac3252ff,3,Influenza viruses belong to the family Orthomy...,[Influenza viruses belong to the family Orthom...
11274,cca041657183da140d9bff16a7ec097eb67cdc41,3,Certain neurotropic viruses can invade the bra...,[Certain neurotropic viruses can invade the br...
10167,e62f6de9bc9588f663d52ae7d998bbe160485d48,3,Three main clinical syndromes can be distingui...,[Three main clinical syndromes can be distingu...


In [8]:
"""Set up stop words, stemmer, and lemmatizer."""

stop_words = set(stopwords.words('english')) 
snowBallStemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [9]:
"""Tokenize and clean the abstracts of every paper."""

def tokenize_clean(abstract):
    #tokenizes abstract string
    tokens = word_tokenize(abstract.lower())
    
    #lemmatizes tokens
    counter = 0
    while counter < len(tokens):
        tokens[counter] = lemmatizer.lemmatize(tokens[counter])
        counter += 1
    
    #filters, stems, and lowercases tokens
    filtered_tokens = []
    for i in tokens:
        if i not in stop_words and len(i) > 3:
            stemmed_word = snowBallStemmer.stem(i)
            filtered_tokens.append(stemmed_word)
    
    return filtered_tokens

In [10]:
"""Add column of cleaned and tokenized sentences."""

sent_list = mcp_partial["sentences"].values
tokenized_sentences = []


for i in np.arange(len(sent_list)):
    #sent_arr is a list of lists
    sent_arr = sent_list[i]
    sent_tokens = []
    for j in sent_arr:
        sent_tokens.append(tokenize_clean(j))
    tokenized_sentences.append(sent_tokens)

tokenized_sentences = np.array(tokenized_sentences)

mcp_partial["tokenized sentences"] = tokenized_sentences
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences
10030,d03cf2f21e08b7e0df78531221da6c06915e1b50,3,"The influenza pandemic of 1918-1919, which con...","[The influenza pandemic of 1918-1919, which co...","[[influenza, pandem, 1918-1919, contribut, est..."
7790,cacf69c575667e4fe2e84b872563ed59f55f0ce2,3,Respiratory viral infections account for many ...,[Respiratory viral infections account for many...,"[[respiratori, viral, infect, account, mani, h..."
10889,18d6b41f2b9f5cad93519eb39f422f7f008a4ed3,3,Community-acquired pneumonia (CAP) is an impor...,[Community-acquired pneumonia (CAP) is an impo...,"[[community-acquir, pneumonia, import, caus, m..."
10592,0e46f380b9d3d066fb9a46098731ae1247ccc4e7,3,"In recent years, emerging infectious diseases ...","[In recent years, emerging infectious diseases...","[[recent, year, emerg, infecti, diseas, appear..."
10343,3a2ebe61bc67b4f57b6af20d70a3636ee22e5337,3,Bronchoscopy is performed as a diagnostic proc...,[Bronchoscopy is performed as a diagnostic pro...,"[[bronchoscopi, perform, diagnost, procedur], ..."
...,...,...,...,...,...
9562,d93f7790f05612e0984dba1bf0e720d2771ac304,3,The status of the oral and oropharyngeal cavit...,[The status of the oral and oropharyngeal cavi...,"[[status, oral, oropharyng, caviti, includ, mo..."
9411,1cfcdcf2a3783b9adba2b091bd50c5d8ac3252ff,3,Influenza viruses belong to the family Orthomy...,[Influenza viruses belong to the family Orthom...,"[[influenza, virus, belong, famili, orthomyxov..."
11274,cca041657183da140d9bff16a7ec097eb67cdc41,3,Certain neurotropic viruses can invade the bra...,[Certain neurotropic viruses can invade the br...,"[[certain, neurotrop, virus, invad, brain, alo..."
10167,e62f6de9bc9588f663d52ae7d998bbe160485d48,3,Three main clinical syndromes can be distingui...,[Three main clinical syndromes can be distingu...,"[[three, main, clinic, syndrom, distinguish, h..."


In [11]:
"""Given a sorted df and int n, returns indexes of the top 3 rows."""

def top_n_index(df, n):
    indexes = []
    for i in np.arange(n):
        indexes.append(df.iloc[i].name)
    return indexes

In [12]:
"""Returns the indexes of the top n sentences of the text in the given row."""

def top_n_sentences(df, row, n):
    vectorizer = TfidfVectorizer()

    list_of_lists = df["tokenized sentences"].values[row]
    list_of_strings = []
    for i in list_of_lists:
        list_of_strings.append(" ".join(i))

    tfidf_matrix = vectorizer.fit_transform(list_of_strings)
    
    feature_names = vectorizer.get_feature_names()

    vectors = tfidf_matrix.todense().tolist()

    df = pd.DataFrame(vectors, columns=feature_names)

    sums = []
    for i in np.arange(len(df)):
        sums.append(np.sum(df.iloc[i].values))

    df["sums"] = sums
    df = df.sort_values("sums", ascending=False)
    
    return top_n_index(df, n)

In [13]:
"""Adds column of indexes of the top 3 most relevant sentences based on tfidf sum."""

temp = []
for i in np.arange(len(mcp_partial)):
    temp.append(top_n_sentences(mcp_partial, i, 3))

mcp_partial["relevant sentences"] = temp
mcp_partial

Unnamed: 0,sha,assigned topic,text,sentences,tokenized sentences,relevant sentences
10030,d03cf2f21e08b7e0df78531221da6c06915e1b50,3,"The influenza pandemic of 1918-1919, which con...","[The influenza pandemic of 1918-1919, which co...","[[influenza, pandem, 1918-1919, contribut, est...","[66, 13, 86]"
7790,cacf69c575667e4fe2e84b872563ed59f55f0ce2,3,Respiratory viral infections account for many ...,[Respiratory viral infections account for many...,"[[respiratori, viral, infect, account, mani, h...","[8, 13, 33]"
10889,18d6b41f2b9f5cad93519eb39f422f7f008a4ed3,3,Community-acquired pneumonia (CAP) is an impor...,[Community-acquired pneumonia (CAP) is an impo...,"[[community-acquir, pneumonia, import, caus, m...","[165, 13, 153]"
10592,0e46f380b9d3d066fb9a46098731ae1247ccc4e7,3,"In recent years, emerging infectious diseases ...","[In recent years, emerging infectious diseases...","[[recent, year, emerg, infecti, diseas, appear...","[104, 13, 92]"
10343,3a2ebe61bc67b4f57b6af20d70a3636ee22e5337,3,Bronchoscopy is performed as a diagnostic proc...,[Bronchoscopy is performed as a diagnostic pro...,"[[bronchoscopi, perform, diagnost, procedur], ...","[11, 6, 57]"
...,...,...,...,...,...,...
9562,d93f7790f05612e0984dba1bf0e720d2771ac304,3,The status of the oral and oropharyngeal cavit...,[The status of the oral and oropharyngeal cavi...,"[[status, oral, oropharyng, caviti, includ, mo...","[8, 90, 67]"
9411,1cfcdcf2a3783b9adba2b091bd50c5d8ac3252ff,3,Influenza viruses belong to the family Orthomy...,[Influenza viruses belong to the family Orthom...,"[[influenza, virus, belong, famili, orthomyxov...","[25, 170, 61]"
11274,cca041657183da140d9bff16a7ec097eb67cdc41,3,Certain neurotropic viruses can invade the bra...,[Certain neurotropic viruses can invade the br...,"[[certain, neurotrop, virus, invad, brain, alo...","[124, 116, 5]"
10167,e62f6de9bc9588f663d52ae7d998bbe160485d48,3,Three main clinical syndromes can be distingui...,[Three main clinical syndromes can be distingu...,"[[three, main, clinic, syndrom, distinguish, h...","[87, 56, 55]"


In [16]:
"""Print out top 3 most relevant sentences of each text in the sample."""

for index, row in mcp_partial.iterrows():
    for i in row["relevant sentences"]:
        print(row["sentences"][i])
        print("\n")

Additionally, the odds ratio in the no pathogen detected cohort was significantly higher in vaccinated versus unvaccinated individuals (OR = 1.51) ( Table 5) .Examining 6120 people with respiratory viruses other than influenza and pan-negative results who submitted a respiratory specimen for laboratory testing to the DoDGRS team, those who received an influenza vaccine had a decreased risk of having other respiratory pathogens identified compared to the unvaccinated group.


Laboratory testing completed at USAFSAM and Landstuhl Regional Medical Center (LRMC) included multiplex PCR respiratory pathogen panels (including: adenovirus, Chlmydia pneumoniae, coronavirus, human bocavirus, human metapnumovirus, Mycoplasma pneumoniae, parainfluenza, respiratory syncytial virus (RSV), rhinovirus/enterovirus, and co-infections) [14, 15] , viral culture detecting influenza and other respiratory viruses, and influenza A/B subtyping via PCR [16, 17] .


Both the unadjusted and adjusted models did no