## Step 1

Load abstracts from the pickled list of papers-to-dictionary files.

In [2]:
import pandas as pd
import numpy as np
import pickle
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import gensim
import os
import collections
from collections import Counter

import smart_open
import random

import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

%matplotlib inline

In [3]:
with open("documents/atrial_fibrillation_review_paper_dictionary_list.pkl", "rb") as picklefile:
    af_list_of_dictionaries = pickle.load(picklefile)
    
with open("documents/lewy_body_dementia_review_paper_dictionary_list.pkl", "rb") as picklefile:
    lbd_list_of_dictionaries = pickle.load(picklefile)

In [6]:
af_abstracts = []

for paper_dict in af_list_of_dictionaries:
    af_abstracts.append(paper_dict['abstract_text'])

In [20]:
af_papers = []

for paper_dict in af_list_of_dictionaries:
    af_papers.append(paper_dict['article_text'])

Now that the afib papers are loaded into a list, we can import sumy and run different summarization algorithms on each of the papers.

In [8]:
import sumy

In [9]:
#Plain text parsers since we are parsing through text
from sumy.parsers.plaintext import PlaintextParser

#for tokenization
from sumy.nlp.tokenizers import Tokenizer

In [10]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer

In [12]:
af_abstracts[0]

'Pulse palpation has been recommended as the first step of screening to detect atrial fibrillation. We aimed to determine and compare the accuracy of different methods for detecting pulse irregularities caused by atrial fibrillation. We systematically searched MEDLINE, EMBASE, CINAHL and LILACS until 16 March 2015. Two reviewers identified eligible studies, extracted data and appraised quality using the QUADAS-2 instrument. Meta-analysis, using the bivariate hierarchical random effects method, determined average operating points for sensitivities, specificities, positive and negative likelihood ratios (PLR, NLR); we constructed summary receiver operating characteristic plots. Twenty-one studies investigated 39 interventions (n = 15,129 pulse assessments) for detecting atrial fibrillation. Compared to 12-lead electrocardiography (ECG) diagnosed atrial fibrillation, blood pressure monitors (BPMs; seven interventions) and non-12-lead ECGs (20 interventions) had the greatest accuracy for d

In [21]:
af_papers[0]

'Atrial fibrillation (AF) has a prevalence that increases with age. 1 , 2 AF is associated with significant morbidity and mortality, most notably from its associated four to fivefold increased risk of ischaemic stroke, 3 and poses a significant public health burden. 4 The SAFE trial was the largest randomised study of AF screening in primary care and found this to be an effective method for increasing AF detection when compared to routine practice. 5 , 6 Combined with the subsequent provision of antithrombotic therapy, 7 , 8 screening is likely to reduce thromboembolic complications from AF 9 and, consequently, this has been proposed to improve AF. 10 , 11 AF screening is a two-stage process. Firstly, asymptomatic patients with irregular pulses are identified and then AF is confirmed or excluded using 12-lead electrocardiography (ECG). 8 , 10 The accuracy with which irregular pulses are caused by AF is important; a high false positive rate would result in many patients having unnecessa

In [22]:
parser = PlaintextParser(af_papers[0], Tokenizer("english"))

In [27]:
# Lex rank
from sumy.summarizers.lex_rank import LexRankSummarizer 
summarizer = LexRankSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")

#Summarize the document with 3 sentences
summary = summarizer(parser.document, 3) 
for sentence in summary:
    print(sentence)

The findings were similar to our primary analyses, although the specificity of non-12-lead ECGs was slightly lower (non-12-lead ECGs: sensitivity 0.91 (95% CI 0.83–0.95), specificity 0.89 (95% CI 0.85–0.92); pulse palpation: all studies were conducted in primary care and findings already presented above).This review of 21 studies for methods of detecting irregular pulses caused by AF found modified BPMs and non-12-lead ECG devices had the greatest diagnostic accuracy.
Our review identified four methods (non-12-lead ECG, BPMs, smartphone applications and pulse oximetry) as alternative methods for detecting pulse irregularities, although the latter method was not eligible for inclusion in our analyses.
Our review identified four methods (non-12-lead ECG, BPMs, smartphone applications and pulse oximetry) as alternative methods for detecting pulse irregularities, although the latter method was not eligible for inclusion in our analyses.


In [35]:
" ".join(map(str, summary))

'The findings were similar to our primary analyses, although the specificity of non-12-lead ECGs was slightly lower (non-12-lead ECGs: sensitivity 0.91 (95% CI 0.83–0.95), specificity 0.89 (95% CI 0.85–0.92); pulse palpation: all studies were conducted in primary care and findings already presented above).This review of 21 studies for methods of detecting irregular pulses caused by AF found modified BPMs and non-12-lead ECG devices had the greatest diagnostic accuracy. Our review identified four methods (non-12-lead ECG, BPMs, smartphone applications and pulse oximetry) as alternative methods for detecting pulse irregularities, although the latter method was not eligible for inclusion in our analyses. Our review identified four methods (non-12-lead ECG, BPMs, smartphone applications and pulse oximetry) as alternative methods for detecting pulse irregularities, although the latter method was not eligible for inclusion in our analyses.'

In [24]:
# Luhn summarization

from sumy.summarizers.luhn import LuhnSummarizer
summarizer_1 = LuhnSummarizer(Stemmer("english"))
summarizer_1.stop_words = get_stop_words("english")

summary_1 =summarizer_1(parser.document,3)

for sentence in summary_1:
     print(sentence)

All randomised trials and observational studies, with the exclusion of case reports and case series, which recruited participants ≥18 years of age, investigated any method of identifying patients with an irregular pulse or suspected AF (the index test) and compared the index test with any ECG interpreted by a competent professional (the reference standard), involved healthcare professionals identifying patients with an irregular pulse, and reported sufficient data to enable calculation of diagnostic accuracy were included.
All randomised trials and observational studies, with the exclusion of case reports and case series, which recruited participants ≥18 years of age, investigated any method of identifying patients with an irregular pulse or suspected AF (the index test) and compared the index test with any ECG interpreted by a competent professional (the reference standard), involved healthcare professionals identifying patients with an irregular pulse, and reported sufficient data to

In [25]:
# Lsa summarization

from sumy.summarizers.lsa import LsaSummarizer
summarizer_2 = LsaSummarizer(Stemmer("english"))
summarizer_2.stop_words = get_stop_words("english")

summary_2 =summarizer_2(parser.document,3)


for sentence in summary_2:
    print(sentence)

All randomised trials and observational studies, with the exclusion of case reports and case series, which recruited participants ≥18 years of age, investigated any method of identifying patients with an irregular pulse or suspected AF (the index test) and compared the index test with any ECG interpreted by a competent professional (the reference standard), involved healthcare professionals identifying patients with an irregular pulse, and reported sufficient data to enable calculation of diagnostic accuracy were included.
All randomised trials and observational studies, with the exclusion of case reports and case series, which recruited participants ≥18 years of age, investigated any method of identifying patients with an irregular pulse or suspected AF (the index test) and compared the index test with any ECG interpreted by a competent professional (the reference standard), involved healthcare professionals identifying patients with an irregular pulse, and reported sufficient data to

In [26]:
# Text rank summarization

from sumy.summarizers.text_rank import TextRankSummarizer
summarizer_3 = TextRankSummarizer(Stemmer("english"))
summarizer_3.stop_words = get_stop_words("english")

summary_3 =summarizer_3(parser.document,3)

for sentence in summary_3:
    print(sentence)

All randomised trials and observational studies, with the exclusion of case reports and case series, which recruited participants ≥18 years of age, investigated any method of identifying patients with an irregular pulse or suspected AF (the index test) and compared the index test with any ECG interpreted by a competent professional (the reference standard), involved healthcare professionals identifying patients with an irregular pulse, and reported sufficient data to enable calculation of diagnostic accuracy were included.
The 21 studies investigated 39 interventions (n = 15,129 pulse assessments), which were categorised as blood pressure monitors (BPMs) (six studies; seven interventions), 25 , 28 , 33 , 36 – 38 non-12-lead ECG (10 studies; 20 interventions), 9 , 19 – 22 , 24 , 25 , 31 , 32 , 35 pulse palpation (six studies; six interventions) 9 , 25 , 27 , 30 , 32 , 34 and smartphone applications (three studies; six interventions).
The 21 studies investigated 39 interventions (n = 15,