## Clustering

The idea behind clustering is that once we've fetched N review papers from pubmed, now what we want to do is to check to see which papers are similar to each other, and should be fed to the summarizer as a single document. We can do this by comparing the abstracts for each paper.

We will try doing this by using gensim's doc2vec method, for which there is a tutorial [here](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)

In [45]:
import pandas as pd
import numpy as np

In [78]:
%%bash
ls documents/

ClinicalSenseInventoryII_RefinedMasterFile.txt
ClinicalSenseInventoryI_MasterFile.txt
af_paper1_abstract.txt
af_paper1_body.txt
af_paper1_dict.pkl
af_paper2_abstract.txt
af_paper2_body.txt
af_paper2_dict.pkl
af_paper3_abstract.txt
af_paper3_body.txt
af_paper3_dict.pkl
af_paper4_abstract.txt
af_paper4_body.txt
af_paper4_dict.pkl
af_paper5_abstract.txt
af_paper5_body.txt
af_paper5_dict.pkl
dict_structure.txt
lbd_paper1_dict.pkl
lbd_paper2_dict.pkl
lbd_paper3_dict.pkl
lbd_paper4_dict.pkl
lbd_paper5_dict.pkl
lbd_paper6_dict.pkl
lbd_paper7_dict.pkl
lbd_paper8_dict.pkl


In [6]:
import pickle

In [8]:
with open("documents/lbd_paper1_dict.pkl", "rb") as picklefile:
    lbd_paper1_dict = pickle.load(picklefile)

In [9]:
lbd_paper1_dict.keys()

dict_keys(['pmcid', 'title', 'abstract_text', 'article_text', 'citation_tuples'])

In [10]:
lbd_paper1_dict['title']

'Non-pharmacological interventions for Lewy body dementia: a systematic review'

In [11]:
lbd_paper1_dict['abstract_text']

"Lewy body dementia (consisting of dementia with Lewy bodies and Parkinson's disease dementia) is a common neurodegenerative disease characterised by visual hallucinations, fluctuating attention, motor disturbances, falls, and sensitivity to antipsychotics. This combination of features presents challenges for pharmacological management. Given this, we sought to review evidence for non-pharmacological interventions with patients with Lewy body dementia and their carers. Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design, language, or clinical setting. Two reviewers independently assessed papers for inclusion, rated study quality, and extracted data. The search identified 21 studies including two randomised controlled trials with available subgroup data, seven case series, and 12 case studies. Most studies reported beneficial effects of the interventions used, though the only sizeable study was on dysphagia, showing a 

In [12]:
import gensim

In [13]:
import os
import collections
import smart_open
import random

In [14]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

In [18]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            print(line)
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [21]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

Let's import the pubmed word2vec vectors. Due to the size of the vectors, we're going to just use chunks of the original file.

In [35]:
med_w2v_chunks_path = "../../../resources/med_w2v_chunks/"

In [38]:
with open(med_w2v_chunks_path + "directory_of_chunk_names.txt", "r") as f:
    chunk_names = f.read().split("\n")

In [63]:
with open(med_w2v_chunks_path + chunk_names[0], "r") as f:
    vector_chunk = f.readlines()

In [60]:
clean_vector_chunk = vector_chunk[1:]
clean_vector_chunk = [vector.split(" ")[:-1] for vector in clean_vector_chunk]

In [69]:
words = [line[0] for line in clean_vector_chunk]

In [72]:
vectors = [line[1:] for line in clean_vector_chunk]

In [74]:
pubmed_vector_chunk_1 = pd.DataFrame(data = vectors, index=words)

In [75]:
pubmed_vector_chunk_1.shape

(9999, 200)

In [77]:
pubmed_vector_chunk_1.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
gram,-0.018303,0.126082,0.091416,-0.036617,0.062727,0.10113,-0.042764,0.026161,0.063491,0.07787,...,0.085909,0.045432,-0.008336,-0.015598,0.051747,0.091813,-0.037191,0.022189,0.071518,-0.004419
cope,-0.159158,-0.048508,0.084827,-0.067393,0.080998,1.5e-05,-0.042624,-0.029751,0.006232,0.104981,...,0.069751,0.065958,0.032497,-0.015137,-0.054521,-0.022652,0.071351,0.132672,0.062271,0.010946
8.4,-0.011951,-0.039005,0.184014,0.025395,0.064504,-0.056436,-0.070603,0.128208,0.016662,0.05368,...,0.096095,0.012659,0.057466,0.033076,-0.030929,0.043632,-0.065978,-0.019174,-0.057212,0.020115
Sodium,-0.061905,-0.006765,-0.021027,-0.048634,0.101416,-0.049456,0.000857,-0.019953,0.069388,-0.077846,...,0.091255,0.085205,-0.053687,0.075144,0.020077,0.027895,0.029918,-0.051652,-0.084587,0.074435
Much,-0.034406,0.001993,-0.046504,-0.0143,-0.030149,0.017973,-0.096581,-0.072681,0.049732,0.064109,...,-0.020626,-0.019448,-0.059979,0.009705,0.127309,3.3e-05,0.018481,-0.024821,-0.000173,-0.092571


## Step 1

Let's manually inspect the 5 abstracts for the papers on atrial fibrillation that you kept.

After reading through all 5 abstracts, I would say that all of these papers are about distinctly different aspects of atrial fibrillation, and that they should not be clustered together and should instead by summarized each on their own.

In [82]:
for i in range(5):
    print(f"Paper {i+1}:")
    with open(f"documents/af_paper{i+1}_abstract.txt", "r") as f:
        print(f.read())
    print("--------------------------\n")

Paper 1:
In contemporary atrial fibrillation trials most deaths are cardiac related, whereas stroke and bleeding represent only a small subset of deaths. We aimed to evaluate the long-term risk of cardiac events and all-cause mortality in individuals with atrial fibrillation compared to no atrial fibrillation. A systematic review and meta-analysis of studies published between 1 January 2006 and 21 October 2016. Four databases were searched. Studies had follow-up of at least 500 stable patients for either cardiac endpoints or all-cause mortality for 12 months or longer. Publication bias was evaluated and random effects models were used to synthesise the results. Heterogeneity between studies was examined by subgroup and meta-regression analyses. A total of 15 cohort studies was included. Analyses indicated that atrial fibrillation was associated with an increased risk of myocardial infarction (relative risk (RR) 1.54, 95% confidence interval (CI) 1.26–1.85), all-cause mortality (RR 1.95

In [84]:
lbd_paper_dict_list = []

for i in range(8):
    print(f"Paper {i+1}:")
    with open(f"documents/lbd_paper{i+1}_dict.pkl", "rb") as picklefile:
        lbd_paper_dict = (pickle.load(picklefile))
        print(lbd_paper_dict['abstract_text'])
        lbd_paper_dict_list.append(lbd_paper_dict)
    print("------------------------\n")

Paper 1:
Lewy body dementia (consisting of dementia with Lewy bodies and Parkinson's disease dementia) is a common neurodegenerative disease characterised by visual hallucinations, fluctuating attention, motor disturbances, falls, and sensitivity to antipsychotics. This combination of features presents challenges for pharmacological management. Given this, we sought to review evidence for non-pharmacological interventions with patients with Lewy body dementia and their carers. Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design, language, or clinical setting. Two reviewers independently assessed papers for inclusion, rated study quality, and extracted data. The search identified 21 studies including two randomised controlled trials with available subgroup data, seven case series, and 12 case studies. Most studies reported beneficial effects of the interventions used, though the only sizeable study was on dysphagia, sh

In [98]:
lbd_paper_dict_list[2]['article_text']

'The nosologic relationship, as defined by DSM-5 , between dementia with Lewy bodies (DLB) and Parkinson’s disease dementia (PDD), both of which are major neurocognitive disorders with α-synuclein (αSyn) deposition/Lewy bodies (LB), is continuously being debated . The clinical features of DLB and PDD are similar and include dementia, cognitive fluctuations, and (visual) hallucinations in the setting of clinical or latent parkinsonism. The cognitive domains of both disorders overlap, with progressive executive dysfunctions, visual-spatial abnormalities, and memory disorders . Based on international consensus, DLB is diagnosed when cognitive impairment precedes parkinsonian motor signs or begins within 1 year from its onset , whereas in PDD, cognitive impairment develops in the setting of well-established Parkinson’s disease (PD) . DLB patients will also develop parkinsonism of increasing severity over the years, although 25% of them never develop parkinsonian symptoms . Despite differen

In [99]:
lbd_paper_dict_list[3]['article_text']

'The nosologic relationship, as defined by DSM-5 , between dementia with Lewy bodies (DLB) and Parkinson’s disease dementia (PDD), both of which are major neurocognitive disorders with α-synuclein (αSyn) deposition/Lewy bodies (LB), is continuously being debated . The clinical features of DLB and PDD are similar and include dementia, cognitive fluctuations, and (visual) hallucinations in the setting of clinical or latent parkinsonism. The cognitive domains of both disorders overlap, with progressive executive dysfunctions, visual-spatial abnormalities, and memory disorders . Based on international consensus, DLB is diagnosed when cognitive impairment precedes parkinsonian motor signs or begins within 1 year from its onset , whereas in PDD, cognitive impairment develops in the setting of well-established Parkinson’s disease (PD) . DLB patients will also develop parkinsonism of increasing severity over the years, although 25% of them never develop parkinsonian symptoms . Despite differen