## Clustering

The idea behind clustering is that once we've fetched N review papers from pubmed, now what we want to do is to check to see which papers are similar to each other, and should be fed to the summarizer as a single document. We can do this by comparing the abstracts for each paper. We'll do this by taking an average of the abstract's word2vec word vectors.

In [120]:
import pandas as pd
import numpy as np
import pickle
import string
import nltk
from nltk.tokenize import word_tokenize

import gensim
import os
import collections
import smart_open
import random

## Step 1

Let's manually inspect the 5 abstracts for the papers on atrial fibrillation that you kept.

After reading through all 5 abstracts, I would say that all of these papers are about distinctly different aspects of atrial fibrillation, and that they should not be clustered together and should instead by summarized each on their own.

In [82]:
for i in range(5):
    print(f"Paper {i+1}:")
    with open(f"documents/af_paper{i+1}_abstract.txt", "r") as f:
        print(f.read())
    print("--------------------------\n")

Paper 1:
In contemporary atrial fibrillation trials most deaths are cardiac related, whereas stroke and bleeding represent only a small subset of deaths. We aimed to evaluate the long-term risk of cardiac events and all-cause mortality in individuals with atrial fibrillation compared to no atrial fibrillation. A systematic review and meta-analysis of studies published between 1 January 2006 and 21 October 2016. Four databases were searched. Studies had follow-up of at least 500 stable patients for either cardiac endpoints or all-cause mortality for 12 months or longer. Publication bias was evaluated and random effects models were used to synthesise the results. Heterogeneity between studies was examined by subgroup and meta-regression analyses. A total of 15 cohort studies was included. Analyses indicated that atrial fibrillation was associated with an increased risk of myocardial infarction (relative risk (RR) 1.54, 95% confidence interval (CI) 1.26–1.85), all-cause mortality (RR 1.95

In [100]:
lbd_paper_dict_list = []

for i in range(5):
    print(f"Paper {i+1}:")
    with open(f"documents/lbd_paper{i+1}_dict.pkl", "rb") as picklefile:
        lbd_paper_dict = (pickle.load(picklefile))
        print(lbd_paper_dict['abstract_text'])
        lbd_paper_dict_list.append(lbd_paper_dict)
    print("------------------------\n")

Paper 1:
Lewy body dementia (consisting of dementia with Lewy bodies and Parkinson's disease dementia) is a common neurodegenerative disease characterised by visual hallucinations, fluctuating attention, motor disturbances, falls, and sensitivity to antipsychotics. This combination of features presents challenges for pharmacological management. Given this, we sought to review evidence for non-pharmacological interventions with patients with Lewy body dementia and their carers. Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design, language, or clinical setting. Two reviewers independently assessed papers for inclusion, rated study quality, and extracted data. The search identified 21 studies including two randomised controlled trials with available subgroup data, seven case series, and 12 case studies. Most studies reported beneficial effects of the interventions used, though the only sizeable study was on dysphagia, sh

## Step 2

Let's go ahead and try averaging the word2vec vectors for each of the abstracts. The other thing to look at would be whether or not taking out the stopwords for each of the abstracts would actually help.

The first step is to go ahead and tokenize the abstract. We'll use the abstract from the first LBD paper.

In [104]:
abstract = lbd_paper_dict_list[0]['abstract_text']

In [109]:
import nltk
from nltk.tokenize import word_tokenize

In [117]:
def vectorize_abstract(abstract):
    abstract_words = word_tokenize(abstract)
    return abstract_words

Let's import the pubmed word2vec vectors. Due to the size of the vectors, we're going to just use chunks of the original file. We're also going to make a dictionary of all the words contained in each chunk file, so that we can can get the words that are contained in that file.

In [152]:
med_w2v_chunks_path = "../../../resources/med_w2v_chunks/"

In [153]:
with open(med_w2v_chunks_path + "directory_of_chunk_names.txt", "r") as f:
    chunk_names = f.read().split("\n")

In [169]:
len(chunk_names)

410

In [168]:
word_to_w2v_chunk_dict = {}

In [184]:
for i in range(100):
    with open(med_w2v_chunks_path + chunk_names[i], "r") as f:
        vector_chunk = f.readlines()
    words = [(vector.split(" ")[0], chunk_names[i]) for vector in vector_chunk]
    word_to_w2v_chunk_dict.update(dict(words))

In [229]:
with open(med_w2v_chunks_path + chunk_names[0], "r") as f:
    vector_chunk = f.readlines()
# words = [(vector.split(" ")[0], chunk_names[i]) for vector in vector_chunk]
# word_to_w2v_chunk_dict.update(dict(words))

In [196]:
# We'll go ahead and pickle this dictionary.

with open("documents/word_to_w2v_chunk_dict.pkl", "wb") as picklefile:
    pickle.dump(word_to_w2v_chunk_dict, picklefile)

Now, we'll go ahead and get the word vectors for each abstract, and just look to see whether or not we can cluster different papers.

In [277]:
def get_abstract_word_vectors(abstract, word_to_w2v_chunk_dict):
    abstract_word_vectors = {}
    word_to_chunkfile_dict = {}
    
    word2vec_dict = {}
    
    abstract_words = word_tokenize(abstract)
    
    for word in abstract_words:
        if word not in word_to_chunkfile_dict:
            if word in word_to_w2v_chunk_dict:
                word_to_chunkfile_dict[word] = word_to_w2v_chunk_dict[word]
    
    word_to_chunkfile_series = pd.Series(word_to_chunkfile_dict)
    
    words_to_check_for_each_chunkfile = {}
    
    for chunk_file in set(word_to_chunkfile_dict.values()):
        words_to_check_for_each_chunkfile[chunk_file] = \
        list(word_to_chunkfile_series[word_to_chunkfile_series == chunk_file].index)
    
    for chunk_file, word_list in words_to_check_for_each_chunkfile.items():
        with open(med_w2v_chunks_path + chunk_file, "r") as f:
            word2vec_file = f.readlines()
        
        word2vec_dict = {}
        for word2vec_line in word2vec_file:
            split_word2vec_line = word2vec_line.split(" ")
            word2vec_dict[split_word2vec_line[0]] = list(map(float, split_word2vec_line[1:-1]))
            
        for word in word_list:
            abstract_word_vectors[word] = word2vec_dict[word]
    
    return abstract_word_vectors

In [291]:
lbd_paper1_word_vectors = get_abstract_word_vectors(abstract, word_to_w2v_chunk_dict)

In [303]:
lewy_body_abstract_mean_vector = []

for lbd_paper_dict in lbd_paper_dict_list:
    print(lbd_paper_dict['abstract_text'])
    abstract_word_vectors = get_abstract_word_vectors(lbd_paper_dict['abstract_text'], word_to_w2v_chunk_dict)
    lewy_body_abstract_mean_vector.append(np.mean(list(abstract_word_vectors.values()), axis=1))

Lewy body dementia (consisting of dementia with Lewy bodies and Parkinson's disease dementia) is a common neurodegenerative disease characterised by visual hallucinations, fluctuating attention, motor disturbances, falls, and sensitivity to antipsychotics. This combination of features presents challenges for pharmacological management. Given this, we sought to review evidence for non-pharmacological interventions with patients with Lewy body dementia and their carers. Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design, language, or clinical setting. Two reviewers independently assessed papers for inclusion, rated study quality, and extracted data. The search identified 21 studies including two randomised controlled trials with available subgroup data, seven case series, and 12 case studies. Most studies reported beneficial effects of the interventions used, though the only sizeable study was on dysphagia, showing a b

In [304]:
lewy_body_abstract_mean_vector

[array([-0.00228016, -0.005327  ,  0.00343219, -0.00239429, -0.00272403,
        -0.0025297 , -0.00652254, -0.00539283, -0.00395533, -0.00460923,
         0.00201319, -0.00389045, -0.00366805,  0.00201477, -0.00122336,
        -0.00606793, -0.00457171, -0.00538787, -0.00480024,  0.00173611,
         0.0016477 , -0.00514065,  0.00276594,  0.00255077, -0.00092949,
        -0.00021652, -0.00035522, -0.0055997 , -0.00477975,  0.0005191 ,
        -0.00444989,  0.00218933, -0.00032444, -0.01004172, -0.00176036,
        -0.00696765, -0.00014063, -0.00752481, -0.00021117,  0.00293141,
        -0.00592717, -0.00116795, -0.00016118, -0.0088345 , -0.00894355,
        -0.00174407, -0.00710111, -0.0065993 , -0.01337985, -0.00694562,
        -0.00542744, -0.0079034 , -0.00089415,  0.00160952, -0.00689147,
        -0.00659728, -0.00732531, -0.00450111, -0.00629366, -0.00243257,
        -0.00267998,  0.00159085, -0.00468278,  0.00311708, -0.00507915,
         0.00332733, -0.00445157, -0.00978587,  0.0

In [251]:
abstract

"Lewy body dementia (consisting of dementia with Lewy bodies and Parkinson's disease dementia) is a common neurodegenerative disease characterised by visual hallucinations, fluctuating attention, motor disturbances, falls, and sensitivity to antipsychotics. This combination of features presents challenges for pharmacological management. Given this, we sought to review evidence for non-pharmacological interventions with patients with Lewy body dementia and their carers. Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design, language, or clinical setting. Two reviewers independently assessed papers for inclusion, rated study quality, and extracted data. The search identified 21 studies including two randomised controlled trials with available subgroup data, seven case series, and 12 case studies. Most studies reported beneficial effects of the interventions used, though the only sizeable study was on dysphagia, showing a 

In [114]:
import string

In [116]:
translator = str.maketrans('', '', string.punctuation)

abstract.translate(translator)

'Lewy body dementia consisting of dementia with Lewy bodies and Parkinsons disease dementia is a common neurodegenerative disease characterised by visual hallucinations fluctuating attention motor disturbances falls and sensitivity to antipsychotics This combination of features presents challenges for pharmacological management Given this we sought to review evidence for nonpharmacological interventions with patients with Lewy body dementia and their carers Bibliographic databases were searched using a wide range of search terms and no restrictions were placed on study design language or clinical setting Two reviewers independently assessed papers for inclusion rated study quality and extracted data The search identified 21 studies including two randomised controlled trials with available subgroup data seven case series and 12 case studies Most studies reported beneficial effects of the interventions used though the only sizeable study was on dysphagia showing a benefit of honeythicken

In [284]:
from nltk.corpus import stopwords

In [288]:
abstract_words = word_tokenize(abstract.translate(translator))

In [289]:
abstract_words

['Lewy',
 'body',
 'dementia',
 'consisting',
 'of',
 'dementia',
 'with',
 'Lewy',
 'bodies',
 'and',
 'Parkinsons',
 'disease',
 'dementia',
 'is',
 'a',
 'common',
 'neurodegenerative',
 'disease',
 'characterised',
 'by',
 'visual',
 'hallucinations',
 'fluctuating',
 'attention',
 'motor',
 'disturbances',
 'falls',
 'and',
 'sensitivity',
 'to',
 'antipsychotics',
 'This',
 'combination',
 'of',
 'features',
 'presents',
 'challenges',
 'for',
 'pharmacological',
 'management',
 'Given',
 'this',
 'we',
 'sought',
 'to',
 'review',
 'evidence',
 'for',
 'nonpharmacological',
 'interventions',
 'with',
 'patients',
 'with',
 'Lewy',
 'body',
 'dementia',
 'and',
 'their',
 'carers',
 'Bibliographic',
 'databases',
 'were',
 'searched',
 'using',
 'a',
 'wide',
 'range',
 'of',
 'search',
 'terms',
 'and',
 'no',
 'restrictions',
 'were',
 'placed',
 'on',
 'study',
 'design',
 'language',
 'or',
 'clinical',
 'setting',
 'Two',
 'reviewers',
 'independently',
 'assessed',
 'papers

In [290]:
[word for word in abstract_words if word not in stopwords.words("English")]

['Lewy',
 'body',
 'dementia',
 'consisting',
 'dementia',
 'Lewy',
 'bodies',
 'Parkinsons',
 'disease',
 'dementia',
 'common',
 'neurodegenerative',
 'disease',
 'characterised',
 'visual',
 'hallucinations',
 'fluctuating',
 'attention',
 'motor',
 'disturbances',
 'falls',
 'sensitivity',
 'antipsychotics',
 'This',
 'combination',
 'features',
 'presents',
 'challenges',
 'pharmacological',
 'management',
 'Given',
 'sought',
 'review',
 'evidence',
 'nonpharmacological',
 'interventions',
 'patients',
 'Lewy',
 'body',
 'dementia',
 'carers',
 'Bibliographic',
 'databases',
 'searched',
 'using',
 'wide',
 'range',
 'search',
 'terms',
 'restrictions',
 'placed',
 'study',
 'design',
 'language',
 'clinical',
 'setting',
 'Two',
 'reviewers',
 'independently',
 'assessed',
 'papers',
 'inclusion',
 'rated',
 'study',
 'quality',
 'extracted',
 'data',
 'The',
 'search',
 'identified',
 '21',
 'studies',
 'including',
 'two',
 'randomised',
 'controlled',
 'trials',
 'available',

In [285]:
stopwords.words("English")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each