# Topic Modelling and Visualisation
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model has already searched the abstracts of academic papers that contain the keyword "nutrition" and saved it into corpus_raw.csv.  
  
Overview of this notebook:
- Setup notebok environment and load data (corpus_raw.csv)
- Review articles published per year (sklearn's Countvectorizer)
- Create Bag Of Words (BOW) of all articles for Titles and Conclusions (nltk)
- Create interactive BOW per year with Bokeh

### Credit

Credit due in large to Patrick Harrison: Modern NLP in Python | PyData DC 2016  
  
Thank you!

## Setup

### Packages and setup

In [1]:
# Common
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Workspace
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML

In [2]:
# Set workspace
sns.set()
# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'
# Make notebook wider to fit pyLDAvis plot
display(HTML("<style>.container { width:50% !important; }</style>"))

### Load and inspect corpus.csv

In [3]:
# Load corpus.csv as DataFrame with parsed date format
corpus = pd.read_csv('../data/interim/corpus_raw.csv', parse_dates=[0])

In [4]:
# Keyword 
keywords = ['nutrition', 'diet']

In [5]:
# Inspect
corpus.info()
corpus.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1530 entries, 0 to 1529
Data columns (total 3 columns):
publication_date    1530 non-null datetime64[ns, UTC]
title               1530 non-null object
conclusions         1530 non-null object
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 36.0+ KB


Unnamed: 0,publication_date,title,conclusions
0,2016-03-09 00:00:00+00:00,Pregnancy Requires Major Changes in the Qualit...,['Pregnancy tends to markedly widen the nutrit...
1,2016-08-23 00:00:00+00:00,Continental-Scale Patterns Reveal Potential fo...,"['In all, given the geographic patterns in die..."
2,2015-06-17 00:00:00+00:00,Assessing Nutritional Parameters of Brown Bear...,['Previous studies have illustrated the differ...
3,2015-04-17 00:00:00+00:00,The Self-Reported Clinical Practice Behaviors ...,['The present study provides a valuable insigh...
4,2017-10-09 00:00:00+00:00,The impact of nutritional supplement intake on...,['Our study shows that the propensity to consu...


## Phrase Modeling 

Using Spacy to tokenise and clean each sentence in every document by removing stop words, punctuation, unnecessary white space, numbers and lemmatising each word.  
  
Do this for every sentence in each document. 
Generate a list of:
- Phrased documents
- Phrased sentences combined

In [6]:
# Necessary imports for Phrase Modeling
import spacy
from gensim.models.phrases import Phrases, Phraser
#from gensim.models.word2vec import LineSentence     - to be used when scaling

# Set Spacy
nlp = spacy.load('en_core_web_sm')

In [7]:
def docs_preclean_from_df(df, col):
    '''
    Convert documents from DataFrame and clean new line escapes and brackets.
    '''
    # Gather all articles' 'Conclusions' into docs
    docs_corp = df[col]

    # Clean unneseccary escapes and characters
    docs = []
    for doc in docs_corp:
        text = doc.replace('\'', '')
        text = text.replace('[', '')
        text = text.replace(']', '')
        text = text.replace('\\n' ,'')
        docs.append(text)
                
    return docs


In [8]:
def sents_split_from_docs(docs):
    '''
    Convert documents to list of all sentences.
    '''
#    sents = []
#    for doc in docs:
#        for sent in doc.split('.'):
#            if not sent == '':
#                sents.append(sent)
                
    #docs_clean = []
    sents_clean = []
    for parsed_doc in nlp.pipe(docs, n_process=4):   
        # For each document 
        # sents_clean_per_doc = []
        for sent in parsed_doc.sents:
            sent_token          = [token for token in sent]
            #sents_clean_per_doc += sent_token
            # Adds to corpus of sentences
            sents_clean.append(sent_token)
        

            
    return sents

In [9]:
def docs_sents_clean(docs):
    '''
    Clean corpus and produce:
    - Cleaned documents
    - Cleaned sentences combined from all documents
    '''
     # Use Spacy - pipe to parse each document, break it into sentences, 
    # clean it (lemmatise, punctuation and white space) and combine
    # all cleaned sentences into sents_clean, and 
    # all cleaned sentences per document into docs_clean

    docs_clean = []
    sents_clean = []
    sents = []
    for parsed_doc in nlp.pipe(docs, n_process=4):   
        # For each document 
        sents_clean_per_doc = []
        for sent in parsed_doc.sents:
            sent_token = []
            sent_token = [token.lemma_ for token in sent if not (token.is_stop or token.is_punct)]
            
            # Only add original sentence if sent_token is not null
            if len(sent_token) > 1:
                # Adds to corpus of sentences
                sents.append(sent)
                sents_clean.append(sent_token)
            
            sents_clean_per_doc += sent_token
           
        # Adds to corpus of ducuments
        docs_clean.append(sents_clean_per_doc)
    # Confirming output
    print('Number of documents (articles): {}'.format(len(docs_clean)))
    print('Total number of sentences: {}'.format(len(sents_clean)))
    print('Total words: {}\n'.format(sum([len(sent) for sent in sents_clean])))
    print('\nsents and sents_clean have the same length: ', len(sents) == len(sents_clean), '\n')
    
    return docs_clean, sents_clean, sents


In [10]:
def n_plus_one_gram(sents):
    ''' Generate n+1_gram model and outputs new sentences 
    Returns n+1_gram model and n+1_gram sentences'''
    # Create n+1 gram model with Gensim for all sentences
    g = Phrases(sents)
    g = Phraser(g)    
    # Generate n+1 gram sentences
    g_sents = [sent for sent in g[sents]]

    # Confirming output
    print('Total words after (n+1) gram: {}'.format(sum([len(sent) for sent in g_sents])) + '\n')
    
    return g, g_sents


In [11]:
def docs_sents_phrasing(docs_clean, sents_clean):
    '''
    Prepare documents and sentences for phrasing.
    '''
    # Phrasing SENTENCES
    # Bigram
    bg, bg_sents = n_plus_one_gram(sents_clean)
    # Trigram
    tg, tg_sents = n_plus_one_gram(bg_sents)
    
    # Phrasing DOCUMENTS
    # Fit bigrams model to the cleaned documents
    bg_docs = bg[docs_clean]
    # Fit trigrams model to converted bigrams
    tg_docs = tg[bg_docs]
    
    # Extract docs from generator
    ph_docs = [doc for doc in tg_docs]
    
    # add stopwords
    add_stopwords = ['study', 'effect', 'population', 'change', 'increase', 'different', 'high', 'level', 'low', 'result'] #food, diet
    [nlp.Defaults.stop_words.add(word) for word in add_stopwords]
                
    # Take out final stop words for each document
    phrased_docs = []
    for doc in ph_docs:
        phrased_docs.append([word for word in doc if word not in nlp.Defaults.stop_words])
    
    # Take out final stop words for each sentence (to be used later)
    phrased_sents = []
    for sent in tg_sents:
        phrased_sents.append([word for word in sent if word not in nlp.Defaults.stop_words])
    
    
    return phrased_docs, phrased_sents
    

In [12]:
# Messing around by excluding some words

nlp.Defaults.stop_words.add('study')
nlp.Defaults.stop_words.add('diet')
nlp.Defaults.stop_words.add('effect')

In [14]:
docs = docs_preclean_from_df(corpus, 'conclusions')
docs_clean, sents_clean, sents = docs_sents_clean(docs)
phrased_docs, phrased_sents = docs_sents_phrasing(docs_clean, sents_clean)

Number of documents (articles): 1530
Total number of sentences: 12464
Total words: 189133


sents and sents_clean have the same length:  True 

Total words after (n+1) gram: 177335

Total words after (n+1) gram: 175828



In [15]:
print('First 10 words from first document: \n', phrased_docs[0][:10], '\n')
print('First sentence: \n', phrased_sents[0])

First 10 words from first document: 
 ['pregnancy', 'tend', 'markedly', 'widen', 'nutritional', 'gap', 'decrease', 'nutritional', 'adequacy', 'observed'] 

First sentence: 
 ['pregnancy', 'tend', 'markedly', 'widen', 'nutritional', 'gap']


## Topic Modeling with LDA

### Topic Modeling

In [20]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

In [21]:
def do_lda(corp, num_topics=15):
    '''
    Apply LDA modeling to a list of documents.
    '''
    # Create dictionary and tidy
    corp_dict = Dictionary(corp)
    corp_dict.filter_extremes(no_below=10, no_above=0.6)
    corp_dict.compactify()
    # Create BOW. 
    # Note: docs and sents are the same now for all practical purposes
    corp_bow = [corp_dict.doc2bow(doc) for doc in corp]
    # LDA model
    lda = LdaMulticore(corp_bow, num_topics=num_topics, id2word=corp_dict, workers=3)
    
    return lda, corp_bow, corp_dict

In [22]:
# Function to show a topic
def show_topic(lda, num=0, topn=10):
    '''
    Display topic for given LDA Model, topic number and number of words.
    '''
    for word, freq in lda.show_topic(num, topn=topn):
        print(f'{word:20} {round(freq, 3):.3f}')

In [23]:
lda_docs, bow_docs, dict_docs  = do_lda(phrased_docs,  num_topics=15)
lda_sents, bow_sents, dict_sents = do_lda(phrased_sents, num_topics=15)

In [24]:
print('First topic for lda_docs:')
print(show_topic(lda_docs, num=0), '\n')
print('First topic for lda_sents:')
print(show_topic(lda_sents, num=0))

First topic for lda_docs:
food                 0.005
model                0.005
include              0.005
suggest              0.005
find                 0.005
child                0.005
important            0.005
dietary              0.004
base                 0.004
individual           0.004
None 

First topic for lda_sents:
University           0.016
United_States        0.007
base                 0.007
muscle               0.006
New                  0.006
provide              0.005
patient              0.005
likely               0.005
component            0.005
activity             0.005
None


### Visualisation with pyLDAvis

In [33]:
import pyLDAvis
import pyLDAvis.gensim
import warnings

In [34]:
LDAvis_prep = pyLDAvis.gensim.prepare(lda_docs, bow_docs, dict_docs)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [35]:
pyLDAvis.display(LDAvis_prep)

## Word Vector Model

### Word2Vec

Using Gensim's Word2Vec.

In [36]:
from gensim.models import Word2Vec

In [37]:
w2v_docs = Word2Vec(phrased_docs, size=100, window=5, min_count=20, sg=1, workers=3, iter=100, compute_loss=True)

In [38]:
w2v_docs.get_latest_training_loss()

35279060.0

In [39]:
w2v_sents = Word2Vec(phrased_sents, size=100, window=5, min_count=20, sg=1, workers=3, iter=100, compute_loss=True)

In [40]:
w2v_sents.get_latest_training_loss()

28394614.0

In [42]:
print('Words in Word2Vec dictionary for docs: {}'.format(len(w2v_docs.wv.vocab)))
print('Words in Word2Vec dictionary for sents: {}'.format(len(w2v_sents.wv.vocab)))

Words in Word2Vec dictionary for docs: 1660
Words in Word2Vec dictionary for sents: 1660


Create functions to find similar and opposite words in the vocabulary.

In [43]:
def similar_word(w2v, word, topn=5):
    """Find similar words in the corpus."""
    for word, simil in w2v.wv.most_similar(positive=[word], topn=topn):
        print(f'{word:25} {round(simil, 4)}')

In [44]:
def opposite_word(w2v, word, topn=5):
    """Find opposite words in the corpus."""
    for word, oppos in w2v.wv.most_similar(negative=[word], topn=topn):
        print(f'{word:25} {round(oppos, 4)}')

In [45]:
def meaning(w2v, plus=[], minus=[], topn=3):
    """Using 'plus' to find positive/similar words and 'minus' to find
       negative/opposite words in the word2vec vocabulary, 
       showing the top 'topn' results."""
    answers = w2v.wv.most_similar(positive=plus, negative=minus, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [46]:
# Only if needed to see the matrix as DataFrame
def make_df(w2v):
    '''
    Create a DataFrame from word2vec matrix.
    '''
    # Get word, index and word count
    vecs = [(word, info.index, info.count) for word, info in w2v.wv.vocab.items()]
    # Sort it according to word count (3rd column) - descending order
    vecs = sorted(vecs, key=lambda val: -val[2])
    # Get individual 'columns' values
    words, idx, count = zip(*vecs)
    # Now use the indices to get the vectors
    df_vecs = pd.DataFrame(w2v.wv.vectors_norm[idx,:], index=words)
    
    return df_vecs

In [49]:
similar_word(w2v_docs, 'child', topn=10)

stunting                  0.586
household                 0.5346
health                    0.5215
education                 0.5193
woman                     0.501
child_age                 0.4904
poor                      0.4817
undernutrition            0.4772
stunt                     0.4732
care                      0.4725


In [50]:
similar_word(w2v_sents, 'child', topn=10)

stunting                  0.5307
girl                      0.5185
poor                      0.5064
young_child               0.4992
mother                    0.4986
parent                    0.4898
health                    0.4753
poverty                   0.4695
stunt                     0.4668
household                 0.4608


In [77]:
similar_word(w2v_docs, 'carbohydrate', topn=10)

intake                    0.4006
synthesis                 0.3881
insulin                   0.3759
fat                       0.3721
sugar                     0.369
calorie                   0.3631
anti                      0.3542
reveal                    0.3511
metabolism                0.3503
adjustment                0.3445


In [238]:
similar_word(w2v_sents, 'carbohydrate', topn=10)

amino_acid                0.4185
starch                    0.4017
impaired                  0.3979
insulin                   0.394
nitrogen                  0.3893
metabolism                0.3806
protein                   0.3391
fatty_acid                0.339
vitamin                   0.3387
offer                     0.3367


In [78]:
similar_word(w2v_docs, 'fat', topn=10)

dietary                   0.4605
fatty_acid                0.4513
total                     0.4237
liver                     0.4234
supplement                0.4179
plasma                    0.4154
weight_gain               0.4061
protein                   0.4017
cholesterol               0.4014
vegetable                 0.3862


In [81]:
similar_word(w2v_docs, 'death', topn=10)

clinical                  0.3634
mortality                 0.3618
life                      0.3578
cardiovascular            0.3465
neonatal                  0.3399
peptide                   0.3289
perform                   0.3208
screening                 0.3174
newborn                   0.3135
birth                     0.3126


In [242]:
opposite_word(w2v_sents, 'carbohydrate', topn=10)

take_account              0.2092
dynamic                   0.1803
actual                    0.1674
end                       0.148
child_nutrition           0.1472
herbivore                 0.1403
evaluate                  0.1314
cost                      0.1282
percentage                0.1154
extend                    0.0993


In [51]:
similar_word(w2v_sents, 'diabetes', topn=10)

hypertension              0.4999
obesity                   0.4659
risk_factor               0.4017
patient                   0.3944
awareness                 0.3818
adipose_tissue            0.3736
obese                     0.3692
health                    0.3653
metabolic_syndrome        0.3605
acute                     0.3499


In [61]:
opposite_word(w2v_docs, 'sedentary', topn=10)

transfer                  0.19
role                      0.1721
prediction                0.1672
open                      0.1604
critical                  0.1494
ensure                    0.1489
robust                    0.1436
22                        0.1361
novel                     0.1347
species                   0.1346


In [72]:
opposite_word(w2v_docs, 'muscle', topn=10)

difficult                 0.2089
successful                0.2086
manage                    0.175
context                   0.1652
cultivation               0.1559
breastfeed                0.1372
face                      0.1363
local                     0.1347
Health                    0.133
health_risk               0.1226


In [108]:
meaning(w2v_sents, plus=['nutrition'], minus=['food'], topn=5)

breastfeed
organ
clearly
nutrition_intervention
underlying


In [109]:
meaning(w2v_docs, plus=['child'], minus=['food'], topn=5)

undernutrition
stunting
stunt
height
nutritional_intervention


In [114]:
meaning(w2v_docs, plus=['life', 'negative'], topn=5)#, minus=['nutrition'])

death
year
young
10
live


In [111]:
meaning(w2v_docs, plus=['growth'], minus=['food'], topn=5)

height
organ
damage
deposition
inflammation


In [113]:
meaning(w2v_docs, plus=['health'], minus=['nutrition'], topn=5)

colony
nutritional_intervention
overweight
behavioral
indicator


In [115]:
meaning(w2v_docs, plus=['muscle', 'protein'], topn=10)#, minus=['nutrition'])

skeletal_muscle
fatty_acid
liver
plasma
lipid
metabolism
expression
mitochondrial
deposition
respectively


In [78]:
meaning(w2v_sents, plus=['carbohydrate', 'protein', 'fat'], topn=10)#, minus=['nutrition'])

fatty_acid
amino_acid
dietary
plasma
cholesterol
n-3
acid
sugar
starch
metabolism


### Doc2Vec

In [153]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

In [154]:
def train_d2v(docs, max_epochs=100, vec_size=100, alpha=0.025):
    '''
    Train Doc2Vec model on all documents, or collection of sentences.
    '''
    tagged_docs = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(docs)]
    
    d2v = Doc2Vec(vector_size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm=1)
  
    d2v.build_vocab(tagged_docs)

    for epoch in range(max_epochs):
        if (epoch % 1 == 0):
                print('Completed epoch {} of {}'.format(epoch+1, max_epochs), end='\r')
        d2v.train(tagged_docs, total_examples=d2v.corpus_count, epochs=d2v.epochs)
        # decrease the learning rate
        d2v.alpha -= 0.0002
        # fix the learning rate, no decay
        d2v.min_alpha = d2v.alpha
    
    return d2v

In [155]:
# Train Doc2Vec model on documents if True, else load the model
run = False
if run:
    d2v_docs = train_d2v(phrased_docs)
    d2v_docs.save('../models/d2v_docs.model')
else:
    d2v_docs = Doc2Vec.load('../models/d2v_docs.model')

In [156]:
# Train Doc2Vec model on sentences if True, else load the model
run = False
if run:
    d2v_sents = train_d2v(phrased_sents)
    d2v_sents.save('../models/d2v_sents.model')
else:
    d2v_sents = Doc2Vec.load('../models/d2v_sents.model')

In [193]:
def find_similar_doc_or_sent(d2v_model, docs_or_sents, pos, neg='', topn=3):
    '''
    Find the top_n number of documents or sentences relating to the input string.
    d2v -> Doc2Vec model
    docs_or_sents -> Find original text in the document or sentence
    pos -> Positive document/sentence as string
    neg -> Negative document/sentence as string
    topn -> Show topn number of docs/sents
    '''
    # Find similar vectors
    pos_tokens = [doc.lower().replace('.', '') for doc in pos.split()]
    neg_tokens = [doc.lower().replace('.', '') for doc in neg.split()]
    new_vector_pos = d2v_model.infer_vector(pos_tokens)
    new_vector_neg = d2v_model.infer_vector(neg_tokens)
    similars = d2v_model.docvecs.most_similar(positive=[new_vector_pos], negative=[new_vector_neg], topn=topn)
    
    # Find similars in docs/sents
    for idx, weight in similars:
        print('Index {}: Relevance: {:0.2f}%'.format(idx, weight*100))
        print(docs_or_sents[int(idx)], '\n')
       
    return similars

In [212]:
text = 'a child requires healthy nutrition to avoid stunting growth'

a = find_similar_doc_or_sent(d2v_docs, docs, pos=text, topn=5)

Index 453: Relevance: 53.66%
We observed an association between early nutrition or growth, and depression at 30 years, this association seems to be cumulative, because the risk of depression was higher only among subjects who were SGA and were also stunted at age two or four years. 

Index 673: Relevance: 53.21%
This study provides evidence for the temporary presence of BAT in the BFP in early postnatal life. A key question that remains is whether BAT transdifferentiates into WAT in the BFP or WAT simply fills an empty space and replaces BAT over time. Studies are also needed to better determine the degree to which BAT in the BFP is linked to muscle function and infant nutrition, and to establish the significance of this adipose tissue in preterm and low birth weight infants known to lack the coordination between suckling, swallowing, and breathing. 

Index 497: Relevance: 53.19%
Solid state fermentation with A.niger could be practical methods for altering physicochemical properties of

In [186]:
text = 'protein is important for building muscles'

find_similar_doc_or_sent(d2v_sents, sents, pos=text, topn=5);

Index 7065: Relevance: 61.57%
First, the results are based on male individuals who are between 40 and 55. 

Index 7944: Relevance: 60.10%
In addition to health education and policies aimed at facilitating the adoption of healthy lifestyles by individuals, targeting stroke prevention efforts on poor urban households can participate in the effective implementation of stroke prevention to tackle the growing related burden in Morocco. 

Index 1847: Relevance: 56.70%
The field of child nutrition is very important, but some related issues are not on the policymakers’ agenda as required. 

Index 6023: Relevance: 56.60%
This study assessed differences in the relationship between food outlet access and diet according to level of educational attainment. 

Index 3793: Relevance: 55.81%
Currently, we cannot distinguish whether AM are responsible for high K acquisition from soil minerals or whether they have high capacities for K retention and efficiently recapture cations released from degrading l

### Vector Visualisation (t-SNE)

Prepare data.

In [118]:
from sklearn.manifold import TSNE
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [148]:
def make_vecs_for_bokeh_tsne(w2v):
    '''
    Prepare data to DataFrame vecs for input into bokeh_tsne_plot given Word2Vec model.
    '''
    df_vecs = make_df(w2v)
    # Initialse TSNE and fir_transform
    tsne = TSNE()
    tsne_vecs = tsne.fit_transform(df_vecs.values)
    # Transform to 2d coordinates for plotting
    tsne_vecs = pd.DataFrame(tsne_vecs, index=pd.Index(df_vecs.index), columns=['x', 'y'])
    # Add columns for words
    tsne_vecs['word'] = tsne_vecs.index
    
    return tsne_vecs

In [149]:
def bokeh_tsne_plot(tsne_vecs):
    '''
    Bokeh plot for DataFrame of t-SNE vectors.
    '''

    # Credit: Patrick Harrison: Modern NLP in Python | PyData DC 2016  
    # add our DataFrame as a ColumnDataSource for Bokeh
    plot_data = ColumnDataSource(tsne_vecs)

    # create the plot and configure the
    # title, dimensions, and tools
    tsne_plot = figure(
        title='t-SNE Word Embeddings',
        plot_width=500,
        plot_height=500,
        tools=(
            'pan, wheel_zoom, box_zoom,'
            'box_select, reset'
            ),
        active_scroll='wheel_zoom'
        )

    # add a hover tool to display words on roll-over
    tsne_plot.add_tools(
        HoverTool(tooltips = '@word')
        )
    
    # draw the words as circles on the plot
    tsne_plot.circle(
        'x',
        'y',
        source=plot_data,
        color='blue',
        line_alpha=0.2,
        fill_alpha=0.1,
        size=15,
        hover_line_color='black'
        )
    
    # configure visual elements of the plotc
    tsne_plot.title.text_font_size = value('16pt')
    tsne_plot.xaxis.visible = False
    tsne_plot.yaxis.visible = False
    tsne_plot.grid.grid_line_color = None
    tsne_plot.outline_line_color = None
    
    # engage!
    show(tsne_plot);

In [150]:
tsne_vecs_docs = make_vecs_for_bokeh_tsne(w2v_docs)

In [151]:
# docs
bokeh_tsne_plot(tsne_vecs_docs)

In [337]:
tsne_vecs_sents = make_vecs_for_bokeh_tsne(w2v_sents)

In [338]:
# sents
bokeh_tsne_plot(tsne_vecs_sents)

## Hierarchical Clustering

import scipy.cluster.hierarchy as shc

df = make_df(w2v_docs)
dft = df.transpose()

df.head()

####### First define the leaf label function.
def llf(id):
    if id < n:
        return str(id)
    else:
        return '[%d %d %1.2f]' % (id, count, R[n-id,3])
####### The text for the leaf nodes is going to be big so force
####### a rotation of 90 degrees.
#dendrogram(Z, leaf_label_func=llf, leaf_rotation=90)

plt.figure(figsize=(20, 250))  
plt.title('Dendrograms')  
dend = shc.dendrogram(shc.linkage(df, method='ward'), orientation='left', leaf_font_size=10)#, truncate_mode='lastp')

dend['ivl']

df.shape

df.reset_index()

df

import numpy as np

X = np.array([[5,3],  
    [10,15],
    [15,12],
    [24,10],
    [30,30],
    [85,70],
    [71,80],
    [60,78],
    [70,55],
    [80,91],])

import matplotlib.pyplot as plt

labels = range(1, 11)  
plt.figure(figsize=(10, 7))  
plt.subplots_adjust(bottom=0.1)  
plt.scatter(X[:,0],X[:,1], label='True Position')

for label, x, y in zip(labels, X[:, 0], X[:, 1]):  
    plt.annotate(
        label,
        xy=(x, y), xytext=(-3, 3),
        textcoords='offset points', ha='right', va='bottom')
plt.show()  

from scipy.cluster.hierarchy import dendrogram, linkage  
from matplotlib import pyplot as plt

linked = linkage(X, 'single')

labelList = range(1, 11)

plt.figure(figsize=(10, 7))  
dendrogram(linked,  
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()  

## Sub-topic analysis

Lets summarise articles containing certain key words (sub keywords). This is done manually. See **4_article_summarisation.ipynb** for the Gensim 'summarize' method.

In [304]:
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [298]:
# Find all sentences with sub keyword
sub_key = 'diabetes'
# plant_base_diet

In [305]:
stopwords = list(STOP_WORDS)

In [306]:
docx = nlp(docs[0])

In [308]:
# Frequency table
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [310]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())
maximum_frequency
# Normalise Word Frequencies
for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

In [312]:
# Sentence Tokens
sentence_list = [sentence for sentence in docx.sents]

In [313]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                #if len(sent.text.split(' ')) < 100:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]
        # Normalise accoring to sentence length
        sentence_scores[sent] = sentence_scores[sent] / len(sent)    

In [314]:
sentence_scores

{Pregnancy tends to markedly widen the nutritional gap.: 0.37777777777777777,
 The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries.: 0.21250000000000005,
 A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet.: 0.2428571428571429,
 Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy.: 0.2,
 These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, cr

In [315]:
summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)

In [316]:
summarized_sentences

[Pregnancy tends to markedly widen the nutritional gap.,
 A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet.,
 The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries.,
 Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy.,
 These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, critical stage of the life cycle.]