# Topic Modelling and Visualisation
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model has already searched the abstracts of academic papers that contain the keyword "nutrition" and saved it into corpus_raw.csv.  
  
Overview of this notebook:
- Setup notebok environment and load data (corpus_raw.csv)
- Review articles published per year (sklearn's Countvectorizer)
- Create Bag Of Words (BOW) of all articles for Titles and Conclusions (nltk)
- Create interactive BOW per year with Bokeh

### Credit

Credit due in large to Patrick Harrison: Modern NLP in Python | PyData DC 2016  
  
Thank you!

## Setup

### Packages and setup

In [1]:
# Common
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Workspace
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML

In [2]:
# Set workspace
sns.set()
# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'
# Make notebook wider to fit pyLDAvis plot
display(HTML("<style>.container { width:50% !important; }</style>"))

### Load and inspect corpus.csv

In [3]:
# Load corpus.csv as DataFrame with parsed date format
corpus = pd.read_csv('../data/interim/corpus_raw.csv', parse_dates=[0])

In [4]:
# Keyword 
keywords = ['nutrition', 'diet']

In [5]:
# Inspect
corpus.info()
corpus.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1530 entries, 0 to 1529
Data columns (total 3 columns):
publication_date    1530 non-null datetime64[ns, UTC]
title               1530 non-null object
conclusions         1530 non-null object
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 36.0+ KB


Unnamed: 0,publication_date,title,conclusions
0,2016-03-09 00:00:00+00:00,Pregnancy Requires Major Changes in the Qualit...,['Pregnancy tends to markedly widen the nutrit...
1,2016-08-23 00:00:00+00:00,Continental-Scale Patterns Reveal Potential fo...,"['In all, given the geographic patterns in die..."
2,2015-06-17 00:00:00+00:00,Assessing Nutritional Parameters of Brown Bear...,['Previous studies have illustrated the differ...
3,2015-04-17 00:00:00+00:00,The Self-Reported Clinical Practice Behaviors ...,['The present study provides a valuable insigh...
4,2017-10-09 00:00:00+00:00,The impact of nutritional supplement intake on...,['Our study shows that the propensity to consu...


## Phrase Modeling 

Using Spacy to tokenise and clean each sentence in every document by removing stop words, punctuation, unnecessary white space, numbers and lemmatising each word.  
  
Do this for every sentence in each document and combine all documents in the corpus.

In [6]:
# Necessary imports for Phrase Modeling
import spacy
from gensim.models.phrases import Phrases, Phraser
#from gensim.models.word2vec import LineSentence     - to be used when scaling

In [7]:
# Set Spacy
nlp = spacy.load('en_core_web_sm')

In [8]:
# Gather all articles' 'Conclusions' into docs
docs_corp = corpus['conclusions']

# Clean unneseccary escapes
docs = []
for doc in docs_corp:
    text = doc.replace('\'', '')
    docs.append(text.replace('\\n' ,''))

In [9]:
# Use Spacy - pipe to parse each document, break it into sentences, 
# clean it (lemmatise, punctuation and white space) and combine
# all cleaned sentences into sents_clean, and 
# all cleaned sentences per document into docs_clean

docs_clean = []
sents_clean = []
for parsed_doc in nlp.pipe(docs, n_threads=4):   
    # For each document 
    sents_clean_per_doc = []
    for sent in parsed_doc.sents:
        sent_token          = [token.lemma_ for token in sent if not (token.is_stop or token.is_punct)]
        sents_clean_per_doc += sent_token
        # Adds to corpus of sentences
        sents_clean.append(sent_token)
    # Adds to corpus of ducuments
    docs_clean.append(sents_clean_per_doc)
        
# Confirming output
print('Number of documents (articles): {}'.format(len(docs_clean)))
print('Total number of sentences: {}'.format(len(sents_clean)))
print('Total words: {}'.format(sum([len(sent) for sent in sents_clean])))

Number of documents (articles): 1530
Total number of sentences: 12598
Total words: 192351


Create a function that generates a (n+1)-gram model and outputs the new sentences.

In [10]:
def n_plus_one_gram(sents):
    ''' Generate n+1_gram model and outputs new sentences 
    Returns n+1_gram model and n+1_gram sentences'''
    # Create n+1 gram model with Gensim for all sentences
    g = Phrases(sents)
    g = Phraser(g)    
    # Generate n+1 gram sentences
    g_sents = [sent for sent in g[sents]]

    # Confirming output
    print('Total number of sentences: {}'.format(len(g_sents)))
    print('Total words: {}'.format(sum([len(sent) for sent in g_sents])) + '\n')
    return g, g_sents

Run Phrase modeling over all the sentences three times to get four-gram model.

In [11]:
# Bigram
print('Bigram')
bg, bg_sents = n_plus_one_gram(sents_clean)
# Trigram
print('Trigram')
tg, tg_sents = n_plus_one_gram(bg_sents)
# Four-gram
print('Four-gram')
qg, qg_sents = n_plus_one_gram(tg_sents)

Bigram
Total number of sentences: 12598
Total words: 180022

Trigram
Total number of sentences: 12598
Total words: 178310

Four-gram
Total number of sentences: 12598
Total words: 178063



Now run the complete corpus of documents on the (n+1)-gram model.

In [12]:
# Fit bigrams model to the cleaned documents
bg_docs = bg[docs_clean]
# Fit trigrams model to converted bigrams
tg_docs = tg[bg_docs]
# Fit four-grams model to converted trigrams
qg_docs = qg[tg_docs]

In [13]:
ph_docs = [doc for doc in qg_docs]
print('Total number of sentences: {}'.format(len(ph_docs)))
print('Total words: {}'.format(sum([len(sent) for sent in ph_docs])))

Total number of sentences: 1530
Total words: 178048


In [134]:
# Add stopwords
#nlp.Defaults.stop_words.add('study')

In [31]:
# Take out final stop words for each document
corp = []
for doc in ph_docs:
    corp.append([word for word in doc if word not in nlp.Defaults.stop_words])

# Confirm output
print('Total number of sentences: {}'.format(len(corp)))
print('Total words: {}'.format(sum([len(sent) for sent in corp])))

Total number of sentences: 1530
Total words: 175457


In [32]:
# Take out final stop words for each sentence (to be used later)
corp_sent = []
for sent in qg_sents:
    corp_sent.append([word for word in sent if word not in nlp.Defaults.stop_words])

# Confirm output
print('Total number of sentences: {}'.format(len(corp_sent)))
print('Total words: {}'.format(sum([len(sent) for sent in corp_sent])))

Total number of sentences: 12598
Total words: 175468


In [33]:
print('First document in corpus after applying Phrase Modeling: \n\n', u' '.join(corp[0]))
print('\n')
print('First sentence in corpus after applying Phrase Modeling: \n\n', u' '.join(corp_sent[0]))

First document in corpus after applying Phrase Modeling: 

 pregnancy tend markedly widen nutritional gap decrease nutritional adequacy observed diet woman_childbeare_age solve simple increase energy_intake recommend country relatively good nutritional adequacy pregnant_woman simply attribute high energy_intake qualitative change diet recommendation snack addition public_health agency somewhat limited extremely variable contribution tackle nutritional gap pregnancy result dedicated define theoretically efficient dietary counselling pregnancy methodologically sound basis generic counselling population tailor fit advice individual level improve_nutritional adequacy diet jeopardize specific critical stage life_cycle


First sentence in corpus after applying Phrase Modeling: 

 pregnancy tend markedly widen nutritional gap


## Topic Modeling with LDA

### Topic Modeling

In [34]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

In [100]:
# Create dictionary and tidy
corp_dict = Dictionary(corp)
corp_dict.filter_extremes(no_below=10, no_above=0.6)
corp_dict.compactify()

In [101]:
# Create BOW. 
# Note: docs and sents are the same now for all practical purposes
corp_bow = [corp_dict.doc2bow(doc) for doc in corp]

In [102]:
# LDA model
lda = LdaMulticore(corp_bow, num_topics=15, id2word=corp_dict, workers=3)

In [103]:
# Function to show a topic
def show_topic(lda, num=0, topn=10):
    '''Display topic for given LDA Model, topic number and number of words.'''
    for word, freq in lda.show_topic(num, topn=topn):
        print(f'{word:20} {round(freq, 3):.3f}')

In [109]:
show_topic(lda, num=0, topn=10)

diet                 0.013
nutrition            0.007
effect               0.007
increase             0.007
high                 0.006
child                0.005
result               0.005
research             0.005
different            0.005
individual           0.005


### Visualisation with pyLDAvis

In [110]:
import pyLDAvis
import pyLDAvis.gensim
import warnings

In [111]:
LDAvis_prep = pyLDAvis.gensim.prepare(lda, corp_bow, corp_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [112]:
#%%time
pyLDAvis.display(LDAvis_prep)

## Word Vector Model

### Vector Modeling  (word2vec)

Using Gensim's Word2Vec.

In [45]:
from gensim.models import Word2Vec

In [135]:
w2v = Word2Vec(corp_sent, size=100, window=5, min_count=20, sg=1, workers=3, iter=100)

In [136]:
w2v.init_sims()
print(f'{w2v.epochs} training epochs so far.')

100 training epochs so far.


In [137]:
print('Words in Word2Vec dictionary: {}'.format(len(w2v.wv.vocab)))

Words in Word2Vec dictionary: 1680


First generate the 100-dimentional matrix from word2vec in the previous section.

In [138]:
# Get word, index and word count
vecs = [(word, info.index, info.count) for word, info in w2v.wv.vocab.items()]
# Sort it according to word count (3rd column) - descending order
vecs = sorted(vecs, key=lambda val: -val[2])
# Get individual 'columns' values
words, idx, count = zip(*vecs)
# Now use the indices to get the vectors
df_vecs = pd.DataFrame(w2v.wv.vectors_norm[idx,:], index=words)

In [139]:
df_vecs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
diet,0.038471,-0.065727,0.036493,0.111315,-0.094863,0.094459,-0.028849,-0.040578,0.034383,0.01437,...,0.119562,0.236303,-0.002424,-0.001734,-0.061753,0.038822,0.292569,0.12691,-0.013864,-0.225706
increase,-0.005244,0.028765,0.035089,-0.189051,-0.06572,-0.064051,0.011634,0.046224,0.184972,-0.01394,...,0.130158,0.187754,-0.013506,-0.042828,0.035044,0.094181,0.009193,-0.012315,0.093556,-0.055268
result,0.079385,-0.083812,-0.112972,0.056171,-0.176594,-0.030169,-0.037042,-0.020893,-0.088918,0.092742,...,-0.154363,0.161008,-0.049255,0.013462,0.040598,-0.106429,0.190632,-0.024493,0.105604,-0.174962
high,0.003289,0.002262,0.03286,-0.126417,-0.169252,0.057015,0.023633,0.075326,0.079267,-0.078839,...,0.078307,0.164383,-0.002872,-0.056232,0.01394,0.006983,0.049052,-0.044258,0.006378,-0.149614
effect,-0.00357,-0.144737,-0.020202,-0.205143,0.038867,-0.069709,-0.031036,0.034612,0.007768,0.054054,...,-0.036904,0.078452,-0.144285,-0.071278,0.159907,0.00237,0.104134,-0.073446,0.234772,-0.075305


Create functions to find similar and opposite words in the vocabulary.

In [140]:
def similar_word(word, topn=10):
    """Find similar words in the corpus."""
    for word, simil in w2v.wv.most_similar(positive=[word], topn=topn):
        print(f'{word:25} {round(simil, 4)}')

In [141]:
def opposite_word(word, topn=10):
    """Find opposite words in the corpus."""
    for word, oppos in w2v.wv.most_similar(negative=[word], topn=topn):
        print(f'{word:25} {round(oppos, 4)}')

In [142]:
similar_word('child')

girl                      0.5115
education                 0.509
overweight_obesity        0.4899
mother                    0.4777
stunting                  0.4765
parent                    0.4711
nutritional_status        0.4686
stunt                     0.4621
adolescent                0.4583
young_child               0.4554


In [143]:
similar_word('carbohydrate')

impaired                  0.4358
metabolism                0.4289
insulin                   0.4246
transport                 0.4172
fat                       0.4155
fluid                     0.4027
amino_acid                0.3918
plasma                    0.376
low                       0.3642
secretion                 0.3639


In [144]:
similar_word('diabetes')

hypertension              0.5797
obesity                   0.4783
blood_pressure            0.4581
obese                     0.4236
acute                     0.4213
alteration                0.4107
african                   0.4104
awareness                 0.4053
barrier                   0.4037
susceptibility            0.3777


In [145]:
opposite_word('diabetes')

niche                     0.1827
example                   0.1819
replacement               0.1697
distribution              0.1564
ruminant                  0.1555
clade                     0.1438
tropical                  0.1401
step                      0.128
pressure                  0.1212
36                        0.1199


Create a function using above methods to manipulate word vectors.

In [146]:
def meaning(plus=[], minus=[], topn=2):
    """Using 'plus' to find positive/similar words and 'minus' to find
       negative/opposite words in the word2vec vocabulary, 
       showing the top 'topn' results."""
    answers = w2v.wv.most_similar(positive=plus, negative=minus, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [147]:
meaning(plus=['nutrition'], minus=['food'], topn=5)

breastfeed
organ
key
infection
malaria


In [148]:
meaning(plus=['obese', 'food'], topn=10)#, minus=['nutrition'])

consume
healthy_food
health
seek
elderly
nutrition_education
nutritional
meal
diet
consumption


In [149]:
meaning(plus=['age'], topn=5, minus=['nutrition'])

age_group
young
12
mammal
morphological


### Vector Visualisation (t-SNE)

Prepare data.

In [150]:
from sklearn.manifold import TSNE

In [151]:
df_vecs.shape
#df_vecs.drop(nlp.Defaults.stop_words, errors='ignore').head(5000)    # Take out stopwords. no need I think

(1680, 100)

In [152]:
# Initialse TSNE and fir_transform
tsne = TSNE()
tsne_vecs = tsne.fit_transform(df_vecs.values)

In [153]:
# Transform to 2d coordinates for plotting
tsne_vecs = pd.DataFrame(tsne_vecs, index=pd.Index(df_vecs.index), columns=['x', 'y'])
# Add columns for words
tsne_vecs['word'] = tsne_vecs.index

Plot data.

In [154]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [155]:
# Credit: Patrick Harrison: Modern NLP in Python | PyData DC 2016  
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vecs)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(
    title='t-SNE Word Embeddings',
    plot_width=500,
    plot_height=500,
    tools=(
        'pan, wheel_zoom, box_zoom,'
        'box_select, reset'
        ),
    active_scroll='wheel_zoom'
    )

# add a hover tool to display words on roll-over
tsne_plot.add_tools(
    HoverTool(tooltips = '@word')
    )

# draw the words as circles on the plot
tsne_plot.circle(
    'x',
    'y',
    source=plot_data,
    color='blue',
    line_alpha=0.2,
    fill_alpha=0.1,
    size=15,
    hover_line_color='black'
    )

# configure visual elements of the plotc
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

## Sub-topic analysis

Lets summarise articles containing certain key words (sub keywords).

In [93]:
# Find all sentences with sub keyword
sub_key = 'diabetes'
# plant_base_diet

In [99]:
len(corp)

1530

In [98]:
len(corp_sent)

12598

In [158]:
sub_corp_sent = []
[sub_corp_sent.append(sent) for sent in corp_sent if sub_key in sent];

In [160]:
len(sub_corp_sent)

57

In [170]:
sub_corp_sent
docs[0]

'[Pregnancy tends to markedly widen the nutritional gap. The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries. A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet. Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy. These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, critical stage of the life cycle.]'

In [255]:
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [180]:
stopwords = list(STOP_WORDS)

In [181]:
docx = nlp(docs[0])

In [182]:
docx

[Pregnancy tends to markedly widen the nutritional gap. The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries. A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet. Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy. These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, critical stage of the life cycle.]

In [211]:
# Frequency table
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [212]:
#word_frequencies

In [213]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())
maximum_frequency
# Normalise Word Frequencies
for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

In [217]:
#word_frequencies

In [221]:
# Sentence Tokens
sentence_list = [sentence for sentence in docx.sents]

In [254]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                #if len(sent.text.split(' ')) < 100:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]
        # Normalise accoring to sentence length
        sentence_scores[sent] = sentence_scores[sent] / len(sent)    

In [258]:
sentence_scores

{[Pregnancy tends to markedly widen the nutritional gap.: 0.36,
 The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries.: 0.21250000000000005,
 A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet.: 0.2428571428571429,
 Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy.: 0.2,
 These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, critical stage o

In [256]:
summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)

In [257]:
summarized_sentences

[[Pregnancy tends to markedly widen the nutritional gap.,
 A relatively good nutritional adequacy in pregnant women can therefore not be simply attributed to a higher energy intake, but to qualitative changes in the diet.,
 The decrease in the nutritional adequacy of observed diets of women of childbearing age cannot be solved by a simple increase in energy intake, as recommended in some countries.,
 Recommendations of snack additions from public health agencies make a somewhat limited and an extremely variable contribution to tackling the nutritional gap during pregnancy.,
 These results call for dedicated studies to define the most theoretically efficient dietary counselling during pregnancy on a methodologically sound basis, either as generic counselling for the whole population or as tailor-fitted advice at the individual level, to improve the nutritional adequacy of the diet which is jeopardized at this specific, critical stage of the life cycle.]]