# Exploratory Analysis
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model has already searched the abstracts of academic papers that contain the keyword "nutrition" and saved it into corpus_raw.csv.  
  
Overview of this notebook:
- Setup notebok environment and load data (corpus_raw.csv)
- Review articles published per year (sklearn's Countvectorizer)
- Create Bag Of Words (BOW) of all articles for Titles and Conclusions (nltk)
- Create interactive BOW per year with Bokeh

### Credit

Credit due in large to Patrick Harrison: Modern NLP in Python | PyData DC 2016  
  
Thank you!

## Setup

### Packages and setup

In [2]:
# Common
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Workspace
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML

In [3]:
# Set workspace
sns.set()
# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'
# Make notebook wider to fit pyLDAvis plot
display(HTML("<style>.container { width:50% !important; }</style>"))

### Load and inspect corpus.csv

In [4]:
# Load corpus.csv as DataFrame with parsed date format
corpus = pd.read_csv('../data/interim/corpus_raw.csv', parse_dates=[0])

In [5]:
# Keyword 
keyword = 'nutrition'

In [6]:
# Inspect
corpus.info()
corpus.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 794 entries, 0 to 793
Data columns (total 3 columns):
publication_date    794 non-null datetime64[ns, UTC]
title               794 non-null object
conclusions         794 non-null object
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 18.7+ KB


Unnamed: 0,publication_date,title,conclusions
0,2014-10-21 00:00:00+00:00,Psychological Determinants of Consumer Accepta...,"['To the authors’ knowledge, this is the first..."
1,2015-03-13 00:00:00+00:00,Uncovering the Nutritional Landscape of Food,"['In this study, we have developed a unique co..."
2,2017-06-27 00:00:00+00:00,Developing and validating a scale to measure F...,['Food and nutrition literacy scale is a valid...
3,2017-05-18 00:00:00+00:00,Quality of nutrition services in primary healt...,"['The aim of the NNS, integrating nutrition se..."
4,2015-10-21 00:00:00+00:00,To See or Not to See: Do Front of Pack Nutriti...,['Our work strongly supports the idea that FOP...


## Phrase Modeling 

Using Spacy to tokenise and clean each sentence in every document by removing stop words, punctuation, unnecessary white space, numbers and lemmatising each word.  
  
Do this for every sentence in each document and combine all documents in the corpus.

In [7]:
# Necessary imports for Phrase Modeling
import spacy
from gensim.models.phrases import Phrases, Phraser
#from gensim.models.word2vec import LineSentence     - to be used when scaling

In [8]:
# Set Spacy
nlp = spacy.load('en_core_web_sm')

In [9]:
# Gather all articles' 'Conclusions' into docs
docs_corp = corpus['conclusions']

# Clean unneseccary escapes
docs = []
for doc in docs_corp:
    text = doc.replace('\'', '')
    docs.append(text.replace('\\n' ,''))

In [10]:
# Use Spacy - pipe to parse each document, break it into sentences, 
# clean it (lemmatise, punctuation and white space) and combine
# all cleaned sentences into sents_clean, and 
# all cleaned sentences per document into docs_clean

docs_clean = []
sents_clean = []
for parsed_doc in nlp.pipe(docs, n_threads=4):   
    # For each document 
    sents_clean_per_doc = []
    for sent in parsed_doc.sents:
        sent_token          = [token.lemma_ for token in sent if not (token.is_stop or token.is_punct)]
        sents_clean_per_doc += sent_token
        # Adds to corpus of sentences
        sents_clean.append(sent_token)
    # Adds to corpus of ducuments
    docs_clean.append(sents_clean_per_doc)
        
# Confirming output
print('Number of documents (articles): {}'.format(len(docs_clean)))
print('Total number of sentences: {}'.format(len(sents_clean)))
print('Total words: {}'.format(sum([len(sent) for sent in sents_clean])))

Number of documents (articles): 794
Total number of sentences: 6448
Total words: 98386


Create a function that generates a (n+1)-gram model and outputs the new sentences.

In [11]:
def n_plus_one_gram(sents):
    ''' Generate n+1_gram model and outputs new sentences 
    Returns n+1_gram model and n+1_gram sentences'''
    # Create n+1 gram model with Gensim for all sentences
    g = Phrases(sents)
    g = Phraser(g)    
    # Generate n+1 gram sentences
    g_sents = [sent for sent in g[sents]]

    # Confirming output
    print('Total number of sentences: {}'.format(len(g_sents)))
    print('Total words: {}'.format(sum([len(sent) for sent in g_sents])) + '\n')
    return g, g_sents

Run Phrase modeling over all the sentences three times to get four-gram model.

In [12]:
# Bigram
print('Bigram')
bg, bg_sents = n_plus_one_gram(sents_clean)
# Trigram
print('Trigram')
tg, tg_sents = n_plus_one_gram(bg_sents)
# Four-gram
print('Four-gram')
qg, qg_sents = n_plus_one_gram(tg_sents)

Bigram
Total number of sentences: 6448
Total words: 93268

Trigram
Total number of sentences: 6448
Total words: 92523

Four-gram
Total number of sentences: 6448
Total words: 92407



Now run the complete corpus of documents on the (n+1)-gram model.

In [13]:
# Fit bigrams model to the cleaned documents
bg_docs = bg[docs_clean]
# Fit trigrams model to converted bigrams
tg_docs = tg[bg_docs]
# Fit four-grams model to converted trigrams
qg_docs = qg[tg_docs]

In [14]:
ph_docs = [doc for doc in qg_docs]
print('Total number of sentences: {}'.format(len(ph_docs)))
print('Total words: {}'.format(sum([len(sent) for sent in ph_docs])))

Total number of sentences: 794
Total words: 92404


In [15]:
# Take out final stop words for each document
corp = []
for doc in ph_docs:
    corp.append([word for word in doc if word not in nlp.Defaults.stop_words])

# Confirm output
print('Total number of sentences: {}'.format(len(corp)))
print('Total words: {}'.format(sum([len(sent) for sent in corp])))

Total number of sentences: 794
Total words: 91760


In [16]:
# Take out final stop words for each sentence
corp_sent = []
for sent in qg_sents:
    corp_sent.append([word for word in sent if word not in nlp.Defaults.stop_words])

# Confirm output
print('Total number of sentences: {}'.format(len(corp_sent)))
print('Total words: {}'.format(sum([len(sent) for sent in corp_sent])))

Total number of sentences: 6448
Total words: 91762


In [17]:
print('First document in corpus after applying Phrase Modeling: \n\n', u' '.join(corp[0]))
print('\n')
print('First sentence in corpus after applying Phrase Modeling: \n\n', u' '.join(corp_sent[0]))

First document in corpus after applying Phrase Modeling: 

 author knowledge study model factor determine intention personalised_nutrition representative sample european consumer important strength study element model inform qualitative research similar population datum imply attitude adoption personalised_nutrition primarily drive perception benefit adoption personalised_nutrition achievable trust regulatory system particular relate datum protection extent individual commit improve perceive action influence health_status attitude personalised_nutrition imply promotion personalised_nutrition general public need emphasise personal benefit personalised_nutrition discussion risk focus end user concern particular relate datum protection service_delivery communication address perceive Efficacy provide information personalised_nutrition adopt consumer provide information potential health_benefit associate personalised_nutrition influence adoption individual low level Health Locus Control


F

## Topic Modeling with LDA

### Topic Modeling

In [18]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

In [19]:
# Create dictionary and tidy
corp_dict = Dictionary(corp)
corp_dict.filter_extremes(no_below=10, no_above=0.6)
corp_dict.compactify()

In [20]:
# Create BOW. 
# Note: docs and sents are the same now for all practical purposes
corp_bow = [corp_dict.doc2bow(doc) for doc in corp]

In [21]:
# LDA model
lda = LdaMulticore(corp_bow, num_topics=5, id2word=corp_dict, workers=3)

In [22]:
# Function to show a topic
def show_topic(lda, num=0, topn=10):
    '''Display topic for given LDA Model, topic number and number of words.'''
    for word, freq in lda.show_topic(num, topn=topn):
        print(f'{word:20} {round(freq, 3):.3f}')

In [23]:
show_topic(lda, num=0, topn=20)

study                0.013
change               0.007
child                0.007
result               0.007
effect               0.007
increase             0.006
high                 0.006
nutritional          0.005
health               0.005
model                0.005
low                  0.005
intervention         0.005
associate            0.005
food                 0.004
consumption          0.004
different            0.004
compare              0.004
community            0.004
growth               0.004
specific             0.004


### Visualisation with pyLDAvis

In [24]:
import pyLDAvis
import pyLDAvis.gensim
import warnings

In [25]:
LDAvis_prep = pyLDAvis.gensim.prepare(lda, corp_bow, corp_dict)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [27]:
#%%time
#pyLDAvis.display(LDAvis_prep)

## Word Vector Model

### Vector Modeling  (word2vec)

Using Gensim's Word2Vec.

In [28]:
from gensim.models import Word2Vec

In [29]:
w2v = Word2Vec(corp_sent, size=100, window=5, min_count=20, sg=1, workers=3, iter=100)

In [30]:
w2v.init_sims()
print(f'{w2v.epochs} training epochs so far.')

100 training epochs so far.


In [31]:
print('Words in Word2Vec dictionary: {}'.format(len(w2v.wv.vocab)))

Words in Word2Vec dictionary: 910


First generate the 100-dimentional matrix from word2vec in the previous section.

In [32]:
# Get word, index and word count
vecs = [(word, info.index, info.count) for word, info in w2v.wv.vocab.items()]
# Sort it according to word count (3rd column) - descending order
vecs = sorted(vecs, key=lambda val: -val[2])
# Get individual 'columns' values
words, idx, count = zip(*vecs)
# Now use the indices to get the vectors
df_vecs = pd.DataFrame(w2v.wv.vectors_norm[idx,:], index=words)

In [33]:
df_vecs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
study,0.163705,0.021156,0.101648,-0.136232,-0.009814,-0.15814,0.076082,-0.048634,-0.013547,-0.029488,...,-0.095729,0.016112,0.118536,-0.060578,0.049431,0.078662,-0.081215,0.026343,0.000994,-0.037305
food,0.068261,-0.098494,0.228036,-0.060514,0.052625,-0.056582,0.112749,-0.063511,-0.083908,-0.07381,...,-0.095357,0.133349,0.105197,0.080472,-0.084886,-0.06918,-0.182211,0.001532,0.085801,-0.003614
increase,0.124175,-0.076242,-0.06712,-0.200467,0.044417,0.0256,-0.009536,-0.203688,-0.106361,0.15543,...,-0.004481,0.116649,0.160058,-0.068762,-0.048025,-0.042022,0.042618,0.040412,0.041577,-0.090093
high,0.098998,-0.104873,-0.099589,-0.026485,-0.002834,-0.09898,0.099358,-0.083949,-0.181567,-0.052678,...,0.095712,0.077151,0.073151,-0.059743,-0.011824,0.002956,0.08377,-0.025857,0.154293,-0.060923
nutrition,0.113328,-0.036537,0.04871,-0.097475,0.066991,-0.119434,-0.004724,-0.123263,-0.034086,0.033742,...,-0.188973,-0.009063,0.196974,-0.066029,-0.040653,0.002761,0.025761,-0.068838,-0.082555,-0.11241


Create functions to find similar and opposite words in the vocabulary.

In [34]:
def similar_word(word, topn=10):
    """Find similar words in the corpus."""
    for word, simil in w2v.wv.most_similar(positive=[word], topn=topn):
        print(f'{word:25} {round(simil, 4)}')

In [35]:
def opposite_word(word, topn=10):
    """Find opposite words in the corpus."""
    for word, oppos in w2v.wv.most_similar(negative=[word], topn=topn):
        print(f'{word:25} {round(oppos, 4)}')

In [36]:
similar_word('child')

stunting                  0.5089
undernutrition            0.4678
education                 0.4558
mother                    0.4455
malnutrition              0.4409
adult                     0.4297
socioeconomic             0.4244
school                    0.4221
old                       0.4213
stunt                     0.4176


In [37]:
similar_word('carbohydrate')

fat                       0.4896
fatty_acid                0.479
sugar                     0.3943
synthesis                 0.3873
amino_acid                0.3853
diet                      0.3824
oral                      0.3686
protein                   0.3681
acid                      0.3651
ad                        0.3438


In [38]:
opposite_word('protein')

India                     0.1887
establish                 0.1451
aim                       0.1372
way                       0.1271
assess                    0.1239
design                    0.1137
market                    0.1131
public                    0.1123
exist                     0.1118
scale                     0.1067


Create a function using above methods to manipulate word vectors.

In [39]:
def meaning(plus=[], minus=[], topn=2):
    """Using 'plus' to find positive/similar words and 'minus' to find
       negative/opposite words in the word2vec vocabulary, 
       showing the top 'topn' results."""
    answers = w2v.wv.most_similar(positive=plus, negative=minus, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [40]:
meaning(plus=['nutrition'], minus=['food'], topn=5)

death
crucial
breastfeed
optimal
chronic


### Vector Visualisation (t-SNE)

Prepare data.

In [48]:
from sklearn.manifold import TSNE

In [49]:
df_vecs.shape
#df_vecs.drop(nlp.Defaults.stop_words, errors='ignore').head(5000)    # Take out stopwords. no need I think

(910, 100)

In [64]:
# Initialse TSNE and fir_transform
tsne = TSNE()
tsne_vecs = tsne.fit_transform(df_vecs.values)

In [65]:
# Transform to 2d coordinates for plotting
tsne_vecs = pd.DataFrame(tsne_vecs, index=pd.Index(df_vecs.index), columns=['x', 'y'])
# Add columns for words
tsne_vecs['word'] = tsne_vecs.index

Plot data.

In [66]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [71]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vecs)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(
    title='t-SNE Word Embeddings',
    plot_width=500,
    plot_height=500,
    tools=(
        'pan, wheel_zoom, box_zoom,'
        'box_select, reset'
        ),
    active_scroll='wheel_zoom'
    )

# add a hover tool to display words on roll-over
tsne_plot.add_tools(
    HoverTool(tooltips = '@word')
    )

# draw the words as circles on the plot
tsne_plot.circle(
    'x',
    'y',
    source=plot_data,
    color='blue',
    line_alpha=0.2,
    fill_alpha=0.1,
    size=15,
    hover_line_color='black'
    )

# configure visual elements of the plotc
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);