# Exploratory Analysis
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model has already searched the abstracts of academic papers that contain the keyword "nutrition" and saved it into corpus_raw.csv.  
  
Overview of this notebook:
- Setup notebok environment and load data (corpus_raw.csv)
- Review articles published per year (sklearn's Countvectorizer)
- Create Bag Of Words (BOW) of all articles for Titles and Conclusions (nltk)
- Create interactive BOW per year with Bokeh

## Setup

### Packages and setup

In [1]:
# Common
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Spacy
import spacy

# Gensim
from gensim.models import Phrases
#from gensim.models.word2vec import LineSentence     - to be used when scaling

# Workspace
from IPython.core.interactiveshell import InteractiveShell

In [2]:
# Set workspace
sns.set()
# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'

### Load and inspect corpus.csv

In [3]:
# Load corpus.csv as DataFrame with parsed date format
corpus = pd.read_csv('../data/interim/corpus_raw.csv', parse_dates=[0])

In [4]:
# Keyword 
keyword = 'nutrition'

In [5]:
# Inspect
corpus.info()
corpus.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 794 entries, 0 to 793
Data columns (total 3 columns):
publication_date    794 non-null datetime64[ns, UTC]
title               794 non-null object
conclusions         794 non-null object
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 18.7+ KB


Unnamed: 0,publication_date,title,conclusions
0,2014-10-21 00:00:00+00:00,Psychological Determinants of Consumer Accepta...,"['To the authors’ knowledge, this is the first..."
1,2015-03-13 00:00:00+00:00,Uncovering the Nutritional Landscape of Food,"['In this study, we have developed a unique co..."
2,2017-06-27 00:00:00+00:00,Developing and validating a scale to measure F...,['Food and nutrition literacy scale is a valid...
3,2017-05-18 00:00:00+00:00,Quality of nutrition services in primary healt...,"['The aim of the NNS, integrating nutrition se..."
4,2015-10-21 00:00:00+00:00,To See or Not to See: Do Front of Pack Nutriti...,['Our work strongly supports the idea that FOP...


## Phrase Modeling 

Using Spacy to tokenise and clean each sentence in every document by removing stop words, punctuation, unnecessary white space, numbers and lemmatising each word.  
  
Do this for every sentence in each document and combine all documents in the corpus.

In [18]:
# Set Spacy
nlp = spacy.load('en_core_web_sm')

In [43]:
# Gather all articles' 'Conclusions' into docs
docs = corpus['conclusions']

# Clean unneseccary escapes
docs_clean = []
for idx in range(len(docs)):
    text = docs[idx].replace('\'', '')
    docs_clean.append(text.replace('\\n' ,''))

In [51]:
# Use Spacy - pipe to parse each document, break it into sentences, 
# clean it (lemmatise, punctuation and white space) and combine
# all cleaned sentences into cleaned_sents
cleaned_sents = []
for parsed_doc in nlp.pipe(docs_clean, n_threads=4):   
    for sent in parsed_doc.sents:
        type(sent)
        cleaned_sent = [token.lemma_ for token in sent if not (token.is_stop or token.is_punct)]
        #cleaned_sent = u' '.join(cleaned_sent)
        cleaned_sents.append(cleaned_sent)
        
# Confirming output
print('Number of documents (articles): {}'.format(len(docs_clean)))
print('Total number of sentences: {}'.format(len(cleaned_sents)))
print('Total words: {}'.format(sum([len(sent) for sent in cleaned_sents])))

Number of documents (articles): 794
Total number of sentences: 6448
Total words: 98386


In [52]:
# Create bigram model with Gensim
bigram_model = Phrases(cleaned_sents)
# Generate bigram sentences to bigram_sents
bigram_sents = []
[bigram_sents.append(sent) for sent in bigram_model[cleaned_sents]]
# Confirming output
print('Total number of sentences: {}'.format(len(bigram_sents)))
print('Total words: {}'.format(sum([len(sent) for sent in bigram_sents])))

Total number of sentences: 6448
Total words: 93268


In [53]:
# Create trigram model with Gensim
trigram_model = Phrases(bigram_sents)
# Generate trigram sentences to trigram_sents
trigram_sents = []
[trigram_sents.append(sent) for sent in trigram_model[bigram_sents]]
# Confirming output
print('Total number of sentences: {}'.format(len(trigram_sents)))
print('Total words: {}'.format(sum([len(sent) for sent in trigram_sents])))

Total number of sentences: 6448
Total words: 92523


In [54]:
# Create quadgram model with Gensim
qgram_model = Phrases(trigram_sents)
# Generate quadgram sentences to quadgram_sents
qgram_sents = []
[qgram_sents.append(sent) for sent in qgram_model[trigram_sents]]
# Confirming output
print('Total number of sentences: {}'.format(len(qgram_sents)))
print('Total words: {}'.format(sum([len(sent) for sent in qgram_sents])))

Total number of sentences: 6448
Total words: 92407


In [70]:
qgram_sents[0]

['author',
 'knowledge',
 'study',
 'model',
 'factor',
 'determine',
 'intention',
 'personalised_nutrition',
 'representative',
 'sample',
 'european',
 'consumer']

## Topic Modeling with LDA

### Topic Modeling

### Visualisation with pyLDAvis

gensim.models.phrases.Phrases

In [103]:
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2)
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
print(bigram[sent])

['the', 'mayor', 'of', 'new_york', 'was', 'there']


In [189]:
sentence_stream

[['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],
 ['machine', 'learning', 'can', 'be', 'useful', 'sometimes'],
 ['new', 'york', 'mayor', 'was', 'present']]

In [115]:
documents

['the mayor of new york was there',
 'machine learning can be useful sometimes',
 'new york mayor was present']

In [121]:
bigram[sentence_stream[2]]

['new_york', 'mayor', 'was', 'present']

In [88]:
cleaned_doc

'study develop unique computational framework systematic analysis large scale food nutritional datum network food nutrient offer global unbiased view organization nutritional connection enable discovery unexpected knowledge association food nutrient nutritional fitness gauge quality raw food accord nutritional balance appear widely disperse different food raise question origin variation food remarkably nutritional balance food solely depend characteristic individual nutrient structure intimate correlation multiple nutrient amount food underscore importance nutrient nutrient connection constitute network structure embody multiple level nutritional composition food extend analysis raw food cooked food necessary truly understand nutritional landscape food consume daily leave study consider raw food sufficient draw primary insight relatively simple system number application achievable concept present judiciously combine practical approach incorporation region specific information analysis 

In [69]:
for sent in parsed_doc.sents:
    parsed_sent = nlp(sent.lemma_)
    for token in parsed_sent:
        token.is_stop
    
# if not token.is_stop ...
    
print(parsed_sent)
print(token.is_stop)
#parsed_sent.is_stop

finally , -PRON- systematic approach set the foundation for future endeavor to enhance the understanding of food and nutrition . ' ]
False


In [47]:
#[t for t in parsed_row.sents]

In [43]:
#parsed_row

In [None]:
corp = []
doc = []

for token in doc

In [10]:

token_attr = [(token.orth_, 
               token.lemma_,
               token.is_stop, 
               token.is_punct, 
               token.is_space, 
               token.like_num, 
               token.is_oov) for token in parsed_row]

df = pd.DataFrame(token_attr, columns=['text', 'lemma', 'stop', 'punct', 'space', 'num', 'oov'])


In [12]:
#token

In [13]:
df.head()
#df.shape

Unnamed: 0,text,lemma,stop,punct,space,num,oov
0,[,[,False,True,False,False,True
1,',',False,True,False,False,True
2,In,in,True,False,False,False,True
3,this,this,True,False,False,False,True
4,study,study,False,False,False,False,True


In [14]:
#!time
df_clean = df[(df['stop'] == False) & (df['punct'] == False) & (df['space'] == False) & (df['num'] == False)]['lemma']

In [15]:
df_clean

4              study
8            develop
10            unique
11     computational
12         framework
           ...      
530         endeavor
532          enhance
534    understanding
536             food
538        nutrition
Name: lemma, Length: 265, dtype: object