![](images/mk.png)

<h1><center>ONLINE SUPPLEMENT</center></h1>

John McLevey & Reid-McIlroy-Young. **Introducing *metaknowledge*: Software for Computational Research in Information Science, Science of Science, and Network Analysis.** *Journal of Informetrics*. XX(XX):XX-XX.

<h1><center>Part 3: Text Analysis</center></h1>

This supplementary notebook was prepared by [Steve McColl](http://networkslab.org/) (NetLab, University of Waterloo), Dr. [John McLevey](http://www.johnmclevey.com/) (University of Waterloo), and [Reid McIlroy-Young](http://reidmcy.com/) (University of Chicago). The code in this notebook is current as of *metaknowledge* version 3.1.1.

In [1]:
import metaknowledge as mk
from stop_words import get_stop_words
from nltk.tokenize import RegexpTokenizer
import seaborn as sns
import numpy
import matplotlib as plt
import pandas
import os

# Imports for gensim.
import gensim
from gensim import corpora, models

# Imports for pyLDAvis.
import pyLDAvis.gensim as gensimvis
import pyLDAvis

# Imports for sklearn.
from __future__ import print_function
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

sns.set_style(style="white") # change the default background plot colour
sns.set(font_scale=.7)

plt.rc("savefig", dpi=300) # improve default resolution of graphics

os.chdir('.')

  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)


In [2]:
RC = mk.RecordCollection('raw_data/imetrics/', cached = True)

# Topic Models


* Create topic models using the Gensim package or Scikit-Learn.
* Create interactive visualizations using Gensim models and pyLDAvis.

In [5]:
# Transform the record collection into a format for use with natural language processing applications.
raw = RC.forNLP('generated_datasets/topic_model/topic_model.csv', lower=True, removeNumbers=True,
         removeNonWords=True, removeWhitespace=True, extractCopyright=False)

In [6]:
# Conver the raw text into a list.
documents = raw['abstract']

## SKLearn

* Basic topic models using code from 
[this tutorial](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html) written by Olivier Grisel, Lars Buitinck, and Chyi-Kwei Yau.

In [7]:
# For use with SKlearn, convert the raw text to a numpy array.
docs = numpy.asarray(documents)

In [18]:
# Increasing the number of features will give a better model, but it may increase the runtime.
features = 1000
topics = 50
top_words = 10

In [19]:
# Initialize the tokenizer.
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=features,
                                   stop_words='english')

In [20]:
# Tokenize the documents.
tfidf = tfidf_vectorizer.fit_transform(docs)

In [21]:
# Define the output function, code exactly as shown in tutorial.
def print_top_words(model, feature_names, top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-top_words - 1:-1]]))
    print()

In [22]:
# Extract the features (tokens) for the models.
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=features,
                                stop_words='english')

In [23]:
tf = tf_vectorizer.fit_transform(docs)

In [24]:
# Fit the Non-negative matrix factorization model.
nmf = NMF(n_components=topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)

In [25]:
# Print the list of topics and their contents. 
# Note that the constraints can be modified by changing the number of topics or number of words in each topic.
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, top_words)

Topic #0:
research funding institutions output bibliometric analysis areas field institutes groups
Topic #1:
information seeking behavior use systems article health retrieval study needs
Topic #2:
journal jif factor subject categories journals jcr reports isi editors
Topic #3:
journals published oa publishing international jcr editorial open access sci
Topic #4:
patent patents technological patenting companies applications value relationship market paper
Topic #5:
chinese china database india international nanotechnology usa japan english words
Topic #6:
web pages sites site links page link content engines websites
Topic #7:
impact factor factors average measure isi influence year higher metrics
Topic #8:
papers published cited paper highly physics average years proportion coauthored
Topic #9:
network networks centrality coauthorship structure social degree analysis properties structural
Topic #10:
collaboration international collaborative coauthorship collaborations domestic patterns 

In [26]:
# Extract topics for the LDA model, and fit the model.
lda = LatentDirichletAllocation(n_topics=topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=50, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [27]:
# Print the list of topics and their contents. 
# Note that the constraints can be modified by changing the number of topics or number of words in each topic.
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, top_words)

Topic #0:
patent technology patents technological technologies china analysis development study nanotechnology
Topic #1:
evaluation quality criteria design environment assessment evaluations organization visualization tools
Topic #2:
practices discussion history historical ways cultural visual ideas reasons nature
Topic #3:
method approach based performance using results used measures proposed classification
Topic #4:
network networks structure analysis centrality structures coauthorship bibliographic study degree
Topic #5:
data groups sets size set scale statistical sample collected large
Topic #6:
hierarchical published recall lack applying impacts resulting decision projects universities
Topic #7:
search interface users tasks information features health models patent group
Topic #8:
search users user results information task tasks web study searching
Topic #9:
productivity academic publication publications publishing differences scholars scientific countries conference
Topic #10:
mo


## Gensim

* For model coefficients and more analysis options.
* Code adapted from [the gensim tutorial](https://radimrehurek.com/gensim/tutorial.html) by Radim Řehůřek.
* Visualizations created using pyLDAvis, using code from the [notebooks]('https://github.com/bmabey/pyLDAvis/tree/master/notebooks') by Ben Mabey.

In [28]:
# Create a list of stopwords, using the stopwords package.
stopwords = get_stop_words('en')

In [29]:
# Initialize the tokenizer.
tokenizer = RegexpTokenizer(r'\w+')

In [30]:
# Initialize a list, we will save tokens here.
tokens = []

In [31]:
# Iterate over the documents list, and tokenize and save each entry.
for l in documents:
    token = tokenizer.tokenize(l)
    tokens.append(token)

In [32]:
# Initialize a list, we will save cleaned tokens here.
cleaned_tokens = []

In [33]:
# Keep tokens only if they do not appear in the list of stopwords.
for l in tokens:
    cleaned_tokens.append([i for i in l if not i in stopwords])

In [34]:
# Create dictionary from the cleaned tokens.
dictionary = corpora.Dictionary(cleaned_tokens)

In [35]:
# Convert the cleaned tokens into a numpy array.
array = numpy.asarray(cleaned_tokens)

In [36]:
# Train the corpus using the array, creating a bag-of-words that contains each word in the array.
corpus = [dictionary.doc2bow(word) for word in array]

In [37]:
# Generate the LDA model using gensim.
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=50, id2word = dictionary, passes=20)

In [41]:
# The format for printing 50 topics is not very good, so we can print a sample of 10 topics, with the top 5 words in the content.
# You can change the number of topics and number of words as you wish
ldamodel.print_topics(num_topics=10, num_words=5)

[(5,
  '0.033*"h" + 0.029*"index" + 0.028*"number" + 0.023*"author" + 0.022*"citations"'),
 (41,
  '0.047*"chinese" + 0.021*"financial" + 0.020*"swedish" + 0.019*"english" + 0.019*"languages"'),
 (33,
  '0.075*"research" + 0.019*"researchers" + 0.014*"performance" + 0.014*"universities" + 0.014*"university"'),
 (1,
  '0.026*"analysis" + 0.021*"research" + 0.015*"science" + 0.013*"data" + 0.012*"study"'),
 (4,
  '0.051*"image" + 0.033*"images" + 0.032*"indexing" + 0.029*"tags" + 0.019*"retrieval"'),
 (10,
  '0.022*"hindices" + 0.021*"pubmed" + 0.021*"names" + 0.019*"citer" + 0.018*"biomedical"'),
 (44,
  '0.079*"scientific" + 0.052*"production" + 0.032*"indicators" + 0.023*"spanish" + 0.021*"institutions"'),
 (8,
  '0.016*"used" + 0.014*"resources" + 0.013*"language" + 0.012*"usage" + 0.012*"ontology"'),
 (30,
  '0.068*"digital" + 0.065*"library" + 0.046*"libraries" + 0.027*"internet" + 0.024*"efficiency"'),
 (12,
  '0.025*"countries" + 0.023*"papers" + 0.023*"publications" + 0.022*"res

In [42]:
# You also have the option of saving the corpus, dictionary, and model.
# MmCorpus.serialize('paper_abstracts.mm', corpus)
dictionary.save('paper_abstracts.dict')
ldamodel.save('paper_abstracts_lda.model')

### Using the model created with Gensim for an pyLDAvis interactive visualization

In [43]:
# Prepare the visualization data.
vis_data = gensimvis.prepare(ldamodel, corpus, dictionary)

In [44]:
# Visualize the topic model.
pyLDAvis.display(vis_data)