# TEDtalk NMF Topics

## Preliminaries

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


See also this for suggestions: http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html#example-applications-topics-extraction-with-nmf-lda-py

More useful discussion of NMF-LDA on this [SO thread][so].

[so]: http://stackoverflow.com/questions/35140117/how-to-interpret-lda-components-using-sklearn

In [2]:
import pandas
import re
colnames = ['author', 'title', 'date' , 'length', 'text']
data = pandas.read_csv('./data/talks-v1b.csv', names=colnames)

# Creating 3 lists of relevant data.
# Importing everything here. 
# If we want to test, we should import 2006-2015 and test on 2016.

talks = data.text.tolist()
authors = data.author.tolist()
dates = data.date.tolist()

# Getting only the years from dates list
years = [re.sub('[A-Za-z ]', '', item) for item in dates]

# Combining year with presenter for citation
authordate = [author+" "+year for author, year in zip(authors, years)]

# We need to remove the "empty" talks from both lists.

# We establish which talks are empty
i = 0
no_good = []
for talk in talks: 
    A = type(talk)
    B = type('string or something')
    if A != B:
        no_good.append(i)
    i = i + 1

# Now we delete them in reverse order so as to preserve index order
for index in sorted(no_good, reverse=True):
    del talks[index]
for index in sorted(no_good, reverse=True):
    del authordate[index]

## Non-Negative Matrix Topic Models

The block of code below produces a list saved as an `np.array`, of 7254 words and a **document term matrix**, `dtm.shape`, of `(2106, 7254)`. The `CountVectorizer` does a lot of work, and it has the following parameters:

* `stop_words` specifies which set to use. (The English words are the same as the Glasgow Information Retrieval Group. See link on [GitHub][].)
* `lowercase` (default `True`) convert all text to lowercase before tokenizing
* `min_df` (default 1) remove terms from the vocabulary that occur in fewer than min_df documents (in a large corpus this may be set to 15 or higher to eliminate very rare words)
vocabulary ignore words that do not appear in the provided list of words
* `token_pattern` (default `u'(?u)\b\w\w+\b'`) regular expression identifying tokens–by default words that consist of a single character (e.g., ‘a’, ‘2’) are ignored, setting `token_pattern` to `'(?u)\b\w+\b'` will include these tokens
* `tokenizer` (default unused) use a custom function for tokenizing

[GitHub]: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

In [5]:
import numpy as np
import sklearn.feature_extraction.text as text
from sklearn.decomposition import NMF, LatentDirichletAllocation

# Function for printing topic words (used later):
def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic {}:'.format(int(topic_id))) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +', ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_samples = len(talks)
n_features = 1000
n_topics = 35
n_top_words = 20

# Use tf-idf features for NMF.
tfidf_vectorizer = text.TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(talks)
tf_vectorizer = text.CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

# Use tf (raw term count) features for LDA.
tf = tf_vectorizer.fit_transform(talks)

In [6]:
# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples={} and n_features={}...".format(n_samples, n_features))
nmf = NMF(n_components=n_topics, 
          random_state=1,
          alpha=.1, 
          l1_ratio=.5).fit(tfidf)

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

# Scale component values so that they add up to 1 for any given document
#doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

Fitting the NMF model with tf-idf features, n_samples=2106 and n_features=1000...

Topics in NMF model:

Topic 0:
actually 0.93, really 0.89, just 0.88, think 0.7, things 0.68, going 0.65, way 0.59, time 0.58, ve 0.57, make 0.52, little 0.48, kind 0.48, different 0.45, thing 0.45, look 0.45, right 0.45, new 0.44, world 0.43, work 0.42, use 0.39, 

Topic 1:
said 0.7, life 0.67, story 0.5, day 0.47, man 0.45, time 0.43, father 0.43, years 0.43, family 0.39, went 0.39, mother 0.37, love 0.37, stories 0.36, did 0.35, home 0.34, told 0.34, people 0.34, old 0.33, didn 0.32, came 0.32, 

Topic 2:
world 0.67, countries 0.62, country 0.5, global 0.47, percent 0.47, government 0.46, china 0.42, dollars 0.4, economic 0.39, economy 0.37, money 0.36, states 0.35, india 0.34, democracy 0.34, growth 0.33, billion 0.33, people 0.33, united 0.31, political 0.31, need 0.29, 

Topic 3:
patients 1.11, health 0.98, patient 0.8, disease 0.63, care 0.6, medical 0.54, doctors 0.45, medicine 0.38, doctor 0.37,

In [9]:
print(nmf)

NMF(alpha=0.1, beta=1, eta=0.1, init=None, l1_ratio=0.5, max_iter=200,
  n_components=35, nls_max_iter=2000, random_state=1, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)


In [7]:
# Fit the LDA model
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:

Topic 0:
people 0.17, going 0.12, know 0.12, talk 0.11, car 0.11, computer 0.11, way 0.1, thing 0.1, actually 0.1, able 0.1, world 0.1, different 0.09, film 0.09, device 0.09, did 0.09, use 0.09, think 0.09, really 0.09, got 0.09, power 0.09, 

Topic 1:
world 0.1, violence 0.1, cells 0.09, patients 0.09, cancer 0.09, war 0.09, father 0.08, day 0.08, today 0.08, just 0.08, make 0.08, work 0.08, field 0.08, time 0.08, cell 0.08, stand 0.08, use 0.08, life 0.08, people 0.07, state 0.07, 

Topic 2:
years 1272.46, energy 1044.08, planet 952.69, earth 921.22, world 814.61, water 762.98, ve 719.5, ocean 691.31, going 642.01, life 609.82, oil 572.08, just 563.98, time 556.73, need 533.48, climate 509.74, sea 507.03, fish 482.79, species 470.4, ice 454.26, year 430.16, 

Topic 3:
light 1121.88, just 922.57, space 896.63, universe 869.77, time 728.21, look 625.82, different 567.79, way 542.17, world 490.07, science 486.24, actually 428.66, new 421.71, image 364.43, make 36

In [15]:
# Now to associate topics to documents...
doc_topic_distrib = lda.transform(tf)

<class 'numpy.ndarray'> 2106


In [16]:
print(type(doc_topic_distrib), len(doc_topic_distrib))

# Turn our list of authors and dates into an array:
citations = np.asarray(authordate)

num_groups = len(set(citations))

doctopic_grouped = np.zeros((num_groups, num_topics))

for i, name in enumerate(sorted(set(citations))):
    doctopic_grouped[i, :] = np.mean(doctopic[citations == name, :], axis=0)

for i in range(len(doctopic)):
    top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
    top_topics_str = ' '.join(str(t) for t in top_topics)
    print("{}: {}".format(citations[i], top_topics_str))

for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))

num_groups = len(set(au_dats))

doctopic_grouped = np.zeros((num_groups, num_topics))

for i, name in enumerate(sorted(set(au_dats))):
    doctopic_grouped[i, :] = np.mean(doctopic[au_dats == name, :], axis=0)

<class 'numpy.ndarray'> 2106
