# TEDtalk NMF Topics

## Preliminaries

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


See also this for suggestions: http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf_lda.html#example-applications-topics-extraction-with-nmf-lda-py

More useful discussion of NMF-LDA on this [SO thread][so].

[so]: http://stackoverflow.com/questions/35140117/how-to-interpret-lda-components-using-sklearn

In [2]:
import pandas
import re
colnames = ['author', 'title', 'date' , 'length', 'text']
data = pandas.read_csv('./data/talks-v1b.csv', names=colnames)

# Creating 3 lists of relevant data.
# Importing everything here. 
# If we want to test, we should import 2006-2015 and test on 2016.

talks = data.text.tolist()
authors = data.author.tolist()
dates = data.date.tolist()

# Getting only the years from dates list
years = [re.sub('[A-Za-z ]', '', item) for item in dates]

# Combining year with presenter for citation
authordate = [author+" "+year for author, year in zip(authors, years)]

# We need to remove the "empty" talks from both lists.

# We establish which talks are empty
i = 0
no_good = []
for talk in talks: 
    A = type(talk)
    B = type('string or something')
    if A != B:
        no_good.append(i)
    i = i + 1

# Now we delete them in reverse order so as to preserve index order
for index in sorted(no_good, reverse=True):
    del talks[index]
for index in sorted(no_good, reverse=True):
    del authordate[index]

## Non-Negative Matrix Topic Models

The block of code below produces a list saved as an `np.array`, of 7254 words and a **document term matrix**, `dtm.shape`, of `(2106, 7254)`. The `CountVectorizer` does a lot of work, and it has the following parameters:

* `stop_words` specifies which set to use. (The English words are the same as the Glasgow Information Retrieval Group. See link on [GitHub][].)
* `lowercase` (default `True`) convert all text to lowercase before tokenizing
* `min_df` (default 1) remove terms from the vocabulary that occur in fewer than min_df documents (in a large corpus this may be set to 15 or higher to eliminate very rare words)
vocabulary ignore words that do not appear in the provided list of words
* `token_pattern` (default `u'(?u)\b\w\w+\b'`) regular expression identifying tokens–by default words that consist of a single character (e.g., ‘a’, ‘2’) are ignored, setting `token_pattern` to `'(?u)\b\w+\b'` will include these tokens
* `tokenizer` (default unused) use a custom function for tokenizing

[GitHub]: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

In [3]:
import numpy as np
import sklearn.feature_extraction.text as text
from sklearn.decomposition import NMF, LatentDirichletAllocation

# Function for printing topic words (used later):
def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic {}:'.format(int(topic_id))) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +', ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Show the top X words in a topic:
#for t in range(len(topic_words)):
#    print("Topic {}: {}".format(t, ' '.join(topic_words[t][:15])))
    
n_samples = len(talks)
n_features = 1000
n_topics = 35
n_top_words = 20

# Use tf-idf features for NMF.
tfidf_vectorizer = text.TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(talks)
tf_vectorizer = text.CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

# Use tf (raw term count) features for LDA.
tf = tf_vectorizer.fit_transform(talks)

In [None]:
# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples={} and n_features={}...".format(n_samples, n_features))
nmf = NMF(n_components=n_topics, 
          random_state=1,
          alpha=.1, 
          l1_ratio=.5).fit(tfidf)

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

# Scale component values so that they add up to 1 for any given document
#doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

In [13]:
# Now to associate NMF topics to documents...

dtm = tf.toarray()
doctopic = nmf.fit_transform(dtm)

print("Top NMF topics in...")
for i in range(len(doctopic)):
    top_topics = np.argsort(doctopic[i,:])[::-1][0:3]
    top_topics_str = ' '.join(str(t) for t in top_topics)
    print("{}: {}".format(authordate[i], top_topics_str))

Top NMF topics in...
Al Gore 2006: 8 2 3
David Pogue 2006: 9 13 2
Cameron Sinclair 2006: 9 34 23
Sergey Brin + Larry Page 2007: 0 5 13
Nathalie Miebach 2011: 10 0 23
Richard Wilkinson 2011: 10 1 20
Malcolm Gladwell 2011: 1 0 17
Jay Bradner 2011: 32 27 5
Béatrice Coron 2011: 3 34 25
Hasan Elahi 2011: 13 26 0
Paul Zak 2011: 1 2 4
Anna Mracek Dietrich 2011: 13 0 3
Daniel Wolpert 2011: 4 8 15
Marco Tempest 2011: 31 18 8
Stew 2007: 6 33 9
Martin Hanczyc 2011: 22 25 26
Aparna Rao 2011: 23 1 33
Ben Kacyra 2011: 31 3 23
Allan Jones 2011: 4 10 27
Charlie Todd 2011: 1 34 3
Alexander Tsiaras 2011: 26 27 8
Yves Rossy 2011: 0 13 25
Thomas Suarez 2011: 12 25 9
Cynthia Kenyon 2011: 0 13 27
Robin Ince 2011: 33 16 7
James Howard Kunstler 2007: 8 34 5
Phil Plait 2011: 19 8 13
Péter Fankhauser 2011: 31 18 25
Joe Sabia 2011: 3 14 1
Britta Riley 2011: 24 26 34
Amy Purdy 2011: 22 3 8
Damon Horowitz 2011: 5 18 13
Annie Murphy Paul 2011: 6 14 24
John Bohannon 2011: 27 31 33
Charles Limb 2011: 29 0 15
Kathryn 

In [6]:
# Fit the LDA model
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:

Topic 0:
people 0.17, going 0.12, know 0.12, talk 0.11, car 0.11, computer 0.11, way 0.1, thing 0.1, actually 0.1, able 0.1, world 0.1, different 0.09, film 0.09, device 0.09, did 0.09, use 0.09, think 0.09, really 0.09, got 0.09, power 0.09, 

Topic 1:
world 0.1, violence 0.1, cells 0.09, patients 0.09, cancer 0.09, war 0.09, father 0.08, day 0.08, today 0.08, just 0.08, make 0.08, work 0.08, field 0.08, time 0.08, cell 0.08, stand 0.08, use 0.08, life 0.08, people 0.07, state 0.07, 

Topic 2:
years 1272.46, energy 1044.08, planet 952.69, earth 921.22, world 814.61, water 762.98, ve 719.5, ocean 691.31, going 642.01, life 609.82, oil 572.08, just 563.98, time 556.73, need 533.48, climate 509.74, sea 507.03, fish 482.79, species 470.4, ice 454.26, year 430.16, 

Topic 3:
light 1121.88, just 922.57, space 896.63, universe 869.77, time 728.21, look 625.82, different 567.79, way 542.17, world 490.07, science 486.24, actually 428.66, new 421.71, image 364.43, make 36

In [8]:
# Now to associate topics to documents...
doc_topic_distrib = lda.transform(tf)

In [9]:
print(type(doc_topic_distrib), len(doc_topic_distrib))

<class 'numpy.ndarray'> 2106


In [None]:
# For a quick check of the number of documents and terms in the matrix:
print(dtm.shape)

In [16]:
doctopic_orig = doctopic.copy()
num_groups = len(set(authordate))
doctopic_grouped = np.zeros((num_groups, n_topics))

for i, name in enumerate(sorted(set(authordate))):
    doctopic_grouped[i, :] = np.mean(doctopic[authordate == name, :], axis=0)

