As Katherine Kinnaird and I continue our work on the Tedtalks, we have found ourselves drawn to examine more closely the notion of topics, which we both feel have been underexamined in their usage in the humanities. 

Most humanists use an implementation of LDA, which we will probably also use simply to stay in parallel, but at some point in our work, frustrated with my ability to get LDA to work within Python, I picked up Alan Riddell's DARIAH tutorial and drafted an implementation of NMF topic modeling for our corpus. One advantage I noticed right away, in comparing the results to earlier work I had done with Jonathan Goodwin, was what seemed like a much more stable set of word clusters in the algorithmically-derived topics. 

Okay, good, but Kinnaird noticed that stopwords kept creeping into the topics and that raised larger issues about how NMF does what it does and that meant, because she's so thorough, backing up a bit and making sure we understand how NMF works.

What follows is an experiment to understand the shape and nature of the tf matrix, the tfidf matrix, and the output of the `sklearn` NMF algorithm. Some of this is driven by the following essays:

* [Improving the Interpretation of Topic Models][]
* [Practical Topic Finding for Short-Sentence Texts][]

[Improving the Interpretation of Topic Models]: https://medium.com/towards-data-science/improving-the-interpretation-of-topic-models-87fd2ee3847d
[Practical Topic Finding for Short-Sentence Texts]: https://nbviewer.jupyter.org/github/dolaameng/tutorials/blob/master/topic-finding-for-short-texts/topics_for_short_texts.ipynb

To start our adventure, we needed a small set of texts with sufficient overlap that we could later successfully derive topics from them. I set myself the task of creating ten sentences, each of approximately ten words. Careful readers who take the time to read the sentences themselves will, I hope, forgive me for the texts being rather reflexive in nature, but it did seem appropriate given the overall reflexive nature of this task.

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# The Toy Corpus
# =-=-=-=-=-=-=-=-=-=-= 

sentences = ["Green grow the rushes along the banks of the river.", 
             "The sea refuses no river, and this river is homeward flowing.", 
             "I have run this river during high and low water.", 
             "A river grows larger as it collects water from more tributaries along its course.", 
             "The river regularly broke its banks before the levee system.", 
             "The global economy is too dependent upon a limited number of banks.", 
             "Clinton's campaign was too dependent on the saying about the economy.", 
             "No economy can survive if its banks go broke.", 
             "We are too dependent upon the economy as a measure.", 
             "Many banks went broke when the river topped its banks."]

# =-=-=-=-=-=-=-=-=-=-=
# The Stopwords for this corpus
# =-=-=-=-=-=-=-=-=-=-= 

stopwords = ["a", "about", "along", "and", "are", "as", "before", "can", "during", 
             "from", "go", "have", "high", "i", "if", "is", "it", "its", "low", 
             "many", "more", "no", "of", "on", "run", "the", "this", "too", "upon", 
             "was", "we", "went", "when"]

Each text is simply a sentence in a list of strings. Below the texts is the custom stopword list for this corpus. For those curious, there are a total of 107 tokens in the corpus and 33 stopwords. Once the stopwords are applied, 49 tokens remain for a total of 30 words.

In [2]:
# =-=-=-=-=-=
# Clean & Tokenize
# =-=-=-=-=-=

import re
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
# stopwords = re.split('\s+', open('../data/tt_stop.txt', 'r').read().lower())

# Loop to tokenize, stop, and stem (if needed) texts.
tokenized = []
for sentence in sentences:   
    raw = re.sub(r"[^\w\d'\s]+",'', sentence).lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [word for word in tokens if not word in stopwords]
    tokenized.append(stopped_tokens)

    
# =-=-=-=-=-=-=-=-=-=-=
# Re-Assemble Texts as Strings from Lists of Words
# (because this is what sklearn expects)
# =-=-=-=-=-=-=-=-=-=-= 

texts = []
for item in tokenized:
    the_string = ' '.join(item)
    texts.append(the_string)

In [3]:
for text in texts:
    print(text)

green grow rushes banks river
sea refuses river river homeward flowing
river water
river grows larger collects water tributaries course
river regularly broke banks levee system
global economy dependent limited number banks
clinton's campaign dependent saying economy
economy survive banks broke
dependent economy measure
banks broke river topped banks


In [4]:
all_words = ' '.join(texts).split()
print("There are {} tokens representing {} words."
      .format(len(all_words), len(set(all_words))))

There are 49 tokens representing 30 words.


We will explore below the possibility of using the sklearn module's built-in tokenization and stopword abilities, but while I continue to teach myself that functionality, we can move ahead with understanding the vectorization of a corpus. 

There are a lot of ways to turn a series of words into a series of numbers. One of the principle ways of doing so ignores any individuated context for a particular word as we might understand it within the context of a given sentence but simply considers a word in relationship to other words in a text. That is, one way to turn words into numbers is simply to count the words in a text, reducing a text to what is known as a "bag of words." (There's a lot of linguistics and information science that validates this approach, but it will always chafe most humanists.)

If we run our corpus of ten sentences through the `CountVectorizer`, we will get a representation of it as a series of numbers, each representing the count of a particular word within a particular text:

In [5]:
# =-=-=-=-=-=-=-=-=-=-=
# TF
# =-=-=-=-=-=-=-=-=-=-= 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer()
tf_data = vec.fit_transform(texts)
tf_data_array = tf_data.toarray()
print(tf_data.shape)
print(tf_data[1])
print(tf_data_array)

(10, 30)
  (0, 21)	2
  (0, 24)	1
  (0, 19)	1
  (0, 13)	1
  (0, 8)	1
[[1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0]
 [1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0]]


The term frequency vectorizer in sklearn creates a set of words out of all the tokens, like we did above, then counts the number of times a given word occurs within a given text, returning that text as a vector. Thus, the second sentence above:

    "The sea refuses no river, and this river is homeward flowing."

which we had tokenized and stopworded to become:

    sea refuses river river homeward flowing

becomes a list of numbers, or a vector, that looks like this:

      0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 1 0 0 0 0 0

I chose the second sentence because it has a word that occurs twice, *river*, and thus it stands out a bit and, too, it doesn't look like a line of binary. If you stack all ten texts on top of each other, then you get a matrix of 10 rows, each row a text, and 30 columns, each column one of the important, lexical, words.

Based on the location of the two, my guess is that the `CountVectorizer` alphabetizes its list of words, which can also be considered as **features** of a text. A quick check of our set of words, sorted alphabetically is our first step in confirmation. (This particular corpus avoids the problem of multiple word forms: so there is no "bank" only "banks". Stemming is a different, and difficult conversation.)

In [6]:
the_words = list(set(all_words))
the_words.sort()
print(the_words)

['banks', 'broke', 'campaign', "clinton's", 'collects', 'course', 'dependent', 'economy', 'flowing', 'global', 'green', 'grow', 'grows', 'homeward', 'larger', 'levee', 'limited', 'measure', 'number', 'refuses', 'regularly', 'river', 'rushes', 'saying', 'sea', 'survive', 'system', 'topped', 'tributaries', 'water']


We can actually get that same list from the vectorizer itself with the `get_feature_names` method:

In [7]:
features = vec.get_feature_names()
print(features)

['banks', 'broke', 'campaign', 'clinton', 'collects', 'course', 'dependent', 'economy', 'flowing', 'global', 'green', 'grow', 'grows', 'homeward', 'larger', 'levee', 'limited', 'measure', 'number', 'refuses', 'regularly', 'river', 'rushes', 'saying', 'sea', 'survive', 'system', 'topped', 'tributaries', 'water']


We can actually get the count for each term with the `vocabulary_` method, which reveals that sklearn stores the information as a dictionary with the term as the key and the column index as the value:

In [8]:
occurrences = vec.vocabulary_
print(occurrences)

{'water': 29, 'homeward': 13, 'global': 9, 'tributaries': 28, 'dependent': 6, 'grows': 12, 'collects': 4, 'measure': 17, 'grow': 11, 'topped': 27, 'refuses': 19, 'limited': 16, 'larger': 14, 'sea': 24, 'flowing': 8, 'banks': 0, 'survive': 25, 'clinton': 3, 'saying': 23, 'system': 26, 'regularly': 20, 'river': 21, 'course': 5, 'levee': 15, 'green': 10, 'broke': 1, 'campaign': 2, 'rushes': 22, 'number': 18, 'economy': 7}


It's also worth pointing out that we can get a count of particular terms within our corpus by feeding the `CountVectorizer` a `vocabulary` argument. Here I've prepopulated a list with three of our terms -- "sentence", "stories", and "vocabulary" -- and the function returns an array which counts only the occurrence of those three terms across all ten texts:

In [9]:
# =-=-=-=-=-=-=-=-=-=-=
# Controlled Vocabulary Count
# =-=-=-=-=-=-=-=-=-=-= 

tags = ['banks', 'river', 'broke']
cv = CountVectorizer(vocabulary=tags)
data = cv.fit_transform(texts).toarray()
print(data)

[[1 1 0]
 [0 2 0]
 [0 1 0]
 [0 1 0]
 [1 1 1]
 [1 0 0]
 [0 0 0]
 [1 0 1]
 [0 0 0]
 [2 1 1]]


So far we've been trafficking in raw counts, or occurrences, of a word -- aka term, aka feature -- in our corpus. Chances are, longer, or bigger, texts which simply have more words will have more of any given word, which means they may come to be overvalued (overweighted?) if we rely only on occurrences. Fortunately, we can simply normalize by length of a text to get a value that can be used to compare how often a word is used in relationship to the size of the text across all texts in a corpus. That is, we can get a term's **frequency**.

As I was working on this bit of code, I learned that sklearn stores this information in a compressed sparse row matrix, wherein a series of `(text, term)` coordinates are followed by a value. I have captured the first two texts below. (Note the commented out `toarray` method in the second-to-last line. It's there so often in sklearn code that I had come to take it for granted.)

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(tf_data)
words_tf = tf_transformer.transform(tf_data)#.toarray()
print(words_tf[1]) # values for second setnence

  (0, 21)	0.707106781187
  (0, 24)	0.353553390593
  (0, 19)	0.353553390593
  (0, 13)	0.353553390593
  (0, 8)	0.353553390593


And here's that same information represented as an array:

In [11]:
words_tf_array = words_tf.toarray()
print(words_tf_array[0:2])

[[ 0.4472136   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.4472136   0.4472136   0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.4472136   0.4472136   0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.35355339  0.          0.          0.          0.
   0.35355339  0.          0.          0.          0.          0.
   0.35355339  0.          0.70710678  0.          0.          0.35355339
   0.          0.          0.          0.          0.        ]]


Finally, we can also weight words within a document contra the number of times they occur within the overall corpus, thus lowering the value of common words.

The original sentence was reduced to 6 tokens of 5 words: river, word 24, occurs twice in this sentence.

In [12]:
# =-=-=-=-=-=-=-=-=-=-=
# TFIDF
# =-=-=-=-=-=-=-=-=-=-= 

tfidf = TfidfVectorizer(use_idf=True) # This defaults to True without `use_idf` passed to it.
tfidf_data = tfidf.fit_transform(texts)#.toarray()
print(tfidf_data.shape)
print(tfidf_data[1]) # values for second sentence

(10, 30)
  (0, 8)	0.44053555152
  (0, 13)	0.44053555152
  (0, 19)	0.44053555152
  (0, 24)	0.44053555152
  (0, 21)	0.472983838401


And now, again, in the more common form of an array:

In [13]:
tfidf_array = tfidf_data.toarray()
print(tfidf_array[1]) # values for second sentence

[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.44053555  0.          0.          0.          0.
  0.44053555  0.          0.          0.          0.          0.
  0.44053555  0.          0.47298384  0.          0.          0.44053555
  0.          0.          0.          0.          0.        ]


In [14]:
tfidf_feature_names = tfidf.get_feature_names()
print(tfidf_feature_names)

['banks', 'broke', 'campaign', 'clinton', 'collects', 'course', 'dependent', 'economy', 'flowing', 'global', 'green', 'grow', 'grows', 'homeward', 'larger', 'levee', 'limited', 'measure', 'number', 'refuses', 'regularly', 'river', 'rushes', 'saying', 'sea', 'survive', 'system', 'topped', 'tributaries', 'water']


Looking at our older code, it looks like the moment to populate the various facets of the vectorizer is when you are first setting it up:

    vectorizer = sk_text.TfidfVectorizer(max_df = max_percent, 
                                         min_df = min_percent,
                                         max_features = n_features,
                                         stop_words = tt_stopwords)
    tfidf = vectorizer.fit_transform(strungs)
    
We'll revisit that possibility in a moment in the next section. For now, let's just continue with the exploration of NMF. In the code below, I should note that you can feed the model either the date in the CSR matrix or the full array. The results are the same.

In [15]:
from sklearn.decomposition import NMF

topics = 2
model = NMF(n_components = topics,
          random_state = 1,
          alpha = 0,
          l1_ratio = 0).fit(tfidf_array)
print(model)

NMF(alpha=0, beta=1, eta=0.1, init=None, l1_ratio=0, max_iter=200,
  n_components=2, nls_max_iter=2000, random_state=1, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)


In [16]:
W = model.fit_transform(tfidf_array)
H = model.components_

print(W)
print(H)

[[ 0.40440648  0.        ]
 [ 0.32888381  0.        ]
 [ 0.41970176  0.        ]
 [ 0.29288554  0.        ]
 [ 0.50431328  0.        ]
 [ 0.04788035  0.70818205]
 [ 0.          0.72139888]
 [ 0.32382609  0.37050242]
 [ 0.          0.80814917]
 [ 0.58151533  0.02546852]]
[[ 0.62470522  0.46205882  0.          0.          0.09686473  0.09686473
   0.          0.05076776  0.1174786   0.          0.17185199  0.17185199
   0.09686473  0.1174786   0.09686473  0.19967539  0.          0.          0.
   0.1174786   0.19967539  0.69965254  0.17185199  0.          0.1174786
   0.15525365  0.19967539  0.26147722  0.09686473  0.37008406]
 [ 0.14171045  0.06209536  0.19919591  0.19919591  0.          0.
   0.52254373  0.54813629  0.          0.1874398   0.          0.          0.
   0.          0.          0.          0.1874398   0.31596346  0.1874398
   0.          0.          0.          0.          0.19919591  0.
   0.11906477  0.          0.          0.          0.        ]]


In [17]:
print(H[0])

[ 0.62470522  0.46205882  0.          0.          0.09686473  0.09686473
  0.          0.05076776  0.1174786   0.          0.17185199  0.17185199
  0.09686473  0.1174786   0.09686473  0.19967539  0.          0.          0.
  0.1174786   0.19967539  0.69965254  0.17185199  0.          0.1174786
  0.15525365  0.19967539  0.26147722  0.09686473  0.37008406]


I'd like to see if we can get a grid of words-to-topics. The first step, I think, is to get a tuple out of the dictionary of `occurrences` above, sort by the numerical values for position, and then zip that into a pandas dataframe along with the H array above.

In [18]:
# Make sure `occurrences is a dictionary
type(occurrences)

dict

In [19]:
# Use a list comprehension on a dictionary to create a list of tuples (Yeah Huh)
vocab_list = [(val, key) for key, val in occurrences.items()]

In [20]:
# Check the outcome
print(vocab_list)

[(29, 'water'), (13, 'homeward'), (9, 'global'), (28, 'tributaries'), (6, 'dependent'), (12, 'grows'), (4, 'collects'), (17, 'measure'), (11, 'grow'), (27, 'topped'), (19, 'refuses'), (16, 'limited'), (14, 'larger'), (24, 'sea'), (8, 'flowing'), (0, 'banks'), (25, 'survive'), (3, 'clinton'), (23, 'saying'), (26, 'system'), (20, 'regularly'), (21, 'river'), (5, 'course'), (15, 'levee'), (10, 'green'), (1, 'broke'), (2, 'campaign'), (22, 'rushes'), (18, 'number'), (7, 'economy')]


In [21]:
# Sort
vocab_list.sort() # (reverse=False) for ascending

In [22]:
# Check again
print(vocab_list)

[(0, 'banks'), (1, 'broke'), (2, 'campaign'), (3, 'clinton'), (4, 'collects'), (5, 'course'), (6, 'dependent'), (7, 'economy'), (8, 'flowing'), (9, 'global'), (10, 'green'), (11, 'grow'), (12, 'grows'), (13, 'homeward'), (14, 'larger'), (15, 'levee'), (16, 'limited'), (17, 'measure'), (18, 'number'), (19, 'refuses'), (20, 'regularly'), (21, 'river'), (22, 'rushes'), (23, 'saying'), (24, 'sea'), (25, 'survive'), (26, 'system'), (27, 'topped'), (28, 'tributaries'), (29, 'water')]


In [23]:
# Convert to pandas dataframe
import pandas as pd

df = pd.DataFrame(vocab_list)

In [24]:
# And check
print(df)

     0            1
0    0        banks
1    1        broke
2    2     campaign
3    3      clinton
4    4     collects
5    5       course
6    6    dependent
7    7      economy
8    8      flowing
9    9       global
10  10        green
11  11         grow
12  12        grows
13  13     homeward
14  14       larger
15  15        levee
16  16      limited
17  17      measure
18  18       number
19  19      refuses
20  20    regularly
21  21        river
22  22       rushes
23  23       saying
24  24          sea
25  25      survive
26  26       system
27  27       topped
28  28  tributaries
29  29        water


In [25]:
topic_0 = H[0].tolist()
topic_1 = H[1].tolist()
print(topic_0, topic_1)

[0.6247052188788008, 0.4620588181862671, 0.0, 0.0, 0.09686473061048828, 0.09686473061048828, 0.0, 0.05076776195177597, 0.11747860355297193, 0.0, 0.17185198507628535, 0.17185198507628535, 0.09686473061048828, 0.11747860355297193, 0.09686473061048828, 0.19967539374676238, 0.0, 0.0, 0.0, 0.11747860355297193, 0.19967539374676238, 0.6996525364356248, 0.17185198507628535, 0.0, 0.11747860355297193, 0.15525364970917988, 0.19967539374676238, 0.2614772151502125, 0.09686473061048828, 0.3700840630854316] [0.14171045350929856, 0.06209535810168242, 0.19919591161113917, 0.19919591161113917, 0.0, 0.0, 0.5225437281955799, 0.5481362932314325, 0.0, 0.18743980465004675, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18743980465004675, 0.31596346419397225, 0.18743980465004675, 0.0, 0.0, 0.0, 0.0, 0.19919591161113917, 0.0, 0.1190647736525464, 0.0, 0.0, 0.0, 0.0]


In [26]:
df['Topic 0'] = topic_0
df['Topic 1'] = topic_1

In [27]:
print(df)

     0            1   Topic 0   Topic 1
0    0        banks  0.624705  0.141710
1    1        broke  0.462059  0.062095
2    2     campaign  0.000000  0.199196
3    3      clinton  0.000000  0.199196
4    4     collects  0.096865  0.000000
5    5       course  0.096865  0.000000
6    6    dependent  0.000000  0.522544
7    7      economy  0.050768  0.548136
8    8      flowing  0.117479  0.000000
9    9       global  0.000000  0.187440
10  10        green  0.171852  0.000000
11  11         grow  0.171852  0.000000
12  12        grows  0.096865  0.000000
13  13     homeward  0.117479  0.000000
14  14       larger  0.096865  0.000000
15  15        levee  0.199675  0.000000
16  16      limited  0.000000  0.187440
17  17      measure  0.000000  0.315963
18  18       number  0.000000  0.187440
19  19      refuses  0.117479  0.000000
20  20    regularly  0.199675  0.000000
21  21        river  0.699653  0.000000
22  22       rushes  0.171852  0.000000
23  23       saying  0.000000  0.199196


In [28]:
df.drop(0, axis=1)

Unnamed: 0,1,Topic 0,Topic 1
0,banks,0.624705,0.14171
1,broke,0.462059,0.062095
2,campaign,0.0,0.199196
3,clinton,0.0,0.199196
4,collects,0.096865,0.0
5,course,0.096865,0.0
6,dependent,0.0,0.522544
7,economy,0.050768,0.548136
8,flowing,0.117479,0.0
9,global,0.0,0.18744


In [29]:
df.rename(columns={1:'Word'}, inplace=True)

In [30]:
df.drop(0, axis=1)

Unnamed: 0,Word,Topic 0,Topic 1
0,banks,0.624705,0.14171
1,broke,0.462059,0.062095
2,campaign,0.0,0.199196
3,clinton,0.0,0.199196
4,collects,0.096865,0.0
5,course,0.096865,0.0
6,dependent,0.0,0.522544
7,economy,0.050768,0.548136
8,flowing,0.117479,0.0
9,global,0.0,0.18744


In [31]:
df.to_csv('../outputs/toy_WTT.csv')

This print/display code must be somewhere in the sklearn documentation: both [Alan Riddell][] and [Aneesha Bakharia][] use a version of it.

[Alan Riddell]: https://de.dariah.eu/tatom/topic_model_python.html
[Aneesha Bakharia]: https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730

In [32]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}: ".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(model, tfidf_feature_names, no_top_words)

Topic 0: 
river banks broke water topped system regularly levee grow rushes
Topic 1: 
economy dependent measure campaign clinton saying limited global number banks


Mostly what this reveals is that I need to go back and tweak my toy corpus so that the topics can be more clearly seen.

## Staying within the `sklearn` ecosystem

What if we do all tokenization and normalization in `sklearn`?

In [41]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This is the bog-standard version from the documentation
sk_test_vec = TfidfVectorizer(
    lowercase = True, 
    preprocessor = None, 
    tokenizer = None, 
    stop_words = stopwords, 
    ngram_range = (1, 1), 
    analyzer = u'word', 
    max_df = 1.0, 
    min_df = 1, 
    max_features = None, 
    vocabulary = None, 
    binary = False)

In [42]:
sk_test = sk_test_vec.fit_transform(sentences)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [35]:
from sklearn.decomposition import NMF

topics = 2
sk_test_model = NMF(n_components = topics, 
                    random_state = 1,
                    alpha = 0.5,
                    l1_ratio = 0.5,
                    solver = 'cd').fit(sk_test_data)
print(sk_test_model)

NMF(alpha=0.5, beta=1, eta=0.1, init=None, l1_ratio=0.5, max_iter=200,
  n_components=2, nls_max_iter=2000, random_state=1, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)


In [36]:
W = sk_test_model.fit_transform(sk_test_data)
H = sk_test_model.components_

To my non-mathematical mind, it just seems pretty amazing that you can *see* so clearly the division of the larger matrix by the topics. First, we have the document-topic matrix and we can see the assignment of topics to the documents. (And this is with a terrible set of texts in which it's not clear there are any particular topics -- so, I guess, a lot like life, really.)

In [37]:
print(W)

[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]


And then here's the words spread across the two topics in a `2 x 31` matrix:

In [38]:
print(H)

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.]]


In [39]:
feature_names = sk_test_data.get_feature_names()
print(feature_names)

AttributeError: get_feature_names not found