# Extracting topics from text documents

Sometimes you have a nice big set of documents, and all you wish for is to know what's hiding inside. But without reading them, of course! Two approaches to try to lazily get some information from your texts are **topic modeling** and **clustering**.

<p class="reading-options">
  <a class="btn" href="/text-analysis/topic-modeling-and-clustering">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/text-analysis/notebooks/Topic modeling and clustering.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="#">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

I'm going to tell you a big secret: **computers are really really really bad at reading documents and figuring out what they're about.** Text is for _people_ to read, people with a collective knowledge of The World At Large and a history of reading things and all kinds of other tricky secret little things we don't think about that help us understand what a piece of text means.

When dealing with understanding content, computers are good for _very specific situations_ to do _very specific things_. Or alternatively, to do a not-that-great job when you aren't going to be terribly picky about the results.

Do I sound a little biased? Oh, but aren't we all. It isn't going to stop us from talking about it, though!

Before we start, **let's make some assumptions:**

* When you're dealing with documents, each document is (typically) about something.
* You know each document is about by looking at the words in the document.
* Documents with similar words are probably about similar things. 

We have two major options available to us: **topic modeling** and **clustering**. There's a lot of NLP nuance going on between the two, but we're going to keep it simple:

**Topic modeling** is if each document can be about **multiple topics**. There might be 100 different topics, and a document might be 30% about one topic, 20% about another, and then 50% spread out between the others.

**Clustering** is if each document should only fit into **one topic**. It's an all-or-nothing approach.

The most important part of _all of this_ is the fact that **the computer figures out these topics by itself**. You don't tell it what to do! If you're teaching the algorithm what different specific topics look like, that's **classification.** In this case we're just saying "hey computer, please figure this out!"

Let's get started.

## Preparing our datasets

### Recipes

We're going to start with analyzing **about 36,000 recipes**. Food is interesting because you can split it so many ways: by courses, or by baked goods vs meat vs vegetables vs others, by national cuisine...

In [19]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)

recipes = pd.read_csv("data/recipes.csv")
recipes.head()

Unnamed: 0,cuisine,id,ingredient_list
0,greek,10259,"romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese crumbles"
1,southern_us,25693,"plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil"
2,filipino,20130,"eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, soy sauce, butter, chicken livers"
3,indian,22213,"water, vegetable oil, wheat, salt"
4,indian,13162,"black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, chili powder, passata, oil, ground cumin, boneless chicken skinless thigh, garam ma..."


In order to analyze the text, we'll need to count the words in each recipe. To do that we're going to use a **stemmed TF-IDF vectorizer** from scikit-learn.

* **Stemming** will allow us to combine words like `tomato` and `tomatoes`
* Using **TF-IDF** will allow us to devalue common ingredients like salt and water

I'm using the code from [the reference section](https://investigate.ai/reference/vectorizing/#stem-and-vectorize), just adjusted from a `CountVectorizer` to a `TfidfVectorizer`, and set it so ingredients have to appear in at least **fifty recipes**.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=50)
matrix = vectorizer.fit_transform(recipes.ingredient_list)

words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,activ,adobo,agav,alfredo,all,allspic,almond,amchur,anaheim,ancho,...,wrapper,yam,yeast,yellow,yoghurt,yogurt,yolk,yukon,zest,zucchini
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.278745,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.276,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.210575,0.0,0.0,0.0,0.0


Looks like we have 752 ingredients! Yes, there are some numbers in there and probably other things we aren't interested in, but let's stick with it for now.

## Topic modeling

There are multiple techniques for topic modeling, but in the end they do the same thing: **you get a list of topics, and a list of words associated with each topic.**

Let's tell it to break them down into **five topics.**

In [54]:
from sklearn.decomposition import NMF

model = NMF(n_components=5)
model.fit(matrix)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

Why five topics? **Because we have to tell it _something_.** Our job is to decide the number of topics, and it's the computer's job to find the topics. We'll talk about how to pick the "right" number later, but for now: it's magic.

Fitting the model allowed it to "learn" what the ingredients are and how they're organized, we just need to find out what's inside. Let's ask for the **top ten terms in each group.**

In [55]:
n_words = 10
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(model.components_):
    message = "Topic #%d: " % topic_idx
    message += " ".join([feature_names[i]
                         for i in topic.argsort()[:-n_words - 1:-1]])
    print(message)
print()

Topic #0: oliv pepper fresh oil dri garlic salt parsley red tomato
Topic #1: flour egg sugar purpos all butter bake milk larg powder
Topic #2: sauc soy sesam rice oil ginger sugar chicken vinegar garlic
Topic #3: ground chili cilantro cumin powder lime onion pepper chop fresh
Topic #4: chees shred cream parmesan cheddar grate tortilla mozzarella sour chicken



Those actually seem like _pretty good topics_. Italian-ish, then baking, then Chinese, maybe Latin American or Indian food, and then dairy. What if we did it with **fifteen topics** instead?

In [56]:
model = NMF(n_components=15)
model.fit(matrix)

n_words = 10
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(model.components_):
    message = "Topic #%d: " % topic_idx
    message += " ".join([feature_names[i]
                         for i in topic.argsort()[:-n_words - 1:-1]])
    print(message)
print()

Topic #0: pepper bell red green onion celeri flake black tomato crush
Topic #1: flour purpos all bake powder butter soda buttermilk salt egg
Topic #2: sauc soy sesam oil ginger rice sugar garlic scallion starch
Topic #3: tortilla cream shred chees sour cheddar salsa corn bean jack
Topic #4: chees parmesan grate mozzarella pasta ricotta basil italian fresh spinach
Topic #5: lime cilantro fresh chop juic jalapeno chile avocado chili fish
Topic #6: chicken breast boneless skinless broth halv sodium low fat thigh
Topic #7: ground black pepper cumin cinnamon salt beef cayenn kosher paprika
Topic #8: chili seed powder cumin coriand masala garam curri ginger coconut
Topic #9: sugar egg vanilla milk extract larg cream butter yolk unsalt
Topic #10: oliv extra virgin oil clove garlic fresh salt tomato parsley
Topic #11: white wine vinegar rice shallot red salt grain mustard sugar
Topic #12: dri oregano tomato thyme parsley garlic bay basil leaf onion
Topic #13: lemon juic fresh orang zest parsle

This is where we start to see **the big difference between categories and topics**. The grouping with five groups seemed very much like cuisines - Italian, Chinese, etc. But now that we're breaking it down further, the groups have changed a bit.

They're now **more like classes of ingredients.** Baking gets a category - `chicken breast boneless skinless` and so do generic European ingredients - `oliv extra virgin oil clove garlic fresh salt`. The algorithm got a little confused about black pepper vs. hot pepper flakes vs green/yellow bell peppers when it created `pepper bell red green onion celeri flake black`, but we understand what it's going for.

Remember, the important thing about topic modeling is that every row in our dataset is a **combinations of topics**. It might be a little bit about one thing, a little bit less about another, etc etc. Let's take a look at how that works.

In [59]:
# Get a list of topics - yes, they're named poorly
topic_list = [f"topic_{i}" for i in range(model.n_components_)]

# Convert our counts into numbers
percentages = model.transform(matrix)

# Set it up as a dataframe
topics = pd.DataFrame(percentages, columns=topic_list)
topics.head(2) * 100

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14
0,1.438302,0.0,0.0,2.665898,1.615467,0.22474,0.0,0.507691,0.0,0.0,2.561061,0.0,0.480989,0.0,0.0
1,2.447042,3.041425,0.176429,1.408264,0.0,0.0,0.0,5.7155,0.49568,1.583069,0.548611,0.0,1.620243,0.0,0.243763


Our first recipe is primary `topic_3` with a rating of 2.44, but it's also a bit topic 0 and topic 8 with scores of 1.5 and 1.36.

Our second recipe is a bit bolder - it scores a whopping 5.7 in `topic_7`, with 0, 8 and 14 coming up in the 2.5-3 range.

Let's combing this topics dataframe with our **original dataframe** to see what is going on here.

In [None]:
merged = recipes.merge(topics, right_index=True, left_index=True)
merged.head(2)

Now we can do things like...

* Uncover possible topics discussed in the dataset
* See how many documents cover each topic
* Find the top documents in each topic

There's a lot lot lot more to say on topic modeling - other techniques, other critiques, as well as fun visualizations - so be sure to check out the follow-up after you read the comparison with clustering down below!

## Clustering

Clustering's major difference is that **each category is kept completely separate.**

Let's do the same thing with clustering that with did with topic modeling, starting with breaking things into **five categories.**

In [57]:
%%time

from sklearn.cluster import KMeans

km = KMeans(n_clusters=5)
km.fit(matrix)

CPU times: user 7min 37s, sys: 2.42 s, total: 7min 40s
Wall time: 7min 54s


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

It's a fair bit slower, but that's also because we picked the absolute fastest version of topic modeling that exists.

Let's see what the top 10 words are for each of the clusters.

In [58]:
print("Top terms per cluster:")

order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
n_words = 10
for i in range(km.n_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :n_words]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: ground chili cilantro cumin lime
Cluster 1: sugar flour egg butter purpos
Cluster 2: pepper oliv fresh oil salt
Cluster 3: sauc soy sesam oil rice
Cluster 4: chees shred cream parmesan cheddar


Nothing too surprising there! Pretty much the same thing we got with topic modeling. Now let's break it into **fifteen groups**.

In [62]:
%%time

km = KMeans(n_clusters=15)
km.fit(matrix)

print("Top terms per cluster:")
n_words = 10
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(km.n_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :n_words]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: lemon juic fresh oliv pepper
Cluster 1: lime cilantro fresh juic chili
Cluster 2: sauc soy sesam oil ginger
Cluster 3: flour purpos all bake egg
Cluster 4: pepper bell green onion red
Cluster 5: salt water sugar butter pepper
Cluster 6: chees tortilla shred cheddar cream
Cluster 7: dri oliv fresh pepper tomato
Cluster 8: vanilla sugar extract egg cream
Cluster 9: chees parmesan grate mozzarella pepper
Cluster 10: fish sauc lime fresh thai
Cluster 11: seed masala chili cumin coriand
Cluster 12: chicken boneless breast skinless halv
Cluster 13: virgin extra oliv oil fresh
Cluster 14: ground pepper cumin salt black
CPU times: user 11min 53s, sys: 17.5 s, total: 12min 10s
Wall time: 11min 1s


In [63]:
%%time

km = KMeans(n_clusters=20)
km.fit(matrix)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(km.n_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :10]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: vanilla extract sugar egg butter
Cluster 1: pepper bell green onion red
Cluster 2: flour all purpos egg salt
Cluster 3: chees parmesan grate mozzarella pepper
Cluster 4: salt pepper onion oil water
Cluster 5: ground pepper cumin salt powder
Cluster 6: chicken boneless breast skinless halv
Cluster 7: bake flour powder purpos all
Cluster 8: seed chili masala cumin coriand
Cluster 9: virgin extra oliv oil fresh
Cluster 10: sauc soy oil ginger starch
Cluster 11: sesam soy seed sauc oil
Cluster 12: coconut milk curri past lime
Cluster 13: dri oliv tomato fresh wine
Cluster 14: lemon juic fresh oliv pepper
Cluster 15: chees tortilla shred cheddar cream
Cluster 16: sugar cream orang milk egg
Cluster 17: fish sauc lime fresh peanut
Cluster 18: lime cilantro fresh chili juic
Cluster 19: flat leaf parsley oliv pepper
CPU times: user 11min 2s, sys: 15 s, total: 11min 17s
Wall time: 10min 27s


In [None]:
%%time

km = KMeans(n_clusters=30)
km.fit(matrix)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(km.n_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))