# Topic Modeling Lab
## Contents
1. [Setup](#Section-1%3A-Setup)
    1. [Import](#1.1-Import-Packages)
    1. [Download](#1.2-Download-and-Prepare-Data)
    1. [Read](#1.3-Read-Data)
    1. [Pick Your Data](#1.4-Pick-Your-Data)
    1. [Clean Text](#1.5-Clean-Text)
1. [Converting Text to Numbers](#Section-2%3A-Converting-Text-to-Numbers)
1. [LDA Topic Model](#Section-3%3A-LDA-Topic-Model)
1. [Interpret Topics](#Section-4%3A-Interpret-Topics)
1. [Topic Quality](#Section-5%3A-Topic-Quality)
1. [Topic Popularity](#Section-6%3A-Topic-Popularity)
1. [NMF Topic Model](#Section-7%3A-NMF-Topic-Model)
1. [What We Learned](#Section-7%3A-What-We-Learned)

## Section 0: Background
- In this lab, we'll learn about topic modeling. Topic modeling uses statistics to understand what text is about, that is, to find the topics in text.
- We'll use the online dating profile text that OKCupid made public as our example, but of course topic modeling can be used on any text.

@Author: [Jeff Lockhart](http://www-personal.umich.edu/~jwlock/) & [Ed Platt](https://elplatt.com/), with some code adapted from [Aneesha Bakharia](https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730)'s example.

## Section 1: Setup
### 1.1 Import Packages
- Packages contain code others have written to make our work easier.

In [None]:
!pip install lxml

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()
from scipy.stats import pearsonr  
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup
import warnings
import matplotlib.pyplot as plt
%matplotlib inline

### 1.2 Download and Prepare Data
This code checks whether you have the data. If you don't, it will download and prepare it for you. To see how it works, look at lab `1 Data munging` which explains it in detail.

In [None]:
%run -i "download_and_clean_data.py"
print('Ready to go!')

### 1.3 Read Data

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')
profiles.head(2)

### 1.4 Pick Your Data
- Pick which section of the profiles you want to analyze.
#### Options:
- `text` - All of the text from a profile (**Recommended**)
- `essay0` - My self summary (**Recommended**)
- `essay1` - What I’m doing with my life
- `essay2` - I’m really good at
- `essay3` - The first thing people usually notice about me
- `essay4` - Favorite books, movies, show, music, and food
- `essay5` - The six things I could never do without
- `essay6` - I spend a lot of time thinking about
- `essay7` - On a typical Friday night I am
- `essay8` - The most private thing I am willing to admit
- `essay9` - You should message me if...

#### Replace `essay0` in the cell below with the essay you want to look at.
- `text` and `essay0` are both recommended, but it's your choice.

In [None]:
profile_section_to_use = 'essay0'

### 1.5 Clean Text 
- For this lab, it is not so important that you understand this code. 
- For now, just run it and move on. 

In [None]:
# Some of the essays have just a link in the text. BeautifulSoup sees that and gets 
# the wrong idea. This line hides those warnings.
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
def clean(text):
    if pd.isnull(text):
        t = np.nan
    else:
        t = BeautifulSoup(text, 'lxml').get_text()
        t = t.lower()

        bad_words = ['http', 'www', '\nnan']

        for b in bad_words:
            t = t.replace(b, '')
    if t == '':
        t = np.nan
    
    return t

In [None]:
print('Cleaning up profile text for', profile_section_to_use, '...')
profiles['clean'] = profiles[profile_section_to_use].progress_apply(clean)

print('We started with', profiles.shape[0], 'profiles.')
print("Dropping profiles that didn't write anything for the essay we chose...")
profiles.dropna(axis=0, subset=['clean'], inplace=True)

text_cols = ['text', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 
             'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
profiles.drop(columns=text_cols, inplace=True)

#what we will use as our documents, here the cleaned up text of each profile
documents = profiles['clean'].values

print('We have', documents.shape[0], 'profiles left.')

## Section 2: Converting Text to Numbers
- Our first model takes "count vectors" as input, that is, a count of how many times each word shows up in each document. 
    - Here we tell it to only use the 1,000 most popular words, ignoring stop words like "a" and "of".
    - We use the abbreviation `tf` for these because they represent "text frequency," i.e., how often each word shows up in text.

<div class="alert-info">
 
#### Short Answer No.1
- (1) Why do we want to ignore stop words? (1-2 sentences)
- (2) Give 3 more examples of stop words 
    
    
</div>

🤔 **Write your response here:**
...



In [None]:
tf_vectorizer = CountVectorizer(max_features=1000, stop_words='english')

print("Vectorizing text by word counts...")
tf_text = tf_vectorizer.fit_transform(documents)

tmp = tf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

#### See what words are being counted

In [None]:
tf_words = tf_vectorizer.get_feature_names_out()

print("The first few words (alphabetically) are:\n\n", tf_words[:20])

#### See an example of how a profile's text is encoded
- `n` is the profile number you want to look at. Change the value of `n` and re-run the code to see different profiles.
- Note that only some of the words are counted. This is because we set `max_features=1000` in the vectorizor function, so it is only counting the 1,000 most common words and ignoring the rest. 
    - You can change that number to be bigger or smaller and see what happens.
    - We found in Lab 1 that 1,000 is a good choice for this data because words less popular than that show up in less than 1% of all profiles.

In [None]:
n = 6

def show_vector(x, words):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        print(words[col], '\t', x[row,col].round(2))

print('Profile text:\n', documents[n])
print('\nTF (count) vector:')
show_vector(tf_text[n], tf_words)

<div class="alert-info">
    
#### Short Answer No.2 

- (1) Given the essay questions decribed above, list 5 words that you predict to be the most popular in essay4. 
    
    
</div>

🤔 **Write your response here:**
...



In [None]:
profile_section_to_use = 'essay4'

In [None]:
# In this cell, let's find the top 10 most popular words and list them alphabetically. 
# Look the sample codes above and try fill in the "???".

profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

print('Cleaning up profile text for', profile_section_to_use, '...')
profiles['clean'] = profiles[profile_section_to_use].progress_apply(clean)

print('We started with', profiles.shape[0], 'profiles.')
print("Dropping profiles that didn't write anything for the essay we chose...")
profiles.dropna(axis=0, subset=['clean'], inplace=True)

text_cols = ['text', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 
             'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
profiles.drop(columns=text_cols, inplace=True)

#what we will use as our documents, here the cleaned up text of each profile
documents = profiles['clean'].values


tf_vectorizer_your_turn = CountVectorizer(???)

print("Vectorizing text by word counts...")
tf_text_your_turn = tf_vectorizer_your_turn.fit_transform(documents)

tmp_your_turn = tf_text_your_turn.get_shape()
print("Our transformed text has", tmp_your_turn[0], "rows and", tmp_your_turn[1], "columns.")

tf_words_your_output = tf_vectorizer_your_turn.get_feature_names_out()

print("The first few words (alphabetically) are:\n\n", tf_words_your_output[:20])

<div class="alert-info">
    
- (2) Did any of your predicted words made to top 10? (No right or wrong answer!) 
    
</div>

🤔 **Write your response here:**
...



In [None]:
# Now let's focus on essay0 again.

profile_section_to_use = 'essay0'

profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

print('Cleaning up profile text for', profile_section_to_use, '...')
profiles['clean'] = profiles[profile_section_to_use].progress_apply(clean)

print('We started with', profiles.shape[0], 'profiles.')
print("Dropping profiles that didn't write anything for the essay we chose...")
profiles.dropna(axis=0, subset=['clean'], inplace=True)

text_cols = ['text', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 
             'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
profiles.drop(columns=text_cols, inplace=True)

#what we will use as our documents, here the cleaned up text of each profile
documents = profiles['clean'].values


tf_vectorizer_your_turn = CountVectorizer(max_features=10, stop_words='english')

print("Vectorizing text by word counts...")
tf_text_your_turn = tf_vectorizer_your_turn.fit_transform(documents)

tmp_your_turn = tf_text_your_turn.get_shape()
print("Our transformed text has", tmp_your_turn[0], "rows and", tmp_your_turn[1], "columns.")

tf_words_your_output = tf_vectorizer_your_turn.get_feature_names_out()

print("The first few words (alphabetically) are:\n\n", tf_words_your_output[:20])

## Section 3: LDA Topic Model

- LDA stands for Latent Dirichlet Allocation. The statistical math behind it is complicated, but its goals are simple:
    - find groups of words that often show up together and call those groups topics. 
    - find topics that can be used to tell documents apart, i.e. topics that are in some documents but not others.
- LDA is the most popular method for topic modeling.
- [Learn more](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) about LDA

### Step 1: Decide how many topics we want to find
- We must tell LDA how many topics we want it to look for (we will do this with the `ntopics` variable below.)

<div class="alert-info">
    
#### Short Answer No.3
- How would the number of topics you choose affect the result in your opinion? (Think about extreme cases like you only pick 1 topic, vs if you pick 10,000 topics.) (1-2 sentences)
- Choose 1 essay and describe the types of topics you would expect to find.
    
</div>

🤔 **Write your response here:**
...



In [None]:
#how many topics we want our model to find
ntopics = 15

#how many top words we want to display for each topic
nshow = 10

### Step 2: Run the LDA algorithm
- LDA can be a little slow. We'll use a faster method later on.
- Set `n_jobs=` to the number of processors you want to use to compute LDA. If you set it to `-1`, it will use all available processors. 

In [None]:
model = LatentDirichletAllocation(n_components=ntopics, max_iter=10, 
                                  learning_method='online', n_jobs=-1, random_state=42)

print('Performing LDA on vectors. This may take a while...')
lda = model.fit(tf_text)
lda_topics = lda.components_

print('Done!')

## Section 4: Interpret Topics
#### Some helper functions 
Don't worry about how these work right now. Just run them and scroll down. We'll use them to make our analysis easier later on.

In [None]:
def describe_topic(topic, feature_names, n_words=10):
    words = []
    # sort the words in the topic by importance
    topic = topic.argsort() 
    # select the n_words most important words
    topic = topic[:-n_words - 1:-1]
    # for each important word, get it's name (i.e. the word) from our list of names
    for i in topic:
        words.append(feature_names[i])
    # print the topic number and its most important words, separated by spaces
    return " ".join(words)

def display_topics(components, feature_names, n_words=10):
    # loop through each topic (component) in the model; show its top words
    for topic_idx, topic in enumerate(components):
        print("Topic {}:".format(topic_idx), 
              describe_topic(topic, feature_names, n_words))
    return

def find_intersection(idxa, idxb, n):
    a = set()
    b = set()
    both = set()
    i = 0
    while len(both) < n:
        a.add(idxa[i])
        b.add(idxb[i])
        both = a.intersection(b)
        i += 1
    return list(both)

def compare_topic_words(topics, a, b, words, how='overlap', n_words=nshow):
    b_sort = False
    if how == 'difference':
        b_sort = True
    
    dfa = pd.DataFrame(topics, columns=words).T
    idxa = dfa.sort_values(by=a, ascending=False).index.values
    idxb = dfa.sort_values(by=b, ascending=b_sort).index.values
    both = find_intersection(idxa, idxb, n=n_words)
    
    out = how + ' between ' + str(a) + ' and ' + str(b) + ':'
    for w in both:
        out += ' ' + w
    print(out)
    return

def blue_matrix(cells, xl, yl, t, x_labels=None):
    n = cells.shape[0]
    plt.figure()
    fig, ax = plt.subplots(figsize=(8,8))
    plt.imshow(cells, cmap='Blues')
    ax.xaxis.tick_top()
    ax.xaxis.set_label_position('top') 
    if x_labels is not None:
        plt.xticks(range(n), x_labels)
    else:
        plt.xticks(range(n))
    plt.yticks(range(n))
    plt.ylabel(yl)
    plt.xlabel(xl)
    plt.title(t)
    # show a colorbar legend
    plt.colorbar()
    return

### Step 1: Show our topics with the top words in each

In [None]:
display_topics(lda_topics, tf_words, n_words=10)

<div class="alert-info">
    
#### Short Answer No.4
- Pick three topics.
- Look at the words that make up each one. Say which one it is, and briefly answer these questions about it (3-5 sentences per topic):
    - What does the topic seem to be about?
    - Do any of the other topics seem similar to this one?
    
</div>

🤔 **Write your response here:**
....

### Step 2: Examine the words that make two topics similar or different
- We can also compare two topics to each other by looking at words that are common in both, or words that are common in one but not the other.
- Try changing `topic_a` and `topic_b` to different topic numbers.
- Notice the `how` option will let you see either the `overlap` or `difference` between two topics
    - Notice also that the difference between topics a and b is not the same as between b and a

In [None]:
topic_a = 0
topic_b = 2

compare_topic_words(lda_topics, topic_a, topic_b, tf_words, how='overlap', n_words=nshow)
compare_topic_words(lda_topics, topic_a, topic_b, tf_words, how='difference', n_words=nshow)
compare_topic_words(lda_topics, topic_b, topic_a, tf_words, how='difference', n_words=nshow)

### Step 3: Interpret these topics
<div class="alert-info">
    
#### Short Answer No.5
- This part is for you to do: code can't do it for you.
- Try to come up with a short, catchy name for each topic and write it down.
    - For example, if the words were "san francisco city moved living born years raised lived live", you might call it "places lived" because the topic seems to be about where people currently live (San Francisco) and where they were born / raised / moved from.  
    
    
</div>

🤔 **Write your response here:**
...

- Topic 0:
- Topic 1: 
- Topic 2:
- Topic 3: 
- Topic 4:
- Topic 5: 
- Topic 6:
- Topic 7: 
- Topic 8: 
- Topic 9:
- Topic 10: 
- Topic 11:
- Topic 12: 
- Topic 13:
- Topic 14: 

...

### Step 4: Check whether your interpretations match with the text

#### Helper functions
- Run this code and scroll down. 
- You don't need to understand these details right now.

In [None]:
def get_profiles_from_topics(data, transformed, topic_a, topic_b=None, pick_from=10):
    #get our data ready
    df = pd.DataFrame(transformed)
    df = df.sort_values(by=topic_a, ascending=False)
    n = df.shape[0]
    
    if topic_b is None:
        #if we only want things high in one topic, take randomly from the top
        keep = df.head(pick_from).sample(1)
        pid = keep.index.values[0]
    else:
        #if we want things high in two topics, find them and pick one of the top
        idxb = df.sort_values(by=topic_b, ascending=False).index.values
        both = find_intersection(df.index.values, idxb, pick_from)
        keep = df.loc[both, :].sample(1)
        pid = keep.index.values[0]    
        
    #output text to show our results
    match_text = 'Profile number ' + str(pid)
    match_text += ' has more of topic ' + str(topic_a)
    match_text += ' than {:.2f}%'.format(100 - (np.where(df.index==pid)[0][0] / n)*100)
    match_text += ' of other profiles'
    if topic_b is not None:
        match_text += ' and more of topic ' + str(topic_b)
        match_text += ' than {:.2f}%'.format(100 - (np.where(idxb==pid)[0][0] / n)*100)
        match_text += ' of other profiles.'
        
    #print results
    text = data[pid]
    print(match_text)
    print('Here is the text:\n\n', text)
    return

def visualize_profile(profile_topics, profile_id):
    # plot a stem diagram for a single profile
    plt.figure(figsize=(8,4))
    plt.xticks(range(profile_topics.shape[1]))
    plt.xlabel('Topic number')
    plt.ylabel('How much of profile is about each topic')
    plt.title('Profile #'+str(profile_id))
    plt.stem(profile_topics[profile_id,:])
    return

#### Calculate what portion of each profile is about each topic

In [None]:
profile_topics = lda.transform(tf_text)

#### Look at the text of a profile that has a lot of a particular topic
- This function randomly picks one of the top few profiles for a topic, so each time you run it you will see a different example.
    - If you want it to pick from more or less topics, change the value of `pick_from`
    - If you want to see a different topic, change the value of `topic_a`
    - **Hint:** you can press `ctrl`+`enter` over and over to keep re-running the code in this cell.

In [None]:
get_profiles_from_topics(documents, profile_topics, topic_a=7, pick_from=1000)

<div class="alert-info">
    
#### Short Answer No.6
    
- For each of the 3 topics you chose in **Step 1**, use the function get_profiles_from_topics defined above to find a document that strongly aligns with the topic. 
- Then discuss if the document truly aligns with the topic? (3-5 sentences) 
    
</div>

🤔 **Write your response here:**
...

#### Look at the text of a profile that matches two topics well at the same time
- Note that some topics might not happen together very often. If this is the case, the examples we find of both together might not be very good.

In [None]:
get_profiles_from_topics(documents, profile_topics, topic_a=1, topic_b=2)

#### See how much of a profile is about each topic
- Try looking at some of the profiles you just found:
    - Make the `pid` equal to the profile number from above.

In [None]:
pid = 40000 
visualize_profile(profile_topics, profile_id=pid)


### NEW
documents[pid]

<div class="alert-info">
    
#### Short Answer No.7

- Does the topic classification make sense to you? If so, which of the words in the example printed above are the most representitave of each of the topics? (3-5 sentences)
    
- Do your topic names match what you are seeing in the text? 
- Did any of your interpretations change after reading some profiles? (You would do so by changing pid and create several different charts. Then, based on the result, conclude if you want to make any changes to your topic names) 
    - If you need to update your topic names, do so here.
    
</div>

🤔 **Write your response here:**
...

## Section 5: Topic Quality
Let's see how good the topics we found are.

### Step 1: See if the topics are each about different things.
We want each topic to be about something different than the other topics. We can check this by comparing the words in each topic to the words in all the others. How to interpret:
- Each square shows how similar two topics are. Darker means more similar, and lighter means more different.
- The square in the very top left shows how similar topic 0 is to topic 0 (i.e. how similar it is to itself). 
- The square next to it in the top row shows how similar topic 0 is to topic 1, and so on. 
- For any two topics, you can see how similar they are by finding their numbers on the edges and seeing where they intersect.

In [None]:
def plot_topics(components):
    sim = cosine_similarity(components)
    blue_matrix(sim, xl='Topic number', yl='Topic number',
                t = 'Word Similarity between Topics')
    return

plot_topics(lda_topics)

### Step 2: See if different topics show up in different profiles
The point is to tell profiles apart based on what topics they're about, so we need to check whether the topics appear in different profiles.
- This shows us something that looks similar to the topic similarity we saw before, but this time:
    - We **don't** compare topics based on which words they use
    - We **do** compare topics based on how often they appear in the same profile as one another

In [None]:
def topic_cooccurance(topics):
    co = pd.DataFrame(topics).corr()
    blue_matrix(co, xl='Topic number', yl='Topic number',
               t = 'Topic Co-occurance in Profiles')

topic_cooccurance(profile_topics)

#### Note that the topics are mostly uncorrelated. 
- The cells in the figure above are mostly very light blue
- This doesn't mean that, for instance, topic 1 and 2 never show up in the same profile.
- It does mean, however, that seeing any particular topic doesn't mean we're especially likely to also see any other topic.

<div class="alert-info">
    
#### Short Answer No.8
- Why is the diagonal line so dark? (1-2 sentences)

- Would you say that this topic classification result is good? Why? (explain in 1-3 sentences)

- Do you think that the two figures above are sufficient to conclude that the topic quality is perfect? Why? (3-5 sentences)

</div>

🤔 **Write your response here:**
....

## Section 6: Topic Popularity

#### Helper functions to visualize and compare topics
- Run this code and scroll down. The details of how it works aren't our focus right now.

In [None]:
def common_topics_bars(topics):
    popularity = pd.DataFrame(topics).mean()
    popularity = popularity.rename_axis('Topic')
    popularity = popularity.sort_values(ascending=False)
    popularity.plot.bar(title='Topic popularity')
    return

def rank_groups(data, trait, topic):
    groups = data[trait].value_counts().index.values
    result = {}
    
    for g in groups:
        result[g] = data[data[trait] == g][topic].mean()
    
    r = pd.DataFrame.from_dict(result, orient='index')
    r.columns = [topic]
    r = r.sort_values(by=topic, ascending=False)
    
    return r.round(3)

def top_topics(data, trait, value, n_top_topics=3, distinctive=False):
    topics = [col for col in data if col.startswith('topic_')]
    vals = {}
    means = {}
    if distinctive:
        for t in topics:
            means[t] = data[t].mean()
    else:
        for t in topics:
            means[t] = 1
    
    data = data[data[trait] == value]
    
    for t in topics:
        vals[t] = data[t].mean() / means[t]
    vals = pd.DataFrame.from_dict(vals, orient='index')    
    vals = vals.sort_values(by=0, ascending=False).head(n_top_topics)

    return list(vals.index.values)

### Overall most common topics

In [None]:
common_topics_bars(profile_topics)

### Who is a topic most popular with?

#### Step 1: Merge our information about topics with our information about people

In [None]:
topic_info = pd.DataFrame(profile_topics).add_prefix('topic_')
together = profiles.merge(topic_info, left_index=True, right_index=True)
together.head()

#### Step 2: See the groups that have the most text about a given topic
- The numbers here show how much of a profile, on average, is about a specific topic. For example, if you don't have a pet, you probably wouldn't be writing about your pet (topic X). 

#### Play around with the code in the next few cells:
- Pick 3 of the traits we have data for. Here are the options (information we know about users from their profiles):
    - `age_group` categories: ['10', '20', '30', '40', '50']
    - `body` categories: ['average', 'fit', 'thin', 'overweight', 'unknown']
    - `alcohol_use` categories: ['yes', 'no']
    - `drug_use` categories: ['yes', 'no']
    - `edu` (highest degree completed) categories: ['`<HS`', 'HS', 'BA', 'Grad_Pro', 'unknown'] 
    - `race_ethnicity` categories: ['Asian', 'Black', 'Latinx', 'White', 'multiple', 'other']
    - `height_group` (whether someone is over or under six feet tall) categories: ['under_6', 'over_6']
    - `industry` (what field they work in) categories: ['STEM', 'business', 'education', 'creative', 'med_law', 'other'] 
    - `kids` (whether they have children) categories: ['yes', 'no']
    - `orientation` categories: ['straight', 'gay', 'bisexual']
    - `pets_likes` (what pets they like) categories: ['both', 'dogs', 'cats', 'neither']
    - `pets_has` (what pets they have) categories: ['both', 'dogs', 'cats', 'neither']
    - `pets_any` (whether they have pets or not) categories: ['yes', 'no']
    - `religion` categories: ['christianity', 'catholicism', 'judaism', 'buddhism', 'none', 'other'] 
    - `sex` categories: ['m', 'f']
    - `smoker` categories: ['yes', 'no']
    - `languages` categories: ['multiple', 'English_only'] 
- Change the topics and values in the code in the next few cells to explore how the trait you chose relates to the topics.
    
<div class="alert-info">
    
#### Short Answer No.9

- Before you run the code, predict what will the result be (1-2 sentences)

- Write down which traits you chose to look at, and what you learned about different groups of people from the topics they used. What topics do they have in common? What topics make them different? Does this make sense given the groups? Why or why not? Remember: we interpreted the topics above, so explain your findings in terms of content, not just topic numbers. (1-2 paragraphs)

- Does the actual result match your prediction? 

In [None]:
rank_groups(together, trait='edu', topic='topic_5') ## we use this as an example, and ask students to find 3 more

#### Step 3: See the topics that are most common for a given group
- This example shows most common topics for different education groups.
- You can change the arguments to compare different groups.

In [None]:
#show most popular topics for High School graduates
top_topics(data=together, trait='edu', value='HS', n_top_topics=3)

In [None]:
#show most popular topics for High School graduates
top_topics(data=together, trait='edu', value='BA', n_top_topics=3)

#### Step 4: See the topics that distinguish a group from other groups
- This example shows most distinctive topics for different education levels.
- You can change the arguments to compare different groups

In [None]:
top_topics(data=together, trait='edu', value='HS', n_top_topics=3, distinctive=True)

In [None]:
top_topics(data=together, trait='edu', value='Grad_Pro', n_top_topics=3, distinctive=True)

🤔 **Write your response here:**
....

# Section 7: NMF Topic Model
- NMF is an alternative to LDA
- NMF stands for Non-Negative Matrix Factorization. 

### Expand for more on how NMF works

- `Factoring` is something you may have done in math class before, for example:
    - $ 10 $ can be factored as $ 2 \times 5 $
    - $ x^2+3x+2 $ can be factored as $ (x+2)(x+1)$
- When we convert the text into numbers for the computer, it gets stored as something called a `matrix`.
    - The matrix is `non-negative` because we can't have negative numbers of words: all the word counts are zero or more.
- It is not important right now how exactly we find factors for these matrices, but if you're curious, you can learn more about it in a Linear Algebra class.
- It turns out that finding factors for text is a really good way of finding topics. This makes sense intuitively: factors are simple things we can combine to get the more complicated output, and topics are simple things people combine to write profiles.
- [Learn more](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization#Text_mining) about NMF.

### Step 1: Convert text to numbers the computer understands
- NMF takes "tf-idf vectors" as input. Tf-idf stands for "text frequency - inverse document frequency." 
    - Text frequency is the same as the count vectors we used for LDA above: how often does each word appear in the text?
    - Inverse document frequency means we divide ("inverse") by the number of documents the word is in. (If everyone uses the word, it isn't very helpful for figuring out what makes people different. So this measurement looks for words that are used a lot in some documents, and not at all in others.)

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

print("Vectorizing text by TF-IDF...")
tfidf_text = tfidf_vectorizer.fit_transform(documents)

tmp = tfidf_text.get_shape()
print("Our transformed text has", tmp[0], "rows and", tmp[1], "columns.")

#### The features are mostly the same as count vectors, because they are just the common words in the text

In [None]:
tfidf_words = tfidf_vectorizer.get_feature_names_out()
print("The first few words (alphabetically) are:\n", tfidf_words[:20])

#### The values are different: the counts have been divided by the documents they show up in

In [None]:
n = 4

print('Profile text:\n', documents[n])
print('\nTF-IDF vector:')
show_vector(tfidf_text[n], words=tfidf_words)

### Step 2: Build a topic model using NMF

- NMF is faster than LDA and often works a little better for small documents like we have here.

In [None]:
import sklearn
sklearn_major_ver = int(sklearn.__version__.split('.')[0])

if (sklearn_major_ver < 1):
    model = NMF(n_components=ntopics, alpha=.1, l1_ratio=.5, init='nndsvd', random_state= 7)
else:
    # after sklearn v1.0, parameter alpha was removed, alpha_W and alpha_H with different approach.  
    model = NMF(n_components=ntopics, alpha_W=.0001, alpha_H='same', l1_ratio=.5, init='nndsvd', random_state= 7)

print('Performing NMF on vectors...')
nmf = model.fit(tfidf_text)
nmf_topics = nmf.components_

print('Done!')

### Step 3: Show our topics with the top words in each

In [None]:
display_topics(nmf_topics, tfidf_words, nshow)

### Step 4: Compare topics to each other
We can compare topics visually by plotting the similarity of each topic's chosen words to each other topic.

In [None]:
plot_topics(nmf_topics)

#### Examine the words that make two topics similar or different
We can also compare two topics to each other by looking at words that are common in both, or words that are common in one but not the other. Try changing topic_a and topic_b to different topic numbers.

In [None]:
topic_a = 1
topic_b = 8

compare_topic_words(nmf_topics, topic_a, topic_b, tfidf_words, n_words=nshow, how='overlap')
compare_topic_words(nmf_topics, topic_a, topic_b, tfidf_words, n_words=nshow, how='difference')
compare_topic_words(nmf_topics, topic_b, topic_a, tfidf_words, n_words=nshow, how='difference')

### Step 5: Interpret these topics

<div class="alert-info">
    
#### Short Answer No.10
    
- This part is for you to do: code can't do it for you.
- Look at the list of important words for each topic, and think about these questions.
    - What do the words have in common?
    - What could someone write that would use most of those words?
    - What does this topic seem to be about?
- Try to come up with a short, catchy name for each topic.
    - For example, if the words were "san francisco city moved living born years raised lived live", you might call it "places lived" because the topic seems to be about where people currently live (San Francisco) and where they were born / raised / moved from. 
    
</div>

🤔 **Write your response here:**
....

### Step 6: Compare the topics from LDA and NMF

#### Helper function to make a graph for us

In [None]:
def plot_confusion(x, y, x_label='', y_label='', t_label = '', sort=True):
    n = x.shape[0]
    corrs = cosine_similarity(x, y)
    topic_similarity = pd.DataFrame(corrs)
    new_order=None
    
    if sort:
        matches = []
        pairs = {}

        for i in range(n):
            for j in range(n):
                tmp = {}
                tmp['i'] = i
                tmp['j'] = j
                tmp['match'] = corrs[i][j]
                matches.append(tmp)

        matches = pd.DataFrame(matches).sort_values(by='match', ascending=False)

        for row in matches.iterrows():
            i = row[1]['i']
            j = row[1]['j']
            if i not in pairs.keys():
                if j not in pairs.values():
                    pairs[i] = row[1]['j']

        new_order = list(range(n))
        for k in pairs.keys():
            new_order[int(k)] = int(pairs[k])

        topic_similarity = topic_similarity[new_order]

    blue_matrix(topic_similarity, xl=x_label, yl=y_label,
                t = t_label,
                x_labels=new_order)
    return

#### See how similar the words in each topic from LDA are to the words in each topic from NMF
- The NMF topics are along the X axis and the LDA are along the Y axis.

In [None]:
plot_confusion(x = nmf_topics, x_label='NMF topic number',
               y = lda_topics, y_label='LDA topic number',
               t_label = 'Word Similarity between LDA and NMF Topics',
               sort=False)

#### We can also sort the topics so that the most similar ones are aligned

In [None]:
plot_confusion(x = nmf_topics, x_label='NMF topic number',
               y = lda_topics, y_label='LDA topic number',
               t_label = 'Word Similarity between LDA and NMF Topics',
               sort=True)

<div class="alert-info">
    
#### Short Answer No.11

Look at the LDA and NMF topic words and the confusion matrix, and consider the following questions:
- Do any of the topics seem to be the same in both models?
- Are some topics in one model but not the other?
- Do the topics you get from one of the models make more sense than the ones you get from the other?

</div>

🤔 **Write your response here:**
....

## Section 8: What We Learned
- Two statistical methods for topic modeling
    - LDA
    - NMF
- How to think about and interpret the topics our models find
- How to compare and relate different topics
- Different ways to see the distribution of topics in profiles
- Which topics are most popular with social categories of people
- Which social categories of people discuss a topic most

<div class="alert-warning">

#### Reflection Question 1:
- How is what we learned in this lab, using topic modeling, different from what we learned in the last lab, using just word frequencies? How is it similar? Write a paragraph explaining. 
    
</div>

🤔 **Write your response here:**
....

<div class="alert-warning">

#### Reflection Question 2:
- For both the LDA and NMF models, we specified 15 topics. Now, try running the LDA with other numbers of topics.
    - If the topics seemed repetitive, you might want to try looking for fewer topics.
    - If the topics seem confusing or vague, you might want to try looking for more topics (so that they can be more specific).
- Run the code below and answer the following:
    - What different numbers of topics did you try?
    - How did your interpretations change in response to the number of topics? Were they it easier or more difficult to interpret?
    - How did topic quality change in response to the number of topics? Were topics more similar or less similar?
    
</div>

🤔 **Write your response here:**
....

#### Step 1: Decide how many topics we want to find
- We must tell LDA how many topics we want it to look for (we did this above with the `ntopics` variable).
    - We suggest picking a few values between 2-50.

In [None]:
#how many topics we want our model to find
ntopics = 2

#how many top words we want to display for each topic
nshow = 10

#### Step 2: Run the LDA algorithm

In [None]:
model = LatentDirichletAllocation(n_components=ntopics, max_iter=10, 
                                  learning_method='online', n_jobs=-1)

print('Performing LDA on vectors. This may take a while...')
lda = model.fit(tf_text)
lda_topics = lda.components_

print('Done!')

#### Step 3: Show our topics with the top words in each

In [None]:
display_topics(lda_topics, tf_words, n_words=10)

#### Step 4: Examine the words that make two topics similar or different
- We can also compare two topics to each other by looking at words that are common in both, or words that are common in one but not the other.
- Try changing `topic_a` and `topic_b` to different topic numbers.
- Notice the `how` option will let you see either the `overlap` or `difference` between two topics
    - Notice also that the difference between topics a and b is not the same as between b and a

In [None]:
topic_a = 0
topic_b = 1

compare_topic_words(lda_topics, topic_a, topic_b, tf_words, how='overlap', n_words=nshow)
compare_topic_words(lda_topics, topic_a, topic_b, tf_words, how='difference', n_words=nshow)
compare_topic_words(lda_topics, topic_b, topic_a, tf_words, how='difference', n_words=nshow)

#### Step 5: Interpret these topics
- This part is for you to do: code can't do it for you.
- Try to come up with a short, catchy name for each topic (no writing required).

#### Step 6: Topic Quality
Let's see how good the topics we found are.

#### Step 6.1: See if the topics are each about different things.
We want each topic to be about something different than the other topics. We can check this by comparing the words in each topic to the words in all the others. How to interpret:
- Each square shows how similar two topics are. Darker means more similar, and lighter means more different.
    - Notice that some topics may be negatively correlated.
- The square in the very top left shows how similar topic 0 is to topic 0 (i.e. how similar it is to itself). 
- The square next to it in the top row shows how similar topic 0 is to topic 1, and so on. 
- For any two topics, you can see how similar they are by finding their numbers on the edges and seeing where they intersect.

In [None]:
def plot_topics(components):
    sim = cosine_similarity(components)
    blue_matrix(sim, xl='Topic number', yl='Topic number',
                t = 'Word Similarity between Topics')
    return

plot_topics(lda_topics)

#### Step 6.2: See if different topics show up in different profiles
The point is to tell profiles apart based on what topics they're about, so we need to check whether the topics appear in different profiles.
- This shows us something that looks similar to the topic similarity we saw before, but this time:
    - We **don't** compare topics based on which words they use
    - We **do** compare topics based on how often they appear in the same profile as one another

In [None]:
profile_topics = lda.transform(tf_text)

In [None]:
def topic_cooccurance(topics):
    co = pd.DataFrame(topics).corr()
    blue_matrix(co, xl='Topic number', yl='Topic number',
               t = 'Topic Co-occurance in Profiles')

topic_cooccurance(profile_topics)