<a href="https://colab.research.google.com/github/jamiehadd/Math189AD-MathematicalDataScienceAndTopicModeling/blob/main/tutorials/Guided_NMF_Demo_Twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guided NMF Demo: Twitter
This notebook contains a demo for applying the Guided NMF models.  We experiment with the dataset of tweet text from 2016 presidential candidates;
> Littman, Justin; Wrubel, Laura; Kerchner, Daniel, 2016, "2016 United States Presidential Election Tweet Ids", https://doi.org/10.7910/DVN/PDI7IN, Harvard Dataverse, V3.

This dataset has been provided for you in "all_tweets_avg.npy" and "all_tweets_words.npy."  Keep these files private as they have been extracted via the Twitter API by a single user account and are not for distribution.

### Activity
We'll be applying NMF and Guided NMF to a dataset consisting of tweets from 2016 presidential candidates and investigating the results of these models!
- Run the NMF and Guided NMF models in the notebook.
- Interpret the results.
- Design your own experiments (consider different topics and words for guidance)!
- Report back with interesting findings!

### Run first: Install and import calls, and function definitions

In [None]:
!pip install numpy
!pip install matplotlib
!pip install ssnmf
!pip install scipy

import numpy as np
from matplotlib import pyplot as plt
import ssnmf
from ssnmf import SSNMF
import scipy
import random

In [None]:
def d_to_v(d, verbose=True):
    """
    Given dictionary d of form {word: weight}, created GT topic vector v. See writeup for details.
    """
    l = list(idx_to_word)
    v = np.zeros(idx_to_word.shape[0])

    for key in d.keys():
        i = l.index(key)
        if(i < 0):
            print("Could not find word '" + key + "' in list of words!")
        else:
            v[i] = d[key]

    return v

### Data Preprocessing
In these blocks, we load and format the tweet dataset, and create properly formatted seed topics in Y.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd '/content/drive/Shareddrives/Math 189AD FA22: Datasets'

In [None]:
# Load data
X = np.load("all_tweets_avg.npy", allow_pickle=True)
idx_to_word = np.load("all_tweets_words.npy", allow_pickle=True)
X = X.item()
X = scipy.sparse.csr_matrix.toarray(X)

In [None]:
# Create seed topics and Y matrix
obamacare_word = {"obamacare": 1}
economy_words = {"economy": 1}

gt_topic_words = [obamacare_word, economy_words]
gt_topic_vectors= [d_to_v(x) for x in gt_topic_words]

Y = np.stack(gt_topic_vectors).T

### Experiment 1: NMF on Twitter Data
In this experiment, we first train a Frobenius-norm NMF model on the tweets dataset (with no supervision or guidance).  We extract the five top keywords represented in each topic (these are the five highest values in each row of S).

In [None]:
#seed the random matrix initialization for reproducable results
np.random.seed(1)
r = 8

#intialize and train the model
model = SSNMF(X.T,r)
N=200
model.mult(numiters = N)

#access the factor matrices
S = model.A.T
A = model.S.T

In [None]:
#collect and print the top keywords for each topic
keywords = np.empty((7,r), dtype=object)

for i in range(keywords.shape[1]):
    keywords[0,i] = "Topic " + str(i+1)
    keywords[1,i] = "-------"

for i in range(A.shape[1]):
    col = A[:,i]
    top = col.argsort()
    top = top[-5:][::-1]

    keywords[2:,i] = idx_to_word[top]


col_widths = [max([len(keywords[i][j]) for i in range(keywords.shape[0])])+2 for j in range(keywords.shape[1])]
for row in keywords:
    print("".join(row[i].ljust(col_widths[i]) for i in range(len(row))))

### Experiment 2: Guided NMF on Twitter Data
In this experiment, we train a Frobenius-norm Guided NMF model on the tweets dataset with guidance towards the words "obamacare" and "economy".  We extract the five top keywords represented in each topic (these are the five highest values in each row of S).

In [None]:
#seed the random matrix initialization for reproducable results
np.random.seed(1)

#intialize and train the model
model_3 = SSNMF(X.T,r,Y=Y.T,lam=0.5,modelNum=3)
N=200
model_3.mult(numiters = N)

#access the factor matrices
S = model_3.A.T
A = model_3.S.T

In [None]:
#collect and print the top keywords for each topic
keywords = np.empty((7,r), dtype=object)

for i in range(keywords.shape[1]):
    keywords[0,i] = "Topic " + str(i+1)
    keywords[1,i] = "-------"

for i in range(A.shape[1]):
    col = A[:,i]
    top = col.argsort()
    top = top[-5:][::-1]

    keywords[2:,i] = idx_to_word[top]

col_widths = [max([len(keywords[i][j]) for i in range(keywords.shape[0])])+2 for j in range(keywords.shape[1])]
for row in keywords:
    print("".join(row[i].ljust(col_widths[i]) for i in range(len(row))))