# Week 10

Text Processing and Analysis

## Setup

Run the following 2 cells to import all necessary libraries and helpers for this week's exercises

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/text_utils.py
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/datasets/text/movie_reviews.tar.gz | tar xz

### All Scikit-Learn Now!

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import string

from sklearn.cluster import KMeans
from sklearn.decomposition import NMF, TruncatedSVD, PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from data_utils import MinMaxScaler
from data_utils import display_confusion_matrix

from text_utils import get_top_words

## Text Classification

Let's try to predict whether a given review expresses positive or negative feelings towards a movie.

We have a dataset that basically has $2$ features per record: `review` and `sentiment`.

Let's load and look:

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")
reviews_df

### Features

Text is a very different kind of feature...

We do want to turn it into numbers somehow in order to apply some of the methods and models we've been learning about, but how to do that exactly is not entirely obvious.

We can try to extract some numerical information about the review text. Maybe something like the length of the review or the relative amount of punctuation marks or digits can be indicative of its sentiment.

Let's define some helper functions.

In [None]:
def count_characters(st):
  return len("".join(st.split()))

def count_words(st):
  return len(st.split(" "))

def count_punctuation(st):
  return len([c for c in st if c in string.punctuation])

def count_digits(st):
  return len([c for c in st if c in string.digits])

def get_punctuation_pct(st):
  return count_punctuation(st) / count_characters(st)

def get_digit_pct(st):
  return count_digits(st) / count_characters(st)

Now, let's apply some of these to our `DataFrame` to create numerical features that we can eventually use in a classifier.

In [None]:
reviews_df["char_count"] = reviews_df["review"].apply(count_characters)
reviews_df["word_count"] = reviews_df["review"].apply(count_words)
reviews_df["punctuation_pct"] = reviews_df["review"].apply(get_punctuation_pct)
reviews_df["digit_pct"] = reviews_df["review"].apply(get_digit_pct)

reviews_df

Before we model this data, let's look at some of these features and see if we can visually identify the negative and positive reviews on plots.

In [None]:
plt.scatter(reviews_df["word_count"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Punctuation % x Word Count")
plt.show()

plt.scatter(reviews_df["digit_pct"], reviews_df["punctuation_pct"], c=reviews_df["sentiment"])
plt.title("Digit % x Word Count")
plt.show()

plt.scatter(reviews_df["word_count"], reviews_df["char_count"], c=reviews_df["sentiment"])
plt.title("Character Count x Word Count")
plt.show()

This is not very promising. It doesn't seem like these features contain enough information to help us extract meaning from the reviews.

Let's just confirm this suspicion by creating a classifier.

We'll use a `MinMaxScaler` since some of the features are already in a $[0,1]$ range.

In [None]:
mScaler = MinMaxScaler()

simple_feats_df = reviews_df.drop(columns=["review", "sentiment"])
simple_feats_scaled_df = mScaler.fit_transform(simple_feats_df)

simple_feats_scaled_df["sentiment"] = reviews_df["sentiment"]

simple_feats_scaled_df

Train/Test splitting should've been done before scaling, but this is just a quick experiment.

In [None]:
reviews_train_df, reviews_test_df = train_test_split(simple_feats_scaled_df, test_size=0.2)

reviews_train_df

In [None]:
mClassifier = RandomForestClassifier()

train_feats = reviews_train_df.drop(columns=["sentiment"])
train_labels = reviews_train_df["sentiment"]

mClassifier.fit(train_feats, train_labels)

train_preds = mClassifier.predict(train_feats)

accuracy_score(train_labels, train_preds)

In [None]:
test_feats = reviews_test_df.drop(columns=["sentiment"])
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(test_feats)

accuracy_score(test_labels, test_preds)

# 🤔

Our model is just as good as a random guess.

We'll have to use something else.

Back to the drawing board.

### Bag-of-Words (BoW)

This is a way of modeling sentences as a function of their words. We can think of it as a specialized version of One-Hot Encoding, where we turn our single-column `review` feature into a series of numbers that represent which words are present in that review. If a word is not present, that column gets a $0$, if the word is present, the column gets an integer that represents the total number of times that word was used in the review.

There are some specificities to keep in mind when we encode text this way. We need to figure out what constitutes a _word_ and what kind of words we want to ignore.

Do we consider the words `type`, `types`, `typed` as the same word or $3$ different words ?

Do we consider words like `a`, `the`, `of`, `in`, etc ... in our encoding ? What other kinds of words should be treated differently ?

The first consideration is part of the process of _tokenization_, or, how we turn sequences (of words) into its constitutive components (tokens). There are libraries and pre-trained models that can help us with that task.

To answer part of the second question: it's best to remove common words from text before processing it because they don't add meaning or variance to our data. These words are commonly referred to as _stop words_, or _negative dictionary_, and, again, we can find lists of common _stop words_ for different languages in text-processing libraries and packages.

<img src="./imgs/tokens-00.jpg" width="720px" />

Our dataset can have other words that show up very frequently, but aren't generally considered _stop words_. For example, a dataset about movie reviews might have the words `movie`, `good`, `director`, etc in every single review. While not typical _stop words_, they should be ignored during encoding because they add no meaningful differentiation to our data.

The same is true for words that are rare and only show up in a small fraction of our sentences/reviews.

This process of encoding text sequences by the count of their words is called Vectorization. This method of encoding keeps track of which words are present in a series of words, and how common they are, but without any significant information about the order of the words or where they occurred in the original text.

That's why models created this way are called _Bag-of-Word_ models: they model _what_ words are there, but not _where_ they occurred.

### Vectorize by Count

Let's use the `Scikit-Learn` class `CountVectorizer` to encode our reviews.

The `min_df` and `max_df` parameters to the class constructor determine the minimum and maximum document frequencies to consider when encoding our data.

With `min_df=5` and `max_df=0.75`, the vectorizer ignores words that show up in less than $5$ reviews and words that show up in more than $75\%$ of reviews.

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")

reviews_train_df, reviews_test_df = train_test_split(reviews_df, test_size=0.2, random_state=1010)
reviews_train_df

In [None]:
# english stop words -- the, a, etc.
# min_df - minimum document frequency (ie if word shows up less than 5 times then ignore it) - too specific
# max_df - maximum document frequency (ie if word shows up in more than 3/4 documents then ignore it) - too common
# max_features - limit number of features for performance, optimization (ie only keep 30,000 columns that have most items)
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=30_000)

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

If we print our encoded features, we should see something like this:

<img src="./imgs/vector-00.jpg" width="720px" />

In [None]:
reviews_train_vct

# see that have 20,000 rows for all 20,000 reviews
# have 24,066 features 
# but not actually storing 20,000 x 24,066 because many entries are empty 

In [None]:
# can turn Compressed Sparse Row sparse matrix into array, but is extremely big
reviews_train_vct.toarray()[0]

But we don't.

What !?

### Sparse Matrices

This is why we have to move beyond `DataFrames` for text encoding.

We have thousands of reviews and thousands of possible words in our vocabulary. Encoding this information using a `DataFrame` would be extremely inefficient and wasteful because most of the columns for any given row is most likely a $0$. Even if a review used $1\text{,}000$ unique words, that would still mean that only about $10\%%$ of our columns would have non-zero values.

### Using Sparse Matrices

The `CountVectorizer` object has functions that give us information about the words it encountered.

`get_feature_names_out()`: returns a list of the words seen in the dataset and encoded.

`inverse_transform()`: can be used to turn a sequence of word counts back into actual words, but without the order information of the original sentence.

And, we can index into our vector of reviews with `[]` to get a specific review. These are encoded as sparse matrices, so we have to do a bit of work to get to the actual words and their counts:

- It helps to use the `nonzero()` function to get a list of the indices of words that are actually present in that review.

- Once we have the indices, we can use them to access the review vector, and get the non-zero word counts from specific locations in the sparse matrix.

In [None]:
vocab = mCV.get_feature_names_out()

print(len(vocab))
display(vocab)

Get indices of non-zero counts for words in the first review:

In [None]:
reviews_train_vct[0].nonzero()

# reviews_train_vct[:2].nonzero()

# result is 2 lists
# first list is the rows the nonzero elements are in
# second are the nonzero columns

# if only want to see the columns of nonzero since we're only getting 1 row anyway...
_, feature_idxs = reviews_train_vct[0].nonzero()
print(feature_idxs)

# now lets see what words are at each nonzero idx
vocab[feature_idxs]

Get counts from those indices:

In [None]:
reviews_train_vct[reviews_train_vct[0].nonzero()]

Get words in a review:

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

Or, using the non-zero indices to index into the `vocab` list:

In [None]:
vocab[reviews_train_vct[0].nonzero()[1]]

We can use these functions to order the words of a review by frequency.

The process is:

- Get a `review` by indexing into our list of encoded reviews
- Count the number of tokens/words in the review
- Use [argsort()](https://numpy.org/doc/2.1/reference/generated/numpy.argsort.html) to get the order of indices that would sort the word counts
  - Use negative counts to get the counts ordered from largest to smallest
- Use the first `word_count` items of this array to index into our vocab and get the actual words of the review

In [None]:
review = reviews_train_vct[0]

word_count = len(review.nonzero()[0])

sorted_idxs = (-review.toarray()[0]).argsort()

vocab[sorted_idxs[:word_count]]

This seems like a useful enough operation, that maybe it should be a function that we can use on any sparse matrix of frequency counts...

The `get_top_words(cnt, vocab, n)` function in `text_utils` will return the top `n` words of `cnt`, a count vector or count matrix (list of vectors).

Omitting `n` makes the function return all of the words present in the sequences, ordered by frequency.

The returned value is a tuple of words and their counts.

In [None]:
from text_utils import get_top_words

words, counts = get_top_words(reviews_train_vct[0], vocab)

display(words)
display(counts)

### Classifying by Count

Ok. After that little bit of a detour to explore vector count sparse matrices, we are back to our classification problem.

Seems like we should be able to classify whether a review is positive or negative by looking at the words used...

Let's train a `RandomForestClassifier` and validate with our test dataset.

In [None]:
mClassifier = RandomForestClassifier()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

accuracy_score(train_labels, train_preds)

Not bad. Promising.

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

accuracy_score(test_labels, test_preds)

Ok! This is not bad.

After learning about count vectorization and sparse matrices, the code for doing this is actually quite simple.

We could adjust parameters of the classifier or the vectorizer to improve this, but using a `RandomForestClassifier` for this task is quite inefficient.

Let's look at a different kind of classifier before we continue exploring vectorization.

### Naive Bayes

Bayesian statistics is a complete and complex field of math and philosophy. At a high level, it's a theory that allows for probabilities (of events, measurements, classifications, etc) to be updated based on the presence of (new) data.

We are going to look at a very slim portion of Bayesian statistics to get a basic understanding of how this theory can be applied within Machine Learning algorithms.

The Naive Bayes methods are a set of supervised learning algorithms based on a version of Bayes' theorem that assumes that all of our features are independent.

As applied to a classification problem, this theorem has the following form:

$$P\left(y \middle| x_1, x_2, \ldots, x_n\right) = \frac{P\left(y\right)P\left(x_1, x_2, \ldots, x_n \middle| y \right)}{P\left(x_1, x_2, \ldots, x_n\right)}$$

This is an eyeful, but given that $y$ is the class associated with feature measurements $x_1, x_2, \ldots, x_n$, it reads:

The probability that a given set of measurements ($x_1, x_2, \ldots, x_n$) represents an object of class $y$ is equal to the probability of seeing an object of class $y$ in our dataset, multiplied by the probability that an object of class $y$ has measurements $x_1, x_2, \ldots, x_n$, divided by how common that particular set of measurements are.

$P\left(y\right)$ is calculated by measuring how many items of our dataset represent an object of class $y$. If we have $10$ objects that are $y$ in a dataset of $50$ objects, our $P\left(y\right) = \frac{10}{50}$.

Likewise, $P\left(x_1, x_2, \ldots, x_n\right)$ represents how many times this exact combination of measurements showed up in our dataset. If only one row out of $50$ has this combination of input features, then $P\left(x_1, x_2, \ldots, x_n\right) = \frac{1}{50}$.

$P\left(x_1, x_2, \ldots, x_n \middle| y \right)$ is the trickier bit, but it gets simplified by the _naive_ assumption of feature independence and can be split into multiple terms:

$P\left(x_1 \middle| y \right) \cdot P\left(x_2 \middle| y \right) \cdot\ldots\cdot P\left(x_n \middle| y \right)$

These are the probabilities that items of class $y$ have specific values for $x_1, x_2, \ldots x_n$. For example, if in our dataset of $50$ elements, $10$ have class $y$, and $2$ out of those $10$ have a particular value $x_1$ for the first feature, $P\left(x_1 \middle| y \right) = \frac{2}{10}$.

#### Naive Bayes Text Example

Let's pretend we want to calculate the probability that a review with the words `awful`, `bloody`, `guns` and `park` is **negative**.

This is equivalent to calculating:
$$P\left(negative \middle| \text{awful}, \text{bloody}, \text{guns}, \text{park} \right) = \frac{P\left(negative\right) P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park} \middle| negative \right)}{P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park}\right)}$$

$P\left(\text{negative}\right)$ is equal to the proportion of **negative** reviews in the dataset. If half are positive and half are negative, $P\left(\text{negative}\right) = 0.5$.

$P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park}\right)$ is the proportion of the number of reviews in the dataset that have all four words `awful`, `bloody`, `guns` and `park`.

$P\left(\text{awful}, \text{bloody}, \text{guns}, \text{park} \middle| negative \right)$ can be simplified to $P\left(\text{awful} \middle| negative \right) \cdot P\left(\text{bloody} \middle| negative \right) \cdot P\left(\text{guns} \middle| negative \right) \cdot P\left(\text{park} \middle| negative \right)$

$P\left(\text{awful} \middle| negative \right)$ is the proportion of negative reviews that have the word `awful`, $P\left(\text{bloody} \middle| negative \right)$ is the proportion of negative review with `bloody` in it, etc etc etc.

### Why ????

It might not be obvious at first, but when used for classification of datasets with sparse feature vectors, the process described above can be extremely efficient because _fitting_ the model means calculating a few probability constants from our training dataset. All of the $P()$ terms on the right hand side of the Bayes equation are basic proportions calculated with addition and division operations.

`Scikit-Learn` has different flavors of Naive Bayes classifiers that make further assumptions about the distributions of the input features and how $P\left(X \middle| y \right)$ can be simplified.

- [Gaussian](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) Naive Bayes assumes the features have gaussian distributions. This is good for datasets with continuous-valued inputs.
- [Bernoulli](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) Naive Bayes assumes the features are all binary values (One-Hot Encoding).
- [Categorical](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html) Naive Bayes assumes our features are integers that represent categories (Ordinal Encoding).
- [Multinomial](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) Naive Bayes assumes our features are discrete measurements.

Given that the feature vectors computed with `CountVectorizer` represent word counts, it makes sense for us to use a Multinomial classifier for this task.

In [None]:
# TODO: repeat classification using the appropriate Bayes model

mClassifierB = MultinomialNB()

# train_labels = reviews_train_df["sentiment"]

mClassifierB.fit(reviews_train_vct, train_labels)

train_preds_B = mClassifierB.predict(reviews_train_vct)

accuracy_score(train_labels, train_preds_B)

### N-Grams

Now that we have an efficient classifier for sparse count feature vectors we can finally experiment with n-grams.

In its simplest form, the Bag-of-Words method doesn't take into consideration any information about the order or location of the words in a sequence of words. We can, however, set it up to count pairs (or triplets, or quadruplets, etc) of words instead of single words.

So, instead of breaking up "_it was a good movie_", like this:
|it|was|a|good|movie|
|-|-|-|-|-|

It breaks it up like this:

|it was|was a|a good|good movie|
|-|-|-|-|

These are the 2-grams (or bi-grams) of our sentence, but the concept can be extended to any integer value of $n$ to extract counts for different lengths of n-grams.

While this doesn't help with location information, it does extract some information about word order and common phrases.

To extract n-grams during vectorization, we can give `CountVectorizer` a range of values to consider with the parameter `ngram_range`. A value of $(2,2)$ will only extract bigrams, while $(1,2)$ will extract counts for single words and pairs of words.

In [None]:
mCV = CountVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(2, 2))

reviews_train_vct = mCV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mCV.transform(reviews_test_df["review"])

The `CountVectorizer` functions we saw above and our `get_top_words()` function will work the same way. The only difference is that right now our features represent counts for pairs of words.

In [None]:
vocab = mCV.get_feature_names_out()
print(len(vocab))
vocab

In [None]:
mCV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab)

### Train and Validate

Let's try it out !

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

accuracy_score(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

accuracy_score(test_labels, test_preds)

### TF-IDF

Another way to vectorize our reviews into a Bag-of-Words is to use a slightly smarter and more specific way of counting words in our reviews.

Term Frequency-Inverse Document Frequency is a technique used to "count" words and scale the counts by how important a word might be to a document/review.

It does this by calculating two values for each tokenized word in a review:
- _**Term Frequency**_: the relative frequency of the word within a document/review.
- _**Document Frequency**_: the relative frequency of the number of documents in the dataset that have this word. It measures how much information a word carries by calculating how rare it is. What gets used in the actual `tf-idf` calculation is the $log()$ of the inverse of this value.

In math, this is:

$$ tf(t, d) = \frac{Count(t)}{| d |} \hspace{20pt}\hspace{20pt} idf(t, D) = log\left(\frac{| D |}{Count(d : t \in d)}\right)$$

Where:
- $Count(t)$ is the number of times a word appears in a document.
- $| d |$ is the length of the document, in words.
- $| D |$ is the total number of documents in the dataset.
- $Count(d : t \in d)$ is the number of documents in the dataset that have the word $t$ in them.

Of course `cikit-Learn` has a builtin tf-idf Vectorizer that will do this for us. We instantiate it just like the `CountVectorizer`:

In [None]:
mTfidV = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=50_000, ngram_range=(1, 1))

reviews_train_vct = mTfidV.fit_transform(reviews_train_df["review"])
reviews_test_vct = mTfidV.transform(reviews_test_df["review"])

The `TfidfVectorizer` has all the functions we saw above in the `CountVectorizer` object, and our `get_top_words()` function will work the same way with our tf-idf vectors. The difference is that now our features are not plain integer counts of words or n-grams, but our tf-idf importance metric. The higher the metric, the more significant the word (or n-gram) within our vocabulary.

In [None]:
vocab = mTfidV.get_feature_names_out()
print(len(vocab))
vocab

In [None]:
mTfidV.inverse_transform(reviews_train_vct[0])

In [None]:
get_top_words(reviews_train_vct[0], vocab, 10)

### Classification with tf-idf

This stays the same.

While the multinomial classifier normally requires integer features, in practice, fractional counts such as the ones computed with a `TfidfVectorizer` also work.

We could turn these into `int`s by multiplying them by $100$... but we don't have to. A `MultinomialNB` is still the best option because our td-idf values represent a kind of count. They aren't continuous, unbounded, `float` values, so it wouldn't make sense to use a Gaussian Bayes classifier, for example.

In [None]:
mClassifier = MultinomialNB()

train_labels = reviews_train_df["sentiment"]

mClassifier.fit(reviews_train_vct, train_labels)

train_preds = mClassifier.predict(reviews_train_vct)

accuracy_score(train_labels, train_preds)

In [None]:
test_labels = reviews_test_df["sentiment"]

test_preds = mClassifier.predict(reviews_test_vct)

accuracy_score(test_labels, test_preds)

Not bad.

How does the choice of n-gram range affect classification by tf-idf ?

In [None]:
# TODO: Evaluate the effect of n-grams in the TfidfVectorizer

## Unsupervised Learning

Our movie reviews dataset is rich in information. It contains details and descriptions of movies, actors, directors, etc, along with other patterns and trends that weren't directly used while training our binary sentiment classifier. This is probably true of most natural language text datasets, and probably has to due with the nature of languages and how they evolved to have structure and carry dense amounts of information... Unlike a pixel, a single word by itself will mean something, even if ambiguously.

What this means is, there are usually other patterns and trends, that are independent of outcome variables, that we can try to extract from datasets like this.

How do we extract information when we don't have "answers" in our dataset ? Unsupervised Learning! And in this case we'll start by looking at Clustering.

### Clustering

Just like we clustered numerical data and pixels by finding locations in our feature space to represent, or capture, sections of our dataset, we can imagine finding specific locations in our feature space to represent sub-sets of our reviews.

When we clustered pixels, our original features were `R`, `G`, `B` channel values, and so our cluster centers could be considered a set of representative colors for our image.

When we clustered wines, the cluster centers had the same $10$ features as our original dataset, and represented meaningful characteristics for the wines in each clusters.

For reviews ... our dataset has $10\text{,}000$ to $50\text{,}000$ features after we vectorize our reviews. Even if most of these values are $0$, we still have a dataset with $50\text{,}000$ features. If we cluster based on these features, the resulting cluster centers will have the same features and values with the same meaning as the original data.

This means that instead of being representative colors or wine characteristics, our cluster centers will be sets of words that represent a subsection of out dataset.

Let's take a look. We'll start by clustering our reviews into $8$ groups:

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")

In [None]:
mVec = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.5, max_features=50_000, ngram_range=(1, 1))

reviews_vct = mVec.fit_transform(reviews_df["review"])

In [None]:
vocab = mVec.get_feature_names_out()
print(len(vocab))

In [None]:
mClust = KMeans(n_clusters=8, random_state=800)
reviews_km = mClust.fit_predict(reviews_vct)

We can check our cluster centers:

In [None]:
mClust.cluster_centers_

# 🤔
Maybe we can check the shape of these... 

In [None]:
mClust.cluster_centers_.shape

Ok. That makes sense. We have $8$ clusters with about $24\text{,}000$ features each.

We can think of these as the list of words that would've been in the $8$ most representative reviews of our dataset.

We can "unpack" them using our `get_top_words()` function. The cluster centers are the reviews, the `TfidfVectorizer` object has our vocabulary and we can look at the top $8$ - $10$ words in each review:

In [None]:
get_top_words(mClust.cluster_centers_, vocab, 8)[0]

### Interpretation

There's something here.

Some of the cluster center words seem to be indicative of the type of movies in those cluster, or even whether they are TV series.

Can we improve the legibility of our clusters ?

In [None]:
# TODO: Experiment with CountVectorizer, the TfidfVectorizer parameters and/or N-grams

### Decomposition

We're clustering over $20\text{,}000$ - $50\text{,}000$ features of very sparse data. KMeans clustering and other algorithms might benefit from a reduction in the number of features that they have to consider.

Last week we saw how to do something like this with `PCA`. `PCA` is the MVP of all decomposition algorithms, and we could use `PCA` here, but given our type of data, we should use something a little more specific.

If we read the documentation for `PCA` it will mention something about how it _centers the data_ before calculating the decomposition, meaning that it shifts all of the input features so their average is $0$. This is fine for continuous features, and in most of those cases our data will already have been shifted and normalized using something like `StandardScaler` by the time it gets to `PCA`.

But, it doesn't make sense to do that with our text features here due to the scale and sparseness of our data. First, it is very inefficient to go through the process of calculating averages and shifting these features. Second, even if we are using tf-idf values that aren't whole numbers, they don't really represent continuous values of a distribution; they're more like counters that use floating point values, so calculating these averages can introduce unwanted distortions to our data.

### Singular Value Decomposition

We can use a more general form of the `PCA` algorithm called _Singular Value Decomposition_ that, like `PCA`, will decompose our dataset into smaller datasets of combined features ordered by importance, but unlike `PCA` it doesn't have to center our data before processing.

The computation for doing `SVD` and `PCA` decomposition is the same, but due to the centering of the data, `PCA` can take some shortcuts.

If we can think of `PCA` decomposition as something that refactors our dataset into two dense matrices like this, where the first one holds our new features and can be _abbreviated_ by selecting columns with the greatest amount of combined variance:

<img src="./imgs/pca-01.jpg" width="720px" />

Singular Value Decomposition does something like this:

<img src="./imgs/svd-01.jpg" width="720px" />

It refactors our data into $3$ matrices, where one of them only has elements on the diagonal. We get our transformed dataset by multiplying the first matrix by a _truncated_/_abbreviated_ version of this diagonal matrix.

Same... but different.

The details of the math aren't very crucial, since `Scikit-Learn` will handle all of the computations for us, we just have to remember that when decomposing sparse feature vectors or features that represent counts, it is better to use `SVD` instead of `PCA`.

### Latent Semantic Analysis

Using `TruncatedSVD` on a feature vector of tf-idf values is so common that it has its own name, _Latent Semantic Analysis_, and like `PCA` or clustering, we can use it to uncover some hidden patterns in our data.

We'll use it in a slightly different manner than how we used `PCA` to reduce our data before modeling.

Like `PCA`, the components of our `TruncatedSVD` decomposition represent new axes for our transformed data, and are linear combinations of the original features in our dataset.

Unlike the data we looked at with `PCA`, our feature space here has so many dimensions, that just looking at the top features that contribute to each of our components can give us an idea of the topics in our dataset.

This isn't always possible with non-sparse datasets because when every row of a dataset has a value for every feature, and we have few features, a lot of the `PCA` components might end up having the same contributing features, but with different weights.

For example, if we consider the diamond dataset, maybe the components from our `PCA` are all mostly made up of different combinations of `length`, `width` and `height`.

With a dataset of sparse word count features, maybe our first component is mostly made up of a combination of the words `car` and `bottle`, the second component is mostly a combination of the words `flower` and `water`, etc...

#### Reload Dataset

Let's start afresh: let's reload our dataset and run `tf-idf` vectorization.

Like with clustering, we won't worry about separating our dataset into train/test subsets since this is mostly exploratory data analysis and we're not so much interested in the predictive capabilities of our models right now.

In [None]:
reviews_df = pd.read_csv("./data/text/movie_reviews.csv")

In [None]:
mVec = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.9, max_features=50_000, ngram_range=(1, 2))

reviews_vct = mVec.fit_transform(reviews_df["review"])

In [None]:
vocab = mVec.get_feature_names_out()
print(len(vocab))

#### Decompose

We'll decompose our dataset into $10$ components.

This means that we'll transform our original $25\text{,}000 \times 50\text{,}000$ sparse tf-idf document matrix into a dense $25\text{,}000 \times 10$.

In [None]:
svd = TruncatedSVD(n_components=10, random_state=1010)
reviews_svd = svd.fit_transform(reviews_vct)

These are the first few rows of our transformed dataset. Like with `PCA`, the original meaning of our columns is gone and each of these $10$ columns is a linear combination of the original $50\text{,}000$ features.

In [None]:
reviews_svd[:5]

### Topic Extraction

Unlike `PCA`, the $\text{top-}8$ features in each of these components should give us an idea of the kinds of documents/records we have in our dataset.

In [None]:
get_top_words(svd.components_, vocab, 7)[0]

### Interpretation

Just like with clustering above, further refinement of the dataset would be needed in order to get very separated and unique topics, but the lists above do show trends in the content of the reviews. We can see certain movie genres and even some indication of the sentiment of the reviews.

A possible next step would further filter out the list of allowed words in our initial tokenization and remove certain common words like `movie`, `movies`, `film`, `good`, `bad`, `like`, etc.

## Classification for other dataset.

Now that we know all of the tricks of working with text data, let's look at more significant text classification problems.

The datasets [HERE](https://github.com/PSAM-5020-2025S-A/5020-utils/tree/refs/heads/main/datasets/text/amazon_reviews) have review information for different categories of amazon products.

Books is the largest of the datasets, so we can start with that one.

The dataset not only has the text of each review, and some irrelevant information about the reviewer, but also includes a numerical rating of the product. These ratings are whole numbers between $1$ and $5$, which means we're looking at a multi-class classification problem.

### Let's Load

Let's download, load and take a look at our data.

In [None]:
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/raw/refs/heads/main/datasets/text/amazon_reviews/books.tar.gz | tar xz

In [None]:
reviews_full_df = pd.read_csv("./data/text/amazon_reviews/books.csv")
reviews_full_df

This is a pretty big dataset with $220\text{,}000$ rows of book reviews.

Let's take a closer look at our output label, the `rating` column, to see how its values are distributed.

In [None]:
reviews_full_df["rating"].value_counts()

# 🫤

That's pretty uneven. A model could just guess $5$ all the time and be correct $60\%$ of the time.

When training a classifier we want to keep the values of our output label balanced in order to avoid any kind of artificial biasing of the model during training.

Let's re-balance the dataset. We'll do this by storing the number of reviews that have the least common rating and then getting an equal number of random ratings for each of the other possible rating values.

We have the ability of grouping our `DataFrame` rows by one of the columns and then sampling an equal number of reviews from each of these groups. The code for doing this is concise, but not very intuitive, requiring some extraneous parameters and function calls:

In [None]:
min_count = reviews_full_df["rating"].value_counts().min()

def sample_min(df):
  return df.sample(min_count, random_state=10010)

rg = reviews_full_df.groupby("rating")
reviews_balanced_df = rg[reviews_full_df.columns].apply(sample_min).reset_index(drop=True)

reviews_balanced_df

We should have a balanced dataset now:

In [None]:
reviews_balanced_df["rating"].value_counts()

But, there is one other thing we should check and fix before we start separating our input features and labels.

We should check if we have any reviews that don't have values in the `review_text` column.

We do this by using the `isna()` function to detect any empty/null values, and then getting the total count of these _na_ values per column:

In [None]:
reviews_balanced_df.isna().sum()

There seems to be some missing `reviewer`, `date` and `title` values, but all of our reviews have a `review_text`.

Let's start separating our input and output features from the full dataset.

We'll create a separate `DataFrame` that will basically have the `review_text` and `rating` values.

We just have to make sure `rating` is represented as a whole number (`int`) so we can use it as the class labels in a `MultinomialNB` classifier.

In [None]:
reviews_df = pd.DataFrame(reviews_balanced_df["rating"].astype(int))
reviews_df["review"] = reviews_balanced_df["review_text"]
reviews_df

### Classify !

Ok. The data is ready, we just have to:
- Split the data into train/test datasets
- Vectorize the text column into count or tf-idf features
- Train a classifier
- Look at confusion matrices and evaluate the classifier
- Adjust parameters in the vectorizer or classifier, maybe try n-grams

In [None]:
# TODO:
  # T/T Split
  # Vectorize
  # Classify
  # Evaluate
  # Try n-grams