### US Foods Assessment
#### Julian Carrasquillo
#### 2023-04-24

Resources

* [Data Understanding and Reading in](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
* [SO: Building combinations of dictionary of lists](https://stackoverflow.com/questions/38721847/how-to-generate-all-combination-from-values-in-dict-of-lists-in-python)
* [word2vec Tutorial](https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/)

In [214]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

import numpy as np
import pandas as pd
import random

random.seed(42)

### Base Tutorial

The tutorial gave a great look into the best ways to read in the data. Removing heads, footers, and quotes helps to make more generalizable models since a lot can be gleaned from these fields. For example, having a university email in a signature tends to lean towards science-based articles.

The overall dataset is quite large, so I picked a random group of categories to develop.

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))

len(newsgroups_train['data'])

11314

From the tutorial, we can use sklearn's built in `TfidfVectorizer`, which builds a sparse matrix based on a balance between a token's term frequency and its inverse document frequency. This weights each token by a measure of relevance to a specific document and relevance to the entire corpus. We can verify the matrix size by checking the output `vectors` shape. The number of rows aligns with the number of articles downloaded.

In [16]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(11314, 101631)

The tutorial uses a multinomial naive bayes classifier. This trains by counting the number of times each word appears in each category. When we're inferring, a document's words are scored with the most represent category being the prediction. This approach is considered naive because there is assumed independence between words. Despite that, the model still does fairly well with this subset, bringing an F1 score of ~.91.

In [17]:
clf = MultinomialNB(alpha = .01)
clf.fit(vectors,  newsgroups_train.target)

newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))

vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(metrics.accuracy_score(pred, newsgroups_test.target))

0.7002124269782263


## Grid Search on Hyperparameters

Between the `TfidfVectorizer` and `MultinomialNB` objects, we have a few hyperparameters we can explore to improve performance. We can build out our own grid search algorithm to try out different combinations and store the results. We'll leverage a function that takes in the various objects along with their parameter set. 

In [6]:
def train_classifier(train, vectorizer, classifier, vectorizer_params = {}, classifier_params = {}):
    vectorizer = vectorizer(**vectorizer_params)
    vectors = vectorizer.fit_transform(train['data'])
    
    clf = classifier(**classifier_params)
    clf.fit(vectors, train['target'])
    
    return vectorizer, clf

def test_classifier(test, vectorizer, classifier):
    vectors = vectorizer.transform(test['data'])
    pred = classifier.predict(vectors)
    return metrics.accuracy_score(pred, test['target'])

In [7]:
# verify we get the same results as above
vectorizer, clf = train_classifier(newsgroups_train, vectorizer = TfidfVectorizer, classifier = MultinomialNB, classifier_params = {'alpha' : .01}) 
test_classifier(newsgroups_test, vectorizer, clf)

0.7002124269782263

Using python's built-in `itertools`, we can build combinations of parameters for both our vectorizer and algorithms. Using the string as a key for the algorithm, we can include the actual algorithm object from `sklearn` to be passed to our code. Some explanations of the hyperparameters are below:

### `TfidfVectorizer`

#### `ngram_range`

We can see if opening up the word count of each token has any positive effects. Out of the box, `TfidfVectorizer` takes single words as tokens (unigrams). We can expand this to include any arbitrary group of words (2 word combinations = bigrams, 3 words = trigrams, etc) to get some more context. Some examples for why this can be helpful:

* we can better capture a baseball players full name - `Jorge` & `Posada` mean more together and more heavily imply an article about baseball.
* we can differentiate between `climate change` and the current `political climate`

#### `max_df`

This parameter essentially acts like a stop word list where the list is generated from the corpus itself. It removes any tokens that show up too many times. As a value between `0` and `1`, it removes any words that show up in more than that proportion of documents.

### `MultinomialNB`

#### `alpha`

This parameter helps with smoothing. Essentially supports situations where the model may see a new word in the test set.

In [8]:
import itertools

vector_grid = { 'ngram_range'  : [(1, 1), (1, 2), (2, 2)],
               'max_df' : [0.7, 0.8, 0.9, 1.0]}

algo_grid = {'MultinomialNB' : {'algo_obj' : MultinomialNB,
                                'params' : {'alpha' : [0.01, .05,  0.1, 0.5,  1]}}}

In [9]:
# From https://stackoverflow.com/questions/38721847/how-to-generate-all-combination-from-values-in-dict-of-lists-in-python
keys, values = zip(*vector_grid.items())
vector_groups = [dict(zip(keys, v)) for v in itertools.product(*values)]
vector_groups

[{'ngram_range': (1, 1), 'max_df': 0.7},
 {'ngram_range': (1, 1), 'max_df': 0.8},
 {'ngram_range': (1, 1), 'max_df': 0.9},
 {'ngram_range': (1, 1), 'max_df': 1.0},
 {'ngram_range': (1, 2), 'max_df': 0.7},
 {'ngram_range': (1, 2), 'max_df': 0.8},
 {'ngram_range': (1, 2), 'max_df': 0.9},
 {'ngram_range': (1, 2), 'max_df': 1.0},
 {'ngram_range': (2, 2), 'max_df': 0.7},
 {'ngram_range': (2, 2), 'max_df': 0.8},
 {'ngram_range': (2, 2), 'max_df': 0.9},
 {'ngram_range': (2, 2), 'max_df': 1.0}]

In [10]:
algo_groups = {}
for item in algo_grid.keys():
    keys, values = zip(*algo_grid[item]['params'].items())
    algo_groups.update({item : [dict(zip(keys, v)) for v in itertools.product(*values)]})

algo_groups

{'MultinomialNB': [{'alpha': 0.01},
  {'alpha': 0.05},
  {'alpha': 0.1},
  {'alpha': 0.5},
  {'alpha': 1}]}

Now that our combinations are set up, we can loop through them with our train / test functions and keep track of our results. We extract a validation set from our training data in order to leave the test set as a true evaluator.

In [11]:
X_train, X_val, y_train, y_val = train_test_split(newsgroups_train.data, newsgroups_train.target,  test_size = 0.2, random_state = 42, stratify = newsgroups_train.target)

hp_train = {'data' : X_train, 'target' :  y_train}
hp_val = {'data' : X_val, 'target' :  y_val}

In [12]:
results = {}
i = 0
for vector_combo in vector_groups:
    for key, value in algo_groups.items():
        for algo_param in value:
            vectorizer, clf = train_classifier(hp_train, vectorizer = TfidfVectorizer, classifier = algo_grid[key]['algo_obj'], vectorizer_params = vector_combo, classifier_params = algo_param)
            results.update({i : {'vector_combo' : vector_combo, 'algo' : key, 'algo_param' : algo_param, 'accuracy' : test_classifier(hp_val, vectorizer, clf)}})
            i = i + 1

In [13]:
results_df = pd.DataFrame(results)
best_combo = results_df[np.argmax(results_df.loc['accuracy'])]
best_combo

vector_combo    {'ngram_range': (1, 1), 'max_df': 0.7}
algo                                     MultinomialNB
algo_param                             {'alpha': 0.01}
accuracy                                      0.771984
Name: 0, dtype: object

In [14]:
vectorizer, clf = train_classifier(newsgroups_train, vectorizer = TfidfVectorizer, classifier = MultinomialNB, vectorizer_params = best_combo['vector_combo'], classifier_params = best_combo['algo_param'])
test_classifier(newsgroups_test, vectorizer, clf)

0.700477960701009

This improved accuracy by about 0.0002 for this subset! It looks like increasing the word counts in a token did not add value, but removing words showing up in more than 70% of the documents did. From the algorithm perspective, increasing smoothing parameter looks to have helped. 

However, we can see what words are most associated with each category using the below function from the sklearn tutorial. Many of the words are those you would find in a typical stop list. We can try using the built-in stop_word function to see if we improve results. 

In [40]:
# From https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names_out())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.feature_log_prob_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

In [41]:
show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: be are not in and it you is that of
comp.graphics: on that you in graphics it is for and of
comp.os.ms-windows.misc: that in file of you for and is it windows
comp.sys.ibm.pc.hardware: that have with scsi for of drive is it and
comp.sys.mac.hardware: you with that apple for of mac it and is
comp.windows.x: server motif for this it in of is and window
misc.forsale: it of or in shipping offer 00 and sale for
rec.autos: for on is that in it of you and car
rec.motorcycles: is my for that in of you it and bike
rec.sport.baseball: his they year was is that of in and he
rec.sport.hockey: is was hockey team that game of he and in
sci.crypt: encryption this in be it is that key and of
sci.electronics: this on that for in it you is and of
sci.med: be are this you that in it and is of
sci.space: you be for that it is in and space of
soc.religion.christian: we not you it in god and is that of
talk.politics.guns: this they it gun is you in and that of
talk.politics.mideast: are not it 

In [42]:
vectorizer, clf = train_classifier(newsgroups_train, vectorizer = TfidfVectorizer, classifier = MultinomialNB, vectorizer_params = {'stop_words' : 'english'}, classifier_params = best_combo['algo_param'])
test_classifier(newsgroups_test, vectorizer, clf)

0.7010090281465746

In [43]:
show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: looking format 3d know program file files thanks image graphics
comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac
comp.windows.x: using windows x11r5 use application thanks widget server motif window
misc.forsale: asking email sell price condition new shipping offer 00 sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: don thanks voltage used know does lik

In [47]:
vectors = vectorizer.transform(newsgroups_test['data'])
pred = clf.predict(vectors) 
print(metrics.classification_report(pred, newsgroups_test['target'], target_names = newsgroups_test['target_names'] ))

                          precision    recall  f1-score   support

             alt.atheism       0.44      0.59      0.51       239
           comp.graphics       0.71      0.66      0.68       419
 comp.os.ms-windows.misc       0.53      0.69      0.60       304
comp.sys.ibm.pc.hardware       0.70      0.60      0.65       452
   comp.sys.mac.hardware       0.70      0.73      0.71       371
          comp.windows.x       0.74      0.80      0.77       365
            misc.forsale       0.72      0.80      0.76       352
               rec.autos       0.72      0.75      0.74       380
         rec.motorcycles       0.73      0.75      0.74       386
      rec.sport.baseball       0.81      0.93      0.87       347
        rec.sport.hockey       0.93      0.59      0.72       634
               sci.crypt       0.77      0.73      0.75       414
         sci.electronics       0.58      0.72      0.65       317
                 sci.med       0.78      0.84      0.81       365
         

This improved from greatly from the last set - both in accuracy and more defining words to differentiate topics. Looking at individual categories, the model does best with baseball, middle eastern politics, and science articles. 

## Word2Vec - Another way to Vectorize

The estimators in sklearn lean on traditional NLP techniques - bag of words and TF-IDF. We can leverage newer, transformer-based approaches to try and get some context from various words through the use of an attention mechanism.

In [30]:
from nltk.tokenize import sent_tokenize, word_tokenize

import gensim
from gensim.models import Word2Vec

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [108]:
data = []

for i, item in enumerate(newsgroups_train['data']):
    item_clean = item.replace("\n", " ")
    for j in sent_tokenize(item_clean):
        temp = []
        # tokenize the sentence into words
        for k in word_tokenize(j):
            temp.append(k.lower())
            
        data.append(temp)

In [110]:
newsgroups_w2v = gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 5)
newsgroups_w2v.save("word2vec.wordvectors")

In [280]:
def convert_to_vector(data, w2vModel, dim):
    """
    Takes data from the newsgroups_20 set and converts each document into a singular vector
    
    data: the ['data'] component from the train or test set 
    w2vModel: object representing word-vector lookup. Can be the .wv object or the KeyedVector object
    dim int: the dimension of the expected returned vector. This allows us to fill missing vocabulary with a vector that plays nice with the ultimate output 
    """
    doc_vectors = []
    for doc in data:
        word_vectors = []
        if doc.strip() == "":
            word_vectors.append(np.zeros((dim,)))
        else:
            for word in word_tokenize(doc):
                word_clean = word.replace("\n", " ").lower()
                try:
                    word_vectors.append(w2vModel[word_clean])
                except KeyError:
                    word_vectors.append(np.zeros((dim,)))
        doc_vectors.append(sum(word_vectors))
    return doc_vectors

def print_int_instance(a_list):
    """
    Prints out the index of each element of a list that is an integer 
    """
    for i, item in enumerate(a_list):
        if isinstance(item, int):
            print("Int at index:", i)
            print(item)

In [258]:
homebrew_train = convert_to_vector(data = newsgroups_train['data'], w2vModel = newsgroups_w2v.wv, dim = 100)

In [268]:
print_int_instance(homebrew_train)

In [269]:
from sklearn.ensemble import RandomForestClassifier 

clf = RandomForestClassifier(random_state = 42)
clf.fit(homebrew_train, newsgroups_train['target'])

In [263]:
homebrew_test = convert_to_vector(data = newsgroups_test['data'], w2vModel = newsgroups_w2v.wv, dim = 100)

In [270]:
print_int_instance(homebrew_test)

In [271]:
pred = clf.predict(homebrew_test)
print(metrics.accuracy_score(pred, newsgroups_test['target']))
print(metrics.classification_report(pred, newsgroups_test['target'], target_names = newsgroups_test['target_names']))

0.25238980350504514
                          precision    recall  f1-score   support

             alt.atheism       0.24      0.23      0.23       336
           comp.graphics       0.24      0.20      0.22       455
 comp.os.ms-windows.misc       0.20      0.23      0.21       336
comp.sys.ibm.pc.hardware       0.27      0.25      0.26       425
   comp.sys.mac.hardware       0.14      0.16      0.15       318
          comp.windows.x       0.33      0.29      0.31       449
            misc.forsale       0.67      0.56      0.61       467
               rec.autos       0.20      0.13      0.16       628
         rec.motorcycles       0.23      0.18      0.20       529
      rec.sport.baseball       0.23      0.17      0.20       539
        rec.sport.hockey       0.39      0.36      0.37       439
               sci.crypt       0.28      0.25      0.27       448
         sci.electronics       0.12      0.18      0.14       257
                 sci.med       0.15      0.20      0.17

The above is not a strong model! It could be that our embeddings are not too strong. We can test by looking as some cosine similarities.

In [272]:
print("apple vs mac ->", newsgroups_w2v.wv.similarity('apple', 'mac'))
print("baseball vs catcher ->", newsgroups_w2v.wv.similarity('baseball', 'catcher'))
print("space vs rocket ->", newsgroups_w2v.wv.similarity('space', 'rocket'))
print("gaza vs israel ->", newsgroups_w2v.wv.similarity('gaza', 'israel'))

apple vs mac -> 0.8163826
baseball vs catcher -> 0.6250329
space vs rocket -> 0.66309035
gaza vs israel -> 0.6368495


Perhaps we could get some better performance with stronger embeddings.

In [245]:
import gensim.downloader

pretrained_w2v = gensim.downloader.load('word2vec-google-news-300')



In [273]:
pretrained_train = convert_to_vector(data = newsgroups_train['data'], w2vModel = pretrained_w2v, dim = 300)

In [274]:
print_int_instance(pretrained_train)

In [278]:
clf = RandomForestClassifier(random_state = 42)
clf.fit(doc_vectors, newsgroups_train['target'])

In [276]:
pretrained_test = convert_to_vector(data = newsgroups_test['data'], w2vModel = pretrained_w2v, dim = 300)

In [279]:
pred = clf.predict(pretrained_test)
print(metrics.accuracy_score(pred, newsgroups_test['target']))
print(metrics.classification_report(pred, newsgroups_test['target'], target_names = newsgroups_test['target_names']))

0.43335103558151883
                          precision    recall  f1-score   support

             alt.atheism       0.29      0.27      0.28       340
           comp.graphics       0.36      0.31      0.33       454
 comp.os.ms-windows.misc       0.32      0.29      0.30       439
comp.sys.ibm.pc.hardware       0.34      0.30      0.32       449
   comp.sys.mac.hardware       0.18      0.25      0.21       279
          comp.windows.x       0.39      0.36      0.38       422
            misc.forsale       0.69      0.58      0.63       467
               rec.autos       0.51      0.32      0.40       620
         rec.motorcycles       0.49      0.38      0.43       508
      rec.sport.baseball       0.59      0.43      0.50       552
        rec.sport.hockey       0.65      0.63      0.64       412
               sci.crypt       0.37      0.50      0.42       291
         sci.electronics       0.26      0.44      0.33       235
                 sci.med       0.63      0.67      0.65

We increased the performance, but still have a very poor model. We'd likely benefit from a neural network using these vectors as weights! However, since we are focusing on text featurization, we can leave that as a later exercise.

## Trying Out an LLM - Google's BERT

We can hop on the hype train and try out vectorization using a Large Language Model. [Google's BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) is a model that was built using the relatively new and novel transformer architecture. It stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers. While typical transformer models were using everything to the left of a pointer to predict the next word, BERT was looking both directions to try and improve contextual information retrieval. It was set up to model language (via masking tokens and trying to predict them) and next sentence predictions.

The model was trained on the Toronto Book Corpus and English Wikipedia. We can use the openly available weights from Tensorflow Hub to embed our text and use that as input into an algorithm.

In [4]:
import tensorflow
import tensorflow_hub as hub
import tensorflow_text as text

In [5]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") 
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")



In [19]:
X_train_bert, _ , y_train_bert,  _ = train_test_split(newsgroups_train.data, newsgroups_train.target,  train_size = 0.05, random_state = 42, stratify = newsgroups_train.target)
X_test_bert, _ , y_test_bert,  _ = train_test_split(newsgroups_test.data, newsgroups_test.target,  train_size = 0.05, random_state = 42, stratify = newsgroups_test.target)

In [7]:
train_preprocessed = bert_preprocess(X_train_bert) 
train_vectorized = bert_encoder(train_preprocessed)['pooled_output']

In [21]:
from sklearn.ensemble import RandomForestClassifier 

clf = RandomForestClassifier()
clf.fit(train_vectorized, y_train_bert)

In [22]:
test_preprocessed = bert_preprocess(X_test_bert) 
test_vectorized = bert_encoder(test_preprocessed)['pooled_output']
pred = clf.predict(test_vectorized)

In [23]:
metrics.accuracy_score(pred, y_test_bert)

0.2047872340425532

This low score is somewhat expected - we're unable to train on the full dataset due to memory constraints on