### US Foods Assessment
#### Julian Carrasquillo
#### 2023-04-24

Resources

* [Data Understanding and Reading in](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
* [SO: Building combinations of dictionary of lists](https://stackoverflow.com/questions/38721847/how-to-generate-all-combination-from-values-in-dict-of-lists-in-python)

In [213]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

import pandas as pd
import numpy as np

### Base Tutorial

The tutorial gave a great look into the best ways to read in the data. Removing heads, footers, and quotes helps to make more generalizable models since a lot can be gleaned from these fields. For example, having a university email in a signature tends to lean towards science-based articles.

The overall dataset is quite large, so I picked a random group of categories to develop.

In [41]:
categories = ['talk.politics.guns',
                        'comp.windows.x', 
                         'rec.sport.baseball', 
                         'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train',
                                                                         remove=('headers', 'footers', 'quotes'),
                                                                         categories = categories)

len(newsgroups_train['data'])

2329

From the tutorial, we can use sklearn's built in `TfidfVectorizer`, which builds a sparse matrix based on a balance between a token's term frequency and its inverse document frequency. This weights each token by a measure of relevance to a specific document and relevance to the entire corpus. We can verify the matrix size by checking the output `vectors` shape. The number of rows aligns with the number of articles downloaded.

In [42]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(2329, 31568)

The tutorial uses a multinomial naive bayes classifier. This trains by counting the number of times each word appears in each category. When we're inferring, a document's words are scored with the most represent category being the prediction. This approach is considered naive because there is assumed independence between words. Despite that, the model still does fairly well with this subset, bringing an F1 score of ~.91.

In [43]:
clf = MultinomialNB(alpha = .01)
clf.fit(vectors,  newsgroups_train.target)

newsgroups_test = fetch_20newsgroups(subset='test',
                                                                      remove=('headers', 'footers', 'quotes'),
                                                                      categories=categories)

vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(metrics.accuracy_score(pred, newsgroups_test.target))

0.9083870967741936


### Increasing n-grams

Just sticking with this classifier, we can see if opening up the word count of each token has any positive effects. Out of the box, `TfidfVectorizer` takes single words as tokens (unigrams). We can expand this to include any 2 word combinations (bigrams) to get some more context. Some examples for why this can be helpful:

* we can better capture a baseball players full name - `Jorge` & `Posada` mean more together and more heavily imply an article about baseball.
* we can differentiate between `climate change` and the current `political climate`


In [44]:
vectorizer = TfidfVectorizer(ngram_range = (1, 2))
vectors = vectorizer.fit_transform(newsgroups_train.data)

clf = MultinomialNB(alpha = .01)
clf.fit(vectors,  newsgroups_train.target)

vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(metrics.accuracy_score(pred, newsgroups_test.target))

0.9


This did not increase accuracy! The combinations of words may have just added more sparcity and less signal to the tf-idf matrix. 

### Trying different combinations

With this data, we have a recipe for building our models - translate the text data into a numeric form, then apply a classifier. We can streamline our workflow by building out a function.

In [37]:
def train_classifier(train, vectorizer, classifier, vectorizer_params = {}, classifier_params = {}):
    vectorizer = vectorizer(**vectorizer_params)
    vectors = vectorizer.fit_transform(train.data)
    
    clf = classifier(**classifier_params)
    clf.fit(vectors, train.target)
    
    return vectorizer, clf

def test_classifier(test, vectorizer, classifier):
    vectors = vectorizer.transform(test.data)
    pred = classifier.predict(vectors)
    return metrics.accuracy_score(pred, test.target)

In [47]:
# verify we get the same results as above
vectorizer, clf = train_classifier(newsgroups_train, vectorizer = TfidfVectorizer, classifier = MultinomialNB, classifier_params = {'alpha' : .01}) 
test_classifier(newsgroups_test, vectorizer, clf)

0.9083870967741936

Using python's built-in `itertools`, we can build combinations of parameters for both our vectorizer and algorithms. Using the string as a key for the algorithm, we can include the actual algorithm object from `sklearn` to be passed to our code.

In [216]:
import itertools

vector_grid = { 'ngram_range'  : [(1, 1), (1, 2)],
 'max_df' : [0.8, 1.0]}

algo_grid = {'MultinomialNB' : {'algo_obj' : MultinomialNB,
                                                     'params' : {'alpha' : [0.01,  0.1,  1]}},
                   'RandomForestClassifier' : {'algo_obj' : RandomForestClassifier,
                                                                 'params' : {'max_depth' : [ 8, 12],
                                                                                   'n_estimators' : [100, 1000]}}}

In [217]:
# From https://stackoverflow.com/questions/38721847/how-to-generate-all-combination-from-values-in-dict-of-lists-in-python
keys, values = zip(*vector_grid.items())
vector_groups = [dict(zip(keys, v)) for v in itertools.product(*values)]

In [218]:
vector_groups

[{'ngram_range': (1, 1), 'max_df': 0.8},
 {'ngram_range': (1, 1), 'max_df': 1.0},
 {'ngram_range': (1, 2), 'max_df': 0.8},
 {'ngram_range': (1, 2), 'max_df': 1.0}]

In [219]:
algo_groups = {}
for item in algo_grid.keys():
    keys, values = zip(*algo_grid[item]['params'].items())
    algo_groups.update({item : [dict(zip(keys, v)) for v in itertools.product(*values)]})

In [220]:
algo_groups

{'MultinomialNB': [{'alpha': 0.01}, {'alpha': 0.1}, {'alpha': 1}],
 'RandomForestClassifier': [{'max_depth': 8, 'n_estimators': 100},
  {'max_depth': 8, 'n_estimators': 1000},
  {'max_depth': 12, 'n_estimators': 100},
  {'max_depth': 12, 'n_estimators': 1000}]}

Now that our combinations are set up, we can loop through them with our train / test functions and keep track of our results. 

In [221]:
results = {}
i = 0
for vector_combo in vector_groups:
    for key, value in algo_groups.items():
        for algo_param in value:
            vectorizer, clf = train_classifier(newsgroups_train, vectorizer = TfidfVectorizer, classifier = algo_grid[key]['algo_obj'], vectorizer_params = vector_combo, classifier_params = algo_param)
            results.update({i : {'vector_combo' : vector_combo, 'algo' : key, 'algo_param' : algo_param, 'accuracy' : test_classifier(newsgroups_test, vectorizer, clf)}})
            i = i + 1

We were able to get a bit more accuracy by changing the updating the `max_df` and `alpha` parameters of the MultinomialNB algorithm. `max_df` essentially introduces corpus-specific stop words which likely helped reduce some of the more common words from the weight calculations. 

In [222]:
results_df = pd.DataFrame(results)
results_df[np.argmax(results_df.loc['accuracy'])]

vector_combo    {'ngram_range': (1, 1), 'max_df': 0.8}
algo                                     MultinomialNB
algo_param                              {'alpha': 0.1}
accuracy                                      0.909677
Name: 1, dtype: object

### Trying a new vectorizer