# LAB 6: Text classification with linear models

Objectives:

* Train and evaluate linear text classifiers using SGDClassifier
* Experiment with different feature extraction and training methods
* Log and evaluate experimental results using [mlflow](https://mlflow.org)

In [1]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

### Load and preprocess data

In [2]:
train = pd.read_parquet(
    "s3://ling583/rcv1-topics-train.parquet", storage_options={"anon": True}
)
test = pd.read_parquet(
    "s3://ling583/rcv1-topics-test.parquet", storage_options={"anon": True}
)

In [3]:
train.head()

Unnamed: 0,text,topics
0,NZ bonds close well bid ahead of key U.S. data...,MCAT
1,Asia Product Swaps - Jet/gas oil regrade at di...,MCAT
2,U.S. public schools get a C report card in qua...,GCAT
3,Thunder Bay vessel clearances - May 12. Daily ...,MCAT
4,"Amoco gains shares in Ula,Gyda N.Sea fields. A...",CCAT


CCAT : CORPORATE/INDUSTRIAL  
ECAT : ECONOMICS  
GCAT : GOVERNMENT/SOCIAL  
MCAT : MARKETS

In [4]:
train["topics"].value_counts()

CCAT    5896
MCAT    3281
GCAT    3225
ECAT    1073
Name: topics, dtype: int64

In [5]:
import spacy

nlp = spacy.load(
    "en_core_web_sm",
    exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],
)


def tokenize(text):
    doc = nlp.tokenizer(text)
    return [t.norm_ for t in doc if t.is_alpha]

In [6]:
import multiprocessing as mp

In [7]:
with mp.Pool() as p:
    train["tokens"] = pd.Series(p.imap(tokenize, tqdm(train["text"]), chunksize=100))
    test["tokens"] = pd.Series(p.imap(tokenize, tqdm(test["text"]), chunksize=100))

  0%|          | 0/13475 [00:00<?, ?it/s]

  0%|          | 0/3369 [00:00<?, ?it/s]

In [8]:
train.head()

Unnamed: 0,text,topics,tokens
0,NZ bonds close well bid ahead of key U.S. data...,MCAT,"[nz, bonds, close, well, bid, ahead, of, key, ..."
1,Asia Product Swaps - Jet/gas oil regrade at di...,MCAT,"[asia, product, swaps, jet, gas, oil, regrade,..."
2,U.S. public schools get a C report card in qua...,GCAT,"[public, schools, get, a, c, report, card, in,..."
3,Thunder Bay vessel clearances - May 12. Daily ...,MCAT,"[thunder, bay, vessel, clearances, may, daily,..."
4,"Amoco gains shares in Ula,Gyda N.Sea fields. A...",CCAT,"[amoco, gains, shares, in, ula, gyda, fields, ..."


---

### SGDClassifier

We will run the basic SGDClassifier first.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline

In [10]:
sgd = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.96      0.97      0.96      1475
        ECAT       0.93      0.86      0.89       268
        GCAT       0.96      0.98      0.97       806
        MCAT       0.95      0.95      0.95       820

    accuracy                           0.96      3369
   macro avg       0.95      0.94      0.94      3369
weighted avg       0.96      0.96      0.96      3369



In [11]:
import logger
import mlflow
from logger import log_search, log_test

In [12]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

At this point, when we run the basic SGD classifier, our f1 improves from 0.922 to 0.944.  We will now adjust the hyperparameters to see if we can approve the model even more.

---

### Hyperparameters

In [13]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:33595")
client

0,1
Client  Scheduler: tcp://127.0.0.1:33595  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


In [14]:
from dask_ml.model_selection import RandomizedSearchCV
from scipy.stats.distributions import loguniform, randint, uniform

In [15]:
from warnings import simplefilter

simplefilter(action="ignore", category=FutureWarning)

In [16]:
mlflow.set_experiment("lab-6/sgd")

In [17]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "sgdclassifier__alpha": loguniform(1e-8, 100.0),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7 s, sys: 437 ms, total: 7.44 s
Wall time: 1min 10s


In [18]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "sgdclassifier__alpha": loguniform(1e-8, 1e-1),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7.04 s, sys: 404 ms, total: 7.44 s
Wall time: 1min 9s


----

#### 2.3) Optimized SGDClassifier

In [19]:
sgd = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=9, max_df=0.87), SGDClassifier(alpha=1e-2))
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.97      0.96      0.97      1475
        ECAT       0.89      0.87      0.88       268
        GCAT       0.96      0.97      0.97       806
        MCAT       0.95      0.97      0.96       820

    accuracy                           0.96      3369
   macro avg       0.94      0.94      0.94      3369
weighted avg       0.96      0.96      0.96      3369



In [20]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 2.1) SGDClassifier with Term-Frequency Times Inverse Document-Frequency (tf-idf)

In [21]:
from sklearn.feature_extraction.text import TfidfTransformer

In [22]:
sgd = make_pipeline(CountVectorizer(analyzer=identity), 
                    TfidfTransformer(), 
                    SGDClassifier())
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.97      0.97      0.97      1475
        ECAT       0.94      0.82      0.88       268
        GCAT       0.95      0.98      0.97       806
        MCAT       0.96      0.97      0.96       820

    accuracy                           0.96      3369
   macro avg       0.95      0.93      0.94      3369
weighted avg       0.96      0.96      0.96      3369



In [23]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 2.2) Hyperparameters for SGDClassifier with tf-idf

In [24]:
mlflow.set_experiment("lab-6/sgd-tfidf")

In [25]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-8, 100.0),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7.14 s, sys: 355 ms, total: 7.5 s
Wall time: 1min 6s


In [26]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-8, 1e-3),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7.14 s, sys: 387 ms, total: 7.52 s
Wall time: 1min 6s


----

#### 2.3) Optimized SGDClassifier/tf-idf model

In [27]:
sgd = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=6, max_df=0.80), TfidfTransformer(use_idf=True), SGDClassifier(alpha=1e-4)
)
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.97      0.97      0.97      1475
        ECAT       0.94      0.82      0.88       268
        GCAT       0.96      0.98      0.97       806
        MCAT       0.96      0.97      0.96       820

    accuracy                           0.96      3369
   macro avg       0.96      0.94      0.94      3369
weighted avg       0.96      0.96      0.96      3369



In [28]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 3.1) SGDClassifier with truncated SVD

In [29]:
from sklearn.decomposition import TruncatedSVD

In [30]:
sgd = make_pipeline(CountVectorizer(analyzer=identity), TfidfTransformer(), TruncatedSVD(n_components=100), SGDClassifier())
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.96      0.95      0.96      1475
        ECAT       0.93      0.70      0.80       268
        GCAT       0.92      0.98      0.95       806
        MCAT       0.93      0.96      0.94       820

    accuracy                           0.94      3369
   macro avg       0.93      0.90      0.91      3369
weighted avg       0.94      0.94      0.94      3369



In [31]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 3.2) Hyperparameters for SGDClassifier with TruncatedSVD

In [32]:
mlflow.set_experiment("lab-6/sgd-tsvd")

In [33]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-8, 100.0),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7.68 s, sys: 530 ms, total: 8.21 s
Wall time: 2min 41s


In [34]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True],
        "sgdclassifier__alpha": loguniform(1e-8, 1e-3),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 7.76 s, sys: 583 ms, total: 8.35 s
Wall time: 2min 49s


----

#### 3.3) Optimized SGDClassifier/T-SVD model

In [35]:
sgd = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=2, max_df=0.55),
    TfidfTransformer(use_idf=True), TruncatedSVD(n_components=100), SGDClassifier(alpha=1e-5))
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.96      0.95      0.96      1475
        ECAT       0.90      0.81      0.85       268
        GCAT       0.91      0.98      0.95       806
        MCAT       0.95      0.94      0.95       820

    accuracy                           0.94      3369
   macro avg       0.93      0.92      0.92      3369
weighted avg       0.94      0.94      0.94      3369



In [36]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 4.1) SGDClassifier with unigram/bigram

In [37]:
from nltk import bigrams
def unibigrams(toks):
    return [(tok,) for tok in toks] + list(bigrams(toks))

In [38]:
sgd = make_pipeline(CountVectorizer(analyzer=unibigrams), SGDClassifier())
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.97      0.96      0.96      1475
        ECAT       0.91      0.85      0.88       268
        GCAT       0.95      0.98      0.96       806
        MCAT       0.94      0.96      0.95       820

    accuracy                           0.95      3369
   macro avg       0.94      0.94      0.94      3369
weighted avg       0.95      0.95      0.95      3369



In [39]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

#### 4.2) Hyperparameters for SGDClassifier with unigram/bigram

In [40]:
mlflow.set_experiment("lab-6/sgd-ug-bg")

In [41]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "sgdclassifier__alpha": loguniform(1e-8, 100.0),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 1min 10s, sys: 3.24 s, total: 1min 13s
Wall time: 7min 10s


In [42]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 10),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "sgdclassifier__alpha": loguniform(1e-8, 1e-3),
    },
    n_iter=25,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 1min 11s, sys: 3.41 s, total: 1min 14s
Wall time: 6min 56s


----

#### 4.3) Optimized SGDClassifier with Unigram/Bigram model

In [43]:
sgd = make_pipeline(
    CountVectorizer(analyzer=unibigrams, min_df=2, max_df=0.57), SGDClassifier(alpha=1e-3))
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        CCAT       0.97      0.97      0.97      1475
        ECAT       0.91      0.86      0.88       268
        GCAT       0.96      0.98      0.97       806
        MCAT       0.96      0.96      0.96       820

    accuracy                           0.96      3369
   macro avg       0.95      0.94      0.95      3369
weighted avg       0.96      0.96      0.96      3369



In [44]:
mlflow.set_experiment("lab-6")
log_test(sgd, test["topics"], predicted)

### Summary:

* MNB Classifier = 0.944
* MNB Classifier Optimized = 0.915
* SGD Classifier F1 Score = 0.945
* SGD Classifier Optimized F1 Score = 0.943
* SGD Classifier with tf-idf F1 Score = 0.943
* SGD Classifier with tf-idf Optimized F1 Score = 0.945
* SGD Classifier with truncated SVD F1 Score = 0.912
* SGD Classifier with truncated SVD Optimized F1 Score = 0.925
* SGD Classifier with unigram/bigram F1 Score = 0.94
* SGD Classifier with unigram/bigram Optimized F1 Score = 0.946

Out of all the  F1 scores, the SGD Classifier with unigram/bigram optimized came in 1st with the highest F1 score of 0.946, but both the SGD Classifier baseline and SGD Classifier with tf-idf optimized had the 2nd highest F1 score of 0.945.  The third highest F1 score came from the SGD Classifier with unigram/bigram baseline at 0.94.  These 4 are the best, but because they are so close, I would use all 4 and monitor them over time.