# An In-depth Evaluation of Approaches to Text Classification (IDEATC)

## II. Classic Supervised Baselines

_This notebook is used to establish baselines using "classic" supervised learning approaches to text classification, including Naive Bayes, Logistic Regression and Support Vector Machines._

### Libraries

In [1]:
# standard library
import os
from pathlib import Path

# data wrangling
import numpy as np
import datasets

# local packages
import src
from src.experiments import supervised
from src.frameworks.sklearn import models, optimisation

# other settings
LOAD_PATH_DATASET = Path(os.pardir, 'data', 'processed')
SAVE_PATH_RESULTS = Path(os.pardir, 'data', 'results')
src.utils.show_datasets(LOAD_PATH_DATASET)

Unnamed: 0,dataset,num_rows_train,num_rows_valid,num_rows_test,num_classes
0,yelp_review_full_processed,650000,0,2500,5
1,yelp_polarity_processed,560000,0,1000,2
2,dbpedia_14_processed,560000,0,7000,14
3,ag_news_processed,120000,0,7600,4
4,web_of_science_processed,37589,0,9396,134
5,imdb_processed,25000,0,1000,2
6,dynabench_dynasent_processed,13065,720,720,3
7,20_newsgroups_processed,11314,0,7532,20
8,setfit_sst5_processed,8544,1101,2210,5
9,rotten_tomatoes_processed,8530,1066,1066,2


## II. Dummy Classifier

In [2]:
for path in LOAD_PATH_DATASET.glob('*processed*'):
    dataset = datasets.load_from_disk(path)
    sample_sizes = src.experiments.utils.get_sample_sizes(dataset['train'])
    supervised.run_experiment(
        dataset_dict=dataset,
        feature='text_clean',
        get_model=models.get_dummy_model,
        search_params={},
        optimisation=optimisation,
        sample_sizes=sample_sizes,
        progress_bar=True,
        experiment_id='dummy_classifier',
        save_path=SAVE_PATH_RESULTS.joinpath(path.name),
    )
print('Done!')

Training size: 11,314 Run: 10/10 F1-score: 0.053: 100%|██████████| 7/7 [00:03<00:00,  1.79it/s]
Training size: 120,000 Run: 3/3 F1-score: 0.253: 100%|██████████| 10/10 [00:07<00:00,  1.31it/s]
Training size: 37,589 Run: 3/3 F1-score: 0.007: 100%|██████████| 9/9 [00:05<00:00,  1.50it/s]  
Training size: 560,000 Run: 1/1 F1-score: 0.467: 100%|██████████| 13/13 [00:30<00:00,  2.34s/it]
Training size: 13,065 Run: 10/10 F1-score: 0.327: 100%|██████████| 7/7 [00:01<00:00,  4.35it/s]
Training size: 25,000 Run: 3/3 F1-score: 0.489: 100%|██████████| 8/8 [00:02<00:00,  2.88it/s]  
Training size: 8,544 Run: 10/10 F1-score: 0.201: 100%|██████████| 7/7 [00:01<00:00,  4.74it/s]
Training size: 560,000 Run: 1/1 F1-score: 0.072: 100%|██████████| 13/13 [00:27<00:00,  2.10s/it]
Training size: 8,530 Run: 10/10 F1-score: 0.478: 100%|██████████| 7/7 [00:01<00:00,  5.12it/s]
Training size: 650,000 Run: 1/1 F1-score: 0.204: 100%|██████████| 13/13 [00:32<00:00,  2.49s/it]

Done!





## III. Naive Bayes

In [2]:
params_vectoriser = {
    'vectoriser__min_df': [2, 5],
    'vectoriser__max_features': [10_000, 30_000, 50_000],
    'vectoriser__binary': [True, False],
}

params_clf = {
    'clf__alpha': [.1, .5, 1, 2, 5]  # Laplace smoothing
}
params = params_vectoriser | params_clf

In [4]:
for path in LOAD_PATH_DATASET.glob('*processed*'):
    dataset = datasets.load_from_disk(path)
    sample_sizes = src.experiments.utils.get_sample_sizes(dataset['train'])
    supervised.run_experiment(
        dataset_dict=dataset,
        feature='text_clean',
        get_model=models.get_naive_model,
        search_params=params,
        optimisation=optimisation,
        sample_sizes=sample_sizes,
        progress_bar=True,
        experiment_id='complement_naive_bayes',
        save_path=SAVE_PATH_RESULTS.joinpath(path.name),
    )
print('Done!')

Training size: 11,314 Run: 10/10 F1-score: 0.751: 100%|██████████| 7/7 [01:00<00:00,  8.59s/it]
Training size: 120,000 Run: 3/3 F1-score: 0.867: 100%|██████████| 10/10 [00:59<00:00,  5.97s/it]
Training size: 37,589 Run: 3/3 F1-score: 0.686: 100%|██████████| 9/9 [02:52<00:00, 19.21s/it]
Training size: 560,000 Run: 1/1 F1-score: 0.864: 100%|██████████| 13/13 [16:21<00:00, 75.48s/it] 
Training size: 13,065 Run: 10/10 F1-score: 0.520: 100%|██████████| 7/7 [00:04<00:00,  1.68it/s]
Training size: 25,000 Run: 3/3 F1-score: 0.856: 100%|██████████| 8/8 [01:11<00:00,  8.98s/it]
Training size: 8,544 Run: 10/10 F1-score: 0.360: 100%|██████████| 7/7 [00:05<00:00,  1.34it/s]
Training size: 560,000 Run: 1/1 F1-score: 0.943: 100%|██████████| 13/13 [05:30<00:00, 25.42s/it]
Training size: 8,530 Run: 10/10 F1-score: 0.765: 100%|██████████| 7/7 [00:05<00:00,  1.39it/s]
Training size: 650,000 Run: 1/1 F1-score: 0.493: 100%|██████████| 13/13 [17:40<00:00, 81.58s/it] 

Done!





## IV. Logistic Regression / SVM

In [3]:
params_clf = {
    'clf__loss': ['hinge', 'log_loss'],
    'clf__alpha': np.logspace(-6, 1, 8),  # i.e., 1e-6 to 10
}
params = params_vectoriser | params_clf

In [None]:
for path in LOAD_PATH_DATASET.glob('*processed*'):
    dataset = datasets.load_from_disk(path)
    sample_sizes = src.experiments.utils.get_sample_sizes(dataset['train'])
    supervised.run_experiment(
        dataset_dict=dataset,
        feature='text_clean',
        get_model=models.get_linear_model,
        search_params=params,
        optimisation=optimisation,
        sample_sizes=sample_sizes,
        progress_bar=True,
        experiment_id='sgd_classifier',
        save_path=SAVE_PATH_RESULTS.joinpath(path.name),
    )
print('Done!')

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

## V. Sanity Check

In [22]:
for path in LOAD_PATH_DATASET.glob('*_processed'):
    df_metrics = src.experiments.utils.read_metrics(SAVE_PATH_RESULTS, path.name)
    fig = src.plotting.plot_performance_overall(df_metrics=df_metrics, add_se=True)
    fig.update_layout(title=path.name)
    fig.show()