# 4. Supervised Analysis and Classifiers

Supervised algorithms (or **supervised** models) observe the way in which the feature inputs `X` (e.g. function word frequencies of known texts) correlate with class outputs `y` (e.g. authorship / label of the text). A supervised model ‘observes’ correctly labelled, preclassified X-y pairs, in order to register meaningful correlations (e.g. between certain function word patterns X with certain authors y). This process is called ‘training,’ and the X-y pairs which the model trains on correspond to what is commonly referred to as `training data`. Consequently, once this learning process has taken place, the model can be confronted with `test data,` comprising previously unobserved and unclassified texts. On the basis of what it has observed, the supervised model can make a prediction (classification), and assign the unseen test data to a class, either by a hard decision or by outputting a probability score.

Classification is a considerable field of research on its own. There are **many types** of classifiers, and it is not always clear which one will perform best, and why. Varying types of classifiers tend to react differently to different problems, have a variety of parametrization options and require other methods by which to optimize their performance during training. A big advantage of supervised machine-learning methods, ‘text classification’ (Sebastiani 2002), is the **possibility of evaluation**. By making different combinations of parameters, such as the feature set, the vector length (number of features), sample length, vectorization method, scaling method, etc., and evaluating how well they can be fitted to a class (author), scholars can finetune and optimize these parameters.
Before we proceed, we **repeat**, with the block of code below, some of **the steps from the previous notebook**.

These are:

1. Loading and segmentation of documents, containers `authors`, `titles`, `texts`
2. Vectorization of `texts` to matrix `X` containing vectors for all text segments
3. Scale `X` by applying `StandardScaler()`

Note that, as opposed to the three previous notebooks, we are now introducing a **test corpus** of allegedly **unknown authorship**. These files can be found in our `'corpus/test/'` folder. Below, we will apply a classifier to attribute the text to one of the **known classes**, i.e. our training set from the `'corpus/train/'` folder.

In [1]:
from google.colab import files
from string import punctuation
import numpy as np
import pandas as pd
import re

# --- Parameters ---
sample_len = 5000  # length of each text segment in words
data_dict = {"train": [[], [], []], "test": [[], [], []]}

def process_uploaded(uploaded, dict_key):
    """Helper: clean, segment, and add uploaded texts to data_dict."""
    authors, titles, texts = data_dict[dict_key]

    for filepath, fileobj in uploaded.items():
        filename = filepath.split("/")[-1]  # e.g. "author_title.txt"

        # Extract author and title from filename
        author = filename.split(".")[0].split("_")[0]
        title = filename.split(".")[0].split("_")[1]

        # Decode uploaded bytes
        text = fileobj.decode("utf-8-sig")

        # Tokenize and clean
        bulk = []
        for word in text.strip().split():
            word = re.sub(r'\d+', '', word)  # remove digits
            word = re.sub('[%s]' % re.escape(punctuation), '', word)  # remove punctuation
            word = word.lower()
            if word != "":
                bulk.append(word)

        # Split text into equal-length samples
        bulk = [bulk[i:i+sample_len] for i in range(0, len(bulk), sample_len)]
        for index, sample in enumerate(bulk):
            if len(sample) == sample_len:
                authors.append(author)
                titles.append(title + "_{}".format(index + 1))
                texts.append(" ".join(sample))


# --- Phase 1: Upload training files ---
print("Upload training files...")
uploaded = files.upload()
process_uploaded(uploaded, "train")

# --- Phase 2: Upload test files ---
print("Upload test files...")
uploaded = files.upload()
process_uploaded(uploaded, "test")

# After both phases:
# data_dict["train"] = [authors, titles, texts]
# data_dict["test"]  = [authors, titles, texts]

Upload training files...


Saving Petrus-Abaelardus_Commentariorum-super-S-Pauli-Epistolam-ad-Romanos_cleaned.txt to Petrus-Abaelardus_Commentariorum-super-S-Pauli-Epistolam-ad-Romanos_cleaned (8).txt
Saving Bernardus-Claraevallensis_Sermones-super-cantica-canticorum_cleaned.txt to Bernardus-Claraevallensis_Sermones-super-cantica-canticorum_cleaned (8).txt
Saving Hildegardis-Bingensis_Liber-divinorum-operum_cleaned.txt to Hildegardis-Bingensis_Liber-divinorum-operum_cleaned (8).txt
Upload test files...


Saving Hildegardis-Bingensis_Scivias.txt to Hildegardis-Bingensis_Scivias (2).txt


In [3]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler, LabelEncoder

# ==========================================================
# Vectorize and scale the training and test sets
# ==========================================================

for set, [authors, titles, texts] in data_dict.items():
    if set == 'train':  # key must be 'train'
        # Vectorize training texts (bag-of-words, top 250 unigrams)
        model = CountVectorizer(
            max_features=250,
            analyzer='word',
            ngram_range=(1, 1)
        )
        X_train = model.fit_transform(texts).toarray()

        # Sort features by frequency for consistent ordering
        feat_frequencies = np.asarray(X_train.sum(axis=0)).flatten()
        features = model.get_feature_names_out()
        feat_freq_df = pd.DataFrame({'feature': features, 'frequency': feat_frequencies})
        feat_freq_df = feat_freq_df.sort_values(by='frequency', ascending=False).reset_index(drop=True)
        sorted_features = feat_freq_df['feature'].tolist()
        sorted_indices = [model.vocabulary_[feat] for feat in sorted_features]
        X_train_sorted = X_train[:, sorted_indices]

        # Refit vectorizer with sorted vocabulary
        model = CountVectorizer(
            stop_words=[],
            analyzer='word',
            vocabulary=sorted_features,
            ngram_range=(1, 1)
        )
        X_train = model.fit_transform(texts).toarray()

        # Scale features (zero mean, unit variance)
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)

        # Encode authors as integer labels
        le = LabelEncoder()
        y_train = le.fit_transform(authors)

# ==========================================================
# Apply same models to the test set
# ==========================================================
for set, [authors, titles, texts] in data_dict.items():
    if set == 'test':  # key must be 'test'
        X_test = model.transform(texts).toarray()
        X_test = scaler.transform(X_test)
        test_titles = titles

# --- Inspect processed vectors ---

print("Training vectors shape:", X_train.shape)
print("First 5 training vectors:\n", X_train[:5])

print("\nTest vectors shape:", X_test.shape)
print("First 5 test vectors:\n", X_test[:5])

# Optional: check feature names and order
print("\nFirst 10 features in vectorizer:", model.get_feature_names_out()[:10])


Training vectors shape: (73, 250)
First 5 training vectors:
 [[-1.6679167  -1.2055242   0.02400279 ... -0.85138809 -0.1603732
  -0.15149293]
 [-1.46936238 -0.50842679  1.31510028 ... -0.48362874  0.3486374
  -1.11314372]
 [-1.02813057 -0.9097859   1.56870871 ... -0.11586939 -0.66938381
   0.32933246]
 [-1.29286966 -0.93091007  3.2286912  ... -0.85138809 -0.66938381
  -0.63231833]
 [-1.13843852 -0.17044017  1.49954278 ... -0.48362874 -0.1603732
  -1.11314372]]

Test vectors shape: (29, 250)
First 5 test vectors:
 [[ 0.18525692  1.64623792 -1.10570751 ... -0.48362874 -1.17839441
   2.25263403]
 [ 0.11907215  2.17434201 -0.5293247  ... -0.48362874 -1.17839441
   1.77180864]
 [ 0.14113374  0.39991225 -1.4515372  ...  0.98740867 -1.17839441
   0.32933246]
 [ 0.89122782  1.85747955 -0.75987783 ... -0.11586939 -1.17839441
  -0.63231833]
 [ 0.2293801   2.21659034 -0.87515439 ... -0.48362874 -1.17839441
   2.25263403]]

First 10 features in vectorizer: ['et' 'in' 'est' 'non' 'ad' 'quod' 'ut' 'q

## 4.1 Training a Classifier (by applying SVM)

Especially in recent years, that have witnessed the rise of machine learning and computing power,  classification algorithms such as support vector machines (SVM’s, **support vector machines**) have become increasingly popular. A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that maximally separates data points of different classes in a high-dimensional space. SVM is effective in high-dimensional spaces and is versatile with different kernel functions for non-linear classification.

For now, we will not occupy ourselves too much with the hyperparameters of SVM's just yet, and first take a look at some more general principles of training and evaluating classifiers.

### 4.1.1 Preparing the Dataset for Training → `train_test_split`

Above, we already declared a train and test set in separate folders. During training, however, it is considered good practice to hold out a **development set** (`X_dev`), also known as a **validation set**. It is a subset of the training set that is set aside during the training of a machine learning model. This heldout `X_dev` is not used in the training process but is instead used to evaluate the model's performance during development. The heldout dev set allows you to assess how well your model is likely to perform on unseen data (i.e. `X_test`), providing a better estimate of its generalization ability, offering a basis for tuning hyperparameters by evaluating different settings, detecting overfitting early to take preventive measures, and aiding in selecting the best performing model for production.

Here is a list of the variables that you will encounter in the process of partitioning our data set:

* `X_train` and `y_train`: The **full training set**: all vectorized text segments (`X_train`) labelled by authorship (`y_train`).
* `X_train_split` and `y_train_split`: The **remaining training set** after subtraction of the validation set.
* `X_dev` and `y_dev`: The **validation set**, subset of the training data  temporarily held out in order to function as a kind of stand-in test set.
* `X_test`:  The *actual* **test set**, i.e. texts unseen by the model, for which authorship are —truly, this time— unknown.

Parameters to take into account when subtracting `X_dev` from `X_train` are the **split ratio** (`test_size=0.33`) and a **random seed** in the split process to ensure that the results are reproducible (`random_state`).

### 4.1.2 Evaluation: Accuracy, Precision, Recall, F1 score

Once our model is trained on `X_train_split` and known labels `y_train_split`, we are effectively able to test the model's quality by having it predict on the heldout `X_dev` set. This yields a vector of **predictions** `y_dev_pred`, which can be readily compared to what is commonly referred to as "**ground truth**" or "**gold standard**".

During training, each of our authors (let's say for now, authors A, B, and C) is awarded a class label corresponding to a digit, e.g. `0`, `1`, `2`.
- The `y_dev`-array will, therefore, look like something like this: `[0, 0, 1, 1, 2, 2]`, where each of 3 authors in the training set is corresponded by 2 text samples in the development set.  
- Possibly, our model can output as prediction (`y_dev_pred`) the vector array `[0, 1, 1, 1, 2, 2]`.

Clearly, it has made a mistake in misattributing the second text segment to class `1` (Author B) instead of class `0` (Author A).

When comparing the golden standard against the predictions, we can extract several interesting evaluation metrics from a `classification_report`, yielding `accuracy`, `precision`, `recall`,`f1-score`.

* `accuracy`: *"How often did we correctly attribute the text segment to a given author?"*  
  I.e. the percentage of correct predictions out of all the predictions made. The answer in our case may be obvious: 5 out of 6 times, 0.83.
* `precision`: *"When we positively identified a text segment as written by a given author, how often was that true?"*  
  Precision is calculated for each class separately and later averaged across classes. In our case, let us consider the example of Author B. In case of Authors A and C, the precision is in fact 100% in both cases: all positive identifications (1/1 for Author A and 2/2 for Author B) were indeed positive. In case of Author B, however, at one time `1` flared up where the outcome should have been `0` (= Author A). This is a **false positive**, and impairs our model's precision to 2/3 —out of 3 positive outcomes for Author B, only 2 were in fact correct—, and yields a score of 0.67. On average, our precision (when macro-averaged across all classes) is 0.89.
* `recall`: *"Were we able to attribute all text segments of a given author to that author?"*  
  Recall computes how many out of all predictions that should have been labelled positive were actually labelled such. Again, each class is first looked at individually. It turns out that, when we indeed look at the example of Author A, we in fact only caught half of the `0`'s we should have caught, because we falsely attributed an observation belonging to `0` to class `1`. The missed instance for class `0` is what we call a **false negative**, and impairs the recall for that class to 1/2, i.e. 0.5. For Authors B and C, the results are 2/2 and 2/2, both times 100%. Taken on average, then, when looking at our model's performance on all authors, the so-called macro recall adds up to 0.83.
* `f1-score`: A balanced score (harmonic mean) that combines precision and recall.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd

"""
Train on partitioned X_train_split, y_train_split
Test on validation set X_dev => yields y_dev_pred
"""

# Splits datasets into random train and test subsets.
X_train_split, X_dev, y_train_split, y_dev = train_test_split(X_train, y_train, test_size=0.33, random_state=1) # test_size: 1/3 of train data becomes validation set

print('Dimensions of original data set:')
print(X_train.shape)
print('Dimensions of partitions (train and dev) set:')
print(X_train_split.shape)
print(X_dev.shape)

# Initialize an SVM-classifier
svm_classifier = SVC(kernel='linear', C=1.0, random_state=42) # random seed ensures reproducibility
svm_classifier.fit(X_train_split, y_train_split)

# Make predictions with model
y_dev_pred = svm_classifier.predict(X_dev) # y_pred = model predictions

print(classification_report(y_dev, y_dev_pred)) # compare predictions to ground truth / gold standard

"""
Test on test set
Yields y_pred (predictions) of authorship
"""

y_pred = svm_classifier.predict(X_test)
predictions = le.inverse_transform(y_pred)

print()
print("Predicted authorship:")

df = pd.DataFrame(predictions) # structures matrix X as a DataFrame
df.columns = ['Prediction'] # assigns column labels
df.index = test_titles

print(df)

Dimensions of original data set:
(73, 250)
Dimensions of partitions (train and dev) set:
(48, 250)
(25, 250)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00         3

    accuracy                           1.00        25
   macro avg       1.00      1.00      1.00        25
weighted avg       1.00      1.00      1.00        25


Predicted authorship:
                               Prediction
Scivias (2)_1       Hildegardis-Bingensis
Scivias (2)_2       Hildegardis-Bingensis
Scivias (2)_3   Bernardus-Claraevallensis
Scivias (2)_4       Hildegardis-Bingensis
Scivias (2)_5       Hildegardis-Bingensis
Scivias (2)_6       Hildegardis-Bingensis
Scivias (2)_7       Hildegardis-Bingensis
Scivias (2)_8       Hildegardis-Bingensis
Scivias (2)_9       Hildegardis-Bingensis
Scivias (2)_10      Hildegardis-Bingensis
Scivias (2)_11      Hilde

## 4.2 `GridsearchCV()`: Tuning Parameters and Hyperparameters of the SVM (Advanced)

In this section, we mainly repeat much of the above, but introduce the useful class `sklearn.model_selection.GridSearchCV`.
Think of what follows as a more advanced and specialized way of going about training your model. This time, we do not simply *choose* whatever parameters we think will work best, we statistically analyze and evaluate a series of varying presets, in order to gauge their performance on a more objective basis.

An SVM has quite a few hyperparameters, such as the regularization parameter (`'C'`) and the kernel parameters (like `'linear'`).
Moreover, from a stylometric methodological perspective, we may want to experiment with varying feature types (function words, character n-grams, ...), feature vector lengths (`n_features`), and segment lengths (`sample_len`). Gridsearch can help us find the optimal settings, ensuring the SVM model achieves the highest possible accuracy and generalizes well to unseen data. This process helps to avoid the pitfalls of manual tuning (which can be subjective) and ensures a more robust and reliable model.

Below, we first declare these various presets in containers, e.g. `sample_len_loop`, `feat_type_loop`, `feat_n_loop`, `c_options`, `kernel_options`, and `k_folds`.

**READ FIRST: Searching many parameters at the same time can be quite costly and take a long time. Try to start your gridsearch by focussing on only a few of these parameters at a time, just so you can get acquainted with them.**

In [5]:
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, make_scorer, recall_score, accuracy_score, precision_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from string import punctuation
from tqdm import tqdm
from tqdm.notebook import tqdm
import numpy as np
import re

# --- Upload training files ---
print("Upload training files...")
uploaded = files.upload()
process_uploaded(uploaded, "train")

# -------------------------------
# PREPROCESS UPLOADED FILES ONCE
# -------------------------------
preprocessed_texts = {}  # filename -> list of words
authors_dict = {}        # filename -> author
titles_dict = {}         # filename -> title

for filename in uploaded.keys():  # loop over uploaded files
    author, title = filename.split('/')[-1].split('.')[0].split('_')[:2]  # parse metadata
    text = open(filename, encoding='utf-8-sig').read().strip()             # read file
    words = re.sub(r'[\d%s]' % re.escape(punctuation), '', text.lower()).split()  # clean & tokenize
    preprocessed_texts[filename] = words
    authors_dict[filename] = author
    titles_dict[filename] = title

Upload training files...


Saving Petrus-Abaelardus_Commentariorum-super-S-Pauli-Epistolam-ad-Romanos_cleaned.txt to Petrus-Abaelardus_Commentariorum-super-S-Pauli-Epistolam-ad-Romanos_cleaned (9).txt
Saving Bernardus-Claraevallensis_Sermones-super-cantica-canticorum_cleaned.txt to Bernardus-Claraevallensis_Sermones-super-cantica-canticorum_cleaned (9).txt
Saving Hildegardis-Bingensis_Liber-divinorum-operum_cleaned.txt to Hildegardis-Bingensis_Liber-divinorum-operum_cleaned (9).txt


Consequently, by using `GridSearchCV()`, one can efficiently navigate through the classifier's parameter space. Moreover, we expand its functionality in the code to follow by introducing a number of implementations particularly suitable for non-traditional authorship attribution. The code block below is an example.

1. **Preprocessing**: First, we open and preprocess our files again (a step you are familiar with by now), and store our `texts` (text segments by a certain `sample_len`) in a variable `X_train`. Consequently, we label the segments by authorship and store the labels in `y_train` (if there are three authors, the labels should be `0` for author A, `1` for author B, and `2` for author C).
2. **Vectorization**: We initialize various vectorization options, where we take into account the feature type (`word` or `char`), declare our `n_gram` preferences, and decide whether we want to input raw frequencies (`CountVectorizer()`) or else TF-IDF frequencies (`TfidfVectorizer`).
3. **Pipeline and Parameter Grid**: We build a `pipe` (`Pipeline`) a tool for chaining together these data preprocessing steps and machine learning algorithms into a single object. This enables seamless and efficient handling of data transformation and model training, facilitating the creation of end-to-end machine learning workflows. We store our variables in a dictionary `param_grid`, specifying the hyperparameter values to search over during the grid search. It allows for systematic exploration of different combinations of hyperparameters to identify the optimal configuration for a machine learning model.
4. **Grid Search and Cross-Validated Results**: Finally, we introduce a new important concept, that of cross-Validation (CV). In fact, CV is an advanced and more reliable way of going about `train_test_split` as it was introduced in the block of code earlier. With CV, the dataset is divided into *k* equal-sized **folds**. Consequently, the model is trained *k* times, where each time it is trained on *k*−1 folds (which corresponds to `X_train_split` above) and tested on the remaining fold (`X_dev`). The performance metrics (accuracy, precision, recall, f1-score) are averaged over the *k* trials to give a more reliable estimate of the model's performance (information you can extract from `results['mean_test_accuracy_score']`).

This helps ensure that the model is not overly dependent on a particular subset of the data, as well as provides a more accurate estimate of the model’s performance on unseen data.

Try to tweak the various parameter settings above, and then run the code below.

**Searching many parameters at the same time can be quite costly and take a long time. Try to start your gridsearch by focussing on only a few of these parameters at a time, just so you can get acquainted with them.**

In [8]:
# -------------------------------
# PARAMETERS
# -------------------------------
sample_len_loop = [500, 3000]         # test sample lengths
feat_type_loop = ['raw_MFW','tfidf_MFW']  # smaller subset for testing
feat_n_loop = [200, 600]              # number of features
c_options = [1, 10]                   # SVM C values
kernel_options = ['linear', 'rbf']    # SVM kernels, try 'poly' and 'sigmoid' too
k_folds = [3]                         # cross-validation folds
n_iter_random = 4                      # number of random hyperparameter combos

vectorizer_params = {
    'raw_MFW': (CountVectorizer, 'word', (1,1)),
    'tfidf_MFW': (TfidfVectorizer, 'word', (1,1)),
    'raw_4grams': (CountVectorizer, 'char', (4,4)),
    'tfidf_4grams': (TfidfVectorizer, 'char', (4,4))
}

# -------------------------------
# CONTAINERS FOR RESULTS
# -------------------------------
all_grid_scores = []          # store aggregated CV metrics per parameter combo
all_parameter_combos = []     # store corresponding parameter settings

# Use a stratified CV object to ensure balanced folds across classes
cv = StratifiedKFold(n_splits=k_folds[0], shuffle=True, random_state=42)

# --- Outer loop over sample lengths ---
for sample_len in tqdm(sample_len_loop, desc="Sample length"):
    # -------------------------------
    # SLICE PREPROCESSED TEXTS INTO SAMPLES (chunk texts once per sample_len)
    # -------------------------------
    texts, authors, titles = [], [], []
    for filename, words in preprocessed_texts.items():
        # split the text into consecutive chunks of length `sample_len`
        bulk = [words[i:i+sample_len] for i in range(0, len(words), sample_len)]
        for index, sample in enumerate(bulk):
            if len(sample) == sample_len:  # skip last chunk if shorter than sample_len
                texts.append(" ".join(sample))          # reconstruct sample as string
                authors.append(authors_dict[filename])  # append corresponding author
                titles.append(f"{titles_dict[filename]}_{index+1}")  # unique title per chunk

    # Encode labels for classifier
    y_train = LabelEncoder().fit_transform(authors)

    # Ensure each class has at least as many samples as CV folds
    _, class_counts = np.unique(y_train, return_counts=True)
    if class_counts.min() < k_folds[0]:
        print(f"[SKIP] sample_len={sample_len}: min class count {class_counts.min()} < cv={k_folds[0]}.")
        continue

    # --- Loop over feature extraction configurations ---
    for feat_type in feat_type_loop:
        for n_feats in feat_n_loop:
            # -------------------------------
            # INITIALIZE AND FIT VECTORIZER (specific to feat_type & n_feats)
            # -------------------------------
            vectorizer_class, analyzer, ngram_range = vectorizer_params[feat_type]
            vectorizer = vectorizer_class(
                analyzer=analyzer,
                ngram_range=ngram_range,
                max_features=n_feats,
                lowercase=False,     # texts already pre-lowercased
                dtype=np.float32     # reduces memory usage & speeds up computation
            )
            X = vectorizer.fit_transform(texts)   # sparse matrix representation

            # -------------------------------
            # DEFINE PIPELINE (scaler + classifier)
            # -------------------------------
            pipe = Pipeline([
                ('scaler', StandardScaler(with_mean=False)),          # preserves sparse format
                ('classifier', svm.SVC(probability=True))            # SVM classifier
            ])

            # -------------------------------
            # RANDOMIZED SEARCH CONFIGURATION
            # -------------------------------
            param_grid = {
                'classifier__C': c_options,
                'classifier__kernel': kernel_options
            }

            # define scoring metrics for evaluation
            scoring = {
                'accuracy_score': make_scorer(accuracy_score),
                'precision_score': make_scorer(precision_score, average='macro', zero_division=0),
                'recall_score': make_scorer(recall_score, average='macro', zero_division=0),
                'f1_score': make_scorer(f1_score, average='macro', zero_division=0)
            }

            # setup RandomizedSearchCV
            grid = RandomizedSearchCV(
                pipe,
                param_distributions=param_grid,
                n_iter=n_iter_random,               # number of random combos to try
                cv=cv,                               # stratified CV
                n_jobs=-1,                           # use all CPUs
                scoring=scoring,                     # multiple metrics
                refit='f1_score',              # best model selected by accuracy
                verbose=0,
                error_score=np.nan                   # assign NaN on failure, avoid tracebacks
            )

            # -------------------------------
            # FIT MODEL ON TRAINING DATA
            # -------------------------------
            grid.fit(X, y_train)
            results = grid.cv_results_

            # -------------------------------
            # STORE CV RESULTS
            # -------------------------------
            for idx, params in enumerate(results['params']):
                # store mean scores for current parameter combination
                all_grid_scores.append((
                    results['mean_test_accuracy_score'][idx],
                    results['mean_test_precision_score'][idx],
                    results['mean_test_recall_score'][idx],
                    results['mean_test_f1_score'][idx]
                ))
                # store corresponding parameter settings
                all_parameter_combos.append((
                    feat_type, n_feats, sample_len,
                    params['classifier__C'], params['classifier__kernel']
                ))

# -------------------------------
# CREATE REPORT DATAFRAME
# -------------------------------
import pandas as pd
full_report = []
for (acc, prec, rec, f1), params in zip(all_grid_scores, all_parameter_combos):
    model_name = '-'.join([str(i) for i in params])   # create unique model name from parameters
    full_report.append((model_name, acc, prec, rec, f1))

# convert to DataFrame and sort by f1 score
df = pd.DataFrame(full_report, columns=['model', 'accuracy', 'precision', 'recall', 'f1 score'])
df_sorted = df.sort_values(by='f1 score', ascending=False)
print(df_sorted)

Sample length:   0%|          | 0/2 [00:00<?, ?it/s]

                           model  accuracy  precision    recall  f1 score
31     tfidf_MFW-600-3000-10-rbf  1.000000   1.000000  1.000000  1.000000
30  tfidf_MFW-600-3000-10-linear  1.000000   1.000000  1.000000  1.000000
29      tfidf_MFW-600-3000-1-rbf  1.000000   1.000000  1.000000  1.000000
28   tfidf_MFW-600-3000-1-linear  1.000000   1.000000  1.000000  1.000000
27     tfidf_MFW-200-3000-10-rbf  1.000000   1.000000  1.000000  1.000000
26  tfidf_MFW-200-3000-10-linear  1.000000   1.000000  1.000000  1.000000
25      tfidf_MFW-200-3000-1-rbf  1.000000   1.000000  1.000000  1.000000
24   tfidf_MFW-200-3000-1-linear  1.000000   1.000000  1.000000  1.000000
23       raw_MFW-600-3000-10-rbf  1.000000   1.000000  1.000000  1.000000
22    raw_MFW-600-3000-10-linear  1.000000   1.000000  1.000000  1.000000
21        raw_MFW-600-3000-1-rbf  1.000000   1.000000  1.000000  1.000000
20     raw_MFW-600-3000-1-linear  1.000000   1.000000  1.000000  1.000000
19       raw_MFW-200-3000-10-rbf  1.00