# Tutorial 3a - Medical text analysis

Original author: [Santeri Rytky](https://www.oulu.fi/fi/tutkijat/santeri-rytky) 

Natural language processing (NLP) is a specific subset of artificial intelligence, covering all techniques that are used to process linguistic and textual data. Practical examples include speech recognition, text-to-speech conversion, automatic grammar correction and junk email detection. In this tutorial, you will use an open-source dataset to classify medical abstracts between 5 different conditions. Topics covered:

- Basics of natural language processing
- Multiclass classification
- Pipelining multiple analysis steps
- Accounting for dataset bias
- Objective selection of model parameters (Hyperparameter optimization)

Original dataset:
https://www.kaggle.com/chaitanyakck/medical-text

References/more reading:
https://realpython.com/python-keras-text-classification/
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [None]:
'''
# Mount the Google drive folder to Colab
from google.colab import drive
drive.mount('/content/drive')

# Add the Tutorial folder to Python path
import sys
sys.path.append('/content/drive/My Drive/MLinMedicine/Tutorial2')
''';

Remember that you can press `ctrl` and hover over the module to see more details.

In [14]:
import numpy as np
import pandas as pd

# Timing and dates
import time

# Wrap long texts
from textwrap import fill

# Assign default variables to functions
from functools import partial

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    f1_score,
    classification_report,
)

# Combine analysis steps
from sklearn.pipeline import Pipeline

# NLP functions
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# 1. Data loading and preprocessing
Let's start by loading the data into the notebook environment. If you open the .dat file, you can see that it consists of numbers followed by a couple of sentences. The number represents the condition of the patient, and the sentences consist of the medical abstract. The number and the abstract are separated by a tabulator, `\t`. The collection of texts is often called a **corpus** in NLP.

In [3]:
# Read the abstracts data and the corresponding labels
with open('./train.dat', 'r') as f:
    text_data = f.readlines() 

# Split labels and text
corpus, labels = [], []
for line in text_data:
    # Split data and label based on the tab. Split the abstract into words.
    corpus.append(line.split('\t')[1])
    labels.append(int(line.split('\t')[0]))

Data has been divided to 5 classes:

- 1 = Neoplasms (tumor)
- 2 = Digestive system diseases
- 3 = Nervous system diseases
- 4 = Cardiovascular diseases
- 5 = General pathological conditions

In contrast to tutorial 2, we now need to identify multiple diseases, classes (**multi-class classification**), instead of the patient death (**binary classification**). Let's verify the number of abstracts and classes. It is also a good idea to read a couple of abstracts to get a sense of their contents in this dataset. You can change the value for `id` to get another set of abstracts.

In [5]:
print(f'The dataset consists of {len(corpus)} medical texts from {len(np.unique(labels))} classes')

# Are the classes balanced?
labels_array = np.array(labels)
np.sum(labels_array == 4)
print(f'Abstracts for \n class 1: {np.sum(labels_array == 1)}\n',
f'class 2: {np.sum(labels_array == 2)}\n',
f'class 3: {np.sum(labels_array == 3)}\n',
f'class 4: {np.sum(labels_array == 4)}\n',
f'class 5: {np.sum(labels_array == 5)}\n')

# Example abstracts
id = 7
print('Class 1:\n\n', fill(corpus[np.where(labels_array == 1)[0][id]][:100]))
print('\nClass 2:\n\n', fill(corpus[np.where(labels_array == 2)[0][id]][:100]))
print('\nClass 3:\n\n', fill(corpus[np.where(labels_array == 3)[0][id]][:100]))
print('\nClass 4:\n\n', fill(corpus[np.where(labels_array == 4)[0][id]][:100]))
print('\nClass 5:\n\n', fill(corpus[np.where(labels_array == 5)[0][id]][:100]))

The dataset consists of 14438 medical texts from 5 classes
Abstracts for 
 class 1: 3163
 class 2: 1494
 class 3: 1925
 class 4: 3051
 class 5: 4805

Class 1:

 Modified anterior compartment resection. In the majority of patients
with soft tissue sarcomas of th

Class 2:

 Postoperative pancreatic abscess due to Plesiomonas shigelloides.
Plesiomonas shigelloides is being

Class 3:

 Meralgia paresthetica after coronary bypass surgery. Meralgia
paresthetica is a neurologic disorder

Class 4:

 A rheolytic system for percutaneous coronary and peripheral plaque
removal. A method for plaque diss

Class 5:

 A controlled trial comparing vidarabine with acyclovir in neonatal
herpes simplex virus infection. I


Note that the classes 2 and 3 are slightly less frequent in the data. Next, we can split the data into training, validation and test.

In [7]:
# Fix random seed
seed = 2

# Train and test split
X_train, X_test, Y_train, Y_test = train_test_split(corpus, labels, test_size=0.33, random_state=seed, shuffle=True)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.33, random_state=seed, shuffle=True)
print(f'Training set consists of {len(X_train)} abstracts, while the validation set includes {len(X_val)} abstracts')
print(f'Test set includes {len(X_test)} abstracts')

Training set consists of 6480 abstracts, while the validation set includes 3193 abstracts
Test set includes 4765 abstracts


# 2. Baseline

Let's start by making a simple baseline experiment. A baseline refers to a simple reference method that we can later try to improve. The results of the baseline experiment also describe how difficult the given task is. 

A conventional machine learning approach to classify textual data is to first to convert the text into a feature vector by counting the occurrences of each word. The features can then be used to create a machine learning model. This is known as a **bag-of-words** approach.

Next, we will do the **feature extraction**. In the simple case, we can count occurrences for each word, and assign an index for each unique word. This means that we can represent the abstract as a vector. **Feature vectors** can then easily be fed to the machine learning model.

In [9]:
# Feature extraction:
# Count the word occurrences. 
# Note that the words are converted to lowercase by default
vectorizer = CountVectorizer(lowercase=True)

# .fit() creates the vocabulary based on X_train
vectorizer.fit(X_train)

# .transform() converts texts into feature vectors based on the fitted vocabulary
X_train_vector = vectorizer.transform(X_train)
X_val_vector = vectorizer.transform(X_val)
X_test_vector = vectorizer.transform(X_test)

# Which words occur in the first abstract?
words = np.nonzero(X_train_vector[0, :])[1]
f_names = vectorizer.get_feature_names_out()
print('Words used in first abstract:\n', np.take(f_names, words))
print(f'Feature vector type: {type(X_train_vector)}')

Words used in first abstract:
 ['21' '24' 'activation' 'activity' 'acute' 'administered' 'administration'
 'all' 'almost' 'amidase' 'an' 'and' 'animals' 'antiprotease'
 'antiproteases' 'approach' 'are' 'as' 'at' 'balance' 'be' 'been'
 'benefit' 'but' 'by' 'capacities' 'capacity' 'complete' 'confirmed'
 'consumption' 'continues' 'current' 'data' 'defences' 'degree' 'during'
 'dying' 'enzyme' 'enzymes' 'examined' 'exogenous' 'experimental'
 'exudate' 'exudates' 'factor' 'few' 'for' 'formed' 'from' 'fulminant'
 'has' 'human' 'in' 'indicating' 'inhibitory' 'instances'
 'intraperitoneal' 'intraperitoneally' 'intravenously' 'is' 'key' 'likely'
 'man' 'marked' 'may' 'not' 'occur' 'of' 'or' 'other' 'overwhelming'
 'pancreas' 'pancreatitis' 'patients' 'peritoneal' 'possessed' 'prolongs'
 'protease' 'proteolytic' 'rats' 'reduced' 'reduction' 'release'
 'released' 'responsible' 'role' 'sampling' 'shocked' 'showed' 'study'
 'such' 'suggested' 'suggests' 'survival' 'that' 'the' 'their' 'therapy'
 '

Note that the vectorizer sorts the words alphabetically. We have now created a **vocabulary** including all words in the training set. The feature vectors are sparse matrices, which are optimized for having only a couple of nonzero elements (one sample does not include all words of the vocabulary).

Next, we can train the classifier based on the word occurrences. Let's use a simple logistic regression model for the classification.

In [10]:
# Train the classifier for 1000 iterations
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vector, Y_train)

# Predict on validation data
predictions = model.predict(X_val_vector)

# Predict on test data for later
predictions_bline = model.predict(X_test_vector)

# Show average accuracy with 3 decimal places
print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')
# Print a report for multiclass prediction performance
print(classification_report(Y_val, predictions, labels=[1, 2, 3, 4, 5]))

Accuracy: 0.495
              precision    recall  f1-score   support

           1       0.61      0.61      0.61       705
           2       0.42      0.36      0.39       338
           3       0.43      0.36      0.39       427
           4       0.57      0.54      0.56       639
           5       0.42      0.49      0.45      1084

    accuracy                           0.49      3193
   macro avg       0.49      0.47      0.48      3193
weighted avg       0.50      0.49      0.49      3193



The classification report provides a lot of useful information:

- Precision = true positives / total amount of predicted positives
- Recall = true positives / total amount of positives
- F1 score = weighted score accounting both precision and recall
- Support = number of cases for the class
- Accuracy = correct predictions / incorrect predictions
- macro average = unweighted average across the classes
- weighted average = weighted according to support values

# 3. Improving the performance: stop words

What if we remove the stopwords from the vocabulary? These are commonly used words that occur in the english language and do not add much value for classification, such as: *the, and, is, in*.

In [43]:
# Count the word occurrences removing the stopwords 
vectorizer = CountVectorizer(lowercase=True, stop_words='english')
vectorizer.fit(X_train)
X_train_vector = vectorizer.transform(X_train)
X_val_vector = vectorizer.transform(X_val)

# Train the classifier
model = LogisticRegression()
model.fit(X_train_vector, Y_train)

# Predict on validation data
predictions = model.predict(X_val_vector)
print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')
print(classification_report(Y_val, predictions, labels=[1, 2, 3, 4, 5]))

Accuracy: 0.497
              precision    recall  f1-score   support

           1       0.62      0.61      0.62       705
           2       0.43      0.37      0.39       338
           3       0.43      0.37      0.40       427
           4       0.58      0.56      0.57       639
           5       0.42      0.48      0.45      1084

    accuracy                           0.50      3193
   macro avg       0.50      0.48      0.49      3193
weighted avg       0.50      0.50      0.50      3193



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


We can already that there was a small improvement when removing unnecessary words. This means that the model can focus on more informative features.

# 4. Word frequency

Our bag-of-words model can already detect pretty well some of the easier classes. For example, Precision for digestive system diseases (class 1) is 0.7, which means that a prediction for class is going to be correct 70% of the time. 

The feature vectors now simply consist of counting the words of the vocabulary. Now what if we have a longer abstract and there are more occurrences of the words than in the training set? In this case, it would be beneficial to scale the frequencies based on text length. We can do that with the scikit-learn function `TfidfVectorizer`. Tf–idf is for “Term Frequency times Inverse Document Frequency”, meaning that the word count is divided by text length and words that occur in multiple documents are given a smaller weight.

In [45]:
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
tfidf_vectorizer.fit(X_train)
X_train_vector = tfidf_vectorizer.transform(X_train)
X_val_vector = tfidf_vectorizer.transform(X_val)

# Train the classifier
model = LogisticRegression()
model.fit(X_train_vector, Y_train)

# Predict on validation data
predictions = model.predict(X_val_vector)
print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')
print(classification_report(Y_val, predictions, labels=[1, 2, 3, 4, 5]))

Accuracy: 0.566
              precision    recall  f1-score   support

           1       0.69      0.70      0.69       705
           2       0.56      0.35      0.43       338
           3       0.54      0.32      0.40       427
           4       0.67      0.62      0.64       639
           5       0.47      0.61      0.53      1084

    accuracy                           0.57      3193
   macro avg       0.59      0.52      0.54      3193
weighted avg       0.58      0.57      0.56      3193



We can see that the model is again working better. It is clearly important to account for changes in document length and account for commonly occurring words.

# 5. Pipelines

To conduct all of these steps in one go, we can combine the feature extraction and classification into a pipeline. This makes our job easier later.

In [41]:
abstract_clf = Pipeline([('vectorizer', TfidfVectorizer(lowercase=True, stop_words='english')),
                         ('classifier', LogisticRegression())])

abstract_clf.fit(X_train, Y_train)
predictions = abstract_clf.predict(X_val)
print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')

Accuracy: 0.566


We got the same result using the pipeline approach, which shows that we did not change anything other than compile the methods together.

Let's now try to change a better classifier: Support vector machine (SVM). SVMs are generally considered very good for classifying text data.

In [None]:
abstract_clf = Pipeline([('vectorizer', TfidfVectorizer(lowercase=True, stop_words='english')),
                         ('classifier', SVC(kernel='linear', random_state=seed))])

abstract_clf.fit(X_train, Y_train)
predictions = abstract_clf.predict(X_val)

# Predict on test for later
predictions_svm = abstract_clf.predict(X_test)

print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')
print(classification_report(Y_val, predictions, labels=[1, 2, 3, 4, 5]))

# Create confusion matrix
conf = confusion_matrix(Y_val, predictions)
true_labels = ['True 1', 'True 2', 'True 3', 'True 4', 'True 5']
pred_labels = ['Predicted 1', 'Predicted 2', 'Predicted 3', 'Predicted 4', 'Predicted 5']
print(pd.DataFrame(conf, index=true_labels,
             columns=pred_labels))

Accuracy: 0.573
              precision    recall  f1-score   support

           1       0.68      0.73      0.71       705
           2       0.52      0.42      0.46       338
           3       0.53      0.40      0.45       427
           4       0.65      0.65      0.65       639
           5       0.49      0.55      0.52      1084

    accuracy                           0.57      3193
   macro avg       0.57      0.55      0.56      3193
weighted avg       0.57      0.57      0.57      3193

        Predicted 1  Predicted 2  Predicted 3  Predicted 4  Predicted 5
True 1          513           23           25           21          123
True 2           40          141            3           10          144
True 3           41            3          169           28          186
True 4           16            9           32          416          166
True 5          139           97           91          165          592


It seems that SVM was a better choice compared to logistic regression. However, there are many other design choices available, and we will discuss one option how to deal with them next.

# 6. Weighted sampling

Weighted sampling allows to account for the imbalance between the different classes. This can be achieved easily by passing the `class_weight='balanced'` argument. This means that the weights are inversely proportional to their frequency. Thus, underrepresented classes are accounted for.

In [None]:
abstract_clf = Pipeline([('vectorizer', TfidfVectorizer(lowercase=True, stop_words='english')),
                         ('classifier', SVC(kernel='linear', random_state=seed, class_weight='balanced'))])

abstract_clf.fit(X_train, Y_train)
predictions = abstract_clf.predict(X_val)

# Predict on test for later
predictions_bal = abstract_clf.predict(X_test)

print(f'Accuracy: {accuracy_score(predictions, Y_val):.3f}')
print(classification_report(Y_val, predictions, labels=[1, 2, 3, 4, 5]))

# Create confusion matrix
conf = confusion_matrix(Y_val, predictions)
true_labels = ['True 1', 'True 2', 'True 3', 'True 4', 'True 5']
pred_labels = ['Predicted 1', 'Predicted 2', 'Predicted 3', 'Predicted 4', 'Predicted 5']
print(pd.DataFrame(conf, index=true_labels,
             columns=pred_labels))

Accuracy: 0.593
              precision    recall  f1-score   support

           1       0.69      0.74      0.71       705
           2       0.51      0.62      0.56       338
           3       0.50      0.58      0.54       427
           4       0.64      0.72      0.68       639
           5       0.56      0.42      0.48      1084

    accuracy                           0.59      3193
   macro avg       0.58      0.62      0.59      3193
weighted avg       0.59      0.59      0.59      3193

        Predicted 1  Predicted 2  Predicted 3  Predicted 4  Predicted 5
True 1          522           42           47           20           74
True 2           37          211            9           12           69
True 3           36           11          246           31          103
True 4           17           14           42          462          104
True 5          146          137          148          202          451


We can immediately see from the classification report that the metrics are improving for classes 2 and 3. However, the most common class has now much worse performance.

# 7. Hyperparameter and feature selection
This time we optimize for the best set of hyperparameters.

In [53]:
from sklearn.model_selection import GridSearchCV, PredefinedSplit

# Define the parameter grid for SVC
param_dist = {
    'classifier__kernel': ['linear', 'sigmoid'],
    'classifier__C': [1e-2, 1e-1, 1, 10],
    'classifier__max_iter': [-1, 100, 500, 1000],
    'classifier__class_weight': ['balanced', None]
}

# Create the pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(lowercase=True, stop_words='english')),
    ('classifier', SVC(random_state=seed))
])

# Build a predefined train/validation split using existing X_train/X_val
X_train_val = X_train + X_val
Y_train_val = Y_train + Y_val

# -1 indicates training samples; 0 indicates validation fold
test_fold = [-1] * len(X_train) + [0] * len(X_val)
ps = PredefinedSplit(test_fold=test_fold)

# Randomized search with predefined split
search = GridSearchCV(
    pipeline,
    param_grid=param_dist,
    scoring='accuracy',
    cv=ps,
    n_jobs=-1,
    verbose=10,
)

search.fit(X_train_val, Y_train_val)
print("Best parameters:", search.best_params_)
print(f'Best F1 score: {search.best_score_:.3f}')

Fitting 1 folds for each of 64 candidates, totalling 64 fits
[CV 1/1; 1/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 2/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100
[CV 1/1; 3/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500
[CV 1/1; 4/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 5/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 6/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 7/64] START classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500
[CV 1/1; 8/64] START classifier__C=0.01, 



[CV 1/1; 2/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100;, score=0.106 total time=   5.2s
[CV 1/1; 9/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 6/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.106 total time=   5.3s
[CV 1/1; 10/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 10/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100;, score=0.413 total time=   5.8s
[CV 1/1; 11/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500
[CV 1/1; 7/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.106 total time=  17.0s
[CV 1/1; 12/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 3/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500;, score=0.106 total time=  17.2s
[CV 1/1; 13/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1




[CV 1/1; 8/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.134 total time=  25.1s
[CV 1/1; 14/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 4/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000;, score=0.191 total time=  27.4s
[CV 1/1; 15/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500




[CV 1/1; 11/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500;, score=0.445 total time=  20.5s
[CV 1/1; 16/64] START classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000
[CV 1/1; 14/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.452 total time=   6.5s
[CV 1/1; 17/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1




[CV 1/1; 9/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1;, score=0.339 total time=  34.8s
[CV 1/1; 18/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 1/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1;, score=0.134 total time=  42.8s
[CV 1/1; 19/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500




[CV 1/1; 12/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000;, score=0.492 total time=  27.6s
[CV 1/1; 20/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 18/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100;, score=0.106 total time=   5.4s
[CV 1/1; 21/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 5/64] END classifier__C=0.01, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.134 total time=  46.3s
[CV 1/1; 22/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 15/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.459 t



[CV 1/1; 22/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.106 total time=   5.1s
[CV 1/1; 24/64] START classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000




[CV 1/1; 13/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.339 total time=  39.5s
[CV 1/1; 25/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1




[CV 1/1; 19/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500;, score=0.110 total time=  17.9s
[CV 1/1; 26/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 16/64] END classifier__C=0.01, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.479 total time=  31.7s
[CV 1/1; 27/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500




[CV 1/1; 23/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.111 total time=  17.5s
[CV 1/1; 28/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 26/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100;, score=0.420 total time=   5.7s
[CV 1/1; 29/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1




[CV 1/1; 20/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000;, score=0.299 total time=  26.5s
[CV 1/1; 30/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 17/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1;, score=0.571 total time=  41.9s
[CV 1/1; 31/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500




[CV 1/1; 30/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.448 total time=   5.6s
[CV 1/1; 32/64] START classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000
[CV 1/1; 24/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.300 total time=  26.5s
[CV 1/1; 33/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 27/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500;, score=0.512 total time=  19.4s
[CV 1/1; 34/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 21/64] END classifier__C=0.1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.574 total time=  41.7s
[CV 1/1; 35/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500
[CV 1/1; 34/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100;, score=0.222 total time=   5.3s
[CV 1/1; 36/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000




[CV 1/1; 31/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.515 total time=  19.2s
[CV 1/1; 37/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 25/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1;, score=0.469 total time=  37.0s
[CV 1/1; 38/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 28/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000;, score=0.530 total time=  30.8s
[CV 1/1; 39/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500




[CV 1/1; 38/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.189 total time=   5.3s
[CV 1/1; 40/64] START classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000




[CV 1/1; 29/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.470 total time=  35.9s
[CV 1/1; 41/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 35/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500;, score=0.521 total time=  17.3s
[CV 1/1; 42/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 32/64] END classifier__C=0.1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.528 total time=  30.3s
[CV 1/1; 43/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500




[CV 1/1; 33/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1;, score=0.593 total time=  31.8s
[CV 1/1; 44/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 42/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100;, score=0.392 total time=   5.5s
[CV 1/1; 45/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 36/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000;, score=0.565 total time=  25.2s
[CV 1/1; 46/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100
[CV 1/1; 39/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.521 total time=  17.3s
[CV 1/1



[CV 1/1; 46/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.457 total time=   5.7s
[CV 1/1; 48/64] START classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000




[CV 1/1; 40/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.558 total time=  24.0s
[CV 1/1; 49/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 37/64] END classifier__C=1, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.605 total time=  30.4s
[CV 1/1; 50/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 43/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500;, score=0.538 total time=  18.4s
[CV 1/1; 51/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500
[CV 1/1; 50/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=100;, score=0.432 total time=   4.7s
[CV 1/1; 52/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000




[CV 1/1; 47/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.558 total time=  18.4s
[CV 1/1; 53/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 41/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1;, score=0.573 total time=  31.2s
[CV 1/1; 54/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100




[CV 1/1; 44/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000;, score=0.578 total time=  26.3s
[CV 1/1; 55/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500




[CV 1/1; 54/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.423 total time=   4.8s
[CV 1/1; 56/64] START classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000
[CV 1/1; 45/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.592 total time=  28.4s
[CV 1/1; 57/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1
[CV 1/1; 51/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=500;, score=0.492 total time=  17.4s
[CV 1/1; 58/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100




[CV 1/1; 48/64] END classifier__C=1, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.592 total time=  27.2s




[CV 1/1; 59/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500




[CV 1/1; 58/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=100;, score=0.438 total time=   5.7s
[CV 1/1; 60/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000
[CV 1/1; 52/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=1000;, score=0.508 total time=  23.3s
[CV 1/1; 61/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1
[CV 1/1; 55/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.487 total time=  17.7s
[CV 1/1; 62/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100




[CV 1/1; 49/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=linear, classifier__max_iter=-1;, score=0.516 total time=  34.0s
[CV 1/1; 63/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500




[CV 1/1; 62/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=100;, score=0.444 total time=   4.9s
[CV 1/1; 64/64] START classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000
[CV 1/1; 53/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.502 total time=  27.0s
[CV 1/1; 56/64] END classifier__C=10, classifier__class_weight=balanced, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.498 total time=  22.9s
[CV 1/1; 59/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=500;, score=0.488 total time=  17.1s




[CV 1/1; 63/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=500;, score=0.490 total time=  13.1s




[CV 1/1; 60/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=1000;, score=0.519 total time=  21.9s
[CV 1/1; 57/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=linear, classifier__max_iter=-1;, score=0.522 total time=  34.8s
[CV 1/1; 61/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=-1;, score=0.499 total time=  24.6s
[CV 1/1; 64/64] END classifier__C=10, classifier__class_weight=None, classifier__kernel=sigmoid, classifier__max_iter=1000;, score=0.490 total time=  17.4s
Best parameters: {'classifier__C': 1, 'classifier__class_weight': 'balanced', 'classifier__kernel': 'sigmoid', 'classifier__max_iter': -1}
Best F1 score: 0.605


The best set of parameters is stored in the `study` object. We can now use it to train the SGD classifier on the training set, and see the results on the test set.

In [54]:
# Use the best model for future
abstract_clf = search.best_estimator_

# Predict on the test set
abstract_clf.fit(X_train, Y_train)
predictions_opt = abstract_clf.predict(X_test)

It is time to make the final comparison to see which method performed best!

In [55]:
print('Baseline:')
print(f'Accuracy: {accuracy_score(predictions_bline, Y_test):.3f}')
print(classification_report(Y_test, predictions_bline, labels=[1, 2, 3, 4, 5]))
print('Confusion matrix: \n', 
      pd.DataFrame(confusion_matrix(Y_test, predictions_bline), 
                   index=true_labels, columns=pred_labels))

print('\n____________\nSupport vector machine:')
print(f'Accuracy: {accuracy_score(predictions_svm, Y_test):.3f}')
print(classification_report(Y_test, predictions_svm, labels=[1, 2, 3, 4, 5]))
print('Confusion matrix: \n', 
      pd.DataFrame(confusion_matrix(Y_test, predictions_svm), 
                   index=true_labels, columns=pred_labels))

print('\n____________\nBalanced SVM:')
print(f'Accuracy: {accuracy_score(predictions_bal, Y_test):.3f}')
print(classification_report(Y_test, predictions_bal, labels=[1, 2, 3, 4, 5]))
print('Confusion matrix: \n', 
      pd.DataFrame(confusion_matrix(Y_test, predictions_bal), 
                   index=true_labels, columns=pred_labels))

print('\n____________\nOptimized SVM:')
print(f'Accuracy: {accuracy_score(predictions_opt, Y_test):.3f}')
print(classification_report(Y_test, predictions_opt, labels=[1, 2, 3, 4, 5]))
print('Confusion matrix: \n', 
      pd.DataFrame(confusion_matrix(Y_test, predictions_opt), 
                   index=true_labels, columns=pred_labels))

Baseline:
Accuracy: 0.500
              precision    recall  f1-score   support

           1       0.63      0.63      0.63      1012
           2       0.40      0.35      0.38       462
           3       0.49      0.42      0.45       681
           4       0.62      0.53      0.57      1091
           5       0.39      0.46      0.42      1519

    accuracy                           0.50      4765
   macro avg       0.51      0.48      0.49      4765
weighted avg       0.51      0.50      0.50      4765

Confusion matrix: 
         Predicted 1  Predicted 2  Predicted 3  Predicted 4  Predicted 5
True 1          642           48           47           36          239
True 2           56          164           17           20          205
True 3           76           10          288           49          258
True 4           31           18           51          582          409
True 5          213          167          184          250          705

____________
Support vector mach

Seems that hyperparameter optimization yielded still a small improvement to the final results.

Reminder for the classes:

- 1 = Neoplasms (tumor)
- 2 = Digestive system diseases
- 3 = Nervous system diseases
- 4 = Cardiovascular diseases
- 5 = General pathological conditions

# Conclusion

From the confusion matrices, we make some **conclusions on the results**:

- Neoplasms and cardiovascular diseases are the easiest to classify (1 and 4), and likely have the most distinct vocabulary associated.

- Many texts are classified as general pathological conditions (5). It is a frequent class, but also a logical option for uncertain cases.

- Digestive and nervous system diseases (2 and 3) are less frequent and provide the worst results. However, balancing the class weights resolved this issue as seen in the results.

**General conclusions**:

- You now should have learned the basics for **multilabel classification** on **textual data**.

- Since the text needs to be converted into numbers for the machine learning models, the medical abstracts have to be **vectorized**. **Bag-of-words** is the conventional approach, although other methods are also used such as **embedding**

- **Pipelines** create an easy-to-use collection of the preprocessing and classification steps. Pipelines will be used in all upcoming tutorials.

- **Class weights** allows accounting for **bias** in data (either binary or multi-class). Giving higher weight for the less frequent class allows the model to learn features for all individual classes. Always check for the **class frequencies** in your data!

- When multiple possible hyperparameters need to be tuned, consider using **hyperparameter optimization**. In this case, one needs to be careful to avoid **overfitting**.