In [1]:
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

### Import utils functions

In [None]:
from utils import extract_all_sentences, clean_text
from utils_models import *
from finetuning_models import tfidf_tuner, genetic_algorithm_xgb_with_tfidf, genetic_algorithm_rf_with_tfidf

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package average

### Extract all sentences for each patient and put into a list. all_sentences is 2D list as an output.

In [3]:
train_cc = "ADReSS-IS2020-data/train/transcription/cc"
train_cd = "ADReSS-IS2020-data/train/transcription/cd"
test = "ADReSS-IS2020-data-test/test/transcription"
all_sentences_cc = extract_all_sentences(train_cc)
all_sentences_cd = extract_all_sentences(train_cd)
all_sentences_test = extract_all_sentences(test)

### Apply cleaning step on all_sentences both for training and testing dataset. Output is a 2D list.

In [4]:
random.seed(42)
np.random.seed(42)
cleaned_healthy_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_cc
]

cleaned_dementia_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_cd
]

cleaned_test_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_test
]

### Combine CC and CD to make training dataset

In [5]:
cleaned_train_speech = cleaned_healthy_speech + cleaned_dementia_speech

### Join sentences to make a one single string for each patient

In [6]:
# This is necessary as an input for tfidf
clean_texts_train = [" ".join(sentences) for sentences in cleaned_train_speech]

### Initialize tfidf vecotr

In [7]:
random.seed(42)
np.random.seed(42)
# Initialize with random parameters
tfidf_vectorizer = TfidfVectorizer(
    max_features=200,
    ngram_range=(1, 3),
    stop_words="english",
)
X_train = tfidf_vectorizer.fit_transform(clean_texts_train)
# create labels for train data
y_train = [0]*54 + [1]*54

# Test dataset
clean_texts_test = [" ".join(sentences) for sentences in cleaned_test_speech]
X_test = tfidf_vectorizer.transform(clean_texts_test)

test_data = pd.read_csv("ADReSS-IS2020-data-test/test/test_labels.txt", delimiter=";")
# Extract test labels
y_test = test_data["Label "]

### Visualization using TSNE

In [8]:
tsne = TSNE(n_components=2, perplexity=30, random_state=42, init='random')
X_tsne = tsne.fit_transform(X_train)
plot_tsne(X_tsne, y_train)

**The visualization above shows clearer class separation compared to handcrafted features. Since distinct clusters are more visible, we expect stronger performance when these TF-IDF features are used for classification compared to handcrafted features.**

### Initialize all models with default parameters (LogisticRegression, SVM, XGB, RandomForest, Voting)

In [9]:
classifiers = all_models()

### Conduct cross validation for each model

In [10]:
cross_metrics = get_crossvalidation_metrics(classifiers, X_train, y_train)

In [20]:
plot_metrics_table(cross_metrics, title="Evaluation metrics for cross validation")

### Apply models on test data

In [21]:
test_metrics = get_model_metrics(classifiers, X_train, y_train, X_test, y_test)

In [22]:
plot_metrics_table(test_metrics, title="Evaluation metrics for test data")

In [23]:
# Plot confusion matrix and roc-auc curve for each model
plot_confusion_matrices_with_roc(classifiers, X_train, y_train, X_test, y_test, title="Confusion matrices & ROC Curves for test data")

**So far, we observed that all models—except the XGB classifier—show improved performance compared to those trained with handcrafted features. The XGB classifier's performance remains largely unchanged. Among the models, the SVM appears to be the most robust, achieving consistently high scores on both cross-validation and test data. Notably, the Voting Classifier results in only 2 false negatives, which is especially important considering the higher risk associated with false negatives in the dementia classification task.**

# Fine tuning


**Fine-tuning is performed using Grid Search for the Logistic Regression and SVM models, while a Genetic Algorithm is used for XGBoost and Random Forest, as it is generally faster and more efficient than Grid Search for these models. The tuning process involves optimizing both the TF-IDF parameters and the model’s hyperparameters. After each fine-tuning step, we assume that the optimized TF-IDF parameters are also suitable for the other models. Therefore, we reuse the same TF-IDF settings across all models when conducting cross-validation and evaluating on the test set. This approach results in four separate TF-IDF fine-tuning sessions, increasing our chances of identifying the best configuration. Since hyperparameter tuning is time-consuming, we will comment out the tuning line and only retain the saved results in the following cells. The tfidf_tuner function uses a fixed random seed, ensuring reproducibility if the supervisor chooses to rerun the code.**

## XGB Classifier

In [None]:
random.seed(42)
np.random.seed(42)

# genetic_algorithm_xgb_with_tfidf(clean_texts_train, y_train, number_of_population=20)

gen	nevals	avg     	min     	max    
0  	20    	0.758333	0.665801	0.82381
1  	13    	0.741905	0       	0.82381
2  	17    	0.678442	0       	0.8329 
3  	18    	0.818203	0.776623	0.8329 
4  	16    	0.819416	0.78658 	0.8329 
5  	15    	0.819048	0.787013	0.8329 
6  	16    	0.82368 	0.787013	0.8329 
7  	14    	0.827403	0.796537	0.8329 
8  	16    	0.826883	0.814286	0.8329 
9  	16    	0.825909	0.758874	0.841558
10 	16    	0.825   	0.787013	0.841558


{'n_estimators': 377,
 'learning_rate': 0.03428798938153844,
 'subsample': 0.9118142984115694,
 'colsample_bytree': 0.6175898783305428,
 'reg_alpha': 1.3953792852514386,
 'tfidf__max_features': 1779,
 'tfidf__ngram_range': (1, 5),
 'tfidf__stop_words': 'english'}

In [24]:
best_tfidf_xgb = {'n_estimators': 377,
 'learning_rate': 0.03428798938153844,
 'subsample': 0.9118142984115694,
 'colsample_bytree': 0.6175898783305428,
 'reg_alpha': 1.3953792852514386,
 'tfidf__max_features': 1779,
 'tfidf__ngram_range': (1, 5),
 'tfidf__stop_words': 'english'}

In [25]:
random.seed(42)
np.random.seed(42)

tfidf_vectorizer = TfidfVectorizer(
    max_features=1779,
    ngram_range=(1, 5),
    stop_words="english",
)
X_train = tfidf_vectorizer.fit_transform(clean_texts_train)
# create labels for train data
y_train = [0]*54 + [1]*54


clean_texts_test = [" ".join(sentences) for sentences in cleaned_test_speech]
X_test = tfidf_vectorizer.transform(clean_texts_test)

test_data = pd.read_csv("ADReSS-IS2020-data-test/test/test_labels.txt", delimiter=";")
# Extract test labels
y_test = test_data["Label "]

In [26]:
classifiers_tuned_xgb = all_models(xgb_params={'n_estimators': 377,
 'learning_rate': 0.03428798938153844,
 'subsample': 0.9118142984115694,
 'colsample_bytree': 0.6175898783305428,
 'reg_alpha': 1.3953792852514386})

### cross validation

In [27]:
metrics_cross_tuned_xgb = get_crossvalidation_metrics(classifiers_tuned_xgb, X_train, y_train)
plot_metrics_table(metrics_cross_tuned_xgb, title="Evaluation Metrics on cross validation after fine tuning")

## test data

In [28]:
metrics_test_tuned_xgb = get_model_metrics(classifiers_tuned_xgb, X_train, y_train, X_test, y_test)
plot_metrics_table(metrics_test_tuned_xgb, title="Evaluation Metrics on test data after fine tuning")

## Randome forest

In [None]:
random.seed(42)
np.random.seed(42)
# genetic_algorithm_rf_with_tfidf(clean_texts_train, y_train, number_of_population=20)

gen	nevals	avg     	min     	max     
0  	20    	0.809264	0.758442	0.860173
1  	14    	0.826775	0.786147	0.860173
2  	15    	0.827987	0.804329	0.860173
3  	20    	0.829805	0.803896	0.860173
4  	18    	0.830649	0.794372	0.85974 
5  	20    	0.833398	0.803896	0.860173
6  	14    	0.838918	0.814719	0.869264
7  	17    	0.836602	0.804329	0.869264
8  	16    	0.83303 	0.804329	0.869264
9  	18    	0.831061	0.794372	0.85974 
10 	15    	0.838355	0.795238	0.860173


{'rf_params': {'n_estimators': 173,
  'max_depth': 10,
  'min_samples_split': 6,
  'min_samples_leaf': 5},
 'tfidf_params': {'max_features': 846,
  'ngram_range': (1, 3),
  'stop_words': 'english'}}

In [29]:
best_tfidf_rf = {'rf_params': {'n_estimators': 173,
  'max_depth': 10,
  'min_samples_split': 6,
  'min_samples_leaf': 5},
 'tfidf_params': {'max_features': 846,
  'ngram_range': (1, 3),
  'stop_words': 'english'}}

In [30]:
random.seed(42)
np.random.seed(42)

tfidf_vectorizer = TfidfVectorizer(
    max_features=846,
    ngram_range=(1, 3),
    stop_words="english",
)
X_train = tfidf_vectorizer.fit_transform(clean_texts_train)
# create labels for train data
y_train = [0]*54 + [1]*54


clean_texts_test = [" ".join(sentences) for sentences in cleaned_test_speech]
X_test = tfidf_vectorizer.transform(clean_texts_test)

test_data = pd.read_csv("ADReSS-IS2020-data-test/test/test_labels.txt", delimiter=";")
# Extract test labels
y_test = test_data["Label "]

In [31]:
classifiers_tuned_rf = all_models(rf_params={'n_estimators': 173,
  'max_depth': 10,
  'min_samples_split': 6,
  'min_samples_leaf': 5})

### cross validation

In [32]:
metrics_cross_tuned_rf = get_crossvalidation_metrics(classifiers_tuned_rf, X_train, y_train)
plot_metrics_table(metrics_cross_tuned_rf, title="Evaluation Metrics on cross validation after fine tuning")

### test data

In [33]:
metrics_test_tuned_rf = get_model_metrics(classifiers_tuned_rf, X_train, y_train, X_test, y_test)
plot_metrics_table(metrics_test_tuned_rf, title="Evaluation Metrics on test data after fine tuning")

## Logistic regression

In [None]:
best_tfidf_logreg = tfidf_tuner(clean_texts_train, y_train, model_name="logreg")

Fitting 5 folds for each of 168 candidates, totalling 840 fits


In [36]:
best_tfidf_logreg

{'logreg__C': 1,
 'logreg__solver': 'saga',
 'tfidf__max_features': 270,
 'tfidf__ngram_range': (1, 3)}

In [37]:
best_tfidf_logreg = {'logreg__C': 1,
 'logreg__solver': 'saga',
 'tfidf__max_features': 270,
 'tfidf__ngram_range': (1, 3)}

In [39]:
random.seed(42)
np.random.seed(42)

tfidf_vectorizer = TfidfVectorizer(
    max_features=270,
    ngram_range=(1, 3),
    stop_words="english",
)
X_train = tfidf_vectorizer.fit_transform(clean_texts_train)
# create labels for train data
y_train = [0]*54 + [1]*54


clean_texts_test = [" ".join(sentences) for sentences in cleaned_test_speech]
X_test = tfidf_vectorizer.transform(clean_texts_test)

test_data = pd.read_csv("ADReSS-IS2020-data-test/test/test_labels.txt", delimiter=";")
# Extract test labels
y_test = test_data["Label "]

In [40]:
classifiers_tuned_lg = all_models(lg_params={'C': 1, 'solver': 'saga'})

### cross validation

In [41]:
metrics_cross_tuned_lg = get_crossvalidation_metrics(classifiers_tuned_lg, X_train, y_train)
plot_metrics_table(metrics_cross_tuned_lg, title="Evaluation Metrics on cross validation after fine tuning")

### test data

In [42]:
metrics_test_tuned_lg = get_model_metrics(classifiers_tuned_lg, X_train, y_train, X_test, y_test)
plot_metrics_table(metrics_test_tuned_lg, title="Evaluation Metrics on test data after fine tuning")

## SVM

In [43]:
best_tfidf_svc = tfidf_tuner(clean_texts_train, y_train, model_name="svc")

Fitting 5 folds for each of 42 candidates, totalling 210 fits


In [44]:
best_tfidf_svc

{'svc__kernel': 'poly',
 'tfidf__max_features': 230,
 'tfidf__ngram_range': (1, 3)}

In [45]:
best_tfidf_svc = {'svc__kernel': 'poly',
 'tfidf__max_features': 230,
 'tfidf__ngram_range': (1, 3)}

In [9]:
random.seed(42)
np.random.seed(42)

tfidf_vectorizer = TfidfVectorizer(
    max_features=230,
    ngram_range=(1, 3),
    stop_words="english",
)
X_train = tfidf_vectorizer.fit_transform(clean_texts_train)
# create labels for train data
y_train = [0]*54 + [1]*54


clean_texts_test = [" ".join(sentences) for sentences in cleaned_test_speech]
X_test = tfidf_vectorizer.transform(clean_texts_test)

test_data = pd.read_csv("ADReSS-IS2020-data-test/test/test_labels.txt", delimiter=";")
# Extract test labels
y_test = test_data["Label "]

In [10]:
classifiers_tuned_svc = all_models(svc_params={'kernel': 'poly'})

### cross-validation

In [11]:
metrics_cross_tuned_svc = get_crossvalidation_metrics(classifiers_tuned_svc, X_train, y_train)
plot_metrics_table(metrics_cross_tuned_svc, title="Evaluation Metrics on cross validation after fine tuning")

### test data

In [12]:
metrics_test_tuned_svc = get_model_metrics(classifiers_tuned_svc, X_train, y_train, X_test, y_test)
plot_metrics_table(metrics_test_tuned_svc, title="Evaluation Metrics on test data after fine tuning")

**Based on the evaluation metrics and plots, fine-tuning the SVM model with TF-IDF yielded the best results. The SVM maintained consistent scores between cross-validation and test sets and showed a significant improvement after tuning. Its confusion matrix indicates only one false negative, which can be a concern if false negatives are considered more risky in the context of the dementia challenge. Since the fine-tuned SVM-TFIDF combination performed notably better than the others, we focused our visual analysis on this model. We plot bar charts comparing the tuned model with its default configuration, and additionally include the confusion matrix and ROC curve to provide a more comprehensive evaluation.**

In [27]:
# Plot confusion matrix and roc-auc curve for each model
plot_confusion_matrices_with_roc(classifiers_tuned_svc, X_train, y_train, X_test, y_test, title="Confusion matrices & ROC Curves for test data")

## Comparison between before and after fine tuning (Only fine tunning based on SVM-TFIDT)

In [28]:
# cross validation
plot_metrics_comparison(cross_metrics, metrics_cross_tuned_svc)

In [29]:
# test data
plot_metrics_comparison(test_metrics, metrics_test_tuned_svc)