**Contextual word substitution using BERT is applied as a data augmentation technique to the training sets. The augmented sentences are stored separately, and their semantic similarity to the original sentences is assessed using cosine similarity.**

In [1]:
import sys
import os
import random
sys.path.append(os.path.abspath("..")) 

### Import utils functions

In [2]:
from utils import *
from utils_aug import augment_and_save_contextual_speech, cosine_sim, save_grouped_sentences_to_file

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mohse\AppData\Roaming\nltk_data...
[nltk_data]   Package average

**Extract all sentences for each patient and put into a list. all_sentences is 2D list as an output.**

In [3]:
train_cc = "../ADReSS-IS2020-data/train/transcription/cc"
train_cd = "../ADReSS-IS2020-data/train/transcription/cd"
test = "../ADReSS-IS2020-data-test/test/transcription"
all_sentences_cc = extract_all_sentences(train_cc)
all_sentences_cd = extract_all_sentences(train_cd)
all_sentences_test = extract_all_sentences(test)

**Apply cleaning step on all_sentences both for training and testing dataset. Output is a 2D list.**

In [4]:
random.seed(42)
np.random.seed(42)
cleaned_healthy_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_cc
]

cleaned_dementia_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_cd
]

cleaned_test_speech = [
    [clean_text(sentence) for sentence in sentence_list]
    for sentence_list in all_sentences_test
]

### Save cleaned dataset as a txt files for later use

In [5]:
cleaned_healthy_path = 'aug_clean_txtfiles/clean_healthy.txt'
save_grouped_sentences_to_file(cleaned_healthy_speech, cleaned_healthy_path)

cleaned_dementia_path = 'aug_clean_txtfiles/clean_dementia.txt'
save_grouped_sentences_to_file(cleaned_dementia_speech, cleaned_dementia_path)

cleaned_test_path = 'aug_clean_txtfiles/clean_test.txt'
save_grouped_sentences_to_file(cleaned_test_speech, cleaned_test_path)


file saved at: aug_clean_txtfiles/clean_healthy.txt
file saved at: aug_clean_txtfiles/clean_dementia.txt
file saved at: aug_clean_txtfiles/clean_test.txt


### Contextual Augmentation

In [9]:
import nlpaug.augmenter.word as naw
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import torch

In [None]:
# Augment all clean datasets (only training datasets) and save them for later use.
# SEED is added inside utils_aug.py file for reporducibility
path_healthy = "aug_clean_txtfiles/cont_augmented_sentences_healthy.txt"
path_dementia = "aug_clean_txtfiles/cont_augmented_sentences_dementia.txt"
# The following lines take 3 mintutes in total using CPU. Files are already saved in aug_clean_txtfiles folder.
# Set save = True if you want to save the files again.
augmented_healthy_speech_cont = augment_and_save_contextual_speech(cleaned_healthy_speech, path_healthy, save=False)
augmented_dementia_speech_cont = augment_and_save_contextual_speech(cleaned_dementia_speech, path_dementia, save=False)

### Check The Similarity of Augmented and original data

Cosine similarity is a metric used to compare the similarity between two non-zero vectors. In the context of text, it involves converting each piece of text into a vector using methods such as TF-IDF or Word embeddings. Once converted into vectors, cosine similarity calculates the **cosine of the angle** between the two vectors:

**Cosine Similarity = (A · B) / (||A|| × ||B||)**

A higher cosine similarity means the two texts convey a similar meaning or context, represented by a value of 1 or a number close to 1. Additionally, a value of 0 indicates that the texts are orthogonal (completely different), while -1 means the texts are opposite in direction (which is rare in text comparisons).




In [7]:
flat_clean_healthy = [s for group in cleaned_healthy_speech for s in group]
flat_aug_healthy = [s for group in augmented_healthy_speech_cont for s in group]

In [8]:
# Measure similarity for healthy
show = 5
i = 0
for clean, aug in zip(flat_clean_healthy, flat_aug_healthy):
    sim = cosine_sim(clean, aug)
    print(f"Clean: {clean}\nAugmented: {aug}\nCosine Similarity: {sim:.4f}\n{'-'*50}")
    if show == i:
        break
    i += 1

Clean: well mother stand wash dish
Augmented: the mother stand wash ready
Cosine Similarity: 0.8706
--------------------------------------------------
Clean: window open
Augmented: i open
Cosine Similarity: 0.6983
--------------------------------------------------
Clean: outside window walk c curve walk
Augmented: the window walk c curve within
Cosine Similarity: 0.8855
--------------------------------------------------
Clean: see another building
Augmented: see another would
Cosine Similarity: 0.7683
--------------------------------------------------
Clean: look like garage something curtain grass
Augmented: sounds like garage something curtain over
Cosine Similarity: 0.9549
--------------------------------------------------
Clean: dish two cup saucer sink
Augmented: coffee in cup saucer sink
Cosine Similarity: 0.8051
--------------------------------------------------


