# V0: Mental State Classification using Linear-Kernel and RBF-Kernel SVM

We are going to try to use the One-to-Rest approach, in which we train a SVM model that predicts for a single psychological symptom label (+1) versus any other symptoms (-1). For example, we could build a SVM model that predicts for whether the input client text will be diagnosed as having "anxiety" or "non-anxiety". We repeat this classification for all the symptoms, resulting in multiple SVM classifiers. 

In [13]:
import os
import sys
import json
import numpy as np
import matplotlib.pyplot as plt
import collections

from string import punctuation
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

# set working directory
os.chdir("/Users/junweisun/Documents/CS 229/final_project/Psychological-Therapeutic-Chatbot")

# set random seed for result reproducibility
np.random.seed(1234)

## Part I: Feature Extraction

The goal of this part is to form a feature matrix with shape (n, d) from all client texts from **Counsling and Psychotherapy Transcripts, Volumn I**.

- **n**: being the number of counsling/psychotherapy sessions. Here we treat each individual session as a single data point, because even for the same client between sessions, the individual's mental state might change.
- **d**: being the number of unique words in the entirity of the volumn.

In [14]:
# return JSON object as a dictionary
f = open('processed/meta.json')
data = json.load(f)
f.close()

# there are 1715 sessions in the training data
len(data)

1715

In [15]:
# observe a single counsling session, each session is itself stored as a python dictionary
data['2086467']

# we want to focus on two inputs:
# -- Client_Text: this will be our training data input
# -- Psyc_Subjects: this will be our training data label

{'file_name': '1004967238.txt',
 'Abstract': "Client is frustrated with his family members; he feels like there is a lack of communication, and he isn't being heard. He expects financial assistance with no strings attached, and he is frustrated that he can't get that from his parents.",
 'Client_Age': '21-30 years',
 'Client_Gender': 'Male',
 'Client_Marital_Status': 'Engaged',
 'Previous_Session_ID': nan,
 'Next_Session_ID': nan,
 'Client_Sexual_Orientation': 'Heterosexual',
 'CTIV_category': 'Family and relationships',
 'Psyc_Subjects': 'Frustration; Depressive disorder',
 'Symptoms': nan,
 'Therapies': 'Psychotherapy',
 'Therapist': 'Tamara Feldman (1972)',
 'Race_of_Therapist': 'White',
 'Race_of_Client': 'White',
 'Client_Text': [" So I just want to go back to last week, and just you know, continue talking about, you know the frustrating part about, you know me not being heard.  And my feelings just kind of being on the outside of where, you know, every time I bring up my concerns

In [16]:
# There are 3 sessions with 'Client_Text' that cannot be processed, eliminate these from the input data
count = 0
session_to_delete = []

for session in data:
    if 'Client_Text' in data[session].keys():
        count += 1
    else:
        session_to_delete.append(session)

print(session_to_delete)

for session in session_to_delete:
    del data[session]

print(f'There are {count} counsling sessions with proper client texts.')
print(f'The final input data has {len(data)} sessions within.')

['2099872', '2476112', '2476114']
There are 1712 counsling sessions with proper client texts.
The final input data has 1712 sessions within.


In [18]:
# Take a single therapy session and eliminate punctions in client texts, output a list of words separated by white spaces.
# Attention: this list of words is NOT unique!
#
def process_session(session_data):
    """
    Inputs:
        -- session_data: a dictionary representing all data from a single counsling/psychotherapy session (i.e one training data point)
    
    Return:
        -- word_list: a list of words excluding punctuations present in the client text of this particular therapy session
    """
    
    word_list = []

    num_conversation = len(session_data['Client_Text']) # how many lines has the client spoken
    for i in range(num_conversation): # loop over each line
        client_text = session_data['Client_Text'][i]
        # eliminate punctuations
        for c in punctuation:
            client_text = client_text.replace(c, ' ')
        client_text = client_text.lower().split()
        word_list += client_text
    
    return word_list


# Assemble a dictionary of unique words from client texts across ALL therapy sessions
#
def assemble_dictionary(json_obj):
    """
    Inputs:
        -- json_obj: a processed json file in the form of a dictionary
    
    Return:
        -- word_dict: a dictionary of words assembled from all client texts across ALL sessions
                this dictionary has (key, value) pairs representing (word, index):
                    -- word: a unique word
                    -- index: keeps track of how many unique words there are in the dictionary, NOT word frequency
    """

    word_dict = {}
    idx = 0

    for session in json_obj:
        # punctuation-eliminated words from client
        words = process_session(json_obj[session])
        for word in words:
            if word not in word_dict:
                word_dict[word] = idx
                idx += 1
    
    return word_dict


# Create a feature matrix that will be used as input to the training algorithm.
#
def generate_feature_matrix(json_obj, word_dict):
    """
    Inputs:
        -- json_obj: a processed json file in the form of a dictionary
        -- word_dict: the dictionary of unique words outputted by assemble_dictionary()
    
    Return:
        -- feature_matrix: a 2D numpy array of shape (n, d)
                Each row is a feature vector with word frequency, indicating how many times a dictionary word has appeared in a session:
                    -- n: each line represent a single therapy session
                    -- d: each column represent a unique word from the 'Client_Text' section of the json file
    """

    n = len(json_obj)
    d = len(word_dict)
    feature_matrix = np.zeros((n, d))

    data_point = 0
    for session in json_obj:
        words = process_session(json_obj[session]) # a list of non-unique words from a single session
        counter = collections.Counter(words) # a dictionary of unique words with frequency for the session
        for word, frequency in counter.items():
            if word in word_dict.keys():
                value = word_dict[word]
                feature_matrix[data_point][value] += frequency
        data_point += 1
    
    return feature_matrix

We now generate the feature matrix that will be used as input to train the SVM classifier, as well as the unique words dictionary.

In [19]:
unique_words_dictionary = assemble_dictionary(data)
feature_matrix = generate_feature_matrix(data, unique_words_dictionary)

print(f'The dimension of the feature matrix is {feature_matrix.shape}')
print(f'A single feature vector inside the feature matrix looks like:\n {feature_matrix[0]}')

The dimension of the feature matrix is (1712, 34641)
A single feature vector inside the feature matrix looks like:
 [35. 95.  7. ...  0.  0.  0.]


In [20]:
np.savetxt("../feature_matrix.txt", feature_matrix)

# create json object from dictionary
json = json.dumps(unique_words_dictionary)
f = open("unique_words_dictionary.json", "w")
f.write(json)
f.close()

## Part II: Import Label Matrix

The goal of this part is to construct a label matrix for supervised machine learning method. Each row of the label matrix corresponds to a session (same ordering as feature matrix), and each column represents a symptom. A number of 1 indicates that the symptom is diagnosed, whereas a number of 0 represents no such symptom found.

In [22]:
feature_matrix = np.loadtxt("../feature_matrix.txt")
feature_matrix

array([[35., 95.,  7., ...,  0.,  0.,  0.],
       [20., 93.,  2., ...,  0.,  0.,  0.],
       [28., 74.,  6., ...,  0.,  0.,  0.],
       ...,
       [13., 48.,  0., ...,  1., 10.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [23]:
label_matrix = np.loadtxt("label_matrix.txt")
label_matrix

array([[1., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [24]:
# observe correspondance between feature matrix and label matrix
print(f"The shape of the feature matrix is {feature_matrix.shape}")
print(f"The (prior) shape of the label matrix is {label_matrix.shape}")

label_matrix_del = np.delete(label_matrix, [1429, 1710, 1711], 0)
print(f"The shape of the label matrix is {label_matrix_del.shape}")


The shape of the feature matrix is (1712, 34641)
The (prior) shape of the label matrix is (1715, 65)
The shape of the label matrix is (1712, 65)


In [25]:
f = open('symptom_dictionary')
symptom = json.load(f)
f.close()
# observe label dictionary
symptom

{'Apathy': 0,
 'Frustration': 1,
 'Anxiety': 2,
 'Anger': 3,
 'Danger to self': 4,
 'Sadness': 5,
 'Low self-esteem': 6,
 'Crying': 7,
 'Obsessive behavior': 8,
 'Insomnia': 9,
 'Irritability': 10,
 'Fatigue': 11,
 'Hyperactivity': 12,
 'Restlessness': 13,
 'Depression (emotion)': 14,
 'Ambivalence': 15,
 'Shame': 16,
 'Suicidal ideation': 17,
 'Panic': 18,
 'Self-harm': 19,
 'Moodiness': 20,
 'Nightmares': 21,
 'Resentment': 22,
 'Compulsive behavior': 23,
 'Social inhibition': 24,
 'Confusion': 25,
 'Fearfulness': 26,
 'Guilt': 27,
 'Disorganized thoughts': 28,
 'Detached behavior': 29,
 'Fantasizing': 30,
 'Weight gain': 31,
 'Suicidal behavior': 32,
 'Dissociation': 33,
 'Sexual dysfunction': 34,
 'Weight loss': 35,
 'Paranoia': 36,
 'Racing thoughts': 37,
 'Apnea': 38,
 'Problems concentrating': 39,
 'Mania': 40,
 'Avoidance': 41,
 'Itching': 42,
 'Indecisiveness': 43,
 'Tics': 44,
 'Severe sensitivity': 45,
 'Numbness': 46,
 'Forgetfulness': 47,
 'Acting out': 48,
 'Inattentivene

## Part III: K-Fold CV

### Anxiety

The goal of this part is to find optimal parameters for the linear-kernel SVM and RBF-kernel SVM. We can perform K-fold cross validations and find average values of a couple performance metrics to determine which or which combinations of the hyperparameters are the best in this specific project.


In [26]:
# This function can calculate six different performance metrics for the predicted output. These are
# accuracy, F1-Score, AUROC, precision, sensitivity, and specificity.
#
def performance(y_true, y_pred, metric="accuracy"):
    """
    Inputs:
    @y_true: true labels of each example, of shape (n, )
    @y_pred: (continuous-valued) predicted labels of each example, of shape (n, )
    @metric: a string specifying one of the six performance measures.
             'accuracy', 'f1_score', 'auroc', 'precision', 'sensitivity', 'specificity'

    @return: a float representing performance score
    """
    # map continuous-valued predictions to binary labels
    y_label = np.sign(y_pred)
    # if a prediction is 0, treat that as 1
    y_label[y_label == 0] = 1

    # compute performance
    if metric == "accuracy":
      score = metrics.accuracy_score(y_true, y_label)
    elif metric == "f1_score":
      score = metrics.f1_score(y_true, y_label)
    elif metric == "auroc":
      score = metrics.roc_auc_score(y_true, y_label)
    elif metric == "precision":
      score = metrics.precision_score(y_true, y_label)
    else:
      mcm = metrics.confusion_matrix(y_true, y_label)
      tn, fp, fn, tp = mcm.ravel()
      if metric == "sensitivity":  # true positive rate
        score = tp / (tp + fn)
      if metric == "specificity":  # true negative rate
        score = tn / (tn + fp)

    return score


# This function takes in a classifier, splits the data X and labels y into k-folds, perform k-fold cross validations,
# and calculates all specified performance metrics for the classifier by averaging the performance scores across folds.
#
def cv_performance(clf, X, y, kf, metric):
    """
    Inputs:
    @clf: a SVM classifier, aka. an instance of SVC
    @X: the feature matrix we constructed with shape (n, d)
    @y: the labels of each data point with shape (n,), note this is binary labels {1,-1}
    @kf: an instance of cross_validation.KFold or cross_validation.StratifiedKFold
    @metric: a list of strings specifying the performance metrics to calculate for

    @return: a numpy array of floats representing the average CV performance across k folds for all metrics
    """

    metric_score = np.zeros((len(metric), kf.get_n_splits(X, y)))
    counter = 0

    # split data based on cross validation kf and loop for k times (aka k folds)
    for train_index, test_index in kf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # train SVM
        clf.fit(X_train, y_train)
        # predict using trained classifier
        y_pred = clf.decision_function(X_test)
        # metric score
        for j in range(len(metric)):
          metric_score[j][counter] = performance(y_test, y_pred, metric[j])
        counter += 1

    score = np.average(metric_score, axis=1)

    return score


# This function calls cv_performance and performs hyperparameter selection for the linear-kernel SVM
# by selecting the hyperparameter that maximizes each metric's average performance score across k-fold CV.
#
def select_param_linear(X, y, kf, metric):
    """
    Inputs:
    @X: the feature matrix we constructed with shape (n, d)
    @y: the labels of each data point with shape (n,), note this is binary labels {1,-1}
    @kf: an instance of cross_validation.KFold or cross_validation.StratifiedKFold
    @metric: a list of strings specifying the performance metrics to calculate for

    @return: a list of floats representing the optimal hyperparameter values for linear-kernel SVM based on each metric
    """

    print('Linear SVM Hyperparameter Selection based on ' + (', '.join(metric)) + ':')

    # pre-define a range of C values, C here is the hyperparameter used in linear-kernel SVM
    C_range = 10.0 ** np.arange(-3, 3)

    # train linear-kernel SVM using different C values and calculate average k-fold cross validation score
    c_score_T = np.zeros((len(C_range), len(metric)))

    for i in range(len(C_range)):
      clf = SVC(kernel='linear', C=C_range[i])  # define SVM instance
      c_score_T[i] = cv_performance(clf, X, y, kf, metric)
  
    # transpose the matrix
    c_score = c_score_T.T

    # obtain best score across c values for each metric
    best_index = np.argmax(c_score, axis=1)
    best_C = np.zeros(len(metric))
    for i in range(len(best_index)):
      best_C[i] = C_range[best_index[i]]
      print(f"For {metric[i]}, cv scores across different parameters are {c_score[i]}")

    np.savetxt("linear_SVM_c_score_matrix.txt", c_score)
    np.savetxt("linear_SVM_optimal_params.txt", best_C)

    return best_C

# Similar to above, this function calls cv_performance and performs hyperparameter selection for the RBF-kernel SVM
# by selecting the hyperparameter that maximizes each pairwise metric's average performance score across k-fold CV.
#
def select_param_rbf(X, y, kf, metric):
    """
    Inputs:
    @X: the feature matrix we constructed with shape (n, d)
    @y: the labels of each data point with shape (n,), note this is binary labels {1,-1}
    @kf: an instance of cross_validation.KFold or cross_validation.StratifiedKFold
    @metric: a list of strings specifying the performance metrics to calculate for
    
    @returns: a numpy array of shape (len(metric), 2) with each row represents a tuple of floats (C, gamma)
              which are the optimal hyperparameters for RBF-kernel SVM for each metric
    """

    print('\nRBF SVM Hyperparameter Selection based on ' + (', '.join(metric)) + ':')

    # pre-define a range of gamma and C values, which are both hyperparameters used in RBF-kernel SVM
    # construct a grid to make sure we test every single possible combinations of the two hyperparameters
    C_range = 10.0 ** np.arange(-3, 4)
    gamma_range = 10.0 ** np.arange(-5, 2)
    tuple_score_T = np.zeros((len(C_range)*len(gamma_range), len(metric)))
    tuple_dict = {}

    counter = 0
    # train a SVM classifier using some values of the hyperparameters and calculate average performance score
    for i in range(len(C_range)):
      for j in range(len(gamma_range)):
        clf = SVC(kernel='rbf', C=C_range[i], gamma=gamma_range[j])  # define SVM instance
        evaluate_row_num = i+j+counter*(len(gamma_range)-1)
        tuple_score_T[evaluate_row_num] = cv_performance(clf, X, y, kf, metric)
        tuple_dict[str(evaluate_row_num)] = np.array([C_range[i], gamma_range[j]])
      counter += 1

    # transpose the matrix
    tuple_score = tuple_score_T.T
    np.savetxt("RBF_SVM_tuple_score_matrix.txt", tuple_score)

    # obtain best score across all pairwise (c, gamma) values for each metric
    best_index = np.argmax(tuple_score, axis=1)
    best_tuple = np.zeros((len(metric), 2))
    for z in range(len(best_index)):
      best_tuple[z] = tuple_dict[str(best_index[z])]
      print(f"For {metric[z]}, the best cv scores across different parameters is {tuple_score[z][best_index[z]]}")

    np.savetxt("RBF_SVM_optimal_params.txt", best_tuple)

    return best_tuple


# Finally, this is rather a trivial function that outputs the performance score of the final chosen models.
#
def performance_test(clf, X, y, metric="accuracy"):
    """
    Inputs:
    @clf: a TRAINED SVM classifier that has already been fitted to the data.
    @X: the feature matrix we constructed with shape (n, d)
    @y: the labels of each data point with shape (n,), note this is binary labels {1,-1}
    @metric: a string specifying the performance metric to calculate for

    @return: a float representing the performance score of the classifier
    """

    y_pred = clf.decision_function(X)
    score = performance(y, y_pred, metric)

    return score


# This function extracts the corresponding label column from the label matrix provided the symptom name and
# the symptom dictionary, and convert the labels from {0,1} to {-1,1}.
#
def extract_symptom_labels(y, symptom_dict, symptom="Anxiety"):
    """
    Inputs:
    @y: the label matrix with labels {0, 1}
    @symptom_dict: the symptom dictionary
    @symptom: the class / symptom we wish to build one-to-rest SVM classifier on, default "Anxiety"

    @return: a float representing the performance score of the classifier
    """
    
    index = symptom_dict[symptom]
    extract_label = y[:, index]
    extract_label[extract_label == 0] = -1
    return extract_label


We now use 5-fold cross validation to obtain optimal hyperparameters.

In [30]:
# define metric list: total 6 metrics we will check
metric_list = ["accuracy", "f1_score", "auroc",
               "precision", "sensitivity", "specificity"]

# since we are focusing on anxiety, extract y as the anxiety labels
X = feature_matrix
y = extract_symptom_labels(label_matrix_del, symptom, symptom="Anxiety")

# we use the first 1612 examples as the training set, and leave the last 100 examples as the test set for now
X_training = X[0:1611]
y_training = y[0:1611]
X_test = X[1612:1711]
y_test = y[1612:1711]

# perform stratified k-fold, in which the folds are made by preserving the percentage of samples for each class
kf = StratifiedKFold(n_splits=5)

print(f"A single (neg) training example looks like: {X[0]}")
print(f"The corresponding label for that example looks like: {y[0]}")

print(f"A single (pos) training example looks like: {X[1]}")
print(f"The corresponding label for that example looks like: {y[1]}")


A single (neg) training example looks like: [35. 95.  7. ...  0.  0.  0.]
The corresponding label for that example looks like: -1.0
A single (pos) training example looks like: [20. 93.  2. ...  0.  0.  0.]
The corresponding label for that example looks like: 1.0


Linear-Kernel SVM Hyperparameter Tuning

In [31]:
# for each metric, select optimal hyperparameter for linear-kernel SVM
optimalC_each_metric = select_param_linear(X_training, y_training, kf, metric=metric_list)
print(f"Optimal C for each metric is {optimalC_each_metric}")

Linear SVM Hyperparameter Selection based on accuracy, f1_score, auroc, precision, sensitivity, specificity:
For accuracy, cv scores across different parameters are [0.46673846 0.485347   0.48347403 0.48595081 0.48595081 0.48595081]
For f1_score, cv scores across different parameters are [0.43521061 0.46811862 0.46414919 0.46560926 0.46560926 0.46560926]
For auroc, cv scores across different parameters are [0.46871167 0.48648141 0.48471227 0.48727637 0.48727637 0.48727637]
For precision, cv scores across different parameters are [0.47806818 0.49338259 0.49078888 0.49339217 0.49339217 0.49339217]
For sensitivity, cv scores across different parameters are [0.42024385 0.45972874 0.45491667 0.45491667 0.45491667 0.45491667]
For specificity, cv scores across different parameters are [0.51717949 0.51323408 0.51450786 0.51963606 0.51963606 0.51963606]
Optimal C for each metric is [1.   0.01 1.   1.   0.01 1.  ]


In [32]:
# for each metric, select optimal hyperparameter for RBF-kernel SVM
optimalTuple_each_metric = select_param_rbf(X_training, y_training, kf, metric=metric_list)
print(f"Optimal C and gamma for each metric is {optimalTuple_each_metric}")


RBF SVM Hyperparameter Selection based on accuracy, f1_score, auroc, precision, sensitivity, specificity:
For accuracy, the best cv scores across different parameters is 0.5263484798953906
For f1_score, the best cv scores across different parameters is 0.6820962140752576
For auroc, the best cv scores across different parameters is 0.5268384317197767
For precision, the best cv scores across different parameters is 0.5489605102924375
For sensitivity, the best cv scores across different parameters is 1.0
For specificity, the best cv scores across different parameters is 0.5425806451612903
Optimal C and gamma for each metric is [[1.e-01 1.e-05]
 [1.e+00 1.e-01]
 [1.e-01 1.e-05]
 [1.e-01 1.e-05]
 [1.e-03 1.e-05]
 [1.e+00 1.e-05]]


## Part IV: Prediction

In [33]:
# train linear-kernel SVM and RBF-kernel SVM with selected, optimal hyperparameters
linear_clf = SVC(kernel='linear', C=1)
linear_clf = linear_clf.fit(X_training, y_training)

rbf_clf = SVC(kernel='rbf', C=0.1, gamma=0.00001)
rbf_clf = rbf_clf.fit(X_training, y_training)

# test the performance of these two classifiers on the 100 subsetted transcripts
linear_metric_score = np.zeros(len(metric_list))
rbf_metric_score = np.zeros(len(metric_list))

for i in range(len(metric_list)):
    linear_metric_score[i] = performance_test(
        linear_clf, X_test, y_test, metric_list[i])
    rbf_metric_score[i] = performance_test(
        rbf_clf, X_test, y_test, metric_list[i])

print(f"For the linear-kernel SVM, final metric scores are: {linear_metric_score}")
print(f"For the RBF-kernel SVM, final metric scores are: {rbf_metric_score}\n")
print("End of project.")


For the linear-kernel SVM, final metric scores are: [0.64646465 0.18604651 0.48644578 0.14814815 0.25       0.72289157]
For the RBF-kernel SVM, final metric scores are: [0.31313131 0.24444444 0.46423193 0.14864865 0.6875     0.24096386]

End of project.
