#  Assignment 3 Natural Language Processing - Part 1: traditional machine learning approach: SVM

For the preprocessing, we constructed another notebook which explains the preprocessing steps as well as the reasoning behind it.

The purpose of this notebook is to construct a traditional machine learner which can be compared to in the remainder of the assignment

For the traditional approaches, we first considered SVM. SVMs are very effective in high-dimensional spaces. To use mathematical models for text, the text needs to be vectorized into high-dimensional space. This might imply a good performance of SVM. Besides, SVM are considered good classifiers. As such, we chose SVM.

To give the machine learner something to work with, the text needs to be vectorized. For this, multiple techniques exist. Therefore, we want to investigate the performance of TF-IDF, Part-of-Speech and Bag-of-Words to see how these would perform and if there would be a clear optimal solution. In the remainder of this notebook, we test these techniques separately. As such, for the SVM, we will investigate in total 3 different machine learning approaches. For all three methods, first, convert the data according to the method. Afterwards, we do a hyperparameter for all target variables and finally evaluate the model's performance. 

With the performances done, we can conclude how SVM performs per target variable as well as how the methods affect the performance.


In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedKFold
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from gensim.models import Word2Vec

  "class": algorithms.Blowfish,


# Reading the data

In [2]:
df = pd.read_csv('processed_data_english_no_lowercasing.csv')
df

Unnamed: 0,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
0,Well right woke midday nap Its sort weird ever...,0,1,1,0,1
1,Well stream consciousness essay used thing lik...,0,0,1,0,0
2,open keyboard button push The thing finally wo...,0,1,0,1,1
3,cant believe Its really happening pulse racing...,1,0,1,1,0
4,Well good old stream consciousness assignment ...,1,0,1,0,1
...,...,...,...,...,...,...
2958,motivated day day basis need provide little fa...,1,0,0,1,1
2959,son biggest part life without reckless person ...,1,1,0,0,0
2960,kid grandkids keep motivated everyday inspire ...,1,0,1,1,0
2961,biggest drive earn money retire beach schedule...,0,0,0,0,0


In [3]:
corpus = df['TEXT'].tolist()

# Modelling 

For part 1, it was decided to use an SVM as traditional machine learning approach. The corpus however can be fed using vectorisation into the SVM in various ways. Therefore, 3 different feeding techniques were considered: TF-IDF, Part-of-Speech (PoS) Tagging and Bag-of-Words (BoW). In the remainder of the notebook, each method will be used and will be compared at the end of the notebook. 

## Using TF-IDF 

First, train-test splits are made per target variable. Afterwards, X_train and X_test are transformed using TF-IDF

In [4]:
# Ext split
X_train_ext, X_test_ext, y_train_ext, y_test_ext = train_test_split(corpus, df['cEXT'].tolist(), test_size=0.2, random_state=42)

# Neu split
X_train_neu, X_test_neu, y_train_neu, y_test_neu = train_test_split(corpus, df['cNEU'].tolist(), test_size=0.2, random_state=42)

# Agr split
X_train_agr, X_test_agr, y_train_agr, y_test_agr = train_test_split(corpus, df['cAGR'].tolist(), test_size=0.2, random_state=42)

# Con split
X_train_con, X_test_con, y_train_con, y_test_con = train_test_split(corpus, df['cCON'].tolist(), test_size=0.2, random_state=42)

# Opn split
X_train_opn, X_test_opn, y_train_opn, y_test_opn = train_test_split(corpus, df['cOPN'].tolist(), test_size=0.2, random_state=42)

In [5]:
tfidf = TfidfVectorizer()  

# for Ext
X_train_ext_t = tfidf.fit_transform(X_train_ext)
X_test_ext_t = tfidf.transform(X_test_ext)

# for Neu
X_train_neu_t = tfidf.fit_transform(X_train_neu)
X_test_neu_t = tfidf.transform(X_test_neu)

# for Agr
X_train_agr_t = tfidf.fit_transform(X_train_agr)
X_test_agr_t = tfidf.transform(X_test_agr)

# for Con
X_train_con_t = tfidf.fit_transform(X_train_con)
X_test_con_t = tfidf.transform(X_test_con)

# for Opn
X_train_opn_t = tfidf.fit_transform(X_train_opn)
X_test_opn_t = tfidf.transform(X_test_opn)

### Hyperparameter tuning

In order to have the optimal SVM, a Randomized Search is done to find the 'C' and 'kernel' parameters of the SVM that yield the highest F1 score. This is done per target variable 

In [8]:
# Ext
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_ext_t,y_train_ext)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.553) f1: (test=0.562) precision: (test=0.553) recall: (test=0.571) total time=   8.4s
[CV 2/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.561) f1: (test=0.570) precision: (test=0.561) recall: (test=0.580) total time=   9.0s
[CV 3/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.544) f1: (test=0.548) precision: (test=0.548) recall: (test=0.548) total time=   7.7s
[CV 4/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.586) f1: (test=0.585) precision: (test=0.592) recall: (test=0.577) total time=   8.9s
[CV 5/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.559) f1: (test=0.552) precision: (test=0.566) recall: (test=0.540) total time=   9.5s
[CV 1/5] END C=1.5, kernel=sigmoid; accuracy: (test=0.574) f1: (test=0.576) precision: (test=0.576) recall: (test=0.576) total time=  11.1s
[CV 2/5] END C=1.5, kernel=sigmoid; accuracy: (test=0.549) f1: (test=0.560) precision: (test=0.548)

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_cons

{'kernel': 'poly', 'C': 0.5}

In [9]:
# Neu
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_neu_t, y_train_neu)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=2.0, kernel=rbf; accuracy: (test=0.589) f1: (test=0.566) precision: (test=0.585) recall: (test=0.547) total time=   7.0s
[CV 2/5] END C=2.0, kernel=rbf; accuracy: (test=0.527) f1: (test=0.513) precision: (test=0.518) recall: (test=0.509) total time=   7.7s
[CV 3/5] END C=2.0, kernel=rbf; accuracy: (test=0.576) f1: (test=0.560) precision: (test=0.569) recall: (test=0.552) total time=   9.0s
[CV 4/5] END C=2.0, kernel=rbf; accuracy: (test=0.574) f1: (test=0.545) precision: (test=0.573) recall: (test=0.519) total time=   8.1s
[CV 5/5] END C=2.0, kernel=rbf; accuracy: (test=0.561) f1: (test=0.536) precision: (test=0.558) recall: (test=0.515) total time=   8.5s
[CV 1/5] END C=0.0, kernel=linear; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 2/5] END C=0.0, kernel=linear; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.

  _warn_prf(average, modifier, msg_start, len(result))


[CV 1/5] END C=0.5, kernel=poly; accuracy: (test=0.511) f1: (test=0.000) precision: (test=0.000) recall: (test=0.000) total time=   7.7s
[CV 2/5] END C=0.5, kernel=poly; accuracy: (test=0.508) f1: (test=0.000) precision: (test=0.000) recall: (test=0.000) total time=   8.3s


  _warn_prf(average, modifier, msg_start, len(result))


[CV 3/5] END C=0.5, kernel=poly; accuracy: (test=0.511) f1: (test=0.000) precision: (test=0.000) recall: (test=0.000) total time=   7.0s
[CV 4/5] END C=0.5, kernel=poly; accuracy: (test=0.511) f1: (test=0.009) precision: (test=1.000) recall: (test=0.004) total time=   6.6s


  _warn_prf(average, modifier, msg_start, len(result))


[CV 5/5] END C=0.5, kernel=poly; accuracy: (test=0.508) f1: (test=0.000) precision: (test=0.000) recall: (test=0.000) total time=   5.8s
[CV 1/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 2/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 3/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 4/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 5/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 1/5] END C=1.5, kernel=linear; accuracy: (test=0.553) f1: (test=0.537) precision: (test=0.544) recall: (test=0.530) total time=   5.8s
[CV 2/5] END C=1.5, kernel=linear; accuracy: (test=0.538) f1: (t

15 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'sigmoid', 'C': 2.0}

In [10]:
# Agr
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_agr_t, y_train_agr)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=1.5, kernel=linear; accuracy: (test=0.582) f1: (test=0.610) precision: (test=0.601) recall: (test=0.620) total time=   7.2s
[CV 2/5] END C=1.5, kernel=linear; accuracy: (test=0.601) f1: (test=0.644) precision: (test=0.606) recall: (test=0.687) total time=   8.2s
[CV 3/5] END C=1.5, kernel=linear; accuracy: (test=0.546) f1: (test=0.589) precision: (test=0.562) recall: (test=0.618) total time=   8.0s
[CV 4/5] END C=1.5, kernel=linear; accuracy: (test=0.561) f1: (test=0.605) precision: (test=0.574) recall: (test=0.639) total time=   7.7s
[CV 5/5] END C=1.5, kernel=linear; accuracy: (test=0.582) f1: (test=0.629) precision: (test=0.589) recall: (test=0.675) total time=   8.1s
[CV 1/5] END C=2.0, kernel=linear; accuracy: (test=0.576) f1: (test=0.594) precision: (test=0.600) recall: (test=0.588) total time=   8.9s
[CV 2/5] END C=2.0, kernel=linear; accuracy: (test=0.599) f1: (test=0.632) precision: (test=0.610) recall

10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'poly', 'C': 1.5}

In [11]:
# Con
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_con_t, y_train_con)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=1.5, kernel=linear; accuracy: (test=0.568) f1: (test=0.579) precision: (test=0.576) recall: (test=0.583) total time=   6.3s
[CV 2/5] END C=1.5, kernel=linear; accuracy: (test=0.551) f1: (test=0.564) precision: (test=0.559) recall: (test=0.570) total time=   7.7s
[CV 3/5] END C=1.5, kernel=linear; accuracy: (test=0.568) f1: (test=0.561) precision: (test=0.580) recall: (test=0.544) total time=   8.8s
[CV 4/5] END C=1.5, kernel=linear; accuracy: (test=0.589) f1: (test=0.611) precision: (test=0.588) recall: (test=0.635) total time=  10.1s
[CV 5/5] END C=1.5, kernel=linear; accuracy: (test=0.589) f1: (test=0.612) precision: (test=0.588) recall: (test=0.639) total time=   9.8s
[CV 1/5] END C=1.0, kernel=rbf; accuracy: (test=0.589) f1: (test=0.609) precision: (test=0.591) recall: (test=0.628) total time=   9.0s
[CV 2/5] END C=1.0, kernel=rbf; accuracy: (test=0.580) f1: (test=0.600) precision: (test=0.584) recall: (tes

15 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'poly', 'C': 0.5}

In [12]:
# Opn
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_neu_t, y_train_neu)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.551) f1: (test=0.544) precision: (test=0.540) recall: (test=0.547) total time=   7.6s
[CV 2/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.546) f1: (test=0.536) precision: (test=0.537) recall: (test=0.534) total time=   6.9s
[CV 3/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.568) f1: (test=0.563) precision: (test=0.557) recall: (test=0.569) total time=   5.5s
[CV 4/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.568) f1: (test=0.551) precision: (test=0.562) recall: (test=0.541) total time=   5.4s
[CV 5/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.557) f1: (test=0.541) precision: (test=0.551) recall: (test=0.532) total time=   5.4s
[CV 1/5] END C=1.5, kernel=sigmoid; accuracy: (test=0.570) f1: (test=0.558) precision: (test=0.561) recall: (test=0.556) total time=   5.5s
[CV 2/5] END C=1.5, kernel=sigmoid; accuracy: (test=0.536) f1: (test=0.522) precision: (test=0.526)

20 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'linear', 'C': 2.0}

### Actual training and testing

With the optimal parameters found, the SVM is trained on the training set and then tested for validation.

In [6]:
def SVM_model(X_train, X_test, y_train, y_test, krnl, c):
    
    svm_model = SVC(kernel= krnl, C=c)
    svm_model.fit(X_train, y_train)

    predictions = svm_model.predict(X_test)

    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)

    print("Accuracy: {:.2f}".format(accuracy))
    print("Precision: {:.2f}".format(precision))
    print("Recall: {:.2f}".format(recall))
    print("F1 Score: {:.2f}".format(f1))
    
    return [f1, accuracy, precision, recall]


In [7]:
# Ext
metrics_ext_t = SVM_model(X_train_ext_t, X_test_ext_t, y_train_ext, y_test_ext, 'poly', 0.5)

Accuracy: 0.50
Precision: 0.50
Recall: 1.00
F1 Score: 0.67


In [8]:
# Neu
metrics_neu_t = SVM_model(X_train_neu_t, X_test_neu_t, y_train_neu, y_test_neu, 'sigmoid', 2.0)

Accuracy: 0.52
Precision: 0.56
Recall: 0.52
F1 Score: 0.54


In [9]:
# Agr
metrics_agr_t = SVM_model(X_train_agr_t, X_test_agr_t, y_train_agr, y_test_agr, 'poly', 1.5)

Accuracy: 0.52
Precision: 0.51
Recall: 0.79
F1 Score: 0.62


In [10]:
# Con
metrics_con_t = SVM_model(X_train_con_t, X_test_con_t, y_train_con, y_test_con, 'poly', 0.5)

Accuracy: 0.53
Precision: 0.53
Recall: 1.00
F1 Score: 0.69


In [11]:
# Opn
metrics_opn_t = SVM_model(X_train_opn_t, X_test_opn_t, y_train_opn, y_test_opn, 'linear', 2.0)

Accuracy: 0.60
Precision: 0.57
Recall: 0.58
F1 Score: 0.58


## Using Part-of-Speech (PoS) tagging

As second method to test, PoS tagging as used. To do so, first the text inputs are tagged.

In [12]:
# Download NLTK data if not already downloaded
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\maxma\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [13]:
pos_df = df.copy()

def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return [tag for word, tag in pos_tags]

pos_df['pos_tags'] = pos_df['TEXT'].apply(pos_tagging)

With the PoS tagging done, the train and test splits are determined per target variable

In [14]:
# Ext
vectorizer = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
X = vectorizer.fit_transform(pos_df['pos_tags'])
y_ext = pos_df['cEXT']

X_train_ext_p, X_test_ext_p, y_train_ext_p, y_test_ext_p = train_test_split(X, y_ext, test_size=0.2, random_state=42)



In [15]:
# Neu
y_neu = pos_df['cNEU']

X_train_neu_p, X_test_neu_p, y_train_neu_p, y_test_neu_p = train_test_split(X, y_neu, test_size=0.2, random_state=42)

In [16]:
# Agr
y_agr = pos_df['cAGR']

X_train_agr_p, X_test_agr_p, y_train_agr_p, y_test_agr_p = train_test_split(X, y_agr, test_size=0.2, random_state=42)

In [17]:
# Con
y_con = pos_df['cCON']

X_train_con_p, X_test_con_p, y_train_con_p, y_test_con_p = train_test_split(X, y_con, test_size=0.2, random_state=42)

In [18]:
# Opn
y_opn = pos_df['cOPN']

X_train_opn_p, X_test_opn_p, y_train_opn_p, y_test_opn_p = train_test_split(X, y_opn, test_size=0.2, random_state=42)

### Hyperparameter tuning

In order to have the optimal SVM, a Randomized Search is done to find the 'C' and 'kernel' parameters of the SVM that yield the highest F1 score. This is done per target variable 

In [26]:
# Ext
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_ext_p,y_train_ext_p)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=0.5, kernel=linear; accuracy: (test=0.538) f1: (test=0.588) precision: (test=0.532) recall: (test=0.655) total time=   7.8s
[CV 2/5] END C=0.5, kernel=linear; accuracy: (test=0.523) f1: (test=0.555) precision: (test=0.522) recall: (test=0.592) total time=   6.8s
[CV 3/5] END C=0.5, kernel=linear; accuracy: (test=0.532) f1: (test=0.581) precision: (test=0.529) recall: (test=0.644) total time=   7.9s
[CV 4/5] END C=0.5, kernel=linear; accuracy: (test=0.521) f1: (test=0.538) precision: (test=0.524) recall: (test=0.552) total time=   6.4s
[CV 5/5] END C=0.5, kernel=linear; accuracy: (test=0.489) f1: (test=0.496) precision: (test=0.494) recall: (test=0.498) total time=   6.7s
[CV 1/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.508) f1: (test=0.499) precision: (test=0.511) recall: (test=0.487) total time=   0.2s
[CV 2/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.513) f1: (test=0.528) precision: (test=0.514) reca

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_cons

{'kernel': 'poly', 'C': 1.0}

In [27]:
# Neu
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_neu_p,y_train_neu_p)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=1.5, kernel=rbf; accuracy: (test=0.544) f1: (test=0.376) precision: (test=0.570) recall: (test=0.280) total time=   0.4s
[CV 2/5] END C=1.5, kernel=rbf; accuracy: (test=0.504) f1: (test=0.402) precision: (test=0.491) recall: (test=0.341) total time=   0.3s
[CV 3/5] END C=1.5, kernel=rbf; accuracy: (test=0.534) f1: (test=0.324) precision: (test=0.558) recall: (test=0.228) total time=   0.3s
[CV 4/5] END C=1.5, kernel=rbf; accuracy: (test=0.504) f1: (test=0.319) precision: (test=0.491) recall: (test=0.236) total time=   0.4s
[CV 5/5] END C=1.5, kernel=rbf; accuracy: (test=0.542) f1: (test=0.348) precision: (test=0.580) recall: (test=0.249) total time=   0.4s
[CV 1/5] END C=0.5, kernel=rbf; accuracy: (test=0.549) f1: (test=0.272) precision: (test=0.645) recall: (test=0.172) total time=   0.4s
[CV 2/5] END C=0.5, kernel=rbf; accuracy: (test=0.517) f1: (test=0.295) precision: (test=0.516) recall: (test=0.207) total 

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_cons

{'kernel': 'sigmoid', 'C': 2.0}

In [28]:
# Agr
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_agr_p,y_train_agr_p)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 2/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 3/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 4/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 5/5] END C=0.0, kernel=sigmoid; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 1/5] END C=2.0, kernel=poly; accuracy: (test=0.532) f1: (test=0.685) precision: (test=0.531) recall: (test=0.964) total time=   0.6s
[CV 2/5] END C=2.0, kernel=poly; accuracy: (test=0.513) f1: (test=0.671) precision: (test=0.520) recall: (test=0.948) total time=   0.6s
[CV 3

10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'rbf', 'C': 0.5}

In [29]:
# Con
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_con_p,y_train_con_p)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=1.5, kernel=linear; accuracy: (test=0.515) f1: (test=0.489) precision: (test=0.529) recall: (test=0.455) total time=  20.7s
[CV 2/5] END C=1.5, kernel=linear; accuracy: (test=0.506) f1: (test=0.555) precision: (test=0.514) recall: (test=0.603) total time=  23.6s
[CV 3/5] END C=1.5, kernel=linear; accuracy: (test=0.536) f1: (test=0.583) precision: (test=0.537) recall: (test=0.639) total time=  18.9s
[CV 4/5] END C=1.5, kernel=linear; accuracy: (test=0.496) f1: (test=0.563) precision: (test=0.503) recall: (test=0.639) total time=  21.3s
[CV 5/5] END C=1.5, kernel=linear; accuracy: (test=0.540) f1: (test=0.593) precision: (test=0.539) recall: (test=0.660) total time=  18.5s
[CV 1/5] END C=2.0, kernel=linear; accuracy: (test=0.519) f1: (test=0.493) precision: (test=0.534) recall: (test=0.459) total time=  22.4s
[CV 2/5] END C=2.0, kernel=linear; accuracy: (test=0.506) f1: (test=0.555) precision: (test=0.514) recall

10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'poly', 'C': 2.0}

In [30]:
# Opn
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_opn_p,y_train_opn_p)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=2.0, kernel=linear; accuracy: (test=0.534) f1: (test=0.395) precision: (test=0.537) recall: (test=0.312) total time=  20.8s
[CV 2/5] END C=2.0, kernel=linear; accuracy: (test=0.536) f1: (test=0.424) precision: (test=0.533) recall: (test=0.352) total time=  25.9s
[CV 3/5] END C=2.0, kernel=linear; accuracy: (test=0.542) f1: (test=0.453) precision: (test=0.539) recall: (test=0.391) total time=  28.7s
[CV 4/5] END C=2.0, kernel=linear; accuracy: (test=0.568) f1: (test=0.462) precision: (test=0.583) recall: (test=0.383) total time=  25.8s
[CV 5/5] END C=2.0, kernel=linear; accuracy: (test=0.544) f1: (test=0.416) precision: (test=0.550) recall: (test=0.335) total time=  35.8s
[CV 1/5] END C=0.0, kernel=poly; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total time=   0.0s
[CV 2/5] END C=0.0, kernel=poly; accuracy: (test=nan) f1: (test=nan) precision: (test=nan) recall: (test=nan) total

20 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'rbf', 'C': 1.0}

### Actual training and testing

With the optimal parameters found, the SVM is trained on the training set and then tested for validation.

In [19]:
# Ext
metrics_ext_p = SVM_model(X_train_ext_p, X_test_ext_p, y_train_ext_p, y_test_ext_p, 'poly', 1.0)

Accuracy: 0.50
Precision: 0.50
Recall: 0.93
F1 Score: 0.65


In [20]:
# Neu
metrics_neu_p = SVM_model(X_train_neu_p, X_test_neu_p, y_train_neu_p, y_test_neu_p, 'sigmoid', 2.0)

Accuracy: 0.53
Precision: 0.58
Recall: 0.46
F1 Score: 0.51


In [21]:
# Agr
metrics_agr_p = SVM_model(X_train_agr_p, X_test_agr_p, y_train_agr_p, y_test_agr_p, 'rbf', 0.5)

Accuracy: 0.49
Precision: 0.49
Recall: 0.97
F1 Score: 0.65


In [22]:
# Con
metrics_con_p = SVM_model(X_train_con_p, X_test_con_p, y_train_con_p, y_test_con_p, 'poly', 2.0)

Accuracy: 0.53
Precision: 0.53
Recall: 0.98
F1 Score: 0.69


In [23]:
# Opn
metrics_opn_p = SVM_model(X_train_opn_p, X_test_opn_p, y_train_opn_p, y_test_opn_p, 'rbf', 1.0)

Accuracy: 0.56
Precision: 0.53
Recall: 0.53
F1 Score: 0.53


## Using Bag-of-Words (BoW)

Firstly Converts the texts into Bag-of-Words representation and determine the train-test splits

In [24]:
vectorizer = CountVectorizer()
X_b = vectorizer.fit_transform(df['TEXT'])

In [25]:
# Ext
X_train_ext_b, X_test_ext_b, y_train_ext_b, y_test_ext_b = train_test_split(X_b, y_ext, test_size=0.2, random_state=42)

# Neu
X_train_neu_b, X_test_neu_b, y_train_neu_b, y_test_neu_b = train_test_split(X_b, y_neu, test_size=0.2, random_state=42)

# Agr
X_train_agr_b, X_test_agr_b, y_train_agr_b, y_test_agr_b = train_test_split(X_b, y_agr, test_size=0.2, random_state=42)

# Con
X_train_con_b, X_test_con_b, y_train_con_b, y_test_con_b = train_test_split(X_b, y_con, test_size=0.2, random_state=42)

# Opn
X_train_opn_b, X_test_opn_b, y_train_opn_b, y_test_opn_b = train_test_split(X_b, y_opn, test_size=0.2, random_state=42)

### Hyperparameter tuning

In order to have the optimal SVM, a Randomized Search is done to find the 'C' and 'kernel' parameters of the SVM that yield the highest F1 score. This is done per target variable 

In [26]:
# Ext
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_ext_b,y_train_ext_b)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits


KeyboardInterrupt: 

In [39]:
# Neu
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_neu_b, y_train_neu_b)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=0.5, kernel=rbf; accuracy: (test=0.591) f1: (test=0.505) precision: (test=0.619) recall: (test=0.427) total time=  11.4s
[CV 2/5] END C=0.5, kernel=rbf; accuracy: (test=0.532) f1: (test=0.437) precision: (test=0.531) recall: (test=0.371) total time=   9.9s
[CV 3/5] END C=0.5, kernel=rbf; accuracy: (test=0.551) f1: (test=0.432) precision: (test=0.566) recall: (test=0.349) total time=   9.9s
[CV 4/5] END C=0.5, kernel=rbf; accuracy: (test=0.561) f1: (test=0.453) precision: (test=0.585) recall: (test=0.369) total time=   9.6s
[CV 5/5] END C=0.5, kernel=rbf; accuracy: (test=0.559) f1: (test=0.457) precision: (test=0.579) recall: (test=0.378) total time=   9.1s
[CV 1/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.515) f1: (test=0.500) precision: (test=0.504) recall: (test=0.496) total time=   9.2s
[CV 2/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.542) f1: (test=0.512) precision: (test=0.535) recall: (test=0.491

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_cons

{'kernel': 'rbf', 'C': 2.0}

In [40]:
# Agr
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_agr_b,y_train_agr_b)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.532) f1: (test=0.561) precision: (test=0.555) recall: (test=0.568) total time=   6.2s
[CV 2/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.515) f1: (test=0.534) precision: (test=0.539) recall: (test=0.530) total time=   6.2s
[CV 3/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.542) f1: (test=0.577) precision: (test=0.561) recall: (test=0.594) total time=   6.3s
[CV 4/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.521) f1: (test=0.554) precision: (test=0.542) recall: (test=0.566) total time=   5.7s
[CV 5/5] END C=2.0, kernel=sigmoid; accuracy: (test=0.534) f1: (test=0.568) precision: (test=0.553) recall: (test=0.582) total time=   6.3s
[CV 1/5] END C=0.5, kernel=poly; accuracy: (test=0.546) f1: (test=0.690) precision: (test=0.540) recall: (test=0.956) total time=   9.4s
[CV 2/5] END C=0.5, kernel=poly; accuracy: (test=0.515) f1: (test=0.666) precision: (test=0.522) recal

10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'poly', 'C': 0.5}

In [41]:
# Con
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_con_b,y_train_con_b)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.551) f1: (test=0.546) precision: (test=0.564) recall: (test=0.529) total time=   9.2s
[CV 2/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.565) f1: (test=0.602) precision: (test=0.565) recall: (test=0.645) total time=   9.2s
[CV 3/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.549) f1: (test=0.580) precision: (test=0.550) recall: (test=0.614) total time=   9.2s
[CV 4/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.591) f1: (test=0.643) precision: (test=0.578) recall: (test=0.726) total time=   9.2s
[CV 5/5] END C=0.5, kernel=sigmoid; accuracy: (test=0.544) f1: (test=0.563) precision: (test=0.549) recall: (test=0.577) total time=  10.7s
[CV 1/5] END C=1.5, kernel=rbf; accuracy: (test=0.561) f1: (test=0.581) precision: (test=0.567) recall: (test=0.595) total time=  10.0s
[CV 2/5] END C=1.5, kernel=rbf; accuracy: (test=0.530) f1: (test=0.551) precision: (test=0.537) recall:

15 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_co

{'kernel': 'poly', 'C': 0.5}

In [42]:
# Opn
svc = SVC()

params_dict = { 'C': np.linspace(start=0, stop=2, num=5),
                'kernel':['linear', 'poly', 'rbf', 'sigmoid']
              }

search = RandomizedSearchCV(
         estimator=svc,
         param_distributions=params_dict,
         scoring = ['recall', 'precision', 'accuracy', 'f1'],
         refit = 'f1', 
         cv= 5,
         verbose=3)

search.fit(X_train_opn_b,y_train_opn_b)
search.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=0.5, kernel=linear; accuracy: (test=0.538) f1: (test=0.523) precision: (test=0.526) recall: (test=0.519) total time=   9.3s
[CV 2/5] END C=0.5, kernel=linear; accuracy: (test=0.574) f1: (test=0.563) precision: (test=0.560) recall: (test=0.565) total time=   9.1s
[CV 3/5] END C=0.5, kernel=linear; accuracy: (test=0.557) f1: (test=0.539) precision: (test=0.544) recall: (test=0.535) total time=   9.1s
[CV 4/5] END C=0.5, kernel=linear; accuracy: (test=0.536) f1: (test=0.509) precision: (test=0.523) recall: (test=0.496) total time=   9.0s
[CV 5/5] END C=0.5, kernel=linear; accuracy: (test=0.540) f1: (test=0.530) precision: (test=0.526) recall: (test=0.535) total time=  10.4s
[CV 1/5] END C=1.0, kernel=sigmoid; accuracy: (test=0.574) f1: (test=0.561) precision: (test=0.563) recall: (test=0.558) total time=   8.9s
[CV 2/5] END C=1.0, kernel=sigmoid; accuracy: (test=0.570) f1: (test=0.545) precision: (test=0.560) reca

5 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maxma\anaconda3\envs\nlp\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_cons

{'kernel': 'rbf', 'C': 1.0}

### Actual training and testing

With the optimal parameters found, the SVM is trained on the training set and then tested for validation.

In [27]:
# Ext
metrics_ext_b = SVM_model(X_train_ext_b, X_test_ext_b, y_train_ext_b, y_test_ext_b, 'poly', 1.0)

Accuracy: 0.53
Precision: 0.52
Recall: 0.90
F1 Score: 0.66


In [28]:
# Neu
metrics_neu_b = SVM_model(X_train_neu_b, X_test_neu_b, y_train_neu_b, y_test_neu_b, 'rbf', 2.0)

Accuracy: 0.56
Precision: 0.61
Recall: 0.50
F1 Score: 0.55


In [29]:
# Agr
metrics_agr_b = SVM_model(X_train_agr_b, X_test_agr_b, y_train_agr_b, y_test_agr_b, 'poly', 0.5)

Accuracy: 0.48
Precision: 0.48
Recall: 0.94
F1 Score: 0.64


In [30]:
# Con
metrics_con_b = SVM_model(X_train_con_b, X_test_con_b, y_train_con_b, y_test_con_b, 'poly', 0.5)

Accuracy: 0.53
Precision: 0.53
Recall: 0.94
F1 Score: 0.68


In [31]:
# Opn
metrics_opn_b = SVM_model(X_train_opn_b, X_test_opn_b, y_train_opn_b, y_test_opn_b, 'rbf', 1.0)

Accuracy: 0.62
Precision: 0.61
Recall: 0.55
F1 Score: 0.58


# Comparison of the different techniques per target variable

With the different feeding techniques done, the metrics can be identified. These are used to compare the methods to see what method would work best.

In [32]:
# Ext
metrics_ext = pd.DataFrame([metrics_ext_t, metrics_ext_p, metrics_ext_b], columns=['f1', 'accuracy', 'precision', 'recall'])
metrics_ext = metrics_ext.transpose()
metrics_ext.columns = ['TF-IDF', 'PoS', 'BoW']
metrics_ext = metrics_ext.transpose()
metrics_ext

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.669663,0.504216,0.503378,1.0
PoS,0.648649,0.495784,0.499096,0.926174
BoW,0.659259,0.53457,0.521484,0.895973


In [33]:
# Neu
metrics_neu = pd.DataFrame([metrics_neu_t, metrics_neu_p, metrics_neu_b], columns=['f1', 'accuracy', 'precision', 'recall'])
metrics_neu = metrics_neu.transpose()
metrics_neu.columns = ['TF-IDF', 'PoS', 'BoW']
metrics_neu = metrics_neu.transpose()
metrics_neu

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.540717,0.524452,0.564626,0.51875
PoS,0.513986,0.531197,0.583333,0.459375
BoW,0.548885,0.556492,0.608365,0.5


In [34]:
# Agr
metrics_agr = pd.DataFrame([metrics_agr_t, metrics_agr_p, metrics_agr_b], columns=['f1', 'accuracy', 'precision', 'recall'])
metrics_agr = metrics_agr.transpose()
metrics_agr.columns = ['TF-IDF', 'PoS', 'BoW']
metrics_agr = metrics_agr.transpose()
metrics_agr

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.615591,0.517707,0.505519,0.786942
PoS,0.65358,0.494098,0.492174,0.972509
BoW,0.63785,0.477234,0.483186,0.938144


In [35]:
# Con
metrics_con = pd.DataFrame([metrics_con_t, metrics_con_p, metrics_con_b], columns=['f1', 'accuracy', 'precision', 'recall'])
metrics_con = metrics_con.transpose()
metrics_con.columns = ['TF-IDF', 'PoS', 'BoW']
metrics_con = metrics_con.transpose()
metrics_con

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.688815,0.526138,0.525338,1.0
PoS,0.687783,0.53457,0.530541,0.977492
BoW,0.675958,0.529511,0.529091,0.935691


In [36]:
# Opn
metrics_opn = pd.DataFrame([metrics_opn_t, metrics_opn_p, metrics_opn_b], columns=['f1', 'accuracy', 'precision', 'recall'])
metrics_opn = metrics_opn.transpose()
metrics_opn.columns = ['TF-IDF', 'PoS', 'BoW']
metrics_opn = metrics_opn.transpose()
metrics_opn

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.575488,0.596965,0.572438,0.578571
PoS,0.531418,0.559865,0.534296,0.528571
BoW,0.578358,0.618887,0.605469,0.553571


Macro average for all personality traits

In [63]:
dfs = [metrics_ext, metrics_neu, metrics_agr, metrics_con, metrics_opn]
df_avg = pd.DataFrame(np.mean([df.values for df in dfs], axis=0), columns=metrics_ext.columns)
new_index_values = ['TF-IDF', 'PoS', 'BoW']
df_avg = df_avg.set_index(pd.Index(new_index_values))
df_avg

Unnamed: 0,f1,accuracy,precision,recall
TF-IDF,0.618055,0.533895,0.53426,0.776853
PoS,0.607083,0.523103,0.527888,0.772824
BoW,0.620062,0.543339,0.549519,0.764676


# Conclusions

Inspecting the metrics per target value, we can see that the metrics vary heavily across the target variables. For Extraversion, Agreeableness and Conscientiousness, the recall is very high, implying the model's ability to correctly predict when the text actually has these traits. it however fails to do so for other traits. Moreover, precision is rather low for all traits, indicating that the model predicts a text often as having that trait while it actually hasn't. Concludingly, the F1 scores are not very high. Moreover, the optimal vectorization techniques also depend on which personality trait is predicted. Every trait prefers a different technique. 

Inspecting the macro averages of all 5 personality traits, we can state that the SVM has mediocre performance in predicting the personality traits in general. We can identify that BoW seems to be the best vectorization technique in general, but as stated earlier this depends on the trait you are trying to predict.

Results from the SVM thus are not very good and vary much per personality trait. Therefore, we want to identify if other machine learners can do better, before comparing them to deep learners.