# FIT5149 Assessment 2: Authorship Profiling

Student information
- Student Name: Priscila Grecov
- Student ID: 29880858
- Student email: pgre0007@student.monash.edu

### Table of Contents - Part II: Training and Testing Classifier Models

* [Part II. Training and Testing Classifier Models](#part_2)
* [II.0. Libraries](#sec_2_0)
* [II.1. Loading training labels](#sec_2_1)
* [II.2. Loading testing labels](#sec_1_2)
* [II.3. Using TF-IDF preprocessing datasets](#sec_2_3)
* [II.4. Using Embedding with W2Vector preprocessing datasets](#sec_2_4)
* [II.5. Using Bag of Words preprocessing datasets with NGRAM](#sec_2_5)



## PART II - Training and Testing Classifier Models <a class="anchor" id="part_2"></a>

### 0. Libraries <a class="anchor" id="sec_2_0"></a>

In [2]:
# libraries to be used:
from sklearn.ensemble import *
from sklearn.linear_model import *
from sklearn.naive_bayes import *
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import cross_val_score

import pandas as pd
import numpy as np
import statistics
import random 
import warnings

### 1. Loading training labels <a class="anchor" id="sec_2_1"></a>

In [3]:
# Loading the train labels dataset
train_labels = pd.read_csv('train_labels.csv', index_col='id') 
train_labels.head() # checking the proper loading

Unnamed: 0_level_0,gender
id,Unnamed: 1_level_1
b91efc94c91ad3f882a612ae2682af17,male
ff91e6d4b79fc64072ae273aa3fed77e,male
7e199c5885131a2579429c07f3215cbc,female
cdc2d20d75f8187ee54caf56b2c77626,male
53259762a49f56f451605df3efa955e6,female


In [4]:
# Relabeling "male" and "female" labels to 0 and 1 values
train_labels["male"] = [0 if  i == 'female' else 1 for i in train_labels["gender"]]
train_df = train_labels.drop(columns="gender")

In [5]:
# Checking the proper relabeling
train_df.head()

Unnamed: 0_level_0,male
id,Unnamed: 1_level_1
b91efc94c91ad3f882a612ae2682af17,1
ff91e6d4b79fc64072ae273aa3fed77e,1
7e199c5885131a2579429c07f3215cbc,0
cdc2d20d75f8187ee54caf56b2c77626,1
53259762a49f56f451605df3efa955e6,0


### 2. Loading testing labels <a class="anchor" id="sec_2_2"></a>

In [6]:
# Loading the indexes from testing dataset
test_index_list = np.load('test_index_list.npy',allow_pickle='TRUE')

In [7]:
# Loading the test labels dataset
test_labels = pd.read_csv('test_labels.csv', index_col='id') 
test_labels.head() # checking the proper loading

Unnamed: 0_level_0,gender
id,Unnamed: 1_level_1
d6b08022cdf758ead05e1c266649c393,male
9a989cb04766d5a89a65e8912d448328,female
2a1053a059d58fbafd3e782a8f7972c0,male
6032537900368aca3d1546bd71ecabd1,male
d191280655be8108ec9928398ff5b563,male


In [8]:
# Relabeling "male" and "female" labels to 0 and 1 values
test_labels["male"] = [0 if  i == 'female' else 1 for i in test_labels["gender"]]
test_labels = test_labels.drop(columns="gender")
test_labels.head()

Unnamed: 0_level_0,male
id,Unnamed: 1_level_1
d6b08022cdf758ead05e1c266649c393,1
9a989cb04766d5a89a65e8912d448328,0
2a1053a059d58fbafd3e782a8f7972c0,1
6032537900368aca3d1546bd71ecabd1,1
d191280655be8108ec9928398ff5b563,1


### 3. Using TF-IDF preprocessing datasets <a class="anchor" id="sec_2_3"></a>

Using the TF_IDF preprocessing datasets built from Part I, we are going to compare the performance of the following classifiers:
* SVC
* Gradient Boosting
* Random Forest
* Logistic Regression

#### 3.1. Loading TF-IDF Data

In [9]:
# Loading TF-IDF preprocessing data
x_train = np.load('tfs_train.npy',allow_pickle='TRUE').item()
x_test = np.load('tfs_test.npy',allow_pickle='TRUE').item()

In [10]:
# Checking the dimensions of training TF-IDF preprocessing data
x_train.shape

(3100, 1265)

In [11]:
# Checking the dimensions of testing TF-IDF preprocessing data
x_test.shape

(500, 1265)

#### 3.2.  Classifier Model creation and evaluation for TF-IDF Data


##### a) TFIDF - SVC Classifier
###### - Tuning process ( this process takes a long time!!! - Don't run)

In [12]:
# Tuning process for the SVC algorithm, to select the best/optimal parameters for SVC function
## Create hyperparameter options
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

## Running the cross-validation to hyperparameters selection
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
y_train = np.asarray(train_df['male'].values)
grid.fit(x_train,y_train)
print(grid.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=  18.1s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   18.1s remaining:    0.0s


[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=  18.7s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=  18.8s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=  19.9s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=  18.1s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=  16.1s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=  17.2s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=  17.8s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] .

[CV] ............... C=0.1, gamma=0.001, kernel=sigmoid, total=  18.3s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=  14.4s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=  15.1s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=  14.1s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=  13.5s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=  13.2s
[CV] C=1, gamma=1, kernel=poly .......................................
[CV] ........................ C=1, gamma=1, kernel=poly, total=  14.3s
[CV] C=1, gamma=1, kernel=poly .......................................
[CV] .

KeyboardInterrupt: 

###### - Training process

The optimal parameter values calculated by the tuning process were: 

* kernel = rbf
* C = 100
* gamma = 0.01

Let's proceed to run the SVC algorithm using these selected parameter values.

In [12]:
# Setting the seed:
random.seed(1234)

# Implementing the SVC model with the tunned parameters kernel=sigmoid, C=100 and gamma=0.01
model1 = SVC(kernel='rbf', C=100, gamma=0.01)
print('Cross Validation Results for SVC using the tuned parameters - TFIDF:')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model1, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model1.fit(x_train, y_train)

#getting predictions from model
y_predict_1 = model1.predict(x_test)

Cross Validation Results for SVC using the tuned parameters - TFIDF:
Doing 10-fold cross validation
sd(accuracy):0.022506277948947325
mean(accuracy):0.8009677419354839


In [13]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC tunned - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_1)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC tunned - TFIDF:


79.4

In [14]:
# Setting the seed:
random.seed(1234)

# Implementing the SVC model using the radial kernel
model2 = SVC(kernel="sigmoid", C=100, gamma=0.01)
print('Cross Validation Results for SVC using the sigmoid kernel - TFIDF:')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model2, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model2.fit(x_train, y_train)

#getting predictions from model
y_predict_2 = model2.predict(x_test)

Cross Validation Results for SVC using the sigmoid kernel - TFIDF:
Doing 10-fold cross validation
sd(accuracy):0.032465284250504656
mean(accuracy):0.8006451612903226


In [15]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC sigmoid - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_2)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC sigmoid - TFIDF:


79.2

In [16]:
# Setting the seed:
random.seed(1234)

# Implementing the SVC model using the linear kernel
model3 = LinearSVC()
print('Cross Validation Results for SVC using the linear kernel - TFIDF:')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model3, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model3.fit(x_train, y_train)

#getting predictions from model
y_predict_3 = model3.predict(x_test)

Cross Validation Results for SVC using the linear kernel - TFIDF:
Doing 10-fold cross validation
sd(accuracy):0.0231220302033817
mean(accuracy):0.807741935483871


In [17]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC linear - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_3)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC linear - TFIDF:


80.0

##### b) TFIDF - Gradient Boosting Classifier

In [18]:
# Setting the seed:
random.seed(1234)

# Implementing the Gradient Boosting Classifier model
model4 = GradientBoostingClassifier()
print('Cross Validation Results for Gradient Boosting Classifier - TFIDF')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model4, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model4.fit(x_train, y_train)

#getting predictions from model
y_predict_4 = model4.predict(x_test)

Cross Validation Results for Gradient Boosting Classifier - TFIDF
Doing 10-fold cross validation
sd(accuracy):0.01742134596346199
mean(accuracy):0.782258064516129


In [19]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Gradient Boosting Classifier - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_4)/len(test_labels.index)*100

Accuracy on Testing Dataset - Gradient Boosting Classifier - TFIDF:


78.4

##### c) TFIDF -  Random Forest

In [20]:
import warnings
warnings.filterwarnings("ignore")

In [21]:
# Setting the seed:
random.seed(1234)

# Implementing the Random Forest Classifier model
model5 = RandomForestClassifier()
print('Cross Validation Results for Random Forest Classifier - TFIDF')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model5, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model5.fit(x_train, y_train)

#getting predictions from model
y_predict_5 = model5.predict(x_test)

Cross Validation Results for Random Forest Classifier - TFIDF
Doing 10-fold cross validation
sd(accuracy):0.020070404465066843
mean(accuracy):0.7729032258064515


In [22]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Random Florest Classifier - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_5)/len(test_labels.index)*100

Accuracy on Testing Dataset - Random Florest Classifier - TFIDF:


76.8

##### d) TFIDF - Logistic Regression

In [23]:
# Setting the seed:
random.seed(1234)

# Implementing the Logistic Regression model
model6 = LogisticRegression()
print('Cross Validation Results for Logistic Regression model - TFIDF')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model6, x_train, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model6.fit(x_train, y_train)

#getting predictions from model
y_predict_6 = model6.predict(x_test)

Cross Validation Results for Logistic Regression model - TFIDF
Doing 10-fold cross validation
sd(accuracy):0.026565886566335073
mean(accuracy):0.7858064516129032


In [24]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Logistic Regression - TFIDF:')
sum(test_labels.loc[test_index_list,'male']==y_predict_6)/len(test_labels.index)*100

Accuracy on Testing Dataset - Logistic Regression - TFIDF:


78.2

### 4. Using Embedding with W2Vector preprocessing datasets <a class="anchor" id="sec_2_4"></a>
Using the W2Vector preprocessing datasets built from Part I, we are going to compare the performance of the following classifiers:
* SVC
* Gradient Boosting
* Random Forest
* Logistic Regression

#### 4.1. Loading W2Vector Data

In [25]:
# Loading W2Vector preprocessing data
x_train2 = np.load('w2v_train.npy',allow_pickle='TRUE')
x_test2 = np.load('w2v_test.npy',allow_pickle='TRUE')

In [26]:
# Checking the proper loading of W2Vector preprocessing training data
x_train2

array([[ 0.01675258, -0.01564144,  0.01055635, ...,  0.00829905,
         0.00540817, -0.01534761],
       [-0.01528502, -0.00355074,  0.0069914 , ..., -0.00742738,
         0.01393083, -0.00563208],
       [-0.00867108,  0.00265902,  0.0072626 , ...,  0.0272107 ,
        -0.0251153 ,  0.00979084],
       ...,
       [ 0.00688194, -0.0158909 ,  0.00859467, ...,  0.00585373,
         0.00425363,  0.00157713],
       [ 0.01578771, -0.01474205,  0.01000684, ...,  0.002748  ,
        -0.00089474, -0.00452141],
       [-0.01139424, -0.00649966,  0.00864893, ...,  0.01603917,
        -0.00813043,  0.01037171]], dtype=float32)

In [27]:
# Checking the dimensions of training TF-IDF preprocessing data
x_train2.shape

(3100, 100)

In [28]:
# Checking the dimensions of testing TF-IDF preprocessing data
x_test2.shape

(500, 100)

#### 4.2.  Classifier Model creation and evaluation for W2Vector Data

##### a) W2V - SVC
###### - Tuning Process ( this process takes a long time!!! - Don't run)

In [30]:
# Tuning process for the SVC algorithm, to select the best/optimal parameters for SVC function
## Create hyperparameter options
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

## Running the cross-validation to hyperparameters selection
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
y_train = np.asarray(train_df['male'].values)
grid.fit(x_train2,y_train)
print(grid.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.9s
[CV] C=0.1, gamma=1, kernel=rbf ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   1.1s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   1.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.9s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.9s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=   0.7s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=   0.7s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] ...................... C=0.1, gamma=1, kernel=poly, total=   0.7s
[CV] C=0.1, gamma=1, kernel=poly .....................................
[CV] .

[CV] ............... C=0.1, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.7s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.7s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.7s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.7s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.7s
[CV] C=1, gamma=1, kernel=poly .......................................
[CV] ........................ C=1, gamma=1, kernel=poly, total=   0.7s
[CV] C=1, gamma=1, kernel=poly .......................................
[CV] .

[CV] ................. C=1, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=1, gamma=0.001, kernel=sigmoid ................................
[CV] ................. C=1, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=1, gamma=0.001, kernel=sigmoid ................................
[CV] ................. C=1, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total=   0.6s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total=   0.6s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total=   0.5s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total=   0.6s
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] .

[CV] ................ C=10, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=10, gamma=0.001, kernel=sigmoid ...............................
[CV] ................ C=10, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=10, gamma=0.001, kernel=sigmoid ...............................
[CV] ................ C=10, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=10, gamma=0.001, kernel=sigmoid ...............................
[CV] ................ C=10, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=10, gamma=0.001, kernel=sigmoid ...............................
[CV] ................ C=10, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] ....................... C=100, gamma=1, kernel=rbf, total=   0.5s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] ....................... C=100, gamma=1, kernel=rbf, total=   0.5s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] .

[CV] .................. C=100, gamma=0.001, kernel=poly, total=   0.7s
[CV] C=100, gamma=0.001, kernel=poly .................................
[CV] .................. C=100, gamma=0.001, kernel=poly, total=   0.7s
[CV] C=100, gamma=0.001, kernel=sigmoid ..............................
[CV] ............... C=100, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=100, gamma=0.001, kernel=sigmoid ..............................
[CV] ............... C=100, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=100, gamma=0.001, kernel=sigmoid ..............................
[CV] ............... C=100, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=100, gamma=0.001, kernel=sigmoid ..............................
[CV] ............... C=100, gamma=0.001, kernel=sigmoid, total=   0.8s
[CV] C=100, gamma=0.001, kernel=sigmoid ..............................
[CV] ............... C=100, gamma=0.001, kernel=sigmoid, total=   0.8s


[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  2.9min finished


SVC(C=100, gamma=1)


###### - Training Process

The optimal parameter values calculated by the tuning process were: 

* kernel = rfb
* C = 100
* gamma = 1

Let's proceed to run the SVC algorithm using these selected parameter values.

In [29]:
# Setting the set:
random.seed(1234)

# Implementing the SVC model with the tunned parameters kernel=rfb, C=100 and gamma=1
model7 = SVC(kernel="rbf", C=100, gamma=1)
print('Cross Validation Results for SVC using the tuned parameters - W2V:')
y_train = np.asarray(train_df['male'].values)

# cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model7, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

# training model
model7.fit(x_train2, y_train)

# getting predictions from model
y_predict_7 = model7.predict(x_test2)


Cross Validation Results for SVC using the tuned parameters - W2V:
Doing 10-fold cross validation
sd(accuracy):0.019618868174233944
mean(accuracy):0.7867741935483871


In [30]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC tunned - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_7)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC tunned - W2V:


77.60000000000001

In [31]:
# Setting the set:
random.seed(1234)

# Implementing the SVC model using the sigmoid kernel
model8 = SVC(kernel="sigmoid", C=100, gamma=1)
print('Cross Validation Results for SVC using the sigmoid kernel- W2V:')
y_train = np.asarray(train_df['male'].values)
  
#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model8, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model8.fit(x_train2, y_train)

#getting predictions from model
y_predict_8 = model8.predict(x_test2)


Cross Validation Results for SVC using the sigmoid kernel- W2V:
Doing 10-fold cross validation
sd(accuracy):0.0261204021324628
mean(accuracy):0.7796774193548387


In [32]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC sigmoid - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_8)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC sigmoid - W2V:


79.2

In [33]:
# Setting the seed:
random.seed(1234)

# Implementing the SVC model using the linear kernel
model9 = LinearSVC()
print('Cross Validation Results for SVC using the linear kernel- W2V:')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model9, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model9.fit(x_train2, y_train)

#getting predictions from model
y_predict_9 = model9.predict(x_test2)

Cross Validation Results for SVC using the linear kernel- W2V:
Doing 10-fold cross validation
sd(accuracy):0.026937576378704052
mean(accuracy):0.7587096774193548


In [34]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC linear - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_9)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC linear - W2V:


74.4

##### b) W2V - Gradient Boosting Classifier

In [35]:
# Setting the seed:
random.seed(1234)

# Implementing the Gradient Boosting Classifier model
model10 = GradientBoostingClassifier()
print('Cross Validation Results for Gradient Boosting Classifier- W2V')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model10, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model10.fit(x_train2, y_train)

#getting predictions from model
y_predict_10 = model10.predict(x_test2)

Cross Validation Results for Gradient Boosting Classifier- W2V
Doing 10-fold cross validation
sd(accuracy):0.028629221565328964
mean(accuracy):0.7512903225806451


In [36]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Gradient Boosting Classifier - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_10)/len(test_labels.index)*100

Accuracy on Testing Dataset - Gradient Boosting Classifier - W2V:


75.6

##### c) W2V - Random Forest

In [37]:
# Setting the seed:
random.seed(1234)

# Implementing the Random Forest Classifier model
model11 = RandomForestClassifier()
print('Cross Validation Results for Random Forest Classifier- W2V')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model11, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model11.fit(x_train2, y_train)

#getting predictions from model
y_predict_11 = model11.predict(x_test2)

Cross Validation Results for Random Forest Classifier- W2V
Doing 10-fold cross validation
sd(accuracy):0.032402900655090786
mean(accuracy):0.7525806451612903


In [38]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Random Forest Classifier - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_11)/len(test_labels.index)*100

Accuracy on Testing Dataset - Random Forest Classifier - W2V:


72.8

In [39]:
# Setting the seed:
random.seed(1234)

# Implementing the Random Forest Classifier model
model12 = RandomForestClassifier(max_features=50, min_samples_leaf= 10, n_estimators=200)
print('Cross Validation Results for Random Forest Classifier B - W2V')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model12, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model12.fit(x_train2, y_train)

#getting predictions from model
y_predict_12 = model12.predict(x_test2)

Cross Validation Results for Random Forest Classifier B - W2V
Doing 10-fold cross validation
sd(accuracy):0.030506193112954688
mean(accuracy):0.7519354838709678


In [40]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Random Forest Classifier B - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_12)/len(test_labels.index)*100

Accuracy on Testing Dataset - Random Forest Classifier B - W2V:


75.0

##### d) W2V - Logistic Regression

In [41]:
# Setting the seed:
random.seed(1234)

# Implementing the Logistic Regression model
model13 = LogisticRegression()
print('Cross Validation Results for Logistic Regression model- W2V')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model13, x_train2, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model13.fit(x_train2, y_train)

#getting predictions from model
y_predict_13 = model13.predict(x_test2)

Cross Validation Results for Logistic Regression model- W2V
Doing 10-fold cross validation
sd(accuracy):0.024321049931392568
mean(accuracy):0.7187096774193549


In [42]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Logistic Regression model - W2V:')
sum(test_labels.loc[test_index_list,'male']==y_predict_13)/len(test_labels.index)*100

Accuracy on Testing Dataset - Logistic Regression model - W2V:


69.0

### 5. Using Bag of Words preprocessing datasets with NGRAM <a class="anchor" id="sec_2_5"></a>
Using the Bag of Words preprocessing datasets built from Part I, we are going to compare the performance of the following classifiers:
* SVC
* Gradient Boosting
* Random Forest
* Logistic Regression

#### 5.1. Loading Bag of Words Data with NGRAM

In [43]:
# Loading Bag of Words preprocessing data
x_train3 = np.load('bw_train.npy',allow_pickle='TRUE')
x_test3 = np.load('bw_test.npy',allow_pickle='TRUE')

In [44]:
# Checking the proper loading of Bag of Words preprocessing training data
x_train3

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [3, 3, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

In [45]:
# Checking the dimensions of training TF-IDF preprocessing data
x_train3.shape

(3100, 44624)

In [46]:
# Checking the dimensions of testing TF-IDF preprocessing data
x_test3.shape

(500, 44624)

#### 5.2.  Classifier Model creation and evaluation for Bag of Words Data

##### a) BagWords - Logistic Regression
###### - Tuning process (this process takes a long time!!! Don't run.)

In [49]:
# Tuning process for the Logistic Regression to select the best/optimal parameters for the model
## Create regularization penalty space
penalty = ['l1', 'l2']

## Create regularization hyperparameter space
C = [0.0001, 0.001, 0.01, 0.1]

## Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

## Running the cross-validation to hyperparameters selection
grid = GridSearchCV(LogisticRegression(max_iter=100), hyperparameters, cv=10, verbose=2)
y_train = np.asarray(train_df['male'].values)
grid.fit(x_train3,y_train)
print(grid.best_estimator_)

Fitting 10 folds for each of 8 candidates, totalling 80 fits
[CV] C=0.0001, penalty=l1 ............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................. C=0.0001, penalty=l1, total=   0.6s
[CV] C=0.0001, penalty=l1 ............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s


[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] ............................. C=0.0001, penalty=l1, total=   0.5s
[CV] C=0.0001, penalty=l1 ............................................
[CV] .

[CV] ............................... C=0.01, penalty=l2, total=  15.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.5s
[CV] C=0.1, penalty=l1 ...............................................
[CV] .

[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed: 10.5min finished


LogisticRegression(C=0.0001)


###### - Training process

The optimal parameter values calculated by the tuning process were: 

* penalty = l2 (default)
* C = 0.0001

Let's proceed to run the Logistic Regression using these selected parameter values.

In [47]:
# Setting the seed:
random.seed(1234)

# Implementing the Logistic Regression model
model14 = LogisticRegression(C=0.0001, fit_intercept=False, max_iter=1000)
print('Cross Validation Results for Logistic Regression model - NGRAM')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model14, x_train3, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model14.fit(x_train3, y_train)

#getting predictions from model
y_predict_14 = model14.predict(x_test3)

Cross Validation Results for Logistic Regression model - NGRAM
Doing 10-fold cross validation
sd(accuracy):0.023937715456936577
mean(accuracy):0.8058064516129032


In [48]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Logistic Regression model - NGRAM:')
sum(test_labels.loc[test_index_list,'male']==y_predict_14)/len(test_labels.index)*100

Accuracy on Testing Dataset - Logistic Regression model - NGRAM:


82.6

##### b) BagWords - SVC
###### - Tuning process (this process takes a long time!!! Don't run.)

In [51]:
# Tuning process for the SVC algorithm, to select the best/optimal parameters for SVC function
## Create hyperparameter options
param_grid = {'C': [10, 100], 'gamma': [1, 0.1, 0.01, 0.001],'kernel': ['rbf', 'sigmoid']}

## Running the cross-validation to hyperparameters selection
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
y_train = np.asarray(train_df['male'].values)
grid.fit(x_train3,y_train)
print(grid.best_estimator_)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] C=10, gamma=1, kernel=rbf .......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ........................ C=10, gamma=1, kernel=rbf, total= 7.6min
[CV] C=10, gamma=1, kernel=rbf .......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.6min remaining:    0.0s


[CV] ........................ C=10, gamma=1, kernel=rbf, total= 7.3min
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total= 7.3min
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total= 7.4min
[CV] C=10, gamma=1, kernel=rbf .......................................
[CV] ........................ C=10, gamma=1, kernel=rbf, total= 7.4min
[CV] C=10, gamma=1, kernel=sigmoid ...................................
[CV] .................... C=10, gamma=1, kernel=sigmoid, total= 7.2min
[CV] C=10, gamma=1, kernel=sigmoid ...................................
[CV] .................... C=10, gamma=1, kernel=sigmoid, total= 7.3min
[CV] C=10, gamma=1, kernel=sigmoid ...................................
[CV] .................... C=10, gamma=1, kernel=sigmoid, total= 7.3min
[CV] C=10, gamma=1, kernel=sigmoid ...................................
[CV] .

[CV] ................. C=100, gamma=0.1, kernel=sigmoid, total= 7.0min
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .................... C=100, gamma=0.01, kernel=rbf, total= 7.1min
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .................... C=100, gamma=0.01, kernel=rbf, total= 7.1min
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .................... C=100, gamma=0.01, kernel=rbf, total= 7.1min
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .................... C=100, gamma=0.01, kernel=rbf, total= 7.1min
[CV] C=100, gamma=0.01, kernel=rbf ...................................
[CV] .................... C=100, gamma=0.01, kernel=rbf, total= 7.1min
[CV] C=100, gamma=0.01, kernel=sigmoid ...............................
[CV] ................ C=100, gamma=0.01, kernel=sigmoid, total= 6.8min
[CV] C=100, gamma=0.01, kernel=sigmoid ...............................
[CV] .

[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed: 563.8min finished


SVC(C=10, gamma=0.001)


###### - Training process

In [49]:
# Setting the seed:
random.seed(1234)

# Implementing the linear SVC model
model15 = LinearSVC()
print('Cross Validation Results for SVC Linear - NGRAM')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model15, x_train3, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model15.fit(x_train3, y_train)

#getting predictions from model
y_predict_15 = model15.predict(x_test3)

Cross Validation Results for SVC Linear - NGRAM
Doing 10-fold cross validation
sd(accuracy):0.022963976172296657
mean(accuracy):0.7751612903225806


In [50]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC Linear - NGRAM:')
sum(test_labels.loc[test_index_list,'male']==y_predict_15)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC Linear - NGRAM:


76.6

In [51]:
# Setting the set:
random.seed(1234)

# Implementing the SVC model with the tunned parameters kernel=rbf, C=0.1 and gamma=0.001
model16 = SVC(kernel="rbf", C=0.1, gamma=0.001)
print('Cross Validation Results for SVC using the tuned parameters - W2V:')
y_train = np.asarray(train_df['male'].values)

# cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model16, x_train3, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

# training model
model16.fit(x_train3, y_train)

# getting predictions from model
y_predict_16 = model16.predict(x_test3)



Cross Validation Results for SVC using the tuned parameters - W2V:
Doing 10-fold cross validation
sd(accuracy):0.0013601194237283154
mean(accuracy):0.5006451612903225


In [52]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - SVC rbf - NGRAM:')
sum(test_labels.loc[test_index_list,'male']==y_predict_16)/len(test_labels.index)*100

Accuracy on Testing Dataset - SVC rbf - NGRAM:


49.6

### 4. CONCLUSION <a class="anchor" id="sec_2_5"></a>

The best performance was observed using the **Logistic Regression** model (where the paremeters were decided by tuning process) over the dataset which was preprocessed by the employment of the **Bag of Words** process.

In [54]:
# summary of results
print('Accuracy on Testing Dataset:') 
print('SVC tunned - TFIDF        : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_1)/len(test_labels.index)*100))
print('SVC sigmoid - TFIDF       : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_2)/len(test_labels.index)*100))
print('SVC linear - TFIDF        : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_3)/len(test_labels.index)*100))
print('Gradient Boosting - TFIDF : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_4)/len(test_labels.index)*100))
print('Random Forest - TFIDF     : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_5)/len(test_labels.index)*100))
print('Logistic Reg - TFIDF      : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_6)/len(test_labels.index)*100))
print('SVC tunned - W2V          : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_7)/len(test_labels.index)*100))
print('SVC sigmoid - W2V         : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_8)/len(test_labels.index)*100))
print('SVC linear - W2V          : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_9)/len(test_labels.index)*100))
print('Gradient Boosting - W2V   : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_10)/len(test_labels.index)*100))
print('Random Forest 1 - W2V     : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_11)/len(test_labels.index)*100))
print('Random Forest 2 - W2V     : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_12)/len(test_labels.index)*100))
print('Logistic Reg - W2V        : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_13)/len(test_labels.index)*100))
print('Logistic Reg - BagOfWords : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_14)/len(test_labels.index)*100))
print('SVC linear - BagOfWords   : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_15)/len(test_labels.index)*100))
print('SVC rbf - BagOfWords      : ' + str(sum(test_labels.loc[test_index_list,'male']==y_predict_16)/len(test_labels.index)*100))

Accuracy on Testing Dataset:
SVC tunned - TFIDF        : 79.4
SVC sigmoid - TFIDF       : 79.2
SVC linear - TFIDF        : 80.0
Gradient Boosting - TFIDF : 78.4
Random Forest - TFIDF     : 76.8
Logistic Reg - TFIDF      : 78.2
SVC tunned - W2V          : 77.60000000000001
SVC sigmoid - W2V         : 79.2
SVC linear - W2V          : 74.4
Gradient Boosting - W2V   : 75.6
Random Forest 1 - W2V     : 72.8
Random Forest 2 - W2V     : 75.0
Logistic Reg - W2V        : 69.0
Logistic Reg - BagOfWords : 82.6
SVC linear - BagOfWords   : 76.6
SVC rbf - BagOfWords      : 49.6
