# FIT5149 Assessment 2: Authorship Profiling

- Student Name: Priscila Grecov
- Student ID: 29880858
- Student email: pgre0007@student.monash.edu

### Table of Contents - Part III: Implementing the Best Classifier Model 

* [Part II. Training and Testing Classifier Models](#part_3)
* [II.0. Libraries](#sec_3_0)
* [II.1. Loading training labels](#sec_3_1)
* [II.2. Loading testing labels](#sec_3_2)
* [II.3. Modelling with the Bag of Words preprocessing datasets with NGRAM and Logistic Regression](#sec_3_3)
* [II.4. Saving the predict dataset for submission](#sec_3_4)



## Part III - Implementing the Best Classifier Model <a class="anchor" id="part_3"></a>

### 0. Libraries <a class="anchor" id="sec_3_0"></a>

In [1]:
# libraries to be used:
from sklearn.linear_model import * # to run the logistic regression model
from sklearn.model_selection import cross_val_score # to perform the cross validation accuracy score calculation

import pandas as pd # to read the .csv files
import numpy as np # to load the .npy files (preprocessing datasets format)
import statistics # to calculate the accuracy metrics
import random # to set the seed

### 1. Loading training labels <a class="anchor" id="sec_3_1"></a>

In [2]:
# Loading the train labels dataset
train_labels = pd.read_csv('train_labels.csv', index_col='id') 
train_labels.head() # checking the proper loading

Unnamed: 0_level_0,gender
id,Unnamed: 1_level_1
b91efc94c91ad3f882a612ae2682af17,male
ff91e6d4b79fc64072ae273aa3fed77e,male
7e199c5885131a2579429c07f3215cbc,female
cdc2d20d75f8187ee54caf56b2c77626,male
53259762a49f56f451605df3efa955e6,female


In [3]:
# Relabeling "male" and "female" labels to 0 and 1 values
train_labels["male"] = [0 if  i == 'female' else 1 for i in train_labels["gender"]]
train_df = train_labels.drop(columns="gender")

In [4]:
# Checking the proper relabeling
train_df.head()

Unnamed: 0_level_0,male
id,Unnamed: 1_level_1
b91efc94c91ad3f882a612ae2682af17,1
ff91e6d4b79fc64072ae273aa3fed77e,1
7e199c5885131a2579429c07f3215cbc,0
cdc2d20d75f8187ee54caf56b2c77626,1
53259762a49f56f451605df3efa955e6,0


### 2. Loading testing labels <a class="anchor" id="sec_3_2"></a>

In [5]:
# Loading the indexes from testing dataset
test_index_list = np.load('test_index_list.npy',allow_pickle='TRUE')

In [6]:
# Loading the test labels dataset
test_labels = pd.read_csv('test_labels.csv', index_col='id') 
test_labels.head() # checking the proper loading

Unnamed: 0_level_0,gender
id,Unnamed: 1_level_1
d6b08022cdf758ead05e1c266649c393,male
9a989cb04766d5a89a65e8912d448328,female
2a1053a059d58fbafd3e782a8f7972c0,male
6032537900368aca3d1546bd71ecabd1,male
d191280655be8108ec9928398ff5b563,male


In [7]:
# Relabeling "male" and "female" labels to 0 and 1 values
test_labels["male"] = [0 if  i == 'female' else 1 for i in test_labels["gender"]]
test_labels = test_labels.drop(columns="gender")
test_labels.head()

Unnamed: 0_level_0,male
id,Unnamed: 1_level_1
d6b08022cdf758ead05e1c266649c393,1
9a989cb04766d5a89a65e8912d448328,0
2a1053a059d58fbafd3e782a8f7972c0,1
6032537900368aca3d1546bd71ecabd1,1
d191280655be8108ec9928398ff5b563,1


### 3. Modelling with the Bag of Words preprocessing datasets with NGRAM and Logistic Regression <a class="anchor" id="sec_3_3"></a>

#### 3.1. Loading Bag of Words Data with NGRAM

In [8]:
# Loading Bag of Words preprocessing data
x_train3 = np.load('bw_train.npy',allow_pickle='TRUE')
x_test3 = np.load('bw_test.npy',allow_pickle='TRUE')

In [9]:
# Checking the proper loading of Bag of Words preprocessing training data
x_train3

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [3, 3, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

In [10]:
# Checking the dimensions of training Bag of Words preprocessing data
x_train3.shape

(3100, 44624)

In [11]:
# Checking the dimensions of testing Bag of Words preprocessing data
x_test3.shape

(500, 44624)

#### 3.2. Implementing the best selected model - Logistic Regression over Bag of Words dataset

The optimal hyperparameter values calculated by the tuning process using “GridSearchCV( )” function of “sklearn” library were: 

* penalty = l2 (default)
* C = 0.0001

Let's proceed to run the Logistic Regression using these selected hyperparameter values.

In [12]:
# Setting the seed:
random.seed(1234)

# Implementing the Logistic Regression model
model14 = LogisticRegression(C=0.0001, fit_intercept=False, max_iter=1000)
print('Cross Validation Results for Logistic Regression model - NGRAM')
y_train = np.asarray(train_df['male'].values)

#cross validation
print('Doing 10-fold cross validation')
accuracies = cross_val_score(model14, x_train3, y_train, scoring='accuracy', cv=10)
print('sd(accuracy):' + str(statistics.stdev(accuracies)))
print('mean(accuracy):' + str(statistics.mean(accuracies)))

#training model
model14.fit(x_train3, y_train)

#getting predictions from model
y_predict_14 = model14.predict(x_test3)

Cross Validation Results for Logistic Regression model - NGRAM
Doing 10-fold cross validation
sd(accuracy):0.023937715456936577
mean(accuracy):0.8058064516129032


In [13]:
# Validating predictions on test dataset
print('Accuracy on Testing Dataset - Logistic Regression model - NGRAM:')
sum(test_labels.loc[test_index_list,'male']==y_predict_14)/len(test_labels.index)*100

Accuracy on Testing Dataset - Logistic Regression model - NGRAM:


82.6

### 4.  Saving the predict dataset for submission <a class="anchor" id="sec_3_4"></a>

In [14]:
# checking the format of dataset with the predicted values
y_predict_14

array([0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,

In [15]:
# transforming the gender labels values to "female" and "male" as requested by assignment specification
y_predict_trasl = ['female' if  i == 0 else 'male' for i in y_predict_14]
y_predict_trasl[0:6] # checking the proper transformation

['female', 'female', 'female', 'male', 'female', 'male']

In [16]:
# Building the predicted labels dataset to submission
y_predict_trasl_df = pd.DataFrame(list(zip(test_index_list,y_predict_trasl)),columns=['id','gender'])

In [17]:
# Checking the proper building of the predicted labels dataset to submission
y_predict_trasl_df

Unnamed: 0,id,gender
0,8ebb5b1633c16c5636f24bbfb70d26bb,female
1,22123328bbf3e81446d92641898c692f,female
2,79622e0db1c3b045b76a5907256dc84c,female
3,4799e811e7c2bf342a9f9fb062295456,male
4,64ab928caec0a27625c47c6bf261e475,female
...,...,...
495,54a209cddb213c282a76d87dc671ba53,male
496,59926d8bd9c721953a0cd95626913bf6,male
497,9f237ee76e2dfc2e73f7f55a69b634f5,female
498,97bd3c7963be72200baf45d3a4268a74,male


In [18]:
# Saving the the predicted labels dataset to submission
y_predict_trasl_df.to_csv('pred_labels.csv', index = False)