# FIT5149 S1 2020 Assessment 2: Authorship Profiling


### Group information:
- Group Number: <b>46</b>
- Group Members: <b>Camilla Maulidina Shaquila</b>, <b>Ng Jade Kuan</b>, <b>Anish Rajan</b>

### Programming Environment:
- Python 3.7.1
- Jupyter Notebook

# Table of Contents
1. [Library Import](#import)<br><br>
2. [Dataset Reading](#read)<br><br>
3. [Preprocessing](#prep1)<br>
    3.1. [Batch processing of data](#prep1)<br>
    3.2. [Generalization](#prep2)<br>
    3.2. [Removal of unnecessary elements](#prep3)<br>
    3.3. [Tokenization, Vectorization and Feature Selection](#vec)<br><br>
4. [Model Development and Prediction](#mod)<br>
    4.1. [Logistic Regression](#LR)<br>
    4.2. [Support Vector Classifier](#SVC)<br>
    4.3. [XGBoost](#XG)<br>
    4.4. [Basic Ensemble Model](#bEn)<br>
5. [CSV File Creation](#csv)<br>

<h2 style="color:Purple"> Import necessary libraries</h2> <a name="import"></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import sklearn
import xml.etree.ElementTree as et
from nltk.tokenize import TweetTokenizer
import os
import re
from bs4 import BeautifulSoup
from html import unescape
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\araja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<h2 style="color:Purple">Load the dataset</h2><a name="read"></a>

In [2]:
train = pd.read_csv("train_labels.csv")                             # load 'train_labels.csv' as a dataframe
test_labels = pd.read_csv("test_labels.csv")                        # load 'test_labels.csv' as a dataframe
test_gender = test_labels.gender.tolist()                           # get the test labels as a list

<h2 style="color:Purple">Pre-processing #1</h2><a name="prep1"></a>

### In the cell below,
- Batch processing is being performed on the training and testing documents

In [3]:
%%time
# initialize 6 empty lists
file_name = []
tw = []
gen1 = []
file_name2 = []
tw2 = []
gen2 = []

def batch_func_train(batch):
    for file in batch['id']:
        f_name = ".\\data\\" + str(file) + ".xml"                   # get the name of the file     
        f_obj = et.parse(f_name)                                    # use ElementTree to parse the file
        tweets = f_obj.findall('documents/document')                # get the content present within the
                                                                    # '<document></document>' tags

        # get the gender value corresponding to each file
        gen = str(batch[batch['id'] == file]['gender'])
        # check whether the gender is female or male, and
        # update the variable accordingly
        if 'female' in gen:
            gen = 'female'
        else:
            gen = 'male'

        # append the extracted file name, document (tweet)
        # and gender to a dataframe
        for tweet in tweets:
            file_name.append(f_name)
            tw.append(tweet.text)
            gen1.append(gen)

def batch_func_test(batch):
    for file in batch['id']:
        f_name = ".\\data\\" + str(file) + ".xml"                   # get the name of the file     
        f_obj = et.parse(f_name)                                    # use ElementTree to parse the file
        tweets = f_obj.findall('documents/document')                # get the content present within the
                                                                    # '<document></document>' tags
            
        # get the gender value corresponding to each file
        gen = str(batch[batch['id'] == file]['gender'])
        # check whether the gender is female or male, and
        # update the variable accordingly
        if 'female' in gen:
            gen = 'female'
        else:
            gen = 'male'    
        
        
        for tweet in tweets:
            file_name2.append(f_name)
            tw2.append(tweet.text)
            gen2.append(gen)


# create batches of 400 records of the training data and pass it through the
# 'batch_func_train' function
for g, df in train.groupby(np.arange(len(train)) // 400):
        batch_func_train(df)

# create batches of 250 records of the training data and pass it through the
# 'batch_func_test' function
for g, df in test_labels.groupby(np.arange(len(test_labels)) // 250):
        batch_func_test(df)

d = {'File Name' : file_name, 'Tweet' : tw, 'Gender' : gen1}
train_df = pd.DataFrame(d)

e = {'id' : file_name2, 'Tweet' : tw2, 'Gender' : gen2}
test_df = pd.DataFrame(e)

Wall time: 3.5 s


<h2 style="color:Purple">Pre-processing #2</h2></span><a name="prep2"></a>

### In the cell below,
- File names are being normalized
- Duplicate instances of Gender are being removed

In [4]:
%%time
# FOR TRAIN DATA
# replace ".\\data\\" and ".xml" with an empty string to reflect the correct file names
train_df['File Name'].replace(r".\\data\\",'', regex=True, inplace=True)
train_df['File Name'].replace(r".xml",'', regex=True, inplace=True)


# group the data by the file name
train_df_group = train_df.groupby('File Name').agg({'Tweet': ', '.join, 'Gender': ', '.join}).reset_index()

# define a function to split the string by ', ', drop duplicates and join back
def drop_duplicates(row):
    words = row.split(', ')
    return ', '.join(np.unique(words).tolist())

#  remove duplicate instances of the gender in each row using the 'drop_duplicates' function
train_df_group['Gender'] = train_df_group['Gender'].apply(drop_duplicates)


# FOR TEST DATA
# replace ".\\data\\" and ".xml" with an empty string to reflect the correct file names
test_df['id'].replace(r".\\data\\",'', regex=True, inplace=True)
test_df['id'].replace(r".xml",'', regex=True, inplace=True)

# group the data by the file name
test_df_group = test_df.groupby('id').agg({'Tweet': ', '.join, 'Gender': ', '.join}).reset_index()

test_df_group['Gender'] = test_df_group['Gender'].apply(drop_duplicates)

Wall time: 1.39 s


<h2 style="color:Purple">Pre-processing #3<a name="prep3"></h2></a>

### In the cell below, the following pre-processing is being performed:
- Removal of `hashtags`
- Normalization of `unicode characters`
- Removal of `mentions`
- Removal of `links`
- Removal of `duplicate tokens`
- Removal of `non-aplhabetic` characters
- Converison of tokens into `lowercase`

In [5]:
%%time
# initialize two empty lists
train_docs = []
test_docs = []

# the loop converts the text into a more generalized format (pre-processing)
for file in train_df_group['File Name']:
    string_list = train_df_group[train_df_group['File Name'] == file]['Tweet'].tolist()  # get the tweets of a file
    
    rem_links_mentions = re.sub(r"(?:\@|https?\://)\S+", '', str(string_list).lower())   # make tweets lowercase and
                                                                                         # remove twitter mentions, links
        
    rem_hashtags = re.sub(r"(?:\#\w+)", '', rem_links_mentions)                          # remove hashtags
    
    convert_unicode_chars = BeautifulSoup(unescape(rem_hashtags), 'lxml')                # convert unicode characters
                                                                                         # to their original form
        
    result_str = convert_unicode_chars.text                                              # get the converted text as a 
                                                                                         # string
        
    keep_words = re.sub(r"[^a-zA-Z ]", '', result_str)                                   # replace non-word characters
                                                                                         # with an empty space       
        
    keep_words = " ".join(keep_words.split())  

    train_docs.append(keep_words)                                                        # append the result to 'docs'



# the loop converts the text into a more generalized format (pre-processing)
for file in test_df_group['id']:
    string_list = test_df_group[test_df_group['id'] == file]['Tweet'].tolist()           # get the tweets of a file
    
    rem_links_mentions = re.sub(r"(?:\@|https?\://)\S+", '', str(string_list).lower())   # make tweets lowercase and
                                                                                         # remove twitter mentions, links
        
    rem_hashtags = re.sub(r"(?:\#\w+)", '', rem_links_mentions)                          # remove hashtags
    
    convert_unicode_chars = BeautifulSoup(unescape(rem_hashtags), 'lxml')                # convert unicode characters
                                                                                         # to their original form
    
    result_str = convert_unicode_chars.text                                              # get the converted text as a 
                                                                                         # string
        
    keep_words = re.sub(r"[^a-zA-Z ]", '', result_str)                                   # replace non-word characters
                                                                                         # with an empty space
        
    keep_words = " ".join(keep_words.split())                                            # convert the list to string

    test_docs.append(keep_words)                                                         # append the result to 'test_docs'

Wall time: 4.41 s


<h2 style="color:Purple">Vectorization and Feature Selection</h2><a name="vec"></a>

### In the cell below, Tf-Idf Vectorizer is being used with the following parameters set to the following:
- Conversion of tokens into <b>lowercase</b>
- Removal of <b>stop words</b> by making use of the list of stop words provided by `NLTK`
- Removal  of words appearing in less than <b>5%</b> of the documents
- Removal of words appearing in more than <b>95%</b> of the documents
- Creation of <b>Unigrams</b>, <b>Bigrams</b> and <b>Trigrams</b>
- Use of <b>TweetTokenizer()</b> to tokenize the document
- The <b>Top 1800</b> features have been selected
<br/><br/>

Furthermore, the training data's vocabulary is being learned and transformed into a term-document matrix. The test data is being transformed into a term-document matrix.

In [6]:
%%time
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True) # use TweetTokenizer() to tokenize every document
for idx in range(len(train_docs)):
    train_docs[idx] = train_docs[idx].lower()                   # make it lowercase
    train_docs[idx] = tokenizer.tokenize(train_docs[idx])       # split into words


for idx in range(len(test_docs)):
    test_docs[idx] = test_docs[idx].lower()                     # make it lowercase
    test_docs[idx] = tokenizer.tokenize(test_docs[idx])         # split into words


def identity_tokenizer(text):
    return text



encoder = LabelEncoder()                                        # get LabelEncoder()
y_train = encoder.fit_transform(train_df_group['Gender'])       # fit label encoder and return encoded labels for Gender
true_labels = encoder.fit_transform(test_df_group['Gender'])    # fit label encoder and return encoded labels for Gender


vectorizer = TfidfVectorizer(analyzer='word',input='content',
                           lowercase=False,
                           stop_words= en_stop,                 # remove stop words
                           min_df=0.05,                         # remove words appearing in less than 5% of the documents
                           max_df=0.95,                         # remove words appearing in more than 95% of the documents
                           ngram_range=(1,3),                   # create unigrams and bigrams
                           tokenizer=identity_tokenizer,        # tokenization
                           max_features = 1800)                 # feature selection

x_train = vectorizer.fit_transform(train_docs)                  # learn vocabulary and idf, return term-document matrix for
                                                                # train data

x_test = vectorizer.transform(test_docs)                        # learn vocabulary and idf, return term-document matrix for
                                                                # test data

Wall time: 30.7 s


<h2 style="color:Purple">Model Development</h2><a name="mod"></a>

### The following models have been used:
- Logistic Regression
- Support Vecor Classifier
- XGBoost
- Basic Ensemble Model (<b>Used to generate the output csv file</b>)*


\* Please note that to run the Basic Ensemble Model, the Logistic Regression, SVC and XGBoost models need to run. This is why those 3 models are included in this program.

<h3 style="color:Green">Logistic Regression</h3><a name="LR"></a>

#### In the cell below, Logistic Regression is being implemented

In [7]:
%%time
# load the LogisticRegression() class
LR_model = LogisticRegression()

# fit the training data to the model
LR_model.fit(x_train,y_train)

# use the model to predict the labels of the test data
pred_LR = LR_model.predict(x_test)

# get the prediction of the test labels
acc_test_LR = accuracy_score(true_labels, pred_LR) * 100
print("The Accuracy on the test labels is: ", acc_test_LR)

The Accuracy on the test labels is:  81.0
Wall time: 88.8 ms


<h3 style="color:Green">Support Vector Classifier</h3><a name="SVC"></a>

#### In the cell below, Support Vector Classifier is being implemented with polynomial kernel and the regularization parameter set to "2"

In [8]:
%%time
# load the SVC() class
SVC_model = SVC(kernel='poly', degree=2, C=7)

# fit the training data to the model
SVC_model.fit(x_train,y_train)

# use the model to predict the labels of the test data
pred_SVC = SVC_model.predict(x_test)

# get the prediction of the test labels
acc_test_SVC = accuracy_score(true_labels, pred_SVC) * 100
print("The Accuracy on the test labels is: ", acc_test_SVC)

The Accuracy on the test labels is:  78.0
Wall time: 21.8 s


<h3 style="color:Green">XGBoost</h3><a name="XG"></a>

#### In the cell below, XGBoost is being implemented

In [9]:
%%time
# load the xgboost class


XGB_model = XGBClassifier(booster='gbtree', eta=0.05, reg_lambda = 0.1)

# fit the training data to the model
XGB_model.fit(x_train, y_train) 

# use the model to predict the labels of the test data
pred_XGB = XGB_model.predict(x_test)

# get the prediction of the test labels
acc_test_XGB = accuracy_score(true_labels, pred_XGB) * 100
print("The Accuracy on the test labels is: ", acc_test_XGB)

The Accuracy on the test labels is:  75.8
Wall time: 4.02 s


<h3 style="color:Green">Basic Ensemble Model</h3><a name="bEn"></a>

#### In the cell below, Basic Ensemble Model is being implemented

In [10]:
%%time
# use ensemble method from previous highest models
ensemble_model = sklearn.ensemble.StackingClassifier(cv = 6, 
                                                     estimators=[('lr', LR_model), ('svc', SVC_model), ('xgb', XGB_model)])

# fit the training data to the model
ensemble_model.fit(x_train,y_train)

# use the model to predict the labels of the test data
pred_ensemble = ensemble_model.predict(x_test)

# get the prediction of the test labels
acc_test_ens = accuracy_score(true_labels, pred_ensemble) * 100
print("The Accuracy on the test labels is: ", acc_test_ens)

The Accuracy on the test labels is:  81.8
Wall time: 2min 23s


<h2 style="color:Purple">CSV Creation</h2><a name="csv"></a>

#### In the cell below, a CSV file titled "pred_labels.csv" is being created from the prediction results generated by the Basic Ensemble Model

In [11]:
# replace the values of the 'Gender' column with the values of 'pred_ensemble'
test_df_group['Gender'] = pred_ensemble

# keep only the 'id' and 'Gender' columns
test_df_group = test_df_group[['id', 'Gender']]

# rename the columns
test_df_group.columns = ['id', 'gender']

# export the dataframe as a csv file
test_df_group.to_csv("pred_labels.csv", index=False)