# Introduction

Here we provide a tutorial making use of the Contagious Naive Bayes package as described above. 

The objective here is to predict an individual's gender through the analysis of a tweet of their's. 

The dataset itself is available online at: https://www.kaggle.com/crowdflower/twitter-user-gender-classification. 

# Preamble

We import the required packages:

In [1]:
import pandas as pd
import numpy as np

We now read the data in:

In [2]:
datafile = pd.read_csv(r"C:\Users\danie\Desktop\University\Internship/General/gender.csv",encoding='latin-1')

Having read the data in, we inspect the dataset briefly:

In [3]:
print(datafile.head())

    _unit_id  _golden _unit_state  _trusted_judgments _last_judgment_at  \
0  815719226    False   finalized                   3    10/26/15 23:24   
1  815719227    False   finalized                   3    10/26/15 23:30   
2  815719228    False   finalized                   3    10/26/15 23:33   
3  815719229    False   finalized                   3    10/26/15 23:10   
4  815719230    False   finalized                   3     10/27/15 1:15   

   gender  gender:confidence profile_yn  profile_yn:confidence  \
0    male             1.0000        yes                    1.0   
1    male             1.0000        yes                    1.0   
2    male             0.6625        yes                    1.0   
3    male             1.0000        yes                    1.0   
4  female             1.0000        yes                    1.0   

          created  ...                                       profileimage  \
0    12/5/13 1:48  ...  https://pbs.twimg.com/profile_images/414342229...  

From the above output, we are only interested within two columns within the dataset. 

Namely, 'text' and 'gender'. 

-'text' contains the content of the individual's tweet.

-'gender' contains the gender of the individual. There are 3 categories that each individual can fall within, namely 'male', 'female' and 'brand'. 

For the purpose of this tutorial, we will ignore users who are tagged as a brand as the package was designed with binary classification in mind. 

# Preprocessing: 

We perform preprocessing to convert the observations flagged as 'male' as 0, and those flagged as female as '1' to simplify the classification. 

In [15]:
cols = ['text','gender']
datafile_conv = datafile[cols]
datafile_conv.columns = ['text','class_code']
interm = []

for i in datafile_conv['class_code']:
  if i == 'male':
    j = 0
  elif i =='female':
    j = 1
  interm.append(j)

datafile_conv['class_code'] = interm

# Sampling:

For tutorial purposes, to speed up the run time of the algorithm, rather than considering the entire dataset, we consider a subset of the data, of size 2000. 

In [16]:
datafile_conv = datafile_conv.sample(2000,random_state = 42)

Having subsetted the data, we now perform a standard test/train split of the data in order to train the model, and then validate it. This is done so making use of sklearn readily available package, whose documentation is available at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(datafile_conv['text'], datafile_conv['class_code'], test_size=0.2, random_state=42)

In [19]:
train_matrix = (X_train,y_train)
train_matrix = pd.DataFrame(train_matrix).transpose()
train_matrix_id = train_matrix.index
test_matrix = (X_test,y_test)
test_matrix = pd.DataFrame(test_matrix).transpose()

# Implementation

Having installed the package through the 'pip' command, we now utilize the package within a notebook enviroment. 

In [12]:
from Contagious_NB import Classification as func

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The package, as well as the function are now called, providing us with output. We will firstly perform the classification without making use of any normalization:

In [20]:
cnb = func.CNB(train_matrix,test_matrix, norm = False)

The Contagious Naive Bayes has executed.
The total runtime was:  174.6720905303955 seconds
The posteriors obtained are as follows: 
               0          1  Predicted  Actual
Index                                         
19518 -30.590401 -21.290037          1       0
3482   -4.922357  -4.356281          1       1
17305 -16.467453 -17.075635          0       1
9134  -29.957109 -27.062482          1       0
19435 -18.369295 -17.093212          1       0
5379   -0.303753  -0.298324          1       0
11752 -13.676371  -9.913059          1       1
19813 -11.693105 -14.263936          0       0
18037 -23.322129 -25.962303          0       1
16755 -17.510576 -24.454507          0       0
8937   -7.378762 -11.076001          0       0
20023  -7.546340 -13.919485          0       0
4639   -4.410793  -5.275328          0       0
16058 -13.279373 -18.869705          0       1
4345   -5.807669  -8.637453          0       0
19691  -2.511892  -2.511807          1       1
6224   -9.123099 -10.1

Having done the above, we now repeat the process making use of document length normalization: 

In [21]:
cnb_norm = func.CNB(train_matrix,test_matrix, norm = True)

The Contagious Naive Bayes has executed.
The total runtime was:  41.43412184715271 seconds
The posteriors obtained are as follows: 
               0          1  Predicted  Actual
Index                                         
19518 -25.249704 -15.647730          1       0
3482   -3.293720  -2.689334          1       1
17305  -9.787161 -10.324267          0       1
9134  -16.499229 -19.649555          0       0
19435 -14.066342 -12.535494          1       0
5379   -0.303196  -0.298875          1       0
11752 -12.262596 -11.247985          1       1
19813  -7.402014 -10.060195          0       0
18037 -18.475814 -24.220325          0       1
16755 -11.441411 -18.325111          0       0
8937   -4.635231  -7.867263          0       0
20023  -4.892830 -11.295643          0       0
4639   -2.764030  -3.548932          0       0
16058 -10.335930 -15.238321          0       1
4345   -4.039880  -6.868240          0       0
19691  -1.531674  -1.657447          0       1
6224   -5.602019  -6.6

The above shows us that the choice between whether or not to make use of document length normalization in this case is a subjective one. Which can be fine tuned through the use of the additional arguments available. 