## Twitter Gender Classification


In [2]:
import numpy as np
import pandas as pd
import sklearn

First we read in the data and take a look at the first few rows of the data frame:

In [3]:
# reading in the data
data = pd.read_csv('gender-classifier-DFE-791531.csv', encoding = 'latin1')

# taking a look at the data
pd.set_option('display.max_columns', 500)
data.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,description,fav_number,gender_gold,link_color,name,profile_yn_gold,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,i sing my own rhythm.,0,,08C2C2,sheezy0,,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,I'm the author of novels filled with family dr...,68,,0084B4,DavdBurnett,,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,louis whining and squealing and all,7696,,ABB8C2,lwtprettylaugh,,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,"Mobile guy. 49ers, Shazam, Google, Kleiner Pe...",202,,0084B4,douggarland,,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,Ricky Wilson The Best FRONTMAN/Kaiser Chiefs T...,37318,,3B94D9,WilfordGemma,,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


From my understading of the data, the variable '_golden' indicates that for these profiles the gender has been verifed. So, we will take a look at how many observations are 'golden':

In [4]:
# inspect row count prior to removal of rows
print('number of rows in data: ', data['_golden'].count())

# inspect row count after removal of rows
dataGolden = data.loc[data['_golden'] == True]
print('number of rows in dataGolden: ', dataGolden['_golden'].count())

number of rows in data:  20050
number of rows in dataGolden:  50


The number of 'golden' observations is very low, so we cannot remove all the 'non-golden observations'.
Next, let's take a look at the following two variables:
- profile_yn: This variable indicates if a profile was available when contributors judged it. We will remove observations that were not available.

- gender:confidence: This variable indicates the confidence with which the profiles where judged as either male, female, or brand. We will use observations with a confidence > .80

In [5]:
# inspect row count for available profiles
print('number of rows in data: ', data['_golden'].count())

# inspect row count for available profiles after removal of 'no' values
dataAvailable = data.loc[data['profile_yn'] == 'yes']
print('number of rows in dataAvailable: ', dataAvailable['profile_yn'].count())

# count observation with confidence > .80
dataFinal = dataAvailable.loc[dataAvailable['gender:confidence'] > .80]
print('count confidence greater .80: ', (dataFinal['gender:confidence']).count())

number of rows in data:  20050
number of rows in dataAvailable:  19953
count confidence greater .80:  13939


Further, we can see that the following two variables contain text that we can also use for our prediction:
- description
- text

Before we extract features from the text in these variables we need to make sure there are no missing values in them:

In [6]:
# check if there are missing values in the dataset
print(dataFinal.isnull().any())
# remove all observations that have missing values in the variable 'description'
dataFinal = dataFinal[pd.notnull(dataFinal['description'])]

_unit_id                 False
_golden                  False
_unit_state              False
_trusted_judgments       False
_last_judgment_at         True
gender                   False
gender:confidence        False
profile_yn               False
profile_yn:confidence    False
created                  False
description               True
fav_number               False
gender_gold               True
link_color               False
name                     False
profile_yn_gold           True
profileimage             False
retweet_count            False
sidebar_color            False
text                     False
tweet_coord               True
tweet_count              False
tweet_created            False
tweet_id                 False
tweet_location            True
user_timezone             True
dtype: bool


Now that there are no more missing values in 'description', we can extract features from both variables.
To do this, we first combine both the 'text' and 'description' variables into one variable named 'alltext'. Then, we extract the features using the `TfidVectorizer`.


In [7]:
# import regular expression module, scipy, and TfidVectorizer from sklearn
import re
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer

# store content of 'text' and 'description' in one variable
dataFinal['alltext'] = dataFinal['text'] + dataFinal['description']

# create vectorizer instance
vectorizer = TfidfVectorizer()

# create matrix with counts of word occurences for variables 'text' and 'description'
textMatrix = vectorizer.fit_transform(dataFinal['alltext'])

Process the features that were extracted:

In [8]:
# count how many zeros occur for each word
columns = (textMatrix != 0).sum(0)
sparseColumns = sparse.csr_matrix(columns) #transform to scipy matrix

# look at shapes prior to appending the row containing the zero counts
print('Dimension of textMatrix: ', textMatrix.shape)
print('Dimension of Matrix containing zero counts: ', sparseColumns.shape)

# append row to a temporary matrix and check if it was appended
tempTextMatrix = sparse.vstack([sparseColumns, textMatrix])
print('Dimension of temp. matrix containing both: ', tempTextMatrix.shape)

Dimension of textMatrix:  (11858, 51140)
Dimension of Matrix containing zero counts:  (1, 51140)
Dimension of temp. matrix containing both:  (11859, 51140)


removing columns with lot of zero counts:

In [9]:
cooSparseColumns = sparseColumns.tocoo()

colList = sparseColumns.data.tolist()
print('Type of the object: ', type(colList))
print('') #print empty line

indexList = []

for index, value in enumerate(colList):
    if value > (11863/4):
       indexList.append(index) 


print('List indeces for which the amount of zeros in a column is > (nrows/4):')
print(indexList)
print('') #print empty line


# we see that there are only a few columns that contain many zeros
# so we will exclude these from the textMatrix and create a newTextMatrix

opIndexList=[]
for index, value in enumerate(colList):
    if value < (11863/4):
       opIndexList.append(index) 

newTextMatrix = sparse.lil_matrix(sparse.csr_matrix(tempTextMatrix)[:,opIndexList])

print('Dimension of newTextMatrix: ', newTextMatrix.shape)

# converting back to matrix csr format
newTextMatrix = newTextMatrix.tocsr()
print('Matrix type is:')
print(type(newTextMatrix))

# Finally we will remove the last row of the newTextMatrix, as this row stored the counts
# First we need to define a function that does this

def deleteRowCSR(mat, i):
    if not isinstance(mat, sparse.csr_matrix):
        raise ValueError("works only for CSR format -- use .tocsr() first")
    n = mat.indptr[i+1] - mat.indptr[i]
    if n > 0:
        mat.data[mat.indptr[i]:-n] = mat.data[mat.indptr[i+1]:]
        mat.data = mat.data[:-n]
        mat.indices[mat.indptr[i]:-n] = mat.indices[mat.indptr[i+1]:]
        mat.indices = mat.indices[:-n]
    mat.indptr[i:-1] = mat.indptr[i+1:]
    mat.indptr[i:] -= n
    mat.indptr = mat.indptr[:-1]
    mat._shape = (mat._shape[0]-1, mat._shape[1])

# deleting the last row
print('')
deleteRowCSR(newTextMatrix, 11858)
print('Dimension of newTextMatrix after removal of last row: ', newTextMatrix.shape)

Type of the object:  <class 'list'>

List indeces for which the amount of zeros in a column is > (nrows/4):
[4606, 10529, 17480, 21480, 22315, 32035, 43453, 44314]

Dimension of newTextMatrix:  (11859, 51132)
Matrix type is:
<class 'scipy.sparse.csr.csr_matrix'>

Dimension of newTextMatrix after removal of last row:  (11858, 51132)


create feature matrix and outcome vector for the ML models:

In [52]:
# select relevant predictors from data frame that are not text
dataFinal=dataFinal.fillna("")
pred = dataFinal.iloc[:,[11, 13, 17, 18, 21, 25]].values

# encode these as categorical features (dummy coding)
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
ohe = OneHotEncoder()
le = LabelEncoder()
pred[:, 1] = le.fit_transform(pred[:, 1])
pred[:, 3] = le.fit_transform(pred[:, 3])
pred[:, 5] = le.fit_transform(pred[:, 5])
# dummy coding
pred = ohe.fit_transform(pred)

# convert to csr matrix and append to newTextMatrix
csrpred = pred.tocsr()
newTextMatrix = sparse.hstack([csrpred, newTextMatrix])
x = newTextMatrix

# transform in order to enable slicing and remove last column (avoiding the dummy variable trap)
x_csr = x.tocsr()
x_csr = x_csr[:,:-1]

# get outcome variable and encode it
y = dataFinal.iloc[:,5].values
y = le.fit_transform(y)

Fitting ML Model:
Start with a multiclass logistic regression

In [53]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

# split into test and train data
x_train, x_test, y_train, y_test = train_test_split(x_csr, y, test_size = 0.2, random_state = 0)

# fit linear model with one regularization value (for simplicity reasons)
# to the training data
regstrength = 1e-2
reg = linear_model.LogisticRegression (penalty='l2', C=regstrength,
                                       solver ='lbfgs', multi_class = 'multinomial')
reg.fit(x_train, y_train)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

make predictions on test data:

In [60]:
y_pred = reg.predict(x_test)
classifRate = np.mean(y_pred.ravel() == y_test.ravel()) * 100
print('Classification rate: ', classifRate)

Classification rate:  54.173693086


Do feature extraction:

In [79]:
from sklearn.feature_selection import SelectKBest

ch2 = SelectKBest(chi2, k="all")
x_train = ch2.fit_transform(x_train, y_train)
x_test = ch2.transform(x_test)

In [78]:
# fit linear model with one regularization value (for simplicity reasons)
# to the training data
reg.fit(x_train, y_train)

y_pred = reg.predict(x_test)
classifRate = np.mean(y_pred.ravel() == y_test.ravel()) * 100
print('Classification rate: ', classifRate)

Classification rate:  38.1956155143


Re-rund ML Model