# Decoding gender from resumes
This Jupyter Notebook will focus on trying to decode people's gender based on their choice of words in resumes. Accuracy of a variety of models will be tested throughout the notebook.

## importing dataset of resumes

In [1]:
import pandas as pd
import re 
df = pd.read_csv('resumes3.csv')
df.head()

Unnamed: 0,name,gender,text
0,Amy,female,PROFILE\nFund accountant with nearly 2 years o...
1,Ben,male,Fund Accoutant\tSep 2016 - Present\n(Citco Fun...
2,Carrie,female,Professional Experience\nCitco Fund Services (...
3,Dickson,male,PROFESSIONAL EXPERIENCE\t\n\nConifer Financial...
4,Edwardo,male,QUALIFICATION SUMMARY\n\n•\tResults-driven ach...


### removing all special characters in string

In [2]:
df['text'] = df['text'].str.replace('[^A-Za-z0-9]+', ' ')
df.head()

Unnamed: 0,name,gender,text
0,Amy,female,PROFILE Fund accountant with nearly 2 years of...
1,Ben,male,Fund Accoutant Sep 2016 Present Citco Fund Ser...
2,Carrie,female,Professional Experience Citco Fund Services Si...
3,Dickson,male,PROFESSIONAL EXPERIENCE Conifer Financial Serv...
4,Edwardo,male,QUALIFICATION SUMMARY Results driven achiever ...


In [3]:
df.groupby(['gender']).count()

Unnamed: 0_level_0,name,text
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,167,167
male,167,167


In [4]:
df = df[['gender', 'text']]
df

Unnamed: 0,gender,text
0,female,PROFILE Fund accountant with nearly 2 years of...
1,male,Fund Accoutant Sep 2016 Present Citco Fund Ser...
2,female,Professional Experience Citco Fund Services Si...
3,male,PROFESSIONAL EXPERIENCE Conifer Financial Serv...
4,male,QUALIFICATION SUMMARY Results driven achiever ...
...,...,...
329,male,CAREER OBJECTIVE To secure a challenging posit...
330,male,OBJECTIVES Being a versatile individual in tac...
331,male,SUMMARY OF QUALIFICATIONS Solid working experi...
332,male,HSBC On boarding Customer Due diligence Analys...


## Use of twitterdata
Secondly, twitterdata was used in order to find words that are relevant to a persons gender.

In [5]:
df2 = pd.read_csv('twitterData.csv', encoding='latin-1')
df2.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


In [6]:
df2 = df2[['gender','text']]
df2

Unnamed: 0,gender,text
0,male,Robbie E Responds To Critics After Win Against...
1,male,ÛÏIt felt like they were my friends and I was...
2,male,i absolutely adore when louis starts the songs...
3,male,Hi @JordanSpieth - Looking at the url - do you...
4,female,Watching Neighbours on Sky+ catching up with t...
...,...,...
20045,female,"@lookupondeath ...Fine, and I'll drink tea too..."
20046,male,Greg Hardy you a good player and all but don't...
20047,male,You can miss people and still never want to se...
20048,female,@bitemyapp i had noticed your tendency to pee ...


## Merge twitter data with resume data
This was done in order to gather more data (words) on both genders

In [7]:
resultdf = pd.concat([df,df2])
resultdf = resultdf.dropna()
#only save rows with people that are of a confirmed gender (just in case)
resultdf = resultdf[(resultdf['gender'] == 'female') | (resultdf['gender'] == 'male')]
resultdf

Unnamed: 0,gender,text
0,female,PROFILE Fund accountant with nearly 2 years of...
1,male,Fund Accoutant Sep 2016 Present Citco Fund Ser...
2,female,Professional Experience Citco Fund Services Si...
3,male,PROFESSIONAL EXPERIENCE Conifer Financial Serv...
4,male,QUALIFICATION SUMMARY Results driven achiever ...
...,...,...
20045,female,"@lookupondeath ...Fine, and I'll drink tea too..."
20046,male,Greg Hardy you a good player and all but don't...
20047,male,You can miss people and still never want to se...
20048,female,@bitemyapp i had noticed your tendency to pee ...


In [8]:
for index,row in resultdf.iterrows():
    no_digits = []
    # Iterate through the string, adding non-numbers to the no_digits list
    for i in row['text']:
        if not i.isdigit():
            no_digits.append(i)

    # Now join all elements of the list with '', 
    # which puts all of the characters together.
    result = ''.join(no_digits)
    row['text'] = result

resultdf

Unnamed: 0,gender,text
0,female,PROFILE Fund accountant with nearly years of ...
1,male,Fund Accoutant Sep Present Citco Fund Service...
2,female,Professional Experience Citco Fund Services Si...
3,male,PROFESSIONAL EXPERIENCE Conifer Financial Serv...
4,male,QUALIFICATION SUMMARY Results driven achiever ...
...,...,...
20045,female,"@lookupondeath ...Fine, and I'll drink tea too..."
20046,male,Greg Hardy you a good player and all but don't...
20047,male,You can miss people and still never want to se...
20048,female,@bitemyapp i had noticed your tendency to pee ...


## Stemming the data to improve accuracy

In [9]:
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

resultdf['unstemmed'] = resultdf['text'].str.split()
resultdf['unstemmed']

0        [PROFILE, Fund, accountant, with, nearly, year...
1        [Fund, Accoutant, Sep, Present, Citco, Fund, S...
2        [Professional, Experience, Citco, Fund, Servic...
3        [PROFESSIONAL, EXPERIENCE, Conifer, Financial,...
4        [QUALIFICATION, SUMMARY, Results, driven, achi...
                               ...                        
20045    [@lookupondeath, ...Fine,, and, I'll, drink, t...
20046    [Greg, Hardy, you, a, good, player, and, all, ...
20047    [You, can, miss, people, and, still, never, wa...
20048    [@bitemyapp, i, had, noticed, your, tendency, ...
20049    [I, think, for, my, APUSH, creative, project, ...
Name: unstemmed, Length: 13228, dtype: object

In [10]:
resultdf['text'] = resultdf['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
resultdf = resultdf.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
resultdf # Print dataframe.

Unnamed: 0,gender,text
0,female,"[profil, fund, account, with, near, year, of, ..."
1,male,"[fund, accout, sep, present, citco, fund, serv..."
2,female,"[profession, experi, citco, fund, servic, sing..."
3,male,"[profession, experi, conif, financi, servic, m..."
4,male,"[qualif, summari, result, driven, achiev, and,..."
...,...,...
20045,female,"[@lookupondeath, ...fine,, and, i'll, drink, t..."
20046,male,"[greg, hardi, you, a, good, player, and, all, ..."
20047,male,"[you, can, miss, peopl, and, still, never, wan..."
20048,female,"[@bitemyapp, i, had, notic, your, tendenc, to,..."


In [11]:
resultdf['text'] = resultdf['text'].str.join(" ")
resultdf

Unnamed: 0,gender,text
0,female,profil fund account with near year of experi i...
1,male,fund accout sep present citco fund servic sing...
2,female,profession experi citco fund servic singapor p...
3,male,profession experi conif financi servic may pre...
4,male,qualif summari result driven achiev and high m...
...,...,...
20045,female,"@lookupondeath ...fine, and i'll drink tea too..."
20046,male,greg hardi you a good player and all but don't...
20047,male,you can miss peopl and still never want to see...
20048,female,@bitemyapp i had notic your tendenc to pee on ...


## Multinomial NB

In [12]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = resultdf['text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
docu_feat

There are 31627 words in the vocabulary. A selection: ['accountingadvanc', 'accountingg', 'accountingintermedi', 'accountright', 'accounts', 'accountserv', 'accountsprincipl', 'accountstaff', 'accout', 'accpac', 'accplus', 'accredit', 'accreditationsand', 'accret', 'accru', 'accrual', 'acct', 'acctrak', 'accu', 'accumul']


<13228x31627 sparse matrix of type '<class 'numpy.int64'>'
	with 193796 stored elements in Compressed Sparse Row format>

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tf = tfidf_transformer.fit_transform(docu_feat)
X_train_tf.eliminate_zeros()
X_train_tf.shape

(13228, 31627)

In [14]:
y = resultdf['gender'] #We need to take out the gender as our Y-variable
X = X_train_tf  #this slices the dataframe to include all rows I need
X 

<13228x31627 sparse matrix of type '<class 'numpy.float64'>'
	with 193796 stored elements in Compressed Sparse Row format>

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# The following was tried but did not improve the accuracy and was therefore later commented out
# from sklearn.model_selection import KFold
# kf = KFold(n_splits=2)
# for train_index, test_index in kf.split(X):
#     print("TRAIN:", train_index, "TEST:", test_index)
#     X_train, X_test = X[train_index], X[test_index]
#     y_train, y_test = y[train_index], y[test_index]

In [16]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() #clf = classifier
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### How well did the model perform?

In [17]:
from sklearn.metrics import confusion_matrix
y_test_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred)
cm

array([[1588,  468],
       [1116,  797]])

In [18]:
clf.classes_

array(['female', 'male'], dtype='<U6')

In [19]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['female', 'male' ], columns = ['predicted female', 'predicted male']) 
conf_matrix

Unnamed: 0,predicted female,predicted male
female,1588,468
male,1116,797


In [20]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

      female       0.59      0.77      0.67      2056
        male       0.63      0.42      0.50      1913

    accuracy                           0.60      3969
   macro avg       0.61      0.59      0.58      3969
weighted avg       0.61      0.60      0.59      3969



In [21]:
from sklearn.metrics import accuracy_score
y_pred_valid = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_valid)
acc

0.6009070294784581

0.6 is not that great, seeing as that people are also able to guess a persons age and will then have an accuracy of 50%. Therefore, other Naive Bayes algorithms will be tried for an improved accuracy.

## Bernoulli NB

In [22]:
from sklearn.naive_bayes import BernoulliNB
clf_b = BernoulliNB()
clf_b.fit(X_train, y_train)
BernoulliNB()

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [23]:
y_test_pred_b = clf_b.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred_b)
cm

array([[1853,  203],
       [1481,  432]])

In [24]:
print(classification_report(y_test,y_test_pred_b))

              precision    recall  f1-score   support

      female       0.56      0.90      0.69      2056
        male       0.68      0.23      0.34      1913

    accuracy                           0.58      3969
   macro avg       0.62      0.56      0.51      3969
weighted avg       0.62      0.58      0.52      3969



An accuracy of 0.58 is even lower than the Multinomial Naive bayes. Let's test a random word to see what comes out.

In [25]:
docs_new = ['Edward']
X_new_counts = vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf_b.predict(X_new_tfidf)
print(predicted)
# for doc, category in zip(docs_new, predicted):
#     print('%r => %s' % (doc, df.gender[category]))

['female']


## Gaussian NB

In [26]:
from sklearn.naive_bayes import GaussianNB
train_data_features= X_train.toarray()

gnb = GaussianNB()
y_pred = gnb.fit(train_data_features, y_train).predict(X_test.toarray())
print(accuracy_score(y_test, y_pred))

0.5716805240614764


Gaussian Naive Bayes also has a dissapointed result. After this I will try using a Random Forest method to improve the accuracy.