# Setup

In [1]:
!pip install -r requirements.txt



In [17]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

seed = 0

data = pd.read_csv('name_gender.csv')
print(f'Size of dataset: {len(data)}')

Size of dataset: 95025


In [18]:
data.describe()

Unnamed: 0,name,gender
count,95025,95025
unique,95025,2
top,Aaban&&,F
freq,1,60304


In [19]:
data.isnull().values.any() # No missing values

False

In [20]:
data['gender'].value_counts() # Class frequencies

F    60304
M    34721
Name: gender, dtype: int64

# Data cleaning and preparation

In [21]:
# List of non-alphabetic names
print(data[data['name'].str.contains('\W|\d|_')])

            name gender
0        Aaban&&      M
1         Aabha*      F
4          Aada_      F
10       Aadhav+      M
13      Aadhira4      F
...          ...    ...
94826   Zyair770      M
94874  Zyheir887      M
94915    Zykir24      M
94957  Zymirah11      F
94995     Zyri*&      F

[65 rows x 2 columns]


The following step removes non-alphabetic characters (numbers, symbols) from the names.

In [22]:
data['name'] = data['name'].str.replace('\W|\d|_','',regex=True)

In [23]:
# Split dataset stratified by gender before tokenizing to avoid leakage of test set info into training features
X_train, X_test, y_train, y_test = train_test_split(data['name'], data['gender'], test_size=0.1, \
                                                    random_state=seed, stratify=data['gender'])

# Modelling

https://arxiv.org/pdf/2102.03692.pdf

error analysis

naive bayes
random forest
lstm
char-bert

Various models will be tried. 

Additionally, for each model, various ways of feature engineering to create input features will be conducted as an experiment. First, a set of ngrams will be created from each name using sklearn's count vectorizer, essentially a bag of words approach before applying a machine learning model. The second method will further convert the count vectors into tf-idf vectors. The last method will involve class scaling.

Different features https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
1) BOW of ngrams

2) TF-IDF of ngrams
scale down the impact of tokens that are common in a corpus and hence less informative (does it apply here?)

3) Class scaling

In [24]:
ngram_upper = int(np.floor(np.mean(data['name'].apply(len))))
print(f'Average word length: {ngram_upper}')

Average word length: 6


A minimum of 2 characters and a maximum of 6 (mean word length) is used for constructing ngrams.

For hyerparameter tuning, stratified 5-fold CV will be used. However, tokenization will only be performed after splitting the dataset in order to prevent the validation set from having any info on the features of the other training folds. This procedure will be combined with gridsearchcv using a pipeline.

## Logistic regression

### BOW

In [15]:
# This tuning section was iteratively run to search for optimal hyperparameters without overfitting.
# I avoided single comprehensive searches as my computer was overheating and crashing.

# As GridsearchCV optimizes for the best validation score, it is likely to overfit since it ignores the training score 
# in its selection. The training scores are therefore printed and included in the manual selection of the best 
# hyperparameters to use


# 5 fold CV
# Tokenization
# Grid search

pipeline = make_pipeline(
        CountVectorizer(analyzer='char_wb', ngram_range=(2,ngram_upper)),
        StandardScaler(with_mean=False),  # Sparse matrix
        LogisticRegression(max_iter=200)
        )

parameters = {'logisticregression__C':[0.01,0.1,1]}
lr_1 = GridSearchCV(pipeline, param_grid=parameters, cv=5, verbose=1, return_train_score=True) # Monitor for overfitting
lr_1.fit(X_train, y_train)

pd.DataFrame(lr_1.cv_results_)[['params','mean_test_score','rank_test_score','mean_train_score']]

Fitting 5 folds for each of 3 candidates, totalling 15 fits


Unnamed: 0,params,mean_test_score,rank_test_score,mean_train_score
0,{'logisticregression__C': 0.01},0.876102,3,0.88612
1,{'logisticregression__C': 0.03},0.890671,2,0.909175
2,{'logisticregression__C': 0.04},0.894027,1,0.915706


In [16]:
print(f'Scores for tuned LR model\nTrain: {round(lr_1.score(X_train, y_train),3)}\nTest: {round(lr_1.score(X_test, y_test),3)}')

Scores for tuned LR model
Train: 0.918
Test: 0.895


### tfidf

In [None]:
pipeline = make_pipeline(
        TfidfVectorizer(analyzer='char_wb', ngram_range=(2,ngram_upper)),
        LogisticRegression(max_iter=200)
        )

parameters = {'logisticregression__C':[0.2,0.4,1]}
lr = GridSearchCV(pipeline, param_grid=parameters, cv=5, verbose=2, return_train_score=True) # Monitor for overfitting
lr.fit(X_train, y_train)