## Setup

In [1]:
!pip install -r requirements.txt

Collecting pandas==1.4.3
  Downloading pandas-1.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 5.5 MB/s eta 0:00:01:00:01
[?25hCollecting matplotlib==3.5.2
  Downloading matplotlib-3.5.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 33.8 MB/s eta 0:00:01
Collecting fonttools>=4.22.0
  Downloading fonttools-4.34.4-py3-none-any.whl (944 kB)
[K     |████████████████████████████████| 944 kB 26.0 MB/s eta 0:00:01
Collecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 28.4 MB/s eta 0:00:01
Installing collected packages: kiwisolver, fonttools, cycler, pandas, matplotlib
Successfully installed cycler-0.11.0 fonttools-4.34.4 kiwisolver-1.4.4 matplotlib-3.5.2 pa

In [10]:
import pandas as pd
import numpy as np

data = pd.read_csv('name_gender.csv')
print(f'Size of dataset: {len(data)}')

data.head()

Size of dataset: 95025


Unnamed: 0,name,gender
0,Aaban&&,M
1,Aabha*,F
2,Aabid,M
3,Aabriella,F
4,Aada_,F


In [11]:
data.describe()

Unnamed: 0,name,gender
count,95025,95025
unique,95025,2
top,Aaban&&,F
freq,1,60304


In [12]:
data['gender'].value_counts()

F    60304
M    34721
Name: gender, dtype: int64

In [13]:
data.isnull().values.any()

False

## Data cleaning

Remove non-alphabetic characters

In [14]:
names = data['name'].str.contains('\W|\d|_').values.sum()

print(f'No. of non-alphabetic names: {names}')
print(data[data['name'].str.contains('\W|\d|_')])

No. of non-alphabetic names: 65
            name gender
0        Aaban&&      M
1         Aabha*      F
4          Aada_      F
10       Aadhav+      M
13      Aadhira4      F
...          ...    ...
94826   Zyair770      M
94874  Zyheir887      M
94915    Zykir24      M
94957  Zymirah11      F
94995     Zyri*&      F

[65 rows x 2 columns]


In [15]:
data['name'] = data['name'].str.replace('\W|\d|_','',regex=True)

In [18]:
np.mean(data['name'].apply(len))

6.5340699815837935

## Modelling

https://arxiv.org/pdf/2102.03692.pdf

What’s in a Name? – Gender Classification of Names with
Character Based Machine Learning Models

In the paper, 2 methods of using names for gender classification is via constructing name embeddings and character-based approaches. However, embeddings tend to work poorly for less common names.

In [None]:
# Try script to secretly route request to commercial API

In [None]:
error analysis on names based on frequency
maybe upsample rarer names on training set, keep test set the same

naive bayes
logistic regression
lstm
char-bert

In [82]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_upper = int(np.floor(np.mean(data['name'].apply(len))))

ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,ngram_upper))

In [83]:
test = data.iloc[:100]['name']

In [84]:
counts = ngram_vectorizer.fit_transform(test)

In [85]:
ngram_vectorizer.get_feature_names_out()

array([' a', ' aa', ' aab', ..., 'ysia ', 'za', 'za '], dtype=object)

In [86]:
counts.toarray().astype(int)

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

In [88]:
counts

<100x1224 sparse matrix of type '<class 'numpy.int64'>'
	with 2603 stored elements in Compressed Sparse Row format>