# Basic gender prediciton based on dataset of Polish names

### Can be used for forms filling and better customer targetting.

In [269]:
# tools, libaries, algorythms

import pandas as pd
import numpy as np

np.random.seed(2020)

from sklearn.linear_model import LogisticRegression

# success metric
from sklearn.metrics import accuracy_score


import matplotlib.pyplot as plt
%matplotlib inline

### Loading the data.

In [270]:
df = pd.read_csv('../input/polish_names.csv')

### Data review:

In [271]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    1705 non-null   object
 1   gender  1705 non-null   object
dtypes: object(2)
memory usage: 26.8+ KB


Simple dataset comprised of 2 columns: *__name__*, *__gender__* (target variable) & 1705 rows.
* no null values
* data type: objects

In [272]:
# show unique values for gender

df.gender.unique()

array(['m', 'f'], dtype=object)

In [273]:
# show totals of unique values

df.gender.value_counts()

m    1033
f     672
Name: gender, dtype: int64

Male names is almost 2 x more than female names.



In [274]:
df.sample(10)

Unnamed: 0,name,gender
861,Ludomił,m
942,Mechtylda,f
734,Klaudian,m
997,Nadzieja,f
1062,Oswald,m
820,Leszek,m
588,Greta,f
396,Dorian,m
1619,Hiacynta,f
1620,Hieronim,m


### Map object values into integers (1 == 'm', 0 == 'f')

In [275]:
df['target'] = df['gender'].map(lambda x: int(x == 'm'))

## Feature engineering 

First we would like to find a feature that will provide us with the highest score.

__Features to test__:

1. Length of the name.
2. Number of consontants.
3. Number of vowels.
4. The first letter is consonant. 
5. The last letter is consonant.
6. The first letter is vowel.
7. The last letter is vowel.

We will prepare a function for each feature and map it with pandas dataframe.

#### 1. Lenght of the name

In [276]:
df['length'] = df['name'].map(lambda x: len(x))

#### 2. Number of consonants

In [277]:
all_vowels_pl = ['a', 'e', 'i', 'o', 'u', 'y', 'ą', 'ę']

def count_consonants(name):
    consonants = sum(map(lambda x: int(x not in all_vowels_pl), name.lower()))
    return consonants

In [278]:
df['count_consonants'] = df['name'].map(count_consonants)

#### 3. Number of vowels

In [279]:
vowels = ['a', 'e', 'i', 'o', 'u', 'y']

def how_many_vowels(name):
    return sum(map(lambda x: int(x in vowels) , name.lower()))  

In [280]:
df['how_many_vowels'] = df['name'].map(how_many_vowels)

#### 4. The first letter is consonant. 

In [281]:
def first_consonant(name):
    first = name.lower()[0] not in all_vowels_pl
    return int(first)

In [282]:
df['first_consonant'] = df['name'].map(first_consonant)

####  5. The last letter is consonant.

In [283]:
def last_consonant(name):
    last = name.lower()[-1] not in all_vowels_pl
    return int(last)

In [284]:
df['last_consonant'] = df['name'].map(last_consonant)

#### 6. The first letter is vowel.

In [285]:

def first_vowel(name):
    first = name.lower()[0] in all_vowels_pl
    return int(first)

In [286]:
df ['first_vowel'] = df['name'].map(first_vowel)

#### 7. The first letter is vowel.

In [287]:
def last_vowel(name):
    return int(name.lower()[-1] in all_vowels_pl)

In [288]:
df['last_vowel'] = df['name'].map(last_vowel)

## Model building

#### Logistic Regression

#### 1. Train & predict for lenght of the name.

In [289]:
X = df[['length']].values # martrix data
y = df['target'].values # vector

In [290]:
def train_and_predict_model(X, y, model, success_metric=accuracy_score):
    model.fit(X, y)
    y_pred = model.predict(X)
    return success_metric(y, y_pred)

In [291]:
train_and_predict_model(X, y, LogisticRegression())

0.6058651026392962

#### 2. Train & predict for Number of consonants

In [292]:
train_and_predict_model(df[['count_consonants']], y, LogisticRegression())

0.6486803519061584

#### 3. Train & predict for number of vowels.

In [293]:
train_and_predict_model(df[['how_many_vowels']], y, LogisticRegression())

0.669208211143695

#### 4. Train & predict for the first letter is consonant.

In [294]:
train_and_predict_model(df[['first_consonant']], y, LogisticRegression())

0.6058651026392962

####  5. Train & predict for the Last letter is consonant

In [295]:
train_and_predict_model(df[['last_consonant']], y, LogisticRegression())

0.9524926686217009

#### 6. Train & predict for the first letter is vowel


In [296]:
train_and_predict_model(df[['first_vowel']], y, LogisticRegression())

0.6058651026392962

#### 7. Train & predict for the last letter is vowel

In [297]:
train_and_predict_model(df[['last_vowel']], y, LogisticRegression())

0.9524926686217009

### Summary

We can see that in case of Polish names the golden feature is the last letter (distinction: vowel/consonant).
Usually Polish female names end with -a, while for male names vowel at the end is rather rare (although possible).

The very high score is also due to the model overffiting. 

#### Polish names ended with -a & tagged as male.

In [299]:
df['last_a'] = df.name.map(lambda x: x[-1] == 'a')
df[ (df.gender == 'm') & df.last_a ]

Unnamed: 0,name,gender,target,length,count_consonants,how_many_vowels,first_consonant,last_consonant,first_vowel,last_vowel,last_a
142,Barnaba,m,1,7,4,3,1,0,0,1,True
219,Bonawentura,m,1,11,6,5,1,0,0,1,True
765,Kosma,m,1,5,3,2,1,0,0,1,True
1574,Batszeba,m,1,8,5,3,1,0,0,1,True


#### END