## <img src="./analytics_logo.jpg" width="80" height="80"/> Mini Project: Predict Gender from Name

In this project I'm going to create a dataframe from scratch with fake data by Faker library, then with NamesDataset library I'll predict the gender of names.

## Prerequisites

To do so, we use the following libraries:
- [`Faker`](https://github.com/joke2k/faker): To generate fake names
- [`names-dataset`](https://github.com/philipperemy/name-dataset): To get gender and country info

In [None]:
!pip install Faker
!pip install names-dataset

# Let's Start

#### import required libraries

In [20]:
import pandas as pd
import numpy as np
from faker import Faker
from faker.providers import internet
from names_dataset import NameDataset, NameWrapper

#### generate fake data by Faker

In [42]:
faker = Faker()

In [65]:
names = np.array([faker.name() for _ in range(1000)], dtype='U31')
names[:10]

array(['David Baker', 'Paige Long', 'Nicole Arias DDS', 'Lisa Brooks',
       'Melvin Bailey', 'Courtney Lynn', 'Dylan Vasquez', 'Keith Shaffer',
       'John Fowler', 'Kevin Johnson'], dtype='<U31')

In [44]:
address = np.array([faker.address() for _ in range(1000)])
address[:10]

array(['781 Laura Drive Apt. 593\nGarciastad, OR 96308',
       '10181 Christopher Forks\nKatelynstad, NH 04549',
       '4141 Connie Roads\nMillerbury, OK 41415',
       '782 Cortez Stream Apt. 139\nNorth Susanview, DE 40247',
       '7334 Nicole Forge Apt. 148\nFieldstown, NE 79103',
       'PSC 5403, Box 9622\nAPO AP 32279',
       '859 Lawrence Freeway\nWest Colechester, KY 03613',
       '804 Hall Run\nEast Juan, TN 06843',
       '93317 Alexandra Courts Apt. 284\nChristopherhaven, PA 02076',
       '8207 Joseph Camp Suite 409\nPatrickland, OH 00949'], dtype='<U62')

In [47]:
comment = np.array([faker.sentence() for _ in range(1000)])
comment[:10]

array(['Money threat whole live east political within.',
       'Clear fast change like yet step.',
       'Cell computer mother manager degree agent whose.',
       'Difficult if must serve.', 'Project candidate line.',
       'Response just today understand coach than her the.',
       'Center company news yard.',
       'Billion professional fire build recent.', 'Add guess your car.',
       'Share too young response each always.'], dtype='<U67')

In [50]:
fake.add_provider(internet)
ip = np.array([fake.ipv4_private() for _ in range(1000)])
ip[:10]

array(['10.128.210.201', '172.17.239.64', '172.29.104.235',
       '10.2.219.223', '172.28.112.53', '192.168.184.9', '10.141.119.31',
       '192.168.63.189', '172.24.44.86', '10.73.221.32'], dtype='<U15')

In [53]:
age = np.random.randint(1,100,1000)
age[:10]

array([16, 55, 20, 11, 51,  4,  5, 94, 65, 72])

In [59]:
status = np.random.randint(0,2,1000,dtype=bool)
status[:10]

array([False,  True, False, False,  True,  True,  True,  True,  True,
        True])

- So, I generate 6 np.array completely fake by Faker. now it's time to create a data frame with them.

In [83]:
df_to = pd.DataFrame(data=[names, age, address, comment, ip, status])

In [84]:
df = df_to.T
df.head(2)

Unnamed: 0,0,1,2,3,4,5
0,David Baker,16,"781 Laura Drive Apt. 593\nGarciastad, OR 96308",Money threat whole live east political within.,10.128.210.201,False
1,Paige Long,55,"10181 Christopher Forks\nKatelynstad, NH 04549",Clear fast change like yet step.,172.17.239.64,True


In [85]:
df.columns = ['name', 'age', 'address', 'comment', 'ip', 'status']

In [86]:
df

Unnamed: 0,name,age,address,comment,ip,status
0,David Baker,16,"781 Laura Drive Apt. 593\nGarciastad, OR 96308",Money threat whole live east political within.,10.128.210.201,False
1,Paige Long,55,"10181 Christopher Forks\nKatelynstad, NH 04549",Clear fast change like yet step.,172.17.239.64,True
2,Nicole Arias DDS,20,"4141 Connie Roads\nMillerbury, OK 41415",Cell computer mother manager degree agent whose.,172.29.104.235,False
3,Lisa Brooks,11,"782 Cortez Stream Apt. 139\nNorth Susanview, D...",Difficult if must serve.,10.2.219.223,False
4,Melvin Bailey,51,"7334 Nicole Forge Apt. 148\nFieldstown, NE 79103",Project candidate line.,172.28.112.53,True
...,...,...,...,...,...,...
995,Matthew Owens,45,USCGC Jones\nFPO AP 55681,Respond meeting paper third.,10.244.14.20,True
996,Jamie Crawford,47,"80546 Guerra Brooks Apt. 443\nRiverahaven, CA ...",Common almost only check paper old cup.,10.133.251.246,False
997,Shannon Chandler,27,"9798 Blair Walk\nNorth Pamela, NH 28129",Away bring foreign become son available contro...,10.220.204.70,False
998,Hannah Cooper,20,"309 Morris Land\nEthanbury, MS 32060",Any low coach positive continue between also.,192.168.118.218,True


### let's work with namesdataset to get the gender of each names

In [88]:
nd = NameDataset()

In [194]:
fn_info = nd.search('کوروش')['first_name']
fn_info

{'country': {'United Arab Emirates': 0.006,
  'Afghanistan': 0.003,
  'Belgium': 0.003,
  'Germany': 0.009,
  'Greece': 0.003,
  'Iraq': 0.022,
  'Iran, Islamic Republic of': 0.931,
  'Netherlands': 0.006,
  'Turkey': 0.013,
  'United States': 0.003},
 'gender': {'Female': 0.035, 'Male': 0.965},
 'rank': {'Iran, Islamic Republic of': 899,
  'United Arab Emirates': None,
  'Afghanistan': None,
  'Belgium': None,
  'Germany': None,
  'Greece': None,
  'Iraq': None,
  'Netherlands': None,
  'Turkey': None,
  'United States': None}}

#### So with this JSON like schema we'll find out 3 things:
   - country probability
   - gender prediction
   - rank of name in each country

In [199]:
def cleanNullTerms(d):
    clean = {}
    for k, v in d.items():
        if isinstance(v, dict):
            nested = cleanNullTerms(v)
            if len(nested.keys()) > 0:
                clean[k] = nested
        elif v is not None:
            clean[k] = v
    return clean

In [213]:
# Dictionary with None values

d = {'rank': {'United Arab Emirates': 12168,
  'Canada': 5364,
  'Germany': 4650,
  'United Kingdom': 11656,
  'Iran, Islamic Republic of': 216,
  'Kuwait': 8123,
  'Netherlands': 9041,
  'Sweden': 1091,
  'Iraq': None,
  'Turkey': None}}

# Testing function
cleanNullTerms(d)

{'rank': {'United Arab Emirates': 12168,
  'Canada': 5364,
  'Germany': 4650,
  'United Kingdom': 11656,
  'Iran, Islamic Republic of': 216,
  'Kuwait': 8123,
  'Netherlands': 9041,
  'Sweden': 1091}}

In [207]:
def get_gender(name):
    """
    """
    fn_info = nd.search(name)['first_name']
    if fn_info is not None:
        if fn_info['gender'] is not None:
            if None not in list(fn_info['gender'].values()):
                return max(fn_info['gender'], key=fn_info['gender'].get)
            else:
                gender_dict = cleanNullTerms(fn_info['gender'])
                return max(gender_dict, key=gender_dict.get)
    
    return None


# Testing function
get_gender('اسما')

'Female'

In [210]:
def country_probability(name):
    """
    """
    fn_info = nd.search(name)['first_name']
    if fn_info is not None:
        if fn_info['country'] is not None:
            if None not in list(fn_info['country'].values()):
                return max(fn_info['country'], key=fn_info['country'].get)
            else:
                country_dict = cleanNullTerms(fn_info['country'])
                return max(country_dict, key=country_dict.get)
    
    return None


# Testing function
country_probability('اسما')

'Egypt'

In [216]:
def get_rank(name):
    """
    """
    fn_info = nd.search(name)['first_name']
    if fn_info is not None:
        if fn_info['rank'] is not None:
            if None not in list(fn_info['rank'].values()):
                return max(fn_info['country'], key=fn_info['country'].get)
            else:
                rank_dict = cleanNullTerms(fn_info['rank'])
                return max(rank_dict, key=rank_dict.get)
    
    return None

# Testing function
get_rank('اسما')

'Egypt'

## Now lets apply these functions on our dataframe

### first lets split the first names

In [226]:
df.insert(
    loc=2,
    column='first_name',
    value=df.name.apply(lambda name: name.split()[0])
)

### then apply functions

In [228]:
df.insert(
    loc=3,
    column='gender',
    value=df.first_name.apply(lambda fn: get_gender(fn))
)

In [230]:
df.insert(
    loc=4,
    column='country',
    value=df.first_name.apply(lambda fn: country_probability(fn))
)

In [233]:
df.insert(
    loc=5,
    column='rank',
    value=df.first_name.apply(lambda fn: get_rank(fn))
)

In [246]:
df.country.value_counts()

United States     780
United Kingdom     69
Italy              47
France             43
Colombia           29
Nigeria             7
Brazil              6
Germany             3
Mexico              3
Spain               2
Poland              1
South Africa        1
Name: country, dtype: int64

In [247]:
df

Unnamed: 0,name,age,first_name,gender,country,rank,address,comment,ip,status
0,David Baker,16,David,Male,United States,United States,"781 Laura Drive Apt. 593\nGarciastad, OR 96308",Money threat whole live east political within.,10.128.210.201,False
1,Paige Long,55,Paige,Female,United States,United States,"10181 Christopher Forks\nKatelynstad, NH 04549",Clear fast change like yet step.,172.17.239.64,True
2,Nicole Arias DDS,20,Nicole,Female,United States,United States,"4141 Connie Roads\nMillerbury, OK 41415",Cell computer mother manager degree agent whose.,172.29.104.235,False
3,Lisa Brooks,11,Lisa,Female,United States,United States,"782 Cortez Stream Apt. 139\nNorth Susanview, D...",Difficult if must serve.,10.2.219.223,False
4,Melvin Bailey,51,Melvin,Male,United States,United States,"7334 Nicole Forge Apt. 148\nFieldstown, NE 79103",Project candidate line.,172.28.112.53,True
...,...,...,...,...,...,...,...,...,...,...
995,Matthew Owens,45,Matthew,Male,United States,United States,USCGC Jones\nFPO AP 55681,Respond meeting paper third.,10.244.14.20,True
996,Jamie Crawford,47,Jamie,Male,United Kingdom,United Kingdom,"80546 Guerra Brooks Apt. 443\nRiverahaven, CA ...",Common almost only check paper old cup.,10.133.251.246,False
997,Shannon Chandler,27,Shannon,Female,United States,United States,"9798 Blair Walk\nNorth Pamela, NH 28129",Away bring foreign become son available contro...,10.220.204.70,False
998,Hannah Cooper,20,Hannah,Female,United Kingdom,United Kingdom,"309 Morris Land\nEthanbury, MS 32060",Any low coach positive continue between also.,192.168.118.218,True


# recap with persian names

In [289]:
faker_per = Faker('fa')

In [290]:
name = np.array([faker_per.name() for _ in range(25)], dtype='U31')
name

array(['محمدجواد زارع', 'سرکار خانم دکتر هليا هومن', 'پارسا علی پور',
       'سرکار خانم دکتر نرگس علی شاهی', 'امیرمهدی نیلوفری',
       'جناب آقای دکتر ابوالفضل فرجی', 'جناب آقای محمدیاسین تهرانی',
       'اميرعلي وثاق', 'نيايش نیلوفری', 'علیرضا رودگر', 'مهسا نعمتی',
       'یاسمین عبدالعلی', 'جناب آقای دکتر آرين دادفر', 'اسرا اکبر پور',
       'بهار همدانی', 'نازنین رودگر', 'آرتین حمیدی', 'محمدامین جلالی',
       'هستی عزیزی', 'سرکار خانم دکتر نازنین سعیدی', 'مهدیه حمیدی',
       'سبحان لاچینی', 'محيا رسته', 'سرکار خانم دکتر آوا روحانی',
       'نازنین زهرا روحانی'], dtype='<U31')

## define a new function

In [291]:
def get_gender_persian(name):
    """
    """
    fn_info = nd.search(name)['first_name']
    if fn_info is not None:
        if fn_info['gender'] is not None:
            if None not in list(fn_info['gender'].values()):
                max_ = max(fn_info['gender'], key=fn_info['gender'].get)
                if max_ == 'Male':
                    return 'مرد'
                else:
                    return 'زن'
            else:
                gender_dict = cleanNullTerms(fn_info['gender'])
                max_ = max(gender_dict, key=gender_dict.get)
                if max_ == 'Male':
                    return 'مرد'
                else:
                    return 'زن'
    
    return None


# Testing function
get_gender_persian('مراد')

'مرد'

In [292]:
df_per = pd.DataFrame(data=[name])
df_per = df_per.T
df_per.columns = ['name']

In [293]:
df_per.insert(
    loc=1,
    column='fn',
    value=df_per.name.apply(lambda name: name.split()[0])
)

In [294]:
df_per

Unnamed: 0,name,fn
0,محمدجواد زارع,محمدجواد
1,سرکار خانم دکتر هليا هومن,سرکار
2,پارسا علی پور,پارسا
3,سرکار خانم دکتر نرگس علی شاهی,سرکار
4,امیرمهدی نیلوفری,امیرمهدی
5,جناب آقای دکتر ابوالفضل فرجی,جناب
6,جناب آقای محمدیاسین تهرانی,جناب
7,اميرعلي وثاق,اميرعلي
8,نيايش نیلوفری,نيايش
9,علیرضا رودگر,علیرضا


In [295]:
df_per.insert(
    loc=2,
    column='gender',
    value=df_per.fn.apply(lambda name: get_gender_persian(name))
)

In [296]:
df_per

Unnamed: 0,name,fn,gender
0,محمدجواد زارع,محمدجواد,مرد
1,سرکار خانم دکتر هليا هومن,سرکار,
2,پارسا علی پور,پارسا,مرد
3,سرکار خانم دکتر نرگس علی شاهی,سرکار,
4,امیرمهدی نیلوفری,امیرمهدی,مرد
5,جناب آقای دکتر ابوالفضل فرجی,جناب,مرد
6,جناب آقای محمدیاسین تهرانی,جناب,مرد
7,اميرعلي وثاق,اميرعلي,مرد
8,نيايش نیلوفری,نيايش,
9,علیرضا رودگر,علیرضا,مرد
