# Evaluation of techniques for name conflation

A little background and details on early phonetic algorithms: https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/
* Metaphone: https://en.wikipedia.org/wiki/Metaphone#Metaphone_3
* Soundex: https://en.wikipedia.org/wiki/Soundex

Some Python libraries to fill desired need:
* Fuzzy - Has a problem with encoding
* Phonetics - https://pypi.org/project/phonetics/
* Jellyfish - Works great, though only supports a few different algorithms
* Metaphone: https://pypi.org/project/Metaphone/ A Python implementation of the Metaphone and Double Metaphone algorithms
* Pyphonetics - https://github.com/Lilykos/pyphonetics and https://pypi.org/project/pyphonetics/
* https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/featx/phonetics.py Soundex, Metaphone, NYSIIS, Caverphone
* Abydos - https://pypi.org/project/abydos/ seems to be the granddaddy of them all. Lots of algorithms implemented - from the early Soundex to more modern algorithms such as Beider-Morse Phonetic Matching; also includes some non-English algorithms. 

Some examples and research behind phonetic algorithms:
  * MetaSoundex: http://www.informaticsjournals.com/index.php/gjeis/article/view/19822
  * Comparison of Caverphone, DMetaphone, NYSIIS, Soundex: https://www.scitepress.org/Papers/2016/59263/59263.pdf, recommends use of Metaphone for English dictionary words and NYSIIS for street names.
  * "Analysis and Comparative Study on Phonetic Matching Techniques"  https://pdfs.semanticscholar.org/9cbc/abee9d8911c65d2d4847bb612bae2f0c83af.pdf
  * "Phonetic Matching: A Better Soundex" (aka Beider-Morse algorithm) https://www.stevemorse.org/phonetics/bmpm2.htm
  * "Study Existing Various Phonetic Algorithms and Designing and Development of a working model for the New Developed Algorithm and Comparison by implementing it with Existing Algorithm(s)" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.677.3003&rep=rep1&type=pdf

As an FYI, Apache Solr includes the following phonetic matching algorithms:
* Beider-Morse Phonetic Matching (BMPM)
* Daitch-Mokotoff Soundex
* Double Metaphone
* Metaphone
* Soundex
* Refined Soundex
* Caverphone
* Kölner Phonetik a.k.a. Cologne Phonetic
* NYSIIS

In [1]:
import pandas as pd
import numpy as np
import pickle

In [2]:
import fuzzy

In [3]:
soundex = fuzzy.Soundex(10)
soundex('fuzzy')

'F2'

In [4]:
dmeta = fuzzy.DMetaphone()
dmeta('fuzzy')

[b'FS', None]

In [5]:
fuzzy.nysiis('fuzzy')
'FASY'

'FASY'

In [6]:
"""
for name in names:
    print('Name: ', name)
    print('Soundex: ', soundex(name))
    print('Metaphone: ', dmeta(name))
    print('Nysiis: ', fuzzy.nysiis(name))
    """

"\nfor name in names:\n    print('Name: ', name)\n    print('Soundex: ', soundex(name))\n    print('Metaphone: ', dmeta(name))\n    print('Nysiis: ', fuzzy.nysiis(name))\n    "

In [7]:
name = 'Sarah'
n2 = name.encode(encoding='ascii',errors='strict')
n2

b'Sarah'

The fuzzy library has some encoding errors that seem to require modification of fuzzy source code to resolve.


```python
#names = ['Sophia', 'Sofia', 'Seth']
names = ['Sarah', 'Sara']
names = ['Jamie', 'Jenna', 'Joanna', 'Jenny', 'Jaime']
for n in names:
    name = n.encode(encoding='ascii',errors='strict')
    print(name)
    print('Soundex: ', soundex(name))
    print('Metaphone: ', dmeta(name))
    print('Nysiis: ', fuzzy.nysiis(n))
```


In [8]:
import jellyfish

names = ['Sarah', 'Sara', 'Seth', 'Beth', 'Aaron', 'Erin']
names = ['Jamie', 'Jenna', 'Joanna', 'Jenny', 'Jaime']
for name in names:
    print('Name: ', name)
    print('   Metaphone: ', jellyfish.metaphone(name))
    print('   Soundex: ', jellyfish.soundex(name))
    print('   Nysiis: ', jellyfish.nysiis(name))
    print('   Match Rating: ', jellyfish.match_rating_codex(name))

Name:  Jamie
   Metaphone:  JM
   Soundex:  J500
   Nysiis:  JANY
   Match Rating:  JM
Name:  Jenna
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JNN
Name:  Joanna
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JNN
Name:  Jenny
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JANY
   Match Rating:  JNNY
Name:  Jaime
   Metaphone:  JM
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JM


In [9]:
from abydos import phonetic as ap

And there are some hybrid phonetic algorithms that employ multiple underlying
phonetic algorithms:
    - Oxford Name Compression Algorithm (ONCA) (:py:class:`.ONCA`)
    - MetaSoundex (:py:class:`.MetaSoundex`)

In [10]:
ap.ONCA()

<abydos.phonetic._onca.ONCA at 0x118f5cb00>

In [11]:
s = ap.Soundex()

In [12]:
s.encode("Sara")

'S600'

In [13]:
bm = ap.BeiderMorse()

In [14]:
print(bm.encode('Sarah', language_arg='english'))
print(bm.encode('Jean', language_arg='english'))
print(bm.encode('Jean', language_arg='french'))
print(bm.encode('Jean', match_mode='exact'))
print(bm.encode('Jean'))

siri siro sira sori soro sora sari saro sara
zn
zDn zian zion
jean jan dZean xean Zean Zjan
iDn ian iin ion zDn zan zin xDn xan xin zian zion


In [15]:
# go through all names, add column for metaphone, soundex, nysiis
# find where mataphone/soundex/sysiis don't match
# which algorithm results in the most reduction of unique values?

# once you have a name, calculing edit distance could be useful to identify common mispellings or mistakes
# soundex similar sounding words/?

# in baby_names_analysis, I was only looking at data from 1990 to now.
# Would including more historical data help with prediction?

In [16]:
import os
os.getcwd()

'/Users/seth/OneDrive - The University of Colorado Denver/Documents/Development/forecasting/forecasting'

## Combine names based on how they sound

As each entry in the babynames dataset is based on spelling, evaluate different sound based algorithms to see how to combine/reduce number of names.

In [17]:
# load name dataset of unique names exported from R package babynames
with open('ssn_names_only.pickle', 'rb') as f:
    names = pickle.load(f)
names.shape

(77092, 1)

### Evaluate Jellyfish Library

In [18]:
import jellyfish

df = names.copy()
df['metaphone'] = df.name.map(jellyfish.metaphone)
df['soundex'] = df.name.map(jellyfish.soundex)
df['nysiis'] = df.name.map(jellyfish.nysiis)
df['matchrating'] = df.name.map(jellyfish.match_rating_codex)

In [19]:
df.head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
0,Jessica,JSK,J220,JASAC,JSSC
1,Ashley,AXL,A240,ASLY,ASHLY
2,Brittany,BRTN,B635,BRATANY,BRTTNY
3,Amanda,AMNT,A553,ANAND,AMND
4,Samantha,SMN0,S553,SANANT,SMNTH


In [20]:
print("Unique values:")
print("    Names: ", df.name.nunique())
print("    Metaphone: ", df.metaphone.nunique())
print("    Soundex: ", df.soundex.nunique())
print("    NYSIIS: ", df.nysiis.nunique())
print("    Match Rating Codex: ", df.matchrating.nunique())

Unique values:
    Names:  77092
    Metaphone:  10739
    Soundex:  3413
    NYSIIS:  16694
    Match Rating Codex:  24109


In [21]:
# need to find values that are conflated to the same Soundex, Metaphone, etc for comparison.
# ? start with most conflated?

In [22]:
tf = df.copy()

In [23]:
tf[['name', 'soundex']].groupby(['soundex']).agg('count').reset_index().sort_values('name', ascending=False).head()
#tf.sort_values()

Unnamed: 0,soundex,name
2678,S500,679
1465,J500,641
1624,K500,562
1854,M200,520
2520,R500,492


In [24]:
tf[tf.soundex == 'J500'].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
54,Jamie,JM,J500,JANY,JM
80,Jenna,JN,J500,JAN,JNN
137,Joanna,JN,J500,JAN,JNN
213,Jenny,JN,J500,JANY,JNNY
309,Jaime,JM,J500,JAN,JM


In [25]:
tf[['name', 'matchrating']].groupby(['matchrating']).agg('count').reset_index().sort_values('name', ascending=False).head()

Unnamed: 0,matchrating,name
10329,JLN,128
11870,KL,117
10494,JN,107
6138,DVN,106
11930,KLN,105


In [26]:
tf[tf.matchrating == 'KL'].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
435,Kaila,KL,K400,CAL,KL
443,Kali,KL,K400,CAL,KL
542,Kala,KL,K400,CAL,KL
807,Kaela,KL,K400,CAL,KL
932,Kalie,KL,K400,CALY,KL


In [27]:
tf[tf.name.isin(['Sofie', 'Sophie'])].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
605,Sophie,SF,S100,SAFY,SPH
11236,Sofie,SF,S100,SAFY,SF


In [28]:
df[df.metaphone == 'SF'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
27458,Zeev,SF,Z100,ZAAF,ZV
47332,Safi,SF,S100,SAF,SF
49351,Soffia,SF,S100,SAF,SFF
47382,Sava,SF,S100,SAV,SV
22677,Zvi,SF,Z100,ZV,ZV
62917,Sophey,SF,S100,SAFY,SPHY
65465,Xavy,SF,X100,XAVY,XVY
62910,Saveah,SF,S100,SAV,SVH
64473,Zivah,SF,Z100,ZAV,ZVH
12562,Safaa,SF,S100,SAF,SF


In [29]:
df[df.soundex == 'S100'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
73777,Shivya,XFY,S100,SAVY,SHVY
19938,Seve,SF,S100,SAF,SV
4345,Safiya,SFY,S100,SAFAY,SFY
35913,Svea,SF,S100,SV,SV
72839,Szofia,SSF,S100,SAF,SZF
21354,Shoaib,XB,S100,SAB,SHB
67236,Suheb,SHB,S100,SAHAB,SHB
48711,Sophya,SFY,S100,SAFY,SPHY
65343,Sevi,SF,S100,SAF,SV
29240,Safiah,SF,S100,SAF,SFH


In [30]:
df[df.nysiis == 'SAFY'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
67057,Sophiee,SF,S100,SAFY,SPH
72432,Sophyia,SFY,S100,SAFY,SPHY
72833,Soffie,SF,S100,SAFY,SFF
59266,Shiffy,XF,S100,SAFY,SHFFY
48711,Sophya,SFY,S100,SAFY,SPHY
48709,Sofya,SFY,S100,SAFY,SFY
51613,Shafee,XF,S100,SAFY,SHF
605,Sophie,SF,S100,SAFY,SPH
56055,Sevy,SF,S100,SAFY,SVY
11236,Sofie,SF,S100,SAFY,SF


In [31]:
df[df.matchrating.isin(['SF', 'SPH'])].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
77045,Sefa,SF,S100,SAF,SF
29336,Sofi,SF,S100,SAF,SF
5485,Safa,SF,S100,SAF,SF
69318,Sifa,SF,S100,SAF,SF
32280,Sophi,SF,S100,SAF,SPH
67686,Safee,SF,S100,SAFY,SF
51392,Sephia,SF,S100,SAF,SPH
19930,Saif,SF,S100,SAF,SF
9955,Safia,SF,S100,SAF,SF
71093,Sopheia,SF,S100,SAF,SPH


### Looks like the NYSIIS and Match Rating Codex algorithms give the best results here

### Evaluate the Abydos library

In [32]:
from abydos import phonetic as ap

apdf = names.copy()

In [33]:
%%timeit -n1 -r1

apdf['onca'] = apdf.name.map(ap.ONCA().encode)

1.27 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [34]:
%%timeit -n1 -r1

apdf['metasoundex'] = apdf.name.map(ap.MetaSoundex().encode)

1.16 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [35]:
%%timeit -n1 -r1

apdf['caverphone'] = apdf.name.map(ap.Caverphone().encode)

953 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [36]:
%%timeit -n1 -r1

apdf['daitchmokotoff'] = apdf.name.map(ap.DaitchMokotoff().encode)

1.57 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Beider Morse is about 250x - 500x slower than the other algorithms, so split data set and run in multiple processes

%%timeit result for full data set if run serially
```
4min 1s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
```

Using multiprocessing, on a machine with 4 cores, almost splits time in 4:

```
1min 22s
```

In [37]:
import multiprocessing
from multiprocessing import Pool
import psutil
from functools import partial

def process_df(function, label, df_in):
    # this is is beidermorse specific; remove this function partial and you can actually parallelize
    # any method that doesn't require additional parameters
    func = partial(function, language_arg = 'english')
    df_in[label] = df_in.name.map(func)
    return df_in

def parallelize(inputdf, function, label):
    num_processes = psutil.cpu_count(logical=False)
    num_partitions = num_processes * 2 #smaller batches to get more frequent status updates (if function provides them)
    func = partial(process_df, function, label)
    with Pool(processes=num_processes) as pool:
        df_split = np.array_split(inputdf, num_partitions)
        df_out = pd.concat(pool.map(func, df_split))
    return df_out

In [38]:
from datetime import datetime 
start_time = datetime.now() 

apdf = parallelize(apdf, ap.BeiderMorse().encode, 'beidermorse')

time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:01:59.806210


In [39]:
ap.BeiderMorse().encode('Seth', language_arg = 'english')

'sit'

In [40]:
%%timeit -n1 -r1

apdf['parmarkumbharana'] = apdf.name.map(ap.ParmarKumbharana().encode)

615 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [41]:
apdf.head()

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
0,Jessica,J220,1000,YSKA111111,"{444000, 445000, 144000, 145000}",zYsQki zYsiki zYsDki zisQki zisiki zisDki zYsi...,JSC
1,Ashley,A240,240,ASLA111111,{048000},izlii izlD izla izli ozlii ozlD ozla ozli azli...,ASL
2,Brittany,B635,7635,PRTNA11111,{793600},brQtini britini brDtini britoni brQtoni britan...,BRTN
3,Amanda,A553,530,AMNTA11111,{066300},imindi imndi imindo imndo iminda imnda imondi ...,AMND
4,Samantha,S553,4500,SMNTA11111,{466300},siminti simnti siminto simnto siminta simnta s...,SMNTH


In [42]:
repr(apdf['daitchmokotoff'][0])

"{'444000', '445000', '144000', '145000'}"

In [43]:
apdf.daitchmokotoff.map(repr).nunique()

6679

In [44]:
print("Unique values:")
print("    Names:            ", apdf.name.nunique())
print("    ONCA:             ", apdf.onca.nunique())
print("    MetaSoundex:      ", apdf.metasoundex.nunique())
print("    Caverphone:       ", apdf.caverphone.nunique())
print("    Daitchmokotoff:   ", apdf.daitchmokotoff.map(repr).nunique())
print("    Beidermorse:      ", apdf.beidermorse.nunique())
print("    ParmarKumbharana: ", apdf.parmarkumbharana.nunique())

Unique values:
    Names:             77092
    ONCA:              3193
    MetaSoundex:       1279
    Caverphone:        6083
    Daitchmokotoff:    6679
    Beidermorse:       51501
    ParmarKumbharana:  13875


In [45]:
temp = apdf.copy()
temp['daitchmokotoff'] = temp.daitchmokotoff.map(repr)
for c in temp.columns:
    if c == 'name':
        continue
    top = temp[['name', c]].\
          groupby([c]).\
          agg('count').\
          reset_index().\
          sort_values('name', ascending=False).\
          rename(columns={'name':'count'})
    print(top.head())
    print('Sample of names matching the most common value')
    print(temp[temp[c] ==  top.head(1).values[0][0]]['name'].head())

      onca  count
469   C500    813
2482  S500    696
1455  J500    682
432   C400    614
456   C450    553
Sample of names matching the most common value
331    Cheyenne
482      Connie
613       Kenya
757       Kiana
784         Kim
Name: name, dtype: object
    metasoundex  count
711        5500   1881
280        1500   1426
800        6200   1224
419        3500   1120
681        5400   1051
Sample of names matching the most common value
29       Hannah
67      Shannon
249      Shawna
331    Cheyenne
343       Hanna
Name: name, dtype: object
      caverphone  count
1706  KLA1111111    711
5055  TNA1111111    698
4471  SNA1111111    662
1877  KNA1111111    592
194   ALA1111111    581
Sample of names matching the most common value
13      Kayla
40      Kelly
157    Claire
162     Carly
178     Kylie
Name: name, dtype: object
     daitchmokotoff  count
2924     {'460000'}    695
3846     {'560000'}    592
2125     {'360000'}    586
1039     {'086000'}    559
4188     {'586000'}    522

In [46]:
check = ['Sophia', 'Sofia']
#names[names.sort_values('name').name.astype(str).str[0:3] == 'Aar'
apdf[names.name.isin(check)]

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
250,Sophia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF


### Now look at names resulting matching each of the different phonetic algorithms

In [47]:
apdf[apdf.onca == 'S100'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
21331,Savvas,S100,4120,SFS1111111,{474000},sivis sivos sivas sovis sovos sovas savis savo...,SVS
60814,Savea,S100,4100,SFA1111111,{470000},siivi,SV
72835,Sopia,S100,4100,SPA1111111,{470000},sopii sopio sopia sapii sapio sapia,SP
74117,Safiyo,S100,4100,SFA1111111,{470000},sifiio sifQio sifiia sifQia sofiio sofQio sofi...,SF
51392,Sephia,S100,4100,SFA1111111,{470000},sYfii sifii sYfio sifio sYfia sifia,SPH
4345,Safiya,S100,4100,SFA1111111,{470000},sifiii sifQii sifiio sifQio sifiia sifQia sofi...,SF
67687,Saheb,S100,4100,SP11111111,{457000},siiip,SHB
67058,Sophiyah,S100,4100,SFA1111111,{470000},sofiii sofQii sofiio sofQio sofiia sofQia safi...,SPHH
605,Sophie,S100,4100,SFA1111111,{470000},sofi safi,SPH
61710,Savva,S100,4100,SFA1111111,{470000},sivi sivo siva sovi sovo sova savi savo sava,SV


In [48]:
apdf[apdf.metasoundex == '4100'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
55704,Sef,S100,4100,SF11111111,{470000},sif,SF
8489,Zahava,Z100,4100,SFA1111111,{457000},ziivi ziivo ziiva ziD ziovi ziovo ziova ziavi ...,ZHV
67058,Sophiyah,S100,4100,SFA1111111,{470000},sofiii sofQii sofiio sofQio sofiia sofQia safi...,SPHH
57951,Sibi,S100,4100,SPA1111111,{470000},sibi sQbi,SB
65343,Sevi,S100,4100,SFA1111111,{470000},sYvi sivi,SV
22677,Zvi,Z100,4100,SFA1111111,{470000},zvi,ZV
67874,Xophia,X100,4100,KFA1111111,{570000},zofii zofio zofia zafii zafio zafia,XPH
41205,Sakib,S210,4100,SKP1111111,{457000},sikip sokip sakip,SKB
49328,Seviah,S100,4100,SFA1111111,{470000},sYvii sivii sYvio sivio sYvia sivia,SVH
14957,Sobia,S100,4100,SPA1111111,{470000},sobii sobio sobia sabii sabio sabia,SB


In [49]:
apdf[apdf.caverphone == 'SFA1111111'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
30874,Safire,S160,4160,SFA1111111,{479000},sifDr sifar sifir sifDrY sifDri sifarY sifari ...,SFR
333,Sylvia,S410,4410,SFA1111111,{487000},silvii slvii silvio slvio silvia slvia,SLV
29240,Safiah,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SFH
53809,Siva,S100,4100,SFA1111111,{470000},sQvi sivi sDvi sivo sQvo siva sQva,SV
67057,Sophiee,S100,4100,SFA1111111,{470000},sofi sofii safi safii,SPH
55291,Silvie,S410,4410,SFA1111111,{487000},silvi sQlvi slvi,SLV
39244,Shaefer,S160,5160,SFA1111111,{479000},siifir sDfir sofir sifir safir,SFR
609,Silvia,S410,4410,SFA1111111,{487000},silvii sQlvii slvii silvio sQlvio slvio silvia...,SLV
67947,Suhavi,S100,4100,SFA1111111,{457000},suivi suovi suavi saivi soivi saovi soovi saav...,SHV
60814,Savea,S100,4100,SFA1111111,{470000},siivi,SV


In [50]:
tdf = apdf.copy()
tdf['daitchmokotoff'] = apdf.daitchmokotoff.map(repr)
tdf[tdf.daitchmokotoff.str.contains('470000')].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
53809,Siva,S100,4100,SFA1111111,{'470000'},sQvi sivi sDvi sivo sQvo siva sQva,SV
67052,Sofee,S100,4100,SFA1111111,{'470000'},sDfi sofi sufi,SF
71866,Cap,C100,5100,KP11111111,"{'570000', '470000'}",kip kop kap,CP
15226,Zavia,Z100,4100,SFA1111111,{'470000'},zivii zivio zivia zovii zovio zovia zavii zavi...,ZV
35913,Svea,S100,4100,SFA1111111,{'470000'},svi,SV
16544,Jeb,J100,1100,YP11111111,"{'170000', '470000'}",zip,JB
41112,Joeb,J100,1100,YP11111111,"{'170000', '470000'}",zoip zaip,JB
12562,Safaa,S100,4100,SFA1111111,{'470000'},sifii sifio sifD sifia sifoi sifoo sifoa sifai...,SF
22069,Jsoeph,J100,1100,ASF1111111,"{'147000', '470000'}",zsoif zsaif,JSPH
55628,Jiwoo,J000,1000,YWA1111111,"{'170000', '470000'}",ziwu zQwu zivu zQvu,JW


In [51]:
checklist = apdf[apdf.name == 'Sofia'].beidermorse.values[0].split()
def check_fn(input):
    return any(x in input.split() for x in checklist)

apdf[apdf.beidermorse.map(check_fn)]

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
250,Sophia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
9955,Safia,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SF
12562,Safaa,S100,4100,SFA1111111,{470000},sifii sifio sifD sifia sifoi sifoo sifoa sifai...,SF
29240,Safiah,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SFH
40366,Saphia,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SPH
41050,Zsofia,Z100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,ZSF
48433,Sofhia,S100,4100,SFA1111111,{475000},sofii sofio sofia safii safio safia,SFH
48709,Sofya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
48711,Sophya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH


In [52]:
apdf[apdf.parmarkumbharana.isin(['SF', 'SPH'])].sample(n=20)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
51613,Shafee,S100,5100,SFA1111111,{470000},siifi,SF
74117,Safiyo,S100,4100,SFA1111111,{470000},sifiio sifQio sifiia sifQia sofiio sofQio sofi...,SF
47332,Safi,S100,4100,SFA1111111,{470000},sifi sofi safi,SF
69464,Saffiya,S100,4100,SFA1111111,{470000},sifiii sifQii sifiio sifQio sifiia sifQia sofi...,SF
48711,Sophya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
11236,Sofie,S100,4100,SFA1111111,{470000},sofi safi,SF
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
39787,Shafi,S100,5100,SFA1111111,{470000},sifi sofi safi,SF
47133,Saffa,S100,4100,SFA1111111,{470000},sifi sifo sifa sofi sofo sofa safi safo safa,SF
63496,Sophiea,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH


### Results

This is highly subjective, but it appears that most phonetic algorithms over conflate source names. The algorithms have too many false positives and thus too many names that are not similar in pronounciation/spelling are assigned the same code.

```
    MetaSoundex:       1279
    ONCA:              3193
    Soundex:           3413
    Caverphone:        6083
    Daitchmokotoff:    6679
    Metaphone:        10739
    ParmarKumbharana: 13875
    NYSIIS:           16694
    Match Rating:     24109
    Beidermorse:      51501

    Total Unique Names: 77092
```

As a native English speaker, it appears that algorithms with more the approximately 10,000 uniques from Metaphone work the best; from Jellyfish this includes NYSIIS, and Match Rating Codex. From Abydos this includes Beider & Morse or Parmar & Kumbharana algorithms. Since Abydos includes all of the algorithms from Jellyfish as well as many others, Abydos seems the better library.

Match Rating Codex and the Parmar & Kumbharana algorithms appear to be similar; names that I think should be assigned a similar category are not. However when looking at the combine results of the multiple categories, the results appear pretty good.

Of course, if someone is a non-native English speaker in a community of non-native English speakers, different algorithms may work better.

By splitting and combining the various options provided by Beider & Morse, I think the results are the best, so will use that for additional analysis.

### Perform sample analysis to see how Beider & Morse algroithm might work

In [53]:
names = apdf[['name', 'beidermorse']].copy()
names['bmset'] = names['beidermorse'].str.split().apply(set)

In [54]:
# setup test dataset
data = [{'year': 1990.0,
  'sex': 'F',
  'name': 'Bayleigh',
  'n': 11,
  'prop': 5.36e-06},
 {'year': 1990.0,
  'sex': 'F',
  'name': 'Dyesha',
  'n': 8,
  'prop': 3.89e-06},
 {'year': 1990.0,
  'sex': 'F',
  'name': 'Latrivia',
  'n': 7,
  'prop': 3.41e-06},
 {'year': 1990.0,
  'sex': 'F',
  'name': 'Leinaala',
  'n': 9,
  'prop': 4.38e-06},
 {'year': 1990.0,
  'sex': 'F',
  'name': 'Michael',
  'n': 278,
  'prop': 0.00013535},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Cordarious',
  'n': 14,
  'prop': 6.51e-06},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Jeromy',
  'n': 115,
  'prop': 5.346e-05},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Kelcie',
  'n': 6,
  'prop': 2.79e-06},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Nelson',
  'n': 931,
  'prop': 0.00043279},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Shade',
  'n': 11,
  'prop': 5.11e-06},
 {'year': 1991.0,
  'sex': 'F',
  'name': 'Michael',
  'n': 42,
  'prop': 3e-05},
 {'year': 1990.0,
  'sex': 'F',
  'name': 'Mychael',
  'n': 42,
  'prop': 3e-05},
 {'year': 1990.0,
  'sex': 'M',
  'name': 'Mychael',
  'n': 42,
  'prop': 3e-05}]
alt = pd.DataFrame.from_dict(data)
alt.sort_values(['year', 'sex', 'name'], inplace=True)
alt.shape

(13, 5)

In [55]:
alt['counted'] = False
alt['alt_n'] = 0
alt['alt_prop'] = 0.0

# process each row of dataframe
def create_alt_n(row):
    print(row['name'])

    # should do no further processing if this row has already been counted
    if (row['counted'] == True):
        return

    # find matching names
    checklist = names[names.name == row['name']].beidermorse.values[0].split()
    find = lambda i: any(x in i for x in checklist)
    found = names[names.bmset.map(find)].name
    
    # aggregate count, excluding counted names, for all found names into alt_name
    alt.loc[(alt.name == row['name']) &
            (alt.year == row.year) &
            (alt.sex == row.sex) ,
            'alt_n'] = alt[(alt.name.isin(found)) & 
                           (alt.year == row.year) &
                           (alt.sex == row.sex) &
                           (alt.counted == False)]['n'].sum()
    
    # set counted flag for found names in group
    # ? how to update just group ?
    alt.loc[(alt.name.isin(found)) & (alt.year == row.year) & (alt.sex == row.sex), 'counted'] = True

gdf = alt.groupby(['year', 'sex'])
for name, group in gdf:
    g = group.sort_values('n', ascending=False).copy()
    g.apply(create_alt_n, axis=1)

# create alt_prop
def create_alt_prop(row):
    alt.loc[(alt.name == row['name']) &
            (alt.year == row.year) &
            (alt.sex == row.sex) ,
            'alt_prop'] = row['alt_n'] / gsum

for name, group in gdf:
    print(name)
    gsum = group['alt_n'].sum()
    group.apply(create_alt_prop, axis=1)    
    
alt

Michael
Mychael
Bayleigh
Leinaala
Dyesha
Latrivia
Nelson
Jeromy
Mychael
Cordarious
Shade
Kelcie
Michael
(1990.0, 'F')
(1990.0, 'M')
(1991.0, 'F')


Unnamed: 0,n,name,prop,sex,year,counted,alt_n,alt_prop
0,11,Bayleigh,5e-06,F,1990.0,True,11,0.030986
1,8,Dyesha,4e-06,F,1990.0,True,8,0.022535
2,7,Latrivia,3e-06,F,1990.0,True,7,0.019718
3,9,Leinaala,4e-06,F,1990.0,True,9,0.025352
4,278,Michael,0.000135,F,1990.0,True,320,0.901408
11,42,Mychael,3e-05,F,1990.0,True,0,0.0
5,14,Cordarious,7e-06,M,1990.0,True,14,0.012511
6,115,Jeromy,5.3e-05,M,1990.0,True,115,0.10277
7,6,Kelcie,3e-06,M,1990.0,True,6,0.005362
12,42,Mychael,3e-05,M,1990.0,True,42,0.037534
