# Evaluation of techniques for name conflation

A little background and details on early phonetic algorithms: https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/
* Metaphone: https://en.wikipedia.org/wiki/Metaphone#Metaphone_3
* Soundex: https://en.wikipedia.org/wiki/Soundex

Some Python libraries to fill desired need:
* Fuzzy - Has a problem with encoding
* Phonetics - https://pypi.org/project/phonetics/
* Jellyfish - Works great, though only supports a few different algorithms
* Metaphone: https://pypi.org/project/Metaphone/ A Python implementation of the Metaphone and Double Metaphone algorithms
* Pyphonetics - https://github.com/Lilykos/pyphonetics and https://pypi.org/project/pyphonetics/
* https://github.com/japerk/nltk-trainer/blob/master/nltk_trainer/featx/phonetics.py Soundex, Metaphone, NYSIIS, Caverphone
* Abydos - https://pypi.org/project/abydos/ seems to be the granddaddy of them all. Lots of algorithms implemented - from the early Soundex to more modern algorithms such as Beider-Morse Phonetic Matching; also includes some non-English algorithms. 

Some examples and research behind phonetic algorithms:
  * MetaSoundex: http://www.informaticsjournals.com/index.php/gjeis/article/view/19822
  * Comparison of Caverphone, DMetaphone, NYSIIS, Soundex: https://www.scitepress.org/Papers/2016/59263/59263.pdf, recommends use of Metaphone for English dictionary words and NYSIIS for street names.
  * "Analysis and Comparative Study on Phonetic Matching Techniques"  https://pdfs.semanticscholar.org/9cbc/abee9d8911c65d2d4847bb612bae2f0c83af.pdf
  * "Phonetic Matching: A Better Soundex" (aka Beider-Morse algorithm) https://www.stevemorse.org/phonetics/bmpm2.htm
  * "Study Existing Various Phonetic Algorithms and Designing and Development of a working model for the New Developed Algorithm and Comparison by implementing it with Existing Algorithm(s)" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.677.3003&rep=rep1&type=pdf

As an FYI, Apache Solr includes the following phonetic matching algorithms:
* Beider-Morse Phonetic Matching (BMPM)
* Daitch-Mokotoff Soundex
* Double Metaphone
* Metaphone
* Soundex
* Refined Soundex
* Caverphone
* Kölner Phonetik a.k.a. Cologne Phonetic
* NYSIIS

In [1]:
import pandas as pd
import numpy as np

In [2]:
import fuzzy

In [3]:
soundex = fuzzy.Soundex(10)
soundex('fuzzy')

'F2'

In [4]:
dmeta = fuzzy.DMetaphone()
dmeta('fuzzy')

[b'FS', None]

In [5]:
fuzzy.nysiis('fuzzy')
'FASY'

'FASY'

In [6]:
"""
for name in names:
    print('Name: ', name)
    print('Soundex: ', soundex(name))
    print('Metaphone: ', dmeta(name))
    print('Nysiis: ', fuzzy.nysiis(name))
    """

"\nfor name in names:\n    print('Name: ', name)\n    print('Soundex: ', soundex(name))\n    print('Metaphone: ', dmeta(name))\n    print('Nysiis: ', fuzzy.nysiis(name))\n    "

In [7]:
name = 'Sarah'
n2 = name.encode(encoding='ascii',errors='strict')
n2

b'Sarah'

The fuzzy library has some encoding errors that seem to require modification of fuzzy source code to resolve.


```python
#names = ['Sophia', 'Sofia', 'Seth']
names = ['Sarah', 'Sara']
names = ['Jamie', 'Jenna', 'Joanna', 'Jenny', 'Jaime']
for n in names:
    name = n.encode(encoding='ascii',errors='strict')
    print(name)
    print('Soundex: ', soundex(name))
    print('Metaphone: ', dmeta(name))
    print('Nysiis: ', fuzzy.nysiis(n))
```


In [8]:
import jellyfish

names = ['Sarah', 'Sara', 'Seth', 'Beth', 'Aaron', 'Erin']
names = ['Jamie', 'Jenna', 'Joanna', 'Jenny', 'Jaime']
for name in names:
    print('Name: ', name)
    print('   Metaphone: ', jellyfish.metaphone(name))
    print('   Soundex: ', jellyfish.soundex(name))
    print('   Nysiis: ', jellyfish.nysiis(name))
    print('   Match Rating: ', jellyfish.match_rating_codex(name))

Name:  Jamie
   Metaphone:  JM
   Soundex:  J500
   Nysiis:  JANY
   Match Rating:  JM
Name:  Jenna
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JNN
Name:  Joanna
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JNN
Name:  Jenny
   Metaphone:  JN
   Soundex:  J500
   Nysiis:  JANY
   Match Rating:  JNNY
Name:  Jaime
   Metaphone:  JM
   Soundex:  J500
   Nysiis:  JAN
   Match Rating:  JM


In [9]:
from abydos import phonetic as ap

And there are some hybrid phonetic algorithms that employ multiple underlying
phonetic algorithms:
    - Oxford Name Compression Algorithm (ONCA) (:py:class:`.ONCA`)
    - MetaSoundex (:py:class:`.MetaSoundex`)

In [10]:
ap.ONCA()

<abydos.phonetic._onca.ONCA at 0x1129c47f0>

In [11]:
s = ap.Soundex()

In [12]:
s.encode("Sara")

'S600'

In [13]:
bm = ap.BeiderMorse()

In [14]:
print(bm.encode('Sarah', language_arg='english'))
print(bm.encode('Jean', language_arg='english'))
print(bm.encode('Jean', language_arg='french'))
print(bm.encode('Jean', match_mode='exact'))
print(bm.encode('Jean'))

siri siro sira sori soro sora sari saro sara
zn
zDn zian zion
jean jan dZean xean Zean Zjan
iDn ian iin ion zDn zan zin xDn xan xin zian zion


In [15]:
# go through all names, add column for metaphone, soundex, nysiis
# find where mataphone/soundex/sysiis don't match
# which algorithm results in the most reduction of unique values?

# once you have a name, calculing edit distance could be useful to identify common mispellings or mistakes
# soundex similar sounding words/?

# in baby_names_analysis, I was only looking at data from 1990 to now.
# Would including more historical data help with prediction?

In [16]:
import os
os.getcwd()

'/Users/seth/OneDrive - The University of Colorado Denver/Documents/Development/forecasting/forecasting'

## Combine names based on how they sound

As each entry in the babynames dataset is based on spelling, evaluate different sound based algorithms to see how to combine/reduce number of names.

In [17]:
# load name dataset of unique names exported from R package babynames
import pickle
with open('ssn_names_only.pickle', 'rb') as f:
    names = pickle.load(f)
names.shape

(77092, 1)

### Evaluate Jellyfish Library

In [18]:
import jellyfish

df = names.copy()
df['metaphone'] = df.name.map(jellyfish.metaphone)
df['soundex'] = df.name.map(jellyfish.soundex)
df['nysiis'] = df.name.map(jellyfish.nysiis)
df['matchrating'] = df.name.map(jellyfish.match_rating_codex)

In [19]:
df.head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
0,Jessica,JSK,J220,JASAC,JSSC
1,Ashley,AXL,A240,ASLY,ASHLY
2,Brittany,BRTN,B635,BRATANY,BRTTNY
3,Amanda,AMNT,A553,ANAND,AMND
4,Samantha,SMN0,S553,SANANT,SMNTH


In [20]:
print("Unique values:")
print("    Names: ", df.name.nunique())
print("    Metaphone: ", df.metaphone.nunique())
print("    Soundex: ", df.soundex.nunique())
print("    NYSIIS: ", df.nysiis.nunique())
print("    Match Rating Codex: ", df.matchrating.nunique())

Unique values:
    Names:  77092
    Metaphone:  10739
    Soundex:  3413
    NYSIIS:  16694
    Match Rating Codex:  24109


In [21]:
# need to find values that are conflated to the same Soundex, Metaphone, etc for comparison.
# ? start with most conflated?

In [22]:
tf = df.copy()

In [23]:
tf[['name', 'soundex']].groupby(['soundex']).agg('count').reset_index().sort_values('name', ascending=False).head()
#tf.sort_values()

Unnamed: 0,soundex,name
2678,S500,679
1465,J500,641
1624,K500,562
1854,M200,520
2520,R500,492


In [24]:
tf[tf.soundex == 'J500'].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
54,Jamie,JM,J500,JANY,JM
80,Jenna,JN,J500,JAN,JNN
137,Joanna,JN,J500,JAN,JNN
213,Jenny,JN,J500,JANY,JNNY
309,Jaime,JM,J500,JAN,JM


In [25]:
tf[['name', 'matchrating']].groupby(['matchrating']).agg('count').reset_index().sort_values('name', ascending=False).head()

Unnamed: 0,matchrating,name
10329,JLN,128
11870,KL,117
10494,JN,107
6138,DVN,106
11930,KLN,105


In [26]:
tf[tf.matchrating == 'KL'].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
435,Kaila,KL,K400,CAL,KL
443,Kali,KL,K400,CAL,KL
542,Kala,KL,K400,CAL,KL
807,Kaela,KL,K400,CAL,KL
932,Kalie,KL,K400,CALY,KL


In [27]:
tf[tf.name.isin(['Sofie', 'Sophie'])].head()

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
605,Sophie,SF,S100,SAFY,SPH
11236,Sofie,SF,S100,SAFY,SF


In [59]:
df[df.metaphone == 'SF'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
9955,Safia,SF,S100,SAF,SF
27458,Zeev,SF,Z100,ZAAF,ZV
30105,Savio,SF,S100,SAV,SV
67057,Sophiee,SF,S100,SAFY,SPH
21330,Savoy,SF,S100,SAVY,SVY
47382,Sava,SF,S100,SAV,SV
57630,Xavi,SF,X100,XAV,XV
11382,Xavia,SF,X100,XAV,XV
47133,Saffa,SF,S100,SAF,SFF
68146,Zophie,SF,Z100,ZAFY,ZPH


In [60]:
df[df.soundex == 'S100'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
42086,Shavy,XF,S100,SAVY,SHVY
25494,Sophea,SF,S100,SAF,SPH
7746,Safiyyah,SFY,S100,SAFAY,SFYYH
49504,Suhayb,SHB,S100,SAHAYB,SHYB
47315,Shubh,XB,S100,SAB,SHBH
45849,Shihab,XHB,S100,SAHAB,SHHB
26668,Suhaib,SHB,S100,SAHAB,SHB
57951,Sibi,SB,S100,SAB,SB
72833,Soffie,SF,S100,SAFY,SFF
72058,Savya,SFY,S100,SAVY,SVY


In [65]:
df[df.nysiis == 'SAFY'].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
605,Sophie,SF,S100,SAFY,SPH
11236,Sofie,SF,S100,SAFY,SF
23666,Sophy,SF,S100,SAFY,SPHY
34798,Shevy,XF,S100,SAFY,SHVY
42937,Sophee,SF,S100,SAFY,SPH
48709,Sofya,SFY,S100,SAFY,SFY
48711,Sophya,SFY,S100,SAFY,SPHY
51613,Shafee,XF,S100,SAFY,SHF
56055,Sevy,SF,S100,SAFY,SVY
59266,Shiffy,XF,S100,SAFY,SHFFY


In [66]:
df[df.matchrating.isin(['SF', 'SPH'])].sample(n=10)

Unnamed: 0,name,metaphone,soundex,nysiis,matchrating
77045,Sefa,SF,S100,SAF,SF
55305,Sufia,SF,S100,SAF,SF
55704,Sef,SF,S100,SAF,SF
69318,Sifa,SF,S100,SAF,SF
12562,Safaa,SF,S100,SAF,SF
32280,Sophi,SF,S100,SAF,SPH
25494,Sophea,SF,S100,SAF,SPH
71093,Sopheia,SF,S100,SAF,SPH
67057,Sophiee,SF,S100,SAFY,SPH
29336,Sofi,SF,S100,SAF,SF


### Looks like the NYSIIS and Match Rating Codex algorithms give the best results here

### Evaluate the Abydos library

In [28]:
from abydos import phonetic as ap

apdf = names.copy()

In [29]:
%%timeit -n1 -r1

apdf['onca'] = apdf.name.map(ap.ONCA().encode)

1.28 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [30]:
%%timeit -n1 -r1

apdf['metasoundex'] = apdf.name.map(ap.MetaSoundex().encode)

1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [31]:
%%timeit -n1 -r1

apdf['caverphone'] = apdf.name.map(ap.Caverphone().encode)

933 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [32]:
%%timeit -n1 -r1

apdf['daitchmokotoff'] = apdf.name.map(ap.DaitchMokotoff().encode)

1.59 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Beider Morse is about 250x - 500x slower than the other algorithms, so split data set and run in multiple processes

%%timeit result for full data set if run serially
```
4min 1s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
```

Using multiprocessing, on a machine with 4 cores, almost splits time in 4:

```
1min 22s
```

In [33]:
import multiprocessing
from multiprocessing import Pool
import psutil
from functools import partial

def process_df(function, label, df_in):
    # this is is beidermorse specific; remove this function partial and you can actually parallelize
    # any method that doesn't require additional parameters
    func = partial(function, language_arg = 'english')
    df_in[label] = df_in.name.map(func)
    return df_in

def parallelize(inputdf, function, label):
    num_processes = psutil.cpu_count(logical=False)
    num_partitions = num_processes * 2 #smaller batches to get more frequent status updates (if function provides them)
    func = partial(process_df, function, label)
    with Pool(processes=num_processes) as pool:
        df_split = np.array_split(inputdf, num_partitions)
        df_out = pd.concat(pool.map(func, df_split))
    return df_out

In [34]:
from datetime import datetime 
start_time = datetime.now() 

apdf = parallelize(apdf, ap.BeiderMorse().encode, 'beidermorse')

time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:02:04.173615


In [35]:
ap.BeiderMorse().encode('Seth', language_arg = 'english')

'sit'

In [36]:
%%timeit -n1 -r1

apdf['parmarkumbharana'] = apdf.name.map(ap.ParmarKumbharana().encode)

606 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [37]:
apdf.head()

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
0,Jessica,J220,1000,YSKA111111,"{144000, 145000, 445000, 444000}",zYsQki zYsiki zYsDki zisQki zisiki zisDki zYsi...,JSC
1,Ashley,A240,240,ASLA111111,{048000},izlii izlD izla izli ozlii ozlD ozla ozli azli...,ASL
2,Brittany,B635,7635,PRTNA11111,{793600},brQtini britini brDtini britoni brQtoni britan...,BRTN
3,Amanda,A553,530,AMNTA11111,{066300},imindi imndi imindo imndo iminda imnda imondi ...,AMND
4,Samantha,S553,4500,SMNTA11111,{466300},siminti simnti siminto simnto siminta simnta s...,SMNTH


In [38]:
repr(apdf['daitchmokotoff'][0])

"{'144000', '145000', '445000', '444000'}"

In [39]:
apdf.daitchmokotoff.map(repr).nunique()

6679

In [40]:
print("Unique values:")
print("    Names:            ", apdf.name.nunique())
print("    ONCA:             ", apdf.onca.nunique())
print("    MetaSoundex:      ", apdf.metasoundex.nunique())
print("    Caverphone:       ", apdf.caverphone.nunique())
print("    Daitchmokotoff:   ", apdf.daitchmokotoff.map(repr).nunique())
print("    Beidermorse:      ", apdf.beidermorse.nunique())
print("    ParmarKumbharana: ", apdf.parmarkumbharana.nunique())

Unique values:
    Names:             77092
    ONCA:              3193
    MetaSoundex:       1279
    Caverphone:        6083
    Daitchmokotoff:    6679
    Beidermorse:       51501
    ParmarKumbharana:  13875


In [45]:
temp = apdf.copy()
temp['daitchmokotoff'] = temp.daitchmokotoff.map(repr)
for c in temp.columns:
    if c == 'name':
        continue
    top = temp[['name', c]].\
          groupby([c]).\
          agg('count').\
          reset_index().\
          sort_values('name', ascending=False).\
          rename(columns={'name':'count'})
    print(top.head())
    print('Sample of names matching the most common value')
    print(temp[temp[c] ==  top.head(1).values[0][0]]['name'].head())

      onca  count
469   C500    813
2482  S500    696
1455  J500    682
432   C400    614
456   C450    553
Sample of names matching the most common value
331    Cheyenne
482      Connie
613       Kenya
757       Kiana
784         Kim
Name: name, dtype: object
    metasoundex  count
711        5500   1881
280        1500   1426
800        6200   1224
419        3500   1120
681        5400   1051
Sample of names matching the most common value
29       Hannah
67      Shannon
249      Shawna
331    Cheyenne
343       Hanna
Name: name, dtype: object
      caverphone  count
1706  KLA1111111    711
5055  TNA1111111    698
4471  SNA1111111    662
1877  KNA1111111    592
194   ALA1111111    581
Sample of names matching the most common value
13      Kayla
40      Kelly
157    Claire
162     Carly
178     Kylie
Name: name, dtype: object
     daitchmokotoff  count
2932     {'460000'}    695
3858     {'560000'}    592
2173     {'360000'}    586
1042     {'086000'}    559
4201     {'586000'}    522

In [46]:
check = ['Sophia', 'Sofia']
#names[names.sort_values('name').name.astype(str).str[0:3] == 'Aar'
apdf[names.name.isin(check)]

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
250,Sophia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF


### Now look at names resulting matching each of the different phonetic algorithms

In [50]:
apdf[apdf.onca == 'S100'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
56055,Sevy,S100,4100,SFA1111111,{470000},sYvi sivi,SV
67687,Saheb,S100,4100,SP11111111,{457000},siiip,SHB
48711,Sophya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
47332,Safi,S100,4100,SFA1111111,{470000},sifi sofi safi,SF
29240,Safiah,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SFH
45011,Safah,S100,4100,SFA1111111,{470000},sifi sifo sifa sofi sofo sofa safi safo safa,SFH
3418,Saba,S100,4100,SPA1111111,{470000},sibi sibo siba sobi sobo soba sabi sabo saba,SB
62286,Suheib,S100,4100,SP11111111,{457000},siuiip siuDp siuap siuip siDp siup siip suiip ...,SHB
44405,Safeya,S100,4100,SFA1111111,{471000},siifD siifii siifio siifia,SF
34798,Shevy,S100,5100,SFA1111111,{470000},sYvi sivi,SV


In [53]:
apdf[apdf.metasoundex == '4100'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
22677,Zvi,Z100,4100,SFA1111111,{470000},zvi,ZV
23666,Sophy,S100,4100,SFA1111111,{470000},sofi safi,SPH
47332,Safi,S100,4100,SFA1111111,{470000},sifi sofi safi,SF
68146,Zophie,Z100,4100,SFA1111111,{470000},zofi zafi,ZPH
9955,Safia,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SF
38961,Sabia,S100,4100,SPA1111111,{470000},sibii sibio sibia sobii sobio sobia sabii sabi...,SB
50144,Sakeef,S210,4100,SKF1111111,{457000},siikif,SKF
71093,Sopheia,S100,4100,SFA1111111,{471000},sofD sofii sofio sofia safD safii safio safia,SPH
61462,Suvi,S100,4100,SFA1111111,{470000},suvi savi sovi,SV
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF


In [54]:
apdf[apdf.caverphone == 'SFA1111111'].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
59743,Zaviar,Z160,4160,SFA1111111,{479000},ziviir zivQir zivior zivQor ziviar zivQar zovi...,ZVR
68761,Zaviyar,Z160,4160,SFA1111111,{479000},ziviiir zivQiir ziviior zivQior ziviiar zivQia...,ZVR
49351,Soffia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
32280,Sophi,S100,4100,SFA1111111,{470000},sofi safi,SPH
42344,Sofiya,S100,4100,SFA1111111,{470000},sofiii sofQii sofiio sofQio sofiia sofQia safi...,SF
38077,Saffire,S160,4160,SFA1111111,{479000},sifDr sifar sifir sifDrY sifDri sifarY sifari ...,SFR
61444,Sharvi,S610,5610,SFA1111111,{497000},sirvi sorvi sarvi,SRV
75357,Zavaeh,Z100,4100,SFA1111111,{470000},ziviY zivii zivoi zivai zDi zoviY zovii zovoi ...,ZVH
72100,Zaviere,Z160,4160,SFA1111111,{479000},zivir zivirY ziviri zovir zovirY zoviri zavir ...,ZVR


In [49]:
tdf = apdf.copy()
tdf['daitchmokotoff'] = apdf.daitchmokotoff.map(repr)
tdf[tdf.daitchmokotoff.str.contains('470000')].sample(n=10)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
76929,Jefe,J100,1100,YF11111111,"{'170000', '470000'}",zif zifY zifi,JF
64145,Jeeva,J100,1100,YFA1111111,"{'170000', '470000'}",zivi zivo ziva,JV
70680,Sepp,S100,4100,SP11111111,{'470000'},sip,SP
64641,Zeeva,Z100,4100,SFA1111111,{'470000'},zivi zivo ziva,ZV
51613,Shafee,S100,5100,SFA1111111,{'470000'},siifi,SF
63743,Saw,S000,4000,SA11111111,{'470000'},siw sif sow sof saw saf,SW
72557,Chavah,C100,5100,KFA1111111,"{'570000', '470000'}",tsivi tsivo tsiva tsD tsovi tsovo tsova tsavi ...,CHVH
19215,Cobey,C100,5100,KPA1111111,"{'570000', '470000'}",kDbii kobii kubii kDbD kDba kDbi kobD koba kob...,CB
36110,Shifa,S100,5100,SFA1111111,{'470000'},sQfi sifi sDfi sifo sQfo sifa sQfa,SF
27458,Zeev,Z100,4100,SF11111111,{'470000'},zif,ZV


In [44]:
checklist = apdf[apdf.name == 'Sofia'].beidermorse.values[0].split()
def check_fn(input):
    return any(x in input.split() for x in checklist)

apdf[apdf.beidermorse.map(check_fn)]

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
250,Sophia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH
646,Sofia,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
9955,Safia,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SF
12562,Safaa,S100,4100,SFA1111111,{470000},sifii sifio sifD sifia sifoi sifoo sifoa sifai...,SF
29240,Safiah,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SFH
40366,Saphia,S100,4100,SFA1111111,{470000},sifii sifio sifia sofii sofio sofia safii safi...,SPH
41050,Zsofia,Z100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,ZSF
48433,Sofhia,S100,4100,SFA1111111,{475000},sofii sofio sofia safii safio safia,SFH
48709,Sofya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SF
48711,Sophya,S100,4100,SFA1111111,{470000},sofii sofio sofia safii safio safia,SPH


In [58]:
apdf[apdf.parmarkumbharana.isin(['SF', 'SPH'])].sample(n=20)

Unnamed: 0,name,onca,metasoundex,caverphone,daitchmokotoff,beidermorse,parmarkumbharana
605,Sophie,S100,4100,SFA1111111,{470000},sofi safi,SPH
61459,Sufi,S100,4100,SFA1111111,{470000},sufi safi sofi,SF
68114,Safiye,S100,4100,SFA1111111,{470000},sifD sifDi sofD sofDi safD safDi,SF
48424,Safiyya,S100,4100,SFA1111111,{470000},sifiiii sifQiii sifiiio sifQiio sifiiia sifQii...,SF
72432,Sophyia,S100,4100,SFA1111111,{470000},sofiii sofiio sofiia safiii safiio safiia,SPH
42937,Sophee,S100,4100,SFA1111111,{470000},sofi safi,SPH
44405,Safeya,S100,4100,SFA1111111,{471000},siifD siifii siifio siifia,SF
42344,Sofiya,S100,4100,SFA1111111,{470000},sofiii sofQii sofiio sofQio sofiia sofQia safi...,SF
39787,Shafi,S100,5100,SFA1111111,{470000},sifi sofi safi,SF
71093,Sopheia,S100,4100,SFA1111111,{471000},sofD sofii sofio sofia safD safii safio safia,SPH


### Results

This is highly subjective, but it appears that most phonetic algorithms over conflate source names. The algorithms have too many false positives and thus too many names that are not similar in pronounciation/spelling are assigned the same code.

```
    MetaSoundex:       1279
    ONCA:              3193
    Soundex:           3413
    Caverphone:        6083
    Daitchmokotoff:    6679
    Metaphone:        10739
    ParmarKumbharana: 13875
    NYSIIS:           16694
    Match Rating:     24109
    Beidermorse:      51501

    Total Unique Names: 77092
```

As a native English speaker, it appears that algorithms with more the approximately 10,000 uniques from Metaphone work the best; from Jellyfish this includes NYSIIS, and Match Rating Codex. From Abydos this includes Beider & Morse or Parmar & Kumbharana algorithms. Since Abydos includes all of the algorithms from Jellyfish as well as many others, Abydos seems the better library.

Match Rating Codex and the Parmar & Kumbharana algorithms appear to be similar; names that I think should be assigned a similar category are not. However when looking at the combine results of the multiple categories, the results appear pretty good.

Of course, if someone is a non-native English speaker in a community of non-native English speakers, different algorithms may work better.

By splitting and combining the various options provided by Beider & Morse, I think the results are the best, so will use that for additional analysis.