## Fuzzy Matching and Fuzzy Pandas
      
[Max Harlow](https://twitter.com/maxharlow), a journalist at the Financial Times, wrote this library `csvmatch`, and he's been adding new algorithms to facilitate fuzzy matching across datasets. He's used it for a bunch of stories, including:
- https://www.theguardian.com/uk-news/2014/jul/09/offshore-tax-dealings-celebrities-sportsmen-leaked-jersey-files
- https://www.theguardian.com/politics/2014/jul/08/offshore-secrets-wealthy-political-donors

Similar techniques have also been used in other stories like:
- https://www.globalwitness.org/en/campaigns/oil-gas-and-mining/myanmarjade
- https://www.irinnews.org/investigation/2016/09/02/exclusive-un-paying-blacklisted-diamond-company-central-african-republic

But, wait, first: 

### What is Fuzzy Matching? 

Automating the look-up for names in documents is [inherently imprecise](https://www.elastic.co/blog/found-fuzzy-search). The computer can't _know_ that different representations of the same _thing_ refer to the same _thing_. For example: 
- _Apple Inc._; _Apple Computer Company_; _Apple Computer, Inc._; and _Apple_ all refer to the fruit company. 
- _Samuel Langhorne Clemens_, _Samuel L. Clemens_, _Samuel Clemens_ and Mark Twain all refer to the same person. 
- _Robert Ford_, _Rob Ford_, and _Robert Frod_ refer to the same person **probably**. 

When you're working with unstructured data, you can't take anything for granted. Least of all, you can't assume that:
- documents will have correct spellings
- first, last, and middle names will exist in all documents
- the abbreviated/shortened names of people won't make an appearance (e.g. Jon instead of Jonathan, Tom instead of Thomas, Phil instead of Philip, etc.) 

So, when you're living in an uncertain world, you try to make things slightly more _certain_ with **Fuzzy Matching**. You might not hit 100 percent, but at least you'll hit more than what you would without fuzzy matching. 

There are multiple algorithms that try to minimise the uncertainty/enable fuzzy matching. The library we are going to be look at today incorporates a bunch of these, instead of just doing one thing. 

This notebook's predominantly based on an [awesome NICAR2019 presentation](https://docs.google.com/presentation/d/1djKgqFbkYDM8fdczFhnEJLwapzmt4RLuEjXkJZpKves/) where Max Harlow (the aforementioned news app developer at the Financial Times) demonstrated [csvmatch](https://github.com/maxharlow/csvmatch). And, then, Soma basically created a library to make it with Pandas. 

Worth remembering that there are no shortcuts in life, and few panaceas. Depending on the project you're working on, you might be more inclined to use one algorithm or the other. Or, you know, try a few of them and see what happens. And, also, remember: all computational tools you use need to hand-in-hand with traditional reporting. People share names, there's more than one John Smith, etc. 


In [1]:
# Make sure you `pip install fuzzy_pandas` first. 

import pandas as pd
import fuzzy_pandas as fpd

### A Toy Example

We'll be working with two toy datasets first, just to get going and get an idea as to what's possible. The names of the files are not terribly imaginative: `data1.csv` and `data2.csv`. And, they both contain structured data: names, code names and locations of characters from John le Carré's spy thriller: Tinker Tailor Soldier Spy. 

Right, let's have a look. 


In [2]:
df1 = pd.read_csv("data/data1.csv")
df2 = pd.read_csv("data/data2.csv")

In [3]:
df1.sort_values("name")

Unnamed: 0,name,location,codename
5,Bill Haydon,London,Tailor
8,Connie Sachs,Oxford,none
0,George Smiley,London,Beggerman
7,Jim Prideaux,Slovakia,none
6,Oliver Lacon,London,none
1,Percy Alleline,London,Tinker
4,Peter Guillam,Brixton,none
2,Roy Bland,London,Soldier
3,Toby Esterhase,Vienna,Poorman


In [4]:
df2.sort_values("Person Name")

Unnamed: 0,Person Name,Location
8,Claus Kretzschmar,Hamburg
2,George SMILEY,London
4,Konny Saks,Oxford
0,Maria Andreyevna Ostrakova,Russia
1,Otto Leipzig,Estonia
3,Peter Guillam,Brixton
6,Sam Collins,Vietnam
5,Saul Enderby,London
7,Tony Esterhase,Vienna


### Exact matches

We start with doing "exact matches", i.e. both tables should have the exact same name. Capitalisation matters, accents matter. With this function, for example:
- John le Carre will not match with John le Carré
- George SMILEY will not match with George Smiley

Based on what you see in the data frames above, how many matches do you expect? 

In [5]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name')

Unnamed: 0,name,location,codename,Person Name,Location
0,Peter Guillam,Brixton,none,Peter Guillam,Brixton


Right, so, we only find one match as expected. But, are there any other matches that a _smarter_ algorithm could find? Let's try something called **Levenshtein**, a nifty simple algorithm that's pretty common. It's the basis for a bunch of spellcheck algorithms, amongst other things, and the way it works is it checks the number of characters that are different between two inputs, and if the _distance_ is small enough, it assumes the two words are the same. 

For example, in the above two data frames, you have Toby Esterhase and Tony Esterhase, which means the Levenshtein distance is 1 (The 'b' v. 'n' in To(b,n)y.). 

In [6]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='levenshtein')

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton


The other thing you'll notice above is that, by default, the _Levenshtein_ algorithm doesn't care about case. 

However, are we still missing potential matches? 

When we work with any algorithms, we need a confidence threshold that we decide on. By default, the `csvmatch` algorithm has a `threshold` of 0.6, i.e. only if the algorithm returns a match score greater than or equal to 0.6 will it return a match. 

The score, in this case, is calculated using the below formula: 

> `1 - (distance/maximum(value1, value2))`

We can be slightly more conservative with the threshold, and we get a Brand New Result in our output. 

In [7]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='levenshtein', threshold=0.5)

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton
3,Connie Sachs,Oxford,none,Konny Saks,Oxford


This is _cool_. By changing the threshold, we found another match based on what the pronunciation of the two names are: Connie and Konny. **But, what could be cooler?**

In [8]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method='metaphone')

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Peter Guillam,Brixton,none,Peter Guillam,Brixton
2,Connie Sachs,Oxford,none,Konny Saks,Oxford


The **metaphone** algorithm does phonetic matching, and gives you results based on that. 

Note: In theory, the documentation says that you can combine a couple of these algorithms if you're so inclined. But, it looks like when you combine two algorithms, it doesn't _quite_ work. ¯\_(ツ)_/¯

In [9]:
fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Person Name', method=['levenshtein', 
                                                                          'metaphone'])

## swap the methods around and then look at the results, too.

Unnamed: 0,name,location,codename,Person Name,Location
0,George Smiley,London,Beggerman,George SMILEY,London
1,Toby Esterhase,Vienna,Poorman,Tony Esterhase,Vienna
2,Peter Guillam,Brixton,none,Peter Guillam,Brixton


What do you think is happening here? 

This is important—you're often going to be using tools built by other folks, but where there's code, there are bugs. You should make sure that you play with the tool a bit to make sure it's doing _exactly_ what you think it's doing. And, if it's not, you know where it falls short. 

### Less Fictional Datasets

We are going to be using the same datasets Max Harlow used for this exercise. As he explains in his presentation [here](https://docs.google.com/presentation/d/1djKgqFbkYDM8fdczFhnEJLwapzmt4RLuEjXkJZpKves/edit#slide=id.g3512a0ce6b_1_22), there are a bunch of files: 
- a list of world billionaires published by Bloomberg
- a similar list published by Forbes
- a list also published by Forbes that only includes Chinese individuals
- a list published by the CIA of chiefs of state and cabinet members of foreign governments
- a list of all the people that attended the World Economic Forum conference in Davos this year
- a list of all the people and companies that have been sanctioned by the United Nations


In [10]:
## Read in the two billionaire lists (Forbes + Bloomberg)

df1 = pd.read_csv("data/forbes-billionaires.csv")
df2 = pd.read_csv("data/bloomberg-billionaires.csv")

Can you find out how many billionaires appear in both lists (exact matching)? 

In [11]:
df1.sample(30)

Unnamed: 0,name,lastName,uri,imageUri,worthChange,source,industry,gender,country,timestamp,realTimeWorth,realTimeRank,realTimePosition,squareImage
150,Liang Feng,Liang,liang-feng,no-pic,48.095,manufacturing,Manufacturing,M,China,1547574901334,1200.751,1772.0,1772.0,//specials-images.forbesimg.com/imageserve/5a7...
977,Douglas Leone,Leone,douglas-leone,douglas-leone,78.844,venture capital,Finance and Investments,M,United States,1547574901333,3541.548,593.0,593.0,//specials-images.forbesimg.com/imageserve/5ba...
1137,Zhang Bangxin,Zhang,zhang-bangxin,no-pic,165.393,after-school tutoring,Service,M,China,1547575201867,5402.953,307.0,307.0,//specials-images.forbesimg.com/imageserve/5bc...
1720,Martua Sitorus,Sitorus,martua-sitorus,martua-sitorus,2.892,palm oil,Manufacturing,M,Indonesia,1547574901334,1720.06,1308.0,1308.0,//specials-images.forbesimg.com/imageserve/583...
538,Randa Williams,Williams,randa-williams,randa-williams,36.413,pipelines,Energy,F,United States,1547575201867,5971.675,263.0,263.0,//specials-images.forbesimg.com/imageserve/08e...
2141,Liu Zhenguo,Liu,liu-zhenguo,no-pic,,sewage treatment,Service,M,China,1518125575873,,,,
197,Rocco Commisso,Commisso,rocco-commisso,no-pic,0.0,telecom,Telecom,M,United States,1547574901333,4197.971,455.0,455.0,//specials-images.forbesimg.com/imageserve/59e...
1939,Yin-Chun Wei,Wei,yin-chun-wei,yin-chun-wei,,"food, beverages",Food and Beverage,M,Taiwan,1538748670520,,,,//specials-images.forbesimg.com/imageserve/950...
1220,Joseph Grendys,Grendys,joseph-grendys,joseph-grendys,0.0,poultry processing,Food and Beverage,M,United States,1547574901333,2475.093,901.0,901.0,//specials-images.forbesimg.com/imageserve/55f...
101,Igor Rotenberg,Rotenberg,igor-rotenberg,no-pic,0.0,"construction, real estate",Diversified,M,Russia,1547574901334,1053.363,1947.0,1947.0,


In [None]:
df2.sample(30)

Unnamed: 0,Rank,Name,Total_net_worth,Country,Industry
246,247,Jeff Hildebrand,$6.26B,United States,Energy
7,8,Larry Page,$52.3B,United States,Technology
301,302,Lynn Schusterman,$5.19B,United States,Energy
138,139,Gordon Moore,$9.65B,United States,Technology
302,303,Micree Zhan,$5.19B,China,Technology
68,69,Takemitsu Takizaki,$14.7B,Japan,Technology
440,441,Peter-Alexander Wacker,$4.07B,Germany,Industrial
24,25,Hui Ka Yan,$30.1B,China,Real Estate
229,230,Scott Duncan,$6.41B,United States,Energy
52,53,Joseph Safra,$17.1B,Brazil,Finance


In [None]:
results = fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Name')

print("Found", results.shape)
results.head(5)

Found (354, 19)


Unnamed: 0,name,lastName,uri,imageUri,worthChange,source,industry,gender,country,timestamp,realTimeWorth,realTimeRank,realTimePosition,squareImage,Rank,Name,Total_net_worth,Country,Industry
0,Alexander Otto,Otto,alexander-otto,no-pic,2.12,real estate,Real Estate,M,Germany,1547575201867,10821.927,126.0,126.0,//specials-images.forbesimg.com/imageserve/5a7...,323,Alexander Otto,$4.94B,Germany,Real Estate
1,Ben Ashkenazy,Ashkenazy,ben-ashkenazy,no-pic,0.0,real estate,Real Estate,M,United States,1547574901333,4000.0,499.0,499.0,//specials-images.forbesimg.com/imageserve/59e...,447,Ben Ashkenazy,$4.05B,United States,Real Estate
2,Giovanni Ferrero,Ferrero,giovanni-ferrero,no-pic,0.0,"Nutella, chocolates",Food and Beverage,M,Italy,1547575201866,22673.165,38.0,38.0,//specials-images.forbesimg.com/imageserve/5b1...,33,Giovanni Ferrero,$22.6B,Italy,Food & Beverage
3,Henry Cheng,Cheng,henry-cheng-1,no-pic,3.542,property,Diversified,M,Hong Kong,1547574901334,1334.282,1630.0,1630.0,//specials-images.forbesimg.com/imageserve/5a7...,79,Henry Cheng,$14.1B,Hong Kong,Retail
4,Henry Laufer,Laufer,henry-laufer,no-pic,0.0,hedge funds,Finance and Investments,M,United States,1547574901333,2000.0,1141.0,1142.0,,463,Henry Laufer,$3.95B,United States,Finance


Now, can you find the ones where the ranks aren't the same across the two datasets? What about the ones that are the same?

In [None]:
results = fpd.fuzzy_merge(df1, df2, left_on='name', right_on='Name', 
                          keep_left=["realTimeRank", "name"], 
                          keep_right=["Rank"])
results[results.realTimeRank == results.Rank]

Unnamed: 0,realTimeRank,name,Rank
13,2.0,Bill Gates,2
14,3.0,Warren Buffett,3
15,1.0,Jeff Bezos,1
21,4.0,Bernard Arnault,4
23,10.0,Sergey Brin,10
27,23.0,Sheldon Adelson,23
37,17.0,Mukesh Ambani,17
47,54.0,Stefan Quandt,54
88,135.0,Aliko Dangote,135
131,177.0,Cyrus Poonawalla,177


### Fuzzy matching with non-fictional data

In the above couple of cells, we've conducted "exact matching", i.e. the equivalent of you running a `Cmd+F`/`Ctrl+F` on your text editor. But, this is _almost_ worse as it's case sensitive, i.e. "Tom" and "tom" are treated differently. 

We've gone through some of this already, but what are the things we can ignore when it comes to name-matching? Harlow, in his presentation, identified:
- case
- title (Mr., Mrs., etc.)
- non-latin characters (é, å, ß, etc.)
- the order of the names
- non-alphanumerics (e.g. hyphenated names)

Now, you don't _have to_ ignore _anything_, but sometimes, it might make your life far easier. Other times, you'll end up with false positives and whatnot. 

The library `csvmatch`—and by extension `fuzzy_pandas`—support a bunch of the above parameters, which you can just pass in to the function. Passing in a bunch of these parameters would allow you to go from `Orbán, Viktor` to `Viktor Orban`, which is quite useful. (Again, the example's from Harlow's slides)

For this bit, we'll move on to two of the other datasets: `cia-world-leaders.csv` and `davos-attendees-2019.csv`. As always, read in the data and figure out which columns the exact match should run on. 

In [None]:
cia_world_leaders = pd.read_csv('data/cia-world-leaders.csv')
davos_attendees = pd.read_csv('data/davos-attendees-2019.csv')
print(f"Our CIA World Leaders df has these columns: {cia_world_leaders.columns} \
      \n The Davos attendees have these: {davos_attendees.columns}")

Our CIA World Leaders df has these columns: Index(['country', 'role', 'name'], dtype='object')       
 The Davos attendees have these: Index(['full_name', 'position_short_name', 'org_name', 'org_country'], dtype='object')


In [None]:
cia_world_leaders.sort_values('name').head(20)

Unnamed: 0,country,role,name
384,Bangladesh,Min. of Foreign Affairs,A. H. Mahmood ALI
499,Belize,"Governor, Central Bank",A. Joy GRANT
4721,Somalia,Min. of Public Works & Reconstruction,"ABAS Abdullahi Sheikh ""Siraji"""
4451,Saudi Arabia,Min. of Interior,ABD AL-AZIZ bin Saud bin Nayif bin Abd al-Aziz...
2625,Jordan,King,ABDALLAH II
4193,Qatar,Deputy Amir,ABDALLAH bin Hamad Al Thani
4213,Qatar,Prime Min.,ABDALLAH bin Nasir bin Khalifa Al Thani
4205,Qatar,Min. of Interior,ABDALLAH bin Nasir bin Khalifa Al Thani
4194,Qatar,"Governor, Qatar Central Bank",ABDALLAH bin Saud Al Thani
5482,United Arab Emirates,Min. of Foreign Affairs and International Coop...,ABDALLAH bin Zayid Al Nuhayyan


In [None]:
davos_attendees.sort_values('full_name').head(20)

Unnamed: 0,full_name,position_short_name,org_name,org_country
1864,Aaron Karczmer,"Executive Vice-President; Chief Risk, Complian...",PayPal,USA
2532,Aaron Motsoaledi,Minister of Health of South Africa,Ministry of Health of South Africa,South Africa
422,Aarthi Subramanian,"Executive Director, Board of Directors",Tata Consultancy Services,India
2074,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
897,Abdulaziz Al Judaimi,"Senior Vice-President, Downstream",Saudi Aramco,Saudi Arabia
888,Abdulaziz Al Subeaei,Chairman,Jabal Omar Development Company,Saudi Arabia
35,Abdulaziz Al-Helaissi,Group Chief Executive Officer (GIB),Gulf International Bank BSC,Bahrain
903,Abdulaziz Al-Jarbou,Chairman of the Board,Saudi Basic Industries Corporation,Saudi Arabia
2641,Abdulla Al Basti,"General Secretary, Executive Council of Dubai,...",Executive Council of Dubai,United Arab Emirates
809,Abdulla Al Khalifa,Chief Executive Officer,Qatar National Bank Q.P.S.C.,Qatar


In [None]:
# Let's start by seeing what an exact match would look like. 
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', right_on='full_name')

Apparently, there are no matches. What do you reckon? What _should_ the overlap between CIA's list of world leaders and Davos attendees be? 

Let's try some _fuzzy matching_. 

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', 
                          right_on='full_name', 
                          ignore_case=True)

In [None]:
print(results.shape)
results

(119, 7)


Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
0,Algeria,Min. of Foreign Affairs & International Cooper...,Abdelkader MESSAHEL,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
1,Argentina,Min. of Production & Work,Dante SICA,Dante Sica,Minister of Industry and Labour of Argentina,Ministry of Industry and Labour of Argentina,Argentina
2,Argentina,"Pres., Central Bank",Guido SANDLERIS,Guido Sandleris,Governor of the Central Bank of Argentina,Central Bank of Argentina,Argentina
3,Armenia,Prime Min.,Nikol PASHINYAN,Nikol Pashinyan,Prime Minister of the Republic of Armenia,Office of the Prime Minister of the Republic o...,Armenia
4,Australia,Min. for Defense Industry,Steven CIOBO,Steven Ciobo,Minister of Defence Industry of Australia,Department of Defence of Australia,Australia
5,Australia,"Min. for Trade, Investment, & Tourism",Simon BIRMINGHAM,Simon Birmingham,"Minister for Trade, Tourism and Investment of ...",Department of Foreign Affairs and Trade of Aus...,Australia
6,Austria,Chancellor,Sebastian KURZ,Sebastian Kurz,Federal Chancellor of Austria,Office of the Federal Chancellor of Austria,Austria
7,Austria,"Min. for Europe, Integration, & Foreign Affairs",Karin KNEISSL,Karin Kneissl,"Federal Minister for Europe, Integration and F...","Federal Ministry for Europe, Integration and F...",Austria
8,Azerbaijan,Pres.,Ilham ALIYEV,Ilham Aliyev,President of the Republic of Azerbaijan,Administration of the President of the Republi...,Azerbaijan
9,Belgium,Dep. Prime Min.,Alexander DE CROO,Alexander De Croo,Deputy Prime Minister and Minister of Finance ...,"Ministry of Foreign Affairs, Foreign Trade and...",Belgium


OK, we have more matches, but this is also pretty boring. There's nothing super-smart about ignoring cases to get matches. Your word processors have been doing that for _decades_. 

But, now, let's start adding some of our other parameters discussed above, and see what happens

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', 
                          right_on='full_name', 
                          ignore_case=True, 
                          ignore_nonalpha=True,
                          ignore_nonlatin=True,
                          ignore_order_words=True,
                          ignore_titles=True,
                         )

In [None]:
print(results.shape)
results

(138, 7)


Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
0,Algeria,Min. of Foreign Affairs & International Cooper...,Abdelkader MESSAHEL,Abdelkader Messahel,Minister of Foreign Affairs of Algeria,Ministry of Foreign Affairs of Algeria,Algeria
1,Argentina,Min. of Production & Work,Dante SICA,Dante Sica,Minister of Industry and Labour of Argentina,Ministry of Industry and Labour of Argentina,Argentina
2,Argentina,Min. of Treasury & Finance,Nicolas DUJOVNE,Nicolás Dujovne,Minister of the Treasury of Argentina,Ministry of the Treasury of Argentina,Argentina
3,Argentina,"Pres., Central Bank",Guido SANDLERIS,Guido Sandleris,Governor of the Central Bank of Argentina,Central Bank of Argentina,Argentina
4,Armenia,Prime Min.,Nikol PASHINYAN,Nikol Pashinyan,Prime Minister of the Republic of Armenia,Office of the Prime Minister of the Republic o...,Armenia
5,Australia,Min. for Defense Industry,Steven CIOBO,Steven Ciobo,Minister of Defence Industry of Australia,Department of Defence of Australia,Australia
6,Australia,"Min. for Trade, Investment, & Tourism",Simon BIRMINGHAM,Simon Birmingham,"Minister for Trade, Tourism and Investment of ...",Department of Foreign Affairs and Trade of Aus...,Australia
7,Austria,Chancellor,Sebastian KURZ,Sebastian Kurz,Federal Chancellor of Austria,Office of the Federal Chancellor of Austria,Austria
8,Austria,"Min. for Europe, Integration, & Foreign Affairs",Karin KNEISSL,Karin Kneissl,"Federal Minister for Europe, Integration and F...","Federal Ministry for Europe, Integration and F...",Austria
9,Azerbaijan,Pres.,Ilham ALIYEV,Ilham Aliyev,President of the Republic of Azerbaijan,Administration of the President of the Republi...,Azerbaijan


Right, 19 more results. Baby steps, but at least steps in the right direction. Now, let's start using the more _intelligent_ algorithms in place—this one named after a Russian mathematician: Levenshtein. 

All *Levenshtein* does is look at how many characters are different between two inputs? For example:

In [None]:
from jellyfish._jellyfish import damerau_levenshtein_distance

damerau_levenshtein_distance("Évry", "Every")

2

Let's quickly use the algorithm directly to see the output. The above cell imports something called `jellyfish`, which is another package that `csvmatch` uses. Typically, you wouldn't call the function directly (you could if you wanted to), but this is just to give you an idea of how the algorithm works. 

Right, now let's use this with our above data, and see if we have better luck. 

Remember: the threshold specified by us is 0.6, and it's calculated by: `1-(distance/max(value1, value2))`. In the case of `Évry` and `Every` above, our calculation would be:

`1 - (2/5)` = `3/5` = `0.6`

So, in this case, the two would lead to a fuzzy match.

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', 
                          right_on='full_name', 
                          ignore_case=True, 
                          ignore_nonalpha=True,
                          ignore_nonlatin=True,
                          ignore_order_words=True,
                          ignore_titles=True,
                          method='levenshtein'
                         )

In [None]:
results.shape

(1952, 7)

**WAIT, WHAT?!** Have we just gone from 138 matches to 1952? Is that overtly optimistic? 

In [None]:
results.sample(50)

Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
1733,Switzerland,Ambassador to the US,Martin DAHINDEN,Martin Baron,"Executive Editor, Washington Post, USA",The Washington Post,USA
1484,Papua New Guinea,Min. for Treasury,Charles ABEL,Bernard Charlès,Vice-Chairman and Chief Executive Officer,Dassault Systèmes SE,France
1431,Palau,Min. of Health,Emais ROBERTS,Robert Etman,Chief Financial Officer,Alghanim Industries,Kuwait
1811,Uganda,Min. of State for Public Service,David KARUBANGA,David Nabarro,Director,4SD,Switzerland
1457,Papua New Guinea,Min. for Fisheries,Patrick BASA,Patrick Allen,"Vice-President; International Managing Editor,...",CNBC,United Kingdom
1768,Taiwan,"Sec. Gen., Executive Yuan",CHEN Mei-ling,Chen Lifang,"Director; President, Public Affairs and Commun...",Huawei Technologies Co. Ltd,People's Republic of China
104,Australia,Min. for Human Services,Michael KEENAN,Michael Corbat,"Chief Executive Officer, Citigroup",Citi,USA
1473,Papua New Guinea,Min. for National Planning & Monitoring,Richard MARU,Richard Ambrose,"Executive Vice-President, Space Systems System...",Lockheed Martin Space,USA
1651,Slovakia,Min. of Finance,Peter KAZIMIR,Peter Altmaier,Federal Minister of Economic Affairs and Energ...,Federal Ministry of Economic Affairs and Energ...,Germany
1466,Papua New Guinea,Min. for Justice,Davis STEVEN,David Siegel,Co-Chairman and Co-Founder,"Two Sigma Investments, LP",USA


Why, yes. Yes, it is. This is why you _always_ confirm what an algorithm does. Right, maybe let's bump up our threshold and see what happens. 

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', 
                          right_on='full_name', 
                          ignore_case=True, 
                          ignore_nonalpha=True,
                          ignore_nonlatin=True,
                          ignore_order_words=True,
                          ignore_titles=True,
                          method='levenshtein', 
                          threshold=0.8
                         )

In [None]:
results.shape

(186, 7)

OK, that seems more reasonable. Let's sanity check.

In [None]:
results.sample(50)

Unnamed: 0,country,role,name,full_name,position_short_name,org_name,org_country
148,Slovakia,Min. of Foreign & European Affairs,Miroslav LAJCAK,Miroslav Lajčák,Minister of Foreign and European Affairs of th...,Ministry of Foreign Affairs of the Slovak Repu...,Slovakia
162,Tunisia,"Min. of Development, Investment & Internationa...",Zied LAADHARI,Zied Ladhari,"Minister of Development, Investment and Intern...","Ministry of Development, Investment and Intern...",Tunisia
61,Guyana,Min. of Public Infrastructure,David PATTERSON,Gavin Patterson,Chief Executive,BT Group Plc,United Kingdom
149,Slovakia,Prime Min.,Peter PELLEGRINI,Peter Pellegrini,Prime Minister of Slovakia,Office of the Prime Minister of Slovakia,Slovakia
70,Ireland,Min. for Education & Skills,Richard BRUTON,Richard Houston,Global Consulting Executive; Chief Executive O...,Deloitte,United Kingdom
91,Kenya,"Cabinet Sec. for Industry, Trade, & Cooperatives",Peter MUNYA,Peter Munya,"Cabinet Secretary for Trade, Industry and Coop...","Ministry of Trade, Industry and Cooperatives o...",Kenya
73,Ireland,Min. for Public Expenditure & Reform,Paschal DONOHOE,Paschal Donohoe,Minister for Finance of Ireland,Department of Finance of Ireland,Ireland
157,Switzerland,"Chief, Federal Dept. of Home Affairs",Alain BERSET,Alain Berset,Federal Councillor of Home Affairs of Switzerland,Federal Department of Home Affairs of Switzerland,Switzerland
60,Ghana,Vice Pres.,Mahamudu BAWUMIA,Mahamudu Bawumia,Vice-President of Ghana,"Office of the Vice-President, Vice-President S...",Ghana
101,Luxembourg,Min. of Communications & Media,Xavier BETTEL,Xavier Bettel,Prime Minister and Minister for Communications...,Office of the Prime Minister of Luxembourg,Luxembourg


Next question, one for you guys to do:

Which names from the CIA world leaders list are also on the Forbes billionaires list?

Who can get the best result?

In [None]:
forbes_df = df1

results = fpd.fuzzy_merge(cia_world_leaders, forbes_df, left_on='name', 
                          right_on='name', 
#                           ignore_case=True, 
#                           ignore_nonalpha=True,
#                           ignore_nonlatin=True,
#                           ignore_order_words=True,
#                           ignore_titles=True,
                          method='levenshtein', 
                         )

In [None]:
results.shape

(323, 17)

So in what scenarios do you reckon Levenshtein will perform badly? 

Next up: **metaphone**. 

Metaphone's great for names which sound similar, which wouldn't be caught by Levenshtein. It's especially handy when you're working with transcript data. But, it too comes with its pitfalls. 

Let's look at an example and then discuss what the possible pitfalls could be. 

Which names from the CIA world leaders list are also on the United Nations sanctions list?

In [None]:
un_sanctions = pd.read_csv("data/un-sanctions.csv")

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, un_sanctions, left_on='name', 
                          right_on='name', 
                          method='metaphone', 
                         )

In [None]:
results.shape

(18, 22)

In [None]:
## How does this compare to our other algorithms? 
results = fpd.fuzzy_merge(cia_world_leaders, un_sanctions, left_on='name', 
                          right_on='name', 
                          method='levenshtein', 
                         )

In [None]:
results.shape

(11, 22)

In [None]:
## How does this compare to our other algorithms? 
results = fpd.fuzzy_merge(cia_world_leaders, un_sanctions, left_on='name', 
                          right_on='name', 
                          ignore_case=True
                         )

In [None]:
results.shape

Finally, we get to **Bilenko**. And, yes, this uses machine learning, where you train your own data. So, now you have human *smarts* being involved in the process of matching up names across documents. 

Let's look at an example: 

Which names from the CIA world leaders list are also on the Davos attendees list?

In [None]:
results = fpd.fuzzy_merge(cia_world_leaders, davos_attendees, left_on='name', 
                          right_on='full_name', 
                          method='bilenko', 
                         )


Answer questions as follows:
 y - yes
 n - no
 s - skip
 f - finished

name: Nguyen Xuan CUONG

name: Nguyen Xuan Phuc

Do these records refer to the same thing? [y/n/s/f] 

In [None]:
results.shape