# Using `fuzzup`

The `fuzzup` library can be used for clustering text where it's spelt very similarly but with variations, such as names. 

From https://pypi.org/project/fuzzup/

See also https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe 

It's not installed in Colab so we have to install it first using `!pip`

In [None]:
#install fuzzup
!pip install fuzzup
import fuzzup
#import a particular function
from fuzzup.gear import form_clusters_and_rank

A classic use case for this sort of algorithm is where you have a column full of names (e.g. organisations or people) that haven't been entered consistently. 

Below we create a list of such names to test the algorithm.

In [7]:
# strings we want to cluster
person_names = ['Donald Trump', 'Donald Trump', 
                    'J. biden', 'joe biden', 'Biden', 
                    'Bide', 'mark esper', 'Christopher c . miller', 
                    'jim mattis', 'Nancy Pelosi', 'trumps',
                    'Trump', 'Donald', 'miller']


## Using the 'form clusters' function

We can now run the `form_clusters_and_rank()` function on that list. This returns a *list* of *dictionaries*.

In [8]:

#use the function on that list
form_clusters_and_rank(person_names)

[{'COUNT': 4,
  'PROMOTED_STRING': 'joe biden',
  'RANK': 2,
  'STRINGS': ['Bide', 'Biden', 'J. biden', 'joe biden']},
 {'COUNT': 2,
  'PROMOTED_STRING': 'Christopher c . miller',
  'RANK': 3,
  'STRINGS': ['Christopher c . miller', 'miller']},
 {'COUNT': 5,
  'PROMOTED_STRING': 'Donald Trump',
  'RANK': 1,
  'STRINGS': ['Donald', 'Donald Trump', 'Trump', 'trumps']},
 {'COUNT': 1,
  'PROMOTED_STRING': 'Nancy Pelosi',
  'RANK': 6,
  'STRINGS': ['Nancy Pelosi']},
 {'COUNT': 1,
  'PROMOTED_STRING': 'jim mattis',
  'RANK': 6,
  'STRINGS': ['jim mattis']},
 {'COUNT': 1,
  'PROMOTED_STRING': 'mark esper',
  'RANK': 6,
  'STRINGS': ['mark esper']}]

In [None]:
#store that list of dictionaries
rankdict = form_clusters_and_rank(person_names)

In [None]:
#check the type of object we've created
print(type(rankdict))
#check how many items it has
print(len(rankdict))
#check if we can drill down into the first item, and access the value paired with the key 'STRINGS'
print(rankdict[0]['STRINGS'])

<class 'list'>
6
['Bide', 'Biden', 'J. biden', 'joe biden']


## Assigning the preferred string from the cluster

To grab the 'promoted' string' for each word, we can loop like this.

In [None]:
#create empty list
prefstrings = []

#loop through original names
for i in person_names:
  #loop through dicts of clusters of names
  for d in rankdict:
    #if the original name is in the list of strings
    if i in d['STRINGS']:
      #print that and the 'promoted' (preferred) one
      print(i, "=", d['PROMOTED_STRING'])
      #add promoted one to list
      prefstrings.append(d['PROMOTED_STRING'])



Donald Trump = Donald Trump
Donald Trump = Donald Trump
J. biden = joe biden
joe biden = joe biden
Biden = joe biden
Bide = joe biden
mark esper = mark esper
Christopher c . miller = Christopher c . miller
jim mattis = jim mattis
Nancy Pelosi = Nancy Pelosi
trumps = Donald Trump
Trump = Donald Trump
Donald = Donald Trump
miller = Christopher c . miller


## Creating a dataframe to store the results

Now let's create a dataframe to store both the original list of names, and the list of the 'preferred' version (i.e. the consistent version) that is going to be used instead.

It's always a good idea to store both, because you can then move on to a further stage of checking the results in various ways (using code and manually).

In [None]:
import pandas as pd

In [None]:
#create a dataframe from the two lists
pd.DataFrame({'originaldata':person_names, 'cleandata': prefstrings})

Unnamed: 0,originaldata,cleandata
0,Donald Trump,Donald Trump
1,Donald Trump,Donald Trump
2,J. biden,joe biden
3,joe biden,joe biden
4,Biden,joe biden
5,Bide,joe biden
6,mark esper,mark esper
7,Christopher c . miller,Christopher c . miller
8,jim mattis,jim mattis
9,Nancy Pelosi,Nancy Pelosi
