# **TheFuzz**
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

**Requirements**

Python 2.7 or higher
difflib
python-Levenshtein (optional, provides a 4-10x speedup in String Matching, though may result in differing results for certain cases)


In [1]:
!pip install thefuzz[speedup]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting thefuzz[speedup]
  Downloading thefuzz-0.19.0-py2.py3-none-any.whl (17 kB)
Collecting python-levenshtein>=0.12
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 4.1 MB/s 
Building wheels for collected packages: python-levenshtein
  Building wheel for python-levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149867 sha256=81613d2d0a3470161ff5286d3b4b7e87a91a4a754d7d9e5a86950d7e042ef934
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-levenshtein
Installing collected packages: thefuzz, python-levenshtein
Successfully installed python-levenshtein-0.12.2 thefuzz-0.19.0


In [3]:
import pandas as pd
import numpy as np
from thefuzz import process, fuzz

## Data Loading

As an example of use we will use a list of extracted brands and check how it would work by performing a fuzzy search of those terms into the SP500 dataset, in order to get a match eventhought the terms on both sides are not exactly the same. 

In [37]:
extracted_brands = ['Amazon',
                    'Electronic Arts',
                    'Sony',
                    'EA',
                    'Embracer',
                    'Microsoft',
                    'Comcast',
                    'Blizzard',
                    'CNBC',
                    'Apple',
                    'NBC Universal',
                    'Eidos',
                    'Disney',
                    'Bungie',
                    'EA Games',
                    'Activision',
                    'Crystal Dynamics']

In [6]:
# Load in the SP500 file
sp500 = pd.read_csv('/content/SP500_constituents.csv')

# Display the first columns
sp500.head()

Unnamed: 0,Symbol,Name,Sector
0,MMM,3M,Industrials
1,AOS,A. O. Smith,Industrials
2,ABT,Abbott Laboratories,Health Care
3,ABBV,AbbVie,Health Care
4,ABMD,Abiomed,Health Care


## FuzzyWuzzy

FuzzyWuzzy has four scorer options to find the Levenshtein distance between two strings. In this example, I would check on the token sort ratio and the token set ratio, for I believe they are more suitable for this dataset which might have mixed words order and duplicated words.<br>
I would pick four brand names and find their similar names in the Brand column. Since we're matching the Brand column with itself, the result would always include the selected name with a score of 100.

### Token Sort Ratio
The token sort ratio scorer tokenizes the strings and cleans them by returning these strings to lower cases, removing punctuations, and then sorting them alphabetically. After that, it finds the Levenshtein distance and returns the similarity percentage.

In [10]:
# OK
process.extract('Amazon', sp500.Name, scorer=fuzz.token_sort_ratio, limit=3)

[('Amazon', 100, 25), ('Aon', 67, 42), ('AutoZone', 57, 54)]

In [11]:
# Activision is now Activision Blizzard so it gets a 69 score at TSR
process.extract('Activision', sp500.Name, scorer=fuzz.token_sort_ratio, limit=3)

[('Activision Blizzard', 69, 6),
 ('Union Pacific', 61, 455),
 ('Verisign', 56, 463)]

In [12]:
# If we try to search EA Games into the Names the matches are bad
process.extract('EA Games', sp500.Name, scorer=fuzz.token_sort_ratio, limit=3)

[('Cboe Global Markets', 52, 91),
 ('Global Payments', 52, 215),
 ('Baker Hughes', 50, 57)]

In [13]:
# By searching EA Games by the SP500 symbol there could be a match with EA, but with this metric the score is too low
process.extract('EA Games', sp500.Symbol, scorer=fuzz.token_sort_ratio)

[('AES', 55, 11),
 ('AME', 55, 36),
 ('AKAM', 50, 15),
 ('EA', 40, 166),
 ('ES', 40, 180)]

In [15]:
# If we try to search Electronic Arts by Name the matches are OK
process.extract('Electronic Arts', sp500.Name, scorer=fuzz.token_sort_ratio, limit=3)

[('Electronic Arts', 100, 166),
 ('Crown Castle', 59, 131),
 ('American Electric Power', 58, 29)]

### Token Set Ratio
The token set ratio scorer also tokenizes the strings, and follows processing steps just like the token sort ratio. Then it collects common tokens between two strings and performs pairwise comparisons to find the similarity percentage.

In [20]:
# OK
process.extract('Amazon', sp500.Name, scorer=fuzz.token_set_ratio, limit=3)

[('Amazon', 100, 25), ('Aon', 67, 42), ('AutoZone', 57, 54)]

In [19]:
# Activision is now Activision Blizzard so it gets a 69 score at TSR
process.extract('Activision', sp500.Name, scorer=fuzz.token_set_ratio, limit=3)

[('Activision Blizzard', 100, 6),
 ('Union Pacific', 61, 455),
 ('Verisign', 56, 463)]

In [18]:
# If we try to search EA Games into the Names the matches are bad
process.extract('EA Games', sp500.Name, scorer=fuzz.token_set_ratio, limit=3)

[('Cboe Global Markets', 52, 91),
 ('Global Payments', 52, 215),
 ('Baker Hughes', 50, 57)]

In [17]:
# By searching EA Games by the SP500 symbol there could be a match with EA, but with this metric the score is too low
process.extract('EA Games', sp500.Symbol, scorer=fuzz.token_set_ratio, limit=3)

[('EA', 100, 166), ('AES', 55, 11), ('AME', 55, 36)]

In [16]:
# If we try to search Electronic Arts by Name the matches are OK
process.extract('Electronic Arts', sp500.Name, scorer=fuzz.token_set_ratio, limit=3)

[('Electronic Arts', 100, 166),
 ('Crown Castle', 59, 131),
 ('American Electric Power', 58, 29)]

In [82]:
def get_filtered_standard_names(extracted_names, lut_names, threshold=0.95):
  '''
    Gets the standard name for a given name extracted from some text over some 
    similarity threshold. Gets the std name from a lut table by fuzzy search.
    We use Token Set Ratio to be more flexible.

    extracted_names: names to be standardized by approx searching in lut_names
    lut_names: standardized names
  '''
  from thefuzz import process, fuzz

  output = []
  for extracted_name in extracted_names:
    name_macth = process.extractOne(extracted_name, sp500.Name, 
                                    scorer=fuzz.token_set_ratio)
    similarity = name_macth[1]/100
    if similarity >= threshold:
      output.append(name_macth[:2]) 

  return output


In [83]:
get_filtered_standard_names(extracted_brands, sp500.Name)

[('Amazon', 100),
 ('Electronic Arts', 100),
 ('Microsoft', 100),
 ('Comcast', 100),
 ('Activision Blizzard', 100),
 ('Apple', 100),
 ('The Walt Disney Company', 100),
 ('Activision Blizzard', 100)]

In [94]:
def get_company_sectors(standard_names_tuples, lut_company_sectors):
  '''
  '''
  output = []
  for std_comp_name, _ in standard_names_tuples:
    sectors = list(sp500[['Name','Sector']].where(sp500.Name == std_comp_name).dropna().itertuples(index=False, name=None))
    output += sectors
  return output

In [95]:
get_company_sectors(get_filtered_standard_names(extracted_brands, sp500.Name), sp500)

[('Amazon', 'Consumer Discretionary'),
 ('Electronic Arts', 'Communication Services'),
 ('Microsoft', 'Information Technology'),
 ('Comcast', 'Communication Services'),
 ('Activision Blizzard', 'Communication Services'),
 ('Apple', 'Information Technology'),
 ('The Walt Disney Company', 'Communication Services'),
 ('Activision Blizzard', 'Communication Services')]

## Referencias 
* Approximate string matching or Fuzzy searching - https://en.wikipedia.org/wiki/Approximate_string_matching
* Levenshtein distance - https://en.wikipedia.org/wiki/Levenshtein_distance
* The Fuzz - https://github.com/seatgeek/thefuzz (formerly known as fuzzywuzzy)
