# The Fuzzy Matching Algorithm to Merge Vacancy Postings with Compustat Data
Phai Phongthiengtham
***

This IPython notebook demonstrates the fuzzy matching algorithm used to match firm names from online job vacancy postings and compustat database.

## Import necessary modules

In [10]:
import os
import re
import pandas as pd
import fuzzywuzzy
from fuzzywuzzy import fuzz

## The Fuzzy matching algorithm

Given that a company may not use its exact legal name when posting a vacancy, we use the fuzzy matching algorithm to identify imperfect string match. For example, suppose we are matching:

In [11]:
fuzz.token_sort_ratio("enterprise products partners lp", "enterprise products partners")

95

The fuzzy matching algorithm would suggestthat the two strings are closed to each other (see [here](https://pypi.python.org/pypi/fuzzywuzzy) for more detail). 

## COMPUSTAT (North America) database

All publicly traded companies in the US are required to track accounting and balance sheet data. Compustat database, therefore, provides excellent information on firms. Full compustat dataset contains more than 1,800 variables. For this exercise, we are interested in the company legal name (conml). According to the compustat database variable description, this is the official company name as reported on its EDGAR SEC filings.

In [12]:
compustat_file_name = 'compustat.txt'
df_compustat_name = pd.read_csv(compustat_file_name,sep='\t')
df_compustat_name.conml.head(50)

0     01 COMMUNIQUE LABORATORY INC
1                1-800-FLOWERS.COM
2      1347 PROPERTY INS HLDGS INC
3             180 DEGREE CPTL CORP
4               1PM INDUSTRIES INC
5                 1ST CAPITAL BANK
6       1ST CENTURY BANCSHARES INC
7         1ST CONSTITUTION BANCORP
8              1ST ENTERPRISE BANK
9                  1ST SOURCE CORP
10          1ST UNITED BANCORP INC
11          20-20 TECHNOLOGIES INC
12                 2050 MOTORS INC
13     21ST CENTURY ONCOLOGY HLDGS
14              21VIANET GROUP INC
15                 2242749 ONT LTD
16          22ND CENTURY GROUP INC
17                24/7 KID DOC INC
18                          2U INC
19                        30DC INC
20                    360 VOX CORP
21                  37 CAPITAL INC
22          3D PIONEER SYSTEMS INC
23               3D SIGNATURES INC
24                 3D SYSTEMS CORP
25                3DICON CORP -OLD
26              3DX INDUSTRIES INC
27                           3M CO
28                 3

## Job vacancy postings

In previous step, we perform initial cleaning step of the job vacancy postings (See [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb)). 



In [13]:
posting_file_name = 'structured_data.txt'

df_posting = pd.read_csv(posting_file_name,sep='\t')
df_posting = df_posting[['company_name']].drop_duplicates()
df_posting = df_posting.sort_values(['company_name'])
df_posting.head(50)

Unnamed: 0,company_name
0,A & Associates Inc
1,A-1 Roofing
12,Active Learners Academy
8,"Adesa Corporation, LLC"
2,"Adesa of Lexington, Inc"
13,"Agile Premier, LLC"
24,"American Academy of Dermatology, Inc."
27,American Cancer Society
28,American Management Resources Corporation
29,"Appareo Systems, LLC"


## Edit company names
First, we lightly edit the company names as follow: 

1. Convert all character to lowercase.
2. Remove all punctuation.
3. Reset extra whitespaces

For example:

In [14]:
original_name = 'Oceaneering International, Inc.'
name = original_name.lower() # change to lowercase.
name = re.sub('[^a-z0-9]',' ', name) 
# replace all non-alphanumeric characters with whitespaces.
name = ' '.join([w for w in re.split(' ',name) if not w==''])
# remove extra whitespaces.
print('original name: ' + original_name)
print('processed name: ' + name)
print('-------------------------------------------')

original name: Oceaneering International, Inc.
processed name: oceaneering international inc
-------------------------------------------


## Match Company Names

Then, we match the company name as follows:

1. Direct match: for each vacancy posting company name, we loop through each of the Compustat company name and see if there is any direct match.
2. Replace common abbrevations and re-match: we further edit the company names by removing common suffixes and abbrevations. Then repeat step 1
3. Fuzzy string matching: repeat 2 but use fuzzy string matching with 90% or greater ratio.

In [15]:
remove_word = ['inc','corp','ltd','etf','group','co','holdings',
               'resources','fd','intl','global','tr','international','plc']

regex = '|'.join(remove_word)

In [16]:
list_compustat_name = df_compustat_name.conml.unique()
list_compustat_name_clean1 = [re.sub('\W','',w.lower()) for w in list_compustat_name]
list_compustat_name_clean2 = [re.sub(regex,'',w) for w in list_compustat_name_clean1]

In [17]:
def fuzzy_match(name_to_match, list_all_name):

    ratio = [(fuzz.ratio(name_to_match, w),w) for w in list_all_name if fuzz.token_sort_ratio(name_to_match, w) >= 90]

    assert(len(ratio)<=1)

    if len(ratio) == 0:
        output = None
    else:
        output = ratio[0][1]
    
    return output

In [18]:
list_posting_firm = ['onvia inc co','enterprise products partners lp']

for posting_firm in list_posting_firm:

    matching_indicator = False
    
    posting_firm_clean1 = re.sub('\W','',posting_firm.lower())
    posting_firm_clean2 = re.sub(regex,'',posting_firm_clean1)

    if posting_firm_clean1 in list_compustat_name_clean1:
        matching_indicator = True
        i = list_compustat_name_clean1.index(posting_firm_clean1)
        compustat_name = list_compustat_name[i]
        print('posting from "'+ posting_firm + '" is matched with compustat company "' + compustat_name + '"')

    if matching_indicator == False:
        if posting_firm_clean2 in list_compustat_name_clean2:
            matching_indicator = True
            i = list_compustat_name_clean2.index(posting_firm_clean2)
            compustat_name = list_compustat_name[i]
            print('posting from "'+ posting_firm + '" is matched with compustat company "' + compustat_name + '"')

    if matching_indicator == False:
        fuzzy = fuzzy_match(posting_firm_clean2, list_compustat_name_clean2)
        if fuzzy:
            matching_indicator = True
            i = list_compustat_name_clean2.index(fuzzy)
            compustat_name = list_compustat_name[i]
            print('posting from "'+ posting_firm + '" is matched with compustat company "' + compustat_name + '"')

posting from "onvia inc co" is matched with compustat company "ONVIA INC"
posting from "enterprise products partners lp" is matched with compustat company "ENTERPRISE PRODS PRTNRS  -LP"
