In [1]:
import pandas as pd
import numpy as np
from module import *

### First look at the data:
By scanning through the company names on both csv files, there are following observations:
The matching could have these issues:
1. Varying ways to refer to a company: limited, ltd, plt, corp, inc, co., etc.
2. Acronyms 
3. White spaces
4. Reordered names
5. Some names can have completely different synonyms: i.e. Howmet Aerospace Inc and Arconic. Meaning combining name and synonyms could be helpful.


### I. Standardise the data:
Standised the data: remove '.', transform into lower case, remove stop words.

In [2]:
inte_names = pd.read_csv('data/IntegrumNamesSynonyms.csv')
inte_names.head()

Unnamed: 0,name,synonyms
0,ABB LTD-REG,
1,ABIOMED Inc.,ABIOMED Inc.
2,3i Group,
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV
4,Accor S.A,Accor


In [3]:
legal_names = pd.read_csv('data/LeiLegalname.csv')
legal_names.head()

Unnamed: 0,lei,legal_name
0,2138006OTCECA7V12D34,NATIONALE BORG REINSURANCE N.V.
1,549300LG53GXF359TQ39,BARING INVESTMENT FUNDS PLC - BARING ASIAN DEB...
2,213800U17THE1662Z496,CUCINA ACQUISITION (UK) LIMITED
3,213800CAIVXI95XYUC30,AVELLEMY FUNDS OEIC - AVELLEMY 3
4,213800VHRLPHSNZJ4314,PATCHWORK ENERGY LIMITED


In [4]:
# Delete missing values
legal_names[legal_names['legal_name'].isna()]

Unnamed: 0,lei,legal_name
312070,743700H59GD67TAURX08,
312071,743700LBRGRYJFUWTH23,
312072,743700YK0UAF6KQQ6N58,
312073,743700Z235FZDWL6CY78,
312074,743700HOBAYLLXIRJT97,


In [5]:
print(len(legal_names['legal_name']))
print(len(legal_names['legal_name'].unique()))

1735176
1021508


In [6]:
# Delete duplicates
legal_names.drop_duplicates(subset='legal_name', inplace=True)

In [7]:
# Drop NA in legal names
legal_names.dropna(axis=0, how='any', inplace=True)

In [8]:
len(legal_names)

1021507

In [9]:
standardise(inte_names, 'name')
standardise(inte_names, 'synonyms')
inte_names.head()

Unnamed: 0,name,synonyms,name_sd,synonyms_sd
0,ABB LTD-REG,,abb -reg,
1,ABIOMED Inc.,ABIOMED Inc.,abiomed,abiomed
2,3i Group,,3i,
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV,ab inbev sa-nv,anheuser-busch inbev sa/nv
4,Accor S.A,Accor,accor sa,accor


In [10]:
standardise(legal_names, 'legal_name')
legal_names.head()

Unnamed: 0,lei,legal_name,legal_name_sd
0,2138006OTCECA7V12D34,NATIONALE BORG REINSURANCE N.V.,nationale borg reinsurance nv
1,549300LG53GXF359TQ39,BARING INVESTMENT FUNDS PLC - BARING ASIAN DEB...,baring investment funds - baring asian debt fund
2,213800U17THE1662Z496,CUCINA ACQUISITION (UK) LIMITED,cucina acquisition (uk)
3,213800CAIVXI95XYUC30,AVELLEMY FUNDS OEIC - AVELLEMY 3,avellemy funds oeic - avellemy 3
4,213800VHRLPHSNZJ4314,PATCHWORK ENERGY LIMITED,patchwork energy


In [11]:
legal_names.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1021507 entries, 0 to 1552400
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   lei            1021507 non-null  object
 1   legal_name     1021507 non-null  object
 2   legal_name_sd  1021507 non-null  object
dtypes: object(3)
memory usage: 31.2+ MB


In [12]:
inte_names.to_csv('data/inte_names.csv', index=False)
legal_names.to_csv('data/legal_names.csv', index=False)

In [13]:
inte_names[inte_names['name_sd'].str.contains('settlement|portfolio|scheme|legacy|loan|children|grandchildren|deceased|discretionary|administration')]

Unnamed: 0,name,synonyms,name_sd,synonyms_sd


In [14]:
sum(legal_names['legal_name_sd'].str.contains('settlement|portfolio|scheme|legacy|loan|children|grandchildren|deceased|discretionary|administration'))

41939

I noticed:
There are a lot of private funds and settlements that do not appear in the list to be matched. It might make sense to exclude them, however, they only constitute less than 4% of the total count. I have kept them in.

### II. Fuzzywuzzy - Levenshtein distance

FuzzyWuzzy is a Python library using Levenshtein distance to match strings. Levenshtein distance uses the minium number of steps of simgle character change from one string to another. It's important to remove the stop words, so I used the standardised legal names.

In [23]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from time import time

In [50]:
# Use a dictionary to help map standardised names back to original names
choices = legal_names['legal_name_sd'].to_dict()
choices[0]

'nationale borg reinsurance nv'

In [51]:
time_start = time()
print(process.extract('abb -reg', choices, limit=5))
time_end = time()
print(f'Searched 1M records in {round(time_end-time_start,3) } seconds \n')

[('eg  ', 90, 273986), ('abb  ', 90, 456530), ('abb ', 90, 558802), ('  ab', 90, 670755), ('abb ', 90, 725397)]
Searched 1M records in 58.075 seconds 



It took about a minute to return matches for one record, and the result is OK. However, I don't want to return the matched standardised names, but original names. So let's change that!

In [53]:
result = process.extract('abb -reg', choices, limit=5)
for match in result:
    print(legal_names.loc[match[2], 'legal_name'], match[1])

EG Group Limited 90
ABB Company Limited 90
ABB Limited 90
CO Group AB 90
ABB INC. 90


Now match all the company names and save them. (It takes too long, so it is abandoned).

In [None]:
inte_names['fw'] = None
for idx, row in inte_names.iterrows():
    result = process.extract(row['name'], choices, limit=5)
    matches = []
    for match in result:
        matches.append((legal_names.loc[match[2], 'legal_name'], match[1]))
    inte_names.loc[idx,'fw'] = matches

In [None]:
inte_names.to_csv('data/inte_names.csv', index=False)

### III. BM25 "improved" TFIDF 

BM25 improves on TF-IDF. It was chosen because it is more sensitive towards more unique words, often like the keyword in a company's name. i.e. There are not many 'Integrum's in the world! <br><br>
You can also fine tune K to make it extra sensitive to unique words, but I won't be doing that here!

Before BM25, I want to add a column that combine the text from of original and synonym columns together.

In [36]:
inte_names.head()

Unnamed: 0,name,synonyms,name_sd,synonyms_sd
0,ABB LTD-REG,,abb -reg,
1,ABIOMED Inc.,ABIOMED Inc.,abiomed,abiomed
2,3i Group,,3i,
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV,ab inbev sa-nv,anheuser-busch inbev sa/nv
4,Accor S.A,Accor,accor sa,accor


In [38]:
inte_names['synonyms'].fillna(' ', inplace=True)
inte_names['comb_name'] = inte_names['name'] + ' ' + inte_names['synonyms']
inte_names.head()

Unnamed: 0,name,synonyms,name_sd,synonyms_sd,comb_name
0,ABB LTD-REG,,abb -reg,,ABB LTD-REG
1,ABIOMED Inc.,ABIOMED Inc.,abiomed,abiomed,ABIOMED Inc. ABIOMED Inc.
2,3i Group,,3i,,3i Group
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV,ab inbev sa-nv,anheuser-busch inbev sa/nv,AB InBev SA-NV Anheuser-Busch InBev SA/NV
4,Accor S.A,Accor,accor sa,accor,Accor S.A Accor


In [27]:
import spacy
from rank_bm25 import BM25Okapi
from tqdm import tqdm
import en_core_web_lg

nlp = en_core_web_lg.load()

legal_names_list = legal_names['legal_name'].str.lower().values
tokens = []

for doc in tqdm(nlp.pipe(legal_names_list, disable=['tagger', 'parser', 'ner'])):
    token = [t.text for t in doc if t.is_alpha]
    tokens.append(token)

1021507it [01:25, 11977.85it/s]


Check how different languages have been tokenised

In [28]:
print('English', tokens[637457])
print('Greek(?)', tokens[95710])
print('Spanish', tokens[96760])
print('Japanese', tokens[198943])
print('Swidish?', tokens[152754])
print('Chinese', tokens[144947])



English ['barclays', 'multi', 'manager', 'fund', 'public', 'limited', 'company', 'globalaccess', 'japan', 'fund']
Greek(?) ['φυρκο', 'ανωνυμη', 'βιομηχανικη', 'εμπορικη', 'εταιρεια', 'ζωοτροφων']
Spanish ['eólica', 'do', 'alto', 'da', 'lagoa']
Japanese []
Swidish? ['korunní', 'dvůr']
Chinese ['广州岭南集团控股股份有限公司']


It looks like some languages haven't been tokenised well. Because our query list doesn't seem to include these languages. We will ignore this for now.

In [29]:
bm25 = BM25Okapi(tokens)

In [30]:

query = inte_names.loc[5, 'name']
tokenised_q = query.lower().split(' ')

time_start = time()
results = bm25.get_top_n(tokenised_q, legal_names['legal_name'].values, n=10)
scores = np.sort(bm25.get_scores(tokenised_q))[::-1]
scores = scores/sum(scores)
time_end= time()

print(f'Searched 50,000 records in {round(time_end-time_start,3) } seconds \n')
print('Query:', query)
for i, match in enumerate(results):
    print(f'{match}: {scores[i]:.2f}')

Searched 50,000 records in 0.759 seconds 

Query: ServiceNow
ServiceNow Nederland B.V.: 0.29
SERVICENOW, INC.: 0.29
SERVICENOW UK LIMITED: 0.25
SERVICENOW SOFTWARE DEVELOPMENT INDIA PRIVATE LIMITED: 0.18
Työväen Opintorahasto sr: 0.00
Työttömyyskassojen tukisäätiö sr: 0.00
Työttömyyskassojen Tukikassa: 0.00
Työttömyyskassa Statia: 0.00
Työstökoneliike M. Koskela Oy: 0.00
Työttömyyskassa Pro: 0.00


In [39]:
# Test a few more with original and synonyms added names
trial_list = [1, 3, 15, 37, 67, 95, 102, 126, 134]
for i in trial_list:
    query = inte_names.loc[i, 'name']
    query_with_syn = inte_names.loc[i, 'comb_name']
    
    tokenised_q = query.lower().split(' ')
    tokenised_q_comb = query_with_syn.lower().split(' ')

    time_start = time()
    results = bm25.get_top_n(tokenised_q, legal_names['legal_name'].str.lower().values, n=10)
    scores = np.sort(bm25.get_scores(tokenised_q))[::-1]
    scores = scores/sum(scores)
    time_end= time()

    print(f'Searched 1M records in {round(time_end-time_start,3) } seconds \n')
    print('Query:', query)
    for i, match in enumerate(results):
        print(f'{match}: {scores[i]:.2f}')
        
    time_start = time()
    results = bm25.get_top_n(tokenised_q_comb, legal_names['legal_name'].str.lower().values, n=10)
    scores = np.sort(bm25.get_scores(tokenised_q_comb))[::-1]
    scores = scores/sum(scores)
    time_end= time()

    print(f'Searched 1M records in {round(time_end-time_start,3) } seconds \n')
    print('Query:', query)
    for i, match in enumerate(results):
        print(f'{match}: {scores[i]:.2f}')

Searched 1M records in 1.235 seconds 

Query: ABIOMED Inc.
abiomed, inc.: 1.00
the young men's christian association of metropolitan denver: 0.00
työstökoneliike m. koskela oy: 0.00
työväen teatterin talosäätiö sr: 0.00
työväen sivistysliitto tsl ry, ruotsiksi arbetarnas bildningsförbund abf rf: 0.00
työväen sivistysliiton riihimäen-lopen opintojärjestö ry: 0.00
työväen opintorahasto sr: 0.00
työttömyyskassojen tukisäätiö sr: 0.00
työttömyyskassojen tukikassa: 0.00
työttömyyskassa statia: 0.00
Searched 1M records in 1.984 seconds 

Query: ABIOMED Inc.
abiomed, inc.: 1.00
the young men's christian association of metropolitan denver: 0.00
työstökoneliike m. koskela oy: 0.00
työväen teatterin talosäätiö sr: 0.00
työväen sivistysliitto tsl ry, ruotsiksi arbetarnas bildningsförbund abf rf: 0.00
työväen sivistysliiton riihimäen-lopen opintojärjestö ry: 0.00
työväen opintorahasto sr: 0.00
työttömyyskassojen tukisäätiö sr: 0.00
työttömyyskassojen tukikassa: 0.00
työttömyyskassa statia: 0.00
Se

Not bad, but the scoring is all over the place for some reason. I will assign my own score as bellow until I find out what the problem is: <br>
Rank 1-5: 0.9-0.5 <br>

Adding synonyms possibly helped with names acronyms that have expanded synonyms, i.e. AGC Inc <-> Asahi Glass Co

In [40]:
# Match top 5 for all and save the results before standardisation
inte_names = pd.read_csv('data/inte_names.csv')
legal_names = pd.read_csv('data/legal_names.csv')

nlp = en_core_web_lg.load()

legal_names_list = legal_names['legal_name'].str.lower().values
tokens = []

for doc in nlp.pipe(legal_names_list, disable=['tagger', 'parser', 'ner']):
    token = [t.text for t in doc if t.is_alpha]
    tokens.append(token)

bm25 = BM25Okapi(tokens)
inte_names['bm25'] = None

for i, name in enumerate(inte_names['name'].values):
    tokenised_q = name.lower().split(' ')
    results = bm25.get_top_n(tokenised_q, legal_names['legal_name'].values, n=5)
    scores = (0.9, 0.8, 0.7, 0.6, 0.5)
    inte_names.loc[i, 'bm25'] = list(zip(results, scores))

In [52]:
inte_names.to_csv('data/inte_names.csv', index=False)

There are a few observations, from scanning through the .csv file:
1. Some match ranking is poor, 100% matches don't appear on top.
2. Can't deal with missing white space.
3. Some really weird matching going on: i.e. when there isn't a good match, ‘THE YOUNG MEN'S CHRISTIAN ASSOCIATION OF METROPOLITAN DENVER’ is often returned.

Possible solutions:
1. N-grams could help with missing white space and abbreviations
2. Adding FuzzyWuzzy on top of BM25 could possibly bring up the 100% matches to the top.
3. N-grams could also help with out of vocabulary key words that can't match. So the algorithm does not return weird matches.

### IV. Improve on BM25 with FuzzyWuzzy

The idea is to combat problems with BM producing matches like this: <br>
```Query: 4imprint Group plc<br>
h&h group plc: 0.00<br>
4imprint group plc: 0.00<br>
gocompare.com group plc: 0.00<br>
3i group plc: 0.00<br>
```
by layering FuzzyWuzzy on top of BM25

In [101]:
inte_names['bm25_fw'] = None
for idx, row in inte_names.iterrows():
    choices = [name for name, score in row['bm25']]
    result = process.extract(row['name'], choices, limit=5)
    inte_names.loc[idx,'bm25_fw'] = result

In [102]:
inte_names.to_csv('data/inte_names.csv', index=False)

### V. FastText (ngram tokenisation) and NMSLIB (faster matching)

I ran the second cell below and it was estimated to take a couple of days to finish encoding all the n-grams in the legal file. So I implemented a couple of solutions: <br>
1. Remove non-English legal names for the time being.
2. Remove the legal names that doesn't contain any of the words in the query dataset in their string.

There are a couple of drawbacks: alphabetic entries in the query dataset are not necessarily English and it could remove legal names that matches the query names by phonetics, i.e. PingAn and PingAn in Chinese.

In [86]:
from langdetect import detect

def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

legal_names['isEng'] = legal_names['legal_name'].apply(isEnglish)
legal_names_en = legal_names[legal_names['isEng']==True]

In [96]:
# create a vocab for inte_names
vocab = ' '.join(inte_names['name_sd'].to_list()).split()
flt = legal_names_en['legal_name'].str.lower().str.contains('|'.join(vocab))
legal_names_en = legal_names_en[flt]

In [97]:
legal_names_en

Unnamed: 0,lei,legal_name,legal_name_sd,isEng
0,2138006OTCECA7V12D34,NATIONALE BORG REINSURANCE N.V.,nationale borg reinsurance nv,True
1,549300LG53GXF359TQ39,BARING INVESTMENT FUNDS PLC - BARING ASIAN DEB...,baring investment funds - baring asian debt fund,True
2,213800U17THE1662Z496,CUCINA ACQUISITION (UK) LIMITED,cucina acquisition (uk),True
3,213800CAIVXI95XYUC30,AVELLEMY FUNDS OEIC - AVELLEMY 3,avellemy funds oeic - avellemy 3,True
4,213800VHRLPHSNZJ4314,PATCHWORK ENERGY LIMITED,patchwork energy,True
...,...,...,...,...
1021502,549300MHNI5OF7IRT407,SCHRODER INSTITUTIONAL PACIFIC FUND,schroder institutional pacific fund,True
1021503,549300O9LUQH4YGOF193,KELSEN GROUP A/S,kelsen a/s,True
1021504,549300XOOVZAG1ZLG344,DURABLE TECHNOLOGIES GROUP,durable technologies,True
1021505,549300PZDT9H1T3IQ454,CB-ACCENT LUX - ERASMUS BOND FUND,cb-accent lux - erasmus bond fund,True


In [98]:
legal_names_en.to_csv('data/legal_names_en.csv')

There is reduction in numbers, but not too much! Rethink!
Note: the code below is for use when there is enough compute/solution found for quicker encoding

In [99]:
# No standardisation
from gensim.models.fasttext import FastText
import pickle5 as pickle

fast_text = FastText(
    sg=1, # skip gram
    size=100, # embedding dimension
    window=10, # 10 windows before and after to get wider context
    min_count=5, # only consider tokens with at least n occurrences in corpus
    negative=15, # negative subsampling: the bigger the bigger noise
    min_n=2, # min character n-gram
    max_n=5 # max character n-gram
)

fast_text.build_vocab(tokens)

fast_text.train(
    tokens,
    epochs=6,
    total_examples=fast_text.corpus_count,
    total_words=fast_text.corpus_total_words
)

fast_text.save('model/_fasttext.model')
fast_text = FastText.load('model/_fasttext.model')

weighted_doc_vects = []

for i,doc in tqdm(enumerate(tokens)):
  doc_vector = []
  for word in doc:
    vector = fast_text[word]
    weight = (bm25.idf[word] * ((bm25.k1 + 1.0)*bm25.doc_freqs[i][word])) 
    / 
    (bm25.k1 * (1.0 - bm25.b + bm25.b *(bm25.doc_len[i]/bm25.avgdl))+bm25.doc_freqs[i][word])
    weighted_vector = vector * weight
    doc_vector.append(weighted_vector)
  doc_vector_mean = np.mean(doc_vector,axis=0)
  weighted_doc_vects.append(doc_vector_mean)
  
  pickle.dump( weighted_doc_vects, open( "weighted_doc_vects.p", "wb" ) ) #save the results to disc

  vector = fast_text[word]
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
12005it [07:01, 28.45it/s]


KeyboardInterrupt: 

In [None]:
import nmslib

data = np.vstack(weighted_doc_vects)

index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2}, print_progress=True)

In [None]:
input = 'Fifth Third Bancorp'.lower().split()

query = [fast_text[vec] for vec in input]
query = np.mean(query,axis=0)

t0 = time.time()
ids, distances = index.knnQuery(query, k=10)
t1 = time.time()
print(f'Searched {df.shape[0]} records in {round(t1-t0,4) } seconds \n')
for i,j in zip(ids,distances):
  print(round(j,2))
  print(df.text.values[i])

### VI. Ensemble

Since I have only finished BM25 and BM25+FuzzyWuzzy, I will perform ensemble with these two.

In [115]:
from collections import Counter
import operator

inte_names['ensemble'] = None

for idx, row in inte_names.iterrows():
    bm25 = dict(row['bm25'])
    bm25_fw = dict(row['bm25_fw'])
    bm25_fw.update((k, v/100) for k, v in bm25_fw.items())
    row['ensemble'] = dict(Counter(bm25) + Counter(bm25_fw))

inte_names['match'] = inte_names['ensemble'].apply(lambda x: max(x.items(), key=operator.itemgetter(1))[0])

In [117]:
inte_names.to_csv('data/inte_names.csv', index=False)

In [2]:
import pandas as pd
inte_names = pd.read_csv('data/inte_names.csv')
inte_names

Unnamed: 0,name,synonyms,name_sd,synonyms_sd,bm25,fw,bm25_fw,ensemble,match
0,ABB LTD-REG,,abb -reg,,"[('ABB K.K.', 0.9), ('ABB B.V.', 0.8), ('ABB s...","[('EG Group Limited', 90), ('ABB Company Limit...","[('ABB K.K.', 86), ('ABB B.V.', 86), ('ABB d.o...","{'ABB K.K.': 1.76, 'ABB B.V.': 1.6600000000000...",ABB K.K.
1,ABIOMED Inc.,ABIOMED Inc.,abiomed,abiomed,"[('Abiomed, Inc.', 0.9), (""THE YOUNG MEN'S CHR...","[('Abiomed, Inc.', 90), ('NC Group, Ltd.', 90)...","[('Abiomed, Inc.', 96), (""THE YOUNG MEN'S CHRI...","{'Abiomed, Inc.': 1.8599999999999999, ""THE YOU...","Abiomed, Inc."
2,3i Group,,3i,,"[('T.B. GROUP S.R.L.', 0.9), ('SOL.EDIL GROUP ...","[('3I GROUP PLC', 90), ('3I PLC', 90), ('P COM...","[('T.B. GROUP S.R.L.', 86), ('SOL.EDIL GROUP S...","{'T.B. GROUP S.R.L.': 1.76, 'SOL.EDIL GROUP S....",T.B. GROUP S.R.L.
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV,ab inbev sa-nv,anheuser-busch inbev sa/nv,"[('AB INBEV UK LIMITED', 0.9), ('INBEV SPAIN, ...","[('INC S.A.', 90), ('SA GROUP LTD', 90), ('CO ...","[('AB INBEV CORPORATE SERVICES LIMITED', 86), ...","{'AB INBEV UK LIMITED': 1.5899999999999999, 'I...",AB INBEV UK LIMITED
4,Accor S.A,Accor,accor sa,accor,[('Accor Polska Spółka z ograniczoną odpowiedz...,"[('ACC LIMITED', 90), ('AC Corporation', 90), ...",[('Accor Polska Spółka z ograniczoną odpowiedz...,{'Accor Polska Spółka z ograniczoną odpowiedzi...,Accor Polska Spółka z ograniczoną odpowiedzial...
...,...,...,...,...,...,...,...,...,...
1781,Hilton Worldwide Holdings Inc,,hilton worldwide holdings,,"[('HILTON WORLDWIDE HOLDINGS INC.', 0.9), ('Hi...",,"[('HILTON WORLDWIDE HOLDINGS INC.', 100), ('Hi...","{'HILTON WORLDWIDE HOLDINGS INC.': 1.9, 'Hilto...",HILTON WORLDWIDE HOLDINGS INC.
1782,Wood Group Plc,John Wood Group PLC,wood,john wood,"[('STYLES & WOOD GROUP PLC', 0.9), ('WOOD-FORE...",,"[('STYLES & WOOD GROUP PLC', 90), ('WOOD-FORES...","{'STYLES & WOOD GROUP PLC': 1.8, 'WOOD-FOREST ...",STYLES & WOOD GROUP PLC
1783,Bluescope Steel,Bluescope Steel Limited,bluescope steel,bluescope steel,"[('Tata BlueScope Steel Limited', 0.9), ('Blue...",,"[('Tata BlueScope Steel Limited', 90), ('BlueS...","{'Tata BlueScope Steel Limited': 1.8, 'BlueSco...",Tata BlueScope Steel Limited
1784,Mitsubishi Heavy Industries,Mitsubishi Heavy Industries Ltd,mitsubishi heavy industries,mitsubishi heavy industries,"[('MITSUBISHI HEAVY INDUSTRIES EUROPE, LTD.', ...",,"[('MITSUBISHI HEAVY INDUSTRIES EUROPE, LTD.', ...","{'MITSUBISHI HEAVY INDUSTRIES EUROPE, LTD.': 1...","MITSUBISHI HEAVY INDUSTRIES EUROPE, LTD."


In [6]:
test_list = [1, 3, 15, 37, 67, 95, 102, 126, 134, 304, 334, 601, 663, 668, 902]
inte_names.iloc[test_list]

Unnamed: 0,name,synonyms,name_sd,synonyms_sd,bm25,fw,bm25_fw,ensemble,match
1,ABIOMED Inc.,ABIOMED Inc.,abiomed,abiomed,"[('Abiomed, Inc.', 0.9), (""THE YOUNG MEN'S CHR...","[('Abiomed, Inc.', 90), ('NC Group, Ltd.', 90)...","[('Abiomed, Inc.', 96), (""THE YOUNG MEN'S CHRI...","{'Abiomed, Inc.': 1.8599999999999999, ""THE YOU...","Abiomed, Inc."
3,AB InBev SA-NV,Anheuser-Busch InBev SA/NV,ab inbev sa-nv,anheuser-busch inbev sa/nv,"[('AB INBEV UK LIMITED', 0.9), ('INBEV SPAIN, ...","[('INC S.A.', 90), ('SA GROUP LTD', 90), ('CO ...","[('AB INBEV CORPORATE SERVICES LIMITED', 86), ...","{'AB INBEV UK LIMITED': 1.5899999999999999, 'I...",AB INBEV UK LIMITED
15,3M Company,3 M,3m,3 m,"[('G. & P. COMPANY S.R.L.', 0.9), ('A.G.B. COM...","[('ANY LTD.', 90), ('MPA LIMITED', 90), ('MP G...","[('G. & P. COMPANY S.R.L.', 86), ('A.G.B. COMP...","{'G. & P. COMPANY S.R.L.': 1.76, 'A.G.B. COMPA...",G. & P. COMPANY S.R.L.
37,AGC Inc,Asahi Glass Co,agc,asahi glass,"[('AGC S.R.L.', 0.9), ('AGC Capital, Inc.', 0....",,"[('AGC Capital, Inc.', 86), ('AGC PARTNERS, L....","{'AGC S.R.L.': 1.47, 'AGC Capital, Inc.': 1.66...","AGC Capital, Inc."
67,Howmet Aerospace Inc,Arconic,howmet aerospace,arconic,"[('HOWMET AEROSPACE INC.', 0.9), ('Howmet Aero...",,"[('HOWMET AEROSPACE INC.', 100), ('Howmet S.A....","{'HOWMET AEROSPACE INC.': 1.9, 'Howmet Aerospa...",HOWMET AEROSPACE INC.
95,4imprint Group plc,FOUR.LSE,4imprint,fourlse,"[('H&H GROUP PLC', 0.9), ('4IMPRINT GROUP PLC'...",,"[('4IMPRINT GROUP PLC', 100), ('H&H GROUP PLC'...","{'H&H GROUP PLC': 1.76, '4IMPRINT GROUP PLC': ...",4IMPRINT GROUP PLC
102,BBVA,Banco Bilbao Vizcaya Argentari,bbva,banco bilbao vizcaya argentari,"[('BBVA SA', 0.9), ('BBVA USA', 0.8), ('BBVA L...",,"[('BBVA SA', 90), ('BBVA USA', 90), ('BBVA Lux...","{'BBVA SA': 1.8, 'BBVA USA': 1.700000000000000...",BBVA SA
126,BRENNTAG AG,,brenntag ag,,"[('Brenntag Schweizerhall AG', 0.9), ('Brennta...",,"[('Brenntag Schweizerhall AG', 86), ('Brenntag...","{'Brenntag Schweizerhall AG': 1.76, 'Brenntag ...",Brenntag Schweizerhall AG
134,Burberry Group plc,,burberry,,"[('BURBERRY INDIA PRIVATE LIMITED', 0.9), ('3I...",,"[('BURBERRY INDIA PRIVATE LIMITED', 86), ('3I ...","{'BURBERRY INDIA PRIVATE LIMITED': 1.76, '3I G...",BURBERRY INDIA PRIVATE LIMITED
304,Gap Inc.,,gap,,"[('GAP 02 S.R.L.', 0.9), ('GAP S.P.A.', 0.8), ...",,"[('GAP 02 S.R.L.', 86), ('Gap (RHC) B.V.', 86)...","{'GAP 02 S.R.L.': 1.76, 'GAP S.P.A.': 1.37, 'G...",GAP 02 S.R.L.


From studying the tests above, the weights between BM25 and BM+FuzzyWuzzy need to be adjusted. So are the scores of BM25.