In [18]:
import pandas as pd
df = pd.read_csv("search.csv")

In [19]:
df=df[~df['query'].str.contains('{sub:', na=False, regex=False)]
df=df[~df['query'].str.contains('{cat:', na=False, regex=False)]

In [20]:
df['query'] = df['query'].astype(str)
all_names=df['query'].unique()

## Matching by Grouping
Then we group the names by their 1st character. As the list is too long, it will take forever to match them all at once (15889 x 15889 pairs to consider). The work around is to match them by groups, assuming if the names are not matched at the 1st character, it is unlikely that they are the same name. 

In [21]:
all_main_name = pd.DataFrame(columns=['sort_gp','names','alias','score'])
all_names.sort()
all_main_name['names'] = all_names
all_main_name['sort_gp'] = all_main_name['names'].apply(lambda x: x[0])

In [22]:
all_main_name = all_main_name.head(10000)
all_main_name['names'] = all_main_name['names'].astype(str)

## Fuzzy Matching
Here for each group, we use `fuzzywuzzy.token_sort_ratio` to matching the names. Different form the basic `fuzzywuzzy.ratio` which use Levenshtein Distance to calculate the differences, it allow the token (words) in a name to swap order and still give a 'perfect' match. (ref: https://github.com/seatgeek/fuzzywuzzy)

In [23]:
from fuzzywuzzy import fuzz

all_sort_gp = all_main_name['sort_gp'].unique()

for sortgp in all_sort_gp:
    this_gp = all_main_name.groupby(['sort_gp']).get_group(sortgp)
    gp_start = this_gp.index.min()
    gp_end = this_gp.index.max()
    for i in range(gp_start,gp_end+1):
    
        # if self has not got alias, asign to be alias of itself
        if pd.isna(all_main_name['alias'].iloc[i]):
            all_main_name['alias'].iloc[i] = all_main_name['names'].iloc[i]
            all_main_name['score'].iloc[i] = 100
        
        # if the following has not got alias and fuzzy match, asign to be alias of this one
        for j in range(i+1,gp_end+1):
            if pd.isna(all_main_name['alias'].iloc[j]):
                fuzz_socre = fuzz.token_sort_ratio(all_main_name['names'].iloc[i],all_main_name['names'].iloc[j])
                if (fuzz_socre > 85):
                    all_main_name['alias'].iloc[j] = all_main_name['alias'].iloc[i]
                    all_main_name['score'].iloc[j] = fuzz_socre
                    
        if i % (len(all_names)//100) == 0:
            print("progress: %.2f" % (100*i/len(all_names)) + "%")
                
all_main_name.to_csv('company_in_cambridge.csv')

progress: 0.00%
progress: 1.00%
progress: 1.99%
progress: 2.99%
progress: 3.99%
progress: 4.99%
progress: 5.98%
progress: 6.98%
progress: 7.98%
progress: 8.98%
progress: 9.97%
progress: 10.97%
progress: 11.97%
progress: 12.97%
progress: 13.96%
progress: 14.96%
progress: 15.96%
progress: 16.96%
progress: 17.95%


KeyboardInterrupt: 

In [17]:
len(all_main_name.alias.unique())

597

In [23]:
all_main_name[(all_main_name['names']!=all_main_name['alias']) & (all_main_name['alias'].notna())]

Unnamed: 0,sort_gp,names,alias,score
6,A,"Agorecykling, kupiecka","Agorecykling, kupie",92
7,A,"Agorecykling, kupiecka","Agorecykling, kupie",92
8,A,"Agorecykling, kupiecka 34, W","Agorecykling, kupie",89
9,A,"Agorecykling, kupiecka 34, Wa","Agorecykling, kupie",88
10,A,"Agorecykling, kupiecka 34, Warszawa","Agorecykling, kupie",88
13,C,Częstochowa siłowników 2,Częstochowa siłowników,95
14,C,Częstochowa siłowników 25,Częstochowa siłowników,93
16,D,Dąbrowszczaków trGdańsk,Dąbrowszczaków tGdańsk,98
22,F,"Falenica, Cmentarz w Alw","Falenica, Cmentarz w Ale",96
28,G,Gardenia Pakość,Gardenia Pakoś,97


The result is saved in a csv file locally for future inspection and further experimentation. Inspecting the result, the matches consisted of 3 groups:

1. they are usually differ in spelling by 1 character: missing an 'L' or 'I' or 'S'
2. highly similar names: 'No.3' instead of 'No.2' or 'EB' instread of 'EH'
3. fairly similar names: 'HAMMER AND THONGS PRODUCTIONS LIMITED' and 'HAMMER AND TONG PRODUCTIONS LIMITED'

For type 1 and 2 matches it could be the same company, the diffeernce in names could be an intentional alteration or simply a typo. But it is not likely the same company for type 3 matched, it seems more like a coincidnce. 

To further confirm, manual work need to be done but this program saves a lot of manual work hours.

In [24]:
all_main_name[(all_main_name['names']!=all_main_name['alias']) & (all_main_name['alias'].notna())].shape[0]

153

In [32]:
all_main_name['alias'].value_counts()

komitet obrony robotnikow 24                                                                                                  51
7i88i78i                                                                                                                      39
cze 506 0                                                                                                                     32
hydeaulika cL                                                                                                                 30
warszawa Komitet obrony robotników 3                                                                                          24
via di villa casa alle mo                                                                                                     24
y6g6yyyh6hyhQ1QXAyyyh66yhhhbYyth6beri, Maakunta Pori, Lääni Satakunta, F&;&&^ihhhbbh^^&^^^^^^^&&&&&&&;&;&^^&;_^&&^^nlandia    24
modna butik wr                                                                                   

In [2]:
df = pd.read_csv("company_in_cambridge.csv")


In [7]:
dfd=df.alias

In [8]:
dfd=dfd.unique()


In [15]:
dfd  = pd.Series(dfd)


In [21]:
dfdd = dfd.tolist()

In [22]:
dfdd

['\tGenerała Józefa Hauke-Bosaka 9,',
 '\tGenerała Józefa Hauke-Bosaka 9, 25-217 Kielce',
 '\r\nSutter GmbH\r\nAm Gewerbepark 7\r\n14548 Schwielowsee',
 '\r\ne.winkemann gmbh\r\nbremcker linde 5\r\nde 58840 plettenberg\r\n51.199595, 7.837655',
 '  v. cccccbbcv 2bbgvcvnvbb222bvvbc bbbb12v2b2bb1xbsbb2bvvbsbdbxbsbbzcvvbdvvs  ,',
 ' ccm ccvvghf d sdtttttyyu,',
 ' idhchxXbxbxb"cbcbbzbbx',
 '#inoujście Orkana',
 '#inoujście o',
 '(37.0483961,-8.0629835)',
 '(49.9270210,20.2388650)',
 '(49°16\'52.2"N 19°56\'39.4"E)',
 '(49°16\'52.2"N 19°56\'39.4"E) droga do daniela',
 '(50.1023553,19.5870666',
 '(50.4864070,17.9128510)',
 '(51.1915013,18.0249457',
 '(51.2333000,17.4441120',
 '(51.9439710,17.4407790',
 '(52.1880407,20.8911467',
 '(53°53\'36.45"N,21°39\'46.59"E)',
 '(53°53\'36.454"N',
 '(54.0558993,22.8974588',
 '(„‚.',
 ', h zza zzxzV,cdużo dz zz',
 ', h zza zzxzV,cdużo dz zzZ',
 ', że mam się z tym nie możemy się umówić na spotkanie z cyklu',
 ',, ,',
 ',51149 koln_porz_eil friedrich naumann_

In [23]:
dfdd.to_csv('unikalne.csv')

AttributeError: 'list' object has no attribute 'to_csv'

By applying the fuzzy matching, 57 names are caught similar to another name, which is less then 1% of the total. By using this program names that need checking drastically reduce form 15889 total to only 57.