### Climate Match matching entities to basic companies house records

In [28]:
import tabula
import pandas as pd
import numpy as np

In [2]:
url = "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/310599/legal_entities.pdf"

In [3]:
# Read pdf and extract GB Companies

df = tabula.read_pdf(url, pages='all', multiple_tables = False, encoding='cp1252')
cldf = df[0][df[0]['COUNTRY']=='GB']

In [4]:
# Function to remove stopwards from companies

def strip_stopwords(raw_name):    
    company_stopwords = { 'LIMITED', 'LTD', 'SERVICES', 'COMPANY', 'GROUP', 'PROPERTIES', 'CONSULTING', 
        'HOLDINGS', 'UK', 'TRADING', 'LTD.' }
    return(' '.join([raw_name_part for raw_name_part in raw_name.split() if raw_name_part not in company_stopwords]))

In [6]:
# Rename columns and strip stopwords
# Add unique id column Splink needs

# ToFix: SettingWithCopy Warning

cldf['Postcode'] = cldf['POSTCODE']
cldf['Location'] = cldf['CITY'].str.upper()
cldf['CompanyName'] = cldf['LEGAL ENTITY'].str.upper()
cldf['CompanyName'] = cldf.apply(lambda row: strip_stopwords(row['CompanyName']), axis=1)
cldf['unique_id'] = cldf.index

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pand

In [7]:
# Remove unwanted columns
# Number of climate companies to match

cldf = cldf[['Postcode','CompanyName','Location','unique_id']]
len(cldf)

334

In [8]:
# Read basic companies house data as prepared by Download Data notebook

cdf = pd.read_csv('basic_slim.csv')
cdf = cdf.rename(columns={"RegAddress.PostCode": "Postcode", 'RegAddress.PostTown': 'Location'})
cdf['CompanyName'] = cdf.apply(lambda row: strip_stopwords(row['CompanyName']), axis=1)
cdf['CompanyName'].replace('', np.nan, inplace=True)
cdf['unique_id'] = cdf.index

In [9]:
# Remove unwanted columns

cdf = cdf[['Postcode','CompanyName','Location','unique_id']]

In [10]:
# Number of exact matches

exact = cldf.merge(cdf,left_on=['Postcode','CompanyName'], right_on=['Postcode','CompanyName'],
          suffixes=('_left', '_right'))
len(exact)

56

In [14]:
import recordlinkage

In [15]:
indexer = recordlinkage.Index()
indexer.block("Postcode")
candidate_links = indexer.index(cdf, cldf)
len(candidate_links)

101605

In [16]:
compare_cl = recordlinkage.Compare()
compare_cl.string("CompanyName", "CompanyName", method='jarowinkler',threshold=0.85)
compare_cl.exact("Postcode","Postcode")
features = compare_cl.compute(candidate_links, cdf, cldf)

# Name exact or approx match and postcode matches
matches = features[(features[0]==1) & (features[1]==1)]
len(matches)

659

In [17]:
# Set index names to allow join 
matches.index.names = ['cdf','cldf']
cdf.index.names= ['cdf']
cldf.index.names= ['cldf']

# Lookup both names
matches = matches.join(cdf, how='inner')
matches = matches.join(cldf, how='inner', rsuffix='_cldf')

# Select those with only approx match not exact match
approx = matches[matches['CompanyName']!=matches['CompanyName_cldf']]
len(approx)

603

In [19]:
# List of those companies matched at least once
found = matches.index.unique(level='cldf')
len(found)

89

In [20]:
# Select those companies not found 
notfound = cldf.loc[(cldf.index.isin(found, level='cldf') == False)]
len(notfound)

245

In [21]:
import splink

In [22]:
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb import duckdb_comparison_library as cl
settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        "l.Postcode = r.Postcode",
    ],
    "comparisons": [
        cl.jaro_winkler_at_thresholds("CompanyName",distance_threshold_or_thresholds=[0.9]),
    ],    
    "retain_intermediate_calculation_columns" : True,
    "retain_matching_columns" : True
}

In [23]:
linker = DuckDBLinker([cldf, cdf], settings, input_table_aliases=["cldf", "cdf"])
linker.estimate_u_using_random_sampling(target_rows=1e6)

----- Estimating u probabilities using random sampling -----
u probability not trained for CompanyName - Exact match (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for CompanyName - jaro_winkler_similarity >= 0.9 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - CompanyName (some u values are not trained, no m values are trained).


In [24]:
linker.estimate_parameters_using_expectation_maximisation("l.Postcode = r.Postcode")


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.Postcode = r.Postcode

Parameter estimates will be made for the following comparison(s):
    - CompanyName

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 

Iteration 1: Largest change in params was -0.0266 in the m_probability of CompanyName, level `Exact match`
Iteration 2: Largest change in params was -0.0376 in the m_probability of CompanyName, level `Exact match`
Iteration 3: Largest change in params was -0.0534 in the m_probability of CompanyName, level `Exact match`
Iteration 4: Largest change in params was -0.0735 in the m_probability of CompanyName, level `Exact match`
Iteration 5: Largest change in params was -0.0955 in the m_probability of CompanyName, level `Exact match`
Iteration 6: Largest change in params was 0.115 in the m_probability of CompanyName, level `All other comparisons`
Iteration 7: Largest chan

<EMTrainingSession, blocking on l.Postcode = r.Postcode, deactivating comparisons >

In [29]:
results = linker.predict()


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'CompanyName':
    u values not fully trained


In [30]:
resultsdf = results.as_pandas_dataframe()
bespoke = resultsdf[resultsdf['CompanyName_l']=='T&L SUGARS']
linker.waterfall_chart(bespoke.to_dict(orient='records'), filter_nulls=False)

In [31]:
df_splink = linker.predict(threshold_match_probability = 0.001).as_pandas_dataframe()
len(df_splink)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'CompanyName':
    u values not fully trained


0