# Landascape Map - Stage 2

This stage attempts to cleanse the data, match to other sources of data and tag with categories (such as Individuals).

In [1]:
import duckdb
import pandas as pd
from thefuzz.process import extractOne

## Make direct matches to the company data

First, lets load the company data database.

In [2]:
db = duckdb.connect('../raw/company-data.db', read_only=True)

Then we'll load our raw longlist into a temporary table.

In [3]:
db.sql('''CREATE TEMP TABLE tRaw AS SELECT DISTINCT organisation FROM read_csv('../raw/landscape-longlist-raw.csv');''')

In [4]:
db.sql('''SELECT COUNT(*) AS Count FROM tRaw''')

┌───────┐
│ Count │
│ int64 │
├───────┤
│   502 │
└───────┘

We'll create a table of direct matches

In [5]:
db.sql('''
       CREATE TEMP TABLE tDirect as SELECT r.*,
              CompanyName as registered_name,
              CompanyNumber as company_number,
              "URI" as uri,
              "RegAddress.PostTown" as post_town,
              "RegAddress.PostCode" as postcode,
              CompanyCategory as company_category,
              CompanyStatus as company_status,
              [x for x in [
                     "SICCode.SicText_1",
                     "SICCode.SicText_2",
                     "SICCode.SicText_3",
                     "SICCode.SicText_4"
              ] if x is not NULL] as sic_code,
              IncorporationDate as incorporation_date,
              DissolutionDate as dissolution_date, 
              -- , c.*
                        
       FROM tRaw r LEFT JOIN CompanyData c
       ON upper(r.organisation) == c.CompanyName;
''')

In [6]:
direct_matches = db.sql('SELECT * from tDirect WHERE company_number IS NOT NULL').df()

In [7]:
db.close()

In [8]:
direct_matches.sort_values(by='organisation').to_csv('../raw/landscape-map-company-data.csv', index=False)

## Fix typos in longlist

Having matched the details, let's see if we can fuzzy match missing items in the longlist.

First, let's get a list of organisations that have been matched to Company House data.

In [9]:
matched_organisations = direct_matches.organisation.unique().tolist()

Then load the raw longlist

In [10]:
raw = pd.read_csv('../raw/landscape-longlist-raw.csv')

In [11]:
corrections = pd.concat(
    [
        raw,
        raw.organisation.map(
            lambda x: extractOne(x, matched_organisations, score_cutoff=90)
        ).apply(
            pd.Series, index=['match', 'score']
        )
    ], axis=1
).query(
    'score.notna() and score < 100'
).loc[: ,['organisation', 'match']].set_index('organisation')
corrections

Unnamed: 0_level_0,match
organisation,Unnamed: 1_level_1
Monkfish Productions CIC,Monkfish Productions CIO
Moving Parts Arts,Moving Parts Arts CIO
tiny dragon Productins,tiny dragon Productions


In [12]:
corrections.to_csv('../raw/landscape-map-corrections.csv')

## Fuzzy match company data

## Identify possible individuals