# Data science on the HMRC skilled worker data set
The below csv file has a list of companies capable of skilled worker sponsornship in the UK

In [171]:
# importing the libraries
import numpy as np
import pandas as pd

In [172]:
skilled_worker_data = pd.read_csv('./projects/python-ml/2025-06-19-Worker.csv')
print(skilled_worker_data.count())

Organisation Name    133028
Town/City            133025
County                45163
Type & Rating        133028
Route                133028
dtype: int64


### shape of the data

In [118]:
skilled_worker_data.shape

(133028, 5)

### info

In [119]:
skilled_worker_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133028 entries, 0 to 133027
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Organisation Name  133028 non-null  object
 1   Town/City          133025 non-null  object
 2   County             45163 non-null   object
 3   Type & Rating      133028 non-null  object
 4   Route              133028 non-null  object
dtypes: object(5)
memory usage: 5.1+ MB


### head / tail

In [137]:
skilled_worker_data.head()

Unnamed: 0,Organisation Name,Town/City,County,Type & Rating,Route
0,McMullan Shellfish,Ballymena,Co Antrim,Worker (A rating),Skilled Worker
1,(A1F1 Limited T/A ) Ultrasound Direct London,Croydon,London,Worker (A rating),Skilled Worker
2,(IECC Care) Independent Excel Care Consortium ...,Colchester,,Worker (A rating),Skilled Worker
3,*ABOUTCARE HASTINGS LTD,East Sussex,,Worker (A rating),Skilled Worker
4,.LITTLE NOORIYAH LTD,Smethwick,,Worker (A rating),Skilled Worker


In [138]:
skilled_worker_data.tail()

Unnamed: 0,Organisation Name,Town/City,County,Type & Rating,Route
133023,ZZA CONSULTING LIMITED,LONDON,,Worker (A rating),Skilled Worker
133024,ZZIY Ltd,High Wycombe,,Worker (A rating),Skilled Worker
133025,ZZN STUDIO LTD,HAMPTON,,Worker (A rating),Skilled Worker
133026,Zzoomm Plc,Oxford,,Worker (A rating),Skilled Worker
133027,ZZZ Limited,London,,Worker (A rating),Skilled Worker


### unique

In [141]:
skilled_worker_data['County'].unique()

array(['Co Antrim', 'London', nan, ..., 'Ascot', 'Gillingham',
       'Wes Yorkshire'], dtype=object)

### value_counts

In [142]:
skilled_worker_data['Town/City'].value_counts()

Town/City
London             37019
LONDON              3314
Birmingham          2814
Manchester          2616
Bristol             1262
                   ...  
South Ockenden         1
KILWINNING             1
Merthyr Tidfil         1
BULWELL                1
Northwood hills        1
Name: count, Length: 7931, dtype: int64

### value_counts(normalize=True)

The below data tells me that the number of companies with Town/City having `London` makes up `27%` of the total companies available in the skilled worker category in UK.

`LONDON` contributes to `2.4%`
`Birmingham`-`2.1%` and so forth..

In [144]:
skilled_worker_data['Town/City'].value_counts(normalize=True)

Town/City
London             0.278286
LONDON             0.024913
Birmingham         0.021154
Manchester         0.019665
Bristol            0.009487
                     ...   
South Ockenden     0.000008
KILWINNING         0.000008
Merthyr Tidfil     0.000008
BULWELL            0.000008
Northwood hills    0.000008
Name: proportion, Length: 7931, dtype: float64

### mode()

The mode on a column returns back the most occuring value.

Hence data['column'].mode() returns an array of the most occuring values in a list. so if 'London' and another city 'Xyz' took up the most values it would return both, otherwise just the one thats the most repeating. You then use the index
to access the value as seen below.

In [153]:
skilled_worker_data['Town/City'].mode()[0]

'London'

In [154]:
skilled_worker_data['Type & Rating'].mode()[0]

'Worker (A rating)'

In [155]:
skilled_worker_data['Route'].mode()[0]

'Skilled Worker'

Interesting how `Surrey` is the most repeated `County` in the HMRC data set.

What it means is that most of the companies sponsoring are in the Surrey county according to the data set.
But this could also be because that information is not fully captured with all the companies

In [156]:
skilled_worker_data['County'].mode()[0]

'Surrey'

# Fuzzy search on organisation names

### What is Levenshtein Distance?

The Levenshtein distance is a measure of the difference between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1.  **k**itten → **s**itten (substitution of "s" for "k")
2.  sitt**e**n → sitt**i**n (substitution of "i" for "e")
3.  sittin → sittin**g** (insertion of "g" at the end)

The `fuzzywuzzy` library uses the Levenshtein distance to calculate the similarity ratios between strings. The `python-Levenshtein` library is a C implementation of the algorithm, which makes the calculations much faster than if they were done in pure Python. This is why we installed it alongside `fuzzywuzzy`.

In [221]:
# print(skilled_worker_data.columns)
!pip install -q fuzzywuzzy python-Levenshtein
from fuzzywuzzy import fuzz

company_name = input("Enter a company name")

def get_fuzzy_score(company):
    return fuzz.token_set_ratio(company_name.lower(), str(company).lower())

skilled_worker_data['fuzzy_score'] = skilled_worker_data['Organisation Name'].apply(get_fuzzy_score)

# Set a threshold for what is considered a "close match"
threshold = 80
matching_rows = skilled_worker_data[skilled_worker_data['fuzzy_score'] >= threshold]


if not matching_rows.empty:
    print(f"Yes, a close match to '{company_name}' exists. Showing matches with a score of {threshold} or higher:")
    print(matching_rows[['Organisation Name', 'Town/City', 'County', 'fuzzy_score']])
else:
    print(f"No close match to '{company_name}' found.")


Yes, a close match to 'nsave' exists. Showing matches with a score of 80 or higher:
           Organisation Name Town/City County  fuzzy_score
73044   Masref Ltd t/a Nsave    London    NaN          100
124743                 USAVE  Bathgate    NaN           80


### groupby()

In [None]:
# skilled_worker_data.groupby(['Town/City'])['County'].mode()[0]

AttributeError: 'SeriesGroupBy' object has no attribute 'mode'