<img src="https://www.councils.coop/wp-content/uploads/2018/04/nesta-logo.jpg" width="175" height="175">

# From SOC to SIC: assigning the industry to the job adverts

A notebook containing the code for assigning the industry to the TextKernel job adverts.

___

There are **three methods** which are used to assign the **Sectoral Industrial Classification (SIC) code** to the adverts.
1. Using the TextKernel Classifications
2. Matching to Companies House
3. Using the Sector Organisation Classification (SOC) code

## 1. Import necessary functions

In [1]:
%run ../notebook_preamble.ipy

In [90]:
import os
from dotenv import load_dotenv, find_dotenv
import pickle
from time import time as tt
import requests
import lxml.html as lh
import spacy
from collections import Counter
from itertools import repeat, chain
import string

In [3]:
nlp = spacy.load("en_core_web_lg")
from spacy.lang.en.stop_words import STOP_WORDS

In [6]:
from skill_demand.sic_analysis.sic_analysis_funcs import get_files, match_sic_letter_ch, all_jobs, create_sic_reference_table
from skill_demand.utils.textkernel_load_utils import light_clean_org_names
from skill_demand.utils.utils_nlp import lemmatization, noun_chunks, most_frequent

2020-05-12 16:07:44,707 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.


### 1.1 Add/remove stopwords

First add/remove words to the stop words list  that we do not/do want to be matched in later processing.

In [4]:
STOP_WORDS.add("professional")
STOP_WORDS.add("professionals")
STOP_WORDS.add("n.e.c")
STOP_WORDS.add("activity")
STOP_WORDS.add("development")

In [5]:
STOP_WORDS.remove("other")

### 1.2 Load the filepath which points to the data directory from the `.env` file

In [10]:
load_dotenv(find_dotenv())
DATA_PATH = os.getenv("data_path")

### 1.3 Load the unique categories from the dataset

In [11]:
infile = open(f"{DATA_PATH}/data/aux/full_categories_200330.pickle",'rb')
categories = pickle.load(infile)
infile.close()

### 1.4 Load a dataframe containing all unique jobs

If the `jobs_df.gz` compressed `csv` file exists, read it in. If not use the `all_jobs` function to retrieve each unique job title and corresponding SOC code and description, as well as the TextKernel category.

In [14]:
if os.path.exists(f"{DATA_PATH}/data/processed/jobs_df.gz"):
    jobs_df = pd.read_csv(f"{DATA_PATH}/data/processed/jobs_df.gz", compression="gzip")
else:
    jobs_df = all_jobs(DATA_PATH)
    jobs_df = pd.read_csv(f"{DATA_PATH}/data/processed/jobs_df.gz", compression="gzip", encoding='utf-8', index=False)

## 2. Methods for matching job IDS to SIC letters

### 2.1 Add industry labels to job IDs

First, we manually estimated the best match for SIC letters from the industry labels provided by TextKernel, we looked at the `jobs_df` to interrogate the types of jobs which were assigned to each classification by TextKernel.

 ***TextKernel Category*** | ***Estimated SIC letter equivalent***
 --- | ---
 **Accommodation / Food Services** | Accommodation and food service activities (I)
 **Administration / Call center** | Administrative and support service activities (N)
 **Agriculture / Fishing** | Agriculture, forestry and fishing (A)
 **Business services** | None
 **Construction** | Construction (F)
 **Culture / Recreation** | Arts, entertainment and recreation (R)
 **Education / Research** | Professional, scientific and technical activities (M); Education  (P)
 **Facility / Cleaning** | Administrative and support service activities (N)
 **Finance / Insurance** | Financial and insurance activities (K)
 **Healthcare / Welfare** | Human health and social work activities (Q)
 **IT** | Information and communication (J)
 **Logistics** | Transportation and storage (H)
 **Manufacturing / Industrial Facilities** | Manufacturing (C)
 **Media / Communication** | Information and communication (J)
 **Other / Unknown** | None
 **Personal services** | Other service activities (S); Real estate activities (L)
 **Pharmacy / Chemicals** | Manufacturing (C)
 **Public services / Non-profit** | Public administration and defence (O); compulsory social security (S)
 **Security / Fire / Police** | Administrative and support service activities (N); Public administration and defence; compulsory social security (O)
 **Staffing / Employment Agencies** | None
 **Trade / Retail** | Wholesale and retail trade; repair of motor vehicles and motorcycles (G)
 **Utilities** | Electricity, gas, steam and air conditioning supply (D); Water supply and waste management (E)

SIC letters which have **no match** to a TextKernel industry label:
- Mining and quarrying (B)
- Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use (T)
- Activities of extraterritorial organisations and bodies (U)

___

The letters are put into arrays in the order the TextKernel categories appear.

In [76]:
industry_labels_sic_best_estimate = [['I'], ['N'], ['A'], [None], ['F'], ['R'], ['M', 'P'], ['N'], ['K'], ['Q'], ['J'], ['H'], ['C'], ['J'], [None], ['S', 'L'], ['C'], ['O', 'S'], ['N', 'O'], [None], ['G'], ['D', 'E']]

A dictionary is created with the TextKernel industry label as the key and the equivalent SIC letter as the value.

In [87]:
industry_label_to_sic = {}

for i in range(len(categories['organization_industry_label'])):
    industry_label_to_sic[categories['organization_industry_label'][i]] = industry_labels_sic_best_estimate[i]

### 2.2  Matching on SIC letter from Companies House

Import the fuzzy-matched organisations to the Companies House dataset. Each company assigns itself a SIC code, therefore we can estimate the SIC code from the organisation that posted the job.

In [80]:
ch_match = pd.read_csv(f"{DATA_PATH}/data_tk_and_sic/clean_org_name_and_sic_letter.csv")

### 2.3 Using SOC codes

There is no direct translation from SOC code to SIC code. The following sections of code document the process of translating SOC codes into SIC codes. There are **three main methodologies**:

1. Match exact words in the SIC letter descriptions e.g. agriculture to words in the SOC code labels.
2. Use the `token similarity` and `lemmatisation` functions within the nlp package `spacy` to find the most similar four-digit SIC code descriptions and SOC code labels.
3. For the SOC code labels that fail to be matched e.g. draughtsperson, manually assign these a SIC letter.

___

Read in the reference table for the SIC codes which shows each of the high-level SIC letters and the four-digit SIC codes and descriptions included within this. If the `csv` file does not exist, create it and then read it in.

In [15]:
if os.path.exists(f"{DATA_PATH}/data/processed/sic_code_references.csv"):
    sic_reference_table = pd.read_csv(f"{DATA_PATH}/data/processed/sic_code_references.csv")
else:
    create_sic_reference_table(DATA_PATH)
    sic_reference_table = pd.read_csv(f"{DATA_PATH}/data/processed/sic_code_references.csv")

Replace certain words in the SOC code labels which make it easier to match. For example, we do not want to match the "human" of "human resources" to the "human" in "human health". On the other hand, we do want "IT" to be matched to words containing "information" and "technology".

In [17]:
soc_code_labels=list(jobs_df.profession_soc_code_label.value_counts().index)

soc_code_labels = [label if 'IT' not in label else label.replace('IT', 'information technology') for label in soc_code_labels]

soc_code_labels = [label.lower() for label in soc_code_labels]

soc_code_labels = [label if 'human resources' not in label else label.replace('human resources', 'hr') for label in soc_code_labels]

soc_code_labels = [label if 'human resource' not in label else label.replace('human resource', 'hr') for label in soc_code_labels]

soc_code_labels = [label if 'public relations' not in label else label.replace('public relations', 'pr') for label in soc_code_labels]

In [18]:
sic_letter_reference_table = sic_reference_table[['SIC_letter','SIC_description']].drop_duplicates().reset_index().drop(columns='index')

Translate the descriptions to lower case so it is easier to match.

In [19]:
sic_letter_reference_table.SIC_description = sic_letter_reference_table.SIC_description.apply(lambda x : x.lower())

Ensure we also translate the "human resources" and "public relations" in the SIC reference table to "hr" and "pr" to match the SOC code descriptions.

In [20]:
sic_reference_table.Description = sic_reference_table.Description.apply(lambda x: x.lower().replace('human resources', 'hr'))
sic_reference_table.Description = sic_reference_table.Description.apply(lambda x: x.lower().replace('public relations', 'pr'))

#### 2.3.1 Match individual words from the SIC letter description to the SOC labels

There are certain words that appear in the SIC letter description that allow words in the SOC code labels to be directly matched to a SIC letter, for example "agriculture" or "construction".

In [26]:
sic_letter_match_words = []
sic_description_match_words = []
for label in SOC_to_SIC.soc_code_label:
    temp_sic_letter_match_words = []
    temp_sic_description_match_words = []
    for word in label.split():
        for sic_description in list(sic_reference_table.SIC_description.apply(lambda x: x.lower()).value_counts().index):
            found_matching = False
            for individual_word in sic_description.split():
                if word == individual_word and word not in ['.', ';', ' ', 'and', 'of', 'other', 'management', "support"] and found_matching == False:
                    temp_sic_letter_match_words.append(list(sic_reference_table[sic_reference_table.SIC_description.apply(lambda x: x.lower()) == sic_description].SIC_letter.value_counts().index)[0])
                    temp_sic_description_match_words.append(list(sic_reference_table[sic_reference_table.SIC_description.apply(lambda x: x.lower()) == sic_description].SIC_description.value_counts().index)[0])
                    found_matching = True
    if len(temp_sic_letter_match_words) == 0:
        sic_letter_match_words.append(None)
        sic_description_match_words.append(None)
    else:
        sic_letter_match_words.append(temp_sic_letter_match_words)
        sic_description_match_words.append(temp_sic_description_match_words)

#### 2.3.2 Token similarity

Using `spacy`'s [semantic similarity](https://spacy.io/usage/vectors-similarity) for individual tokens, it reads in each word from the SOC label and measures the similarity between that word and each word in the 4 digit SIC labels. These are then summed and averaged to find an average similarity between the SOC label and the 4 digit SIC labels. If the average is over 0.25, then it is considered similar enough to be a potential SIC to SOC match.

In [22]:
sic_letter_equivalent = []
sic_description_equivalent = []
similarity_matrix = []
## Loop through all of the SOC code descriptions
for i in range(len(soc_code_labels)):
    ## Apply spacy's natural language processing function to the SOC code description of index i
    doc_soc = nlp(soc_code_labels[i])
    temp_sic_letter_equivalent = []
    temp_sic_description_equivalent = []
    temp_similarity_matrix = []
    temp_similarity_array = []
    ## Loop through all of the SIC code descriptions
    for j in range(len(sic_reference_table)):
        ## Apply spacy's natural language processing function to the SIC code description of index j
        doc_sic = nlp(list(sic_reference_table.Description)[j])
        ## To find the average similarity for the description, we need the count of words and the total similarity of these words.
        count = 0
        similarity = 0
        ## For each token (word) in the SOC code description compare to each token in the SIC code description.
        for token_soc in doc_soc:
            for token_sic in doc_sic:
                ## If either token is a stop word or punctuation, ignore it.
                if str(token_soc) in STOP_WORDS or str(token_sic) in STOP_WORDS or token_soc.is_punct == True or token_sic.is_punct == True:
                    pass
                ## add the similarity of each token so we can calculate the average
                else:
                    count += 1
                    similarity += token_soc.similarity(token_sic)
        ## add the average similarity between the SOC and SIC descriptions of index i and j, respectively to the temporary similarity array
        temp_similarity_array.append(similarity/count)
    temp = temp_similarity_array.copy()
    ## sort the array so we have the most similar SIC description to the SOC description of index i in descending order.
    temp.sort()
    ## Loop through each value to find those above the threshold of 0.25
    for k in range(1, len(temp)):
        nth_largest_value = temp[len(temp)-k]
        if nth_largest_value > 0.25:
            index = (np.where(temp_similarity_array == nth_largest_value))
            try:
                ## Append the SIC letter, SIC description and the similarity to arrays
                temp_sic_letter_equivalent.append(sic_reference_table.loc[int(index[0])].SIC_letter)
                temp_sic_description_equivalent.append((sic_reference_table.loc[int(index[0])].SIC_description))
                temp_similarity_matrix.append(nth_largest_value)
            except:
                temp_sic_letter_equivalent.append(None)
                temp_sic_description_equivalent.append(None)
                temp_similarity_matrix.append(None)
        else:
            temp_sic_letter_equivalent.append(None)
            temp_sic_description_equivalent.append(None)
            temp_similarity_matrix.append(None)
    ## Append the temporary arrays to a larger array so we have the SIC letters, SIC descriptions and similarities for each SOC code label.
    sic_letter_equivalent.append(temp_sic_letter_equivalent)
    sic_description_equivalent.append(temp_sic_description_equivalent)
    similarity_matrix.append(temp_similarity_matrix)

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mo

#### 2.3.3 Match the lemmatised 4 digit SIC labels to lemmatised SOC descriptions

Reduce the words in the SIC descriptions to their roots (lemmatization)

In [28]:
sic_description_lemmatised = lemmatization(list(sic_reference_table.Description))

Add to the SIC reference table the lemmatised SIC descriptions

In [29]:
sic_reference_table['lemmatised'] = sic_description_lemmatised

Find exact word matches between the lemmatised SIC descriptions and the lemmatised SOC descriptions. This will enable "manufacturing" and "manufacture" to be matched as they are reduced to the root word "manufacture".

In [30]:
sic_letter_match = []
sic_description_match = []
## Lemmatize each SOC code label and loop through them
for label in lemmatization(list(SOC_to_SIC.soc_code_label)):
    label_split = label.split()
    temp_sic_letter_match = []
    temp_sic_description_match = []
    ## Find matches with lemmatised words in the lemmatised SIC descriptions
    for i in range(len(sic_reference_table)):
        match_found = False
        sic_split = sic_reference_table.loc[i].lemmatised.split()
        for word in label_split:
            if word in sic_split and word not in ['.', ';', ' ', 'and', 'of', 'other']:
                match_found = True
        ## If a match is found append the SIC letter and description to a temporary array
        if match_found == True:
            temp_sic_letter_match.append(sic_reference_table.loc[i].SIC_letter)
            temp_sic_description_match.append(sic_reference_table.loc[i].SIC_description)
    if len(temp_sic_letter_match) > 0:
        ## Append the array of matches to a larger array so each SOC code description has a matched SIC letter and description
        sic_letter_match.append(temp_sic_letter_match)
        sic_description_match.append(temp_sic_description_match)
    else:
        sic_letter_match.append(None)
        sic_description_match.append(None)

#### 2.3.4 Combine into a single SOC to SIC dataframe

Using all the arrays which match SOC descriptions to SIC letters using different methods, create a new DataFrame `SOC_to_SIC`.

In [24]:
SOC_to_SIC = pd.DataFrame({'soc_code_label' : soc_code_labels, 'soc_code' : list(jobs_df.profession_soc_code_value.value_counts().index), 'sic_letter_similarity' : sic_letter_equivalent, 'sic_description_similarity' : sic_description_equivalent, 'sic_letter_matching_words':sic_letter_match_words, 'sic_description_matching_words':sic_description_match_words, 'sic_letter_lemmatised':sic_letter_match, 'sic_description_lemmatised':sic_description_match})

#### 2.3.5 Combining to find best and second best estimate of SIC code

Using the `most_frequent` function, identify the best estimate and the second best estimate of the SIC code based on the most common SIC letter estimate from the above 3 method. There are four steps:
1. Manually assign words which have struggled to be matched, e.g. "driver" would be in "Repair of Motor Vehicles" whereas we need it to be in "Transport and Logistics"
2. If there is a direct match to the high level SIC letter e.g. "Agriculture", use this as the best estimate.
3. Combine the SIC letter arrays from the lemmatisation and the token similarity and count the number of SIC letters.
4. If there is a best estimate present from the high level SIC letter, assign the second-best estimate to the most frequent SIC letter which appears in the combined lemmatisation and token similarity array.
5. If there is no best estimate present from matching to high level SIC letters, assign the best estimate to the most frequent SIC letter in the combined lemmatisation and token similarity array and assign the second-best estimate to the second most frequent SIC letter.

In [97]:
sic_letter_best_estimate = []
sic_description_best_estimate = []
sic_letter_second_best_estimate = []
sic_description_second_best_estimate = []

for i in range(len(SOC_to_SIC)):
    table = str.maketrans('', '', string.punctuation)
    s = "".join([w.translate(table) for w in SOC_to_SIC.loc[i].soc_code_label])
    soc_code_label_no_punctuation = s.split()
    ## First manually assign labels to the words which have no match or incorrectly match to the wrong SIC code.
    if "driver" in soc_code_label_no_punctuation or "drivers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('H')
        sic_description_best_estimate.append('Transport and storage')
        sic_letter_second_best_estimate.append('H')
        sic_description_second_best_estimate.append('Transport and storage')
    elif "optician" in soc_code_label_no_punctuation or "opticians" in soc_code_label_no_punctuation or "counsellors" in soc_code_label_no_punctuation or "physiotherapists" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('Q')
        sic_description_best_estimate.append('Human health and social work activities')
        sic_letter_second_best_estimate.append('Q')
        sic_description_second_best_estimate.append('Human health and social work activities')
    elif "chefs" in soc_code_label_no_punctuation or "cooks" in soc_code_label_no_punctuation or "catering" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('I')
        sic_description_best_estimate.append('Accommodation and food service activities')
        sic_letter_second_best_estimate.append('I')
        sic_description_second_best_estimate.append('Accommodation and food service activities')
    elif "health" in soc_code_label_no_punctuation and "safety" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('N')
        sic_description_best_estimate.append('Administration and support service activities')
        sic_letter_second_best_estimate.append('N')
        sic_description_second_best_estimate.append('Administration and support service activities')
    elif "veterinarians" in soc_code_label_no_punctuation or "veterinary" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('M')
        sic_description_best_estimate.append('Professional, scientific and technical activities')
        sic_letter_second_best_estimate.append('M')
        sic_description_second_best_estimate.append('Professional, scientific and technical activities')
    elif "aircraft" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('H')
        sic_description_best_estimate.append('Transport and storage')
        sic_letter_second_best_estimate.append('H')
        sic_description_second_best_estimate.append('Transport and storage')
    elif "air" in soc_code_label_no_punctuation and "traffic" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('H')
        sic_description_best_estimate.append('Transport and storage')
        sic_letter_second_best_estimate.append('H')
        sic_description_second_best_estimate.append('Transport and storage')
    elif "air" in soc_code_label_no_punctuation and "travel" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('H')
        sic_description_best_estimate.append('Transport and storage')
        sic_letter_second_best_estimate.append('H')
        sic_description_second_best_estimate.append('Transport and storage')
    elif "air" in soc_code_label_no_punctuation and "transport" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('H')
        sic_description_best_estimate.append('Transport and storage')
        sic_letter_second_best_estimate.append('H')
        sic_description_second_best_estimate.append('Transport and storage')
    elif "engineer" in soc_code_label_no_punctuation or "engineers" in soc_code_label_no_punctuation or "engineering" in soc_code_label_no_punctuation or "draughtspersons" in soc_code_label_no_punctuation or "architects" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('M')
        sic_description_best_estimate.append('Professional, scientific and technical activities')
        sic_letter_second_best_estimate.append('M')
        sic_description_second_best_estimate.append('Professional, scientific and technical activities')
    elif "hr" in soc_code_label_no_punctuation or "telephonists" in soc_code_label_no_punctuation or "receptionists" in soc_code_label_no_punctuation or "secretary" in soc_code_label_no_punctuation or "secretarial" in soc_code_label_no_punctuation or "secretaries" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('N')
        sic_description_best_estimate.append('Administrative and support service activities')
        sic_letter_second_best_estimate.append('N')
        sic_description_second_best_estimate.append('Administrative and support service activities')
    elif "scientists" in soc_code_label_no_punctuation or "scientist" in soc_code_label_no_punctuation or "research" in soc_code_label_no_punctuation or "laboratory" in soc_code_label_no_punctuation or "estimators" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('M')
        sic_description_best_estimate.append('Professional, scientific and technical activities')
        sic_letter_second_best_estimate.append('M')
        sic_description_second_best_estimate.append('Professional, scientific and technical activities')
    elif "groundsmen" in soc_code_label_no_punctuation or "caretakers" in soc_code_label_no_punctuation or 'cleaners' in soc_code_label_no_punctuation or "housekeepers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('N')
        sic_description_best_estimate.append('Administrative and support service activities')
        sic_letter_second_best_estimate.append('N')
        sic_description_second_best_estimate.append('Administrative and support service activities')
    elif "electroplaters" in soc_code_label_no_punctuation or "assemblers" in soc_code_label_no_punctuation or "bottlers" in soc_code_label_no_punctuation or "moulders" in soc_code_label_no_punctuation or "tailors" in soc_code_label_no_punctuation or "upholsterers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('C')
        sic_description_best_estimate.append('Manufacturing')
        sic_letter_second_best_estimate.append('C')
        sic_description_second_best_estimate.append('Manufacturing')
    elif "electricians" in soc_code_label_no_punctuation or "scaffolders" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('F')
        sic_description_best_estimate.append('Construction')
        sic_letter_second_best_estimate.append('F')
        sic_description_second_best_estimate.append('Construction')
    elif "estimators" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('M')
        sic_description_best_estimate.append('Professional, scientific and technical activities')
        sic_letter_second_best_estimate.append('F')
        sic_description_second_best_estimate.append('Construction') 
    elif "travel" in soc_code_label_no_punctuation and "agents" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('N')
        sic_description_best_estimate.append('Administrative and support service activities')
        sic_letter_second_best_estimate.append('N')
        sic_description_second_best_estimate.append('Administrative and support service activities')
    elif "photographers" in soc_code_label_no_punctuation or "actors" in soc_code_label_no_punctuation or "library" in soc_code_label_no_punctuation or "librarians" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('R')
        sic_description_best_estimate.append('Arts, entertainment and recreation')
        sic_letter_second_best_estimate.append('R')
        sic_description_second_best_estimate.append('Arts, entertainment and recreation')
    elif "refuse" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('E')
        sic_description_best_estimate.append("Water supply, sewerage, waste management and remediation activities")
        sic_letter_second_best_estimate.append('E')
        sic_description_second_best_estimate.append("Water supply, sewerage, waste management and remediation activities")       
    elif "hairdressers" in soc_code_label_no_punctuation or "undertakers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('S')
        sic_description_best_estimate.append("Other service activities")
        sic_letter_second_best_estimate.append('S')
        sic_description_second_best_estimate.append("Other service activities")
    elif "probation" in soc_code_label_no_punctuation or "prison" in soc_code_label_no_punctuation or "fire" in soc_code_label_no_punctuation or "police" in soc_code_label_no_punctuation or "ncos" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('O')
        sic_description_best_estimate.append("Public administration and defence; compulsory social security")
        sic_letter_second_best_estimate.append('O')
        sic_description_second_best_estimate.append("Public administration and defence; compulsory social security")
    elif "garage" in soc_code_label_no_punctuation or "weigher" in soc_code_label_no_punctuation or "merchandisers" in soc_code_label_no_punctuation or "salespersons" in soc_code_label_no_punctuation or "florists" in soc_code_label_no_punctuation or "fishmongers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('G')
        sic_description_best_estimate.append("Wholesale and retail trade; repair of motor vehicles and motorcycles")
        sic_letter_second_best_estimate.append('G')
        sic_description_second_best_estimate.append("Wholesale and retail trade; repair of motor vehicles and motorcycles")        
    elif "shelf" in soc_code_label_no_punctuation and "fillers" in soc_code_label_no_punctuation:
        sic_letter_best_estimate.append('G')
        sic_description_best_estimate.append("Wholesale and retail trade; repair of motor vehicles and motorcycles")
        sic_letter_second_best_estimate.append('G')
        sic_description_second_best_estimate.append("Wholesale and retail trade; repair of motor vehicles and motorcycles")
    else:
        ## Next find the best estimate from the lemmatised and similarity SIC letter estimates, assigning the best estimate to be the most frequent SIC letter
        ## and the second-best estimate to be the second-most frequent SIC letter.
        if SOC_to_SIC.loc[i].sic_letter_similarity == None:
            estimate_1_sic_letter = []
            estimate_1_sic_description = []
        else:
            estimate_1_sic_letter = SOC_to_SIC.loc[i].sic_letter_similarity
            estimate_1_sic_description = SOC_to_SIC.loc[i].sic_description_similarity
        if SOC_to_SIC.loc[i].sic_letter_lemmatised == None:
            estimate_2_sic_letter = []
            estimate_2_sic_description = []
        else:
            estimate_2_sic_letter = SOC_to_SIC.loc[i].sic_letter_lemmatised
            estimate_2_sic_description = SOC_to_SIC.loc[i].sic_description_lemmatised
        sic_letter_estimate_combined = estimate_1_sic_letter + estimate_2_sic_letter
        sic_description_estimate_combined = estimate_1_sic_description + estimate_2_sic_description
        ## Use the directly matched words from the high level SIC descriptions as the first best estimate
        if SOC_to_SIC.loc[i].sic_letter_matching_words != None:
            sic_letter_best_estimate.append(SOC_to_SIC.loc[i].sic_letter_matching_words[0])
            sic_description_best_estimate.append(SOC_to_SIC.loc[i].sic_description_matching_words[0])
            if most_frequent(sic_letter_estimate_combined, 1) == SOC_to_SIC.loc[i].sic_letter_matching_words[0]:
                try:
                    sic_letter_second_best_estimate.append(most_frequent(sic_letter_estimate_combined, 2))
                    sic_description_second_best_estimate.append(most_frequent(sic_description_estimate_combined, 2))
                except:
                    sic_letter_second_best_estimate.append(most_frequent(sic_letter_estimate_combined, 1))
                    sic_description_second_best_estimate.append(most_frequent(sic_description_estimate_combined, 1))
            else:
                sic_letter_second_best_estimate.append(most_frequent(sic_letter_estimate_combined, 1))
                sic_description_second_best_estimate.append(most_frequent(sic_description_estimate_combined, 1))
        ## If there is no directly matched high level SIC letter description, use the lemmatised and similarity measures for both the first and second-best estimate.
        else:
            best_estimate_assigned = False
            second_best_estimate_assigned = False
            ## Count the number of times the letter appears and sort by most common.
            sorted_counter = Counter(sic_letter_estimate_combined).most_common()
            for i in range(len(sorted_counter)):
                if sorted_counter[i][0] != None and best_estimate_assigned == False:
                    sic_letter_best_estimate.append(sorted_counter[i][0])
                    sic_description_best_estimate.append(Counter(sic_description_estimate_combined).most_common()[i][0])
                    best_estimate_assigned = True
                elif sorted_counter[i][0] != None and best_estimate_assigned == True and second_best_estimate_assigned == False:
                    sic_letter_second_best_estimate.append(sorted_counter[i][0])
                    sic_description_second_best_estimate.append(Counter(sic_description_estimate_combined).most_common()[i][0])
                    second_best_estimate_assigned = True
            if best_estimate_assigned == False:
                sic_letter_best_estimate.append(None)
                sic_description_best_estimate.append(None)
                sic_letter_second_best_estimate.append(None)
                sic_description_second_best_estimate.append(None)
            elif second_best_estimate_assigned == False:
                sic_letter_second_best_estimate.append(None)
                sic_description_second_best_estimate.append(None)         

Add these estimates to the `SOC_to_SIC` DataFrame.

In [98]:
SOC_to_SIC["SIC_letter_best_estimate"] = sic_letter_best_estimate
SOC_to_SIC['SIC_description_best_estimate'] = sic_description_best_estimate
SOC_to_SIC['SIC_letter_second_best_estimate'] = sic_letter_second_best_estimate
SOC_to_SIC['SIC_description_second_best_estimate'] = sic_description_second_best_estimate

Preview the resulting `SOC_to_SIC` DataFrame.

In [99]:
SOC_to_SIC.head()

Unnamed: 0,soc_code_label,soc_code,sic_letter_similarity,sic_description_similarity,sic_letter_matching_words,sic_description_matching_words,sic_letter_lemmatised,sic_description_lemmatised,SIC_letter_best_estimate,SIC_description_best_estimate,SIC_letter_second_best_estimate,SIC_description_second_best_estimate
0,programmers and software development professio...,2136.0,"[J, J, G, J, J, C, M, M, J]","[Information and communication , Information a...",,,"[G, G, J, J, J]",[Wholesale and retail trade; repair of motor v...,J,Information and communication,G,Wholesale and retail trade; repair of motor ve...
1,nurses,2231.0,"[Q, Q, Q, Q, Q, P, P, Q, Q]","[Human health and social work activities , Hum...",,,,,Q,Human health and social work activities,P,Education
2,sales accounts and business development managers,3545.0,"[M, K, J, M, K, K, M, K, S]","[Professional, scientific and technical activi...",,,"[G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, ...",[Wholesale and retail trade; repair of motor v...,G,Wholesale and retail trade; repair of motor ve...,K,Financial and insurance activities
3,marketing and sales directors,1132.0,"[M, M, K, M, K, K, K, J, L]","[Professional, scientific and technical activi...",,,"[G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, ...",[Wholesale and retail trade; repair of motor v...,G,Wholesale and retail trade; repair of motor ve...,K,Financial and insurance activities
4,"information technology business analysts, arch...",2135.0,"[M, J, J, M, C, J, M, F, M]","[Professional, scientific and technical activi...",[J],[Information and communication ],"[H, J, J, J, J, J, N, O, S]","[Transportation and storage, Information and c...",M,"Professional, scientific and technical activities",M,"Professional, scientific and technical activities"
...,...,...,...,...,...,...,...,...,...,...,...,...
346,fitness instructors,3443.0,"[R, P, R, S, Q, R, Q, P, N]","[Arts, entertainment and recreation , Educatio...",,,,,R,"Arts, entertainment and recreation",P,Education
347,"metal plate workers, and riveters",5214.0,"[C, C, C, C, C, C, C, C, C]","[Manufacturing , Manufacturing , Manufacturing...",,,"[B, C, C, C, C, C, C, C, C, C, C, C, C, C, C, ...","[Mining and Quarrying , Manufacturing , Manufa...",C,Manufacturing,G,Wholesale and retail trade; repair of motor ve...
348,metal machining setters and setter-operators,5221.0,"[C, C, C, C, C, C, C, C, C]","[Manufacturing , Manufacturing , Manufacturing...",,,"[B, C, C, C, C, C, C, C, C, C, C, C, C, C, G, ...","[Mining and Quarrying , Manufacturing , Manufa...",C,Manufacturing,G,Wholesale and retail trade; repair of motor ve...
349,elementary sales occupations n.e.c.,9259.0,"[P, P, P, P, P, P, N, S, P]","[Education, Education, Education, Education, E...",,,"[G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, ...",[Wholesale and retail trade; repair of motor v...,G,Wholesale and retail trade; repair of motor ve...,P,Education


Save the file to `data/processed`.

In [100]:
SOC_to_SIC.to_csv(f"{DATA_PATH}/data/processed/soc_to_sic.csv")

## 3. Run through each method to assign SIC codes to job adverts

Assign SIC codes to the job adverts based on the three methodologies (TextKernel categories, fuzzy-matched Companies House, SOC to SIC).

Get a list of files to run through.

In [78]:
files = get_files(DATA_PATH)

For each file read in, assign the SIC letter from each of the three methods to each job advert.

In [101]:
for file in files:
    industry_sic_letter = []
    print(f"Reading in {file}...")
    t0 = tt()
    try:
        file_in = pd.read_csv(f"{DATA_PATH}/data/raw/{file}", compression="gzip")
        ## use the TextKernel organization industry label to map the SIC letter to
        file_in['organization_industry_sic_letter'] = file_in["organization_industry_label"].map(industry_label_to_sic)
        ## Join on the job id from the Companies House fuzzy matched dataset
        tmp = file_in.set_index("job_id").join(ch_match.set_index("job_id"), how="left")
        tmp = tmp.reset_index().rename(columns={"index" : "job_id"})
        tmp = tmp.rename(columns={'sic_letter' : "fuzzy_match_sic_letter"})
        ## Join on the SOC to SIC table to match SOC labels to SIC letters
        file_out = tmp.set_index("profession_soc_code_value").join(SOC_to_SIC[['soc_code', 'SIC_letter_best_estimate', 'SIC_description_best_estimate', 'SIC_letter_second_best_estimate', 'SIC_description_second_best_estimate']].set_index("soc_code"), how="left")
        file_out = file_out.reset_index().rename(columns={"index" : "profession_soc_code_value"})
        file_out['clean_organization_name'] = light_clean_org_names(file_out['organization_name'])
        file_out[['job_id', 'clean_organization_name', 'profession_soc_code_value', 'profession_soc_code_label', 'organization_industry_label', 'organization_industry_sic_letter', 'fuzzy_match_sic_letter', 'SIC_letter_best_estimate', 'SIC_description_best_estimate', 'SIC_letter_second_best_estimate', 'SIC_description_second_best_estimate']].to_csv(f"{DATA_PATH}/data/raw/sic_code_assigned_{file}", compression="gzip", encoding='utf-8')     
    except:
        print(f"Reading in of {file} failed...")
    print(f"Time spent on {file} took {(tt()-t0)/60:.2f} minutes")

Reading in full_jobs_200330_0.gz...
Time spent on full_jobs_200330_0.gz took 0.28 minutes
Reading in full_jobs_200330_1.gz...
Time spent on full_jobs_200330_1.gz took 0.25 minutes
Reading in full_jobs_200330_10.gz...
Time spent on full_jobs_200330_10.gz took 0.24 minutes
Reading in full_jobs_200330_100.gz...
Time spent on full_jobs_200330_100.gz took 0.25 minutes
Reading in full_jobs_200330_101.gz...
Time spent on full_jobs_200330_101.gz took 0.25 minutes
Reading in full_jobs_200330_102.gz...
Time spent on full_jobs_200330_102.gz took 0.25 minutes
Reading in full_jobs_200330_103.gz...
Time spent on full_jobs_200330_103.gz took 0.25 minutes
Reading in full_jobs_200330_104.gz...
Time spent on full_jobs_200330_104.gz took 0.25 minutes
Reading in full_jobs_200330_105.gz...
Time spent on full_jobs_200330_105.gz took 0.25 minutes
Reading in full_jobs_200330_106.gz...
Time spent on full_jobs_200330_106.gz took 0.25 minutes
Reading in full_jobs_200330_107.gz...
Time spent on full_jobs_200330_1