# Intro

In [1]:
import json as js
import itertools
import operator

import numpy as np
import pandas as pd
import spacy
from spacy.matcher import PhraseMatcher

  return torch._C._cuda_getDeviceCount() > 0
2022-07-29 14:22:19.546717: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-29 14:22:19.546749: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-07-29 14:22:22.272207: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2022-07-29 14:22:22.272332: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ndrewb): /proc/driver/nvidia/version does not exist


# Loading objective data

In [2]:
data = pd.read_csv('data/kpmi.ru_190k.csv', usecols=['q61']).squeeze(axis=1)

In [3]:
data.isna().value_counts()

False    152856
True      41712
Name: q61, dtype: int64

It is really handy to have separate data set to explore and experiment with.

In [4]:
q61 = data.copy().dropna()

Let's make some basic preprocessing using built-in methods.

In [5]:
q61[q61.str.isdigit()]

170               0
436            1111
767               5
772              66
1086              1
            ...    
165791           42
165894            1
169885            2
170465            1
171164    123456789
Name: q61, Length: 91, dtype: object

In [6]:
# excluding entries where all characters in each string are digits
q61 = q61[~q61.str.isdigit()]

In [7]:
q61 = q61.str.lower()
q61 = q61.str.strip()

In [8]:
q61 = q61.drop_duplicates()
q61.count()

14931

At this point we have slightly preprocessed data set( `pd.Series` actually ) which consists of 14931 unique entries. We'll work on this more later.

# Profession classificator

Here we're making use of [HeadHunter's API](https://dev.hh.ru/), particularly [this part](https://api.hh.ru/openapi/redoc#tag/Obshie-spravochniki/paths/~1professional_roles/get) of it. One can uncomment lines below to get the JSON file again.

In [9]:
# # getting professional_roles
# !curl -o professional_roles.json -H 'User-Agent: api-test-agent' https://api.hh.ru/professional_roles

In [10]:
with open('professional_roles.json') as f:
    professional_roles = js.load(f)

In the following two cells we're struggling with nested structure of the JSON file. The goal is to have one nice list with proffessional roles.

In [11]:
names = []
for category in professional_roles['categories']:
    for role in category.get('roles'):
        names.append(role.get('name').lower().split(','))

names[: 10]

[['автомойщик'],
 ['автослесарь', ' автомеханик'],
 ['мастер-приемщик'],
 ['менеджер по продажам', ' менеджер по работе с клиентами'],
 ['администратор'],
 ['делопроизводитель', ' архивариус'],
 ['курьер'],
 ['менеджер/руководитель ахо'],
 ['оператор пк', ' оператор базы данных'],
 ['офис-менеджер']]

In [12]:
# unpacking 'names' list of lists
roles = list(itertools.chain(*names))
roles = [role.strip() for role in roles]

roles[: 10]

['автомойщик',
 'автослесарь',
 'автомеханик',
 'мастер-приемщик',
 'менеджер по продажам',
 'менеджер по работе с клиентами',
 'администратор',
 'делопроизводитель',
 'архивариус',
 'курьер']

In [13]:
len(roles)

288

# PhraseMatcher
Here we will be using `roles` as basis to create patterns which may provide us with some matches in target data. The PhraseMatcher capable of efficient matching on large terminology lists.

In [14]:
nlp = spacy.load('ru_core_news_lg', disable=['parser', 'ner'])
print(nlp.pipe_names)

['tok2vec', 'morphologizer', 'attribute_ruler', 'lemmatizer']


In [15]:
# making Doc objects out of 'roles' list
docs_roles = list(nlp(role) for role in roles)

In [16]:
# creating patterns 
matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")

# 'patterns1' based on full entries of professional roles list items,
# whereas 'patterns2' takes only first token of the original item
patterns1 = docs_roles
patterns2 = list(nlp(doc[0].text) for doc in docs_roles)

matcher.add("ROLES-FULL", patterns1)
matcher.add("ROLES-SHORT", patterns2)

In [17]:
# string from a pd.Series goes as an input to 'get_matches' function
def get_matches(entry: str):
    # making Doc object out of input data
    doc = nlp(entry)
    # creating list to store matches
    matches = []
    # extracting any matches found
    for match_id, start, end in matcher(doc):
        matches.append(doc[start:end])
    # this condition block may produce better accuracy during validation step
    if len(matches) > 2:
        matches.append(doc)
    return list(set(matches)) if len(matches) > 0 else np.nan

    
q61_matches = q61.apply(get_matches)

In [18]:
q61_matches

93                                            [(экономист)]
94                                                      NaN
95                                              [(учитель)]
97                                                      NaN
98                                                      NaN
                                ...                        
182938    [(главный), (топ, -, менеджмент,  , |, главный...
184169                                         [(директор)]
188174                                                  NaN
190777                                                  NaN
191076                                          [(бизнеса)]
Name: q61, Length: 14931, dtype: object

In [19]:
# 'get_similar' function designed to be applied to matches set from the previous step
def get_similar(docs):
    suggestions = []
    for doc in docs:
        scores = []
        for doc_role in docs_roles:
            # estimating objects similarity
            score = doc.similarity(doc_role)
            scores.append(score)
        for _ in scores:
            index, value = max(enumerate(scores), key=operator.itemgetter(1))
            # setting score threshold
            if value > 0.7:
                suggestions.append((roles[index], value))
    suggestions = list(set(suggestions))
    return suggestions if len(suggestions) > 0 else np.nan

In [20]:
%%time
suggestions = q61_matches.dropna().apply(get_similar)

  score = doc.similarity(doc_role)
  score = doc.similarity(doc_role)


CPU times: user 1min 26s, sys: 49.9 ms, total: 1min 26s
Wall time: 1min 26s


In [21]:
suggestions = suggestions.dropna()
suggestions

93                                       [(экономист, 1.0)]
95                                         [(учитель, 1.0)]
102       [(менеджер/руководитель ахо, 0.7935711144609798)]
108                                          [(повар, 1.0)]
109                   [(фитнес-тренер, 0.7732876534767605)]
                                ...                        
178519    [(pr-менеджер, 1.0), (pr-менеджер, 0.849594620...
179527             [(начальник склада, 0.8047933596030894)]
182357    [(финансовый менеджер, 0.7060157768667648), (б...
182938    [(бухгалтер, 1.0), (финансовый менеджер, 0.706...
184169         [(генеральный директор, 0.8755402042989613)]
Name: q61, Length: 4339, dtype: object

# Results review

In [22]:
dictionary = pd.concat([q61, suggestions], axis=1, join='inner')
dictionary.columns = ['q61', 'suggestions']
dictionary

Unnamed: 0,q61,suggestions
93,экономист,"[(экономист, 1.0)]"
95,учитель,"[(учитель, 1.0)]"
102,руководитель участка,"[(менеджер/руководитель ахо, 0.7935711144609798)]"
108,повар,"[(повар, 1.0)]"
109,фитнес,"[(фитнес-тренер, 0.7732876534767605)]"
...,...,...
178519,реклама и маркетинг | pr-менеджер,"[(pr-менеджер, 1.0), (pr-менеджер, 0.849594620..."
179527,топ-менеджмент | начальник финансового отдела,"[(начальник склада, 0.8047933596030894)]"
182357,топ-менеджмент | главный бухгалтер,"[(финансовый менеджер, 0.7060157768667648), (б..."
182938,топ-менеджмент | главный бухгалтер,"[(бухгалтер, 1.0), (финансовый менеджер, 0.706..."


In [23]:
column1 = dictionary['q61']
column2 = dictionary['suggestions']

data = data.str.lower()
data = data.str.strip()

q61_suggestions = data.map(dict(zip(column1, column2)))

In [24]:
data_preprocessed = pd.concat([data, q61_suggestions], axis=1)
data_preprocessed.columns = ['q61', 'q61_suggestions']

In [25]:
pd.set_option("display.max_rows", 1000)

Let's take a look at results we've accomplished so far.

In [26]:
data_preprocessed.dropna(subset=['q61']).drop_duplicates(subset=['q61'])[: 10]

Unnamed: 0,q61,q61_suggestions
93,экономист,"[(экономист, 1.0)]"
94,данный момент не работаю,
95,учитель,"[(учитель, 1.0)]"
97,никем,
98,я школьник,
100,рекламист,
102,руководитель участка,"[(менеджер/руководитель ахо, 0.7935711144609798)]"
104,никнм,
105,не кем,
107,ученик,


# Conclusion
The algorithm works quite well on extremely irrelevant entries, however its accuracy remains far from perfection. These are some flaws found:
- inaccurate suggestions 
- there are original entries having actual professions still being ignored by algorithm
- misspells mostly lead to no processing at all

Considering everything being said, one must decide whether to fix all the bugs in existing algorithm, or try to build different one.