In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
import json

### NOTE
- I tried to work with more complex models, however with the limited time the results were not really useful
- The goal from such an experiment should be in my opinion something that is useful and can generate some more insights for future improvement.
- Therefore I did not put the focus on the quality of the model (this would take too much time) but on delivering something that can be used quickly

## General Idea

- There are several ways to utilize this dataset to help people write job titles
- I propose some kind of autocomplete feature.
- That means the user will type part of a word and receive a list of possible words to autocomplete the current word.
- This is a simpler but more realistic approach than recommending complete job titles, since as evident by the data, the job titles are written rather subjective, sometimes with a motivational intro. 
- Recommending single words is a less complex problem, and also might be a precursor to more sophisticated recommendation system.
- Also using a character based model that predicts the next character might be an interesting model, however models like this can predict gibberish, depending on the number of datapoints. This is of course not desired for a user facing feature. Therefore we fall back to recommending single words from a known corpus.
- In cases like these it often makes sense simplify the problem as much as possible, so that we can establish a baseline for such a feature and later continue improving it.
- Essentially this is a classification problem where the possible words are the target classes.
- The input to the model is the input of the user.
- We will start by a very simple lookup based model, in the future this can be improved further. There are several reasons for this:
    - There seems to be a limited vocabulary for job titles, we can leverage that
    - Training data is limited, that means optimizing complex models might be difficult
    - It makes more sense to use a very simple model and build the whole application, which can be tested by users.
    - Using this approach we can gather feedback rather quickly and optimize the model step by step
    - Since the dataset is limited the performance of the application should be able to handle a simple lookup based approach
    - We cannot predict words which do not appear in the corpus of existing job titles, however achieving this with a character based model (without predicting gibberish) is really difficult

## Preprocess Data

- To be able to build such a system we want to extract the single words from the job titles
- After that we will do the following preprocessing steps:
    - Lowercase all words to reduce the number of target classes
    - Remove special characters
    - Remove "m/w" gender specification (special case, maybe it makes sense to keep it, can be evaluated in a second step)
    - Remove workloads from the title

In [16]:
job_titles = pd.read_csv('jobcloud_published_job_titles.csv', header=None)[0]
# Lowercase job titles
job_titles = job_titles.str.lower()
# Remove special characters and numbers (except for umlauts and french accents)
job_titles = job_titles.str.replace("[^A-Za-z \u00e4\u00f6\u00fc\u00e1\u00e8\u00e9]", ' ')
# Split titles into words
job_titles = job_titles.str.split(' ')
# Remove lone spaces/empty strings
job_titles = job_titles.map(lambda x: list(filter(lambda y: y != ' ' and y != '', x)))
# Remove gender specification (obsolete with next step, but better for visibility)
job_titles = job_titles.map(lambda x: list(filter(lambda y: y != 'mw', x)))
# Remove short words (limited usefulness in recommending stopwords and such)
job_titles = job_titles.map(lambda x: list(filter(lambda y: len(y) > 3, x)))
# Join everything into one series
job_titles = pd.Series([x for y in job_titles for x in y])
len(job_titles.unique())

5051

## Compute Statistics

- To make some useful recommendations we simply recommend the words that appear most first
- Therefore we compute the counts of the labels

In [17]:
labels = pd.DataFrame(job_titles, columns=['Label'])
labels['count'] = 1
labels = labels.groupby('Label').count()
labels.to_json('counts_by_label.json', orient='index')
labels.head()

Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
aarau,5
aargau,9
abacus,3
abap,2
abend,2


## Build Dataset

- Now we can build the dataset.
- For this we need to think about the problem a bit more.
- What we are actually doing is a multilabel classification, since many words do have common subwords. For example if the user types "hel" the possible labels are "help", "hell", "hello" etc.
- However the more characters the user types the less possible labels there are.
- Now to build this dataset we start map each subword (starting at the beginning) to the possible target words. 
- So for a word abcdef we produce the following examples to train the model:
    - a, abcdef
    - ab, abcdef
    - abc, abcdef
    - abcd, abcdef
    - abcde, abcdef

In [18]:
def feature_and_label_generator(job_titles):
    unique_words = job_titles.unique()

    for word in unique_words:
        for i in range(len(word)):
            yield word[:i+1], word

gen = feature_and_label_generator(job_titles)

dataset_file = open('dataset.csv', 'w', encoding='utf-8')
dataset_file.write('Input,Label\n')

for feature, label in gen:
    dataset_file.write(f"{feature},{label}\n")

dataset_file.close()

In [19]:
df = pd.read_csv('dataset.csv')
print('Inputs:', len(df))
print('Unique Labels:', len(df['Label'].unique()))
print('Unique Inputs:', len(df['Input'].unique()))

Inputs: 55079
Unique Labels: 5051
Unique Inputs: 27797


- As mentioned above this is a multilabel classification, therefore we must first group the labels by the inputs, so that we know all possible labels for each input

In [20]:
df = df.groupby('Input').agg({'Label':lambda x: list(x)}).reset_index()
df.head()

Unnamed: 0,Input,Label
0,a,"[account, automatiker, automaticien, allrounde..."
1,aa,"[aarau, aargau]"
2,aar,"[aarau, aargau]"
3,aara,[aarau]
4,aarau,[aarau]


- Now for each input we can sort the outputs based on the counts we computed before

In [21]:
counts_by_label = json.load(open('counts_by_label.json'))

df['Label'] = df['Label'].map(lambda x: sorted(x, reverse=True, key=lambda y: counts_by_label[y]['count']))
df.head()

Unnamed: 0,Input,Label
0,a,"[assistant, analyst, assistent, aussendienst, ..."
1,aa,"[aargau, aarau]"
2,aar,"[aargau, aarau]"
3,aara,[aarau]
4,aarau,[aarau]


## Produce output

- Now we can output this lookup into a simple dictionary which the application can load

In [22]:
df = df.set_index('Input')
df.to_json('labels_by_input.json', orient='index')

In [23]:
df.to_json('jobname_app/labels_by_input.json', orient='index')