In [30]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
import time

## General Idea

- There are several ways to utilize this dataset to help people write job titles
- I propose some kind of autocomplete feature.
- That means the user will type part of a word and receive a list of possible words to autocomplete the current word.
- This is a simpler but more realistic approach than recommending complete job titles, since as evident by the data, the job titles are written rather subjective, sometimes with a motivational intro. 
- Recommending single words is a less complex problem, and also might be a precursor to more sophisticated recommendation system.
- Also using a character based model that predicts the next character might be an interesting model, however models like this can predict gibberish, depending on the number of datapoints. This is of course not desired for a user facing feature. Therefore we fall back to recommending single words from a known corpus.
- In cases like these it often makes sense simplify the problem as much as possible, so that we can establish a baseline for such a feature and later continue improving it.
- Essentially this is a classification problem where the possible words are the target classes.
- The input to the model is the input of the user.

## Preprocess Data

- To be able to build such a system we want to extract the single words from the job titles
- After that we will do the following preprocessing steps:
    - Lowercase all words to reduce the number of target classes
    - Remove special characters
    - Remove "m/w" gender specification
    - Remove workloads from the title
    - Remove words with less than 3 characters (because that is the history we will try to use to predict the next word)

In [24]:
job_titles = pd.read_csv('jobcloud_published_job_titles.csv', header=None)[0]
# Lowercase job titles
job_titles = job_titles.str.lower()
# Remove special characters and numbers (except for umlauts and french accents)
job_titles = job_titles.str.replace("[^A-Za-z \u00e4\u00f6\u00fc\u00e1\u00e8\u00e9]", ' ')
# Split titles into words
job_titles = job_titles.str.split(' ')
# Remove lone spaces/empty strings
job_titles = job_titles.map(lambda x: list(filter(lambda y: y != ' ' and y != '', x)))
# Remove gender specification (obsolete with next step, but better for visibility)
job_titles = job_titles.map(lambda x: list(filter(lambda y: y != 'mw', x)))
# Remove short words
job_titles = job_titles.map(lambda x: list(filter(lambda y: len(y) > 3, x)))
# Join everything into one series
job_titles = pd.Series([x for y in job_titles for x in y])
len(job_titles.unique())

5051

## Build Training Dataset

- Now we can build the training dataset.
- For this we need to think about the problem a bit more.
- What we are actually doing is a multilabel classification, since many words do have common subwords. For example if the user types "hel" the possible labels are "help", "hell", "hello" etc.
- However the more characters the user types the less possible labels there are.
- Now to build this dataset we start map each subword (starting at the beginning) to the possible target words. 
- So for a word abcdef we produce the following examples to train the model:
    - abc, abcdef
    - abcd, abcdef
    - abcde, abcdef

In [25]:
def feature_and_label_generator(job_titles):
    unique_words = job_titles.unique()

    for word in unique_words:
        for i in range(3, len(word)):
            yield word[:i], word

gen = feature_and_label_generator(job_titles)

dataset_file = open('dataset.csv', 'w', encoding='utf-8')
dataset_file.write('Input,Label\n')

for feature, label in gen:
    dataset_file.write(f"{feature},{label}\n")

dataset_file.close()

In [32]:
df = pd.read_csv('dataset.csv')
print('Inputs:', len(df))
print('Unique Labels:', len(df['Label'].unique()))
print('Unique Inputs:', len(df['Input'].unique()))

Inputs: 39926
Unique Labels: 5051
Unique Inputs: 23396


- Of course we cannot directly input this into a model.
- For the labels we have to map all the possible labels to unique ids
- For the features we need to think a bit more about how we can represent the relevant properties of the user input:
    - The length of the input is probably relevant
    - Which characters are used should be represented, maybe similar to a bag-of-words model we could use a bag-of-characters
    - The ordering of the characters should also be represented. This could be done by a vector of the same length as the boc, then for each character set the average position.

In [33]:
# Encode Labels
label_encoder = LabelEncoder()
label_encoder.fit(df['Label'])
df['LabelEncoded'] = label_encoder.transform(df['Label'])
df.head()

Unnamed: 0,Input,Label,LabelEncoded
0,acc,account,34
1,acco,account,34
2,accou,account,34
3,accoun,account,34
4,man,manager,2697


- As mentioned above this is a multilabel classification, therefore we must first group the labels by the inputs, so that we know all possible labels for each input

In [34]:
df = df.groupby('Input').agg({'Label':lambda x: list(x), 'LabelEncoded': lambda x: list(x)}).reset_index()
df.head()

Unnamed: 0,Input,Label,LabelEncoded
0,aar,"[aarau, aargau]","[0, 1]"
1,aara,[aarau],[0]
2,aarg,[aargau],[1]
3,aarga,[aargau],[1]
4,aba,"[abap, abacus]","[3, 2]"


In [52]:
def generate_boc(chars, char_mapping):
    boc = [0]*len(char_mapping)
    for char in chars:
        boc[char_mapping[char]] += 1
    return boc

def generate_ordering(chars, char_mapping):
    ordering = [[] for _ in range(len(char_mapping))]
    for idx, char in enumerate(chars):
        ordering[char_mapping[char]].append(idx + 1)
    ordering = list(map(lambda x: np.mean(x) if x else 0, ordering))
    return ordering
    
# Generate char_mapping
df['chars'] = df['Input'].map(lambda x: list(x))
unique_chars = list(set([x for y in df['chars'] for x in y]))
char_mapping = dict([(y, x) for (x, y) in enumerate(unique_chars)])

# Generate Bag of Characters
df['boc'] = df['chars'].map(lambda x: generate_boc(x, char_mapping))
# Get length of input
df['length'] = df['chars'].map(lambda x: len(x))
# Generate ordering
df['ordering'] = df['chars'].map(lambda x: generate_ordering(x, char_mapping))
# Generate One-Hot encoding for labels
mlb = MultiLabelBinarizer()
mlb.fit(df['LabelEncoded'])
df['LabelOneHot'] = df['LabelEncoded'].map(lambda x: np.squeeze(mlb.transform([x])))
df.head()
df.to_csv('transformed_data.csv')

In [53]:
# Merge everything to one row
X = df[['boc', 'length', 'ordering']]
colnames_boc = list(map(lambda x: 'boc_' + x, char_mapping.keys()))
colnames_ord = list(map(lambda x: 'ord_' + x, char_mapping.keys()))

X[colnames_boc] = pd.DataFrame(X['boc'].values.tolist(), columns=colnames_boc)
X[colnames_ord] = pd.DataFrame(X['ordering'].values.tolist(), columns=colnames_ord)
X = X.drop(['boc', 'ordering'], axis=1)

y = df['LabelOneHot']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [54]:
X.head()

Unnamed: 0,length,boc_x,boc_n,boc_ü,boc_ö,boc_f,boc_i,boc_ä,boc_g,boc_p,...,ord_j,ord_e,ord_r,ord_y,ord_v,ord_s,ord_t,ord_b,ord_è,ord_a
0,3,0,0,0,0,0,0,0,0,0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
1,4,0,0,0,0,0,0,0,0,0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.333333
2,4,0,0,0,0,0,0,0,1,0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
3,5,0,0,0,0,0,0,0,1,0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.666667
4,3,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0


In [55]:
y.head()

0    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: LabelOneHot, dtype: object

## Baseline

- As mentioned above it is always important to establish a baseline before running experiments with really complex models.
- Therefore we will try to build a simple model based on random forests:
    - They are memory intensive but really simple models
    - Evaluation on a random forest is rather fast.
    - As a baseline they usually provide a good intuition on what is possible.
- We will use 5% of the data as the validation set.

In [72]:
def convert_df_to_np(df):
    return np.array(list(map(lambda x: x.astype(int), np.array(df))))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True)

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, convert_df_to_np(y_train))

MemoryError: 

In [None]:
y_pred = clf.predict(X_test)
classification_report(y_test, y_pred)