# Indeed Machine Learning Hackathon - Exploratory Analysis

This notebook walks through some initial analysis and baseline scoring that I did.

In [61]:
import preprocessing as pp
import re
from IPython.display import display
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
import numpy as np

In [62]:
train = pp.JobDescriptionDataset("./data/train.tsv")

In [63]:
dt_matrix = train.getDTMatrix()

### Descriptive Analysis

First, I look at the tag frequency in the training set.

In [64]:
label_cooccurrence = train.getLabelCooccurrence()
count_pretty = pd.DataFrame(sorted(zip(pp.LABEL_LIST, label_cooccurence.diagonal()), key = lambda x: -x[1]))
count_pretty.columns = ["Tag", "Frequency"]
display(count_pretty)

Unnamed: 0,Tag,Frequency
0,2-4-years-experience-needed,1043
1,bs-degree-needed,970
2,full-time-job,885
3,supervising-job,751
4,salary,669
5,5-plus-years-experience-needed,636
6,licence-needed,524
7,hourly-wage,451
8,1-year-experience-needed,331
9,part-time-job,328


In [65]:
print "Average # of tags per sample: ", 1.*sum(label_cooccurence.diagonal())/dt_matrix.shape[0]

Average # of tags per sample:  1.57257142857


There are several training samples that are missing tags. In several cases, it seems like this is in error. For example, several contain the string 'full time' but aren't tagged with 'full-time-job'. Additionally, there are some generic job descriptions that advertise multiple roles and contain phrases like 'part and full time positions available'. It is unclear how these are supposed to be tagged.

In [66]:
print "Number of training samples without tags:", np.sum(np.array(train.getRawY()) == '')

Number of training samples without tags: 871


In [67]:
descriptions_missing_tags = np.array(train.getRawX())[np.array(train.getRawY()) == '']
contain_full_time = [re.search(".*full time.*", d) is not None for d in descriptions_missing_tags]
contain_part_time = [re.search(".*part time.*", d) is not None for d in descriptions_missing_tags]
print "Training samples without tags that contain 'full time':", sum(contain_full_time)
print "Training samples without tags that contain 'part time':", sum(contain_part_time)

Training samples without tags that contain 'full time': 33
Training samples without tags that contain 'part time': 28


In [68]:
descriptions_missing_tags[3]

'ByteManagers is seeking Drill tool industrial supply  experts for long-term part and full time contract positions. Are you a product expert? Can you look at a picture of a drill bit, tooling component or motor and describe the product you\xe2\x80\x99re looking at in minutes? Do you know what matters to a person looking to purchase a pair of industrial gloves or ball valve? If yes \xe2\x80\x93 we\xe2\x80\x99re looking for you. Skills:  \xe2\x80\xa2 Expert knowledge of industrial supplies \xe2\x80\xa2 Ability to identify products in detail by looking at an image \xe2\x80\xa2 Knowledge of product features that are critical to the customer \xe2\x80\xa2 Ability to learn new software   Headquarters: Chicago Candidate Location: Chicago (Preferable) Possibility of working remotely: ok but not desirable '

As mentioned in the problem description, there are several tags that are mutually exclusive. The table below shows the cooccurrence frequency of various tags in the training set. The (i,j) element of this matrix indicates the proportion of samples in which the i'th tag was assigned given that the j'th tag was assigned.

We see that the following tuples are mutually exclusive. Note that 'full-time-job' and 'part-time-job' never cooccur, but there are job descriptions such as the one above that advertise "part and full time positions".
* '1-year-experience-needed', '2-4-years-experience-neeed', '5-plus-years-experience-needed'
* 'bs-degree-needed', 'associate-needed', 'ms-or-pd-needed', 'license-needed'
* 'salary', 'hourly-wage'
* 'full-time-job', 'part-time-job'

In [69]:
label_cooccurrence_scaled = label_cooccurrence * 1./label_cooccurrence.diagonal()
label_cooccurrence_pretty = pd.DataFrame(label_cooccurrence_scaled)
label_cooccurrence_pretty.columns = pp.LABEL_LIST
label_cooccurrence_pretty.index = pp.LABEL_LIST
display(label_cooccurrence_pretty)

Unnamed: 0,1-year-experience-needed,2-4-years-experience-needed,5-plus-years-experience-needed,associate-needed,bs-degree-needed,full-time-job,hourly-wage,licence-needed,ms-or-phd-needed,part-time-job,salary,supervising-job
1-year-experience-needed,1.0,0.0,0.0,0.095694,0.06701,0.094915,0.079823,0.129771,0.060241,0.088415,0.074738,0.037284
2-4-years-experience-needed,0.0,1.0,0.0,0.416268,0.358763,0.263277,0.146341,0.20229,0.253012,0.121951,0.272048,0.335553
5-plus-years-experience-needed,0.0,0.0,1.0,0.119617,0.329897,0.149153,0.053215,0.080153,0.253012,0.018293,0.164425,0.304927
associate-needed,0.060423,0.083413,0.039308,1.0,0.0,0.055367,0.044346,0.0,0.0,0.045732,0.047833,0.046605
bs-degree-needed,0.196375,0.333653,0.503145,0.0,1.0,0.231638,0.08204,0.0,0.0,0.060976,0.267564,0.368842
full-time-job,0.253776,0.223394,0.207547,0.23445,0.21134,1.0,0.303769,0.204198,0.277108,0.0,0.38864,0.223702
hourly-wage,0.108761,0.063279,0.037736,0.095694,0.038144,0.154802,1.0,0.068702,0.0,0.347561,0.0,0.039947
licence-needed,0.205438,0.10163,0.066038,0.0,0.0,0.120904,0.079823,1.0,0.0,0.125,0.13154,0.083888
ms-or-phd-needed,0.015106,0.020134,0.033019,0.0,0.0,0.025989,0.0,0.0,1.0,0.006098,0.029895,0.035952
part-time-job,0.087613,0.038351,0.009434,0.07177,0.020619,0.0,0.252772,0.078244,0.024096,1.0,0.025411,0.015979


### Baselines

I start by generating training scores for a few baseline models for comparison. 

#### Frequency Model

These models simply predict the top N most frequent labels for N = {1, 2, 3, 4} for all samples.

In [70]:
def baselineScore(tag, trueY):
    binarizer = MultiLabelBinarizer(classes = pp.LABEL_LIST)
    return pp.score(trueY, binarizer.fit_transform([tag.split(" ")]*trueY.shape[0]))

In [71]:
trueY = train.getBinarizedLabels()
tag_predictions = ['2-4-years-experience-needed', '2-4-years-experience-needed bs-degree-needed', 
                   '2-4-years-experience-needed bs-degree-needed full-time-job', 
                   '2-4-years-experience-needed bs-degree-needed full-time-job supervising-job']
for i, t in enumerate(tag_predictions):
    print "PopularTag%s -" % str(i + 1), baselineScore(t, trueY)

PopularTag1 - Precision: 0.2384, Recall: 0.1516, F1: 0.1853
PopularTag2 - Precision: 0.2301, Recall: 0.2926, F1: 0.2576
PopularTag3 - Precision: 0.2208, Recall: 0.4212, F1: 0.2897
PopularTag4 - Precision: 0.2085, Recall: 0.5304, F1: 0.2993


#### Keyword Model

Next, I'll look at the terms (unigrams/bigrams) most highly correlated with each label and build a model that predicts based off of the presence of correlated terms. More specifically, if a sample contains the top two terms for a given tag, then that sample is assigned that tag.

In [72]:
corr_matrix = train.getCorrelationMatrix()
corr_matrix_sort = (-corr_matrix).argsort()
term_names = train.getTermNames()

In [73]:
top_terms = list()
for i, t in enumerate(corr_matrix_sort[:, :5].tolist()):
    top_terms.append([pp.LABEL_LIST[i]] + [term_names[j] for j in t])

In [74]:
top_terms_table = pd.DataFrame(top_terms)
top_terms_table.columns = ["Tag"] + ["Term %s" % str(i) for i in range(1, 6)]
display(top_terms_table)

Unnamed: 0,Tag,Term 1,Term 2,Term 3,Term 4,Term 5
0,1-year-experience-needed,1 year,year,1,year of,1 year of
1,2-4-years-experience-needed,years,2 years,3 years,years of,2
2,5-plus-years-experience-needed,5 years,5,years,5 years of,5 years experience
3,associate-needed,associates degree,associates,associates degree or,degree,associates degree in
4,bs-degree-needed,degree,bachelors,bachelors degree,degree in,bachelor
5,full-time-job,full-time,full time,a full-time,full,a full time
6,hourly-wage,hour,per hour,hourly,00,00 per hour
7,licence-needed,nurse,rn,licensed,care,nursing
8,ms-or-phd-needed,masters,masters degree,masters degree in,clinical,of clinical
9,part-time-job,part time,part-time,a part time,a part-time,week


In [75]:
def topKeywordModel(dt_matrix, corr_matrix):
    train_preds = list()
    corr_matrix_sort = (-corr_matrix).argsort()
    for i in range(dt_matrix.shape[0]):
        pred = list()
        for j in range(len(pp.LABEL_LIST)):
            top_terms = corr_matrix_sort[j, :2].tolist()[0]
            if dt_matrix[i, top_terms[0]] == 1 and dt_matrix[i, top_terms[1]] == 1:
                pred.append(pp.LABEL_LIST[j])
        train_preds.append(pred)
    return MultiLabelBinarizer(classes = pp.LABEL_LIST).fit_transform(train_preds)

In [76]:
train_preds = topKeywordModel(dt_matrix, corr_matrix)
print "Top Correlated Keyword - ", pp.score(train.getBinarizedLabels(), train_preds)

Top Correlated Keyword -  Precision: 0.7556, Recall: 0.2359, F1: 0.3595


In [77]:
print "Average # of tags per prediction: ", 1.*train_preds.sum()/train_preds.shape[0]

Average # of tags per prediction:  0.490971428571


The keyword based model scores very highly in precision but suffers in recall. This makes sense since the model is fairly conservative. It only predicts a tag when a very specific and obvious set of keywords is present which limits the number of false positives, but this comes at the expense of a large number of false negatives. Another consequence of this conservativeness is that very few tags overall are predicted. The average number of tags per prediction is only about 31% of the average number of true tags per sample in the training set.