# Fire up

In [4]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import sklearn
from ggplot import *
import re
import nltk
from nltk.corpus import stopwords
df = pd.read_csv('data1.csv')

Preview the data set.

In [6]:
df.head()

Unnamed: 0,Position,Experience,Description
0,SCM Operations Coordinator,1 - 3 Years,For our European Customer Care Center located ...
1,Internship| Online Partnership Manager | Head ...,Entry,Are you looking for an internship with possibi...
2,EUV Industrial Engineer,3 - 5 Years,Reference RC04506 Do you have a creative way o...
3,Senior Android Mobile Developer,3 - 5 Years,Job description We are the marketing agency f...
4,Medior Business Intelligence Developer bij eVi...,3 - 5 Years,As a Business Intelligence Developer within eV...


# Fast Facts

**Accuracy using title only is overall 71%**

**Methods of voting classifiers in this attempt as well as the previous one are both not scientific enough. However I do not find a non-tree-based model that can create a good result in this data set yet. It is due to my immature parameter tuning.**

# Preparation

## Data Cleansing

Same step first to ensure the data quality

In [9]:
df['length'] = df['Description'].apply(lambda x: len(str(x).split()))
df = df[df['length']>200]

Then an extra step should be taken for getting rid of the messy character (un-ascii) in Position column

In [12]:
df['Position']=df['Position'].apply(lambda x: re.sub(r'[^\x00-\x7F]+',' ',x))

Next, create cleaned description.

In [14]:
def to_words(content):
    letters_only = re.sub("[^a-zA-Z-0-9]", " ", content) 
    words = letters_only.lower().split()                             
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops] 
    return( " ".join( meaningful_words ))
df['clean']=df['Description'].apply(lambda x:to_words(str(x)))

A small cleansing step will also be taken on the job position , for removing annoying '?'.

In [15]:
df['Position']=df['Position'].apply(lambda x: re.sub("[^a-zA-Z-0-9]", " ",x))

Then, data cleansing step is done. 

## Train, Test split

Same seed can make the results comparable.

In [16]:
from sklearn.cross_validation import train_test_split
train,test = train_test_split(df,test_size=0.2,random_state=42)

# Classification based on position

## Feature Extraction

Since the position title tends to be short and informative, we can assume that every single word in Job title can be useful for classification. Therefore, the 1-term Bag of word model can be used to extract features from the position.

In [17]:
Title = []
for each in train['Position']:
    Title.append(each)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
feature_title = v.fit_transform(Title)
print(feature_title.shape)

(1471, 1397)


In comparison with the 1-term BOW for job description, the dimension of feature matrix has a way smaller scale.

In [20]:
Test_Title = []
for each in test['Position']:
    Test_Title.append(each)
test_feature_title = v.transform(Test_Title)

## Model Fitting

Similarily, tree-based models are fitted on the feature matrix.

In [22]:
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

In [49]:
clf1=DecisionTreeClassifier(random_state=999, min_samples_leaf=1)
clf2=RandomForestClassifier(n_estimators=1000,random_state=500,n_jobs=-1,min_samples_leaf =1 )
clf3=ExtraTreesClassifier(n_estimators=1000,random_state=500,n_jobs=-1,min_samples_leaf =1 )
eclf=VotingClassifier(estimators=[('rf', clf2), ('et', clf3)], voting='hard')
Classifiers = [clf1,clf2,clf3,eclf]

In [50]:
Model = []
Accuracy = []
for clf in Classifiers:
    fit = clf.fit(feature_title ,train['Experience'])
    pred = fit.predict(test_feature_title)
    Accu = accuracy_score(pred,test['Experience'])
    Accuracy.append(Accu)
    Model.append(clf.__class__.__name__)
    print('Accuracy of '+clf.__class__.__name__+' is '+str(Accu))

Accuracy of DecisionTreeClassifier is 0.6875
Accuracy of RandomForestClassifier is 0.709239130435
Accuracy of ExtraTreesClassifier is 0.70652173913
Accuracy of VotingClassifier is 0.711956521739


The voting classifier ensembling random forest and extra tree together produce a result that is better than every single model using job description as features.

In [51]:
pd.crosstab(test['Experience'], pred, rownames=['Actual'], colnames=['Predicted'])

Predicted,1 - 3 Years,3 - 5 Years,Entry
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1 - 3 Years,137,44,2
3 - 5 Years,42,72,0
Entry,17,1,53


The biggest discrepency is still between 1-3 year and 3-5 year position

In [53]:
fit = clf2.fit(feature_title,train['Experience'])
words = v.get_feature_names()
importance = clf2.feature_importances_
impordf = pd.DataFrame({'Word' : words, 
                        'Importance' : importance})
impordf = impordf.sort_values(['Importance', 'Word'], ascending=[0, 1])
impordf.head(20)

Unnamed: 0,Importance,Word
670,0.07083,internship
666,0.04548,intern
1125,0.041578,senior
754,0.014084,manager
1378,0.013972,wo
571,0.01264,hbo
423,0.011284,engineer
1183,0.011235,stage
53,0.011155,afstudeeropdracht
628,0.010414,informatica


**My data set is not large enough, so the position from 'Thales' may somehow produce biased result as this company consistently recruit intern/junior level staff. Once the data set is large enough, such problem may vanish.**

# Classification based on both position and job description

## Create New Features

Then, if I combine the features from job description and those from job title together, what will happen? **In order to find out something more interesting, I want to give an attempt on ensembling model using the probability-result of description-model and the job title features as the combined features and make the new prediction. I am very much not sure about the performance of this model.**

In [61]:
train_desc = []
for each in train['clean']:
    train_desc.append(each)

c = CountVectorizer()
feature_1 = c.fit_transform(train_desc)

test_desc = []
for each in test['clean']:
    test_desc.append(each)
test_features_1 = c.transform(test_desc)   

In [65]:
clf =RandomForestClassifier(n_estimators=200,random_state=999,n_jobs=-1,min_samples_leaf =1 )
fit = clf.fit(feature_1,train['Experience'])
pred_o = fit.predict_proba(test_features_1)
pred_i = fit.predict_proba(feature_1)

Then we create new array as the new feature.

In [85]:
feature_t = feature_title.toarray()
test_feature_t = test_feature_title.toarray()
New_features = np.concatenate((feature_t,pred_i),axis=1)
New_test_features = np.concatenate((test_feature_t,pred_o),axis=1)

## Model Fitting

One thing I worry about is the potential overfitting problem. I will try to spend more time on parameter tuning.

In [110]:
clf1_2=DecisionTreeClassifier(random_state=999, min_samples_leaf=1)
clf2_2=RandomForestClassifier(n_estimators=1000,random_state=99,n_jobs=-1,min_samples_leaf =1,max_features='log2' )
clf3_2=ExtraTreesClassifier(n_estimators=1000,random_state=99,n_jobs=-1,min_samples_leaf =1,max_features='log2' )
eclf_2=VotingClassifier(estimators=[('rf', clf2_2), ('et', clf3_2)], voting='soft',weights=[1,1.1])
Classifiers_2 = [clf1_2,clf2_2,clf3_2,eclf_2]

In [111]:
Model_2 = []
Accuracy_2 = []
for clf_ in Classifiers_2:
    fit = clf_.fit(New_features ,train['Experience'])
    pred = fit.predict(New_test_features)
    Accu = accuracy_score(pred,test['Experience'])
    Accuracy_2.append(Accu)
    Model_2.append(clf_.__class__.__name__)
    print('Accuracy of '+clf_.__class__.__name__+' is '+str(Accu))

Accuracy of DecisionTreeClassifier is 0.682065217391
Accuracy of RandomForestClassifier is 0.703804347826
Accuracy of ExtraTreesClassifier is 0.717391304348
Accuracy of VotingClassifier is 0.711956521739


We can see that the overall performance of new model does not increase a lot in comparison with the model built by job title only.

# Conclusion

Several brief conclusions can be made below:

a. Job title is short, but obviously more informative. Also, the integrity of data will be better with job title as feature source.

b. About ensemble models: typically it should use relatively different models (such as Logistic Regression and Tree model) rather than close models to ensemble models. However, since I have tested several other models like LR, SVC and NB, none of them will provide a satisfying  result, I just use tree based model only. However, I do believe that with a fine parameter tunning technique, good result can be obtained from non-tree-based model also.