In [1]:
import numpy as np
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import pickle

In this notebook. I replicate and document the code that I wrote with another member of my team during the **#HackTheNorth** event that took place in **Manchester on the 23rd and 24th of November 2017** (see https://dwpdigital.blog.gov.uk/2017/11/09/hackthenorth-helping-manchester-work-and-grow/). Such event, organised by DWP digital, involved the design and creation of data-based solutions aimed at produce a big impact on people's life - but specially focused on tackling the local unemployment problem.

My team consisted of members with different backgrounds: DWP data scientists, back and front-end developers, business analysts and work coaches, among others. Together we worked on different elements of a complete system aimed at improving the job search experience at different levels. The system consisted on the following elements:

- A machine learning model to classify jobs into either technical or non-technical solely based on the job description.
- An application with the following features: a) build a job seeker's profile in terms of digital skills, based on a casual aptitude test, b) match the user to a set of technical/non-technical jobs based on the results of the previous step, and c) facilitate job search by means of a Tinder-like interface.

The idea of the application was to reduce the frustration produced during job search when facing many different job offers that may not match our skills. 

Obviously, this was a very limited prototype due to the hack event's deadline, but many other ideas where considered, like for instance identifying spam jobs (by applying the same machine learning process used to classify jobs into technical and non-technical) or leveraging other sources of data (e.g. transport data to help filter jobs based on commute time and transport availability.)

During the event, Izzy and I worked together on collecting and analysing job descriptions data towards the creation of a **simple document classification model**. Collecting the data was the most time-consuming step, by far, mainly due to the fact that we had to download and label the job descriptions manually. But this was the least of evils, because building a scrapper wouldn't have left us with enough time to work on the classification model. In spite of this, and thanks to the scikit-learn module, we were able to get a simple model up and running and iteratively improve it to the point of reaching a quite high accuracy. 

## Basic model

As mentioned earlier, our text corpus was created by manually downloading job descriptions and manually labelling them as technical or non-technical. Our definition of technical or non-technical was based on whether any **digital skill** was required in order to perform the advertised job. Fortunately, we could look for jobs on **Universal Jobmatch** (https://jobsearch.direct.gov.uk/) based on skills. Other job searching platforms like Monster or Indeed only allow to search based on job title. Therefore, thanks to Universal Jobmatch, all we had to do was to thing of skills that could be considered as digital or non-digital, search jobs based on those skills, and download the job description section of each job.

It must be noted that job descriptions were downloaded and stored without altering or making any change. Some job descriptions include the job title name, or even contact details. As the validation results at the bottom of this notebook show, including the job title may produce some biases on the classifier, but we decided to minimise the preprocessing of this data. 

For the purpose of this notebook I downloaded 24 technical and 24 non-technical job desriptions. They were stored in the *data/technical* and *data/non_technical* folders. This folder structure was chosen so that we could use scikit-learn's *load_files* method to easily load the documents and automatically assign them a label (the folder names technical and non_technical are automatically used by load_files to label the data.)


In [2]:
job_descriptions = load_files('data/training')

The code below shows an example of one technical and one non-technical job description along with an example on how the labels are assigned to documents:

In [3]:
job_descriptions.data[0]

b"Software Developer- C#, Visual Studio, .Net, MVC- Chorley\n\nAbout the role\n\nRapid growth, expansion into new areas and demand from our customers has led to the need to expand our in-house software development team. We are now seeking a Software Developer to join the team. You will be responsible for playing a critical role in the development of new and existing applications and integrations in an agile environment.\n\nAbout Capita- Parking Eye\n\nWe are Capita, the UK's leading provider of business process management and integrated professional support service solutions. Through bespoke, quality solutions, we've helped countless organisations unlock value and maximise their potential. With access to our range of unique and diverse opportunities, offering real career advancement and progression, we can unlock your potential too.\n\nParkingEye (part of Capita) is the market leading car park management company. ParkingEye not only provide full circle car park management services but 

In [4]:
print(job_descriptions.target[0])
print(job_descriptions.target_names[job_descriptions.target[0]])

1
technical


In [5]:
job_descriptions.data[31]

b'Redbooth Ltd require roofing joiners immediately \n\nRedbooth Ltd require roofing tilers immediately for works on new build housing projects around the stockton and north east areas. Must be time served experienced and proficient with slating and concrete roof tiling. \n\nPlease note this is an immediate start so please call us on 07510439507.\n'

In [6]:
print(job_descriptions.target[31])
print(job_descriptions.target_names[job_descriptions.target[31]])

0
non_technical


In the next cell we randomly split our training set into a training and a test set (70% and 30% of the data, respectively). We will train the model using the training set and produce a report of the classification results using the test set. 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(job_descriptions.data, job_descriptions.target, test_size=0.3, random_state=0)

In order to build a document classifier, we have first to transform each document into a vector of numerical features. We first applied a **bag of words model**: each document is represented by means of a feature vector in which each feature is a word and the value of the feature is the number of occurrences of that word in the document. The *CountVectorizer* object below uses the whole set of words in the training set to represent each single document (after transforming to lower case.) The resulting data is very sparse, but scikit-learn uses a sparse representation to avoid memory issues. 

We used raw words instead of text tokens (word roots, for instance) to build the bag of words representation. *CountVectorizer* can be used to tokenize. This is left for future work.  

In [8]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(33, 2475)

Using the word count may not be advisable in the case in which there is a wide difference in the number of words between documents. We observed that technical job descriptions are on average longer than non-technical job descriptions. Therefore, we decided to use proportions, and more specifically the *TfidTransformer* class, which transforms the bag of words representation into a **tf-idf (term-frequency times inverse document-frequency) representation**. 

The goal of using tf-idf instead of the raw frequencies of occurrence of a word in a given document is to scale down the impact of words that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

In [9]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(33, 2475)

Once we transformed the data into a feature-based representation, we feed this data into a **Multinomial Naive Bayes classifier** in order to build a classification model. Naive Bayes is extensively used in text-related machine learning tasks due to the tradeoff between its high performance in this domain and its simplicity. The assumption that distribution of words on documents is multinomial is also very popular. 

In [10]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Once the model has been trained we can predict the class (technical/non-technical) of the test set. Notice that we need to transform the raw data (the job descriptions) into the same tf-ifd representation that was applied to the training set. For this purpose we are invoking the transform method of the **same** *CountVectorizer* and *TfidfTransformer* objects which were fit above in order to build the representation of the training set. 

In [11]:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)

And we can use the predicted labels to print a very handy report showing the precission, recall and f1-score for each class. There were no false positives in the case of non_technical documents, and there were no false negatives in the case of technical documents. Several non-technical documents were classifed as technical, but all the documents that were classified as non-technical were indeed non-technical. 

We will use the f1-score as an average measure of prediction accuracy. The obtained value (0.63) is not quite bad considering the simplicity of the approach. It is better than the baseline random prediction (0.5)

In [12]:
print(classification_report(y_test, y_pred, labels=None, target_names=job_descriptions.target_names))

               precision    recall  f1-score   support

non_technical       1.00      0.38      0.55         8
    technical       0.58      1.00      0.74         7

  avg / total       0.81      0.67      0.63        15



## Using pipelines

The training and prediction process can be simplified by using pipelines. Our basic model above requires three stages which have to be executed in order: counting the number of words, computing proportions and weighting according to the tf-idf criterion, and finally classification itself.

We can represent this process by means of the pipeline below, in which we feed each stage with the output of the previous stage. The pipeline receives a list of stages as an input. Each stage consists of a name and an object. 

In [13]:
pip_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

Now we can use the pipeline definition to fit our basic model with one single line of code:

In [14]:
pip_clf.fit(X_train, y_train) 

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Prediction also requires a single line of code:

In [15]:
y_pred = pip_clf.predict(X_test)

And we obtain exactly the same results as in the previous section:

In [16]:
print(classification_report(y_test, y_pred, labels=None, target_names=job_descriptions.target_names))

               precision    recall  f1-score   support

non_technical       1.00      0.38      0.55         8
    technical       0.58      1.00      0.74         7

  avg / total       0.81      0.67      0.63        15



## Model selection: cross validation

Rather than randomly splitting the data intro training and test sets only once we can of course apply 10-fold cross validation to make a better use of our small dataset and get a better idea of what the actual accuracy rate may be:

In [17]:
cv_scores = cross_val_score(pip_clf, job_descriptions.data, job_descriptions.target, cv=10)
print([np.mean(cv_scores), np.std(cv_scores)])

[0.80000000000000004, 0.13540064007726602]


## Model selection: grid search

Our document classification pipeline is built by concatenating three different stages. The behaviour of each stage can be modified by means of a set of hyperparameters, but so far we used the default values. In this section, we investigate the effect of modifying the value of some of these hyperparameters, as well as the impact of using different elements in our pipeline, like a different classification algorithm (Support Vector Machines) or removing the *TfidTransformer* step. 

Given a set ot possible hyperparameters value lists, the *GridSearchCV* class applies cross validation to each combination of hyperparameter values. We can then use this information to decide the values we should set our model's hyperparameters before training the model that we will deploy with our application. 

*GridSearchCV* also lets us use different objects in a given stage of the pipeline. Unfortunately, we cannot "disable" a stage. In the context of our document classification problem we realised that the tf-idf step actually decreased the prediction accuracy in the case of the basic model. Should we remove this stage or keep it? Since I cannot enable or disable stages of the pipeline, we had to define a wrapper fot the *TfidTransformer* class that would allow us to activate or deactivate it by means of a boolean parameter:

In [18]:
class TfidfTransformerOptional(TfidfTransformer):
    def __init__(self, activate=True):
        super().__init__()
        self.activate = activate
    
    def fit_transform(self, X, y=None):
        if self.activate:
            return super().fit_transform(X, y)
        else:
            return X
        
    def transform(self, X, y=None):
        if self.activate:
            return super().transform(X, y)
        else:
            return X

Now we can define a parameters dictionary (the *param_grid* variable below) to tell *GridSearchCV* about all the possibilities that we want to explore. In my example:

- We defined a set of values for the *min_df* and *max_df* parameters for *CountVectorizer*. By using these parameters we can filter out words from the bag of words model that are not frecuent enough (*df_min*) or are too frequent (*df_max*)
- We compared two classification algorithms: Multinomial Naive Bayes and Support Vector Machines. In the case of Support Vector Machines we tested different values of the regularisation parameter (*C*)
- Finally, we gave the option to test the effect of including the *TfidTransformer* stage or not. 

Notice how we can set the number of jobs to be run in parallel during the grid search process by setting a value for the *GridSearch*'s *n_jobs* parameter.

In [19]:
pip_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformerOptional()),
                    ('clf', MultinomialNB()),
])

param_grid = [
    {
        'vect__min_df': np.arange(0,0.6,0.1),
        'vect__max_df': np.arange(0.6,1.1,0.1),
        'tfidf__activate': [True, False],
        'clf': [MultinomialNB()],
    },
    {
        'vect__min_df': np.arange(0,0.6,0.1),
        'vect__max_df': np.arange(0.6,1.1,0.1),
        'tfidf__activate': [True, False],
        'clf': [SVC()],
        'clf__C': [0.5, 1, 1.5]
    },
]

grid = GridSearchCV(pip_clf, cv=10, n_jobs=3, param_grid=param_grid)
grid.fit(job_descriptions.data, job_descriptions.target)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...formerOptional(activate=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid=[{'vect__min_df': array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5]), 'vect__max_df': array([ 0.6,  0.7,  0.8,  0.9,  1. ,  1.1]), 'tfidf__activate': [True, False], 'clf': [MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)]}, {'vect__min_df': array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5]),...ty=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)], 'clf__C': [0.5, 1, 1.5]}],
       pre_dispatch='2*n_jobs', refit=True, return_tr

As we can see below, the best results were obtained when using the Multinomial Naive Bayes Classifier with *d_min* = 0.2 and *d_max* = 0.7 and without the tf-idf step:

In [20]:
mean_scores = list(grid.cv_results_['mean_test_score'])
index_max_score = mean_scores.index(max(mean_scores))
print(mean_scores[index_max_score])
print(grid.cv_results_['param_vect__min_df'][index_max_score])
print(grid.cv_results_['param_vect__max_df'][index_max_score])
print(grid.cv_results_['param_clf'][index_max_score])
print(grid.cv_results_['param_clf__C'][index_max_score])
print(grid.cv_results_['param_tfidf__activate'][index_max_score])

0.895833333333
0.2
0.7
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
--
False


Now we are ready to train the document classification model that we will deploy as part of the application: we use all the complete corpus of job descriptions that we downloaded from the Universal Jobmatch website and the pipeline definition that performed best in the process described above.

We store the model into disk to use it in the future from different scripts. 

In [21]:
pip_clf = Pipeline([('vect', CountVectorizer(min_df = 0.2, max_df = 0.7)),
                    ('tfidf', TfidfTransformerOptional(activate=False)),
                    ('clf', MultinomialNB()),
])
pip_clf.fit(job_descriptions.data, job_descriptions.target) 

pickle.dump([pip_clf, job_descriptions.target_names], open('models/model.p', 'wb'))

## Testing our data with the validation model

Let's test our document classification model! We downloaded a new corpus of job descriptions consisting of three technical and three non-tecnical job roles, according to our definition of what a technical job is (based on the requirement of digital skills.) To build this validation set we searched for jobs using completely different skills to those used to build the training dataset.

The code below loads and classifies each document independently. 

In [22]:
files = ['non_technical_01.txt', 'non_technical_02.txt', 'non_technical_03.txt',
         'technical_01.txt', 'technical_02.txt', 'technical_03.txt']
for fil in files:
    with open('data/validation/' + fil, 'r') as f:
        text = [' '.join(f.readlines())] # The input to the pipeline has to be an array of documents
                                           # even if we are only planning to process a single document
        pred = pip_clf.predict(text)
        
        h = '------------------------------------------'
        print(h + ' ' + fil + ' ' + h)
        print('Predicted class: ' + job_descriptions.target_names[pred[0]])
        print(h + h)
        print(' ')
        print(text[0])

------------------------------------------ non_technical_01.txt ------------------------------------------
Predicted class: non_technical
------------------------------------------------------------------------------------
 
WE HAVE A VACANCY FOR A CAR MECHANIC AT OUR CAR SALES IN LLANTWIT FARDRE.EXPERIENCE IS ESSENTIAL AS WILL BE DOING ALL ASPECTS OF MECHANICS INCLUDING HEAD GASKETS GEARBOX CHANGE AND CLUTCHES.DUTIES ALSO INCLUDE SERVICING CHANGING TYRES AND WORKING AS PART OF A TEAM TO REACH OUR TARGETS.A FULL DRIVING LICENSE IS ALSO VITAL AS THE JOB INVOLVES DRIVING BETWEEN OUR CAR SALES AND TO THE MOT STATION.THE POSITION IS FULL TIME AND PERMANENT HOURS AND WAGES TO BE DISCUSSED AT INTERVIEW PROCESS.PLEASE CALL REVOLUTION CARS FOR MORE DETAILS

------------------------------------------ non_technical_02.txt ------------------------------------------
Predicted class: technical
------------------------------------------------------------------------------------
 
Over our 25 years o

## Discussion of the validation results and conclusions

The validation accuracy rate was 0.66. We missclassified one technical job and one non-technical job. 

- The non-technical lift engineer job (once again, we are assuming that a lift engineer does not require to use computers to do his/her job) was classified as technical. One possible explanation may be that the word "engineer" is frequent in technical jobs, but non-existent in any of the non-technical job descriptions. Therefore, our classification model is biased and will think that all engineers require digital skills in their jobs. We could fix this issue by adding non-technical job descriptions containing the word "engineer" or "engineering" to our training dataset. We already did this in the past when we detected that we added some NHS related jobs to the technical set: the solution was to add non-technical NHS-related jobs. 
- The missclassified technical job is more of a mystery. All the relevant words seem to be there: there are many programming language names or IT related technologies. May it be that these words were filtered out by the bag of words model becuase there are not that frequent? Another possibility is the length of that job description. We observed that, in average, technical job descriptions tend to be much longer and more detailed. Should we add shorter technical job descriptions to our training set?

I am surprised that such a simple model seems to work that well. However, the results have to be taken with a pinch of salt due to the small size of the dataset. We have to consider random variation. Aside from increasing the size of the dataset, I can think of the following ways in which we could improve this work:

- Data scrapping: we would still have to manually label the job descriptions, but we could at least automatically assign labels based on the skill names used to search for jobs. 
- Further model selection: we didn't test all the possibilities. For instance, we didn't test the impact of the alpha parameter in the case of the Naive Bayes classifier. And we only tested two classifiers. 
- Using tokens instead of raw words will help to correctly group together related words that should have the same interpretation.