# Skill extraction with a Logistic Regression classifier

## Overview

In this notebook, we are going to train a logistic regression classifier with two datasets, IMDB reviews and LinkedIn profile. The purpose for this classifier is to identify a phrase as skill or not skill. The process can be summarized as follows:

- extract noun chunks from IMDB review text and label them as 0 for "not skill"
- parse skills from the LinkedIn profile dataset and lable them as 1 for "skill"
- merge the two processed dataset and split into training and testing sets
- setup as sklearn pipeline:
  - counter vectorizer with lemmatization and stop words removal
  - logistic regression
- fit the train data and analyze model performance with the test set
- test with external job descriptions and random text

## Creating a train dataset from LinkedIn dataset as skills and IMDB reviews as not_skills

The datasets can be downloaded at:
- https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- https://www.kaggle.com/linkedindata/linkedin-crawled-profiles-dataset

### IMDB Reviews Dataset

In [1]:
import pandas as pd

In [2]:
imdb_df = pd.read_csv('/Users/justinnaing/Workspace/MDSI/datasets/kaggle/IMDB Dataset.csv')
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null object
dtypes: object(2)
memory usage: 781.3+ KB


In [3]:
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


We are going to get the `review` column and extract noun phrases using `spaCy`. The extracted noun chunks will be labeled as 0 for "not skill".

In [9]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create oru list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = STOP_WORDS

In [71]:
def get_train_data(reviews):
    for doc in nlp.pipe(reviews, disable=['ner', 'dep']):
        for nc in doc.noun_chunks:
            if nc.lower_ not in stop_words:
                yield {'text': nc.text, 'label': 0}

In [74]:
not_skills_df = pd.DataFrame(list(get_train_data(imdb_df['review'])))

### LinkedIn Skills Dataset

In [78]:
skills_file = '/Users/justinnaing/Workspace/MDSI/datasets/linkedin-crawled-profiles-dataset/linkedin.json'

In [79]:
import json
from contextlib import suppress

In [80]:
def get_train_data_skills(skills_file):
    for line in open(skills_file):
        data = json.loads(line)
        with suppress(KeyError):
            for skill in data['skills']:
                yield {'text': skill, 'label': 1}

In [82]:
skills_df = pd.DataFrame(list(get_train_data_skills(skills_file)))

In [83]:
skills_df.head()

Unnamed: 0,label,text
0,1,Key Account Development
1,1,Strategic Planning
2,1,Market Planning
3,1,Team Leadership
4,1,Negotiation


### Merge data

In [85]:
train_df = not_skills_df.append(skills_df)

In [86]:
train_df['label'].value_counts()

1    16822119
0     2238940
Name: label, dtype: int64

In [60]:
train_df = train_df.dropna()

In [127]:
label_value_counts = train_df['label'].value_counts()

In [130]:
label_value_counts

1    16822112
0     2238940
Name: label, dtype: int64

In [129]:
label_value_counts[0] / label_value_counts[1]

0.13309505964530494

The classes are imbalanced as the ratio between not_skill and skill is 0.13.

In [2]:
# train_df = pd.read_csv('/Users/justinnaing/Workspace/MDSI/datasets/job_skills/imdb_linkedin_training_dataset.csv')

## Training with Logistic Regression

There are 19 million entries with about 17 million of skills and 2 million of not skills.

In [131]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19061052 entries, 0 to 19061058
Data columns (total 2 columns):
label    int64
text     object
dtypes: int64(1), object(1)
memory usage: 436.3+ MB


Splitting train and test set with 0.3 for `test_size` and with `stratify` option. When the `stratify` option is set to `True`, the y labels will be the same proportion as in the original dataset.

In [61]:
from sklearn.model_selection import train_test_split

X = train_df['text']
y = train_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In order to train with logistic regression, we need to somehow transform text into a representation that the algorithm can make use of. In this case, we are converting the text data into a bag of words where it contains the frequency of unigrams. In addition, we are also adding lemmatization and stop words removals before the vectorization stage to reduce the dimension.

In [49]:
from spacy.lang.en import English

parser = English()

def spacy_tokenizer(text):
    my_tokens = parser(text)
    
    # Lemmatizing each token and converting each token into lowercase
    my_tokens = [word.lemma_.lower().strip() if word.lemma_ != '-PRON-' else word.lower_ for word in my_tokens]
    
    # Removing stop words
    my_tokens = [word for word in my_tokens if word not in stop_words and word not in punctuations]
    
    return my_tokens

In [62]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time

bow_vector = CountVectorizer(tokenizer=spacy_tokenizer, ngram_range=(1,1))
classifier = LogisticRegression()

pipe = Pipeline([('vectorizer', bow_vector),
                 ('classifier', classifier)])

t0 = time()
print('Fitting model...')
pipe.fit(X_train, y_train)
print('Time taken: ', time() - t0)

Fitting model...




Time taken:  2874.091591835022


The training process, including vectorization and fitting, took about 47 minutes for 19 million instances. The time consuming one is tokenization with spaCy during vectorization.

In [210]:
from sklearn.metrics import confusion_matrix, classification_report

predicted = pipe.predict(X_test)

print(confusion_matrix(y_test, predicted, labels=[1, 0]))
print(classification_report(y_test, predicted, labels=[1]))

[[4992475   54159]
 [  74842  596840]]
              precision    recall  f1-score   support

           1       0.99      0.99      0.99   5046634

   micro avg       0.99      0.99      0.99   5046634
   macro avg       0.99      0.99      0.99   5046634
weighted avg       0.99      0.99      0.99   5046634



Although we have class imbalance, it seems like the model is doing well with the test data with an f1-score of 0.99! But we need to test with some job descriptions.

### Testing with some job post data

Here I test with some job descriptions from https://resources.workable.com/job-descriptions/ where I have extracted the noun chunks with spaCy and labelled each instance as skill (1) or not skill (1).

In [72]:
ds_df = pd.read_csv('/Users/justinnaing/Workspace/MDSI/datasets/job_skills/data_scientist_job_post.csv')

In [73]:
ac_df = pd.read_csv('/Users/justinnaing/Workspace/MDSI/datasets/job_skills/accounting_job_post.csv')

In [121]:
ds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 2 columns):
text     85 non-null object
label    85 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.4+ KB


In [74]:
ds_df.head(10)

Unnamed: 0,text,label
0,This Data Scientist job description template,0
1,online job boards,0
2,careers pages,0
3,your company,0
4,Post,0
5,Job Boards,0
6,Data Scientist Responsibilities,0
7,Undertaking data collection,1
8,preprocessing,1
9,analysis,1


In [122]:
ac_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
text     100 non-null object
label    100 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.6+ KB


In [80]:
ac_df.head(10)

Unnamed: 0,text,label
0,tax accountant responsibilities,0
1,tax payments,1
2,estimating,1
3,tax returns,1
4,regular (quarterly and annual) tax reports,1
5,job brief,0
6,a tax accountant,0
7,tax payments,1
8,returns,0
9,our company,0


In [76]:
ds_df['text'] = ds_df['text'].apply(lambda x: x.replace('\n', '').strip())
ac_df['text'] = ac_df['text'].apply(lambda x: x.replace('\n', '').strip())

In [208]:
# data science
predicted_ds_lg = pipe.predict(ds_df['text'])

print(confusion_matrix(ds_df['label'], predicted_ds_lg, labels=[1, 0]))
print(classification_report(ds_df['label'], predicted_ds_lg, labels=[1]))

[[44  4]
 [22 15]]
              precision    recall  f1-score   support

           1       0.67      0.92      0.77        48

   micro avg       0.67      0.92      0.77        48
   macro avg       0.67      0.92      0.77        48
weighted avg       0.67      0.92      0.77        48



We have a recall rate of 0.92 and precision of 0.67 for the class `skill` for the data science job description.

In [209]:
# accounting
predicted_ac_lg = pipe.predict(ac_df['text'])

print(confusion_matrix(ac_df['label'], predicted_ac_lg, labels=[1, 0]))
print(classification_report(ac_df['label'], predicted_ac_lg, labels=[1]))

[[44  3]
 [36 17]]
              precision    recall  f1-score   support

           1       0.55      0.94      0.69        47

   micro avg       0.55      0.94      0.69        47
   macro avg       0.55      0.94      0.69        47
weighted avg       0.55      0.94      0.69        47



We have a recall of 0.94 and precision of 0.55 for the class `skill` for the accounting job description.

### Testing with random text

In [196]:
test_list = [
    ('NLP', 1),
    ('Natural Language Processing', 1),
    ('developed cross-platform mobile apps in React Native', 1),
    ('hardware manufacturer', 0),
    ('analyse trends in the Crypto market', 1),
    ('proficient in Python, and R', 1),
    ('manage a team of nine people in software development', 1),
    ('has a degree in Economics', 0),
    ('Apple', 0),
    ('Black Berry', 0),
    ('supervisor', 0),
    ('postgres', 1),
    ('mysql', 1),
    ('Power BI', 1),
    ('machine learning', 1),
    ('computer vision', 1),
    ('deep learning', 1),
    ('Standford', 0),
    ('Sydney', 0),
    ('Melbourne', 0),
    ('having breakfast with a cup of coffee', 0),
    ('brew coffee', 1),
    ('coffee', 0)
]
test_df = pd.DataFrame(test_list, columns=['text', 'label'])

In [197]:
test_predicted = pipe.predict(test_df['text'])

In [198]:
confusion_matrix(test_df['label'], test_predicted, labels=[1, 0])

array([[13,  0],
       [ 5,  5]])

In [206]:
print(classification_report(test_df['label'], test_predicted, labels=[1]))

              precision    recall  f1-score   support

           1       0.72      1.00      0.84        13

   micro avg       0.72      1.00      0.84        13
   macro avg       0.72      1.00      0.84        13
weighted avg       0.72      1.00      0.84        13



In [200]:
test_df['predicted'] = test_predicted

In [201]:
test_df

Unnamed: 0,text,label,predicted
0,NLP,1,1
1,Natural Language Processing,1,1
2,developed cross-platform mobile apps in React ...,1,1
3,hardware manufacturer,0,1
4,analyse trends in the Crypto market,1,1
5,"proficient in Python, and R",1,1
6,manage a team of nine people in software devel...,1,1
7,has a degree in Economics,0,1
8,Apple,0,1
9,Black Berry,0,0


## Saving work

In [79]:
# Save the trained pipelines
from joblib import dump
dump(pipe, 'skills_classifier_log_reg.joblib')

['skills_classifier_log_reg.joblib']

In [143]:
# Save training dataset
train_df.to_csv(
    '/Users/justinnaing/Workspace/MDSI/datasets/job_skills/imdb_linkedin_training_dataset.csv', 
    index=False
)