# Segment Classifier

This notebook reports the methodology followed to build a machine learning-based classifier to automatically identify sections in job description documents. The model takes a sentence contained in a job description as input and produces as output the section that the sentence belongs to.

# Section 0. Preliminaries

## Load libraries

In [1]:
# update accordingly
run_on_google_colab = True
project_dir = '/content/drive/MyDrive/Personal/Applications/Avature/avature-solution'

if run_on_google_colab:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
import sys
sys.path.append(project_dir)

In [3]:
import json
import joblib
import numpy as np
import os
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics import classification_report, f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from text_processor import *
from utils import *

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Initialize constants

In [4]:
OVERWRITE_DOC2VEC = False
OVERWRITE_TRAINING = False
random_state = np.random.RandomState(1234)  # for reproducibility

## Load data

In [5]:
data_dir = f'{project_dir}/data'
data_fn = 'jobs_training.csv'
models_dir = f'{project_dir}/models'
output_dir = f'{project_dir}/outputs'

In [6]:
job_training_df = pd.read_csv(os.path.join(data_dir, data_fn))

## Preview data

Visualize the size of the data togethet with a small sample

In [7]:
print(f'dataset size: {job_training_df.shape[0]}x{job_training_df.shape[1]}')
print('-'*10)
job_training_df.head()

dataset size: 3885x4
----------


Unnamed: 0,job_id,segment_index,segment,section_label
0,05b865e93e8e46579075562865973d3b,0,Abbott is a global healthcare leader that help...,About Company
1,05b865e93e8e46579075562865973d3b,94,Our portfolio of life-changing technologies sp...,About Company
2,05b865e93e8e46579075562865973d3b,287,"Our 109,000 colleagues\nserve people in more t...",About Company
3,05b865e93e8e46579075562865973d3b,353,"**Tissue Trainer – St. Paul, MN**",Job Title
4,05b865e93e8e46579075562865973d3b,388,Our business purpose is to restore health and ...,About Company


Distribution of values by type of sections

In [8]:
job_training_df.section_label.value_counts()

Job Responsibilities/Summary    1453
Job Skills/Requirements         1012
Other                            506
About Company                    425
Benefits                         291
EOE/Diversity                    163
Job Title                         35
Name: section_label, dtype: int64

In [9]:
job_training_df.section_label.value_counts()/job_training_df.shape[0]

Job Responsibilities/Summary    0.374003
Job Skills/Requirements         0.260489
Other                           0.130245
About Company                   0.109395
Benefits                        0.074903
EOE/Diversity                   0.041956
Job Title                       0.009009
Name: section_label, dtype: float64

Previous output shows that there are `seven classes` classes, which are unbalaced. Now, check if there are missing values.

In [10]:
# check for null values
job_training_df.isnull().sum()

job_id           0
segment_index    0
segment          0
section_label    0
dtype: int64

Previous output unveils that there aren't missing values

---

# Section 1. Feature engineering

## Create syntactial features

Data are augmented with features related to the syntactical characteristics of job description sentences.

1. **Word count of sentences**: total number of words in sentences
2. **Character count of sentences**: total number of characters in sentences
3. **Average word density of sentences**: average length of words in sentences
4. **Upper case count in sentences**: total number of upper count words in sentences
5. **Title word count in sentences**: total number of title case words in sentences
6. **Noun count**: total number of nouns in sentences
7. **Verb count**: total number of verbs in sentences
8. **Adjective count**: total number of adjectives in sentences
9. **Pronouns count**: total number of pronouns in sentences

In [11]:
%%time
syn_features_df = pd.DataFrame()
for idx, row in job_training_df.iterrows():
  syn_features_df = pd.concat([syn_features_df, create_syntactical_features(row['segment'], idx)])

CPU times: user 51.1 s, sys: 880 ms, total: 52 s
Wall time: 1min 13s


In [12]:
syn_features_df.head()

Unnamed: 0,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,6,3,1,2,0,93,17,5.166667,0,1
1,10,2,4,0,0,192,24,7.68,0,1
2,3,1,1,0,0,63,10,5.727273,0,1
3,9,0,1,0,0,33,6,4.714286,1,4
4,14,3,0,0,0,184,29,6.133333,2,1


Concate the new features with the job description data

In [13]:
job_training_df = pd.concat([job_training_df, syn_features_df.reindex(job_training_df.index)], axis=1)

In [14]:
job_training_df.head()

Unnamed: 0,job_id,segment_index,segment,section_label,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,05b865e93e8e46579075562865973d3b,0,Abbott is a global healthcare leader that help...,About Company,6,3,1,2,0,93,17,5.166667,0,1
1,05b865e93e8e46579075562865973d3b,94,Our portfolio of life-changing technologies sp...,About Company,10,2,4,0,0,192,24,7.68,0,1
2,05b865e93e8e46579075562865973d3b,287,"Our 109,000 colleagues\nserve people in more t...",About Company,3,1,1,0,0,63,10,5.727273,0,1
3,05b865e93e8e46579075562865973d3b,353,"**Tissue Trainer – St. Paul, MN**",Job Title,9,0,1,0,0,33,6,4.714286,1,4
4,05b865e93e8e46579075562865973d3b,388,Our business purpose is to restore health and ...,About Company,14,3,0,0,0,184,29,6.133333,2,1


## Pre-process text

Pre-process job description sentences to be used  by the machine learning algorithms. In this sense, sentences are converted to a list of `lower case tokens`, `removing` in this process the `stops words`, `punctuations`, and `digits`.

In [15]:
processed_segs = preprocess_segments(job_training_df['segment'])

In [16]:
# Let's explore tokens of the first segment
processed_segs[0]

['abbott',
 'is',
 'a',
 'global',
 'healthcare',
 'leader',
 'that',
 'helps',
 'people',
 'live',
 'more',
 'fully',
 'at',
 'all',
 'stages',
 'of',
 'life']

In [17]:
# Check if processing tasks result in empty segments
idx_empty_segs = [idx for idx, seg in enumerate(processed_segs) if len(seg) == 0]
print(f'There are {len(idx_empty_segs)} empty segments')

There are 97 empty segments


In [18]:
# Let's look at how the empty segments look like
for idx in idx_empty_segs:
  print(f'[{idx}] {job_training_df.iloc[idx,2]}')

[171] ****
[175] ****
[178] ****
[505] **
[508] ****
[514] ****
[527] __
[535] :**
[552] 16.
[556] 19.
[558] 20.
[606] *
[657] *
[945] *
[951] *
[1101] *
[1105] *
[1191] •
[1199] •
[1203] •
[1354] *
[1365] *
[1413] :**
[1447] *
[1625] ·
[1627] ·
[1671] *
[1694] ·
[1697] ·
[1931] •
[1940] •
[1944] •
[1946] •
[2144] -
[2154] -
[2188] :**
[2685] ****
[2688] ****
[2690] ****
[2693] ****
[2696] ****
[2698] ****
[2818] *
[2873] ...
[2987] *
[2989] *
[3002] *
[3004] *
[3025] •
[3028] •
[3030] •
[3033] •
[3038] •
[3040] •
[3042] •
[3044] •
[3046] •
[3048] •
[3051] •
[3053] •
[3055] •
[3057] •
[3059] •
[3061] •
[3066] •
[3070] •
[3182] ****
[3249] _**
[3397] ?
[3423] *
[3428] *
[3433] *
[3435] *
[3440] *
[3442] *
[3447] *
[3451] *
[3453] *
[3455] *
[3457] *
[3461] *
[3488] *
[3492] *
[3600] *
[3603] *
[3693] -
[3694] 160199
[3699] ®
[3703] ®
[3704] .
[3712] :
[3739] 5
[3756] :
[3759] :
[3762] :
[3765] :
[3850] •


In [19]:
# Get rid empty segments
processed_segs = [seg for seg in processed_segs if len(seg) > 0]
print(f'In total {len(processed_segs)} segments will be used')
# Get rid of rows with empty segments
processed_job_training_df = job_training_df.drop(idx_empty_segs).reset_index(drop=True)
assert len(processed_segs) == processed_job_training_df.shape[0], 'Number of rows in dataframe should be equal to the processed segments'

In total 3788 segments will be used


## Encode labels

The target variable `section_label` contains categorical data, which need to be converted to numbers before usign them to train machine learning algorithms.

In [20]:
labels = list(processed_job_training_df['section_label'].unique())
label_values = list(processed_job_training_df['section_label'].values)
encoded_labels = encode_labels(labels, label_values, models_dir)
assert len(encoded_labels) == len(processed_segs), 'Number of lables should be equal to the processed segments'

## Split dataset

Split the dataset into train and test, holding 10% for testing

In [21]:
features, labels = pd.concat([pd.Series(processed_segs, name='segment_tokens'), processed_job_training_df['segment_index'],
                              processed_job_training_df.iloc[:,4:]], axis=1), encoded_labels
features_train, features_test, y_train, y_test = train_test_split(features, labels, random_state=random_state,
                                                                  test_size=0.10, stratify=labels)

In [22]:
features_train.head(3)

Unnamed: 0,segment_tokens,segment_index,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
1356,"[minimum, of, years, security, experience, is,...",1177,4,2,1,0,0,54,8,6.0,0,1
647,"[why, kelly]",3000,3,1,1,0,0,20,5,3.333333,0,2
1234,"[perform, a, variety, of, standard, molecular,...",284,6,1,5,0,0,134,15,8.375,0,1


In [23]:
y_train[:3]

array([4, 1, 3])

Check if the proportion of target classes has been preserved

In [24]:
labels_train = decode_labels(y_train, os.path.join(models_dir, 'encoder_classes.npy'))
labels_train[:3]

['Job Skills/Requirements', 'Benefits', 'Job Responsibilities/Summary']

In [25]:
unique, counts = np.unique(labels_train, return_counts=True)
dict(zip(unique, counts/len(labels_train)))

{'About Company': 0.11088295687885011,
 'Benefits': 0.07509533587562335,
 'EOE/Diversity': 0.043121149897330596,
 'Job Responsibilities/Summary': 0.36931651510706953,
 'Job Skills/Requirements': 0.26195365209738924,
 'Job Title': 0.009093575828688765,
 'Other': 0.1305368143150484}

## Scale non-textual features

Scale numerical features (i.e., the syntactical created before and the `segment_index`) so values are closer to word vectors that are generated in the next steps

In [26]:
scaled_features_array = scale_features(np.array(features_train.iloc[:,1:]), models_dir)
scaled_features_df = pd.DataFrame(scaled_features_array, columns=list(features_train.columns)[1:])

In [27]:
scaled_features_df.head()

Unnamed: 0,segment_index,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,-0.532704,-0.437331,-0.136614,-0.385304,-0.482161,-0.176252,-0.514814,-0.534461,-0.085476,-0.199011,-0.433468
1,0.399547,-0.549759,-0.531328,-0.385304,-0.482161,-0.176252,-0.784675,-0.704472,-1.689751,-0.199011,-0.160373
2,-0.989369,-0.212475,-0.531328,1.353352,-0.482161,-0.176252,0.120153,-0.137769,1.343331,-0.199011,-0.433468
3,-0.880956,1.136662,1.047526,1.353352,-0.482161,-0.176252,1.382149,1.052307,1.068952,-0.199011,-0.433468
4,-1.078861,0.349666,-0.531328,0.04936,-0.482161,-0.176252,-0.030652,-0.024429,0.148481,0.308459,0.932006


## Create embeddings using doc2vec

Job descriptions are used to create word embeddings of sentences. `Doc2vec` is employed to learn the vector the representation of job description sentences. The vector length and learning approach were decided based on the results of experiments with different combinations of sizes (`50`, `100`, `200`, `300`, `400`, `500`) and the doc2vec approaches `distributed memory` and `distributed bag of words` implemented by the [Gensim](https://radimrehurek.com/gensim/models/doc2vec.html) library.

In [28]:
%%time
doc2vec_model_fn = os.path.join(models_dir, 'doc2vec.model')
if OVERWRITE_DOC2VEC or not os.path.isfile(doc2vec_model_fn):
  hyper_params = {
      'vector_size': 500,
      'alpha': 0.025,
      'min_count': 5,
      'dm': 0,
      'epochs': 100
  }
  print('Preparing sentences for training...')
  train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(list(features_train['segment_tokens']))]
  print('-'*10)
  print('Training embeddings...')
  doc2vec_model = Doc2Vec(**hyper_params)
  doc2vec_model.build_vocab(train_doc2vec)
  doc2vec_model.train(train_doc2vec, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)
  print('-'*10)
  print('Doc2vec model successfully created!')
  print('-'*10)
  print('Saving model...')
  print('-'*10)
  doc2vec_model.save(doc2vec_model_fn)
else:
  print('Loading doc2vec model...')
  doc2vec_model = Doc2Vec.load(doc2vec_model_fn)

Loading doc2vec model...
CPU times: user 20.9 ms, sys: 15.8 ms, total: 36.7 ms
Wall time: 70.2 ms


## Create doc vector features

Pre-trained embeddings are used to generate vectors for sentences

In [29]:
%%time
print('Creating vectors...')
if doc2vec_model:
  train_vectors = [doc2vec_model.infer_vector(seg_tokens) for seg_tokens in list(features_train['segment_tokens'])]
print('-'*10)

Creating vectors...
----------
CPU times: user 20.1 s, sys: 104 ms, total: 20.2 s
Wall time: 20.8 s


In [30]:
# let's explore an example of a generated doc vector
train_vectors[0]

array([ 2.65489995e-01,  7.48461410e-02,  2.16436442e-02,  5.75530380e-02,
        1.83454767e-01,  1.10845268e-01,  1.61977142e-01, -5.15189245e-02,
        1.29265651e-01,  1.39526457e-01,  5.63698933e-02, -1.74062196e-02,
       -1.41283497e-01, -5.58866635e-02, -5.31791560e-02, -5.94113171e-02,
        5.69341406e-02, -8.44651386e-02,  4.90214489e-02,  6.10183645e-03,
        2.45196387e-01,  2.52003938e-01,  3.63706470e-01, -8.13759714e-02,
        1.03102133e-01, -1.73137523e-02, -2.91860431e-01, -2.21537799e-01,
        6.97222948e-02,  9.21166688e-02,  3.48717541e-01, -1.95633899e-02,
       -3.80310789e-02,  5.78939728e-02, -1.62834227e-01,  7.13867396e-02,
       -2.47907102e-01,  1.07105471e-01, -1.37781322e-01, -1.47954002e-01,
        1.90390684e-02,  2.50297744e-04, -5.61065972e-03,  2.75049031e-01,
        7.93677047e-02,  7.79853575e-03,  1.18762083e-01,  1.81954414e-01,
       -1.74598128e-01, -4.03761715e-02, -2.14870516e-02, -7.11302310e-02,
        9.94674675e-03, -

---

# Section 2. Model building

## Build ML model

Here the machine learning classifier is built. The decision on the algorithms to be used was based on algorithms that have been reported to perform well on unbalanced, small, and textual datasets (e.g., [Text Classification with Extremely Small Datasets](https://towardsdatascience.com/text-classification-with-extremely-small-datasets-333d322caee2)). Therefore, we try the following algorithms: `Support Vector Machine` and `Logistic Regression`.

In bulding the model we proceed as following:

* Grid search is used for training algorithms with different combinations of hyperparamets;
* Best models for each algortihm are pre-selected for testing;
* Testing performance of best models are compared and the model that shows the best results is selected.

Models are trained using the cross-validation approach. The performance metric to be optimized is weighted `F1` because it provides an adequate compromise between acceptable coverage and the correct identification of impact sentences.

### Train classifiers

In [31]:
# concatenate word vectors with numerical features
X_train = np.concatenate((train_vectors, scaled_features_df), axis=1)

In [32]:
def get_lr_parameters():
    """
    Define hyper-parameters of
    Logistic Regression.

    Inspired by
    https://notebook.community/tpin3694/tpin3694.github.io/machine-learning/.ipynb_checkpoints/hyperparameter_tuning_using_grid_search-checkpoint
    """
    param_grid = {
        'penalty': ['l2', None],
        'C': np.logspace(0, 4, 10)
    }
    return param_grid

In [33]:
def get_svm_parameters():
    """
    Define hyper-parameters of
    Support Vector Machine.

    Inspired by
    https://www.vebuso.com/2020/03/svm-hyperparameter-tuning-using-gridsearchcv
    """

    param_grid = {
        'C': [0.01, 0.1, 1, 10, 100],
        'gamma': [1, 0.1, 0.01, 0.001, 'scale'],
        'kernel': ['linear', 'rbf', 'poly']
    }
    return param_grid

In [34]:
def filter_params(model_params, hyperparams):
  filtered_params = {}
  for param, value in model_params.items():
    if param in hyperparams:
      filtered_params[param] = value
  return filtered_params

In [35]:
%%time
output_training_file_path = os.path.join(output_dir, 'output_ml_training.json')
if not os.path.isfile(output_training_file_path) or OVERWRITE_TRAINING:
  kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
  metric = {
      'name': 'f1',
      'obj': make_scorer(f1_score, average='weighted')
  }
  training_outputs = []

  for algorithm in ['LR', 'SVM']:
    print(f'Training {algorithm} models with different combination of hyperparameters...')
    if algorithm == 'SVM':
      # class_weight='balanced' is used to indicate that classes are not balanced in the dataset
      classifier = SVC(random_state=random_state, class_weight='balanced')
      hyperparams = get_svm_parameters()
    else:
      classifier = LogisticRegression(random_state=random_state,
                                      class_weight='balanced', max_iter=200,
                                      solver='lbfgs')
      hyperparams = get_lr_parameters()
    # create grid search
    clf = GridSearchCV(classifier, param_grid=hyperparams, cv=kfold,
                       scoring=metric['obj'], verbose=2, error_score='raise',
                       n_jobs=1)
    # do grid search
    best_model = clf.fit(X_train, y_train)
    score = best_model.best_score_
    print(f'Finished training {algorithm} models, {metric["name"]} score of best model: {score}')
    print(f'Hyperparameters of the best {algorithm} model are:')
    print(filter_params(best_model.best_estimator_.get_params(), hyperparams))
    print(f'Saving the best model {algorithm} model...')
    model_name = f'{algorithm.lower()}_model.joblib'
    model_file_path = os.path.join(models_dir, model_name)
    model_dict = dict(model=best_model)
    joblib.dump(model_dict, model_file_path)
    training_outputs.append(
      {
          'algorithm': algorithm,
          'hyperparms': filter_params(best_model.best_estimator_.get_params(), hyperparams),
          'metric': metric['name'],
          'score': score,
          'model_file_path': model_file_path
      }
    )
    print('-'*10)
  # save training outputs
  print('Saving outputs...')
  with open(output_training_file_path, 'w') as f:
      json.dump(training_outputs, f, indent=4)
  print('-'*10)
else:
  print('Skipping, training outputs already exist...')

Training LR models with different combination of hyperparameters...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END ..................................C=1.0, penalty=l2; total time=   3.5s
[CV] END ..................................C=1.0, penalty=l2; total time=   4.1s
[CV] END ..................................C=1.0, penalty=l2; total time=   4.3s
[CV] END ..................................C=1.0, penalty=l2; total time=   3.1s
[CV] END ..................................C=1.0, penalty=l2; total time=   2.7s
[CV] END ................................C=1.0, penalty=None; total time=   2.9s
[CV] END ................................C=1.0, penalty=None; total time=   4.1s
[CV] END ................................C=1.0, penalty=None; total time=   4.4s
[CV] END ................................C=1.0, penalty=None; total time=   3.5s
[CV] END ................................C=1.0, penalty=None; total time=   2.9s
[CV] END ...................C=2.7825594022071245, penalty=l2

### Evaluate best classifiers

#### Prepare test set

Scale numerical features

In [36]:
scaled_features_array = scale_features(np.array(features_test.iloc[:,1:]), models_dir)
scaled_features_df = pd.DataFrame(scaled_features_array, columns=list(features_train.columns)[1:])

Generate doc vectors of segments in the test split

In [37]:
# load doc2vec model
doc2vec_model_file_path = os.path.join(models_dir, 'doc2vec.model')
doc2vec_model = Doc2Vec.load(doc2vec_model_file_path)
test_vectors =  [doc2vec_model.infer_vector(seg_tokens) for seg_tokens in list(features_test.segment_tokens)]

Concatenate vectors with numerical features

In [38]:
X_test = np.concatenate((test_vectors, scaled_features_df), axis=1)

#### **Model**: Best Logistic Regression classifier

Load model

In [39]:
with open(output_training_file_path, 'r') as f:
    training_outputs = json.load(f)
for training in training_outputs:
  if training['algorithm'] == 'LR':
    model_file_path = training['model_file_path']
best_lr_model = joblib.load(model_file_path)['model']

Make predictions

In [40]:
preds = best_lr_model.predict(X_test)

Evaluate model performance

In [41]:
enconder_file_path = os.path.join(models_dir, 'encoder_classes.npy')
class_nums = list(range(0,7))
print(classification_report(y_test, preds, target_names=decode_labels(class_nums, enconder_file_path)))

                              precision    recall  f1-score   support

               About Company       0.70      0.71      0.71        42
                    Benefits       0.58      0.75      0.66        28
               EOE/Diversity       0.83      0.94      0.88        16
Job Responsibilities/Summary       0.85      0.70      0.77       140
     Job Skills/Requirements       0.78      0.74      0.76        99
                   Job Title       0.40      1.00      0.57         4
                       Other       0.65      0.82      0.73        50

                    accuracy                           0.74       379
                   macro avg       0.68      0.81      0.72       379
                weighted avg       0.76      0.74      0.75       379



#### **Model**: Best Support Vector Machine classifier

Load model

In [42]:
with open(output_training_file_path, 'r') as f:
    training_outputs = json.load(f)
for training in training_outputs:
  if training['algorithm'] == 'SVM':
    model_file_path = training['model_file_path']
best_svm_model = joblib.load(model_file_path)['model']

Make predictions

In [43]:
preds = best_svm_model.predict(X_test)

Evaluate model performance

In [44]:
enconder_file_path = os.path.join(models_dir, 'encoder_classes.npy')
class_nums = list(range(0,7))
print(classification_report(y_test, preds, target_names=decode_labels(class_nums, enconder_file_path)))

                              precision    recall  f1-score   support

               About Company       0.64      0.64      0.64        42
                    Benefits       0.83      0.71      0.77        28
               EOE/Diversity       0.93      0.88      0.90        16
Job Responsibilities/Summary       0.85      0.79      0.81       140
     Job Skills/Requirements       0.76      0.82      0.79        99
                   Job Title       0.50      0.25      0.33         4
                       Other       0.73      0.88      0.80        50

                    accuracy                           0.78       379
                   macro avg       0.75      0.71      0.72       379
                weighted avg       0.79      0.78      0.78       379



### Select best-perfomed model

> From results above it is clear that the **`support vector machine`** model shows better a better performance with an **`accuracy`** of **`0.78`** and a similar weighted average **`f1`** in comparison with its **`logistic regression`** counterpart, which shows an **`accuracy`** of **`0.74`** and a weighted average **`f1`** of **`0.75`**. Therefore, the **`support vector machine`** model will be used for predictions.

---

# Section 3. Use case

### Predict section of segments

Solution is checked by predicting the section of a given sentence (and sentence index) taken from the dataset `jobs_test`.

In [45]:
segment = 'The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .'
segment_idx = 341

In [46]:
segments_df = pd.DataFrame({'segment': [segment], 'segment_index': [segment_idx]})
segments_df.head()

Unnamed: 0,segment,segment_index
0,The company began more than 100 years ago in T...,341


Create syntactical features for the segment

In [47]:
num_features_df = create_syntactical_features(segment)
num_features_df.head()

Unnamed: 0,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,7,3,1,2,0,143,23,5.958333,0,2


In [48]:
num_features_df = pd.concat([segments_df['segment_index'], num_features_df], axis=1)
num_features_df.head()

Unnamed: 0,segment_index,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,341,7,3,1,2,0,143,23,5.958333,0,2


Scale numerical features

In [49]:
scaled_features_array = scale_features(np.array(num_features_df.iloc[:,:]), models_dir)
scaled_num_features_df = pd.DataFrame(scaled_features_array, columns=list(features_train.columns)[1:])
scaled_num_features_df.head()

Unnamed: 0,segment_index,noun_count,verb_count,adj_count,adv_count,pron_count,chart_count,word_count,word_density,upper_case_word_count,title_word_count
0,-0.96022,-0.100047,0.258099,-0.385304,2.426613,-0.176252,0.191587,0.315593,-0.110543,-0.199011,-0.160373


Pre-process segment

In [50]:
processed_segs = preprocess_segments(segments_df['segment'])
processed_segs[0]

['the',
 'company',
 'began',
 'more',
 'than',
 'years',
 'ago',
 'in',
 'tulsa',
 'and',
 'has',
 'successfully',
 'diversified',
 'into',
 'a',
 'variety',
 'of',
 'industries',
 'businesses',
 'and',
 'geographies']

Generate doc vector for segment

In [51]:
# load doc2vec model
doc2vec_model_file_path = os.path.join(models_dir, 'doc2vec.model')
doc2vec_model = Doc2Vec.load(doc2vec_model_file_path)
segment_vector =  [doc2vec_model.infer_vector(seg_tokens) for seg_tokens in processed_segs]

Concatenate vector with numerical features

In [52]:
seg_features = np.concatenate((segment_vector, scaled_num_features_df), axis=1)

Load model

In [58]:
with open(output_training_file_path, 'r') as f:
    training_outputs = json.load(f)
for training in training_outputs:
  if training['algorithm'] == 'SVM':
    model_file_path = training['model_file_path']
svm_model = joblib.load(model_file_path)['model']

Make prediction

In [59]:
pred = svm_model.predict(seg_features)

Output prediction result

In [60]:
class_names = decode_labels(list(range(0,7)), os.path.join(models_dir, 'encoder_classes.npy'))
for class_num, class_name in zip(list(range(0,7)), class_names):
  if class_num == pred[0]:
    print('Prediction result')
    print('-'*10)
    print(f'Segment: {segment}')
    print(f'Predicted section: {class_name}')
    print('-'*10)
    break

Prediction result
----------
Segment: The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .
Predicted section: About Company
----------


---