# Neural Network Segment Classifier

This notebook reports the methodology followed to build a neural network (LSTM) classifier to automatically identify sections in job description documents. The model takes a sentence contained in a job description as input and produces as output the section that the sentence belongs to.

# Section 0. Preliminaries

## Load libraries

In [1]:
# Update accordingly
run_on_google_colab = True
project_dir = '/content/drive/MyDrive'

if run_on_google_colab:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
import sys
sys.path.append(project_dir)

In [3]:
import json
import numpy as np
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import warnings

warnings.filterwarnings('ignore')

from collections import Counter
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

from text_processor import *
from utils import *

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


## Initialize constants

In [4]:
OVERWRITE_TRAINING = True
random_state = np.random.RandomState(1234)  # for reproducibility

## Load data

In [5]:
data_dir = f'{project_dir}/data'
data_fn = 'jobs_training.csv'
models_dir = f'{project_dir}/models'
output_dir = f'{project_dir}/outputs'

In [6]:
job_training_df = pd.read_csv(os.path.join(data_dir, data_fn))

## Preview data

Visualize the size of the data togethet with a small sample

In [7]:
print(f'dataset size: {job_training_df.shape[0]}x{job_training_df.shape[1]}')
print('-'*10)
job_training_df.head(10)

dataset size: 3885x4
----------


Unnamed: 0,job_id,segment_index,segment,section_label
0,05b865e93e8e46579075562865973d3b,0,Abbott is a global healthcare leader that help...,About Company
1,05b865e93e8e46579075562865973d3b,94,Our portfolio of life-changing technologies sp...,About Company
2,05b865e93e8e46579075562865973d3b,287,"Our 109,000 colleagues\nserve people in more t...",About Company
3,05b865e93e8e46579075562865973d3b,353,"**Tissue Trainer – St. Paul, MN**",Job Title
4,05b865e93e8e46579075562865973d3b,388,Our business purpose is to restore health and ...,About Company
5,05b865e93e8e46579075562865973d3b,573,We aim to lead the markets we serve by requiri...,About Company
6,05b865e93e8e46579075562865973d3b,951,**WHAT YOU’LL DO**,Job Responsibilities/Summary
7,05b865e93e8e46579075562865973d3b,971,We are recruiting for a Tissue Trainer located...,Job Responsibilities/Summary
8,05b865e93e8e46579075562865973d3b,1035,"You are\nresponsible for the coordination, imp...",Job Responsibilities/Summary
9,05b865e93e8e46579075562865973d3b,1238,Coordinates the ongoing and\nrecurring system ...,Job Responsibilities/Summary


Distribution of values by type of sections

In [8]:
job_training_df.section_label.value_counts()

Job Responsibilities/Summary    1453
Job Skills/Requirements         1012
Other                            506
About Company                    425
Benefits                         291
EOE/Diversity                    163
Job Title                         35
Name: section_label, dtype: int64

In [9]:
job_training_df.section_label.value_counts()/job_training_df.shape[0]

Job Responsibilities/Summary    0.374003
Job Skills/Requirements         0.260489
Other                           0.130245
About Company                   0.109395
Benefits                        0.074903
EOE/Diversity                   0.041956
Job Title                       0.009009
Name: section_label, dtype: float64

Previous output shows that there are `seven classes` classes, which are unbalaced. Now, check if there are missing values.

In [10]:
# Check for null values
job_training_df.isnull().sum()

job_id           0
segment_index    0
segment          0
section_label    0
dtype: int64

Previous output unveils that there aren't missing values

---

# Section 1. Feature engineering

## Pre-process text

Pre-process job description sentences to be used  by the machine learning algorithms. In this sense, sentences are converted to a list of `lower case tokens`, `removing` in this process `punctuations`, and `digits`.

In [11]:
processed_segs = preprocess_segments(job_training_df['segment'])

In [12]:
# Let's explore tokens of the first segment
processed_segs[0]

['abbott',
 'is',
 'a',
 'global',
 'healthcare',
 'leader',
 'that',
 'helps',
 'people',
 'live',
 'more',
 'fully',
 'at',
 'all',
 'stages',
 'of',
 'life']

In [13]:
# Check if processing tasks result in empty segments
idx_empty_segs = [idx for idx, seg in enumerate(processed_segs) if len(seg) == 0]
print(f'There are {len(idx_empty_segs)} empty segments')

There are 97 empty segments


In [14]:
# Let's look at how the empty segments look like
for idx in idx_empty_segs:
  print(f'[{idx}] {job_training_df.iloc[idx,2]}')

[171] ****
[175] ****
[178] ****
[505] **
[508] ****
[514] ****
[527] __
[535] :**
[552] 16.
[556] 19.
[558] 20.
[606] *
[657] *
[945] *
[951] *
[1101] *
[1105] *
[1191] •
[1199] •
[1203] •
[1354] *
[1365] *
[1413] :**
[1447] *
[1625] ·
[1627] ·
[1671] *
[1694] ·
[1697] ·
[1931] •
[1940] •
[1944] •
[1946] •
[2144] -
[2154] -
[2188] :**
[2685] ****
[2688] ****
[2690] ****
[2693] ****
[2696] ****
[2698] ****
[2818] *
[2873] ...
[2987] *
[2989] *
[3002] *
[3004] *
[3025] •
[3028] •
[3030] •
[3033] •
[3038] •
[3040] •
[3042] •
[3044] •
[3046] •
[3048] •
[3051] •
[3053] •
[3055] •
[3057] •
[3059] •
[3061] •
[3066] •
[3070] •
[3182] ****
[3249] _**
[3397] ?
[3423] *
[3428] *
[3433] *
[3435] *
[3440] *
[3442] *
[3447] *
[3451] *
[3453] *
[3455] *
[3457] *
[3461] *
[3488] *
[3492] *
[3600] *
[3603] *
[3693] -
[3694] 160199
[3699] ®
[3703] ®
[3704] .
[3712] :
[3739] 5
[3756] :
[3759] :
[3762] :
[3765] :
[3850] •


In [15]:
# Get rid of the empty segments
processed_segs = [seg for seg in processed_segs if len(seg) > 0]
print(f'In total {len(processed_segs)} segments will be used')

In total 3788 segments will be used


### Create vocabulary

In [16]:
# Count word frequency
word_counts = Counter()
for seg in processed_segs:
  word_counts.update(seg)
print(f'Top-10 most frequent words: {word_counts.most_common(10)}')

Top-10 most frequent words: [('and', 3789), ('to', 2122), ('the', 1735), ('of', 1545), ('a', 1285), ('in', 1090), ('with', 799), ('for', 782), ('or', 619), ('is', 504)]


In [17]:
# Create vocabulary
vocab2index = {'': 0, 'UNK': 1}
words = ['', 'UNK']
for word in word_counts:
  vocab2index[word] = len(words)
  words.append(word)
vocab_size = len(words)
print(f'Vocabulary size: {vocab_size}')

Vocabulary size: 6601


### Encode sentences

In [18]:
# Transform list of tokens of sentences into vector of numbers
def enconde_sentences(segments):
  vec_tokens = []
  for seg in segments:
    vec_seg = []
    for token in seg:
      if token in vocab2index.keys():
        token_to_add = vocab2index[token]
      else:
        token_to_add = vocab2index['UNK']
      vec_seg.append(token_to_add)
    vec_tokens.append(vec_seg)
  return vec_tokens

In [19]:
vec_tokens = enconde_sentences(processed_segs)
# Check consistency
assert len(vec_tokens)==len(processed_segs), 'Vector of token numbers is incomplete'
# Print sample
print(f'Vector of the first sentence: {vec_tokens[0]}')

Vector of the first sentence: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


### Padding sentences

In [20]:
# Let's find out the distribution of sentence lengths
max_len = max([len(seg) for seg in processed_segs])
print(f'The longest sentences have {max_len} tokens')

The longest sentences have 228 tokens


Padding will be used to ajust the length of all sentences to `200`, which is a length close to the longest sentences. Sentences longer than `200` are truncated.

In [21]:
def padding_vectors(vec_tokens, seq_max_len):
  features = np.zeros((len(vec_tokens), seq_max_len), dtype=int)
  for i, vec in enumerate(vec_tokens):
    features[i, -len(vec):] = np.array(vec)[:seq_max_len]
  return features

In [22]:
# Maximum length of vector
seq_max_len = 200
features = padding_vectors(vec_tokens, seq_max_len)
# Check consistency
assert len(features)==len(vec_tokens), 'Features vector should have as many rows as sentences in the dataset.'
assert len(features[0])==seq_max_len, f'Each row in features should contain {seq_max_len} values.'

### Encode categorical labels

The target variable `section_label` contains categorical data, which need to be converted to numbers before usign them to train machine learning algorithms. Before encoding, labels that correspond to empty segments are removed.

In [23]:
all_labels = list(job_training_df['section_label'].values)
# Get rid of labels related to empty segments
filtered_labels = [label for idx, label in enumerate(all_labels) if idx not in idx_empty_segs]
# Check consistency
assert len(filtered_labels)==len(features), 'Features vector and labels vector should have the same number of rows'

In [24]:
unique_labels = list(set(filtered_labels))
encoded_labels = encode_labels(unique_labels, filtered_labels, models_dir)
# Check consistency
assert len(encoded_labels)==len(features), 'Encoded labels vector and features vector should have the same number of rows'

---

# Section 2. Datasets preparation

## Split dataset

Split the dataset into train and test, holding 20% for testing

In [25]:
x_train, x_test, y_train, y_test = train_test_split(features, encoded_labels,
                                                    random_state=random_state,
                                                    test_size=0.20,
                                                    stratify=encoded_labels)

In [26]:
print('-'*10)
print('Size train sets')
print(f'x_train size: {x_train.shape}')
print(f'y_train size: {y_train.shape}')
print('-'*10)
print('Size test sets')
print(f'x_test size: {x_test.shape}')
print(f'y_test size: {y_test.shape}')

----------
Size train sets
x_train size: (3030, 200)
y_train size: (3030,)
----------
Size test sets
x_test size: (758, 200)
y_test size: (758,)


Fifty percent (`50%`) of the test set is reserved for validation

In [27]:
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, random_state=random_state,
                                                test_size=0.50, stratify=y_test)

In [28]:
print('-'*10)
print('Size train sets')
print(f'x_test size: {x_test.shape}')
print(f'y_test size: {y_test.shape}')
print('-'*10)
print('Size validation sets')
print(f'x_val size: {x_val.shape}')
print(f'y_val size: {y_val.shape}')

----------
Size train sets
x_test size: (379, 200)
y_test size: (379,)
----------
Size validation sets
x_val size: (379, 200)
y_val size: (379,)


## Batching datasets

Create tensor datasets and dataloaders to be used in training the neural network.

In [29]:
# Create tensor datasets
train_data = TensorDataset(torch.from_numpy(x_train), torch.from_numpy(y_train))
val_data = TensorDataset(torch.from_numpy(x_val), torch.from_numpy(y_val))
test_data = TensorDataset(torch.from_numpy(x_test), torch.from_numpy(y_test))

# Define batch size
batch_size = 50

# Batching datasets
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [30]:
# Check one batch
sample_x, sample_y = next(iter(train_loader))

print('-'*10)
print(f'Size train sample: {sample_x.size()}')
print(f'Train sample: {sample_x}')
print('-'*10)
print(f'Size label sample: {sample_y.size()}', )
print(f'Sample label: {sample_y}')

----------
Size train sample: torch.Size([50, 200])
Train sample: tensor([[   0,    0,    0,  ...,  153, 2774,  343],
        [   0,    0,    0,  ..., 1300,  331, 1301],
        [   0,    0,    0,  ..., 4786, 1174, 1325],
        ...,
        [   0,    0,    0,  ...,    0, 1787,   56],
        [   0,    0,    0,  ...,  783, 5534,  847],
        [   0,    0,    0,  ...,   30,   29,  582]])
----------
Size label sample: torch.Size([50])
Sample label: tensor([4, 2, 6, 6, 4, 3, 3, 6, 1, 3, 3, 4, 3, 0, 3, 4, 1, 4, 3, 4, 4, 4, 3, 6,
        2, 6, 4, 0, 4, 3, 3, 3, 2, 0, 6, 3, 4, 6, 1, 3, 6, 1, 3, 0, 6, 4, 3, 3,
        3, 3])


---

# Section 2. Build LSTM classifier

In [31]:
# check if gpu is available to use it
if torch.cuda.is_available():
  device = torch.device('cuda')
  print('GPU available and will be used for training.')
else:
  device = torch.device('cpu')
  print('Only CPU available.')

GPU available and will be used for training.


## Create LSTM

Create the LSTM architecture

In [32]:
class Classifier(nn.Module):

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim,
                 n_layers, drop_prob=0.5):
        super(Classifier, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x, hidden):
        batch_size = x.size(0)

        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)

        # take output from last step
        lstm_last_output = lstm_out[:, -1, :]
        out = self.dropout(lstm_last_output)
        out = self.fc(out)

        return out, hidden

    # initialized to zero hidden and cell states
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data

        if str(device)=='cuda':
            hidden = (
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda()
            )
        else:
            hidden = (
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_()
            )

        return hidden

Define method to train classifier

In [33]:
def train_classifier(model, lr, criterion, optimizer, train_loader, val_loader,
                     num_classes, epochs=5, val_freq=500):
  parameters = filter(lambda p: p.requires_grad, model.parameters())
  optimizer = torch.optim.Adam(parameters, lr=lr)
  step = 0
  val_accs = []
  print('Training the model, please wait...')

  for epoch in range(epochs):
    avg_loss = 0.
    # loop over batch
    for x, y in train_loader:
      model.train()
      step += 1
      # initialize hidden states with the current batch size
      batch_size = x.size(0)
      h = model.init_hidden(batch_size)
      if str(device) == 'cuda':
        x, y = x.cuda(), y.cuda()
      # update hidden
      h = tuple([each.data for each in h])
      # zero accumulated gradients
      model.zero_grad()
      # get prediction
      y_pred, h = model(x, h)
      # calculate loss
      loss = criterion(y_pred, y)
      # perform backprop
      loss.backward()
      # accumulate loss
      avg_loss += loss.item() / len(train_loader)
      # update weights
      optimizer.step()
      # evaluate
      if step % val_freq:
        model.eval()
        val_losses, val_acc = validate_classifier(model, val_loader, num_classes)
        avg_val_loss = np.mean(val_losses)
        val_accs.append(val_acc)
        print('-'*10)
        print(f'epoch {epoch+1}/{epochs} loss={avg_val_loss} \t accuracy={val_acc}')

  return val_accs

Define method to validate classifier

In [34]:
def validate_classifier(model, val_loader, num_classes):
  val_losses = []
  correct_preds, total_samples = 0, 0
  for x, y in val_loader:
    batch_size = x.size(0)
    val_h = model.init_hidden(batch_size)
    # update hidden
    h = tuple([each.data for each in val_h])
    if str(device) == 'cuda':
      x, y = x.cuda(), y.cuda()
    y_pred, val_h = model(x, h)
    val_loss = criterion(y_pred, y)
    val_losses.append(val_loss.item())
    # calculate accuracy
    _, pred = torch.max(y_pred.data, 1)
    total_samples += batch_size
    correct_preds += (pred == y).sum().item()

  avg_accu = correct_preds / total_samples
  return val_losses, avg_accu

## Train classifier

In [35]:
%%time
output_training_file_path = os.path.join(output_dir, 'output_lstm_training.json')
if not os.path.isfile(output_training_file_path) or OVERWRITE_TRAINING:
  # Define classifier hyperparameters
  v_size = vocab_size
  output_size = len(unique_labels)
  embedding_dim = seq_max_len
  hidden_dims = [64, 128, 256]
  v_n_layers = [2, 3, 4]

  trainings = []
  for hidden_dim in hidden_dims:
    for n_layers in v_n_layers:
      # Instantiate the LSTM class
      clf = Classifier(v_size, output_size, embedding_dim, hidden_dim, n_layers)

      # Define training parameters
      lr=0.001 # learning rate
      criterion = nn.CrossEntropyLoss() # loss function
      optimizer = torch.optim.Adam(clf.parameters(), lr=lr) # optimizer to update weights
      epochs = 5

      if str(device)=='cuda':
        clf.cuda()

      val_accs = train_classifier(clf, lr, criterion, optimizer, train_loader,
                                  valid_loader, output_size, epochs, val_freq=1000)

      # save trained model
      print('-'*10)
      print('Saving model...')
      model_name = f'lstm_model_{hidden_dim}_{n_layers}.pth'
      model_file_path = os.path.join(models_dir, model_name)
      torch.save(clf, model_file_path)
      print('-'*10)

      trainings.append(
          {
              'hyperparams': {
                  'vocabulary_size': v_size,
                  'num_classes': output_size,
                  'embedding_dim': embedding_dim,
                  'hidden_dim': hidden_dim,
                  'n_layers': n_layers
              },
              'training_parameters': {
                'epochs': epochs,
                'loss_fn': 'cross entropy',
                'optimizer': 'adam'
              },
              'model': {
                  'name': model_name,
                  'file_path': model_file_path,
                  'accuracy': np.mean(val_accs),
              }
          }
      )

  # save training outputs
  print('Saving outputs...')
  with open(output_training_file_path, 'w') as f:
      json.dump(trainings, f, indent=4)
  print('-'*10)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
epoch 5/5 loss=0.9935684651136398 	 accuracy=0.6754617414248021
----------
epoch 5/5 loss=1.0359019115567207 	 accuracy=0.6701846965699209
----------
epoch 5/5 loss=1.0505818203091621 	 accuracy=0.6754617414248021
----------
epoch 5/5 loss=1.0450584217905998 	 accuracy=0.6728232189973615
----------
epoch 5/5 loss=1.0334950163960457 	 accuracy=0.6728232189973615
----------
epoch 5/5 loss=1.0696634128689766 	 accuracy=0.6728232189973615
----------
epoch 5/5 loss=1.0569553598761559 	 accuracy=0.6728232189973615
----------
epoch 5/5 loss=1.0704049095511436 	 accuracy=0.6675461741424802
----------
epoch 5/5 loss=1.0755776092410088 	 accuracy=0.6701846965699209
----------
epoch 5/5 loss=1.05147323012352 	 accuracy=0.6701846965699209
----------
epoch 5/5 loss=1.074850931763649 	 accuracy=0.6649076517150396
----------
epoch 5/5 loss=1.0510712414979935 	 accuracy=0.662269129287599
----------
epoch 5/5 loss=1.0580564960837364 	 acc

## Evaluate most accurate classifier

In [36]:
def evaluate_classifier(model, test_loader, output_size):
  print('Evaluating model, please wait...')
  print('-'*10)

  preds = []
  ys = []
  for x, y in test_loader:
    batch_size = x.size(0)
    val_h = model.init_hidden(batch_size)
    # update hidden
    h = tuple([each.data for each in val_h])
    if str(device) == 'cuda':
      x, y = x.cuda(), y.cuda()
    y_pred, val_h = model(x, h)
    # accumulate predictions and labels
    _, pred = torch.max(y_pred.data, 1)
    preds.extend(pred.cpu().numpy())
    ys.extend(y.cpu().numpy())

  return preds, ys

In [40]:
# find most accurate classifier
with open(output_training_file_path, 'r') as f:
    trainings = json.load(f)
best_model = {'accuracy': -1, 'file_path': '', 'name': ''}
for training in trainings:
  if training['model']['accuracy'] > best_model['accuracy']:
    best_model['accuracy'] = training['model']['accuracy']
    best_model['file_path'] = training['model']['file_path']
    best_model['name'] = training['model']['name']
print(f'The most accurate model is {best_model["name"]}')

The most accurate model is lstm_model_256_2.pth


In [41]:
# load model
clf = torch.load(best_model['file_path'])

In [42]:
preds, ys = evaluate_classifier(clf, test_loader, output_size)
# show classification report using evaluation results
preds = np.array(preds)
ys = np.array(ys)
enconder_file_path = os.path.join(models_dir, 'encoder_classes.npy')
class_nums = list(range(0,7))
report = classification_report(ys, preds, target_names=decode_labels(class_nums, enconder_file_path))
print(report)

Evaluating model, please wait...
----------
                              precision    recall  f1-score   support

               About Company       0.68      0.67      0.67        42
                    Benefits       0.67      0.55      0.60        29
               EOE/Diversity       0.87      0.81      0.84        16
Job Responsibilities/Summary       0.75      0.72      0.73       140
     Job Skills/Requirements       0.68      0.77      0.72        99
                   Job Title       0.00      0.00      0.00         3
                       Other       0.73      0.74      0.73        50

                    accuracy                           0.72       379
                   macro avg       0.62      0.61      0.61       379
                weighted avg       0.71      0.72      0.71       379



---

# Sectio 3. Use case

### Predict section of segments

Solution is checked by predicting the section of a given sentence taken from the dataset `jobs_test`.

In [43]:
segment = 'The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .'

Process segment

In [44]:
processed_seg = preprocess_segments([segment])
processed_seg[0]

['the',
 'company',
 'began',
 'more',
 'than',
 'years',
 'ago',
 'in',
 'tulsa',
 'and',
 'has',
 'successfully',
 'diversified',
 'into',
 'a',
 'variety',
 'of',
 'industries',
 'businesses',
 'and',
 'geographies']

Encode sentence, meaning convert their words to numbers

In [45]:
vec_tokens = enconde_sentences(processed_seg)
print(f'Vector representing the segement: {vec_tokens[0]}')

Vector representing the segement: [24, 312, 2754, 12, 41, 266, 5936, 31, 1, 29, 889, 2294, 1229, 189, 4, 977, 17, 2551, 28, 29, 522]


Pad the sentence vector of numbers so it has the required length

In [46]:
features = padding_vectors(vec_tokens, seq_max_len)
print(f'Segment features: {features[0]}')

Segment features: [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0   24  312 2754
   12   41  266 5936   31    1   29  889 2294 1229  189    

Convert features to tensor

In [47]:
feature_tensor = torch.from_numpy(features)
print(f'Size feature tensore: {feature_tensor.size()}')

Size feature tensore: torch.Size([1, 200])


Make prediction

In [48]:
def make_prediction(clf, feature):
  batch_size = feature.size(0)
  # initialize hidden state
  h = clf.init_hidden(batch_size)
  if str(device)=='cuda':
    feature = feature.cuda()
  # make prediction
  y_pred, _ = clf(feature, h)
  _, pred = torch.max(y_pred.data, 1)
  return pred

In [49]:
# it is assumed that clf corresponds to the evaluated model
pred = make_prediction(clf, feature_tensor)

Output prediction result

In [50]:
class_names = decode_labels(list(range(0,7)), os.path.join(models_dir, 'encoder_classes.npy'))
for class_num, class_name in zip(list(range(0,7)), class_names):
  if class_num == pred[0]:
    print('Prediction result')
    print('-'*10)
    print(f'Segment: {segment}')
    print(f'Predicted section: {class_name}')
    print('-'*10)
    break

Prediction result
----------
Segment: The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .
Predicted section: About Company
----------
