# NLP for Project Management
Model development to tag project management status reports with risk level tags "Red", "Amber", and "Green".

A sample dataset is used with the fastai (v2) library to demonstrate steps involved in model development. fastai uses a transfer learning approach which is relatively recent (2019) in NLP practice. Originating in image recognition modeling, transfer learning leverages pre-trained (usually large) models and adds or adjusts the final layers for specific classification. Classic NLP, before the advent of deep neural networks and transfer learning involved tokenization and statistical analysis of word occurences. Before jumping into the transfer learning demonstration this notebook presents some classic statistical analysis. In addition to background on the history of NLP, the statistical analysis gives sanity checks on the dataset and adheres to the very modern principle that the more you check and recheck your data, the better.

## Setup
Run this notebook in the conda environment defined in the environment yaml file for the project.

Import the fastai.text modules.

In [None]:
from fastai.text.all import *
import pandas as pd

## The Dataset
The sample dataset for this notebook is in a CSV file 'pmd-nlp.csv'. First, simply read the data into a pandas dataframe and check to see how it looks.

In [None]:
df = pd.read_csv('pmd-nlp.csv')
df.head()

Notice the three columns **'label'**, **'text'**, and **'is_valid'**. This is a manually labeled dataset. The text of the status reports appears in the 'text' column, the Red-Amber-Green labels appear in the 'label' column, and the 'is_valid' column indicates the training, validation split. Reports marked for validation will have 'True' in the 'is_valid' column. Note that fastai dataloaders can manage training, validation split in a number of ways which may override the use of the is_valid setting.

### Loading and Preprocessing
After loading the panda dataframe, the next step is to scan character strings and parse the text into word tokens. The tokens will then be tagged numerically in order to simplify working with integers instead of character strings. The integer tags will also serve as column indices of a matrix of occurence counts called the *document term* matrix. The numericalization step will also build the corpus *vocabulary* which is a ranked list of words. The rank is by frequency of occurence in the total dataset and cutoff at a minimum  frequency value, typically 3 or 2.

#### The Dataframe
Notice the fundamental reference mechanism to the elements of the dataframe, **df**.

In [None]:
df.text[0]

In [None]:
df.label[0]

In [None]:
df.is_valid[0]

#### Tokenization
The default word tokenizer in fastai is spaCy. It is easy to find specific information on spaCy. A couple starting points are, https://spacy.io/usage/spacy-101 and https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/

Here is an example of the default tokenization of the text from the first status report in the dataset. WordTokenizer is a generator hence we need to use first() to get the list of tokens.

In [None]:
spacy = WordTokenizer()
tokens = first(spacy([df.text[0]]))
print(coll_repr(tokens, 30))

Define the tokenize function based on the spaCy generator

In [None]:
tokenize = Tokenizer(spacy)

Tokenize all (708 count) reports in the dataset

In [None]:
tokens_corpus = df.text.map(tokenize)

Check the count

In [None]:
len(tokens_corpus)

Look at the last report (document)

In [None]:
tokens_corpus[707]

#### Split training and validation
Split train and valid based on is_valid column. Also store the label in a tuple with the token text.

In [None]:
ttxt_train = [(tokens_corpus[i], df.label[i]) for i in range(len(tokens_corpus)) if not df.is_valid[i]]

Record the count of training reports (expect 606)

In [None]:
report_count_training = len(ttxt_train); report_count_training

In [None]:
ttxt_valid = [(tokens_corpus[i], df.label[i]) for i in range(len(tokens_corpus)) if df.is_valid[i]]

Record the count of validation reports (expect 102)

In [None]:
report_count_validation = len(ttxt_valid); report_count_validation

The first report in the training set, with label

In [None]:
ttxt_train[0]

and the last report in the training set

In [None]:
ttxt_train[605]

which is also the last report in the corpus, remember that the training, validation and corpus indices are not in synch

In [None]:
df.text[707]

but the df and corpus indices are in sync

In [None]:
tokens_corpus[707]

#### Numericalization
The word tokens are mapped to integers ranked in decreasing order of frequency of occurence. The default minimum frequency is 3. Since the dataset is on the small side, we'll use a minimum frequency value of 2.

In [None]:
numerize = Numericalize(min_freq=2)
numerize.setup(tokens_corpus)

Check the length of the vocabulary. Note how the vocabulary list is referenced.

In [None]:
term_count = len(numerize.vocab); term_count

Set a friendlier handle and check the last 3 words in the vocabulary

In [None]:
full_vocab = numerize.vocab
full_vocab[1661:]

is 'funding' in the vocabulary?

In [None]:
'funding' in full_vocab

In [None]:
full_vocab.index('funding')

Look both ways

In [None]:
full_vocab[908]

List indices of all appearances of x and verification that words are listed only once

In [None]:
[i for i,x in enumerate(full_vocab) if x=='funding']

Can also check all the words in the vocabulary

In [None]:
[(full_vocab.count(w),w) for w in full_vocab if full_vocab.count(w)>1]

### Assemble the doc (report) term matrix individually for the training and validation reports
report_count_training|validation = number of reports in the training|validation set
term_count = number of words (terms) in vocabulary

The doc_term matrices will be report_count_training|validation rows by term_count columns.

#### Embedding vector
The document, term matrix is just a stack of embedding vectors for for the 606 | 102 documents in the training|validation document set. The number of elements in an embedding vector is equal to the length of the vocabulary list (term_count=1664). Each element of the vector records the occurence count for the word at that index in the document the vector represents.

Make the embedding vector for the first document as an example.\
Start with the token text for ttxt_train[0]

In [None]:
ttxt_train[0][0]

next, numericalize the tokens

In [None]:
first_doc_nums = numerize(ttxt_train[0][0]); first_doc_nums

Then count the number of times each value occurs, and store the count at the word index tied to that value.

For example, 2 occurs once and corresponds to xxbos so store a 1 at index 2.

Note: this is not an efficient method for counting occurences, for a small dataset and vocabulary it is okay to be inefficient, change it if you have a big dataset

In [None]:
np.array([list(first_doc_nums).count(i) for i in range(len(full_vocab))])

Define a function to assemble (stack) embedding vectors into a doc term matrix.

In [None]:
def assemble_docterm_matrix(ndocs, nterms, ttxt):
    # ndocs = report_count_training|validation
    # nterms = term_count
    # ttxt = tokenized text, one row per document
    # the first row of the matrix is from report ttxt[0]
    embv = numerize(ttxt[0][0])
    doc_term_matrix = np.array([list(embv).count(i) for i in range(nterms)])
    
    # append the remaining ndocs-1 embedding vectors to the matrix
    for di in range(1, ndocs):
        embv = numerize(ttxt[di][0])
        doc_term_matrix = np.vstack((doc_term_matrix,np.array([list(embv).count(i) for i in range(nterms)])))

    return doc_term_matrix

Create the matrix, expect this to take 15 to 20 minutes

In [None]:
doc_term_train = assemble_docterm_matrix(report_count_training, term_count, ttxt_train)

In [None]:
doc_term_train

Saving the matrix to a file will save recreating it after the notebook kernel restarts

In [None]:
np.save('doc_term_train.npy', doc_term_train)

Use this to load the matrix from file if you have it

In [None]:
doc_term_train = np.load('doc_term_train.npy')

#### Sparse Matrix representation
The doc term matrix is very sparse, 98% as seen below. Normally a sparse representation like CSR would be used but with a small dataset we won't bother.

In [None]:
dtm_non0 = np.count_nonzero(doc_term_train)
sparsity = (doc_term_train.size - dtm_non0)/doc_term_train.size
print(f'All the {doc_term_train.size} elements of the doc term matrix are zero except for {dtm_non0}')
print(f'that makes sparsity at {sparsity}')

In [None]:
fig = plt.figure()
plt.spy(doc_term_train, markersize=0.10, aspect = 'auto')
fig.set_size_inches(8,6)
fig.savefig('doc_term_train_matrix.png', dpi=800)

Assemble the validation doc term matrix in the same manner.

In [None]:
doc_term_valid = assemble_docterm_matrix(report_count_validation, term_count, ttxt_valid)

In [None]:
doc_term_valid[0][:20]

Optionally save the matrix, this one is 1/6th the size of the training version so can easily be regenerated dynamically

In [None]:
np.save('doc_term_valid.npy', doc_term_valid)

Use this to load the matrix from file if you have it

In [None]:
doc_term_valid = np.load('doc_term_valid.npy')

Also very sparse, 99%

In [None]:
dtm_non0 = np.count_nonzero(doc_term_valid)
sparsity = (doc_term_valid.size - dtm_non0)/doc_term_valid.size
print(f'All the {doc_term_valid.size} elements of the doc term matrix are zero except for {dtm_non0}')
print(f'that makes sparsity at {sparsity}')

### Data review summary
The data has been split into training and validation sets based on the column 'is_valid'.

#### Data items
**df** has 708 rows\
  df.label\
  df.text\
  df.is_valid

**tokens_corpus** is df.text tokenized by tkn = Tokenizer(spacy)

**ttxt_train** is training subset of toks_all in tuple form with labels (toks_all[i], df.label[i])\
  has report_count_training = 606 rows

**ttxt_valid** is validation subset of toks_all in tuple form with labels\
  has report_count_validation = 102 rows

**full_vocab** is dataset vocabulary from numericalization with minimum frequency set to 2. The vocabulary is ordered in decending frequency of occurence in the dataset.\
**term_count** is 1664, the size of the vocabulary

**doc_term_train** is the 606 x 1664 matrix of stacked embedding vectors for training reports giving the occurence count of each word in the vocabulary per report

**doc_term_valid** is the 102 x 1664 matrix of stacked embedding vectors for validation reports giving the occurence count of each word in the vocabulary per report

The label counts from the original csv dataset are:\
*GREEN* training = 275, validation = 38\
*AMBER* training = 195, validation = 31\
*RED* training = 136, validation = 33

### Statistical Analysis
Things are collected and prepared to do some analysis. Basic analysis like proportion of reports with each label, *RED*, *AMBER*, and *GREEN*. Most frequently occurring words in Red reports etc. Using the analysis, we'll see a way to make a classifier based on it called Naive Bayes.

List the indices for GREEN, AMBER, and RED reports

In [None]:
green_t_indices = [i for i in range(len(ttxt_train)) if ttxt_train[i][1]=='GREEN']
green_v_indices = [i for i in range(len(ttxt_valid)) if ttxt_valid[i][1]=='GREEN']
amber_t_indices = [i for i in range(len(ttxt_train)) if ttxt_train[i][1]=='AMBER']
amber_v_indices = [i for i in range(len(ttxt_valid)) if ttxt_valid[i][1]=='AMBER']
red_t_indices = [i for i in range(len(ttxt_train)) if ttxt_train[i][1]=='RED']
red_v_indices = [i for i in range(len(ttxt_valid)) if ttxt_valid[i][1]=='RED']
print(f'GREEN training = {len(green_t_indices)}, validation = {len(green_v_indices)}\nAMBER training = {len(amber_t_indices)}, validation = {len(amber_v_indices)}\nRED training = {len(red_t_indices)}, validation = {len(red_v_indices)}')

The label counts match the given values in the input CSV file. The first data consistency check.

#### class priors
The proportion of report count for each label value to the whole set.\
gpt = ratio of green reports to total in training\
etc.\

Should have:\
gpt + apt + rpt = 1\
gpv + apv + rpv = 1

In [None]:
gnt = len(green_t_indices)
gnv = len(green_v_indices)
ant = len(amber_t_indices)
anv = len(amber_v_indices)
rnt = len(red_t_indices)
rnv = len(red_v_indices)
gpt = gnt/report_count_training
gpv = gnv/report_count_validation
apt = ant/report_count_training
apv = anv/report_count_validation
rpt = rnt/report_count_training
rpv = rnv/report_count_validation
print(f'GREEN training percentage = {gpt}\nGREEN validation percentage = {gpv}\nAMBER training percentage = {apt}\nAMBER validation percentage = {apv}\nRED training percentage = {rpt}\nRED training percentage = {rpv}')

In [None]:
gpt+apt+rpt

#### Occurence counts
The occurence count vectors per label value are the sum of the embedding vectors with that label.

Green

Occurence Count vector for GREEN training

In [None]:

OC_green = np.zeros(term_count)
for pos in range(len(green_t_indices)):
    OC_green += doc_term_train[green_t_indices[pos]]
OC_green[:44]

Notice the 275 count at index 2. That index corresponds to 'xxbos', the 'beginning of stream' token that occurs once at the start of each report. Since there are 275 *GREEN* reports in the training dataset, this is a consistency check.

Amber

Occurence Count vector for AMBER training

In [None]:
OC_amber = np.zeros(term_count)
for pos in range(len(amber_t_indices)):
    OC_amber += doc_term_train[amber_t_indices[pos]]
OC_amber[:45]

The 195 value at index 2 looks good for 'xxbos'.

Red

Occurence Count vector for RED training

In [None]:
OC_red = np.zeros(term_count)
for pos in range(len(red_t_indices)):
    OC_red += doc_term_train[red_t_indices[pos]]
OC_red[:44]

The 136 value at index 2 checks out.

Let's trace some occurence counts to verify things look okay.

The first *Red* report in training is: (refer to df.head() also)

In [None]:
' '.join(ttxt_train[red_t_indices[0]][0])

Notice the report has the word 'table' four times and the word 'tables' one time -- check if this is correct in the doc term matrix

In [None]:
full_vocab.index('table')

In [None]:
doc_term_train[red_t_indices[0]][223]

In [None]:
print(f'There is {doc_term_train[red_t_indices[0]][full_vocab.index("tables")]} occurence of "tables" in the first Red report')

Check how often 'project' appears in GREEN, AMBER and RED reports

In [None]:
term_index = full_vocab.index('project')
print(f'The word "project" appears {OC_red[term_index]} and {OC_amber[term_index]} and {OC_green[term_index]} times in red, amber and green documents, respectively')

Check how often 'completed' appears in GREEN, AMBER and RED reports

In [None]:
term_index = full_vocab.index('completed')
print(f'The word "completed" appears {OC_red[term_index]} and {OC_amber[term_index]} and {OC_green[term_index]} times in red, amber and green documents, respectively')

GREEN reports with the word project

In [None]:
term_index = full_vocab.index('project')
gr_prj_indices = [i for i in range(len(ttxt_train)) if (ttxt_train[i][1]=='GREEN') and (doc_term_train[i,term_index]>0)]
gr_prj_indices[-15:]

Look at the one at index 600

In [None]:

' '.join(ttxt_train[600][0]), ttxt_train[600][1]

Look at the counts, there are supposed to be 3 occurences of "project" in report 600

In [None]:
[doc_term_train[i,term_index] for i in gr_prj_indices]

#### Conditional likelihood
L(t|green) likelihood term (word) appears in a GREEN report is OC_green/gnt

Once we have conditional likelihood vectors, the commonly occurring Red, Amber, and Green words can be identified.

Conditional likelihood vectors for each label value

In [None]:
CL_green = (OC_green + 1) / (gnt + 1)
CL_amber = (OC_amber + 1) / (ant + 1)
CL_red = (OC_red + 1) / (rnt + 1)

Log ratios between each pair of label values

In [None]:
Rga = np.log(CL_green/CL_amber)
Rgr = np.log(CL_green/CL_red)
Rar = np.log(CL_amber/CL_red)

GREEN to RED comparative

In [None]:

n_tokens = 10
top_Rgr = np.argpartition(Rgr, -n_tokens)[-n_tokens:]
bot_Rgr = np.argpartition(Rgr, n_tokens)[:n_tokens]

In [None]:
print(f'Top {n_tokens} log-count ratios: {Rgr[list(top_Rgr)]}\n')
print(f'Bottom {n_tokens} log-count ratios: {Rgr[list(bot_Rgr)]}')

Green words in green to red comparison

In [None]:
[full_vocab[i] for i in top_Rgr]

Not necessarily all *Green* sounding words, but not unexpected either. 'well', 'soon', 'working', 'good', and 'track' sound *Green*

Red words in green to red comparison

In [None]:

[full_vocab[i] for i in bot_Rgr]

'cancelled', 'without', and 'unable' look the most *Red* here

AMBER to RED comparative

In [None]:
n_tokens = 10
top_Rar = np.argpartition(Rar, -n_tokens)[-n_tokens:]
bot_Rar = np.argpartition(Rar, n_tokens)[:n_tokens]

In [None]:
print(f'Top {n_tokens} log-count ratios: {Rar[list(top_Rar)]}\n')
print(f'Bottom {n_tokens} log-count ratios: {Rar[list(bot_Rar)]}')

Amber words in amber to red comparison

In [None]:
[full_vocab[i] for i in top_Rar]

Red words in amber to red comparison

In [None]:
[full_vocab[i] for i in bot_Rar]

'without' and maybe 'milestone', as in missed milestone are the only *Red* sounding words ...

GREEN to AMBER comparative

In [None]:

n_tokens = 10
top_Rga = np.argpartition(Rga, -n_tokens)[-n_tokens:]
bot_Rga = np.argpartition(Rga, n_tokens)[:n_tokens]

In [None]:
print(f'Top {n_tokens} log-count ratios: {Rga[list(top_Rga)]}\n')
print(f'Bottom {n_tokens} log-count ratios: {Rga[list(bot_Rga)]}')

Green words in green to amber comparison

In [None]:
[full_vocab[i] for i in top_Rga]

Amber words in green to amber comparison

In [None]:
[full_vocab[i] for i in bot_Rga]

'impacted', 'delay', 'divestiture', and 'risk' sound *Amber*

RED reports with the word risk

In [None]:
term_index = full_vocab.index('risk')
red_risk_indices = [i for i in range(len(ttxt_train)) if (ttxt_train[i][1]=='RED') and (doc_term_train[i,term_index]>0)]
red_risk_indices

RED reports with the word risks

In [None]:
term_index = full_vocab.index('risks')
red_risks_indices = [i for i in range(len(ttxt_train)) if (ttxt_train[i][1]=='RED') and (doc_term_train[i,term_index]>0)]
red_risks_indices

Dataset bias, log ratio of green to amber, red to amber and red to green labeled items in the training set

In [None]:
b_green_red = np.log(gpt/rpt)
b_amber_red = np.log(apt/rpt)
b_green_amber = np.log(gpt/apt)
print(f'bias values for green to red = {b_green_red}\n                amber to red = {b_amber_red}\n              green to amber = {b_green_amber}')

### A Naive Bayes Classifier
Use the doc_term_valid validation report matrix binarized so weights are 1 or 0. 1 if the word occurs one or more times and 0 if it is absent. Add the appropriate bias.

In [None]:
W = np.sign(doc_term_valid)

Predict labels for the validation data

In [None]:
preds_amber_red = ['AMBER' if p else 'RED' for p in (W @ Rar + b_amber_red) > 0]
preds_green_red = ['GREEN' if p else 'RED' for p in (W @ Rgr + b_green_red) > 0]
preds_green_amber = ['GREEN' if p else 'AMBER' for p in (W @ Rga + b_green_amber) > 0]

Two out of three seems like a reasonable prediction

Compare these to the validation labels to get an accuracy for the Naive Bayes classifier

In [None]:
triple_preds = [[preds_amber_red[i], preds_green_red[i], preds_green_amber[i]] for i in range(len(preds_amber_red))]

### Analysis Summary
The statistical analysis gives a feel for the quality of the data and the quality of the labelling. The Naive Bayes classifier is historically interesting. Distinction between *Green* and *Amber* words in the dataset may be problematic based on the statistical analysis.

## Neural Network NLP
The approach here, using fastai, is to develop a language model based on the vocabulary of the dataset.

dls_lm is the language model and created from the dataframe **df** as shown below.

In [None]:
dls_lm = TextDataLoaders.from_df(df,
                                 text_col = 'text',
                                 label_col = 'label',
                                 valid_pct = 0.20,
                                 bs = 64,
                                 is_lm = True)

In [None]:
dls_lm.show_batch()

### Transfer learning
Next, the language model is integrated with the core model, AWD_LSTM. AWD_LSTM define the model architecture which is described in https://arxiv.org/abs/1708.02182

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
learn.fit_one_cycle(10, 2e-2)

unfreeze the layers of the model to update weights according to our specific dataset vocabulary

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

save the results

In [None]:
learn.save_encoder('finetuned')

The retrained model can now be used as a predictor to predict next words. Here, given the initial text, 40 more words are generated, twice.

In [None]:
TEXT = "The project has some risks"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [None]:
print("\n".join(preds))

Now create a classifier. The validation split is set at 20% which will ignore the 'is_valid' column in the dataset.

In [None]:
dls_clas = TextDataLoaders.from_df(df,
                        valid_pct = 0.2,
                        text_col = 'text',
                        label_col = 'label',
                        bs = 64,
                        text_vocab = dls_lm.vocab)

In [None]:
dls_clas.show_batch()

Train the classifier

In [None]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

In [None]:
learn = learn.load_encoder('finetuned')

In [None]:
learn.fit_one_cycle(1, 2e-2)

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

We can use the following to help select a learning rate.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
learn.lr_find()

A confusion matrix helps point out ambiguities. Here we see similar *Green/Amber* fuzziness we saw in the statistical analysis of the data.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

### Sample classification results

In [None]:
learn.predict('There is a high risk of staffing loss.')

In [None]:
learn.predict('No work was completed and there is no support.')

In [None]:
learn.predict('Everything is on schedule and support is standing by if needed.')

Save the model in export.pkl

In [None]:
learn.export()

#### Running on CPU or GPU
use the code below to check the cuda device status for the execution environment

In [None]:
import torch
x = torch.cuda.get_device_name(0) if torch.cuda.is_available() else None
print(x)

Copyright (c) 2022, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.