# Masked language prediction

#### Starting from a redacted pdf, can we guess the redactions?

### Setup

#### Start pdf-converter OCR service

Using this tool:
https://github.com/D2P-APPS/pdf-ocr-tool

Run the following command to run the container and start the webservice:

    docker-compose up

#### Set up BERT masked language prediction

In [17]:
# Predicting neighbors to a word in sentence using BERTMaskedLM. 
# Neighbors are from BERT vocab (which includes subwords and full words) 

import torch
from transformers import *
from collections import OrderedDict
import logging

DEFAULT_MODEL_PATH='bert-base-cased'
DEFAULT_TO_LOWER=False
DEFAULT_TOP_K = 10
ACCRUE_THRESHOLD = 1

def init_model(model_path,to_lower):
    """
    Initiate BERTForMaskedLm model.
    """
    logging.basicConfig(level=logging.INFO)
    tokenizer = BertTokenizer.from_pretrained(model_path,do_lower_case=to_lower)
    model = BertForMaskedLM.from_pretrained(model_path)
    #tokenizer = RobertaTokenizer.from_pretrained(model_path,do_lower_case=to_lower)
    #model = RobertaForMaskedLM.from_pretrained(model_path)
    model.eval()
    return model,tokenizer


def predict(model,tokenizer,top_k,accrue_threshold,text):
    """
    Guess masked tokens.
    """
    tokenized_text = tokenizer.tokenize(text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    # Create the segments tensors.
    segments_ids = [0] * len(tokenized_text)

    masked_index = 0

    for i in range(len(tokenized_text)):
        if (tokenized_text[i] == "[MASK]"):
            masked_index = i
            break

    #print(tokenized_text)
    #print(masked_index)
    results_dict = {}

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    with torch.no_grad():
        predictions = model(tokens_tensor, segments_tensors)
        for i in range(len(predictions[0][0,masked_index])):
            if (float(predictions[0][0,masked_index][i].tolist()) > accrue_threshold):
                tok = tokenizer.convert_ids_to_tokens([i])[0]
                results_dict[tok] = float(predictions[0][0,masked_index][i].tolist())

    k = 0
    sorted_d = OrderedDict(sorted(results_dict.items(), key=lambda kv: kv[1], reverse=True))
    for i in sorted_d:
        print(i,sorted_d[i])
        k += 1
        if (k > top_k):
            break

In [18]:
model,tokenizer = init_model(DEFAULT_MODEL_PATH, to_lower=DEFAULT_TO_LOWER)

INFO:transformers.tokenization_utils_base:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /home/kriesbeck/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /home/kriesbeck/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a67d7002391
INFO:transformers.configuration_utils:Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "

## Start with a redacted PDF

In [19]:
pdf_file = "neg-bop-vol1-part2.pdf"

## Get text from PDF with OCR tool

In [20]:
# convert pdf

# curl -X POST "http://localhost:5001/convert" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@neg-bop-vol1-part2.pdf;type=application/pdf"

In [21]:
# doc id

{
  "docId": "66f618ed9b53579cd4b3cffaaced1d12-0832735",
  "message": "File successfully converted"
}

{'docId': '66f618ed9b53579cd4b3cffaaced1d12-0832735',
 'message': 'File successfully converted'}

In [22]:
# get text

# curl -X GET "http://localhost:5001/download?docId=66f618ed9b53579cd4b3cffaaced1d12-0832735&type=txt" -H "accept: application/json"

In [23]:
txt_file = 'bop.txt'

In [34]:
with open(txt_file,"r") as f:
    text = f.read()

In [36]:
orig_text = text

In [37]:
print(orig_text)

 

 

 

APPROVED FOR
RELEASE DATE:

 

 

25-Jul-2011

. >
“ St ee
Shares oF 8

OFFICIAL HISTORY
OF THE
BAY OF PIGS OPERATION

VoLume |
Air OPERATIONS,

MarcH 1960 - Aprit 1961
(pages 303-506)

TOP ET

 

 

 

 

TIVES VUTTNTOUEH OY

CLA History Staff

 

 

 

 

Copy i or 2

TOP “SECRET

 

 

 

 

DCI-8
september 1979

Copy 1 of 2

PERMANENT HISTORICAL DOCUMENT

DO NOT DESTROY

 
 

 

 

 

ay

 
TOP SRCRET

VOLUME TIT

Contents
Page

Part IV Where Cuba Was Lost ........ . 303

A. Over the Beach -- 17 April 1961 . 303
B. Day of Grace -- 18 April 1961 . . 338
C. Four for A Cause -- 19 April

1961 ....6.2.48. ~ « « « 352

D. Bitter Recriminations: The Navy
CAP, 19 April 1961 ..... . . 362

E. Jet Fighters -- the Last Hope .. 390

F. Retrospective View of Air .

Operations ..... . 26. « « - 395
Appendixes

l. US Government's Anti-Castro Program . .. . 406
2. Memorandum for Chief, JMARC from Lt.

Col. George Gaines, Jr., sub: JMARC

Tactical B-26 Capability ....... . .. 418

3. An

In [26]:
# sample sentence from bay of pigs
text = """Although it cannot be determined accurately at
what height any of the Brigade's B-26's actually were
flying, Gar Thorsrud is of the opinion that they
probably would have been cruising at 8,000'-10,000'
for the early part of the trip, dropping down to
2,000' when approximately 15 miles off the target
by which time they would have been well past the
Essex."""

## Guess redactions with BERT masked language prediction

In [27]:
# mask "height"

text = """Although it cannot be determined accurately at
what [MASK] any of the Brigade's B-26's actually were
flying, Gar Thorsrud is of the opinion that they
probably would have been cruising at 8,000'-10,000'
for the early part of the trip, dropping down to
2,000' when approximately 15 miles off the target
by which time they would have been well past the
Essex."""

In [28]:
predict(model,tokenizer,DEFAULT_TOP_K,ACCRUE_THRESHOLD,text)

speed 12.544036865234375
time 11.864686965942383
rate 11.43674087524414
distance 10.931962966918945
degree 10.52767562866211
altitude 10.50318431854248
range 10.494011878967285
direction 10.279193878173828
level 10.113349914550781
position 9.554437637329102
height 9.454315185546875


In [29]:
# mask B-52s

text = """Although it cannot be determined accurately at
what height any of the [MASK] actually were
flying, Gar Thorsrud is of the opinion that they
probably would have been cruising at 8,000'-10,000'
for the early part of the trip, dropping down to
2,000' when approximately 15 miles off the target
by which time they would have been well past the
Essex."""

In [30]:
predict(model,tokenizer,DEFAULT_TOP_K,ACCRUE_THRESHOLD,text)

men 10.376850128173828
bombers 9.472768783569336
boats 9.46126651763916
aircraft 9.459545135498047
ships 9.444518089294434
pilots 9.287113189697266
planes 9.121543884277344
three 8.551804542541504
two 8.54004192352295
tanks 8.48944091796875
four 8.458709716796875


In [31]:
# mask "target"

text = """Although it cannot be determined accurately at
what height any of the Brigade's B-26's actually were
flying, Gar Thorsrud is of the opinion that they
probably would have been cruising at 8,000'-10,000'
for the early part of the trip, dropping down to
2,000' when approximately 15 miles off the [MASK]
by which time they would have been well past the
Essex."""

In [32]:
predict(model,tokenizer,DEFAULT_TOP_K,ACCRUE_THRESHOLD,text)

ground 10.45962905883789
water 10.02135181427002
course 9.594242095947266
coast 9.505855560302734
river 8.977034568786621
track 8.895634651184082
air 8.824122428894043
target 8.814438819885254
wind 8.787298202514648
road 8.736257553100586
sea 8.680889129638672


In [None]:
# TODO: how to identify redactions in pdf>txt conversion?
#  - identify redaction indices
#  - apply [MASK]s post-tokenization

# TODO: tokens vs phrases 
#  - https://stackoverflow.com/questions/61419089/use-bert-to-predict-multiple-tokens
#  - how much (% of text) do we need to redact before it's difficult to predict?

# TODO: use trained model / add domain language to vocab
# TODO: classification/boilerplate/junk text models
# TODO: can we exploit the size of redaction?  number of characters?
# TODO: bert vs roberta