# **Grammar Error Correction using BERT**


***Use of BERT Masked Language Model (MLM) for Grammar Error Correction (GEC), without the use of annotated data***

Sunil Chomal | sunilchomal@gmail.com

In [None]:
%%html
<img src='/nbextensions/google.colab/GEC.png' />

 **High level workflow**
 
•	Tokenize the sentence using Spacy

•	Check for spelling errors using Hunspell

•	For all preposition, determiners & helper verbs, create a set of probable sentences

•	Create a set of sentences with each word “masked”, deleted or an additional determiner, preposition or helper verb added

•	Used BERT Masked Language Model to determine possible suggestions for masks

•	Use the GED model to select appropriate solutions


In [1]:
# install pytorch_pretrained_bert the previous version of Pytorch-Transformers
!pip install -U pytorch_pretrained_bert

# install torch
!pip install torch

# install keras
!pip install tensorflow

Requirement already up-to-date: pytorch_pretrained_bert in /m/home/home0/02/raya1/unix/.local/lib/python3.8/site-packages (0.6.2)


In [2]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Check to confirm that GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Quadro P2200'

In [4]:
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

In [5]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /u/02/raya1/unix/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [6]:
from keras.utils import pad_sequences
import numpy as np

def check_GE(sents):
    """Check of the input sentences have grammatical errors

    :param list: list of sentences
    :return: error, probabilities
    :rtype: (boolean, (float, float))
    """
    
    # Create sentence) and label lists
    # We need to add special tokens at the beginning and end of each sentence
    # for BERT to work properly
    sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sents]
    labels =[0]

    tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

    # Padding Sentences
    # Set the maximum sequence length. The longest sequence in our training set
    # is 47, but we'll leave room on the end anyway.
    # In the original paper, the authors used a length of 512.
    MAX_LEN = 128

    predictions = []
    true_labels = []

    # Pad our input tokens
    input_ids = pad_sequences(
        [tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], 
        maxlen=MAX_LEN, dtype="long", truncating="post", padding="post"
        )

    # Index Numbers and Padding
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

    # pad sentences
    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, 
                              dtype ="long", truncating="post",padding ="post")

    # Attention masks
    # Create attention masks
    attention_masks = []

    # Create a mask of 1s for each token followed by 0s for padding
    for seq in input_ids:
      seq_mask = [float(i > 0) for i in seq]
      attention_masks.append(seq_mask)

    prediction_inputs = torch.tensor(input_ids)
    prediction_masks = torch.tensor(attention_masks)
    prediction_labels = torch.tensor(labels)

    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = modelGED(prediction_inputs, token_type_ids=None, 
                        attention_mask=prediction_masks)

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    # label_ids = b_labels.to("cpu").numpy()

    # Store predictions and true labels
    predictions.append(logits)
    # true_labels.append(label_ids)

  #   print(predictions)
    flat_predictions = [item for sublist in predictions for item in sublist]
  #   print(flat_predictions)
    prob_vals = flat_predictions
    flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
    # flat_true_labels = [item for sublist in true_labels for item in sublist]
  #   print(flat_predictions)
    return flat_predictions, prob_vals

2023-05-03 23:40:05.393964: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-03 23:40:05.457529: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [19]:
# !wget https://github.com/prasmussen/gdrive/releases/download/2.1.1/gdrive_2.1.1_linux_amd64.tar.gz
# !gunzip gdrive_2.1.1_linux_amd64.tar.gz
# !sudo mkdir /usr/local/bin/gdrive
# !sudo cp gdrive-linux-amd64 /usr/local/bin/gdrive
# !sudo chmod a+x /usr/local/bin/gdrive

In [7]:
!pip install gdown
# !gdown 1-Oz7vSFor41eLoxqZzR83GCC5SUxPpr8
!gdown 1-CMJp5y-bJlccQua1SgG20q9Dp2xPZrt

Downloading...
From (uriginal): https://drive.google.com/uc?id=1-CMJp5y-bJlccQua1SgG20q9Dp2xPZrt
From (redirected): https://drive.google.com/uc?id=1-CMJp5y-bJlccQua1SgG20q9Dp2xPZrt&confirm=t&uuid=d27fd4e0-e5b0-4af4-b009-30ab4167032b
To: /m/home/home0/02/raya1/data/Desktop/SNLP/bert-base-uncased-GED.pth
100%|████████████████████████████████████████| 438M/438M [00:19<00:00, 22.9MB/s]


In [8]:
# remove

#
# CREDIT: https://stackoverflow.com/a/39225039
#

import requests

def download_file_from_google_drive(id, destination):
  print("Trying to fetch {}".format(destination))

  def get_confirm_token(response):
    for key, value in response.cookies.items():
      if key.startswith('download_warning'):
        return value

    return None

  def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
      for chunk in progress_bar(response.iter_content(CHUNK_SIZE)):
        if chunk: # filter out keep-alive new chunks
          f.write(chunk)

  URL = "https://docs.google.com/uc?export=download"

  session = requests.Session()

  response = session.get(URL, params = { 'id' : id }, stream = True)
  token = get_confirm_token(response)

  if token:
    params = { 'id' : id, 'confirm' : token }
    response = session.get(URL, params = params, stream = True)

  save_response_content(response, destination)

In [9]:
# remove 

def progress_bar(some_iter):
    try:
        from tqdm import tqdm
        return tqdm(some_iter)
    except ModuleNotFoundError:
        return some_iter

In [11]:
# remove


# load previously trained BERT Grammar Error Detection model

# download from public google drive link
# download_file_from_google_drive("1-Afp2trJBwwDNZmf0Hrq1Fro-g8Q3GcO", "./bert-based-uncased-GED.pth")

Trying to fetch ./bert-based-uncased-GED.pth


1it [00:00, 2013.59it/s]


In [10]:
# https://pytorch.org/tutorials/beginner/saving_loading_models.html

from pytorch_pretrained_bert import BertForSequenceClassification

modelGED = BertForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                      num_labels=2)

# restore model
modelGED.load_state_dict(torch.load('bert-base-uncased-GED.pth'))
modelGED.eval()

INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /u/02/raya1/unix/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.modeling:extracting archive file /u/02/raya1/unix/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpxu8lrdg2
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:pytorch_pretrained_bert.modeling:Weights of B

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

In [11]:
# Load pre-trained model (weights) for Masked Language Model (MLM)
model = BertForMaskedLM.from_pretrained('bert-large-uncased')
model.eval()

INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz from cache at /u/02/raya1/unix/.pytorch_pretrained_bert/214d4777e8e3eb234563136cd3a49f6bc34131de836848454373fa43f10adc5e.abfbb80ee795a608acbf35c7bf2d2d58574df3887cdd94b355fc67e03fddba05
INFO:pytorch_pretrained_bert.modeling:extracting archive file /u/02/raya1/unix/.pytorch_pretrained_bert/214d4777e8e3eb234563136cd3a49f6bc34131de836848454373fa43f10adc5e.abfbb80ee795a608acbf35c7bf2d2d58574df3887cdd94b355fc67e03fddba05 to temp dir /tmp/tmpbsnile02
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:pytorch_pretrained_bert.modeling:Weights fr

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
      

In [12]:
# Load pre-trained model tokenizer (vocabulary)
tokenizerLarge = BertTokenizer.from_pretrained('bert-large-uncased')

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt from cache at /u/02/raya1/unix/.pytorch_pretrained_bert/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [13]:
# install the packages for Hunspell

!apt-get install libhunspell-dev
# !sudo apt-get install hunspell
!sudo apt-get install libhunspell-1.6-0 
!pip install CyHunspell

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
[sudo] password for raya1: 


In [14]:
from hunspell import Hunspell
import os

# download the gn_GB dictionary for hunspell
download_file_from_google_drive("1jC5BVF9iZ0gmRQNmDcZnhfFdEYv8RNok", "./en_GB-large.dic")
download_file_from_google_drive("1g8PO8kdw-YmyOY_HxjnJ5FfdJFX4bsPv", "./en_GB-large.aff")

gb = Hunspell("en_GB-large", hunspell_data_dir=".")

Trying to fetch ./en_GB-large.dic


27it [00:00, 403.95it/s]


Trying to fetch ./en_GB-large.aff


1it [00:00, 3231.36it/s]


In [15]:
# List of common determiners
# det = ["", "the", "a", "an"]
det = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his', 
       'her', 'its', 'our', 'their', 'all', 'both', 'half', 'either', 'neither', 
       'each', 'every', 'other', 'another', 'such', 'what', 'rather', 'quite']

# List of common prepositions
prep = ["about", "at", "by", "for", "from", "in", "of", "on", "to", "with", 
        "into", "during", "including", "until", "against", "among", 
        "throughout", "despite", "towards", "upon", "concerning"]

# List of helping verbs
helping_verbs = ['am', 'is', 'are', 'was', 'were', 'being', 'been', 'be', 
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 
                 'shall', 'should', 'may', 'might', 'must', 'can', 'could']

In [16]:
# test sentences

org_text = []
org_text.append("They drank the pub .")
org_text.append("I am looking forway to see you soon .")
org_text.append("The cat sat at mat .")
org_text.append("Giant otters is an apex predator .")
org_text.append('There is no a doubt, tracking system has brought many benefits in this information age .')

In [17]:
# !pip install -U pip setuptools wheel
!pip install -U spacy==2.2.4
# !/usr/bin/python3 -m spacy download en_core_web_sm


Requirement already up-to-date: spacy==2.2.4 in /m/home/home0/02/raya1/unix/.local/lib/python3.8/site-packages (2.2.4)


In [18]:
# get Doc object from spaCy
# from spacy.tokens import Doc
import spacy
import numpy as np


def create_spelling_set(org_text):
  """ Create a set of sentences which have possible corrected spellings
  """
  
  sent = org_text
  sent = sent.lower()
  sent = sent.strip().split()


  nlp = spacy.load("en")
  proc_sent = nlp.tokenizer.tokens_from_list(sent)
  nlp.tagger(proc_sent)

  sentences = []

  for tok in proc_sent:
    # check for spelling for alphanumeric
    if tok.text.isalpha() and not gb.spell(tok.text):
      new_sent = sent[:]
      # append new sentences with possible corrections
      for sugg in gb.suggest(tok.text):
        new_sent[tok.i] = sugg
        sentences.append(" ".join(new_sent))

  spelling_sentences = sentences

  # retain new sentences which have a 
  # minimum chance of correctness using BERT GED
  new_sentences = []
  
  for sent in spelling_sentences:
    no_error, prob_val = check_GE([sent])
    exps = [np.exp(i) for i in prob_val[0]]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    if(softmax[1] > 0.6):
      new_sentences.append(sent)
  
  
  # if no corrections, append the original sentence
  if len(spelling_sentences) == 0:
    spelling_sentences.append(" ".join(sent))

  # eliminate dupllicates
  [spelling_sentences.append(sent) for sent in new_sentences]
  spelling_sentences = list(dict.fromkeys(spelling_sentences))

  return spelling_sentences

In [19]:
def create_grammar_set(spelling_sentences):
  """ create a new set of sentences with deleted determiners, 
      prepositions & helping verbs
      
  """
  
  new_sentences = []

  for text in spelling_sentences:
    sent = text.strip().split()
    for i in range(len(sent)):
      new_sent = sent[:]
      
      if new_sent[i] not in list(set(det + prep + helping_verbs)):
        continue
      
      del new_sent[i]
      text = " ".join(new_sent)
      
      # retain new sentences which have a 
      # minimum chance of correctness using BERT GED
      no_error, prob_val = check_GE([text])
      exps = [np.exp(i) for i in prob_val[0]]
      sum_of_exps = sum(exps)
      softmax = [j/sum_of_exps for j in exps]
      if(softmax[1] > 0.6):
        new_sentences.append(text)
  
  # eliminate dupllicates
  [spelling_sentences.append(sent) for sent in new_sentences]
  spelling_sentences = list(dict.fromkeys(spelling_sentences))
  return spelling_sentences

In [20]:
def create_mask_set(spelling_sentences):
  """For each input sentence create 2 sentences
     (1) [MASK] each word
     (2) [MASK] for each space between words
  """
  sentences = []

  for sent in spelling_sentences:
    sent = sent.strip().split()
    for i in range(len(sent)):
      # (1) [MASK] each word
      new_sent = sent[:]
      new_sent[i] = '[MASK]'
      text = " ".join(new_sent)
      new_sent = '[CLS] ' + text + ' [SEP]'
      sentences.append(new_sent)

      # (2) [MASK] for each space between words
      new_sent = sent[:]
      new_sent.insert(i, '[MASK]')
      text = " ".join(new_sent)
      new_sent = '[CLS] ' + text + ' [SEP]'
      sentences.append(new_sent)

  return sentences

In [21]:
import math
from difflib import SequenceMatcher

def check_grammar(org_sent, sentences, spelling_sentences):
  """ check grammar for the input sentences
  """
  
  n = len(sentences)
  
  # what is the tokenized value of [MASK]. Usually 103
  text = '[MASK]'
  tokenized_text = tokenizerLarge.tokenize(text)
  mask_token = tokenizerLarge.convert_tokens_to_ids(tokenized_text)[0]

  LM_sentences = []
  new_sentences = []
  i = 0 # current sentence number
  l = len(org_sent.strip().split())*2 # l is no of sentencees
  mask = False # flag indicating if we are processing space MASK

  for sent in sentences:
    i += 1
    
    print(".", end="")
    if i%50 == 0:
      print("")
    
    # tokenize the text
    tokenized_text = tokenizerLarge.tokenize(sent)
    indexed_tokens = tokenizerLarge.convert_tokens_to_ids(tokenized_text)

    # Create the segments tensors.
    segments_ids = [0] * len(tokenized_text)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Predict all tokens
    with torch.no_grad():
        predictions = model(tokens_tensor, segments_tensors)

    # index of the masked token
    mask_index = (tokens_tensor == mask_token).nonzero()[0][1].item()
    # predicted token
    predicted_index = torch.argmax(predictions[0, mask_index]).item()
    predicted_token = tokenizerLarge.convert_ids_to_tokens([predicted_index])[0]
    
    # second best prediction. Can you used to create more options
#     second_index = torch.topk(predictions[0, mask_index], 2).indices[1].item()
#     second_prediction = tokenizer.convert_ids_to_tokens([second_index])[0]

    text = sent.strip().split()
    mask_index = text.index('[MASK]')

    if not mask:
      # case of MASKed words
      
      mask = True
      text[mask_index] = predicted_token
      try:
        # retrieve original word
        org_word = spelling_sentences[i//l].strip().split()[mask_index-1]
#         print(">>> " + org_word)
      except:
#         print(spelling_sentences[i%l - 1])
#         print(tokenized_text)
#         print("{0} {1} {2}".format(i, l, mask_index))
        print("!", end="")
        continue
  #     print("{0} - {1}".format(org_word, predicted_token))
      # check if the prediction is an inflection of the original word
  #   if org_word.isalpha() and predicted_token not in gb_infl[org_word]:
  #     continue
      # use SequenceMatcher to see if predicted word is similar to original word
      if SequenceMatcher(None, org_word, predicted_token).ratio() < 0.6:
        if org_word not in list(set(det + prep + helping_verbs)) or predicted_token not in list(set(det + prep + helping_verbs)):
          continue
      if org_word == predicted_token:
        continue
    else:
      # case for MASKed spaces
      
      mask = False
  #     print("{0}".format(predicted_token))
      # only allow determiners / prepositions  / helping verbs in spaces
      if predicted_token in list(set(det + prep + helping_verbs)) :
        text[mask_index] = predicted_token
      else:
        continue

  #   if org_word == "in":
  #     print(">>>>>> " + predicted_token)
  #   print(tokenized_text)
  #   print(mask_index)
  
    text.remove('[SEP]')
    text.remove('[CLS]')
    new_sent = " ".join(text)
    
  #   print(new_sent)
    # retain new sentences which have a 
    # minimum chance of correctness using BERT GED
    no_error, prob_val = check_GE([new_sent])
    exps = [np.exp(i) for i in prob_val[0]]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    if no_error and softmax[1] > 0.996:
  #     print(org_word)
  #     print(predicted_token)
  #     print(SequenceMatcher(None, org_word, predicted_token).ratio())
  #     print("{0} - {1}, {2}".format(prob_val[0][1], prob_val[0][0], prob_val[0][1] - prob_val[0][0]))

  #     print("{0} - {1:.2f}".format(new_sent, softmax[1]*100) )
      print("*", end="")
      new_sentences.append(new_sent)
  #   print("{0}\t{1}".format(predicted_token, second_prediction))

  print("")
  
  # remove duplicate suggestions
  spelling_sentences = []
  [spelling_sentences.append(sent) for sent in new_sentences]
  spelling_sentences = list(dict.fromkeys(spelling_sentences))
  spelling_sentences
  
  return spelling_sentences

In [52]:
# org_text = []
# with open("./drive/My Drive/Colab Notebooks/S89A/CoNLL_2013_DS.txt") as file:
#   org_text = file.readlines()

# predict for each of the test samples

for sent in org_text:
  
  print("Input Sentence >>> " + sent)
  
  sentences = create_spelling_set(sent)
  spelling_sentences = create_grammar_set(sentences)
  sentences = create_mask_set(spelling_sentences)
  
  print("processing {0} possibilities".format(len(sentences)))
  
  sentences = check_grammar(sent, sentences, spelling_sentences)

  print("Suggestions & Probabilities")
  
  if len(sentences) == 0:
    print("None")
    continue

  no_error, prob_val =  check_GE(sentences)

  for i in range(len(prob_val)):
    exps = [np.exp(i) for i in prob_val[i]]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    print("{0} - {1:0.4f}%".format(sentences[i], softmax[1]*100))
  
  print("-"*60)
  print()

Input Sentence >>> They drank the pub .
processing 10 possibilities
..........
Suggestions & Probabilities
None
Input Sentence >>> I am looking forway to see you soon .
processing 126 possibilities
..................................................
.......................!...........................
.........!.................
Suggestions & Probabilities
None
Input Sentence >>> The cat sat at mat .
processing 12 possibilities
............
Suggestions & Probabilities
None
Input Sentence >>> Giant otters is an apex predator .
processing 14 possibilities
..............
Suggestions & Probabilities
None
Input Sentence >>> There is no a doubt, tracking system has brought many benefits in this information age .
processing 32 possibilities
................................
Suggestions & Probabilities
None


In [23]:
import pandas as pd
df = pd.read_csv("testing_data_cleaned.csv")
# fill in the values of col5 into col6 
df['col6'].fillna(df['col5'], inplace=True)
df

Unnamed: 0,col5,col6
0,Wan na Learn English ?,Wan na Learn English ?
1,Not that much .,Not that much .
2,"A family , seemingly a father , a mother and t...","A family , seemingly a father , a mother and t..."
3,"A funky music starts , singing some worst obsc...","Funky music starts , playing some bad obscenit..."
4,It seems they do n't care so much about its ly...,It seems they do n't care so much about its ly...
...,...,...
9214,but today I went by foot so it rained .,but today I went on foot because it rained .
9215,The weather forecast says that it will rain to...,The weather forecast says that it will rain to...
9216,It is depressing .,It is depressing .
9217,Deaflympic,Deaflympic


In [51]:
# corr_sent = df['col6'][2]
# sent = df['col5'][2]
sent = "he go to the store"
corr_sent = "he goes to the store"

print('Input Sentence >>> ' + sent)
print('Correct Sentence >>> ' + corr_sent)
  
sentences = create_spelling_set(sent)
# spelling_sentences = create_grammar_set(sentences)
# sentences = create_mask_set(spelling_sentences)
sentences = create_mask_set(sentences)

print("processing {0} possibilities".format(len(sentences)))

# sentences = check_grammar(sent, sentences, spelling_sentences)
sentences = check_grammar(sent, sentences, sentences)

print("Suggestions & Probabilities")

if len(sentences) == 0:
  print("None")

else :
  no_error, prob_val =  check_GE(sentences)

  for i in range(len(prob_val)):
    exps = [np.exp(i) for i in prob_val[i]]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    print("{0} - {1:0.4f}%".format(sentences[i], softmax[1]*100))


Input Sentence >>> he go to the store
Correct Sentence >>> he goes to the store
processing 10 possibilities
..........
Suggestions & Probabilities
None


In [33]:
org_text  = [
    "The cat sat at mat",
    "Giant otters is an apex predator",
    "I and my friend is going to the park."
]
for sent in org_text:
  
  print("Input Sentence >>> " + sent)
  
  sentences = create_spelling_set(sent)
  spelling_sentences = create_grammar_set(sentences)
  sentences = create_mask_set(spelling_sentences)
  
  print("processing {0} possibilities".format(len(sentences)))
  
  sentences = check_grammar(sent, sentences, spelling_sentences)

  print("Suggestions & Probabilities")
  
  if len(sentences) == 0:
    print("None")
    continue

  no_error, prob_val =  check_GE(sentences)

  for i in range(len(prob_val)):
    exps = [np.exp(i) for i in prob_val[i]]
    sum_of_exps = sum(exps)
    softmax = [j/sum_of_exps for j in exps]
    print("{0} - {1:0.4f}%".format(sentences[i], softmax[1]*100))
  
  print("-"*60)
  print()

spelling ['a family , seemingly a father , a mother and two of their children , get on a car .']
original sentence: A family , seemingly a father , a mother and two of their children , get on a car .
spelling sentences: 
a family , seemingly a father , a mother and two of their children , get on a car .
family , seemingly a father , a mother and two of their children , get on a car .
a family , seemingly a father , mother and two of their children , get on a car .
a family , seemingly a father , a mother and two their children , get on a car .
a family , seemingly a father , a mother and two of children , get on a car .
a family , seemingly a father , a mother and two of their children , get a car .
**********
masked sentences: 
[CLS] [MASK] family , seemingly a father , a mother and two of their children , get on a car . [SEP]
[CLS] [MASK] a family , seemingly a father , a mother and two of their children , get on a car . [SEP]
[CLS] a [MASK] , seemingly a father , a mother and two of

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,