# Group 1 - NLP Assignment
## Probabilistic model for detection and correction of spelling errors

1. Design and develop a probabilistic model for spelling error.
2. Evaluation of the output.
2. Requirement: Type of spelling errors would include non-words and real-words.
3. Requirement: Using bigram, minimum edit distance
4. Output: Suggest few words to correct the spelling error.
5. Output: Sorted list of dictionary

Mates: Lee, Vimal, Cha, Ser\
Start Date: 7-Nov-2022\
Last Edited: 22-Nov-2022

1. A spellchecker points to spelling errors and possibly suggests alternatives.

2. An autocorrector usually goes a step further and automatically picks the most likely word.

**Non-word Errors**: These are the most common type of errors. You either miss a few keystrokes or let your fingers hurtle a bit longer. E.g. typing *langage* when you mean *language*.

**Real Word Errors**: If you have fat fingers, sometimes instead of creating a non-word, you end up creating a real word, but one you didn't intend. E.g. typing *buckled* when you meant *bucked*. *three* instead of *there*.

Other types of errors: **Cognitive Errors**, **Short forms/Slang/Lingo**, **Intentional Typoes**

**Initial Method**
1. Candidate model: Symspell
2. Language model: add-0.1 unigram & add-1 bigram (In log scale)
3. Error model: Noisy channel model

# 1.0 Preamble


## Question/Limitation

1. Does not work with spelling error with edit distance greater than 2.
2. Does not work when word is not found in dictionary in addition no available candidates generated for that word.

## 1.1 Dependencies

In [1]:
!pip install prettytable
!pip install gdown




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Dependencies
import math
import pickle
import operator
from prettytable import PrettyTable
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##1.1 Corpus

---

Science Journals - Elsevier OA CC-BY
---
Corpus utilized is retrieved from this [link](https://elsevier.digitalcommonsdata.com/datasets/zm33cdndxs/2)
1. The dataset contains a compilation of 40,000 science journals.
1. Using randomly selected 200 journals from the 40,000 journals.
2. Each sentence is separated.


# 2.0 Non-word correction model
 Inspiration from [link](https://github.com/nnakul/symspell-nlp)

In [4]:
"""
File dependencies - Download
"""

path_corpus = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\Training Corpus.txt'
path_dict = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\Training Dictionary.txt'
path_delMat = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\CONFUSION_MATRIX_DEL.txt'
path_insMat = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\CONFUSION_MATRIX_INS.txt'
path_subMat = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\CONFUSION_MATRIX_SUB.txt'
path_traMat = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\CONFUSION_MATRIX_TRAN.txt'
path_apos = 'C:\\Users\\PC\\Desktop\\Code\\Dataset\\Training Apos'
# Loading training corpus - 200 science journals
file = open(path_corpus, 'r', encoding="utf8")
CORPUS = file.read()
file.close()
print('Imported corpus.')


# Loading dictionary
VOCAB = list()
file = open( path_dict , "r", encoding="utf8" )
words = list()
for x in file:
    VOCAB.append(x[:-1])
file.close()
print('Imported dictionary.')
V = len( VOCAB )

# Loading DEL - matrix
file = open( path_delMat , "r", encoding="utf8" )
raw = list()
for x in file:
    x = int(x[:-1])
    raw.append(x+1)
file.close()
CMD = np.array(raw).reshape(27,26)
print('Imported deletion matrix.')

# Loading INS - matrix
file = open( path_insMat , "r", encoding="utf8" )
raw = list()
for x in file:
    x = int(x[:-1])
    raw.append(x+1)
file.close()
CMI = np.array(raw).reshape(27,26)
print('Imported insertion matrix.')

# Loading SUB - matrix
file = open( path_subMat , "r", encoding="utf8" )
raw = list()
for x in file:
    x = int(x[:-1])
    raw.append(x+1)
file.close()
CMS = np.array(raw).reshape(27,26)
print('Imported substitution matrix.')

# Loading TRANS - matrix
file = open( path_traMat , "r", encoding="utf8" )
raw = list()
for x in file:
    x = int(x[:-1])
    raw.append(x+1)
file.close()
CMT = np.array(raw).reshape(27,26)
print('Imported transposition matrix.')

# Loading words with apostrophe
file = open( path_apos , "rb" )
WORDS_W_APOS = pickle.load( file )
file.close()
print('Imported words with apostrophe file.')
APOS_OMITTED_WORDS = list()
for pair in WORDS_W_APOS:
    APOS_OMITTED_WORDS.append(pair[0])

Imported corpus.
Imported dictionary.
Imported deletion matrix.
Imported insertion matrix.
Imported substitution matrix.
Imported transposition matrix.
Imported words with apostrophe file.


In [7]:
DELIMITERS = [ '"' , '.' , '!' , '?' ]

N = len(word_tokenize(CORPUS))        # Number of tokens in corpus
S = len(sent_tokenize(CORPUS))        # Number of sentence in corpus
INFINITY = 10000  # Assumption for usage in edit distance matrix

# Function to apply log
def log( x ):
    if not x:
        return -1*float('inf')
    return math.log10(x)

# Generate edit distance matrix
def getEditDistance( x , y ):
    m = len(x)
    n = len(y)
    arr = np.array([0]*((m+1)*(n+1))).reshape( m+1 , n+1 )
    for a in range(m):
        arr[a+1][0] = a+1
    for a in range(n):
        arr[0][a+1] = a+1
    for i in range(1,m+1):
        for j in range(1,n+1):
            v1 = arr[i-1][j] + 1
            v2 = arr[i][j-1] + 1
            v3 = arr[i-1][j-1]
            if x[i-1] != y[j-1]:
                v3 += 1
            v4 = INFINITY
            if ( i > 1 and j > 1 and x[i-1] == y[j-2] and x[i-2] == y[j-1] ):
                v4 = arr[i-2][j-2] + 1
            v = [ v1 , v2 , v3 , v4 ]
            arr[i][j] = INFINITY
            for each in v:
                if each < arr[i][j]:
                    arr[i][j] = each
    return arr
"""
# SymSpell candidate generation based on deletion
def getCandidates( word , root , s , lev ):
    if ( not len(word) ):
        return
    for x in range(len(word)):
        st = word[:x] + word[x+1:]
        if ( st in VOCAB ):
            s.add(st)
        if ( lev == 1 ):
            getCandidates( st , root , s , 2 )
"""
# Old edits only 1 distance
"""
def getCandidates( word , root , s , lev ):
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    # Candidates with 1 edit distance
    if ( not len(word) ):
        return
    splits      = [(word[:i], word[i:])     for i in range(len(word) + 1)]
    deletes     = [L + R[1:]                for L, R in splits if R]
    transposes  = [L + R[1] + R[0] + R[2:]  for L, R in splits if len(R)>1]
    replaces    = [L + c + R[1:]            for L, R in splits if R for c in letters]
    inserts     = [L + c + R                for L, R in splits for c in letters]
    for w in deletes:
        if (w in VOCAB):
            s.append(w)
    for w in transposes:
        if (w in VOCAB):
            s.append(w)    
    for w in replaces:
        if (w in VOCAB):
            s.append(w)
    for w in inserts:
        if (w in VOCAB):
            s.append(w)
    # Candidates with 2 edit distance
    if lev == 1:
        for st in s:
            getCandidates(st, root, s, 2)
"""
def getCandidates( word , root , s , lev ):
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    store_1 = []
    # Candidates with 1 edit distance
    if ( not len(word) ):
        return
    splits      = [(word[:i], word[i:])     for i in range(len(word) + 1)]
    deletes     = [L + R[1:]                for L, R in splits if R]
    transposes  = [L + R[1] + R[0] + R[2:]  for L, R in splits if len(R)>1]
    replaces    = [L + c + R[1:]            for L, R in splits if R for c in letters]
    inserts     = [L + c + R                for L, R in splits for c in letters]
    for w in deletes:
        if (w in VOCAB):
            store_1.append(w)
    for w in transposes:
        if (w in VOCAB):
            store_1.append(w)    
    for w in replaces:
        if (w in VOCAB):
            store_1.append(w)
    for w in inserts:
        if (w in VOCAB):
            store_1.append(w)
    print('Candidates: ', store_1)
    # Candidates with 2 edit distance
    if lev == 1:
        for cand_1 in store_1:
            splits      = [(cand_1[:i], cand_1[i:])          for i in range(len(cand_1) + 1)]
            deletes     = [L + R[1:]                        for L, R in splits if R]
            transposes  = [L + R[1] + R[0] + R[2:]          for L, R in splits if len(R)>1]
            replaces    = [L + c + R[1:]                    for L, R in splits if R for c in letters]
            inserts     = [L + c + R                        for L, R in splits for c in letters]
            for w in deletes:
                if len(w) > 1:
                    if (w in VOCAB) and (w not in s):
                        s.append(w)
            for w in transposes:
                if len(w) > 0:
                    if (w in VOCAB) and (w not in s):
                        s.append(w)
            for w in replaces:
                if len(w) > 0:
                   if (w in VOCAB) and (w not in s):
                        s.append(w)
            #for w in inserts:
                if len(w) > 0:
                    if (w in VOCAB) and (w not in s):
                        s.append(w)


# Returns edits operation
def getEditOperation( arr , x , y ):
    edits = list()
    i = len(x)
    j = len(y)
    while( i>=0 and j>=0 ):
        if ( not i and not j ):
            break
        p = 0
        if i>0 and j>0 and x[i-1] != y[j-1]:
            p = 1
        if i>0 and j>0 and arr[i][j] == arr[i-1][j-1] + p:
            if p:
                if i == 1:
                    edits.append(('S','#'+x[i-1]+x[i],'#'+y[j-1]+x[i]))
                elif i < len(x):
                    edits.append(('S',x[i-2]+x[i-1]+x[i],x[i-2]+y[j-1]+x[i]))
                else:
                    edits.append(('S',x[i-2]+x[i-1]+'#',x[i-2]+y[j-1]+'#'))
            i = i-1
            j = j-1
            continue
        elif i>0 and arr[i][j] == arr[i-1][j] + 1:
            if i == 1:
                edits.append(('D','#'+x[i-1]+x[i],'#'+x[i]))
            elif i < len(x):
                edits.append(('D',x[i-2]+x[i-1]+x[i],x[i-2]+x[i]))
            else:
                edits.append(('D',x[i-2]+x[i-1]+'#',x[i-2]+'#'))
            i = i-1
            continue
        elif j>0 and arr[i][j] == arr[i][j-1] + 1:
            if not i:
                edits.append(('I','#'+x[i],'#'+y[j-1]+x[i]))
            elif i < len(x) :
                edits.append(('I',x[i-1]+x[i],x[i-1]+y[j-1]+x[i]))
            else:
                edits.append(('I',x[i-1]+'#',x[i-1]+y[j-1]+'#'))
            j = j-1
            continue
        elif ( i > 1 and j > 1 and x[i-1] == y[j-2] and x[i-2] == y[j-1] and arr[i-2][j-2]+1==arr[i][j] ):
                edits.append(('T',x[i-2]+x[i-1],x[i-1]+x[i-2]))
                i = i-2
                j = j-2    
    return edits
 
 
def h( s , axis ):
    if axis == 'h':
        return ord(s[0])-ord('a')
    if s[0] == '#':
        return 0
    return ord(s[0])-ord('a')+1
  

# Likelihood of edits
def getLikelihood( edits ):
    L = 0
    for edit in edits:
        if edit[0] == 'D':
            if edit[1][0] == '#':
                length = N
            else:
                length = len( re.findall(edit[1][0],CORPUS) )
            L = L + log(CMI[h(edit[1][0],'v')][h(edit[1][1],'h')]) - log(length)
        elif edit[0] == 'I':
            if edit[1][0] == '#':
                length = len( re.findall(' '+edit[2][1],CORPUS) )
            else:
                length = len( re.findall(edit[2][0]+edit[2][1],CORPUS) )
            L = L + log(CMD[h(edit[2][0],'v')][h(edit[2][1],'h')]) - log(length)
        elif edit[0] == 'S':
            length = len( re.findall(edit[2][1],CORPUS) )
            L = L + log(CMS[h(edit[1][1],'v')][h(edit[2][1],'h')]) - log(length)
        elif edit[0] == 'T':
            length = len( re.findall(edit[2][0]+edit[2][1],CORPUS) )
            L = L + log(CMT[h(edit[2][0],'v')][h(edit[2][1],'h')]) - log(length)
    return L

# Unigram prob in log scale
def getUnigramProb( w ):
    if ( w == '' or w == '#' ):
        return log( S + 0.1 ) - log( N + 0.1*V )
    co = len( re.findall( ' ' + w + '[.\s]' , CORPUS ) )
    ca = log( co + 0.1 ) - log( N + 0.1*V )
    return ca

# Bigram prob in log scale
def getBigramProb( w1 , w2 ):
    if ( w1 == '' or w1 == '#' ):
        c12 = len( re.findall( '. ' + w2 + '[.\s]' , CORPUS ) ) + len( re.findall( '^' + w2 + '[.\s]' , CORPUS ) )
        c1 = S
        return log( c12 + math.pow(10,getUnigramProb(w2)) ) - log( c1 + 1 )
    if ( w2 == '' or w2 == '#' ):
        c12 = len( re.findall( ' ' + w1 + '.' , CORPUS ) )
        c1 = len( re.findall( ' ' + w1 + '[.\s]' , CORPUS ) )
        return log( c12 + math.pow(10,getUnigramProb(w2)) ) - log( c1 + 1 )
    c12 = len( re.findall( ' ' + w1 + ' ' + w2 + '[.\s]' , CORPUS ) )
    c1 = len( re.findall( ' ' + w1 + '[.\s]' , CORPUS ) )
    return log( c12 + math.pow(10,getUnigramProb(w2)) ) - log( c1 + 1 )

# Sequence prob using bigram
def getSequenceProb( words , pos ):
    l = len( words )
    if l == 1 and not pos:
        return getBigramProb( '#' , words[pos] ) + getBigramProb( words[pos] , '#' )
    if pos == l-1:
        if words[pos-1] in VOCAB:
            return getBigramProb( words[pos-1] , words[pos] ) + getBigramProb( words[pos] , '#' )
        return getBigramProb( words[pos] , '#' )
    if not pos:
        if words[pos+1] in VOCAB:
            return getBigramProb( '#' , words[pos] ) + getBigramProb( words[pos] , words[pos+1] )
        return getBigramProb( '#' , words[pos] )
    if words[pos-1] in VOCAB:
        if words[pos+1] in VOCAB:
            return getBigramProb( words[pos-1] , words[pos] ) + getBigramProb( words[pos] , words[pos+1] )
        return getBigramProb( words[pos-1] , words[pos] )
    if words[pos+1] in VOCAB:
        return getBigramProb( words[pos] , words[pos+1] )
    return getUnigramProb( words[pos] )
 
# Non-word model 
def bestCandidate( words , pos , choice , CHANGES , sen ):
    ORG = words[pos]
    print('\tNON-WORD ERROR : ' , ORG.upper()) # Identifies the non-word error.
    cand = list()
    getCandidates( ORG , ORG , cand , 1 )
    rows = list()
    for c in cand:
        arr = getEditDistance( ORG , c )
        ed = arr[len(ORG)][len(c)]
        if ( ed > 2 ):
            continue
        log_pc = getUnigramProb(c)
        words[pos] = c
        seq_pr = getSequenceProb( words , pos )
        edits = getEditOperation( arr , ORG , c )
        log_likel = getLikelihood( edits )
        rows.append([c, ed, edits, round(-1*log_pc,5), round(-1*seq_pr,5), round(-1*log_likel,5), round(-1*seq_pr-log_likel,5)])
    
    rows.sort( key = lambda x: (x[1],x[6]) )
    print( '\t        [ {} CANDIDATES SHORT-LISTED ]'.format(len(rows)) ) # Candidates
    for r in range(len(rows)):
        #print('Candidates', 'Edit distance')
        print('Candidate: ', rows[r][0],'   ', '\tEdit distance: ', rows[r][1])
    words[pos] = rows[0][0]
    CHANGES[(sen,pos)] = ( ORG, words[pos] )
    print( 'BEST CANDIDATE : {}'.format(words[pos].upper()) )
    print( '\n' , end = '' )
    print( )
    if choice == 'y':
        table = PrettyTable(["Correct Word Candidate (C)", "Edit Distance", 
                             "Edit Operations(s) [I->C]", "-log[P(C)]", "-log[P(SEQ)]", "-log[P(I|C)]", "-log[P(SEQ)P(I|C)]"])
        for row in rows:
            table.add_row(row)
        print(table)
        print( '\n' , end='' )
    return rows


def putApostrophe( w ):
    u = w.lower()
    for pair in WORDS_W_APOS:
        if u == pair[0]:
            if w[0].capitalize() == w[0]:
                return pair[1].capitalize()
            return pair[1]
    return w
    
# Output for non-word 
def correctQuery( text , edits ):
    query = text.replace('-',' - ').replace("'",'')
    SENT = nltk.tokenize.sent_tokenize(query)
    ORG_SENT = nltk.tokenize.sent_tokenize(text.replace('-',' - ').replace("'",'-'))
    KEYS = edits.keys()
    
    if not len( KEYS ):
        for s in range(len(ORG_SENT)):
            words = nltk.tokenize.word_tokenize(ORG_SENT[s])
            flag = True
            space = True
            for w in words:
                if w == 'i':
                    w = 'I'
                if flag:
                    w = w.capitalize()
                    flag = not flag
                if re.search( '^[a-zA-Z0-9]+' , w ):
                    if space:
                        print( ' ' , end='' )
                    if not '-' in w:
                        print(  putApostrophe(w), end = '' )
                    else:
                        if w[:2] == 'i-':   w = 'I' + w[1:]
                        w = w.replace( '-' , "'" )
                        print( w, end = '' )
                    space = True
                else:
                    if w in DELIMITERS: flag = True
                    if w == '-':    space = False
                    else:   space = True
                    print(  w , end = '' )
        return
        
    sentences = set(list(zip(*edits.keys()))[0])
    generation_string = []
    for s in range(len(SENT)):
        N = -1
        p = -1
        words = nltk.tokenize.word_tokenize(SENT[s])
        ORG_WORDS = nltk.tokenize.word_tokenize(ORG_SENT[s])
        if s+1 in sentences:
            for w in nltk.tokenize.word_tokenize(SENT[s]):
                p = p + 1
                if re.search( '^[a-zA-Z]+$' , w ):
                    N = N + 1
                    if ( s+1 , N ) in KEYS:
                        if words[p][0] >= 'a':
                            words[p] = edits[(s+1,N)][1]
                        else:
                            words[p] = edits[(s+1,N)][1].capitalize()
                    elif '-' in ORG_WORDS[p]:
                        w = ORG_WORDS[p]
                        if ORG_WORDS[p][:2] == 'i-':   w = 'I' + w[1:]
                        words[p] = w.replace( '-' , "'" )
        else:
            for w in nltk.tokenize.word_tokenize(SENT[s]):
                p = p + 1
                if re.search( '^[a-zA-Z]+$' , w ):
                    if '-' in ORG_WORDS[p]:
                        w = ORG_WORDS[p]
                        if ORG_WORDS[p][:2] == 'i-':   w = 'I' + w[1:]
                        words[p] = w.replace( '-' , "'" )
        generation_string.append(words) 

        flag = True
        space = True
        final_sent = []
        for w in words:
            if w == 'i':
                w = 'I'
            if flag:
                w = w.capitalize()
                flag = not flag
            if re.search( '^[a-zA-Z0-9]+' , w ):
                if space:   print( ' ' , end='' )
                if "'" in w:    print( w , end = '' )
                else:   print( putApostrophe(w) , end = '' )
                space = True
            else:
                if w in DELIMITERS: flag = True
                if w == '-':    space = False
                else:   space = True
                print( w , end = '' )
            final_sent.append(w)
    return final_sent, generation_string

    

def main( text , choice ):
    query = text.replace('-',' - ').replace("'",'')
    CHANGES = dict()
    counter = 1
    print( '\n' )
    for sentence in nltk.tokenize.sent_tokenize(query):
        flag = True
        print( ' SENTENCE' , counter )
        counter = counter + 1
        words = list()
        for w in nltk.tokenize.word_tokenize(sentence):
            if re.search( '^[a-zA-Z]+$' , w ):
                words.append(w.lower())
        for x in range(len(words)):
            if not (words[x] in VOCAB or words[x] in APOS_OMITTED_WORDS):
                bestCandidate( words , x , choice , CHANGES , counter-1 )
                flag = False
        if flag:
            print( '\n' , end='' )

    print('Corrected text:')        
    output, corrected_corpus = correctQuery( text , CHANGES )
    return corrected_corpus
    print('\n' , end='')

In [8]:
# Run non-word spelling check model
query = "I om a dogo. The experimnt is a succeess."
choice = 'y'

return_tokenized_sentences = main(query , choice[0].lower())



 SENTENCE 1
	NON-WORD ERROR :  OM
Candidates:  ['m', 'o', 'mo', 'am', 'cm', 'dm', 'km', 'mm', 'nm', 'pm', 'sm', 'tm', 'xm', 'oc', 'od', 'of', 'on', 'or', 'bom', 'com', 'dom']
Now editing:  m
Now editing:  o
Now editing:  mo
Now editing:  am
Now editing:  cm
Now editing:  dm
Now editing:  km
Now editing:  mm
Now editing:  nm
Now editing:  pm
Now editing:  sm
Now editing:  tm
Now editing:  xm
Now editing:  oc
Now editing:  od
Now editing:  of
Now editing:  on
Now editing:  or
Now editing:  bom
Now editing:  com
Now editing:  dom
	        [ 174 CANDIDATES SHORT-LISTED ]
Candidate:  on     	Edit distance:  1
Candidate:  of     	Edit distance:  1
Candidate:  or     	Edit distance:  1
Candidate:  m     	Edit distance:  1
Candidate:  cm     	Edit distance:  1
Candidate:  am     	Edit distance:  1
Candidate:  km     	Edit distance:  1
Candidate:  pm     	Edit distance:  1
Candidate:  bom     	Edit distance:  1
Candidate:  dom     	Edit distance:  1
Candidate:  com     	Edit distance:  1
Cand

# 3.0 Real word correction model - Gramformer
Inspiration from [link](https://github.com/PrithivirajDamodaran/Gramformer)

In [9]:
!pip install -U git+https://github.com/PrithivirajDamodaran/Gramformer.git
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en
!pip install torch

Collecting git+https://github.com/PrithivirajDamodaran/Gramformer.git
  Cloning https://github.com/PrithivirajDamodaran/Gramformer.git to c:\users\pc\appdata\local\temp\pip-req-build-racgcmhc
  Resolved https://github.com/PrithivirajDamodaran/Gramformer.git to commit ca5e685d05a0206e9c5b5d708a4d4b70ea77f26c
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/PrithivirajDamodaran/Gramformer.git 'C:\Users\PC\AppData\Local\Temp\pip-req-build-racgcmhc'

[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en_core_web_sm==2.3.1
  Using cached en_core_web_sm-2.3.1-py3-none-any.whl
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en_core_web_sm==2.3.1
  Using cached en_core_web_sm-2.3.1-py3-none-any.whl
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
symbolic link created for c:\Users\PC\anaconda3\lib\site-packages\spacy\data\en <<===>> c:\Users\PC\anaconda3\lib\site-packages\en_core_web_sm
✔ Linking successful
c:\Users\PC\anaconda3\lib\site-packages\en_core_web_sm -->
c:\Users\PC\anaconda3\lib\site-packages\spacy\data\en
You can now load the model via spacy.load('en')



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
# Dependencies for Gramformer
from gramformer import Gramformer
import torch
import spacy
#from huggingface_hub import notebook_login

Token:

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

In [11]:
#notebook_login()

In [12]:
#nlp = spacy.load("en_core_web_sm")
#annotator = errant.load("en")

In [13]:
def set_seed(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(1212)

# Initialize Gramformer
gf = Gramformer(models = 1, use_gpu=False) # 1=corrector, 2=detector

[Gramformer] Grammar error correct/highlight model loaded..


## Run

In [14]:
# Sample run
# Input string sentences
influent_sentences = [
    "He are moving here.",
    "I am doing fine. How is you?",
    "How is they?",
    "Matt like fish",
    "the collection of letters was original used by the ancient Romans",
    "We enjoys horror movies",
    "Anna and Mike is going skiing",
    "I walk to the store and I bought milk",
    "what be the reason for everyone leave the company",
]   

# Initiate real-word correction
for influent_sentence in influent_sentences:
    corrected_sentences = gf.correct(influent_sentence, max_candidates=1)
    print("[Input] ", influent_sentence)
    for corrected_sentence in corrected_sentences:
      print("[Correction] ",corrected_sentence)
    print("-" *100)

[Input]  He are moving here.
[Correction]  He is moving here.
----------------------------------------------------------------------------------------------------
[Input]  I am doing fine. How is you?
[Correction]  I am doing fine. How are you?
----------------------------------------------------------------------------------------------------
[Input]  How is they?
[Correction]  How are they?
----------------------------------------------------------------------------------------------------
[Input]  Matt like fish
[Correction]  Matt likes fish.
----------------------------------------------------------------------------------------------------
[Input]  the collection of letters was original used by the ancient Romans
[Correction]  the collection of letters was originally used by the ancient Romans
----------------------------------------------------------------------------------------------------
[Input]  We enjoys horror movies
[Correction]  We enjoy horror movies.
------------------

In [15]:
"""
# With QE Estimator
# Produce error correction quality score to choose candidate
"""

"""
for influent_sentence in influent_sentences:
    corrected_sentences = gf.correct(influent_sentence, max_candidates=1)
    print("[Input] ", influent_sentence)
    for corrected_sentence in corrected_sentences:
      print("[Edits] ", gf.get_edits(influent_sentence, corrected_sentence))
    print("-" *100)
"""

'\nfor influent_sentence in influent_sentences:\n    corrected_sentences = gf.correct(influent_sentence, max_candidates=1)\n    print("[Input] ", influent_sentence)\n    for corrected_sentence in corrected_sentences:\n      print("[Edits] ", gf.get_edits(influent_sentence, corrected_sentence))\n    print("-" *100)\n'

# 4.0 Combined model
1. Provide a corpus into variable "input_corpus".


## Run

In [16]:
TB = TreebankWordDetokenizer()

#----------------------------------------------------------------------
# Non-word model
# Calling main() to initiate
#input_corpus = "I haven't bailed on writing. Look, I'm generating a random paragraph at this very moment in an attempt to get my writing back on track. I am making an effort. I will start writing consistently again."
input_corpus = "I dogo"
return_NW = main(input_corpus, 'n')
print("\nhere")
# Detokenize return_NW
print(return_NW)
detoken = []
for i in return_NW:
    print(i)
    detoken.append(TB.detokenize(i))
print("detoken")
print(detoken)
print('\n')
print("end")
#----------------------------------------------------------------------
# Real-word model
for sent in detoken:
    corrected_sentences = gf.correct(sent, max_candidates=1)
    print("[Input] ", sent)
    for corrected_sentence in corrected_sentences:
      print("[Correction] ", corrected_sentence)
    print("\n")



 SENTENCE 1
	NON-WORD ERROR :  DOGO
Candidates:  ['dog', 'dogs']
Now editing:  dog
Now editing:  dogs
	        [ 16 CANDIDATES SHORT-LISTED ]
Candidate:  dog     	Edit distance:  1
Candidate:  dogs     	Edit distance:  1
Candidate:  do     	Edit distance:  2
Candidate:  deg     	Edit distance:  2
Candidate:  does     	Edit distance:  2
Candidate:  log     	Edit distance:  2
Candidate:  dots     	Edit distance:  2
Candidate:  dot     	Edit distance:  2
Candidate:  dom     	Edit distance:  2
Candidate:  doi     	Edit distance:  2
Candidate:  doc     	Edit distance:  2
Candidate:  don     	Edit distance:  2
Candidate:  fogs     	Edit distance:  2
Candidate:  degs     	Edit distance:  2
Candidate:  dg     	Edit distance:  2
Candidate:  dof     	Edit distance:  2
BEST CANDIDATE : DOG


Corrected text:
 I dog
here
[['I', 'dog']]
['I', 'dog']
detoken
['I dog']


end
[Input]  I dog
[Correction]  I dog.




## Front End

In [17]:
!pip install spellchecker
!pip install pyspellchecker




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
wrongWordAndCandidate = dict()

def checkSpelling():
    for tag in textEditor.tag_names():
        textEditor.tag_remove(tag, "1.0", "end")
    text = textEditor.get("1.0","end-1c")
    if(len(wrongSpellingList.get(0,'end')) > 0):
        wrongSpellingList.delete(0,'end')
    if(len(potentialCandidateList.get(0,'end')) > 0):
        potentialCandidateList.delete(0,'end')
    choice = 'n'
    query = text.replace('-',' - ').replace("'",'')
    CHANGES = dict()
    counter = 1
    print( '\n' )
    for sentence in nltk.tokenize.sent_tokenize(query):
        flag = True
        print( ' SENTENCE' , counter )
        counter = counter + 1
        words = list()
        for w in nltk.tokenize.word_tokenize(sentence):
            if re.search( '^[a-zA-Z]+$' , w ):
                words.append(w.lower())
        for x in range(len(words)):
            if not (words[x] in VOCAB or words[x] in APOS_OMITTED_WORDS):
                wrongSpellingList.insert(x,words[x])
                wrongWord = words[x]
                wrongWordAndCandidate[wrongWord] = []
                try:
                    candidate = bestCandidate( words , x , choice , CHANGES , counter-1 )                   
                #if candidate == None: #--------------------------------------------
                except:
                    label = words[x]
                    idx = wrongSpellingList.get(0, tk.END).index(label)
                    wrongSpellingList.delete(idx)
                    break
                if wrongWord and candidate != None:
                    # start from the beginning (and when we come to the end, stop)
                    idx = '1.0'
                    while 1:
                        # find next occurrence, exit loop if no more
                        idx = textEditor.search(wrongWord, idx, nocase=1, stopindex="end-1c")
                        if not idx: break
                        # index right after the end of the occurrence
                        lastidx = '%s+%dc' % (idx, len(wrongWord))
                        # tag the whole occurrence (start included, stop excluded)
                        textEditor.tag_add('found', idx, lastidx)
                        # prepare to search for next occurrence
                        idx = lastidx
                    # use a red foreground for all the tagged occurrences
                    textEditor.tag_config('found', foreground='red')
                for i in range (len(candidate)):
                    wrongWordAndCandidate[wrongWord].append(candidate[i][0] + " ED: " + str(candidate[i][1]))
                flag = False
        if flag:
            print( '\n' , end='' )
    if(len(wrongSpellingList.get(0,'end')) == 0):
       print("here")
       wrongSpellingList.insert(1,"No error")
    print('Corrected text:')  
    print(text)
    output, corrected_corpus = correctQuery( text , CHANGES )
    return corrected_corpus
    print('\n' , end='')

def chooseCandidate(evt):
    potentialCandidateList.delete(0,'end')
    selected_wrongSpelling = wrongSpellingList.get(wrongSpellingList.curselection())
    candidateWord = wrongWordAndCandidate.get(selected_wrongSpelling)
    for x in range(len(candidateWord)):
        potentialCandidateList.insert(x,candidateWord[x])
    for tag in textEditor.tag_names():
        textEditor.tag_remove(tag, "1.0", "end")
    selectedCorpus = wrongSpellingList.get(wrongSpellingList.curselection())
    if selectedCorpus:
        # start from the beginning (and when we come to the end, stop)
        idx = '1.0'
        while 1:
            # find next occurrence, exit loop if no more
            idx = textEditor.search(selectedCorpus, idx, nocase=1, stopindex="end-1c")
            if not idx: break
            # index right after the end of the occurrence
            lastidx = '%s+%dc' % (idx, len(selectedCorpus))
            # tag the whole occurrence (start included, stop excluded)
            textEditor.tag_add('found', idx, lastidx)
            # prepare to search for next occurrence
            idx = lastidx
        # use a red foreground for all the tagged occurrences
        textEditor.tag_config('found', background='yellow')
    
        
def correctSpelling():
    selectedCandidate = potentialCandidateList.get(potentialCandidateList.curselection())
    onlyWord = selectedCandidate.split()
    onlyWord = onlyWord[0]
    print(onlyWord)
    text = textEditor.get("1.0","end-1c")
    wrongSpellingWord = wrongSpellingList.get(wrongSpellingList.curselection())
    text = text.replace(wrongSpellingWord,onlyWord)
    print(text)
    textEditor.delete("1.0","end-1c")
    textEditor.insert("1.0", text)
    potentialCandidateList.delete(0,'end')
    checkSpelling()
    
def highLightText(evt):
    for tag in textEditor.tag_names():
        textEditor.tag_remove(tag, "1.0", "end")
    selectedCorpus = corpusList.get(corpusList.curselection())
    if selectedCorpus:
        # start from the beginning (and when we come to the end, stop)
        idx = '1.0'
        while 1:
            # find next occurrence, exit loop if no more
            idx = textEditor.search(selectedCorpus, idx, nocase=1, stopindex="end-1c")
            if not idx: break
            # index right after the end of the occurrence
            lastidx = '%s+%dc' % (idx, len(selectedCorpus))
            # tag the whole occurrence (start included, stop excluded)
            textEditor.tag_add('found', idx, lastidx)
            # prepare to search for next occurrence
            idx = lastidx
        # use a red foreground for all the tagged occurrences
        textEditor.tag_config('found', background='yellow')

In [19]:
import difflib
import re

wrongRealWordAndCandidate = dict()
allOriDiffSet = []
allCorDiffSet = []
allPotentialSuggestions = []

def highLightRealTextError(evt):
    potentialSuggestionList.delete("1.0","end-1c")
    for item in realWordErrorList.curselection():
        index = item
    for tag in textEditor.tag_names():
        textEditor.tag_remove(tag, "1.0", "end")
    selectedCorpus = allOriDiffSet[index]
    selCorDiffSet = allCorDiffSet[index]
    potentialSuggestionList.insert("1.0", allPotentialSuggestions[index])
    string = textEditor.get("1.0","end-1c")
    for x in range(len(selectedCorpus)):
        realWordError = realWordErrorList.get(realWordErrorList.curselection())
        for match in re.finditer(realWordError, string):
            start = "1." + str(match.start())
            end = "1." + str(match.end())
        if selectedCorpus[x]:
            # start from the beginning (and when we come to the end, stop)
            idx = start
            # find next occurrence, exit loop if no more
            idx = textEditor.search(selectedCorpus[x], idx, nocase=1, stopindex=end)
            if not idx: break
            # index right after the end of the occurrence
            lastidx = '%s+%dc' % (idx, len(selectedCorpus[x]))               
            # tag the whole occurrence (start included, stop excluded)
            textEditor.tag_add('found', idx, lastidx)
            # prepare to search for next occurrence
            idx = lastidx
            # use a red foreground for all the tagged occurrences
            textEditor.tag_config('found', background='yellow')
        for x in range(len(selCorDiffSet)):
            if selCorDiffSet[x]:
                # start from the beginning (and when we come to the end, stop)
                idx = '1.0'
                while 1:
                    # find next occurrence, exit loop if no more
                    idx = potentialSuggestionList.search(selCorDiffSet[x], idx, nocase=1, stopindex="end-1c")
                    if not idx: break
                    # index right after the end of the occurrence
                    lastidx = '%s+%dc' % (idx, len(selCorDiffSet[x]))
                    # tag the whole occurrence (start included, stop excluded)
                    potentialSuggestionList.tag_add('correct', idx, lastidx)
                    # prepare to search for next occurrence
                    idx = lastidx
                # use a red foreground for all the tagged occurrences
                potentialSuggestionList.tag_config('correct', background='#b8f2d0')
        
def checkError():
    allOriDiffSet.clear()
    allCorDiffSet.clear()
    allPotentialSuggestions.clear()
    realWordErrorList.delete(0,'end')
    potentialSuggestionList.delete("1.0","end-1c")
    sent = textEditor.get("1.0","end-1c")
    words = nltk.tokenize.sent_tokenize(sent)
    for x in range(len(words)):
        corrected_sentences = gf.correct(words[x], max_candidates=1)
        print("[Input] ", words[x])
        for corrected_sentence in corrected_sentences:
            print("[Correction] ", corrected_sentence)
            splitA = set(words[x].split(" "))
            splitB = set(corrected_sentence.split(" "))
            oriDiffSet = splitA.difference(splitB)
            corDiffSet = splitB.difference(splitA)
            oriDiffSet = list(oriDiffSet)
            corDiffSet = list(corDiffSet)
            if(len(oriDiffSet) != 0):
                print(corrected_sentence)
                realWordErrorList.insert(x,words[x])
                allPotentialSuggestions.append(corrected_sentence)
                allOriDiffSet.append(oriDiffSet)
                allCorDiffSet.append(corDiffSet)
            print("\n")
    print(len(realWordErrorList.get(0,'end')))
    if(len(realWordErrorList.get(0,'end')) == 0):
       realWordErrorList.insert(1,"No error")
        
def correctError():
    selectedSuggestion = potentialSuggestionList.get("1.0","end-1c")
    text = textEditor.get("1.0","end-1c")
    realWordError = realWordErrorList.get(realWordErrorList.curselection())
    text = text.replace(realWordError,selectedSuggestion)
    textEditor.delete("1.0","end-1c")
    textEditor.insert("1.0", text)
    realWordErrorList.delete(realWordErrorList.curselection())
    potentialSuggestionList.delete("1.0","end-1c")
    checkError()

In [20]:
#import the tkinter,csv,spellchecker module
import tkinter as tk
import csv
import numpy as np
from  spellchecker import SpellChecker

# create a simple screen and save to window variable
window = tk.Tk()

# give the GUI a name
window.title('Spelling Error')

# give a height and legnth
window.geometry('1000x500')

##column 1
#adding text editor
textEditor = tk.Text(window, width = 40)
textEditor.place(x = 30, y = 50)
#adding check spelling button
checkSpellingButton = tk.Button(window, text ="Check Spelling")
checkSpellingButton.config(command = checkSpelling)
checkSpellingButton.place(x = 30, y = 20)

checkErrorButton = tk.Button(window, text ="Check Real word error")
checkErrorButton.config(command = checkError)
checkErrorButton.place(x = 150, y = 20)
##------------------------------##

##column 2

# Wrong spelling label 
wrongSpellingLabel = tk.Label(window, text = "Wrong spelling word").place(x = 365, y = 20)
# Wrong Spelling list
wrongSpellingList = tk.Listbox(window, height = 15, width = 18, exportselection=False)

wrongSpellingList.bind('<<ListboxSelect>>', chooseCandidate)

##to be intgrated 
wrongSpellingList.place(x=365, y=50)

# Potential candidate label 
searchLabel = tk.Label(window, text = "Potential candidate").place(x = 365, y = 300)
# Dictionary list
potentialCandidateList = tk.Listbox(window, height = 5, width = 18)
potentialCandidateList.place(x=365, y=320)
correctSpellingButton = tk.Button(window, text ="Correct Spelling")
correctSpellingButton.config(command = correctSpelling)
correctSpellingButton.place(x = 365, y = 410)


##column 3

# Wrong spelling label 
realWordErrorLabel = tk.Label(window, text = "Real word error").place(x = 500, y = 20)
# Wrong Spelling list
realWordErrorList = tk.Listbox(window, height = 15, width = 45, exportselection=False)
realWordErrorList.bind('<<ListboxSelect>>', highLightRealTextError)

##to be intgrated 
realWordErrorList.place(x=500, y=50)

# Potential candidate label 
suggestionLabel = tk.Label(window, text = "Suggestion").place(x = 500, y = 300)
# Dictionary list
potentialSuggestionList = tk.Text(window, height = 5, width = 34)
potentialSuggestionList.place(x=500, y=320)
correctErrorButton = tk.Button(window, text ="Correct Error")
correctErrorButton.config(command = correctError)
correctErrorButton.place(x = 500, y = 410)

##------------------------------##

##column 4
dest_file = "C:\\Users\\PC\\Desktop\\Code\\Pure_Corpus.csv"
data_array = []
# Wrong spelling word label 
spell = SpellChecker()
dictionaryLabel = tk.Label(window, text = "Corpus list").place(x = 800, y = 20)
#adding text editor
corpusList = tk.Listbox(window, height = 15, width = 30)
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = ',')
    
    data_array = [data for data in data_iter]

data_array = data_array[0]
for i in range(len(data_array)):
  corpusList.insert(i,data_array[i])
corpusList.place(x = 800, y = 50)
corpusList.bind('<<ListboxSelect>>', highLightText)



# Search label 
#corpuseSearchText = tk.Entry(window, height = 1, width = 20)
searchLabel = tk.Label(window, text = "Search").place(x = 800, y = 300)
#adding text editor
def corpusSearching(text):
    if(text == ""):
        for i in range(len(data_array)):
            corpusList.insert(i,data_array[i])
    else:
        potentialCandidate = []
        correctWord = spell.correction(text)
        candidateWord = list(spell.candidates(text))
        #potentialCandidate.append(correctWord)
        for word in candidateWord:
            potentialCandidate.append(word)
        filterWords = []
        corpusList.delete(0, 'end')
        for word in potentialCandidate:
            if word in data_array:
                filterWords.append(word)
        for i in range(len(filterWords)):
            corpusList.insert(i,filterWords[i])   
    
corpuseSearchText = tk.Entry(window, validate="focusout", validatecommand=corpusSearching)
corpuseSearchText.bind("<KeyRelease>",lambda x:corpusSearching(corpuseSearchText.get()))
corpuseSearchText.place(x = 800, y = 321)

# always include this line to let your code run
window.mainloop()



 SENTENCE 1
	NON-WORD ERROR :  BAILED
Candidates:  ['failed', 'tailed', 'boiled', 'baited']
Now editing:  failed
Now editing:  tailed
Now editing:  boiled
Now editing:  baited
	        [ 4 CANDIDATES SHORT-LISTED ]
Candidate:  boiled     	Edit distance:  1
Candidate:  baited     	Edit distance:  1
Candidate:  tailed     	Edit distance:  1
Candidate:  failed     	Edit distance:  1
BEST CANDIDATE : BOILED


	NON-WORD ERROR :  WROTE
Candidates:  ['rote', 'write']
Now editing:  rote
Now editing:  write
	        [ 10 CANDIDATES SHORT-LISTED ]
Candidate:  write     	Edit distance:  1
Candidate:  rote     	Edit distance:  1
Candidate:  white     	Edit distance:  2
Candidate:  rate     	Edit distance:  2
Candidate:  role     	Edit distance:  2
Candidate:  note     	Edit distance:  2
Candidate:  rot     	Edit distance:  2
Candidate:  rose     	Edit distance:  2
Candidate:  rope     	Edit distance:  2
Candidate:  vote     	Edit distance:  2
BEST CANDIDATE : WRITE


 SENTENCE 2
	NON-WORD ERROR 