# Motivation 

no spell checker currently exist that can work for compound words like **studentdesigned,teacherdesigned** etc

note that **studentdesigned** is a compound word that need to be splitted like **'student designed'**
similarly,
**teacherdesigned** should be splitted like **'teacher designed'**
there are many of such misspelled samples exists for what currently no existing english spell checker can work. this implementation attemps to tackle this compound word problem

# Installing contractions and autocorrect offline

collected autocorrect utilities from [here](https://www.kaggle.com/datasets/kaggleqrdl/autocorrect) 

In [1]:
!pip install ../input/pycontractions/results/anyascii-0.3.1-py3-none-any.whl
!pip install ../input/pycontractions/results/pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!pip install ../input/pycontractions/results/textsearch-0.0.21-py2.py3-none-any.whl
!pip install ../input/pycontractions/results/contractions-0.1.72-py2.py3-none-any.whl
!pip install -q "../input/autocorrect/autocorrect-2.6.1.tar"


Processing /kaggle/input/pycontractions/results/anyascii-0.3.1-py3-none-any.whl
Installing collected packages: anyascii
Successfully installed anyascii-0.3.1
[0mProcessing /kaggle/input/pycontractions/results/pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: pyahocorasick
Successfully installed pyahocorasick-1.4.4
[0mProcessing /kaggle/input/pycontractions/results/textsearch-0.0.21-py2.py3-none-any.whl
Installing collected packages: textsearch
Successfully installed textsearch-0.0.21
[0mProcessing /kaggle/input/pycontractions/results/contractions-0.1.72-py2.py3-none-any.whl
Installing collected packages: contractions
Successfully installed contractions-0.1.72
[0m

# Imports

In [2]:
from autocorrect import Speller
import contractions,autocorrect
import nltk
# nltk.download('words')
# nltk.download('punkt')
from nltk.corpus import brown,words
import pandas as pd

# compound word splitter

part of code below was collected from [compound word splitter ](https://github.com/TimKam/compound-word-splitter) which Splits words that are not recognized by pyenchant (spell checker) into largest possible compounds. these are my contributions
1. replaced pyenchant with nltk and autocorrect instead, because of issues like [this](https://github.com/chiphuyen/sotawhat/issues/7)
2. added better error handlers
3. added contractions fixer
4. preprocessing
5. error analysis demo

In [3]:

print("nltk version -> ",nltk.__version__)

def autospell(text):
    spell = Speller(lang='en', fast=False)
    spells = [spell(w) for w in (nltk.word_tokenize(text))]
    return " ".join(spells) 

def __concat(object1, object2):
    unicode = str
    if isinstance(object1, str) or isinstance(object1, unicode):
        object1 = [object1]
    if isinstance(object2, str) or isinstance(object2, unicode):
        object2 = [object2]
    return object1 + object2


def __capitalize_first_char(word):
    return word[0].upper() + word[1:]


def split(word, language='en'):
    
    dictionary = set(words.words())
    max_index = len(word)
    for index, char in enumerate(word):
        left_compound = word[0:max_index-index]
        right_compound_1 = word[max_index-index:max_index]
        right_compound_2 = word[max_index-index+1:max_index]
        if right_compound_1:
            right_compound1_upper = right_compound_1[0].isupper()
        if right_compound_2:
            right_compound2_upper = right_compound_2[0].isupper()
      
        if index > 0 and len(left_compound) > 1 and not left_compound in dictionary:
            left_compound = __capitalize_first_char(left_compound)
        is_left_compound_valid_word = len(left_compound) > 1 and left_compound in dictionary
        if is_left_compound_valid_word and \
                ((not split(right_compound_1, language) == '' and not right_compound1_upper) \
                or right_compound_1 == ''):
            return [compound for compound in __concat(left_compound, split(right_compound_1, language))\
                    if not compound == '']
        elif is_left_compound_valid_word and word[max_index-index:max_index-index+1] == 's' and \
            ((not split(right_compound_2, language) == '' and not right_compound2_upper) \
            or right_compound_2 == ''):
            return [compound for compound in __concat(left_compound, split(right_compound_2, language))\
                    if not compound == '']
  
    if not word == '' and word in dictionary:
        return [word]
    
    elif not word == '' and __capitalize_first_char(word) in dictionary:
        return [__capitalize_first_char(word)]
    else:
        return ""
    
def efficient_autospell(words):
    words = words.lower() #without it the previous algo implementation didn't work for upper case char in middle
    
    expanded_words = []   
    for word in words.split():
        expanded_words.append(contractions.fix(word))  
    words = ' '.join(expanded_words)
    
    expanded_words = []
    for word in words.split():
        if("’" in word):
            expanded_words.append([word])
            continue
        if("sep" == word):
            expanded_words.append([word])
            continue
        output = split(word)
        if(output == ""):
            try:
                output = autospell(word)
            except:
                print(word)
                continue
            output = [output]
        expanded_words.append(output)
      

    text = ' '.join(' '.join(l) for l in expanded_words) 

    return text



nltk version ->  3.7


the dict below was collected from [here](https://www.kaggle.com/code/dlaststark/fpe-no-fancy-stuff)

In [4]:
misspell_mapping = {
    'studentdesigned': 'student designed',
    'teacherdesigned': 'teacher designed',
    'genericname': 'generic name',
    'winnertakeall': 'winner take all',
    'studentname': 'student name',
    'driveless': 'driverless',
    'teachername': 'teacher name',
    'propername': 'proper name',
    'bestlaid': 'best laid',
    'genericschool': 'generic school',
    'schoolname': 'school name',
    'winnertakesall': 'winner take all',
    'elctoral': 'electoral',
    'eletoral': 'electoral',
    'genericcity': 'generic city',
    'elctors': 'electoral',
    'venuse': 'venue',
    'blimplike': 'blimp like',
    'selfdriving': 'self driving',
    'electorals': 'electoral',
    'nearrecord': 'near record',
    'egyptianstyle': 'egyptian style',
    'oddnumbered': 'odd numbered',
    'carintensive': 'car intensive',
    'elecoral': 'electoral',
    'oction': 'auction',
    'electroal': 'electoral',
    'evennumbered': 'even numbered',
    'mesalandforms': 'mesa landforms',
    'electoralvote': 'electoral vote',
    'relativename': 'relative name',
    '22euro': 'twenty two euro',
    'ellectoral': 'electoral',
    'thirtyplus': 'thirty plus',
    'collegewon': 'college won',
    'hisher': 'higher',
    'teacherbased': 'teacher based',
    'computeranimated': 'computer animated',
    'canadidate': 'candidate',
    'studentbased': 'student based',
    'gorethanks': 'gore thanks',
    'clouddraped': 'cloud draped',
    'edgarsnyder': 'edgar snyder',
    'emotionrecognition': 'emotion recognition',
    'landfrom': 'land form',
    'fivedays': 'five days',
    'electoal': 'electoral',
    'lanform': 'land form',
    'electral': 'electoral',
    'presidentbut': 'president but',
    'teacherassigned': 'teacher assigned',
    'beacuas': 'because',
    'positionestimating': 'position estimating',
    'selfeducation': 'self education',
    'diverless': 'driverless',
    'computerdriven': 'computer driven',
    'outofcontrol': 'out of control',
    'faultthe': 'fault the',
    'unfairoutdated': 'unfair outdated',
    'aviods': 'avoid',
    'momdad': 'mom dad',
    'statesbig': 'states big',
    'presidentswing': 'president swing',
    'inconclusion': 'in conclusion',
    'handsonlearning': 'hands on learning',
    'electroral': 'electoral',
    'carowner': 'car owner',
    'elecotral': 'electoral',
    'studentassigned': 'student assigned',
    'collegefive': 'college five',
    'presidant': 'president',
    'unfairoutdatedand': 'unfair outdated and',
    'nixonjimmy': 'nixon jimmy',
    'canadates': 'candidate',
    'tabletennis': 'table tennis',
    'himher': 'him her',
    'studentsummerpacketdesigners': 'student summer packet designers',
    'studentdesign': 'student designed',
    'limting': 'limiting',
    'electrol': 'electoral',
    'campaignto': 'campaign to',
    'presendent': 'president',
    'thezebra': 'the zebra',
    'landformation': 'land formation',
    'eyetoeye': 'eye to eye',
    'selfreliance': 'self reliance',
    'studentdriven': 'student driven',
    'winnertake': 'winner take',
    'alliens': 'aliens',
    '2000but': '2000 but',
    'electionto': 'election to',
    'candidatesas': 'candidates as',
    'electers': 'electoral',
    'winnertakes': 'winner takes',
    'isfeet': 'is feet',
    'incar': 'incur',
    'covid19': 'something',
    'aflcio': '',
    'outdatedand': 'outdated and',
    'httpswww': '',
    '51998': '',
    'iswing': '',
    'ascertainments': '',
    'athome': '',
    'risorius': '',
    'votes538': '',
    '41971': '',
    'palpabraeus': '',
    'figurelandform': 'figure landform',
    'possibleit': 'possible it',
    'takeall': 'take all',
    'inschool': 'in school',
    'fouces': 'focus',
    'presidentand': 'president and',
    'elecotrs': 'electoral',
    'formationwhich': 'formation which',
    'electorswho': 'electoral who',
    'presidnt': 'president',
    'eletors': 'electoral',
    'sinceraly': 'sincerely',
    'emotionshappiness': 'emotions happiness',
    'carterbob': 'carter bob',
    'donÃ£Ã¢t': 'do not',
    'eyesnose': 'eyes nose',
    'smartroad': 'smart road',
    'systemvoters': 'system voters',
    'emtions': 'emotions',
    'statedemocrats': 'state democrats',
    'lowcar': 'low car',
    'elcetoral': 'electoral',
    'expressivefor': 'expressive for',
    'animails': 'animals',
    'oppertonuty': 'opportunity',
    'tempetures': 'temperature',
    'recevies': 'receives',
    'twoseat': 'two seat',
    'consistution': 'constitution',
    'horsesyoung': 'horses young',
    'semidriverless': 'semi driverless',
    'presisdent': 'president',
    'exspression': 'expression',
    'valcanoes': 'volcano',
    'actiry': '',
    'lifejust': 'life just',
    'selfreliant': 'self reliant',
    'comcaraccidentcauseofaccidentcellphonecellphonestatistics': 'car accident cause of accident cellphone statistics',
    'vaubangermany': 'germany',
    'fourtyfour': 'fourty four',
    'atomspheric': 'atmospheric',
    'mid1990': '',
    'activitis': 'activities',
    'paragrpah': 'paragraph',
    'electora': 'electoral',
    'elcetion': 'election',
    'stressfree': 'stress free',
    'seegoing': 'see going',
    'coferencing': 'conferencing',
    'ctrdot': '',
    'segoing': '',
    'teacherdesign': 'teacher design',
    'kidsteens': 'kids teens',
    'elcetors': 'electoral',
    'poulltion': 'pollution',
    'surportive': 'supportive',
    'presisent': 'president',
    'technollogy': 'technology',
    'precidency': 'president',
    'voteswhile': 'votes while',
    'headformed': 'head formed',
    'swingstates': 'swing states',
    'candates': 'candidate',
    'locationname': 'location name',
    'venuss': 'venues',
    'astronmers': 'astronomers',
    'democtratic': 'democratic',
    'canadent': 'candidate',
    'cyndonia': '',
    'computure': 'computer',
    'nasas': 'nasa',
    'onehalf': 'one half',
    'preident': 'president',
    'ressons': 'reasons',
    'presidentvice': 'president vice',
    'nonswing': 'non swing',
    'thirtyeight': 'thirty eight',
    'processnot': 'process not',
    'facetoface': 'face to face',
    'teendriversource': 'teen driver source',
    'sadnessand': 'sadness and',
    'abloish': 'abolish',
    'driveing': 'driving',
    'navagating': 'navigating',
    'electorsthe': 'electoral',
    'vothing': 'voting',
    'callage': 'college',
    'senseit': 'sense it',
    'mercedesbenz': 'mercedes benz',
    'electorall': 'electoral'
}

In [5]:

df = pd.DataFrame(columns=['input', 'output'],index=range(1, len(misspell_mapping)))

len(df)

197

In [6]:
if __name__ == "__main__":
    for idx,bad_word in enumerate(misspell_mapping):
        correct_text = efficient_autospell(bad_word) 
        df.input[idx] = bad_word
        df.output[idx] = correct_text
    df.to_csv('error_analyst.csv',index = False)

In [7]:
df.head()

Unnamed: 0,input,output
1,teacherdesigned,teacher designed
2,genericname,generic name
3,winnertakeall,winner take all
4,studentname,student name
5,driveless,drivel ess


# findings and improvement scopes
1. we've observed that it works well for few bad samples but at the same time it converts few good samples into bad ones,example -> driveless to 'drivel ess'
2. slow/time consuming