# MisSpelling Map Builder
This notebook is aimed to fix upfront misspelled words from the training text. It proceeds along below priority order:
* identify firstly all unknown/out of vocabulary words
* among previously identified unknown term, try to suggest a fix based on the custom/medical named entity we built specifically
  * drug name
  * active ingredient
* in a second time, try to suggest a fix based on the general thesaurus

We favor first domain specific named entities because their spelling is pretty particular.

It generates 3 csv files into staging_data folder, defining the map from the misspelled word to its fix suggestion:
* mispelled_drug_names.csv
* mispelled_ingredient_names.csv
* mispelled_general_names.csv


The domain specific named entity correction is inspired from https://norvig.com/spell-correct.html souce code.
It relies on the Levenshtein word distance principle and takes adavantge of the word frequency (learnt from large corpus) to compute the likehood of the word suggestion.

A threshold on the Levenshtein distance needs to be defined to accept the correction suggestion: I prefer to compute the ratio between Levenshtein disance over the word length as this indicator is more significant to identify unintentional spelling incorrectness.

For the general purpose vocabulary, I make use of the **SpellChecker** package which comes with french resources.

## Identify out of vocabulary words

Unknown words are relative to a dictionary. Which dictionary to consider?
In fact, for DL pipeline, we make use of pretrained embedding model (fasttest) which is badly resilient in case of mispelled word. To prevent from falling into such situation, we will consider the fasttext vocabulary as the reference to determine if a word is identified as out of vocabulary
There 's a dedidated Jupyter notebook [notebook](outofvocabulary_identifier.ipynb) which generates a csv file containing the unknown word list [file](../../data/staging_data/outofvocab_words.txt)


## Custom spelling corrector

In [15]:
# custom code to fix domain specific words which are incorrectedly spelled
# inspiration source: https://norvig.com/spell-correct.html
import Levenshtein
import csv
import re
from collections import Counter


def words(text):
    return re.findall(r'\w+', text.lower())


def getCandidates(word, refDictionary): 
    "generate possible spelling corrections for word"
    return (known([word], refDictionary) or known(edits1(word), refDictionary) or known(edits2(word), refDictionary) or [word])

def known(words, refDictionary): 
    "the subset of `words` that appear in the dictionary."
    return set(w for w in words if w in refDictionary)

def edits1(word):
    "all edits that are one edit away from `word`"
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "all edits that are two edits away from `word`"
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def fixWord(word, refDictionary): 
    candidates = getCandidates(word, refDictionary)
    minDist = 10000.
    bestCandidate = None
    bestDistance = None
    for candidate in candidates:
        dist = Levenshtein.distance(candidate, word) / len(word)
        if dist < minDist:
            bestCandidate = candidate
            bestDistance = dist
            minDist = dist
    return (bestCandidate, bestDistance)


def buildFixMap(words, refDictionary, distanceThreshold, exclusionList, fileName):      
    misSpelledMap = {}
    for word in words:
        fixedWord, distance = fixWord(word, refDictionary)
        if fixedWord != word:            
            if not exclusionList is None and word in exclusionList:
                print("{0} is already handled".format(word))
                continue
            acceptFix = distance <= distanceThreshold
            flag = "OK" if acceptFix else "KO"
            print("{3}  {0} => {1} with dist={2:.4f}".format(word, fixedWord, distance, flag))
            if acceptFix:
                misSpelledMap[word] = fixedWord
    writer = csv.writer(open(fileName, "w"), lineterminator='\r')
    for key, val in misSpelledMap.items():
        writer.writerow([key, val])
    return misSpelledMap

# Drug name

In [16]:
with open('../../data/staging_data/fasttext_outofvocab_words.txt',encoding='utf-8') as f:
    unknownWords = f.read().splitlines()

In [17]:
drugNames = Counter(words(open('../../data/staging_data/drug_names.txt').read()))
misSpelledDrugMap = buildFixMap(unknownWords, drugNames, 0.25, None, "../../data/staging_data/mispelled_drug_names.csv")

OK  cérazette => cerazette with dist=0.1111
OK  adépal => adepal with dist=0.1667
OK  sterilet => sterilene with dist=0.2500
OK  lutéran => luteran with dist=0.1429
OK  cycléane => cycleane with dist=0.1250
KO  gygy => gyno with dist=0.5000
OK  lyoc => lyo with dist=0.2500
OK  seroquel => xeroquel with dist=0.1250
OK  triafémi => triafemi with dist=0.1250
OK  lutényl => lutenyl with dist=0.1429
OK  désobel => desobel with dist=0.1429
OK  déroxat => deroxat with dist=0.1429
KO  dexorat => deroxat with dist=0.2857
KO  qu’il => quixil with dist=0.4000
KO  anxiete => anxietum with dist=0.2857
OK  gelsenium => gelsemium with dist=0.1111
KO  apetit => apatite with dist=0.3333
OK  séropram => seropram with dist=0.1250
OK  prévenar => prevenar with dist=0.1250
OK  lévothyrox => levothyrox with dist=0.1000
OK  calcibronate => calcibronat with dist=0.0833
OK  seropam => seropram with dist=0.1429
OK  miréna => mirena with dist=0.1667
OK  bétaserc => betaserc with dist=0.1250
OK  dépamide => depam

KO  ethpo => ethyol with dist=0.4000
OK  gelsemiuim => gelsemium with dist=0.1000
KO  flushs => flucis with dist=0.3333
OK  déturgylone => deturgylone with dist=0.0909
OK  gerdazil => gardasil with dist=0.2500
OK  floxifral => floxyfral with dist=0.1111
OK  proitrine => prostine with dist=0.2222
OK  voltaréne => voltarene with dist=0.1111
OK  andorcur => androcur with dist=0.2500
KO  ingatia => ignatia with dist=0.2857
OK  mynocicline => minocycline with dist=0.1818
OK  naturland => natulan with dist=0.2222
KO  alyna => avena with dist=0.4000
OK  dolyprane => doliprane with dist=0.1111
OK  cytotek => cytotec with dist=0.1429
OK  ludeale => ludeal with dist=0.1429
KO  gracial => fractal with dist=0.2857
KO  ado' => aloe with dist=0.5000
OK  oeuphytoses => euphytose with dist=0.1818
OK  seoplex => seroplex with dist=0.1429
OK  floradil => foradil with dist=0.1250
OK  alverine => elvorine with dist=0.2500
KO  nizoral => neoral with dist=0.2857
OK  kethoderm => ketoderm with dist=0.1111
KO

OK  quétiapin => quetiapine with dist=0.2222
OK  lysensia => lysanxia with dist=0.2500
OK  sertalia => smectalia with dist=0.2500
OK  nourrison => normison with dist=0.2222
KO  c’est => cilest with dist=0.4000
OK  stederil => stediril with dist=0.1250
OK  evapar => evepar with dist=0.1667
KO  benivit => recivit with dist=0.2857
KO  gonglé => gonal with dist=0.3333
KO  normix => novomix with dist=0.3333
OK  roxytromycine => roxithromycine with dist=0.1538
OK  trophygil => trophigil with dist=0.1111
OK  trophigill => trophigil with dist=0.1000
OK  tridnordiol => trinordiol with dist=0.0909
OK  nocerton => nocertone with dist=0.1250
OK  engérix => engerix with dist=0.1429
OK  mienesse => minesse with dist=0.1250
OK  caffea => coffea with dist=0.1667
OK  zirtec => zyrtec with dist=0.1667
OK  florygynal => florgynal with dist=0.1000
KO  noctifs => noctium with dist=0.2857
OK  feldène => feldene with dist=0.1429
KO  immovan => imovax with dist=0.2857
OK  perlodel => parlodel with dist=0.1250

## Active Ingredient name

In [19]:
ingredientNames = Counter(words(open('../../data/staging_data/ingredient_names.txt').read()))
exclusionList = set(list(misSpelledDrugMap.keys()) + list(drugNames))
misSpelledIngredientMap = buildFixMap(unknownWords, ingredientNames, 0.25, exclusionList, "../../data/staging_data/mispelled_ingredient_names.csv")

norset is already handled
adepal is already handled
solian is already handled
tercian is already handled
adépal is already handled
clomid is already handled
cycléane is already handled
lumalia is already handled
inexium is already handled
lyoc is already handled
cycleane is already handled
skenan is already handled
tolexine is already handled
noroxine is already handled
evepar is already handled
depamide is already handled
KO  dexorat => dextran with dist=0.2857
gelsenium is already handled
KO  apetit => apatite with dist=0.3333
zoely is already handled
granions is already handled
dostinex is already handled
fucidine is already handled
depakine is already handled
dépamide is already handled
monazol is already handled
lamisilate is already handled
sertaline is already handled
cacit is already handled
pergotime is already handled
modopar is already handled
erythromycine is already handled
KO  c'est => cyst with dist=0.4000
ginkor is already handled
coversyl is already handled
imodium is 

roxytromycine is already handled
caffea is already handled
hexomedine is already handled
KO  encinte => escine with dist=0.2857
chamomilia is already handled
harpagopytum is already handled
KO  nuitce => nite with dist=0.3333
lacmital is already handled
arpagophytum is already handled
KO  lpmg => lom with dist=0.5000
metfomine is already handled
KO  revamil => rapamil with dist=0.2857
KO  jarsin => xarcin with dist=0.3333
oracilline is already handled
hypéricum is already handled
OK  nitr => titr with dist=0.2500
myambutol is already handled
verapamil is already handled
KO  genertº => genet with dist=0.2857
amiodorane is already handled
KO  adépale => adapal with dist=0.2857
KO  pdant => pent with dist=0.4000
polery is already handled
bisprolol is already handled
KO  alpraz => cloraz with dist=0.3333
OK  tranéxamique => tranexamique with dist=0.0833
KO  môi => ii with dist=0.6667
chlomadinone is already handled
cafeiné is already handled
KO  ginéco => ginkgo with dist=0.3333
KO  nomal 

## Common/Usual Term

In [20]:
exclusionList = set(list(exclusionList) + list(misSpelledIngredientMap.keys()) + list(ingredientNames))

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker(language='fr')

misGeneralSpelledMap = {}
fileName = "../../data/staging_data/mispelled_general_words.csv"

for word in unknownWords:
    try:
        fixedWord = spell.correction(word)
    except:
        print("fail to fix word " + word)
        continue
        
    if fixedWord != word:            
        if word in exclusionList:
            continue
        dist = Levenshtein.distance(fixedWord, word) / len(word)            
        acceptFix = dist < 0.25
        flag = "OK" if acceptFix else "KO"
        print("{3}  {0} => {1} with dist={2}".format(word, fixedWord, dist, flag))
        if acceptFix:
            fixedWord = fixedWord.replace("ã¨", "è")
            fixedWord = fixedWord.replace("ã©", "é")            
            misGeneralSpelledMap[word] = fixedWord

writer = csv.writer(open(fileName, "w"), lineterminator='\r')
for key, val in misGeneralSpelledMap.items():
    writer.writerow([key, val])



OK  aujourd'hui => aujourdhui with dist=0.09090909090909091
