# Auto_Spell_Correcter


## What is the work of an autocorrect system?

An autocorrect system changes a mispelled word into the correct spelling.

Steps for implementing an autocorrect system

1. Identify the mispelled word.

2. Find strings that are n edit distance away from the mispelled word.

3. Filter suggested candidates to retain only the ones found in the vocabulary,

4. Order filtered candidates based on word probabilities.

5. Choose the most likely candidate.

### Identifying a mispelled word

A word is mispelled if it is not found on the vocabulary of the corpus of text the autocorrect system is working with.

Finding string that are n edit distances away

Editting is an operation performed on a string to change it into another string. An edit distance is a count of the number of operations performed on a word to edit it.



**Types of edit operations**

1. INSERT (add a letter). Example "to" => "top", "two"

2. DELETE (remove a letter). Example "hat" => "at" "ha","ht

3. SWAP (swap 2 adjacent letters). Example "eta" => "tea", "eat"

4. REPLACE (changes one letter to another). Example "jaw" => "jar", "paw

**Calculating word probabilities**

The probabilities of a word is calculated based on the following formula: P(w) = C(w)/V where,

P(w) is the probability of a word, w

C(w) is the number of times a word w appears in the corpus

V is the total number of words in the corpus


**Minimum edit distance**

MED: Minimum edit distance is the least number of edits needed to transform one string into another.

Application of MED

1. Spelling correction

2. Documents similarity

3. Machine translation

4. DNA sequencing

In [1]:
import re
import string
from collections import Counter
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m14.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
import PyPDF2

def read_corpus(file_path):
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
        return text.split()



In [5]:
file_path = "/content/drive/MyDrive/text_file.pdf"
words = read_corpus(file_path)
print(f"There are {len(words)} total words in the corpus")

There are 3266 total words in the corpus


In [6]:
print(words)

['Effects', 'of', 'Age', 'and', 'Gender', 'on', 'Blogging', 'Jonathan', 'Schler1Moshe', 'Koppel1Shlomo', 'Argamon2James', 'Pennebaker3', '1', 'Dept.', 'of', 'Computer', 'Science,', 'Bar-Ilan', 'University,', 'Ramat', 'Gan', '52900,Israel', '2', 'Linguistic', 'Cognition', 'Lab,', 'Dept.', 'of', 'Computer', 'Science', 'Illinois', 'Institute', 'of', 'Technology,', 'Chicago,', 'IL', '606163', 'Dept.', 'of', 'Psychology,', 'The', 'University', 'of', 'Texas,', 'Austin,', 'TX', '78712', 'schlerj@cs.biu.ac.il,', 'koppel@cs.biu.ac.il,', 'argamon@iit.edu,', 'pennebaker@mail.utexas.edu', 'Abstract', 'Analysis', 'of', 'a', 'corpus', 'of', 'tens', 'of', 'thousands', 'of', 'blogs', '–', 'incorporating', 'close', 'to', '300', 'million', 'words', '–', 'indicates', 'significant', 'differences', 'in', 'writing', 'style', 'and', 'content', 'between', 'male', 'and', 'female', 'bloggers', 'as', 'well', 'as', 'among', 'authors', 'of', 'different', 'ages.', 'Such', 'differences', 'can', 'be', 'exploited', 't

In [7]:
vocabs = set(words)
print(f"There are {len(vocabs)} unique words in the vocabulary")

There are 1495 unique words in the vocabulary


In [8]:
print(vocabs)

{'Cognition', 'emphasize', 'borne', '671.5', 'further', 'our', 'not', '0.41±0.05', '0.92±0.05', 'initialized', 'various', 'Features', 'TX', '(too', 'Formal', '0.62±0.04', '1.25±0.06', '1.', '4743', '25.4±0.4', 'Taken', '300', 'j\uf0df', '99.2', '0.33±0.02', 'bias,', '28.4', 'types', 'cycles', '1,…,wi', 'sis', 'Average', 'work,', 'examples', 'democratic', 'lol,', '819', '0.47±0.04', '(normalized', 'maths', 'monotonically', '(1995).', ':', 'Pennebaker3', 'considered', 'On', 'male.', 'confirms', 'fiction', '0.27±0.03', '0.29±0.02', 'classification', 'neg-emotions', 'permit', '627.6', 'balanced', 'females,', '1.37±0.06', '87.3%.', '12287', 'Koppel1Shlomo', 'domains.', '196.9±2.4', 'Human', 'server', 'see', 'end', 'shown.', 'has', '1256.5', 'much', 'learn', 'Corney,', '0.14±0.02', 'Illinois', '0.62±0.03', '<x1,…,x', 'have', '0.85±0.03', '0.59±0.05', 'this', '13-17;', 'distribution', '(While', '“male”', 'about', 'earlier:', 'ur–', 'Such', 'study,', 'equal', 'syntactic', 'quite', 'work', '193

In [9]:
word_counts = Counter(words)
print(word_counts["student"])

2


In [10]:
total_word_count = float(sum(word_counts.values()))
word_probas = {word: word_counts[word] / total_word_count for word in word_counts.keys()}

In [11]:
print(word_probas["student"])

0.000612369871402327


In [12]:
def split(word):
  return [(word[:i], word[i:]) for i in range(len(word) + 1)]

In [13]:
print(split("student"))

[('', 'student'), ('s', 'tudent'), ('st', 'udent'), ('stu', 'dent'), ('stud', 'ent'), ('stude', 'nt'), ('studen', 't'), ('student', '')]


In [14]:
def delete(word):
  return [l + r[1:] for l,r in split(word) if r]

In [15]:
print(delete("student"))

['tudent', 'sudent', 'stdent', 'stuent', 'studnt', 'studet', 'studen']


In [16]:
def swap(word):
  return [l + r[1] + r[0] + r[2:] for l, r in split(word) if len(r)>1]

In [18]:
print(swap("student"))

['tsudent', 'sutdent', 'stduent', 'stuednt', 'studnet', 'studetn']


In [19]:
string.ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

In [20]:
def replace(word):
  letters = string.ascii_lowercase
  return [l + c + r[1:] for l, r in split(word) if r for c in letters]

In [21]:
print(replace("student"))

['atudent', 'btudent', 'ctudent', 'dtudent', 'etudent', 'ftudent', 'gtudent', 'htudent', 'itudent', 'jtudent', 'ktudent', 'ltudent', 'mtudent', 'ntudent', 'otudent', 'ptudent', 'qtudent', 'rtudent', 'student', 'ttudent', 'utudent', 'vtudent', 'wtudent', 'xtudent', 'ytudent', 'ztudent', 'saudent', 'sbudent', 'scudent', 'sdudent', 'seudent', 'sfudent', 'sgudent', 'shudent', 'siudent', 'sjudent', 'skudent', 'sludent', 'smudent', 'snudent', 'soudent', 'spudent', 'squdent', 'srudent', 'ssudent', 'student', 'suudent', 'svudent', 'swudent', 'sxudent', 'syudent', 'szudent', 'stadent', 'stbdent', 'stcdent', 'stddent', 'stedent', 'stfdent', 'stgdent', 'sthdent', 'stident', 'stjdent', 'stkdent', 'stldent', 'stmdent', 'stndent', 'stodent', 'stpdent', 'stqdent', 'strdent', 'stsdent', 'sttdent', 'student', 'stvdent', 'stwdent', 'stxdent', 'stydent', 'stzdent', 'stuaent', 'stubent', 'stucent', 'student', 'stueent', 'stufent', 'stugent', 'stuhent', 'stuient', 'stujent', 'stukent', 'stulent', 'stument'

In [22]:
def insert(word):
  letters = string.ascii_lowercase
  return [l + c + r for l, r in split(word) for c in letters]

In [24]:
print(insert("student"))

['astudent', 'bstudent', 'cstudent', 'dstudent', 'estudent', 'fstudent', 'gstudent', 'hstudent', 'istudent', 'jstudent', 'kstudent', 'lstudent', 'mstudent', 'nstudent', 'ostudent', 'pstudent', 'qstudent', 'rstudent', 'sstudent', 'tstudent', 'ustudent', 'vstudent', 'wstudent', 'xstudent', 'ystudent', 'zstudent', 'satudent', 'sbtudent', 'sctudent', 'sdtudent', 'setudent', 'sftudent', 'sgtudent', 'shtudent', 'situdent', 'sjtudent', 'sktudent', 'sltudent', 'smtudent', 'sntudent', 'sotudent', 'sptudent', 'sqtudent', 'srtudent', 'sstudent', 'sttudent', 'sutudent', 'svtudent', 'swtudent', 'sxtudent', 'sytudent', 'sztudent', 'staudent', 'stbudent', 'stcudent', 'stdudent', 'steudent', 'stfudent', 'stgudent', 'sthudent', 'stiudent', 'stjudent', 'stkudent', 'stludent', 'stmudent', 'stnudent', 'stoudent', 'stpudent', 'stqudent', 'strudent', 'stsudent', 'sttudent', 'stuudent', 'stvudent', 'stwudent', 'stxudent', 'styudent', 'stzudent', 'stuadent', 'stubdent', 'stucdent', 'studdent', 'stuedent', 'st

In [25]:
def edit1(word):
  return set(delete(word) + swap(word) + replace(word) + insert(word))

In [27]:
print(edit1("student"))

{'studeit', 'stutdent', 'studeqnt', 'studendt', 'sludent', 'studqent', 'sctudent', 'ktudent', 'stuodent', 'bstudent', 'studmnt', 'studeny', 'srtudent', 'studenot', 'studkent', 'studlent', 'studend', 'studenpt', 'steudent', 'stuhdent', 'studyent', 'stndent', 'sqtudent', 'stjudent', 'stuedent', 'studqnt', 'saudent', 'studeut', 'studvnt', 'stucdent', 'sttdent', 'stuoent', 'studeno', 'swudent', 'rstudent', 'studegnt', 'gtudent', 'sktudent', 'ystudent', 'qstudent', 'ntudent', 'xtudent', 'stuadent', 'studejnt', 'stukent', 'studennt', 'stcdent', 'stbdent', 'studentk', 'sstudent', 'stoudent', 'studext', 'studenxt', 'stuident', 'studcnt', 'stuudent', 'studento', 'hstudent', 'studenk', 'studnet', 'studengt', 'stfdent', 'sgudent', 'studrent', 'studens', 'studwent', 'stiudent', 'stzdent', 'ptudent', 'studejt', 'stuqent', 'stugent', 'studnnt', 'stumdent', 'studentr', 'studena', 'vstudent', 'studsnt', 'stuyent', 'styudent', 'shudent', 'studelt', 'stuvent', 'studeint', 'studenct', 'studgent', 'sthden

In [28]:
def edit2(word):
  return set(e2 for e1 in edit1(word) for e2 in edit1(e1))

In [29]:
print(edit2("student"))

{'dtjudent', 'socdent', 'stxudentv', 'stavent', 'fstudvent', 'stvoent', 'studenstp', 'scudeznt', 'stbdenl', 'zstuqdent', 'studyxent', 'studevntp', 'qtuwent', 'ztudenit', 'studdnq', 'rtudenj', 'sbudert', 'stuudint', 'studednzt', 'sstudjnt', 'gtudennt', 'stukdvent', 'stpudwnt', 'stfnudent', 'studrenh', 'dtident', 'ishtudent', 'stcxudent', 'studpeng', 'studwen', 'sxudtent', 'stujentd', 'studetft', 'ghtudent', 'studeinti', 'sftudetnt', 'stqudeot', 'wtudenqt', 'stuocnt', 'sdtudebnt', 'stupnnt', 'sptyudent', 'studkentz', 'stuheut', 'sktudennt', 'stadenjt', 'stuejnt', 'stuwen', 'stqgudent', 'stusdejt', 'stuaknt', 'studxnnt', 'studenhxt', 'zshtudent', 'dtudenst', 'wtudeft', 'sttudekt', 'hstujent', 'szudvent', 'etudentv', 'pstudentn', 'sjtudenty', 'scudento', 'studensk', 'stuknt', 'studoejnt', 'zstudeni', 'sytudvnt', 'stvdentb', 'tsktudent', 'stldenv', 'stxdentj', 'satudwent', 'studelw', 'studenlbt', 'stndenm', 'soqtudent', 'stuzernt', 'etmdent', 'sbtudynt', 'studiept', 'studxefnt', 'stulpnt', 

In [30]:
def correct_spelling(word, vocabulary, word_probabilities):
  if word in vocabulary:
    print(f"{word} is already correctly spelt")
    return

  suggestions = edit1(word) or edit2(word) or [word]
  best_guesses = [w for w in suggestions if w in vocabulary]
  return [(w, word_probabilities[w]) for w in best_guesses]

In [32]:
word = "geder"
corrections = correct_spelling(word, vocabs, word_probas)

if corrections:
  print(corrections)
  probs = np.array([c[1] for c in corrections])
  best_ix = np.argmax(probs)
  correct = corrections[best_ix][0]
  print(f"{correct} is suggested for {word}")

[('gender', 0.005511328842620943)]
gender is suggested for geder


In [35]:
class SpellChecker(object):

  def __init__(self, corpus_file_path):
    with open(corpus_file_path, "r") as file:
      lines = file.readlines()
      words = []
      for line in lines:
        words += re.findall(r'\w+', line.lower())

    self.vocabs = set(words)
    self.word_counts = Counter(words)
    total_words = float(sum(self.word_counts.values()))
    self.word_probas = {word: self.word_counts[word] / total_words for word in self.vocabs}

  def _level_one_edits(self, word):
    letters = string.ascii_lowercase
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [l + r[1:] for l,r in splits if r]
    swaps = [l + r[1] + r[0] + r[2:] for l, r in splits if len(r)>1]
    replaces = [l + c + r[1:] for l, r in splits if r for c in letters]
    inserts = [l + c + r for l, r in splits for c in letters]

    return set(deletes + swaps + replaces + inserts)

  def _level_two_edits(self, word):
    return set(e2 for e1 in self._level_one_edits(word) for e2 in self._level_one_edits(e1))

  def check(self, word):
    candidates = self._level_one_edits(word) or self._level_two_edits(word) or [word]
    valid_candidates = [w for w in candidates if w in self.vocabs]
    return sorted([(c, self.word_probas[c]) for c in valid_candidates], key=lambda tup: tup[1], reverse=True)


In [48]:
import PyPDF2

def convert_pdf_to_text(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in range(len(reader.pages)):
            page_obj = reader.pages[page]
            text += page_obj.extract_text()
    return text

pdf_file_path = "/content/drive/MyDrive/text_file.pdf"
text_content = convert_pdf_to_text(pdf_file_path)
print(text_content)

Effects of Age and Gender on Blogging
Jonathan Schler1Moshe Koppel1Shlomo Argamon2James Pennebaker3
1
Dept. of Computer Science, Bar-Ilan University, Ramat Gan 52900,Israel  2
Linguistic Cognition Lab, Dept. of Computer Science Illinois 
Institute of Technology, Chicago, IL 606163
Dept. of Psychology, The University of Texas, Austin, TX 78712
schlerj@cs.biu.ac.il, koppel@cs.biu.ac.il, argamon@iit.edu, pennebaker@mail.utexas.edu
Abstract
Analysis of a corpus of tens of thousands of blogs –
incorporating close to 300 million words – indicates 
significant differences in writing style and content between 
male and female bloggers as well as among authors of 
different ages. Such differences can be exploited to 
determine an unknown author’s age and gender on the basis 
of a blog’s vocabulary.
Introduction
The increasing popularity of publicly accessible blogs 
offers an unprecedented opportunity to harvest information 
from texts authored by hundreds of thousands of different 
authors. Co

In [51]:
pip  install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.7.2-py3-none-any.whl (3.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/3.4 MB[0m [31m23.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.4/3.4 MB[0m [31m54.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.7.2


In [58]:
import PyPDF2
from spellchecker import SpellChecker

def convert_pdf_to_text(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

def check_spelling(text):
    spell = SpellChecker()
    words = text.split()
    misspelled = spell.unknown(words)
    corrected_text = ""
    for word in words:
        if word in misspelled:
            correction = spell.correction(word)
            if correction is not None:
              corrected_text += correction+" "
            else:
              corrected_text += word +" "

        else:
            corrected_text += word + " "
    return corrected_text

pdf_file_path = "/content/drive/MyDrive/text_file.pdf"
text_content = convert_pdf_to_text(pdf_file_path)

corrected_text = check_spelling(text_content)
print(corrected_text)


Effects of Age and Gender on Blogging Jonathan Schler1Moshe Koppel1Shlomo Argamon2James Pennebaker3 1 Dept. of Computer Science, Bar-Ilan University, Ramat Gan 52900,Israel 2 Linguistic Cognition Lab, Dept. of Computer Science Illinois Institute of Technology, Chicago, IL 606163 Dept. of Psychology, The University of Texas, Austin, TX 78712 schlerj@cs.biu.ac.il, koppel@cs.biu.ac.il, argamon@iit.edu, pennebaker@mail.utexas.edu Abstract Analysis of a corpus of tens of thousands of blogs i incorporating close to 300 million words i indicates significant differences in writing style and content between male and female bloggers as well as among authors of different ages Such differences can be exploited to determine an unknown authors age and gender on the basis of a blogs vocabulary Introduction The increasing popularity of publicly accessible blogs offers an unprecedented opportunity to harvest information from texts authored by hundreds of thousands of different authors Conveniently, man

In [60]:
from spellchecker import SpellChecker

spell = SpellChecker()
word = "sentence"

# Check the spelling of the word
corrections = spell.correction(word)

# Print the corrected word
print(corrections)

sentence
