# Correct spelling
The [original source](https://github.com/wolfgarbe/SymSpell), but I am using a simple [python port](https://github.com/mammothb/symspellpy).

In [10]:
from symspellpy.symspellpy import SymSpell

In [11]:
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 0
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = "frequency_dictionary_en_82_765.txt"
term_index = 0  # column of the term in the dictionary text file
count_index = 1  # column of the term frequency in the dictionary text file
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

In [14]:
input_term = "I'd like toknowhow I'd done that!"#"thequickbrownfoxjumpsoverthelazydog"

result = sym_spell.word_segmentation(input_term)
# display suggestion term, term frequency, and edit distance
print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
                          result.log_prob_sum))

I'd like to know how I'd done that !, 10, -58.50888790936667


# Removing contractions:

I will be using the [pycontractions](https://pypi.org/project/pycontractions/) library. It takes a three-pass approach. 
* First, the simple contractions with only a single rule are replaced. 
* On the second pass if any contractions are present with multiple rules we proceed to replace all combinations of rules to produce all possible texts. 
* Each text is then passed through a grammar checker and the Word Mover’s Distance (WMD) is calculated between it and the original text. The hypotheses are then sorted by least number of grammatical errors and shortest distance from the original text and the top hypothesis is returned as the expanded form.

In [1]:
from pycontractions import Contractions

In [7]:
# Load your favorite semantic vector model in gensim keyedvectors format from disk
# cont = Contractions('GoogleNews-vectors-negative300.bin')
# or specify any model from the gensim.downloader api
cont = Contractions(api_key='glove-wiki-gigaword-50')
# optional, prevents loading on first expand_texts call
cont.load_models()



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [8]:
list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."]))

['I had like to know how I had done that!',
 'we are going to the zoo and I do not think I will be home for dinner.',
 'they are going to the zoo and she will be home for dinner.']

In [9]:
list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."], precise=True))

['I would like to know how I had done that!',
 'we are going to the zoo and I do not think I will be home for dinner.',
 'they are going to the zoo and she will be home for dinner.']

## Regular Expressions to clean text

In [1]:
import re

In [None]:
email_pattern = re.compile('\"?([-a-zA-Z0-9.`?{}]+@\w+\.\w+)\"?')
dollar_and_decimals_pattern = re.compile('(\$[-\d]*\.*\d+)|(\d*\.\d+)')
us_phone_pattern = re.compile('\d{3}-\d{3}-\d{4}|\(\d{3}\)\d{3}-\d{4}')
date_pattern = re.compile('\s+\d{1,4}[/-]\d{1,2}[/-]*\d{0,4}|\s*\d{1,4}[/-]\d{1,2}[/-]*\d{0,4}\s+|\d{2}-\D{3}-\d{2,4}')
add_space_around_punct_patern = re.compile(r'([\[.,!?():;\]])')
remove_multiple_space_pattern = re.compile('\s{2,}|\t')
split_sentence_pattern = re.compile('[!&\.;?]|\*{2,}|\-{2,}|/{2,}|,[\s\w]{25,},')