***
# <center>***Replacing and Correcting Words***
***

## ***I learned the following natural language processing techniques:***

* **Text Normalization:**
    * [Stemming words](#stemming) 
    * [Lemmatizing words with WordNet](#lemmatization)
    * [Replacing words matching regular expressions](#regex-replacement)
    * [Removing repeating characters](#character-removal)
    * [Spelling correction with Enchant](#spelling-correction)
    * [Replacing synonyms](#synonym-replacement)
    * [Replacing negations with antonyms](#negation-replacement) 



***
## ***<a id="stemming"></a>Stemming words:***
***

**Stemming** is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook, and a good stemming algorithm knows that the ing suffix can be removed. **Stemming** is most commonly used by search engines for indexing 
words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

One of the most common stemming algorithms is the **Porter stemming algorithm** by Martin Porter. It is designed to remove and replace well-known suffixes of English words.  NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem:

In [1]:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('cooking')
 

'cook'

In [7]:

stemmer.stem('runing')


'rune'

In [8]:

stemmer.stem('booking')


'book'

In [10]:

words = ["running", "jumps", "happily", "historical"]
# Stemming the words
stems = [stemmer.stem(word) for word in words]
# Output the stems
print(stems)


['run', 'jump', 'happili', 'histor']


The `PorterStemmer` class knows a number of regular word forms and suffixes and uses this knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, which has 
the same root meaning.

There are other stemming algorithms out there besides the Porter stemming algorithm, such as the **Lancaster stemming algorithm**, developed at Lancaster University. NLTK includes it as the LancasterStemmer class.

All the stemmers covered next inherit from the StemmerI interface, which defines the stem() method:

* **Stemmerl (stem())**
  - PorterStemmer()
  - RegexpStemmer()
  - SnowballStemmer()
  - LancasterStemmer()

**The LancasterStemmer class:**

The functions of the `LancasterStemmer` class are just like the functions of the PorterStemmer class, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer functions:

In [11]:

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
stemmer.stem('cooking')


'cook'

**The RegexpStemmer class**

It takes a single regular expression (either compiled or as a string) and removes any prefix or suffix that matches the expression.  A `RegexpStemmer` class should only be used in very specific cases that are not covered by the PorterStemmer or the LancasterStemmer class because it can only handle very specific patterns and is not a general-purpose algorithm.

In [14]:

from nltk.stem import RegexpStemmer
stemmer = RegexpStemmer('ing')
print(stemmer.stem('cooking'))
print(stemmer.stem('cookery'))
print(stemmer.stem('ingleside'))


cook
cookery
leside


In [15]:

stemmer = RegexpStemmer('oo')
print(stemmer.stem('cooking'))
print(stemmer.stem('cookery'))
print(stemmer.stem('ingleside'))


cking
ckery
ingleside


**The SnowballStemmer class**

The **SnowballStemmer** class supports 13 non-English languages. It also provides two English stemmers: the original porter algorithm as well as the new English stemming algorithm. To use the SnowballStemmer class, create an instance with the name of the language you are using and then call the stem() method. Here is a list of all the supported languages and an example using the Spanish `SnowballStemmer` class:

In [27]:

from nltk.stem import SnowballStemmer
languages = SnowballStemmer.languages
spanish_stemmer = SnowballStemmer('spanish')
spanish_stemmer.stem('hola')


'hol'

***
## ***<a id="lemmatization"></a>Lemmatizing words with WordNet:*** 
***

**Lemmatization** is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word that means the same thing. However, the word you end up with can
be completely different.

In [28]:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('cooking')


'cooking'

In [29]:

lemmatizer.lemmatize('cooking', pos='v')


'cook'

In [30]:

lemmatizer.lemmatize('cookbooks')


'cookbook'

In [36]:

lemmatizer.lemmatize('cookbooks')


'cookbook'

The **WordNetLemmatizer** class is a thin wrapper around the wordnet corpus and uses the morphy() function of the **WordNetCorpusReader** class to find a lemma. If no lemma is found, or the word itself is a lemma, the word is returned as is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, cooking does not return a different lemma unless you specify that the POS is a verb. This is because the default POS is a noun, and as a noun, cooking is its own lemma. On the other hand, cookbooks  is a noun with its singular form, cookbook, as its lemma.

In [37]:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('believes')


'believ'

In [38]:

lemmatizer.lemmatize('believes')


'belief'

**Combining stemming with lemmatization**

Stemming and lemmatization can be combined to compress words more than either process can by itself. These cases are somewhat rare, but they do exist:

In [40]:

stemmer.stem('buses')


'buse'

In [41]:

lemmatizer.lemmatize('buses')


'bus'

In [42]:

stemmer.stem('bus')


'bu'

***
## ***<a id="regex-replacement"></a>Replacing words matching regular expressions:*** 
***

If **stemming** and **lemmatization** are a kind of linguistic compression, then word replacement can be thought of as error mcorrection or text normalization.
    
In this phase, we will replace words based on regular expressions, with a focus on expanding contractions. This technique aims to fix this by replacing contractions with their expanded forms, for example, by replacing "can't" with "cannot" or "would've" with "would have".

First, we need to define a number of replacement patterns. This will be a list of tuple pairs, where the first element is the pattern to match with and the second element is the replacement.

then, we will create a RegexpReplacer class that will compile the patterns and provide a replace() method to substitute all the found patterns with their replacements.

In [54]:

import re

replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'can not'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would')
 ]


In [57]:

class RegexpReplacer(object):
    
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s)
        return s
        

In [58]:

replacer = RegexpReplacer()


In [59]:

replacer.replace("you can't read it")


'you can not read it'

In [60]:

replacer.replace("I should've done that thing I didn't do")


'I should have done that thing I did not do'

***
## ***<a id="character-removal"></a>Removing repeating characters:*** 
***

In casual conversation, people often stray from strict grammar rules. They might write phrases like **I looooooove it** to emphasize their feelings. However, computers don't understand that **looooooove** is just an exaggerated form of **love** unless explicitly instructed. This guide introduces a method to eliminate these repetitive characters, converting them into proper English words.

We will create a class modeled after the RegexpReplacer class from the previous guide. This class will feature a **replace()** method that processes a single word and returns a more accurate version by removing excessive repeating characters.

In [7]:

import re

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'

    def replace(self, word):
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

replacer = RepeatReplacer()
words = ["looooooove", "heeeeelllloooooo", "yesssssss", 'goose','oooooooooooooooooooooooooooooh']

# Removing repetitive characters
corrected_words = [replacer.replace(word) for word in words]

print(corrected_words)


['love', 'helo', 'yes', 'gose', 'oh']


***
## ***<a id="spelling-correction"></a>Spelling correction with Enchant:*** 
***

Removing repetitive characters is an intense form of spelling correction. In this guide, we will tackle the less drastic task of fixing minor spelling errors using **Enchant** (a spelling correction API).

Next, we will create a class named **SpellingReplacer**. This class will feature a **replace()** method that uses Enchant to verify if the word is valid. If the word isn't valid, it will suggest the best alternative using **nltk.metrics.edit_distance()**:

In [9]:

import enchant

# Initialize English dictionary
dictionary = enchant.Dict("en_US")

# Sample text
text = "I am lernin NLP with corect speling"

# Split the text into words
words = text.split()

# Correct each word if misspelled
corrected_words = []
for word in words:
    if dictionary.check(word):  # If the word is spelled correctly
        corrected_words.append(word)
    else:
        # Suggest the most probable correction
        suggestions = dictionary.suggest(word)
        corrected_words.append(suggestions[0] if suggestions else word)  # Use the first suggestion or keep the word

# Combine the corrected words into a sentence
corrected_text = ' '.join(corrected_words)
print("Original:", text)
print("Corrected:", corrected_text)


Original: I am lernin NLP with corect speling
Corrected: I am Lerner NIP with correct spieling


In [12]:

from nltk.metrics import edit_distance

class SpellingReplacer:
    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = max_dist

    def replace(self, word):
        if self.spell_dict.check(word):
            return word
        suggestions = self.spell_dict.suggest(word)
        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

# Create an instance of the replacer
replacer = SpellingReplacer()

# Replace misspelled words
print(replacer.replace('helloo'))  # Output: "cookbook"


hello


The `SpellingReplacer` class starts by creating a reference to an Enchant dictionary. Then, in the **replace()** method, it first checks whether the given word is present in the dictionary. If it is, no spelling correction is necessary and the word is returned. If the word is not found, it looks up a list of suggestions and returns the first suggestion, as long as its edit distance is less than or equal to max_dist. The edit distance is the number of character changes necessary to transform the given word into the suggested word. The max_dist value then acts as a constraint on the Enchant suggest function to ensure that no unlikely replacement words are returned. Here is an example showing all the suggestions for languege, a misspelling of language:

In [13]:

d = enchant.Dict('en')
d.suggest('languege')


['language', 'langue', 'Angela']

Except for the correct suggestion, language, all the other words have an edit distance of three or greater.

In [14]:

print(edit_distance('language', 'languege'))
print(edit_distance('language', 'languo'))


1
3


In [15]:

enchant.list_languages()


['en_BW',
 'en_AU',
 'en_BZ',
 'en_GB',
 'en_JM',
 'en_DK',
 'en_HK',
 'en_GH',
 'en_US',
 'en_ZA',
 'en_ZW',
 'en_SG',
 'en_NZ',
 'en_BS',
 'en_AG',
 'en_PH',
 'en_IE',
 'en_NA',
 'en_TT',
 'en_IN',
 'en_NG',
 'en_CA']

The `en_US` dictionary can give you different results than `en_GB`, such as for the word theater. The word theater is the American English spelling whereas the British English spelling is theatre:

In [18]:

en_US = enchant.Dict('en_US')
print(en_US.check('theater'))

en_GB = enchant.Dict('en_GB')
print(en_GB.check('theater'))


True
False


In [23]:

us_replacer = SpellingReplacer('en_US')
print(us_replacer.replace('theater'))

gb_replacer = SpellingReplacer('en_GB')
print(gb_replacer.replace('theater'))


theater
theatre


***
## ***<a id="synonym-replacement"></a>Replacing synonyms:*** 
***

**Synonym replacement** is a common technique in Natural Language Processing (NLP) to augment data, normalize text, or modify input for various tasks. Below is an approach to replacing words with their synonyms.

You will need a defined mapping of a word to its synonym. This is a simple controlled vocabulary. We will start by hardcoding the synonyms as a Python dictionary, and then explore other options to store synonym maps.

We will first create a `WordReplacer` class that takes a word `replacement mapping`:

In [26]:

class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    def replace(self, word):
        return self.word_map.get(word, word)
        

Then, we can demonstrate its usage for simple word replacement:

In [28]:

replacer = WordReplacer({'bday': 'birthday'})
replacer.replace('bday')


'birthday'

In [29]:

replacer.replace('happy')


'happy'

The `WordReplacer` class is simply a class wrapper around a Python dictionary. The `replace()` method looks up the given word in its **word_map dictionary** and returns the replacement synonym if it exists. Otherwise, the given word is returned as is.

If you were only using the **word_map dictionary**, you would not need the WordReplacer class and could instead call **word_map.get()** directly. However, **WordReplacer** can act as a base class for other classes that construct the **word_map dictionary** from various file formats.

We can use the **WordReplacer** class to perform any kind of word replacement, even spelling correction for more complicated words that can't be automatically corrected, 

***
## ***<a id="negation-replacement"></a>Replacing negations with antonyms:***
***

The opposite of synonym replacement is **antonym replacement**. An antonym is a word that has the opposite meaning of another word. This time, instead of creating custom word mappings, we can use **WordNet** to replace words with unambiguous antonyms.

Let's say you have a sentence like `let's not uglify our code`. With antonym replacement, you can replace `not uglify` with `beautify`, resulting in the sentence `let's beautify our code`. To do this, we will create an AntonymReplacer class:

In [30]:

from nltk.corpus import wordnet

class AntonymReplacer:
    def replace(self, word, pos=None):
        """Replace a word with its antonym if one exists."""
        antonyms = set()
        for syn in wordnet.synsets(word, pos=pos):
            for lemma in syn.lemmas():
                for antonym in lemma.antonyms():
                    antonyms.add(antonym.name())
        if len(antonyms) == 1:
            return antonyms.pop()  # Return the only antonym
        else:
            return None  # Return None if no antonym or multiple antonyms are found

    def replace_negations(self, sent):
        """
        Replace negation phrases (e.g., 'not happy') with their antonyms
        (e.g., 'unhappy').
        """
        i, l = 0, len(sent)
        words = []
        while i < l:
            word = sent[i]
            if word == 'not' and i + 1 < l:  # Check for 'not' and the next word
                ant = self.replace(sent[i + 1])
                if ant:
                    words.append(ant)  # Replace 'not word' with the antonym
                    i += 2  # Skip the next word
                    continue
            words.append(word)  # Add the word as is
            i += 1
        return words


In [36]:

replacer = AntonymReplacer()
replacer.replace('good')
replacer.replace('uglify')
sent = ["let's", 'not', 'uglify', 'our', 'code']
replacer.replace_negations(sent)


["let's", 'beautify', 'our', 'code']

In [37]:

replacer = AntonymReplacer()
sent = ['not', 'happy']
replacer.replace_negations(sent)


['unhappy']

The `AntonymReplacer` class has two methods: `replace()` and `replace_negations()`. The `replace()` method takes a single word and an optional part-of-speech tag, then looks up the Synsets for the word in WordNet. Going through all the Synsets and every lemma of each Synset, it creates a set of all antonyms found. If only one antonym is found, then it is an unambiguous replacement. If there is more than one antonym, which can happen quite often, then we don't know for sure which antonym is correct. In the case of multiple antonyms (or no antonyms), replace() returns None as it cannot make a decision.

In **replace_negations()**, we look through a tokenized sentence for the word not. If not is found, then we try to find an antonym for the next word using replace(). If we find an antonym, then it is appended to the list of words, replacing not and the original word. All other words are appended as is, resulting in a tokenized sentence with unambiguous negations replaced by their antonyms.