Stemming and lemmatization can be considered as a kind of linguistic compression. In the same sense, word replacement can be thought of as text normalization or error correction.

But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.

### Word replacement using regular expression

First, import all the nexessary packages

In [2]:
import re
from nltk.corpus import wordnet

Next, define the replacement patterns of your choice as follows −

In [4]:
R_patterns = [
   (r'won\'t', 'will not'),
   (r'can\'t', 'cannot'),
   (r'i\'m', 'i am'),
   (r'(\w+)\'ll', '\g<1> will'),
   (r'(\w+)n\'t', '\g<1> not'),
   (r'(\w+)\'ve', '\g<1> have'),
   (r'(\w+)\'s', '\g<1> is'),
   (r'(\w+)\'re', '\g<1> are'),
]

In [7]:
class REReplacer(object):
   def __init__(self, patterns = R_patterns):
      self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
   def replace(self, text):
      s = text
      for (pattern, repl) in self.pattern:
         s = re.sub(pattern, repl, s)
      return s

In [8]:
rep_word = REReplacer()
rep_word.replace("I won't do it")

'I will not do it'

### Replacement before text processing

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
rep_word = REReplacer()
word_tokenize("I won't be able to do this now")

['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now']

In [11]:
word_tokenize(rep_word.replace("I won't be able to do this now"))

['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']

### Removal of repeating characters

In [19]:
class Rep_word_removal(object):
    
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        if wordnet.synsets(word):
            return word

        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

In [20]:
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")

'Hi'

In [21]:
rep_word.replace("Hellooooooooooooooo")

'Hello'