# Table of Contents

1. Text Preprocessing Importance in NLP
2. Different Text Preprocessing Techniques
3. Converting to Lower case
4. Removal of HTML tags
5. Removal of URLs
6. Removing Numbers
7. Converting numbers to words
8. Apply spelling correction
9. Convert accented characters to ASCII characters
10. Converting chat conversion words to normal words
11. Expanding Contractions
12. Stemming
13. Lemmatization
14. Removal of Emojis
15. Removal of Emoticons
16. Converting Emojis to words
17. Converting Emoticons to words
18. Removing of Punctuations or Special Characters
19. Removing of Stopwords
20. Removing of Frequent words
21. Removing of Rare words
22. Removing single characters
23. Removing Extra Whitespaces

### Text Preprocessing Importance in NLP

As we said before text preprocessing is the first step in the Natural Language Processing pipeline. The importance of preprocessing is increasing in NLP due to noise or unclear data extracted or collected from different sources. 

Most of the text data collected from reviews of E-commerce websites like Amazon or Flipkart, tweets from twitter,  comments from Facebook or Instagram, and other websites like Wikipedia, etc. 

We can observe users use short forms, emojis, misspelling of words, etc. in their comments, tweets, and so on.

We should not feed raw data without preprocessing to  build models because the preprocessing of text directly improves the model's performance.

If we feed data without performing any text preprocessing techniques, the build models will not learn the real significance of the data. In some cases, if we feed raw data without any preprocessing techniques the models will get confused and give random results. 

In that confusion, the model will learn harmful patterns that are not valuable. Due to this, the model's performance will be affected, which means the model performance will reduce significantly.

So we should remove all these noises from the text and make it a more clear and structured form for building models.

Here we have to know one thing.

The natural language text preprocessing techniques will vary from problem to problem. This means we cannot apply the same text preprocessing techniques used for one NLP problem to another NLP problem. 

For example, in sentiment analysis classification problems, we can remove or ignore numbers within the text because numbers are not significant in this problem statement.

However, we should not ignore the numbers if we are dealing with financial related problems. Because numbers play a key role in these kinds of problems.

So while performing NLP text preprocessing techniques. We need to focus more on the domain we are applying these NLP techniques and the order of methods also plays a key role.

Don't worry about the order of these techniques for now.  We will give the generic order in which you need to apply these techniques.

Our suggestion is to use preprocessing methods or techniques on a subset of aggregate data (take a few sentences randomly). We can easily observe whether it is in our expected form or not. If it is in our expected form, then apply on a complete dataset; otherwise, change the order of preprocessing techniques.

We will provide a python file with a preprocess class of all preprocessing techniques at the end of this article.

You can download and import that class to your code. We can get preprocessed text by calling preprocess class with a list of sentences and sequences of preprocessing techniques we need to use.

Again the order of technique we need to use will differ from problem to problem.

### Different Text Preprocessing Techniques
Let us jump to learn different types of text preprocessing techniques. 

In the next few minutes, we will discuss and learn the importance and implementation of these techniques.

#### Converting to Lower case
Converting all our text into the lower case is a simple and most effective approach.  If we are not applying lower case conversion on words like NLP, nlp, Nlp, we are treating all these words as different words. 

After using the lower casing, all three words are treated as a single word that is nlp.


### Implementation of lower case conversion


In [1]:
def lower_case_convertion(text):
	"""
	Input :- string
	Output :- lowercase string
	"""
	lower_text = text.lower()
	return lower_text


ex_lowercase = "This is an example Sentence for LOWER case conversion"
lowercase_result = lower_case_convertion(ex_lowercase)
print(lowercase_result)



this is an example sentence for lower case conversion


In [None]:
## Output:: this is an example sentence for lower case conversion

### HTML Tag Removal

This is the second essential preprocessing technique. The chances to get HTML tags in our text data is quite common when we are extracting or scraping data from different websites. 

We don't get any valuable information from these HTML tags. So it is better to remove them from our text data. We can remove these tags by using regex and we can also use the BeautifulSoup module from bs4 libraries. 

Let us see the implementation using python.

HTML tags removal Implementation using regex module

### HTML tags removal Implementation using regex module

In [2]:


import re
def remove_html_tags(text):
	"""
	Return :- String without Html tags
	input :- String
	Output :- String
	"""
	html_pattern = r'<.*?>'
	without_html = re.sub(pattern=html_pattern, repl=' ', string=text)
	return without_html

ex_htmltags = """ <body>
<div>
<h1>Hi, this is an example text with Html tags. </h1>
</div>
</body>
"""
htmltags_result = remove_html_tags(ex_htmltags)
print(f"Result :- \n {htmltags_result}")

## Output:: Hi, this is an example text with Html tags.

Result :- 
   
 
 Hi, this is an example text with Html tags.  
 
 



### Implementation of Removing HTML tags using bs4 library

In [4]:
# Implementation of Removing HTML tags using bs4 library

from bs4 import BeautifulSoup
def remove_html_tags_beautifulsoup(text):
	"""
	Return :- String without Html tags
	input :- String
	Output :- String
	"""
	parser = BeautifulSoup(text, "html.parser")
	without_html = parser.get_text(separator = " ")
	return without_html

ex_htmltags = """ <body>
<div>
<h1>Hi, this is an example text with Html tags. </h1>
</div>
</body>
"""
htmltags_result = remove_html_tags_beautifulsoup(ex_htmltags)
print(f"Result :- \n {htmltags_result}")

## Output:: Hi, this is an example text with Html tags.

Result :- 
   
 
 Hi, this is an example text with Html tags.  
 
 



### Removal of URLs

URL is the short-form of Uniform Resource Locator. The URLs within the text refer to the location of another website or anything else.

If we are performing any website backlinks analysis, twitter or Facebook in that case, URLs are an excellent choice to keep in text.

Otherwise, from URLs also we can not get any information. So we can remove it from our text. We can remove URLs from the text by using the python Regex library.

#### Implementation of Removing URLs  using python regex

In the below script. We take example text with URLs and then call the 2 functions with that example text. In the remove_urls function, assign a regular expression to remove URLs to url_pattern after That, substitute URLs within the text with space by calling the re library's sub-function.

In [5]:
# Implementation of Removing URLs  using python regex

import re
def remove_urls(text):
	"""
	Return :- String without URLs
	input :- String
	Output :- String
	"""
	url_pattern = r'https?://\S+|www\.\S+'
	without_urls = re.sub(pattern=url_pattern, repl=' ', string=text)
	return without_urls

# example text which contain URLs in it
ex_urls = """
This is an example text for URLs like http://google.com & https://www.facebook.com/ etc.
"""

# calling removing_urls function with example text (ex_urls)
urls_result = remove_urls(ex_urls)
print(f"Result after removing URLs from text :- \n {urls_result}")

Result after removing URLs from text :- 
 
This is an example text for URLs like   &   etc.



### Removing Numbers
We can remove numbers from the text if our problem statement doesn't require numbers. 

For example, if we are working on financial related problems like banking or insurance-related sectors. We may get information from numbers.

In those cases, we shouldn't remove numbers.

#### Implementation of Removing numbers  using python regex
In the code below, we will call the remove_numbers function with example text, which contains numbers.

Let's see how to implement it

In [6]:
# Implementation of Removing numbers  using python regex

import re
def remove_numbers(text):
	"""
	Return :- String without numbers
	input :- String
	Output :- String
	"""
	number_pattern = r'\d+'
	without_number = re.sub(pattern=number_pattern,
 repl=" ", string=text)
	return without_number

# example text which contain numbers in it
ex_numbers = """
This is an example sentence for removing numbers like 1, 5,7, 4 ,77 etc.
"""
# calling remove_numbers function with example text (ex_numbers)
numbers_result = remove_numbers(ex_numbers)
print(f"Result after removing number from text :- \n {numbers_result}")


Result after removing number from text :- 
 
This is an example sentence for removing numbers like  ,  , ,   ,  etc.



In the above removing_numbers function. We mentioned a pattern to recognize numbers within the text and then substitute numbers with space using the re library's sub-function.

And then return text after removing the number to numbers_result variable.

### Converting numbers to words
If our problem statement need valuable information from numbers in that case, we have to convert numbers to words. Similar problem statements which are discussed at the removing numbers (above section).

### Implementation of Converting numbers to words using python num2words library
We can convert numbers to words by just importing the num2words library. In the code below, we will call the num_to_words function with example text. Example text has numbers.

In [10]:
#Install the package
!pip install num2words

Collecting num2words
  Downloading num2words-0.5.13-py3-none-any.whl (143 kB)
                                              0.0/143.3 kB ? eta -:--:--
     --                                       10.2/143.3 kB ? eta -:--:--
     -------                               30.7/143.3 kB 262.6 kB/s eta 0:00:01
     ----------------------------         112.6/143.3 kB 726.2 kB/s eta 0:00:01
     ------------------------------------ 143.3/143.3 kB 943.7 kB/s eta 0:00:00
Collecting docopt>=0.6.2 (from num2words)
  Using cached docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py): started
  Building wheel for docopt (setup.py): finished with status 'done'
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13775 sha256=d20a70a4253beec22890b62f6466d280e802ab7fc1a9e7d7ab8dfd2fb7e8878b
  Stored in directory: c:\users



In [13]:
# function to convert numbers to words
 
from num2words import num2words

def num_to_words(text):
	"""
	Return :- text which have all numbers or integers in the form of words
	Input :- string
	Output :- string
	"""
	# splitting text into words with space
	after_spliting = text.split()

	for index in range(len(after_spliting)):
		if after_spliting[index].isdigit():
			after_spliting[index] = num2words(after_spliting[index])

    # joining list into string with space
	numbers_to_words = ' '.join(after_spliting)
	return numbers_to_words

# example text which contain numbers in it
ex_numbers = """
This is an example sentence for converting numbers to words like 1 to one, 5 to five, 74 to seventy-four, etc.
"""
# calling remove_numbers function with example text (ex_numbers)
numners_result = num_to_words(ex_numbers)
print(f"Result after converting numbers to its words from text :- \n {numners_result}")

Result after converting numbers to its words from text :- 
 This is an example sentence for converting numbers to words like one to one, five to five, seventy-four to seventy-four, etc.


In the above code, the num_to_words function is getting the text as input. In that, we are splitting text using a python string function of a split with space to get words individually.  

Taking each word and checking if that word is digit or not. If the word is digit then convert that into words.

### Apply spelling correction

Spelling correction is another important preprocessing technique while working with tweets, comments, etc. Because we can see incorrect spelling words in those areas of text. We need to make those misspelling words to correct spelling words.

We can check and replace misspelling words with correct spelling by using two python libraries, one is pyspellchecker, and another one is autocorrect.

#### Implementation of spelling correction using python pyspellchecker library
Below we are calling a spell_correction function with example text. Example text has incorrect spelling words to check whether the spell_correction function gives correct words or not.

In [16]:
#Install pysepllchecker package

!pip install pyspellchecker

Collecting pyspellcheckerNote: you may need to restart the kernel to use updated packages.

  Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
                                              0.0/6.8 MB ? eta -:--:--
                                              0.0/6.8 MB ? eta -:--:--
                                              0.0/6.8 MB ? eta -:--:--
                                              0.0/6.8 MB ? eta -:--:--
                                              0.0/6.8 MB 146.3 kB/s eta 0:00:47
                                              0.1/6.8 MB 252.2 kB/s eta 0:00:27
     -                                        0.3/6.8 MB 1.2 MB/s eta 0:00:06
     ----                                     0.7/6.8 MB 2.4 MB/s eta 0:00:03
     --------                                 1.5/6.8 MB 4.4 MB/s eta 0:00:02
     ----------                               1.8/6.8 MB 4.7 MB/s eta 0:00:02
     ----------                               1.8/6.8 MB 4.1 MB/s eta 0:00:02
     --------



In [17]:
# Implementation of spelling correction using python pyspellchecker library

from spellchecker import SpellChecker

spell_corrector = SpellChecker()

# spelling correction using spellchecker
def spell_correction(text):
	"""
	Return :- text which have correct spelling words
	Input :- string
	Output :- string
	"""
	# initialize empty list to save correct spell words
	correct_words = []
	# extract spelling incorrect words by using unknown function of spellchecker
	misSpelled_words = spell_corrector.unknown(text.split())

	for each_word in text.split():
		if each_word in misSpelled_words:
			right_word = spell_corrector.correction(each_word)
			correct_words.append(right_word)
		else:
			correct_words.append(each_word)

	# joining correct_words list into single string
	correct_spelling = ' '.join(correct_words)
	return correct_spelling


In [18]:
#example text with mis spelling words
ex_misSpell_words = """
This is an example sentence for spell corecton
"""
spell_result = spell_correction(ex_misSpell_words)
print(f"Result after spell checking :- \n{spell_result}")

Result after spell checking :- 
This is an example sentence for spell correction


### Implementation of spelling correction using python autocorrect library

In [19]:
# Install autocorrect package
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
                                              0.0/622.8 kB ? eta -:--:--
                                              10.2/622.8 kB ? eta -:--:--
                                              10.2/622.8 kB ? eta -:--:--
     -                                     30.7/622.8 kB 187.9 kB/s eta 0:00:04
     --                                    41.0/622.8 kB 217.9 kB/s eta 0:00:03
     -------                              122.9/622.8 kB 514.3 kB/s eta 0:00:01
     -----------------------------          481.3/622.8 kB 1.9 MB/s eta 0:00:01
     -------------------------------------- 622.8/622.8 kB 2.3 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py): started
  Building wheel for autocorrect (setup.py): finished with status 'done'
  Created wheel for autocorr



In [20]:
# Implementation of spelling correction using python autocorrect library

from autocorrect import Speller
from nltk import word_tokenize

# spelling correction using spellchecker
def spell_autocorrect(text):
	"""
	Return :- text which have correct spelling words
	Input :- string
	Output :- string
	"""
	correct_spell_words = []

	# initialize Speller object for english language with 'en'
	spell_corrector = Speller(lang='en')
	for word in word_tokenize(text):
		# correct spell word
		correct_word = spell_corrector(word)
		correct_spell_words.append(correct_word)

	correct_spelling = ' '.join(correct_spell_words)
	return correct_spelling

# another example text with misSpelling words
ex_misSpell_words_1 = """
This is anoter exapl for spell correction
"""
spell_result = spell_autocorrect(ex_misSpell_words_1)
print(f"Result :- \n{spell_result}")

Result :- 
This is another example for spell correction


### Convert accented characters to ASCII characters
This is another common preprocessing technique in NLP. We can observe special characters at the top of the common letter or characters if we press a longtime while typing, for example, résumé. 

If we are not removing these types of noise from the text, then the model will consider resume and résumé; both are two different words.

Even if both are the same. We can convert this accented character to ASCII characters by using the unidecode library.

#### Implementation of accented text to ASCII converter in python
We will define the accented_to_ascii function to convert accented characters to their ASCII values in the below script.  

We will do this function with example text.

In [6]:
# Install the unidecode package

!pip install unidecode





In [3]:
# Implementation of accented text to ASCII converter in python

import unidecode

def accented_to_ascii(text):
	"""
	Return :- text after converting accented characters
	Input :- string
	Output :- string
	"""
	# apply unidecode function on text to convert
	# accented characters to ASCII values
	text = unidecode.unidecode(text)
	return text

# example text with accented characters
ex_accented = """
This is an example text with accented characters like dèèp lèarning ánd cömputer vísíön etc.
"""
accented_result = accented_to_ascii(ex_accented)
print(f"Result after converting accented characters to their ASCII values \n{accented_result}")

Result after converting accented characters to their ASCII values 

This is an example text with accented characters like deep learning and computer vision etc.



In the above code, we use the unidecode method of the unidecode library with input text. Which converts accented characters to ASCII values.

### Converting chat conversion words to normal words
This is another essential preprocessing technique if we work with chat conversions, or our problem statement requires chat conversion analysis. We need to handle short-form. As nowadays, people use short-form words in their chatting conversions for their simplicity.

A better way to work with those words is to replace short-form words to their original words.

We can find all those short-form words and its actual words in this Github Repo to save that file into our system; click right click and then press on save as option.


https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt

In [97]:



# example text for chat conversation short-form words
ex_chat = """
omg this is an example text for chat conversation.
"""
# open short_form file and then read sentences from text file using read())
short_form_list = open('slang.txt', 'r')
chat_words_str = short_form_list.read()

chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
	if line != "":
		cw = line.split("=")[0]
		cw_expanded = line.split("=")[1]
		chat_words_list.append(cw)
		chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

fullform = line.split('=')[1] +' '

ex =  line.split('=')[0]

fullformchat = ex_chat.replace(ex,fullform)

# calling function
#chat_result = short_to_original(ex_chat)
print(f"Result {fullformchat}")

Result 
 Oh My God this is an example text for chat conversation.



 ### expanding contractions

Contractions are words or combinations of words created by dropping a few letters and replacing those letters by an apostrophe.

An example of a contraction word.

"don't" is "do not" 
"should've" is "should have" 
Nlp models don't know about these contractions; they will consider "don't" and "do not" both are two different words.

We have to choose this technique if our problem statement is required. Otherwise,  leave it as it is.

#### Implementation of expanding contractions
In the code below, we are importing the CONTRACTION_MAP dictionary from the contraction file. And then define expand_contractions function to expand contractions if our input text has.

In [61]:
#Install the package

!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
                                              0.0/289.9 kB ? eta -:--:--
     -                                        10.2/289.9 kB ? eta -:--:--
     --------                                61.4/289.9 kB 1.1 MB/s eta 0:00:01
     -------------------------------------- 289.9/289.9 kB 3.0 MB/s eta 0:00:00
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp39-cp39-win_amd64.whl (39 kB)
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
Note: you may need to restart the kernel to use updated packages.




In [65]:
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too? 
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.'''
 
# creating an empty list
expanded_words = []    
for word in text.split():
  # using contractions.fix to expand the shortened words
  expanded_words.append(contractions.fix(word))   
   
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)

Original text: I'll be there within 5 min. Shouldn't you be there too? 
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.
Expanded_text: I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.


In [None]:
!pip install nltk

In the expand_contractions function, we take contraction words from our text matching with contraction map words. If we are not performing a lower case conversion technique before this, we have to take the first character to display the result of contraction "Doesn't" like "Does not".

### Stemming


Stemming is reducing words to their base or root form by removing a few suffix characters from words. Stemming is the text normalization technique.

There are so many stemming algorithms available, but the most widely used one is porter stemming.

For example, the result of books after stemming is a book, and the result of learning is learn.

But stemming doesn't always provide the correct form of words because this follows the rules like removing suffix characters to get base words.

Sometimes, stemming words don't relate to original ones and sometimes give non - dictionary words or not proper words.  

For this, we can observe in the above table results of stemming "caring" and "console/consoling". Because of these results stemming technique does not apply to all NLP tasks.

#### Implementation of Stemming using PorterStemming from nltk library
In the below python script, we will define the porter_stemmer function to implement the stemming technique. We will call the function with example text.

Before reaching the function, we have to initialize the object for the PorterStemmer class to use the stem function from that class.

In [66]:
# Implementation of Stemming using PorterStemming from nltk library

from nltk.stem import PorterStemmer

def porter_stemmer(text):
	"""
	Result :- string after stemming
	Input :- String
	Output :- String
	"""
	# word tokenization
	tokens = word_tokenize(text)

	for index in range(len(tokens)):
		# stem word to each word
		stem_word = stemmer.stem(tokens[index])
		# update tokens list with stem word
		tokens[index] = stem_word

	# join list with space separator as string
	return ' '.join(tokens)

# initialize porter stemmer object
stemmer = PorterStemmer()
# example text for stemming technique
ex_stem = "Programers program with programing languages"
stem_result = porter_stemmer(ex_stem)
print(f"Result after stemming technique :- \n{stem_result}")

Result after stemming technique :- 
program program with program languag


In the porter_stemmer function, we tokenized the input using word_tokenize from the nltk library. And then, apply the stem function to each of the tokenized words and update the text with stemmer words.

### Lemmatization

The aim of usage of lemmatization is similar to the stemming technique to reduce inflection words to their original or base words. But the lemmatization process is different from the above approach.

Lemmatization does not only trim the suffix characters; instead, use lexical knowledge bases to get original words. The result of lemmatization is always a meaningful word, not like stemming.

The disadvantages of stemming people prefer to use lemmatization to get base or root words of original words. This preprocessing technique is also optional; we have to apply it based on our problem statement.

Suppose we are doing POS (parts-of-speech) tagger problems. The original words of data have more information about data. As compared to stemming, the lemmatization speed is a little bit slow.

Let's see the implementation of lemmatization using nltk library.

#### Implementation of lemmatization using nltk
In the below strip, before calling the lemmatization function, we have to initialize the object for WordNetLemmatizer to use it.

In [67]:
## Implementation of lemmatization using nltk

from nltk.stem import WordNetLemmatizer

def lemmatization(text):
	"""
	Result :- string after stemming
	Input :- String
	Output :- String
	"""
	# word tokenization
	tokens = word_tokenize(text)

	for index in range(len(tokens)):
		# lemma word
		lemma_word = lemma.lemmatize(tokens[index])
		tokens[index] = lemma_word

	return ' '.join(tokens)

# initialize lemmatizer object
lemma = WordNetLemmatizer()
# example text for lemmatization
ex_lemma = """
Programers program with programing languages
"""
lemma_result = lemmatization(ex_lemma)
print(f"Result of lemmatization \n{lemma_result}")

Result of lemmatization 
Programers program with programing language


### Removal of Emojis

In today's online communication, emojis play a very crucial role.

Emojis are small images. Users use these emojis to express their present feelings. We can communicate these with anyone globally. For some problem statements, we need to remove emojis from the text.

Let's see on that type of problem statement how we can remove emojis.

In [68]:
# Implementation of emoji removing

def remove_emojis(text):
	"""
	Result :- string without any emojis in it
	Input :- String
	Output :- String
	"""
	emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)

	without_emoji = emoji_pattern.sub(r'',text)
	return without_emoji


# example text for emoji removing technique
ex_emoji = """
This is a test 😻 👍🏿"""

In [69]:
# calling function
emoji_result = remove_emojis(ex_emoji)
print(f"Result text after removing emojis :- \n{emoji_result}")

Result text after removing emojis :- 

This is a test  


### Removal of emotions

Emojis and emoticons are both different. An emoticon portrays a human facial expression using just keyboard characters, such as letters, numbers, and punctuation marks.

This is also the same as emojis; if problem statements don't require emoticons, we can remove them.

Implementation of removing of emoticons
To remove emotions from the text, we need a list of emoticons; in this GitHub Repo, we can find all emoticons as a dictionary.

We take an EMOTICONS dictionary from that GitHub repo and save it in our system as emoticons_list.py. After that, import that file into our preprocessing code.

In [73]:
# Install emot package

pip install emot

Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
                                              0.0/61.5 kB ? eta -:--:--
     ------                                   10.2/61.5 kB ? eta -:--:--
     ------------                           20.5/61.5 kB 131.3 kB/s eta 0:00:01
     ------------------                     30.7/61.5 kB 163.8 kB/s eta 0:00:01
     -------------------------------------- 61.5/61.5 kB 297.8 kB/s eta 0:00:00
Installing collected packages: emot
Successfully installed emot-3.1
Note: you may need to restart the kernel to use updated packages.




In [75]:
import emot 
emot_obj = emot.core.emot() 
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))" 
emot_obj.emoji(text) 


{'value': ['☮', '🙂', '❤'],
 'location': [[14, 15], [16, 17], [18, 19]],
 'mean': [':peace_symbol:', ':slightly_smiling_face:', ':red_heart:'],
 'flag': True}

In [78]:
import emot 
emot_obj = emot.core.emot() 
bulk_test = ["I love python ☮ 🙂 ❤ :-) :-( :-)))", "I love python 🙂 ❤ :-) :-( :-)))", "I love python ☮ ❤ :-) :-( :-)))", "I love python ☮ 🙂 :-( :-)))"] 

emot_obj.bulk_emoji(bulk_test) 


[{'value': ['☮', '🙂', '❤'],
  'location': [[14, 15], [16, 17], [18, 19]],
  'mean': [':peace_symbol:', ':slightly_smiling_face:', ':red_heart:'],
  'flag': True},
 {'value': ['🙂', '❤'],
  'location': [[14, 15], [16, 17]],
  'mean': [':slightly_smiling_face:', ':red_heart:'],
  'flag': True},
 {'value': ['☮', '❤'],
  'location': [[14, 15], [16, 17]],
  'mean': [':peace_symbol:', ':red_heart:'],
  'flag': True},
 {'value': ['☮', '🙂'],
  'location': [[14, 15], [16, 17]],
  'mean': [':peace_symbol:', ':slightly_smiling_face:'],
  'flag': True}]

### Removing of Punctuations or Special Characters

Punctuations or special characters are all characters except digits and alphabets. List of all available special characters are [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~].  

This is better to remove or convert emoticons before removing punctuations or special characters.

If we apply this technique process before emoticons related techniques, we may lose emoticons from the text. So if we apply the emoticons technique, apply before removing the punctuation technique.

For example, if we remove the period using the punctuation removing technique from text like "money 20.98", we will lose the period (.) between 20 & 98. That completely lost their meaning.

So we have to focus more on choosing punctuations.

In [81]:
# Implementation of removing punctuations using string library

from string import punctuation

def remove_punctuation(text):
	"""
	Return :- String after removing punctuations
	Input :- String
	Output :- String
	"""
	return text.translate(str.maketrans('', '', punctuation))


# example text for removing punctuations
ex_punct = """
this is an example text for punctuations like .?/*
"""
punct_result = remove_punctuation(ex_punct)
print(f"Result after removing punctuations :- \n{punct_result}")


Result after removing punctuations :- 

this is an example text for punctuations like 



### Removing of Stopwords
Stopwords are common words and irrelevant words from which we can't get any useful information for our model or problem statement.

Few stopwords are "a", "an", "the", etc.  

For example, we can ignore stop words when we work with sentiment analysis, text classification problems. But in the case of POS (Parts-Of-Speech) tagging or language translation, we have to consider whether stop words also give more information and useful words for our problem statement.

In [83]:
# Installing important package

pip install gensim

Collecting gensimNote: you may need to restart the kernel to use updated packages.

  Using cached gensim-4.3.2-cp39-cp39-win_amd64.whl (24.0 MB)
Installing collected packages: gensim
Successfully installed gensim-4.3.2




In [86]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
                                              0.0/12.8 MB ? eta -:--:--
                                              0.0/12.8 MB ? eta -:--:--
                                              0.0/12.8 MB ? eta -:--:--
                                             0.0/12.8 MB 162.5 kB/s eta 0:01:19
                                             0.0/12.8 MB 163.8 kB/s eta 0:01:18
                                             0.1/12.8 MB 238.1 kB/s eta 0:00:54
                                             0.2/12.8 MB 654.6 kB/s eta 0:00:20
     ---                                      1.1/12.8 MB 4.1 MB/s eta 0:00:03
     -----                                    1.7/12.8 MB 5.3 MB/s eta 0:00:03
     -----                                    1.8/12.8 MB 4.9 MB/s eta 0:00:03
     -----                                    1.8/1



In [98]:
# Implementation of removing stopwords using all stop words from nltk, spacy, gensim

from nltk.corpus import stopwords
import spacy
import gensim


def remove_stopwords(text):
	"""
	Return :- String after removing stopwords
	Input :- String
	Output :- String
	"""
	text_without_sw = []
	# tokenization
	text_tokens = word_tokenize(text)
	for word in text_tokens:
		# checking word is stopword or not
		if word not in all_stopwords:
			text_without_sw.append(word)

	# joining all tokens after removing stop words
	without_sw = ' '.join(text_without_sw)
	return without_sw


# list of stopwords from nltk
stopwords_nltk = list(stopwords.words('english'))
sp = spacy.load('en_core_web_sm')
# list of stopwords from spacy
stopwords_spacy = list(sp.Defaults.stop_words)
# list of stopwords from gensim
stopwords_gensim = list(gensim.parsing.preprocessing.STOPWORDS)

# unique stopwords from all stopwords
all_stopwords = []
all_stopwords.extend(stopwords_nltk)
all_stopwords.extend(stopwords_spacy)
all_stopwords.extend(stopwords_gensim)
# all unique stop words
all_stopwords = list(set(all_stopwords))
print(f"Total number of Stopwords :- {len(all_stopwords)}")

# example text for stop words removing
ex_sw = """
this is an example text for stopwords such as a, an, the etc.
"""
sw_result = remove_stopwords(ex_sw)

print(f"Result after removing stopwords :- \n{sw_result}")




Total number of Stopwords :- 412
Result after removing stopwords :- 
example text stopwords , , .


The code mentioned above, we take stopwords from different libraries such as nltk, spacy, and gensim. 

And then take unique stop words from all three stop word lists. In the remove_stopwords, we check whether the tokenized word is in stop words or not; if not in stop words list, then append to the text without the stopwords list

In the above script, we defined two functions one is for counting frequent words another is to remove them from our corpus.

### Removing of Rare words
Removing rare words text preprocessing technique is similar to eliminating frequent words. We can remove more irregular words from the corpus.

#### Implementation of frequent words removing
In the below script, the same as the above one, we defined two functions: finding rare words and removing them. We take only ten rare words for this sample text; this number may increase based on our text corpus.

In [91]:
# Python program for the above approach
from collections import Counter

# Function to remove common
# words from two strings
def removeCommonWords(sent1, sent2):

	# Store the words present
	# in both the sentences
	sentence1 = list(sent1.split())
	sentence2 = list(sent2.split())
	
	# Calculate frequency of words
	# using Counter() function
	frequency1 = Counter(sentence1)
	frequency2 = Counter(sentence2)


	word = 0
	
	# Iterate the list consisting
	# of words in the first sentence 
	for i in range(len(sentence1)):
	
		# If word is present
		# in both the strings
		if sentence1[word] in frequency2.keys():
		
			# Remove the word
			sentence1.pop(word)
			
			# Decrease the frequency of the word
			word = word-1
		word += 1
		
	word = 0
	
	# Iterate the list consisting of
	# words in the second sentence 
	for i in range(len(sentence2)):
	
		# If word is present
		# in both the strings
		if sentence2[word] in frequency1.keys():
		
			# Remove the word
			sentence2.pop(word)
			
			# Decrease the removed word
			word = word-1
			
		word += 1
		
	# Print the remaining
	# words in the two sentences
	print(*sentence1)
	print(*sentence2)


# Driver Code

sentence1 = "sky is blue in color"
sentence2 = "raj likes sky blue color"

removeCommonWords(sentence1, sentence2)


is in
raj likes


### Removing single characters
After performing all text preprocessing techniques except extra spaces, removing this is better to remove a single character if there is any present in our corpus. We can remove using regex.

Implementation of removing single characters

In [92]:
## Remove single characters

def remove_single_char(text):
	"""
	Return :- string after removing single characters
	Input :- string
	Output:- string
	"""
	single_char_pattern = r'\s+[a-zA-Z]\s+'
	without_sc = re.sub(pattern=single_char_pattern, repl=" ", string=text)
	return without_sc

# example text for removing single characters
ex_sc = """
this is an example of single characters like a , b , and c .
"""
# calling remove_sc function to remove single characters
sc_result = remove_single_char(ex_sc)
print(f"Result :-\n{sc_result}")

## Output:: this is an example of single characters like , , and .

Result :-

this is an example of single characters like , , and .



### Removing Extra Whitespaces
This is the last preprocessing technique. We can not get any information from extra spaces, so that we can ignore all additional spaces such as 0ne or more newlines, tabs, extra spaces.

Our suggestion is to apply this preprocessing technique at last after performing all text preprocessing techniques.

Implementation  of removing extra whitespaces

In [7]:

# Removing Extra Whitespaces

import re

def remove_extra_spaces(text):
	"""
	Return :- string after removing extra whitespaces
	Input :- String
	Output :- String
	"""
	space_pattern = r'\s+'
	without_space = re.sub(pattern=space_pattern, repl=" ", string=text)
	return without_space


# example text for removing extra spaces
ex_space = """
this      is an


extra spaces        .
"""

space_result = remove_extra_spaces(ex_space)
print(f"Result :- \n{space_result}")

## Output:: this is an extra spaces .

Result :- 
 this is an extra spaces . 
