<h1><b><font color = 'brown'>
Cleaning Text Data
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green'>

<li>
The text data that we are going to discuss here is unstructured text data, which
consists of written sentences.
</li><br>

<li>
Most of the time, this text data cannot be used as it is for analysis because it contains some noisy elements, that is, elements that do not really contribute much to the meaning of the sentence at all.
</li><br>

<li>
These noisy elements need to be removed because they do not contribute to the meaning and semantics of the text.
</li><br>

<li>
If they are not removed, they can not only waste system memory and processing time, but also negatively impact the accuracy of the results.
</li><br>

<li>
Data cleaning is the art of extracting meaningful portions from data by eliminating unnecessary details.
</li><br>

<li>
Consider the sentence, "He tweeted, <i> 'Live coverage of General Elections
available at this.tv/show/ge2019. _/\_ Please tune in :) '. "</i>
</li><br>

<li>
In this example, to perform NLP tasks on the sentence, we will need to remove the emojis, punctuation, and stop words, and then change the words into their base grammatical form.
</li><br>

<li>
To achieve this, methods such as stopword removal, tokenization, and stemming are used.
</li><br>

<li>
Before we do so, let's get acquainted with some basic NLP libraries that we will be using here:

<ul>
<font color = 'red'>
<li>
Re:<br>
This is a standard Python library that's used for string searching and string
manipulation. It contains methods such as match(), search(), findall(), split(), and sub(), which are used for basic string matching, searching, replacing, and more, using regular expressions.
<br> A regular expression is nothing but a set of characters in a specific order that represents a pattern. This pattern is searched for in the texts.
</li><br>

<li>
textblob:<br>
This is an open source Python library that provides different methods for performing various NLP tasks such as tokenization and PoS tagging. It is similar to nltk.
It is built on the top of nltk and is much simpler as it has an easier to use interface and excellent documentation. In projects that don't involve a lot of complexity, it should be preferable to nltk.
</li><br>

<li>
keras:<br>
This is an open source, high-level neural network library that's was
developed on top of another neural network library called TensorFlow.<br>
In addition to neural network functionality, it also provides methods for basic text processing and NLP tasks.
</li>

</font>
</ul>

</li><br>

</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Tokenization
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green'>

<li>
Tokenization is the process of splitting sentences into their constituents; that is, words and punctuation.
</li><br>

<li>
Let's perform a simple exercise to see how this can be done using various packages.
</li>

</font>
</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Exercise 1: Text Cleaning and Tokenization
</font></b></h1>

In this exercise, we will clean some text and extract the tokens from it. Follow these steps to complete this exercise:

1. Open a Jupyter Notebook.

2. Import the **re** package:

In [None]:
import re

3. Create a method called **clean_text()** that will delete all characters other
than digits, alphabetical characters, and whitespaces from the text and split
the text into tokens. For this, we will use the text which matches with all
non-alphanumeric characters, and we will replace all of them with an
empty string:

In [None]:
def clean_text(sentence):
  return re.sub(r'([^\s\w]|_)+', ' ', sentence).split()

4. Store the sentence to be cleaned in a variable named **sentence** and pass it
through the preceding function.

In [None]:
sentence = 'Sunil tweeted, "Witnessing 70th Republic Day "\
           "of India from Rajpath, New Delhi. "\
           "Mesmerizing performance by Indian Army! "\
           "Awesome airshow! @india_official "\
           "@indian_army #India #70thRepublic_Day. "\
           "For more photos ping me sunil@photoking.com :)"'

In [None]:
clean_text(sentence)

['Sunil',
 'tweeted',
 'Witnessing',
 '70th',
 'Republic',
 'Day',
 'of',
 'India',
 'from',
 'Rajpath',
 'New',
 'Delhi',
 'Mesmerizing',
 'performance',
 'by',
 'Indian',
 'Army',
 'Awesome',
 'airshow',
 'india',
 'official',
 'indian',
 'army',
 'India',
 '70thRepublic',
 'Day',
 'For',
 'more',
 'photos',
 'ping',
 'me',
 'sunil',
 'photoking',
 'com']

<h1><b><font color = 'brown'>
n-grams
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green'>

<li>
Often, extracting each token separately does not help.
</li><br>

<li>
For instance, consider the sentence, "I don't hate you, but your behavior."
</li><br>

<li>
Here, if we process each of the tokens, such as "hate" and "behavior," separately, then the true meaning of the sentence would not be comprehended.
</li><br>

<li>
In this case, the context in which these tokens are present becomes
essential.
</li><br>

<li>
Thus, we consider n consecutive tokens at a time. n-grams refers to the
grouping of n consecutive tokens together.
</li><br>

</font>
</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Exercise 2: Extracting n-grams
</font></b></h1>

In this exercise, we will extract **n-grams** using three different methods. First, we will use **custom-defined functions**, and then the **nltk** and **textblob** libraries. Follow these steps to complete this exercise:

1. Open a Jupyter Notebook.

2. Import the **re** package and create a custom-defined function, which we can use to extract **n**-grams

In [None]:
import re

def n_gram_extractor(sentence, n):
  tokens = re.sub(r'([^\s\w]|_)+', ' ', sentence).split()
  for i in range(len(tokens) - n + 1):
    print(tokens[i:i+n])

3. If **n** is 2, two consecutive tokens will be taken, resulting in **bigrams**. To check the bigrams, we pass the function the text and with **n=2**.

In [None]:
n_gram_extractor('The cute little boy is playing with the kitten.', 2)

['The', 'cute']
['cute', 'little']
['little', 'boy']
['boy', 'is']
['is', 'playing']
['playing', 'with']
['with', 'the']
['the', 'kitten']


4. To check the trigrams, we pass the function with the text and with **n**=3.

In [None]:
n_gram_extractor('The cute little boy is playing with the kitten.', 3)

['The', 'cute', 'little']
['cute', 'little', 'boy']
['little', 'boy', 'is']
['boy', 'is', 'playing']
['is', 'playing', 'with']
['playing', 'with', 'the']
['with', 'the', 'kitten']


5. To check the bigrams using the **nltk** library, add the following code:

In [None]:
from nltk import ngrams

list(ngrams('The cute little boy is playing with the kitten.'.split(), 2))

[('The', 'cute'),
 ('cute', 'little'),
 ('little', 'boy'),
 ('boy', 'is'),
 ('is', 'playing'),
 ('playing', 'with'),
 ('with', 'the'),
 ('the', 'kitten.')]

6. To check the **trigrams** using the **nltk** library, add the following code:

In [None]:
list(ngrams('The cute little boy is playing with the kitten.'.split(), 3))

[('The', 'cute', 'little'),
 ('cute', 'little', 'boy'),
 ('little', 'boy', 'is'),
 ('boy', 'is', 'playing'),
 ('is', 'playing', 'with'),
 ('playing', 'with', 'the'),
 ('with', 'the', 'kitten.')]

7. To check the bigrams using the **textblob** library, add the following code:

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from textblob import TextBlob

blob = TextBlob('The cute little boy is playing with the kitten.')
blob.ngrams(n = 2)

[WordList(['The', 'cute']),
 WordList(['cute', 'little']),
 WordList(['little', 'boy']),
 WordList(['boy', 'is']),
 WordList(['is', 'playing']),
 WordList(['playing', 'with']),
 WordList(['with', 'the']),
 WordList(['the', 'kitten'])]

8. To check the trigrams using the **textblob** library, add the following code:

In [None]:
blob.ngrams(n = 3)

[WordList(['The', 'cute', 'little']),
 WordList(['cute', 'little', 'boy']),
 WordList(['little', 'boy', 'is']),
 WordList(['boy', 'is', 'playing']),
 WordList(['is', 'playing', 'with']),
 WordList(['playing', 'with', 'the']),
 WordList(['with', 'the', 'kitten'])]