Once you have gathered the text, the next stage is about cleaning and consolidating the text.

We need to standardize the text to remove the noise so that an efficient analysis can be performed to derive meaningful insights.

THe cleaning and processing of text is highly dependent on the nature of the NLP project. If numbers are important for your project, you might have to do a different set of cleaning and processing.

We will go through the following techniques:

1. Convert text to lowercase
2. Tokenize paragraphs to sentences
3. Tokenize sentences to words
4. Remove numbers
5. Remove punctuation
6. Remove stop words
7. Remove whitespaces

We will be using the `nltk` library in order to perform this normalization.

# Convert Text to lowercase
***
Change the case of the words to ensure every word is lowercase:

In [2]:
text = 'This is an NLP article of FinTechExplained'

lower_case_text = text.lower()

print(lower_case_text)

this is an nlp article of fintechexplained


# Tokenize Paragraphs to Sentences
***
Use the `nltk` library to perform tokenization. We can use the `PunktSentenceTokenize` model to perform sentence-level tokenization by determining punctuation and character marking at the end of each sentence.

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

paragraph = '''FinTechExplained aims to explain how text processing works.  
Once we have gathered the text, the next stage is about cleaning and 
consolidating the text. It is important to ensure the text is standardised
and the noise is removed so that efficient analysis can be performed on the
text to derive meaningful insights.'''

my_list = sent_tokenize( paragraph )
print(my_list)

['FinTechExplained aims to explain how text processing works.', 'Once we have gathered the text, the next stage is about cleaning and \nconsolidating the text.', 'It is important to ensure the text is standardised\nand the noise is removed so that efficient analysis can be performed on the\ntext to derive meaningful insights.']


# Tokenize Sentences to Words
***
We can use the `TreebankWordTokenizer` from `nltk` to tokenize sentences into words.

In [5]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

print(tokenizer.tokenize(paragraph))

['FinTechExplained', 'aims', 'to', 'explain', 'how', 'text', 'processing', 'works.', 'Once', 'we', 'have', 'gathered', 'the', 'text', ',', 'the', 'next', 'stage', 'is', 'about', 'cleaning', 'and', 'consolidating', 'the', 'text.', 'It', 'is', 'important', 'to', 'ensure', 'the', 'text', 'is', 'standardised', 'and', 'the', 'noise', 'is', 'removed', 'so', 'that', 'efficient', 'analysis', 'can', 'be', 'performed', 'on', 'the', 'text', 'to', 'derive', 'meaningful', 'insights', '.']


# Remove Numbers
***
Use a regular expression to remove all numbers from a given string:

In [6]:
import re
result = re.sub(r'\d+', '','909FinTechExplained9876')
print(result)

FinTechExplained


# Remove Punctuation
***
Now we can remove punctuation from text

In [10]:
import string

punctuation = string.punctuation
words = ['You','Are','Reading','FinTechExplained','!','NLP','.']

clean_words = [w for w in words if w not in punctuation]

print(clean_words)
print("Cleaned sentence:", "".join(clean_words))

['You', 'Are', 'Reading', 'FinTechExplained', 'NLP']
Cleaned sentence: YouAreReadingFinTechExplainedNLP


# Remove Stop Words
***
Stop words: "a","an","the","and","but","if","or","because" are some common English stop words.

Use `nltk` to remove stop words

In [13]:
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')

text = 'FinTechExplained is an important publication'
words = nltk.word_tokenize( text )
stopwords = stopwords.words('english')

clean_words = [w for w in words if w not in stopwords]

print(clean_words)

['FinTechExplained', 'important', 'publication']


# Remove Whitespaces
***
How to remove whitespaces such as space, tab, carriage return, line feeds, etc:

In [15]:
sentence = 'FinTechExplained Is A    Publication. \n This is about NLP'

splitted_words = sentence.split()

print(splitted_words)

['FinTechExplained', 'Is', 'A', 'Publication.', 'This', 'is', 'about', 'NLP']
