## Text Preprocessing and Wrangling

- Text wrangling (also called preprocessing or normalization) is a process that consists of
a series of steps to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and intelligent systems powered by machine learning and deep learning. 

- Common techniques for preprocessing include cleaning text, tokenizing text, removing special characters, case conversion, correcting spellings, removing stopwords and other unnecessary terms, stemming, and lemmatization

## Removing HTML Tags

-  Unstructured text contains a lot of noise, especially if you use techniques
like web scraping or screen scraping to retrieve data from web pages, blogs, and online repositories.

- HTML tags, JavaScript, and Iframe tags typically don’t add much value to understanding and analyzing text.

- Our main intent is to extract meaningful textual content from the data extracted from the web

In [1]:
## web scraping - retrieve the contents of this web page in Python
import requests 
data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.content
print(content[1163:2200])

b'a name="generator" content="Ebookmaker 0.8.9 by Project Gutenberg"/>\r\n</head>\r\n  <body><p id="id00000">Project Gutenberg EBook The Bible, King James, Book 1: Genesis</p>\r\n\r\n<p id="id00001">Copyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nthis or any other Project Gutenberg eBook.</p>\r\n\r\n<p id="id00002">This header should be the first thing seen when viewing this Project\r\nGutenberg file.  Please do not remove it.  Do not change or edit the\r\nheader without written permission.</p>\r\n\r\n<p id="id00003">Please read the "legal small print," and other information about the\r\neBook and Project Gutenberg at the bottom of this file.  Included is\r\nimportant information about your specific rights and restrictions in\r\nhow the file may be used.  You can also find out about how to make a\r\ndonation to Project Gutenberg, and how to get involved.</p>\r\n\r\n<p id="id00004" style="mar

- We can clearly see from the preceding output that it is extremely difficult to decipher the actual textual content in the web page, due to all the unnecessary HTML tags. We need to remove those tags. 
- The BeautifulSoup library provides us with some handy functions that help us remove these unnecessary tags with ease.

In [0]:
import re 
from bs4 import BeautifulSoup

def strip_html_tags(text):
  soup = BeautifulSoup(text,"html.parser")
  [s.extract() for s in soup(['iframe','script'])]
  stripped_text=soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+','\n',stripped_text)
  return stripped_text

In [3]:
## Call the function above
clean_content = strip_html_tags(content)
print(clean_content[1163:2045])

*** START OF THE PROJECT GUTENBERG EBOOK, THE BIBLE, KING JAMES, BOOK 1***
This eBook was produced by David Widger
with the help of Derek Andrew's text from January 1992
and the work of Bryan Taylor in November 2002.
Book 01        Genesis
01:001:001 In the beginning God created the heaven and the earth.
01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.
01:001:005 And God called the light Day, and the darkness he called
           Night. And the evening and the morning were the first day.
01:001:006 And God said, Let there be a firmament in the midst of the
           waters,


- We have successfully removed the unnecessary HTML tags. We now have a clean body of text that’s easier to interpret and understand.

#### Text Tokenization
- defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens

#### Sentence Tokenization
- process of splitting a text corpus into sentences that act
as the first level of tokens the corpus is comprised of. This is also known as sentence segmentation, since we try to segment the text into meaningful sentences. 
- Below are main  sentence tokenizers:
 -  sent_tokenize
 -  Pretrained sentence tokenization models 
 -  PunktSentenceTokenizer
 - RegexpTokenizer

In [4]:
## Use tokenize on some sample text and part of the Gutenberg corpus available in NLTK

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from pprint import pprint
import numpy as np



[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [0]:
## loading text corpora 
alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = ("US unveils world's most powerful supercomputer, beats China." 
               "The US has unveiled the world's most powerful supercomputer called 'Summit'," 
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight," 
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")


In [6]:
sample_text

"US unveils world's most powerful supercomputer, beats China.The US has unveiled the world's most powerful supercomputer called 'Summit',beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [7]:
# Total characters in Alice in Wonderland
print(len(alice))

# First 100 characters in the corpus
alice[0:100]

144395


"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

### Default Sentence Tokenizer 
- The nltk.sent_tokenize() function is the default sentence tokenization function that NLTK recommends and it uses an instance of the PunktSentenceTokenizer class internally.

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

In [10]:
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
print(np.array(sample_sentences))
print("------------------------")
print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
print(np.array(alice_sentences[0:5]))

Total sentences in sample_text: 3
Sample text sentences :-
["US unveils world's most powerful supercomputer, beats China.The US has unveiled the world's most powerful supercomputer called 'Summit',beating the previous record-holder China's Sunway TaihuLight."
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,which is capable of 93,000 trillion calculations per second.'
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']
------------------------

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."
 "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or

-  Note that: This  doesn’t just use periods to delimit sentences, but also considers other punctuation and capitalization of words. We can also tokenize text of other languages using some pretrained models present in NLTK.


### Pretrained Sentence Tokenizer Models
- Suppose we were dealing with German text. We can use sent_tokenize, which
is already trained, or load a pretrained tokenization model on German text into a PunktSentenceTokenizer instance and perform the same operation.

In [11]:
nltk.download('europarl_raw')

[nltk_data] Downloading package europarl_raw to /root/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.


True

In [12]:
from nltk.corpus import europarl_raw
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


- Next, we tokenize the text corpus into sentences using the default sent_ tokenize(...) tokenizer and a pretrained German language tokenizer by loading it from the NLTK resources.

In [0]:
# default sentence tokenizer
german_sentences_def = default_st(german_text,language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

In [14]:
## check if the results obtained by using the two tokenizers match!

# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
print(type(german_tokenizer))

# check if results of both tokenizers match
# should be True
print(german_sentences_def == german_sentences)


<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>
True


- Obs: german_tokenizer is an instance of PunktSentenceTokenizer, which specializes in dealing with the German language.

#### PunktSentenceTokenizer

In [15]:
punkt_st= nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
for sent in sample_sentences:
  print(sent,"\n")
print(len(sample_sentences))

US unveils world's most powerful supercomputer, beats China.The US has unveiled the world's most powerful supercomputer called 'Summit',beating the previous record-holder China's Sunway TaihuLight. 

With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,which is capable of 93,000 trillion calculations per second. 

Summit has 4,608 servers, which reportedly take up the size of two tennis courts. 

3


### RegexpTokenizer

- We will use specific regular expression-based patterns to segment sentences.

In [16]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN,gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
for sent in sample_sentences:
  print(sent,"\n")

US unveils world's most powerful supercomputer, beats China.The US has unveiled the world's most powerful supercomputer called 'Summit',beating the previous record-holder China's Sunway TaihuLight. 

With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,which is capable of 93,000 trillion calculations per second. 

Summit has 4,608 servers, which reportedly take up the size of two tennis courts. 



<b> In the following section, we look at tokenizing these sentences into words using several techniques. </b>

### Word Tokenization

- It is the process of splitting or segmenting sentences into their constituent words. 
- A sentence is a collection of words and with tokenization we essentially split a sentence into a list of words that can be used to reconstruct the sentence. 
- Word tokenization is really important in many processes, especially in cleaning and normalizing text where operations like stemming and lemmatization work on each individual word based on its respective stems and lemma. 
- NLTK provides various useful interfaces for word tokenization. We will touch up on the following main interfaces:
  -  word_tokenize
  -  TreebankWordTokenizer
  -  TokTokTokenizer
  -  RegexpTokenizer
  -  Inherited tokenizers from RegexpTokenizer

### Default Word Tokenizer
- The nltk.word_tokenize(...) function is the default and recommended word tokenizer, as specified by NLTK. 
- This tokenizer is an instance or object of the TreebankWordTokenizer class in its internal implementation and acts as a wrapper to that core class.

In [17]:
default_wt = nltk.word_tokenize
words = default_wt(sample_text)
print(len(words))
np.array(words)

79


array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

### TreebankWordTokenizer
- The TreebankWordTokenizer is based on the Penn Treebank and uses various regular expressions to tokenize the text.
- One <b>primary assumption</b> here is that we have already performed sentence tokenization beforehand
- Some of the main features of this tokenizer are:
     - Splits and separates out periods that appear at the end of a sentence
     - Splits and separates commas and single quotes when followed by whitespace
     - Most punctuation characters are split and separated into independent tokens
     - Splits words with standard contractions, such as don’t to do and n’t


In [18]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sample_text)
print(len(words))
np.array(words)

77


array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

### TokTokTokenizer
- TokTokTokenizer is one of the newer tokenizers introduced by NLTK present in the nltk.tokenize.toktok module.
- In general, the tok-tok tokenizer is a general tokenizer, where it assumes that the input has one sentence per line. Hence, only the final period is tokenized. 
- However, as needed, we can remove the other periods from the words using regular expressions. 

In [19]:
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
words = tokenizer.tokenize(sample_text)
print(len(words))
np.array(words)

81


array(['US', 'unveils', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.The', 'US', 'has',
       'unveiled', 'the', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating',
       'the', 'previous', 'record-holder', 'China', "'", 's', 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

### RegexpTokenizer
- There are two main parameters that
are useful in tokenization—the regex pattern for building the tokenizer and the gaps parameter, which, if set to true, is used to find the gaps between the tokens. Otherwise, it is used to find the tokens themselves

In [20]:
# pattern to identify tokens themselves
TOKEN_PATTERN = r'\W+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,gaps=True)
words = regex_wt.tokenize(sample_text)
print(len(words))
np.array(words)

75


array(['US', 'unveils', 'world', 's', 'most', 'powerful', 'supercomputer',
       'beats', 'China', 'The', 'US', 'has', 'unveiled', 'the', 'world',
       's', 'most', 'powerful', 'supercomputer', 'called', 'Summit',
       'beating', 'the', 'previous', 'record', 'holder', 'China', 's',
       'Sunway', 'TaihuLight', 'With', 'a', 'peak', 'performance', 'of',
       '200', '000', 'trillion', 'calculations', 'per', 'second', 'it',
       'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight',
       'which', 'is', 'capable', 'of', '93', '000', 'trillion',
       'calculations', 'per', 'second', 'Summit', 'has', '4', '608',
       'servers', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts'], dtype='<U13')

In [21]:
# pattern to identify tokens by using gaps between tokens
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,gaps=True)
words = regex_wt.tokenize(sample_text)
print(len(words))
np.array(words)

65


array(['US', 'unveils', "world's", 'most', 'powerful', 'supercomputer,',
       'beats', 'China.The', 'US', 'has', 'unveiled', 'the', "world's",
       'most', 'powerful', 'supercomputer', 'called', "'Summit',beating",
       'the', 'previous', 'record-holder', "China's", 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second,', 'it', 'is', 'over',
       'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight,which', 'is',
       'capable', 'of', '93,000', 'trillion', 'calculations', 'per',
       'second.', 'Summit', 'has', '4,608', 'servers,', 'which',
       'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis',
       'courts.'], dtype='<U16')

### Inherited Tokenizers from RegexpTokenizer
- Besides the base RegexpTokenizer class, there are several derived classes that
perform different types of word tokenization. 
- The WordPunktTokenizer uses the pattern r'\w+|[^\w\s]+' to tokenize sentences into independent alphabetic and non-alphabetic tokens.

In [22]:
wordpunkt_wt =nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sample_text)
print(len(words))
np.array(words)

92


array(['US', 'unveils', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "',", 'beating', 'the',
       'previous', 'record', '-', 'holder', 'China', "'", 's', 'Sunway',
       'TaihuLight', '.', 'With', 'a', 'peak', 'performance', 'of', '200',
       ',', '000', 'trillion', 'calculations', 'per', 'second', ',', 'it',
       'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight',
       ',', 'which', 'is', 'capable', 'of', '93', ',', '000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4', ',',
       '608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the',
       'size', 'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

In [23]:
### The WhitespaceTokenizer tokenizes sentences into words based on whitespace, like tabs, newlines, and spaces
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sample_text)
print(len(words))
np.array(words)

65


array(['US', 'unveils', "world's", 'most', 'powerful', 'supercomputer,',
       'beats', 'China.The', 'US', 'has', 'unveiled', 'the', "world's",
       'most', 'powerful', 'supercomputer', 'called', "'Summit',beating",
       'the', 'previous', 'record-holder', "China's", 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second,', 'it', 'is', 'over',
       'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight,which', 'is',
       'capable', 'of', '93,000', 'trillion', 'calculations', 'per',
       'second.', 'Summit', 'has', '4,608', 'servers,', 'which',
       'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis',
       'courts.'], dtype='<U16')

In [24]:
#obtain the token boundaries for each token during the tokenize operation
word_indices = list(regex_wt.span_tokenize(sample_text))
print(word_indices)
print(np.array([sample_text[start:end] for start,end in word_indices]))

[(0, 2), (3, 10), (11, 18), (19, 23), (24, 32), (33, 47), (48, 53), (54, 63), (64, 66), (67, 70), (71, 79), (80, 83), (84, 91), (92, 96), (97, 105), (106, 119), (120, 126), (127, 143), (144, 147), (148, 156), (157, 170), (171, 178), (179, 185), (186, 197), (198, 202), (203, 204), (205, 209), (210, 221), (222, 224), (225, 232), (233, 241), (242, 254), (255, 258), (259, 266), (267, 269), (270, 272), (273, 277), (278, 283), (284, 286), (287, 291), (292, 294), (295, 301), (302, 318), (319, 321), (322, 329), (330, 332), (333, 339), (340, 348), (349, 361), (362, 365), (366, 373), (374, 380), (381, 384), (385, 390), (391, 399), (400, 405), (406, 416), (417, 421), (422, 424), (425, 428), (429, 433), (434, 436), (437, 440), (441, 447), (448, 455)]
['US' 'unveils' "world's" 'most' 'powerful' 'supercomputer,' 'beats'
 'China.The' 'US' 'has' 'unveiled' 'the' "world's" 'most' 'powerful'
 'supercomputer' 'called' "'Summit',beating" 'the' 'previous'
 'record-holder' "China's" 'Sunway' 'TaihuLight.' '

### Building Robust Tokenizers with NLTK and spaCy

In [0]:
def tokenize_text(text):
  sentences = nltk.sent_tokenize(text)
  word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
  return word_tokens

In [26]:
sents = tokenize_text(sample_text)
np.array(sents)

array([list(['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China.The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the', 'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight', '.']),
       list(['With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.']),
       list(['Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.'])],
      dtype=object)

In [27]:
words = [word for sentence in sents for word in sentence]
print(len(words))
np.array(words)

79


array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

In [28]:
## In a similar way, we can leverage spaCy to perform sentence- and word-level tokenizations really quickly
import spacy
nlp = spacy.load('en_core_web_sm',parse=True,tag=True,entity=True)
text_spacy =nlp(sample_text)

sents =np.array(list(text_spacy.sents))
print(len(sents))
sents

5


array([US unveils world's most powerful supercomputer, beats China.,
       The US has unveiled the world's most powerful supercomputer called 'Summit',beating the previous record-holder,
       China's Sunway TaihuLight.,
       With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,which is capable of 93,000 trillion calculations per second.,
       Summit has 4,608 servers, which reportedly take up the size of two tennis courts.],
      dtype=object)

In [29]:
sent_words = [[word.text for word in sent] for sent in sents]
np.array(sent_words)

array([list(['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.']),
       list(['The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'", "Summit',beating", 'the', 'previous', 'record', '-', 'holder']),
       list(['China', "'s", 'Sunway', 'TaihuLight', '.']),
       list(['With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.']),
       list(['Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.'])],
      dtype=object)

In [30]:
words = [word.text for word in text_spacy]
print(len(words))
np.array(words)

81


array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'", "Summit',beating", 'the',
       'previous', 'record', '-', 'holder', 'China', "'s", 'Sunway',
       'TaihuLight', '.', 'With', 'a', 'peak', 'performance', 'of',
       '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it',
       'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight',
       ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U15')

## Removing Accented Characters

- Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters.

In [31]:
import unicodedata
def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD',text).encode('ascii','ignore').decode('utf-8','ignore')
  return text 

remove_accented_chars('Sómě Áccěntěd těxt')

'Some Accented text'

### Stemming
- Morphemes are the smallest independent unit in any natural language
- Morphemes consist of units that are stems and affixes.
- Affixes are units like prefixes, suffixes, and so on, which are attached to word stems to change their meaning or create a new word altogether.
- Word stems are also often known as the base form of a word and we can create new words by attaching affixes to them. This process is known as inflection
- The reverse of this is obtaining the base form of a word from its inflected form and this is known as stemming
- Consider the word “JUMP”, you can add affixes to it and form several new words like “JUMPS”, “JUMPED”, and “JUMPING”. In this case, the base word is “JUMP” and this is the word stem
- In this case, the base word is “JUMP” and this is the word stem. If we were to carry out stemming on any of its three inflected forms, we would get the base form. 
- Stemming helps us standardize words to their base stem irrespective of their inflections, which helps many applications like classifying or clustering text or even in information retrieval


In [32]:
# Porter Stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem('jumping'),ps.stem('jumps'),ps.stem('jumped'),ps.stem('lying'),ps.stem('strange')

('jump', 'jump', 'jump', 'lie', 'strang')

In [33]:
# Lancaster Stemmer
from nltk.stem import LancasterStemmer
ls =LancasterStemmer()
ls.stem('jumping'),ls.stem('jumped'),ls.stem('jumped'),ls.stem('lying'),ls.stem('are'),ls.stem('strange')

('jump', 'jump', 'jump', 'lying', 'ar', 'strange')

- Observation: 
You can see the behavior of this stemmer is different from the previous Porter stemmer. 
Besides these two, there are several other stemmers, including RegexpStemmer, where you can build your own stemmer 
based on user-defined rules and SnowballStemmer, which supports stemming in 13 different languages besides English.

In [34]:
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)
rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'),rs.stem('lying'),rs.stem('strange')

('jump', 'jump', 'jump', 'ly', 'strange')

In [35]:
# Snowball Stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")
print('Supported Languages:', SnowballStemmer.languages)

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [36]:
# stemming on German words
# autobahnen -> cars
# autobahn -> car
ss.stem('autobahnen'),ss.stem('springen')

('autobahn', 'spring')

### Lemmatization
- The process of lemmatization is very similar to stemming, where we remove word affixes to get to a base form of the word
- However in this case, this base form is also known
as the root word but not the root stem
- The <b> difference between the two </b> is that the root stem may not always be a lexicographically correct word, i.e., it may not be present in the dictionary but the root word, also known as the lemma, will always be present in the dictionary.
- The lemmatization process is considerably slower than stemming because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary.
- The NLTK package has a robust lemmatization module where it uses WordNet and the word’s syntax and semantics like part of speech and context to get the root word or lemma

In [37]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [38]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars','n'))
print(wnl.lemmatize('men', 'n'))
print(wnl.lemmatize('lying', 'n'))
print(wnl.lemmatize('jumped', 'n'))
print(wnl.lemmatize('jumps', 'n'))
# lemmatize verbs
print(wnl.lemmatize('jumps', 'v'))
print(wnl.lemmatize('jumping','n'))
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

car
men
lying
jumped
jump
jump
jumping
run
eat
sad
fancy


- This snippet shows us how each word is converted to its base form using lemmatization. 
- This helps us standardize words. 
- This code leverages the WordNetLemmatizer class, which internally uses the morphy() function belonging to the WordNetCorpusReader class. 
- This function basically finds the base form or lemma for a given word using the word and its part of speech by checking the WordNet corpus and uses a recursive technique for removing affixes from the word until a match is found
in WordNet. 
  - If no match is found, the input word is returned unchanged. The part of speech is extremely important because if that is wrong, the lemmatization will not be effective, as you can see in the following snippet

In [39]:
# ineffective lemmatization
print (wnl.lemmatize('ate', 'n'))
print (wnl.lemmatize('fancier', 'v'))

ate
fancier
