## 03_01 Tokenization

Tokenization refers to converting a text string into individual tokens. Tokens may be words or punctations

In [40]:
import nltk
import os


#Read the base file into a raw text variable
base_file = open(os.getcwd()+ "/Description.txt", 'rt')
raw_text = base_file.read()
base_file.close()

#Extract tokens
token_list = nltk.word_tokenize(raw_text)
print("Token List : ",token_list[:20])
print("\n Total Tokens : ",len(token_list))

Token List :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', ',', 'process', ',', 'and', 'store', 'data', ',', 'data', 'engineers', 'and']

 Total Tokens :  110


## 03_02 Cleansing Text

We will see examples of removing punctuation and converting to lower case

#### Remove Punctuation

In [41]:
#Use the Punkt library to extract tokens
token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))
print("Token List after removing punctuation : ",token_list2[:20])
print("\nTotal tokens after removing punctuation : ", len(token_list2))

Token List after removing punctuation :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', 'process', 'and', 'store', 'data', 'data', 'engineers', 'and', 'data-science', 'DevOps', 'specialists']

Total tokens after removing punctuation :  100


#### Convert to Lower Case

In [42]:
token_list3=[word.lower() for word in token_list2 ]
print("Token list after converting to lower case : ", token_list3[:20])
print("\nTotal tokens after converting to lower case : ", len(token_list3))

Token list after converting to lower case :  ['in', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', 'process', 'and', 'store', 'data', 'data', 'engineers', 'and', 'data-science', 'devops', 'specialists']

Total tokens after converting to lower case :  100


## 03_03 Stop word Removal

Removing stop words by using a standard stop word list available in NLTK for English

In [43]:
#Download the standard stopword list
nltk.download('stopwords')
from nltk.corpus import stopwords

#Remove stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))
print("Token list after removing stop words : ", token_list4[:20])
print("\nTotal tokens after removing stop words : ", len(token_list4))

Token list after removing stop words :  ['order', 'construct', 'data', 'pipelines', 'networks', 'stream', 'process', 'store', 'data', 'data', 'engineers', 'data-science', 'devops', 'specialists', 'must', 'understand', 'combine', 'multiple', 'big', 'data']

Total tokens after removing stop words :  62


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nehadurani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 03_04 Stemming

In [44]:
#Use the PorterStemmer library for stemming.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#Stem data
token_list5 = [stemmer.stem(word) for word in token_list4 ]
print("Token list after stemming : ", token_list5[:20])
print("\nTotal tokens after Stemming : ", len(token_list5))

Token list after stemming :  ['order', 'construct', 'data', 'pipelin', 'network', 'stream', 'process', 'store', 'data', 'data', 'engin', 'data-sci', 'devop', 'specialist', 'must', 'understand', 'combin', 'multipl', 'big', 'data']

Total tokens after Stemming :  62


## 03_05 Lemmatization

In [45]:
#Use the wordnet library to map words to their lemmatized form
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nehadurani/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nehadurani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [46]:

lemmatizer = WordNetLemmatizer()


In [47]:
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]


LookupError: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/omw-1.4[0m

  Searched in:
    - '/Users/nehadurani/nltk_data'
    - '/Users/nehadurani/anaconda3/nltk_data'
    - '/Users/nehadurani/anaconda3/share/nltk_data'
    - '/Users/nehadurani/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


#### Comparison of tokens between raw, stemming and lemmatization

In [51]:
#Check for token Specialists 
print( "Raw : ", token_list4[13]," , Stemmed : ", token_list5[13] )


Raw :  specialists  , Stemmed :  specialist
