## Data Preprocessing
- Tokenization — convert sentences to words
- Removing unnecessary punctuation, tags
- Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic
- Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
- Lemmatization — Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.

In [0]:
# NLTK library
import nltk
from nltk.tokenize import word_tokenize

In [4]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
# split text into word
tokens = word_tokenize("The quick brown fox jumps over the lazy dog")

print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


In [7]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

tokens = [w for w in tokens if not w in stop_words]
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']


In [8]:
# NLTK provides several stemmer interfaces like Porter stemmer
# Lancaster Stemmer, Snowball Stemmer
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stems = []

for t in tokens:    
    stems.append(porter.stem(t))
print(stems)

['the', 'quick', 'brown', 'fox', 'jump', 'lazi', 'dog']
