# Tokenization :~

Tokenization is a process that splits an input sequence into so called tokens.

->One can think of token as a useful unit for semantic processing.

->It can be a word, sentence, paragraph, and so on.

In [3]:
import nltk
text = "Hello, I'm Parth Shingari!"

In [4]:
# An example of simple whitespace Tokenizer
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['Hello,', "I'm", 'Parth', 'Shingari!']

In [5]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['Hello', ',', 'I', "'m", 'Parth', 'Shingari', '!']

In [6]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['Hello', ',', 'I', "'", 'm', 'Parth', 'Shingari', '!']

# Token normalization :~

When we normalize a natural language resource, we attempt to reduce the randomness in it.

When doing text normalization, we should know exactly what do we want to normalize and why. Also, the purpose of the input helps shaping the steps we’re going to apply to normalize our input. There are two things that we are interested in normalizing the most are *Sentence structure and Vocabulary*.

In practice, we can do normalization over these two aspects by breaking into simpler problems like :

→ Removal of duplicate whitespaces and punctuation.

→ Removal or substitution of special characters/emojis.

→ Substitution of contractions (very common in English; e.g.: ‘I’m’→‘I am’).

→ Transform word numerals into numbers.

→ Spell correction (one could say that a word can be misspelled infinite ways, so spell corrections reduce the vocabulary variation by “correcting”) — this is very important if you’re dealing with open user inputs, such as tweets, IMs and emails.

→ Removal of gender/time/grade variation with Stemming or Lemmatization.

→ Substitution of rare words for more common synonyms.

# Stemming :~

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. 

A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

In [7]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

In [8]:
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

"hello , I 'm parth shingari !"

In [9]:
nltk.download('wordnet')
stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

[nltk_data] Downloading package wordnet to C:\Users\LENOVO
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


"Hello , I 'm Parth Shingari !"

# Liked this project ???

Star it on [github](https://github.com/parthshingari28/Text-Mining-and-Analytics-using-Python)!

# Reach Out at !

[@parthshingari28](https://github.com/parthshingari28)