-----------------------
#### Tokenization
- NLTK (Natural Language Toolkit) provides various tokenization methods to suit different use cases and requirements in Natural Language Processing (NLP). 
- Here are examples of different types of tokenization supported by NLTK
----------------------------------------

#### What is Tokenization?
A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. Tokenization is the process of splitting a string into a list of tokens.

In [61]:
#pip install contractions

In [43]:
mystring = "My favorite color is blue"

mystring.split()

['My', 'favorite', 'color', 'is', 'blue']

In [44]:
mystring = "My favorite colors are blue, red, and green."

In [45]:
mystring.split()

['My', 'favorite', 'colors', 'are', 'blue,', 'red,', 'and', 'green.']

the punctuation marks are grouped in with their adjacent word (e.g. blue,). This is problematic for NLP applications, as the goal of tokenization is generally to divide a set (corpus) of documents into a common set of building blocks that can then be used as a basis for comparison. Hence, it’s no good if “blue” in "My favorite color is blue" doesn’t match with “blue” in "My favorite colors are blue, red, and green." since the latter is tokenized as blue, rather than blue.

#### Word Tokenization (Word Tokenizer)
---------------------------------------

In [53]:
# has some grammer rules 
# word_tokenize is more than split()
from nltk.tokenize import word_tokenize

In [54]:
text  = "NLTK provides word tokenization capabilities."
words = word_tokenize(text)

words

['NLTK', 'provides', 'word', 'tokenization', 'capabilities', '.']

Pros:
- Straightforward and widely applicable
- Handles most common cases well

Cons:
- May not handle domain-specific or noisy text effectively

In [56]:
# Text containing domain-specific terminology
text = "DXC The protein-protein interaction network is abc-xyz essential for understanding biological processes."

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['DXC', 'The', 'protein-protein', 'interaction', 'network', 'is', 'abc-xyz', 'essential', 'for', 'understanding', 'biological', 'processes', '.']


recognizes 'protein-protein' as one word which is good

In [57]:
# Text containing abbreviations with periods
text = "Dr. Smith received his Ph.D. from NYU."

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['Dr.', 'Smith', 'received', 'his', 'Ph.D.', 'from', 'NYU', '.']


In [58]:
# Text with unconventional punctuation
text = "This is a sentence..with..multiple..periods..."

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['This', 'is', 'a', 'sentence', '..', 'with', '..', 'multiple', '..', 'periods', '...']


In [59]:
# Text containing special characters and emoticons
text = "I'm feeling 😊 today! #happy"

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['I', "'m", 'feeling', '😊', 'today', '!', '#', 'happy']


NLTK's word_tokenize may not handle special characters and emoticons gracefully, splitting them into separate tokens instead of treating them as part of a single token.

In [7]:
# Text containing contractions
text = "I don't know what I'd do without NLTK."

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['I', 'do', "n't", 'know', 'what', 'I', "'d", 'do', 'without', 'NLTK', '.']


Explanation:

- NLTK's word_tokenize may split contractions like "don't" and "I'd" into separate tokens, treating the apostrophe as a delimiter.

In [63]:
import contractions
print(contractions.fix("you've"))
print(contractions.fix("don't"))

you have
do not


In [64]:
# Text containing internet slang and abbreviations
text = "LOL!!, IDK, BRB, BTW, TTYL"

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['LOL', '!', '!', ',', 'IDK', ',', 'BRB', ',', 'BTW', ',', 'TTYL']


Explanation:

- NLTK's word_tokenize may not recognize internet slang and abbreviations as single tokens, splitting them into separate tokens based on punctuation.

In [65]:
# Text containing technical terms and symbols
text = "The value of π is approximately 3.14159."

# Tokenize the text
words = word_tokenize(text)

print("Word Tokens:", words)

Word Tokens: ['The', 'value', 'of', 'π', 'is', 'approximately', '3.14159', '.']


**Use Cases of word_tokenize**
- Text classification
- Named entity recognition
- Part-of-speech tagging

--------------------------------------
#### Sentence Tokenization (Sentence Tokenizer)
----------------------------------

In [66]:
from nltk.tokenize import sent_tokenize

In [67]:
text = "NLTK provides sentence tokenization capabilities. This is an example sentence."
sentences = sent_tokenize(text)

sentences

['NLTK provides sentence tokenization capabilities.',
 'This is an example sentence.']

**Use Cases**
- Text summarization
- Machine translation
- Text segmentation

**Pros**
- Useful for segmenting text into meaningful units
- Handles various sentence structures and punctuation

-------------------------------
#### Treebank Tokenizer: 
-------------------------------------

Splits text into words using the Penn Treebank tokenization rules.


**Key Characteristics** 

- `Word Segmentation`: The Treebank tokenizer splits text into individual words or tokens based on whitespace characters, punctuation marks, and other language-specific rules.

- `Punctuation Handling`: It handles punctuation marks appropriately, considering them as separate tokens when not part of a word.

- `Hyphenated Words`: Hyphenated words like "state-of-the-art" are treated as single tokens, maintaining their integrity.

- `Abbreviations`: It recognizes common abbreviations and acronyms as single tokens, such as "U.S.A." or "Mr."

In [68]:
from nltk.tokenize import TreebankWordTokenizer

In [69]:
tokenizer = TreebankWordTokenizer()

text = "NLTK's Treebank tokenizer splits text into words accurately and state-of-the-art"
tokens = tokenizer.tokenize(text)

print(tokens)

['NLTK', "'s", 'Treebank', 'tokenizer', 'splits', 'text', 'into', 'words', 'accurately', 'and', 'state-of-the-art']


Try the following hyphenated words

- Well-known: Familiar or widely recognized by many people.
- High-speed: Referring to something that moves or operates at a fast pace.
- Mother-in-law: The mother of one's spouse.
- Self-esteem: Confidence and satisfaction in oneself.
- Run-down: In poor or deteriorated condition.
- Second-hand: Previously owned or used by someone else.
- Up-to-date: Current or modern, reflecting the latest information or trends.
- Co-founder: One of the individuals who jointly establishes a company or organization.
- Off-the-shelf: Ready-made or pre-packaged, available for immediate use.

In [70]:
text = "i've, it can't be done"
tokens = tokenizer.tokenize(text)

print(tokens)

['i', "'ve", ',', 'it', 'ca', "n't", 'be', 'done']


In [71]:
text = "He's going to the park tomorrow."
tokens = tokenizer.tokenize(text)
print(tokens)

['He', "'s", 'going', 'to', 'the', 'park', 'tomorrow', '.']


In [22]:
text = "They'll arrive late tonight."
tokens = tokenizer.tokenize(text)
print(tokens)

['They', "'ll", 'arrive', 'late', 'tonight', '.']


In [23]:
ext = "Can't you see what's happening?"
tokens = tokenizer.tokenize(text)
print(tokens)

['They', "'ll", 'arrive', 'late', 'tonight', '.']


- contractions are split into separate tokens by the TreebankWordTokenizer, treating the apostrophe as a delimiter. (ISSUE)

- If you require contractions to be treated as single tokens, you may need to use a different tokenizer or perform additional pre-processing steps.

Pros:
- Accurate tokenization based on well-defined syntactic conventions.
- Handles a wide range of text formats and styles effectively.
- Retains the integrity of hyphenated words and abbreviations.

Cons:
- May tokenize certain non-standard or domain-specific terms inaccurately.
- Requires additional preprocessing for specific tokenization requirements beyond standard English text.

-----------------------
#### Word Punkt Tokenizer 
-----------------------
- splits text into words using a set of rules derived from punctuation. 
- It is based on the `Penn Treebank tokenizer` but includes additional rules for handling punctuation and contractions more effectively. 


- The Word Punkt Tokenizer utilizes a `set of regular expressions` to split text into words based on punctuation marks, including periods, commas, hyphens, parentheses, and quotation marks. 


In [72]:
from nltk.tokenize import WordPunctTokenizer

In [73]:
tokenizer = WordPunctTokenizer()

In [74]:
text = "NLTK's Word Punkt Tokenizer splits text into words effectively. ! ! ! @@##"
tokens = tokenizer.tokenize(text)
print(tokens)

['NLTK', "'", 's', 'Word', 'Punkt', 'Tokenizer', 'splits', 'text', 'into', 'words', 'effectively', '.', '!', '!', '!', '@@##']


In [75]:
# Handling Contractions
# not good
text = "I can't believe it!"
tokens = tokenizer.tokenize(text)
print(tokens)

['I', 'can', "'", 't', 'believe', 'it', '!']


In [77]:
text = "I'm feeling 😊😊😊😊 today!"
tokens = tokenizer.tokenize(text)
print(tokens)

['I', "'", 'm', 'feeling', '😊😊😊😊', 'today', '!']


In [78]:
# Abbreviations
# not good
text = "He graduated from St. John's University."
tokens = tokenizer.tokenize(text)
print(tokens)

['He', 'graduated', 'from', 'St', '.', 'John', "'", 's', 'University', '.']


In [79]:
# Hyphenated Words
# not good
text = "The quick brown fox jumped over the high-speed fence."
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'high', '-', 'speed', 'fence', '.']


Pros:
- Effectively tokenizes text based on punctuation marks.
- Suitable for basic tokenization tasks in informal text settings.

Cons:
- Does not handle contractions, abbreviations, or hyphenated words as single tokens.
- May split words unnecessarily at punctuation marks, resulting in additional tokens.
- Not suitable for handling complex linguistic structures or specific text formats.

---------------------------
#### Space Tokenizer: Tokenizes text based on spaces.
---------------------------------



In [40]:
from nltk.tokenize import SpaceTokenizer

In [41]:
tokenizer = SpaceTokenizer()

In [42]:
# Example text
text = "NLTK Space Tokenizer splits text based on spaces."

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Print tokens
print("Tokens:", tokens)

Tokens: ['NLTK', 'Space', 'Tokenizer', 'splits', 'text', 'based', 'on', 'spaces.']


Pros:
- Simple and lightweight tokenizer.
- Efficient for tokenizing text with uniform spacing between words.
- Preserves original whitespace characters in the tokens.

Cons:
- Does not handle punctuation marks, contractions, or other non-whitespace delimiters.
- May not be suitable for text with irregular spacing or special characters within words.
- Does not handle complex linguistic structures or tokenization tasks beyond basic word splitting

#### Example

In [80]:
compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter',
                "My favorite color is blue",
                "My favorite colors are blue, red, and green."]

**`NLTK word_tokenize`** - separate words using spaces and punctuations.

In [81]:
word_tokens = []

for sent in compare_list:

    word_tokens.append(word_tokenize(sent))

word_tokens

[['https', ':', '//t.co/9z2J3P33Uc'],
 ['laugh/cry'],
 ['😬😭😓🤢🙄😱'],
 ['world', "'s", 'problems'],
 ['@', 'datageneral'],
 ['It', "'s", 'interesting'],
 ['do', "n't", 'spell', 'my', 'name', 'right'],
 ['all-nighter'],
 ['My', 'favorite', 'color', 'is', 'blue'],
 ['My',
  'favorite',
  'colors',
  'are',
  'blue',
  ',',
  'red',
  ',',
  'and',
  'green',
  '.']]

When dealing with well-formed, formal text, this standard word tokenizer makes a lot of sense and is likely to be sufficient. 

However, the same cannot be said for cases when our text data comes from more casual, slang-ridden sources like Twitter.

**`WordPunctTokenizer`** splits all punctuations into separate tokens.

In [82]:
from nltk.tokenize import WordPunctTokenizer

In [83]:
punct_tokenizer = WordPunctTokenizer()

punct_tokens = []

for sent in compare_list:
    
    punct_tokens.append(punct_tokenizer.tokenize(sent))
punct_tokens

[['https', '://', 't', '.', 'co', '/', '9z2J3P33Uc'],
 ['laugh', '/', 'cry'],
 ['😬😭😓🤢🙄😱'],
 ['world', "'", 's', 'problems'],
 ['@', 'datageneral'],
 ['It', "'", 's', 'interesting'],
 ['don', "'", 't', 'spell', 'my', 'name', 'right'],
 ['all', '-', 'nighter'],
 ['My', 'favorite', 'color', 'is', 'blue'],
 ['My',
  'favorite',
  'colors',
  'are',
  'blue',
  ',',
  'red',
  ',',
  'and',
  'green',
  '.']]

this tokenizer successfully splits laugh/cry into 2 words. But the fallbacks are:
- The link ‘https://t.co/9z2J3P33Uc' is split into 7 words
- world's is split into 2 words by "'" character
- @datageneral is split into @ and datageneral
- don't is split into do and n't

Since these words should be considered as one word, this tokenizer is not what we want either. Is there a way that we can split words based on the space instead?

**`TweetTokenizer`**

In [84]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()

In [85]:
tweet_tokens = []

for sent in compare_list:
    
    print(tweet_tokenizer.tokenize(sent))
    
    tweet_tokens.append(tweet_tokenizer.tokenize(sent))

['https://t.co/9z2J3P33Uc']
['laugh', '/', 'cry']
['😬', '😭', '😓', '🤢', '🙄', '😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']
['My', 'favorite', 'color', 'is', 'blue']
['My', 'favorite', 'colors', 'are', 'blue', ',', 'red', ',', 'and', 'green', '.']
