# Tokenizer

In [1]:
import nltk

In [2]:
from nltk.tokenize import word_tokenize

word_tokenize("Artificial intelligence is the future of mankind.")

['Artificial', 'intelligence', 'is', 'the', 'future', 'of', 'mankind', '.']

In [3]:
# TreebankWordTokenizer (default tokenizer in NLTK)

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize("Artificial intelligence is the future of mankind.")

['Artificial', 'intelligence', 'is', 'the', 'future', 'of', 'mankind', '.']

In [4]:
tokenizer.tokenize("What's up?")

['What', "'s", 'up', '?']

why we have two alternative word tokenizers namely **PunktWordTokenizer** and **WordPunctTokenizer**

## WordPunktTokenizer
> An alternate word tokenizer that splits all punctuation into separate tokens.

In [5]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

tokenizer.tokenize("What's up?")

['What', "'", 's', 'up', '?']

## Tokenizing into sentences

In [6]:
from nltk.tokenize import sent_tokenize

text = "Artificial intelligence is the future of mankind. What do you think?"

sent_tokenize(text)

['Artificial intelligence is the future of mankind.', 'What do you think?']

## Sentence tokenization using regular expressions

In [8]:
import nltk

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"[\w']+")
# matches words and contractions

tokenizer.tokenize("What's up?")

["What's", 'up']

In [10]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"\s+", gaps = True)

tokenizer.tokenize("What's up? How a re you doing")

["What's", 'up?', 'How', 'a', 're', 'you', 'doing']

In [12]:
tokenizer = RegexpTokenizer(r"\s+", gaps = False)
tokenizer.tokenize("What's up? How a re you doing")


[' ', ' ', ' ', ' ', ' ', ' ']

# Training our own sentence tokenizer

The NLTK's tokenizer is basically a general purpose tokenizer. If works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our sentence tokenizer.

In [13]:
from nltk.tokenize import PunktSentenceTokenizer

from nltk.corpus import webtext

In [16]:
import os

with open("training_tokenizer.txt", "r") as file:
    text = file.read()

In [21]:
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)

for item in sents_1:
    print(item)

Guy: How old are you?
Hipster girl: You know, I never answer that question.
Because to me, it's about 
how mature you are, you know?
I mean, a fourteen year old could be more mature 
than a twenty-five year old, right?
I'm sorry, I just never answer that 
question.
Guy: But, uh, you're older than eighteen, right?
Hipster girl: Oh, yeah.


To understand the difference between our own vs the default tokenizer

In [24]:
from nltk.tokenize import sent_tokenize

sents_2 = sent_tokenize(text)

for item in sents_2:
    print(item)
    
len(sents_2) == len(sents_1)

Guy: How old are you?
Hipster girl: You know, I never answer that question.
Because to me, it's about 
how mature you are, you know?
I mean, a fourteen year old could be more mature 
than a twenty-five year old, right?
I'm sorry, I just never answer that 
question.
Guy: But, uh, you're older than eighteen, right?
Hipster girl: Oh, yeah.


True

# Stopswords

Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are "the" and "a"

In [27]:
from nltk.corpus import stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [28]:
english_stopwords = set(stopwords.words("english"))

words = ["I", "am", "going", "to", "the", "store", "and", "park"]

[word for word in words if word not in english_stopwords]

['I', 'going', 'store', 'park']

In [35]:
nepali_stopwords = stopwords.words("nepali")

nepali_stopwords[:20]

['छ',
 'र',
 'पनि',
 'छन्',
 'लागि',
 'भएको',
 'गरेको',
 'भने',
 'गर्न',
 'गर्ने',
 'हो',
 'तथा',
 'यो',
 'रहेको',
 'उनले',
 'थियो',
 'हुने',
 'गरेका',
 'थिए',
 'गर्दै']

# Complete list of supported languages

In [29]:
from nltk.corpus import stopwords

stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']