# Training Tokenizer

This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question lies in the quality of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.

In [5]:
print("""Guy: How old are you?

Hipster girl: You know, I never answer that question. Because to me, it's about
how mature you are, you know? I mean, a fourteen year old could be more mature
than a twenty-five year old, right? I'm sorry, I just never answer that question.
Guy: But, uh, you're older than eighteen, right?

Hipster girl: Oh, yeah.""")

Guy: How old are you?

Hipster girl: You know, I never answer that question. Because to me, it's about
how mature you are, you know? I mean, a fourteen year old could be more mature
than a twenty-five year old, right? I'm sorry, I just never answer that question.
Guy: But, uh, you're older than eighteen, right?

Hipster girl: Oh, yeah.


We have saved this text file with the name of training_tokenizer. NLTK provides a class named <b><i>PunktSentenceTokenizer</b></i> with the help of which we can train on raw text to produce a custom sentence tokenizer. We can get raw text either by reading in a file or from an NLTK corpus using the raw() method.

In [5]:
import nltk

In [6]:
from nltk.tokenize import PunktSentenceTokenizer

In [7]:
from nltk.corpus import webtext

In [31]:
nltk.download('webtext')

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\webtext.zip.


True

In [39]:
text = webtext.raw(r'C:\Users\hp\NLP\B. Training Tokenization and Filtering stopword\training_tokenizer.txt')

In [40]:
text

"Guy: How old are you?\r\nHipster girl: You know, I never answer that question. Because to me, it's about\r\nhow mature you are, you know? I mean, a fourteen year old could be more mature\r\nthan a twenty-five year old, right? I'm sorry, I just never answer that question.\r\nGuy: But, uh, you're older than eighteen, right?\r\nHipster girl: Oh, yeah."

In [53]:
sente_token = PunktSentenceTokenizer(text)
sent_1 = sente_token.tokenize(text)
print(sent_1)
print('\n')
print(sent_1[5])

['Guy: How old are you?', 'Hipster girl: You know, I never answer that question.', "Because to me, it's about\r\nhow mature you are, you know?", 'I mean, a fourteen year old could be more mature\r\nthan a twenty-five year old, right?', "I'm sorry, I just never answer that question.", "Guy: But, uh, you're older than eighteen, right?", 'Hipster girl: Oh, yeah.']


Guy: But, uh, you're older than eighteen, right?


To understand the difference between NLTK’s default sentence tokenizer and our own trained sentence tokenizer, let us tokenize the same file with default sentence tokenizer i.e. sent_tokenize().

In [52]:
from nltk.tokenize import sent_tokenize
sent_2 = sent_tokenize(text)
print(sent_2[1])
print(sent_2[5])

Hipster girl: You know, I never answer that question.
Guy: But, uh, you're older than eighteen, right?


# STOPWORDS

Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are ‘the’ and ‘a’.

In [54]:
from nltk.corpus import stopwords

In [56]:
english_stops = set(stopwords.words('english'))
english_stops

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [61]:
from nltk.tokenize import word_tokenize

text_stop = 'I am a writer'

word = word_tokenize(text_stop)
word

['I', 'am', 'a', 'writer']

In [62]:
[x for x in word if x not in english_stops]

['I', 'writer']

With the help of following Python script, we can also find the complete list of languages supported by NLTK stopwords corpus −

In [63]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']