In [30]:
import pandas as pd
import numpy as np
import os
import time

import warnings
warnings.filterwarnings('ignore')

In [31]:
data = pd.read_csv('../input/Tweets.csv')
df = data[['text']]

## Tokenization to sentences
Given text is split into individual sentences

In [32]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [33]:
text = "welcome readers. Good muffins cost 3.88. I am a good boy and my name is unknown. I hope you find it interesting isn't it "

In [34]:
sent_tokenize(text)

['welcome readers.',
 'Good muffins cost 3.88.',
 'I am a good boy and my name is unknown.',
 "I hope you find it interesting isn't it"]

In [35]:
%time
n = []
for p in df['text']:
    l = sent_tokenize(p)
    n.append(l)

df['text2'] = n

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.5 µs


In [36]:
%time
df['text2'] = df.text.apply(sent_tokenize)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11.4 µs


There is a tokenizer called punkt tokenizer (Unsupervised Multilingual Sentence Boundary Detection paper by kiss and strunk). It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified

## Sentences into words

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.This is the method that is invoked by `word_tokenize()`.  It assumes that the text has already been segmented into sentences

This tokenizer performs the following steps:
    - split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
    - treat most punctuation characters as separate tokens
    - split off commas and single quotes, when followed by whitespace
    - separate periods that appear at the end of line


In [37]:
text = "welcome readers. Good muffins cost 3.88. I am a good boy and my name is unknown. I hope you find it interesting isn't it "

In [38]:
%time
word_tokenize(text)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10 µs


['welcome',
 'readers',
 '.',
 'Good',
 'muffins',
 'cost',
 '3.88',
 '.',
 'I',
 'am',
 'a',
 'good',
 'boy',
 'and',
 'my',
 'name',
 'is',
 'unknown',
 '.',
 'I',
 'hope',
 'you',
 'find',
 'it',
 'interesting',
 'is',
 "n't",
 'it']

In [39]:
%time
df["text3"] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10 µs


In [40]:
%time
df["text4"] = df.text.apply(word_tokenize)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10 µs


### Using WordPunct Tokenizer

By using wordpunct tokenizer we are able to extract the token from strings of words or sentences in form of alphabatic or non-alphabatic character

It provides splitting by making punctuation an entirely new token

In [41]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

In [42]:
text = "don't hesitate to ask questions"

In [43]:
tokenizer.tokenize(text)

['don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

## Tokenize using Reg exp

Tokenization of words performed by constructing regular expressions in two ways
* matching with words
* matching spaces or gaps

In [59]:
from nltk.tokenize import RegexpTokenizer

tok1 = RegexpTokenizer('\s+',gaps=True)
tok2 = RegexpTokenizer('[A-Z]\w+')

* splits a string into substrings using a regular expression
* Explicitly defined pattern (pos argument

In [60]:
print(tok1.tokenize("You're handsome Isn't it"))
print(tok2.tokenize("You're handsome Isn't it"))

["You're", 'handsome', "Isn't", 'it']
['You', 'Isn']


* It uses re.findall( to perform tokenization by matching tokens
* re.split to perform tokenization by matching gaps or spaces

## Line and Space Tokenizer

* useful for removing newline have only spaces 
* similar to text.split(\n)

In [81]:
from nltk.tokenize import LineTokenizer, WhitespaceTokenizer

In [82]:
text = "Happy birthday to you \n \n. Brother."

In [83]:
print(LineTokenizer(blanklines='keep').tokenize(text))
print(LineTokenizer(blanklines='discard').tokenize(text))

['Happy birthday to you ', ' ', '. Brother.']
['Happy birthday to you ', '. Brother.']


* Tokenize a string on whitespace (space, tab, newline). In general, users should use the string ``split()`` method instead
* Space tokenizer work as text.split()

In [84]:
WhitespaceTokenizer().tokenize(text)

['Happy', 'birthday', 'to', 'you', '.', 'Brother.']