In [2]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
nltk.__version__

'3.8.1'

## Tokenization



### **Tokenization of text into sentences :**

In [4]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)

[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

### **Tokenization of text in other languages**



In [5]:
french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
french_tokenizer.tokenize("Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britannique de Levallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire")

['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britannique de Levallois-Perret.',
 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"]

### **Tokenization of sentences into words :**

In [6]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

['Have',
 'a',
 'nice',
 'day.',
 'I',
 'hope',
 'you',
 'find',
 'the',
 'book',
 'interesting']

**TreebankWordTokenizer**.  
Uses conventions according to Penn Treebank Corpus. It works by separating contractions. This is shown here:

In [7]:
text=nltk.word_tokenize("Don't hesitate to ask questions")
print(text)

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']


**PunktWordTokenizer**

In [8]:
from nltk.tokenize import WordPunctTokenizer
tokenizer=WordPunctTokenizer()
tokenizer.tokenize(" Don't hesitate to ask questions")

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

**Tokenization regex**

In [9]:
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")

['Don', 't', 'hesitate', 'to', 'ask', 'questions']

Instead of instantiating class, an alternative way of tokenization would be to use this function:

In [10]:
from nltk.tokenize import regexp_tokenize
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))

['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


## Conversion into lowercase and uppercase :

In [12]:
text='HARdWork IS KEy to SUCCESS'
print(text.lower())
print(text.upper())

hardwork is key to success
HARDWORK IS KEY TO SUCCESS


## Dealing with stop words :

In [13]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stops = set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


["Don't", 'hesitate', 'ask', 'questions']

### **WordListCorpusReader**

In [14]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

## Lemmatization

In [15]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('working')

[nltk_data] Downloading package wordnet to /root/nltk_data...


'working'

In [16]:
lemmatizer_output.lemmatize('working',pos='v')

'work'

In [17]:
lemmatizer_output.lemmatize('works')

'work'

**WordNetLemmatizer**

In [18]:
from nltk.stem import PorterStemmer
stemmer_output=PorterStemmer()
stemmer_output.stem('happiness')

'happi'

In [19]:
from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('happiness')

'happiness'

## Similarity measure

In [20]:
from nltk.metrics import *
edit_distance("relate","relation")

3

In [21]:
edit_distance("suggestion","calculation")

7

**Jaccard's Coefficient**   
to apply similarity measures.

In [23]:
from nltk.metrics import *
X=set([10,20,30,40])
Y=set([20,30,60])
print(jaccard_distance(X,Y))

0.6
