Sentence splitter
Some of the NLP applications require splitting a large raw text into
sentences to get more meaningful information out. Intuitively, a sentence
is an acceptable unit of conversation

In [47]:
import nltk

In [48]:
inputstring = "This is an example sent. The sentence splitter will split on sent markers. Ohh really !!"
from nltk.tokenize import sent_tokenize
all_sent = sent_tokenize(inputstring)
print (all_sent)

['This is an example sent.', 'The sentence splitter will split on sent markers.', 'Ohh really !', '!']


In [49]:
import nltk.tokenize.punkt
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

Tokenization
A word (Token) is the minimal unit that a machine can understand and
process. So any text string cannot be further processed without going
through tokenization. Tokenization is the process of splitting the raw
string into meaningful tokens. The complexity of tokenization varies
according to the need of the NLP application, and the complexity of the
language itself. For example, in English it can be as simple as choosing
only words and numbers through a regular expression.

In [50]:
s = "Hi Everyone ! hola gr8" # simplest
tokenizer
print (s.split())


['Hi', 'Everyone', '!', 'hola', 'gr8']


In [51]:
from nltk.tokenize import word_tokenize
word_tokenize(s)



['Hi', 'Everyone', '!', 'hola', 'gr8']

In [52]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
regexp_tokenize(s, pattern='\w+')

['Hi', 'Everyone', 'hola', 'gr8']

In [53]:
regexp_tokenize(s, pattern='\d+')

['8']

In [54]:
wordpunct_tokenize(s)

['Hi', 'Everyone', '!', 'hola', 'gr8']

In [55]:
blankline_tokenize(s)

['Hi Everyone ! hola gr8']

Stemming is more of a crude rulebased
process by which we want to club together different variations of
the token. For example, the word eat will have variations like eating,
eaten, eats, and so on.

In [56]:
from nltk.stem import PorterStemmer # import Porterstemmer
from nltk.stem.lancaster import LancasterStemmer
pst = PorterStemmer() # create obj of thePorterStemmer
lst = LancasterStemmer() # create obj ofLancasterStemmer
lst.stem("eating")
pst.stem("shopping")

'shop'

In [57]:
lst.stem("shopping")

'shop'

In [58]:
lst.stem("dogs")

'dog'

In [59]:
pst.stem("shopping")

'shop'

Lemmatization uses
context and part of speech to determine the inflected form of the word
and applies different normalization rules for each part of speech to get
the root word (lemma):


In [60]:
from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()
wlem.lemmatize("dogs")

'dog'

WordNetLemmatizer is using wordnet, which takes a word and
searches wordnet, a semantic dictionary. It also uses a morph analysis to
cut to the root and search for the specific lemma (variation of the word).

In [61]:
wlem.lemmatize("eating")

'eating'

In [65]:
#tokens is a list of all tokens in corpus
freq_dist = nltk.FreqDist(token)
rarewords = freq_dist.keys()[-50:]
after_rare_words = [ word for word in token not in rarewords]

NameError: name 'token' is not defined

In [66]:
from nltk.metrics import edit_distance

Calculate the Levenshtein edit-distance between two strings.
The edit distance is the number of characters that need to be
substituted, inserted, or deleted, to transform s1 into s2.  For
example, transforming "rain" to "shine" requires three steps,
consisting of two substitutions and one insertion:
"rain" -> "sain" -> "shin" -> "shine".  These operations could have
been done in other orders, but at least three steps are needed.

In [67]:
edit_distance("rain","shine")

3

In [68]:
import nltk
from nltk import word_tokenize
s = "I was watching TV"
print (nltk.pos_tag(word_tokenize(s)))
#[('I', 'PRP'), ('was', 'VBD'), ('watching',

[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]


In [69]:
tagged = nltk.pos_tag(word_tokenize(s))
allnoun = [word for word,pos in tagged if pos in
['NN','NNP'] ]

In [70]:
print(allnoun)

['TV']


In [71]:
from nltk.corpus import brown
import nltk
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
print (nltk.FreqDist(tags))

<FreqDist with 218 samples and 100554 outcomes>


In [72]:
brown_tagged_sents = brown.tagged_sents(categories='news')
default_tagger = nltk.DefaultTagger('NN')

In [74]:
brown_tagged_sents

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

In [73]:
default_tagger

<DefaultTagger: tag=NN>

In [75]:
from nltk.tag import UnigramTagger

In [76]:
from nltk.tag import DefaultTagger

In [84]:
from nltk.tag import BigramTagger

In [77]:
from nltk.tag import TrigramTagger

In [78]:
train_data = brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)] 

In [79]:
 test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):]

In [80]:
unigram_tagger = UnigramTagger(train_data,backoff=default_tagger)

In [81]:
print(unigram_tagger.evaluate(test_data))

0.8361407355726104


In [85]:
bigram_tagger = BigramTagger(train_data,backoff=unigram_tagger)

In [86]:
print(bigram_tagger.evaluate(test_data))

0.8452108043456593


In [87]:
trigram_tagger = TrigramTagger(train_data,backoff=bigram_tagger)

In [88]:
print(trigram_tagger.evaluate(test_data))

0.843317053722715


In [90]:
import nltk
from nltk import ne_chunk
sent = "Mark is studying at Stanford University in California"
print(ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False))

(S
  (PERSON Mark/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  in/IN
  (GPE California/NNP))


In [1]:
import nltk

In [2]:
from nltk.tokenize import word_tokenize,sent_tokenize