# NLTK Intro

Natural Language Tool-Kit

Sentiment Analysis company: Sentdex

## Terms

* Tokenizing - grouping things: word tokenizers (seperate by words); sentence tokenizers (seperate by sentences).

* Lexicon - words and their meaning/value

* Corpora - boy of text. ex: medical journals, presidential speeches, English language

# Text Pre-Processing

## Tokenizing Words and Sentences

In [None]:
# nltk.download("all")

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [58]:
example_text = "Hello Mr. Smith, how are you today? The weather is great and Python is awesome. The sky is pinkish-blue"

In [59]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue']


In [60]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue']


In [62]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr.
Smith
,
how
are
you
today
?
The
weather
is
great
and
Python
is
awesome
.
The
sky
is
pinkish-blue


More advanced tokenizers can be used.

## Stop Words

## Stemming Words

## Parts of Speech Tagging

In [2]:
import nltk
from nltk.corpus import state_union # state of the union addresses
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(sample_text) # training on this text

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [71]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            print(tagged)
            
    except Exception as e:
        print(str(e))

In [77]:
# process_content()

touple of word and part of speech (POS) tags:

__POS tag list:__

CC	coordinating conjunction

CD	cardinal digit

DT	determiner

EX	existential there (like: "there is" ... think of it like "there exists")

FW	foreign word

IN	preposition/subordinating conjunction

JJ	adjective	'big'

JJR	adjective, comparative	'bigger'

JJS	adjective, superlative	'biggest'

LS	list marker	1)

MD	modal	could, will

NN	noun, singular 'desk'

NNS	noun plural	'desks'

NNP	proper noun, singular	'Harrison'

NNPS	proper noun, plural	'Americans'

PDT	predeterminer	'all the kids'

POS	possessive ending	parent's

PRP	personal pronoun	I, he, she

PRP\$	possessive pronoun	my, his, hers

RB	adverb	very, silently,

RBR	adverb, comparative	better

RBS	adverb, superlative	best

RP	particle	give up

TO	to	go 'to' the store.

UH	interjection	errrrrrrrm

VB	verb, base form	take

VBD	verb, past tense	took

VBG	verb, gerund/present participle	taking

VBN	verb, past participle	taken

VBP	verb, sing. present, non-3d	take

VBZ	verb, 3rd person sing. present	takes

WDT	wh-determiner	which

WP	wh-pronoun	who, what

WP\$	possessive wh-pronoun	whose

WRB	wh-abverb	where, when

Can lead to problems with Twitter texts: person's name in lower case

# Section 2 (to name)

## Chunking

## Chinking

## Named Entity Recognition

## Lemmatizing

Similar to stemming but result in a real word.

In [15]:
from nltk.stem import WordNetLemmatizer

print(lm().lemmatize("plants"))
print(lm().lemmatize("plant"), end = "\n \n")

print(lm().lemmatize("better"))
print(lm().lemmatize("better", pos = "a"))
print(lm().lemmatize("better", pos = "v"), end = "\n \n")

print(lm().lemmatize("eating"), end = "\n \n")


plant
plant
 
better
good
better
 
eating
 


## The corpora

# Analysis

## Wordnet

In [3]:
# import nltk
# nltk.download("wordnet")

from nltk.corpus import wordnet as wn

In [19]:
syns = wn.synsets("program")
print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [10]:
print(syns[0])

Synset('plan.n.01')


In [11]:
print(syns[0].lemmas())

[Lemma('plan.n.01.plan'), Lemma('plan.n.01.program'), Lemma('plan.n.01.programme')]


In [13]:
print(syns[0].lemmas()[0].name()) # just the word

plan


In [14]:
print(syns[0].name()) # synset

plan.n.01


In [15]:
print(syns[0].definition()) # definition

a series of steps to be carried out or goals to be accomplished


In [16]:
print(syns[0].examples()) # examples

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [21]:
synonyms = []
antonyms = []

for syn in wn.synsets("park"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))

{'parking_area', 'ballpark', 'Mungo_Park', 'common', 'parkland', 'car_park', 'parking_lot', 'green', 'Park', 'commons', 'park'}


In [22]:
print(set(antonyms))

set()


In [24]:
# Semantic Symilarity
w1 = wn.synset("tree.n.01")
w2 = wn.synset("leaf.n.01")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.4444444444444444


In [25]:
# Semantic Symilarity
w1 = wn.synset("boat.n.01")
w2 = wn.synset("ship.n.01")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.9090909090909091


In [26]:
# Semantic Symilarity
w1 = wn.synset("tree.n.01")
w2 = wn.synset("plant.n.01")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.4444444444444444


In [27]:
# Semantic Symilarity
w1 = wn.synset("tree.n.01")
w2 = wn.synset("cat.n.01")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.5


In [29]:
# Semantic Symilarity
w1 = wn.synset("tree.n.01")
w2 = wn.synset("plant.n.02")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.8235294117647058


In [30]:
# Semantic Symilarity
w1 = wn.synset("tree.n.01")
w2 = wn.synset("plant.n.03")

print(w1.wup_similarity(w2)) # Wu & Palmer (1994)

0.5714285714285714


## Text Classification

Text classifier for text analysis.

features of documents are words.

In [82]:
import nltk
import random

from nltk.corpus import movie_reviews

# documents = [(list(movie_reviews.words(fileid)), category)
#            for category in movie_reviews.categories()
#            for fileid in movie_reviews.fileids(category)] # list of tuples

In [84]:
# same as above but easier to read

documents = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))
        
random.shuffle(documents)

In [86]:
print(documents[1])

(['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', 'review', 'damn', 'that', 'y2k', 'bug', '.', 'it', "'", 's', 'got', 'a', 'head', 'start', 'in', 'this', 'movie', 'starring', 'jamie', 'lee', 'curtis', 'and', 'another', 'baldwin', 'brother', '(', 'william', 'this', 'time', ')', 'in', 'a', 'story', 'regarding', 'a', 'crew', 'of', 'a', 'tugboat', 'that', 'comes', 'across', 'a', 'deserted', 'russian', 'tech', 'ship', 'that', 'has', 'a', 'strangeness', 'to', 'it', 'when', 'they', 'kick', 'the', 'power', 'back', 'on', '.', 'little', 'do', 'they', 'know', 'the', 'power', 'within', '.', '.', '.', 'going', 'for', 'the', 'gore', 'and', 'bringing', 'on', 'a', 'few', 'action', 'sequences', 'here', 'and', 'there', ',', 'virus', 'still', 'feels', 'very', 'empty', ',', 'like', 'a', 'movie', 'going', 'for', 'all', 'flash', 'and', 'no', 'substance', '.', 'we', 'don', "'", 't', 'know', 'why', 'the', 'crew', 'was', 'really', 'out', 'in', 'the', 'middle', 'of', 'nowhere', ',', 'we', 'don', "'", 't'

In [88]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]


In [89]:
print(all_words["stupid"])

253


## Convertig Words to Features

In [90]:
60*60*24

86400

# Testing

In [431]:
from nltk.corpus import wordnet as wn

word = "plant" # try vegetations (select the correct meaning)

In [432]:
synset_w = wn.synsets(word)
synset_w

[Synset('plant.n.01'),
 Synset('plant.n.02'),
 Synset('plant.n.03'),
 Synset('plant.n.04'),
 Synset('plant.v.01'),
 Synset('implant.v.01'),
 Synset('establish.v.02'),
 Synset('plant.v.04'),
 Synset('plant.v.05'),
 Synset('plant.v.06')]

In [433]:
for i in range(len(synset_w)):
    print(i, ")", wn.synsets(word)[i].definition())

0 ) buildings for carrying on industrial labor
1 ) (botany) a living organism lacking the power of locomotion
2 ) an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
3 ) something planted secretly for discovery by another
4 ) put or set (seeds, seedlings, or plants) into the ground
5 ) fix or set securely or deeply
6 ) set up or lay the groundwork for
7 ) place into a river
8 ) place something or someone in a certain position in order to secretly observe or deceive
9 ) put firmly in the mind


In [434]:
# meaning one is of interst. Now find hyponyms.

word = wn.synsets(word)[1]

In [435]:
# level 1 hyponyms

hyponyms_1 = word.hyponyms()
hyponyms_1[:5]

[Synset('acrogen.n.01'),
 Synset('air_plant.n.01'),
 Synset('annual.n.01'),
 Synset('apomict.n.01'),
 Synset('aquatic.n.01')]

In [436]:
# level 2 hyponyms

hyponyms_2 = []

for i in range(len(hyponyms_1)):
    hyponyms_2.append(hyponyms_1[i].hyponyms())
    
hyponyms_2[:5]

[[],
 [Synset('aeschynanthus.n.01'),
  Synset('hemiepiphyte.n.01'),
  Synset('spanish_moss.n.01'),
  Synset('strangler.n.01'),
  Synset('waxflower.n.02')],
 [],
 [],
 []]

In [437]:
# flattened list
hyponyms_2 = [y for x in hyponyms_2 for y in x]

In [438]:
hyponyms_3 = []

for i in range(len(hyponyms_2)):
    hyponyms_3.append(hyponyms_2[i].hyponyms())
    
hyponyms_3 = [y for x in hyponyms_3 for y in x]

hyponyms_3[:5]

[Synset('lipstick_plant.n.01'),
 Synset('pitch_apple.n.01'),
 Synset('field_corn.n.01'),
 Synset('flamingo_flower.n.01'),
 Synset('canterbury_bell.n.01')]

In [439]:
hyponyms_4 = []

for i in range(len(hyponyms_3)):
    hyponyms_4.append(hyponyms_3[i].hyponyms())

hyponyms_4 = [y for x in hyponyms_4 for y in x]    
    
hyponyms_4[:5]

[Synset('dent_corn.n.01'),
 Synset('flint_corn.n.01'),
 Synset('soft_corn.n.01'),
 Synset('green_arrow_arum.n.01'),
 Synset('common_duckweed.n.01')]

In [440]:
hyponyms_all = []

hyponyms_all.append(hyponyms_1)
hyponyms_all.append(hyponyms_2)
hyponyms_all.append(hyponyms_3)
hyponyms_all.append(hyponyms_4)

hyponyms_all = [y for x in hyponyms_all for y in x]  

len(hyponyms_all)

2096

In [441]:
hyp_current = list(hyponyms_1)
hyp_lower = list()
hyp_all = list(hyponyms_1)

for level in range(3): # 4 levels (3 + initial one)
    for i in range(len(hyp_current)):
        hyp_lower.append(hyp_current[i].hyponyms())
        
    hyp_current = list(y for x in hyp_lower for y in x)
    hyp_all.extend(hyp_current)
    hyp_lower = list()

In [442]:
len(hyp_all)

2096

In [443]:
# Find all hyponyms function (8 steps in this case)

# Functions

## Find hyponyms

In [472]:
def find_hyponyms(word, meaning_n):
    word = wn.synsets(word)[meaning_n]
    hyp_current = word.hyponyms()
    hyp_all = list(hyp_current)
    hyp_lower = []
    
    count = 0

    while count < len(hyp_all):  
        count = len(hyp_all)
        for i in range(len(hyp_current)):
            hyp_lower.append(hyp_current[i].hyponyms())
        
        hyp_current = list(y for x in hyp_lower for y in x)
        hyp_all.extend(hyp_current)
        hyp_lower = list()    
    
    return hyp_all

## Find Meronyms

In [483]:
def find_meronyms(word, meaning_n):
    word = wn.synsets(word)[meaning_n]
    mer_current = word.part_meronyms()
    mer_all = list(mer_current)
    mer_lower = []
    
    count = 0

    while count < len(mer_all):  
        count = len(mer_all)
        for i in range(len(mer_current)):
            mer_lower.append(mer_current[i].part_meronyms())
        
        mer_current = list(y for x in mer_lower for y in x)
        mer_all.extend(mer_current)
        mer_lower = list()    
    
    return mer_all

## Find hyponyms of meronyms and meronyms of hyponyms

In [500]:
def find_meronyms_hyponyms(word, meaning_n):
    word = wn.synsets(word)[meaning_n]
    current = word.part_meronyms()
    current.extend(word.hyponyms())
    
    lower = []
    all = list(current)
    
    count = 0
    
    while count < len(all):
        count = len(all)
        for i in range(len(current)):
            lower.append(current[i].part_meronyms())
            lower.append(current[i].hyponyms())
        
        current = list(y for x in lower for y in x)
        all.extend(current)
        lower = list()
            
    return all

## Morphological relations

## Get bag of Words

## Export to R

## Select correct meaning

# Explore word

## To find meaning number

In [506]:
word = "plant"

synset_w = wn.synsets(word)

for i in range(len(synset_w)):
    print(i, ")", wn.synsets(word)[i].definition())
    
meaning_n = 1

0 ) buildings for carrying on industrial labor
1 ) (botany) a living organism lacking the power of locomotion
2 ) an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
3 ) something planted secretly for discovery by another
4 ) put or set (seeds, seedlings, or plants) into the ground
5 ) fix or set securely or deeply
6 ) set up or lay the groundwork for
7 ) place into a river
8 ) place something or someone in a certain position in order to secretly observe or deceive
9 ) put firmly in the mind


## Find hypernym

In [507]:

word = wn.synsets(word)[meaning_n]
word.hypernyms()

[Synset('organism.n.01')]

# Apply Functions

In [481]:
len(find_hyponyms("plant", 1))

4699

In [485]:
len(find_meronyms("plant", 1))

2

In [501]:
len(find_meronyms_hyponyms("plant", 1))

7723

#### OTHER

In [291]:
word.hypernyms()

[Synset('organism.n.01')]

In [292]:
word

Synset('plant.n.02')

__arguments search__

In [290]:
dir(wn)

['ADJ',
 'ADJ_SAT',
 'ADV',
 'MORPHOLOGICAL_SUBSTITUTIONS',
 'NOUN',
 'VERB',
 '_ENCODING',
 '_FILEMAP',
 '_FILES',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_compute_max_depth',
 '_data_file',
 '_data_file_map',
 '_encoding',
 '_exception_map',
 '_fileids',
 '_get_root',
 '_key_count_file',
 '_key_synset_file',
 '_lang_data',
 '_lemma_pos_offset_map',
 '_lexnames',
 '_load_exception_map',
 '_load_lang_data',
 '_load_lemma_pos_offset_map',
 '_max_depth',
 '_morphy',
 '_omw_reader',
 '_pos_names',
 '_pos_numbers',
 '_root',
 '_synset_from_pos_and_line',
 '_synset_from_pos_and_offset',
 '_synset_offset_cache',
 '_tagset',
 '_unload',
 'abspath',
 'abspaths',
 'all_lemma

In [279]:
# morphological substitutions, meronymies

In [284]:
for synset in (wn.synsets('tree')):
        print(synset)
        nyms = ['hypernyms', 'hyponyms', 'meronyms', 'holonyms', 'part_meronyms', 'sisterm_terms', 'troponyms', 'inherited_hypernyms']
        for i in nyms:
            try:
                print(getattr(synset, i)())
            except AttributeError as e: 
                print(e)
                pass

Synset('tree.n.01')
[Synset('woody_plant.n.01')]
[Synset('aalii.n.01'), Synset('acacia.n.01'), Synset('african_walnut.n.01'), Synset('albizzia.n.01'), Synset('alder.n.02'), Synset('angelim.n.01'), Synset('angiospermous_tree.n.01'), Synset('anise_tree.n.01'), Synset('arbor.n.01'), Synset('aroeira_blanca.n.01'), Synset('ash.n.02'), Synset('australian_nettle.n.01'), Synset('balata.n.02'), Synset('bayberry.n.01'), Synset('bean_tree.n.01'), Synset('beech.n.01'), Synset('birch.n.02'), Synset('bitterwood_tree.n.01'), Synset('black_mangrove.n.01'), Synset('blackwood.n.02'), Synset('bloodwood_tree.n.01'), Synset('bonduc.n.02'), Synset('bonsai.n.01'), Synset('bottle-tree.n.01'), Synset('brazilian_ironwood.n.01'), Synset('brazilian_pepper_tree.n.01'), Synset('brazilwood.n.02'), Synset('breakax.n.01'), Synset('burma_padauk.n.01'), Synset('button_tree.n.01'), Synset('cabbage_tree.n.03'), Synset('calaba.n.01'), Synset('calabash.n.02'), Synset('camwood.n.01'), Synset('caracolito.n.01'), Synset('carib