**Stopwords** are common words that generally do not contribute to the meaning of a sentence,
at least for the purposes of information retrieval and natural language processing  

stopwords corpus is an instance of nltk.corpus.reader WordListCorpusReader. As such, it has a words() method that can take a single argument for the file ID, which in this case is 'english'  
see the list of all English stopwords using stopwords.words('english') or by
examining the word list file at nltk_data/corpora/stopwords/english. There are also
stopword lists for many other languages. You can see the complete list of languages using the
fileids method as follows:


<img src="stopwords_and_NLTK.png" />

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ["Can't", 'is', 'a', 'contraction']

In [5]:
[word for word in words if word not in english_stops]

["Can't", 'contraction']

***WordNet*** is a lexical database for the English language. In other words, it's a dictionary designed specifically for natural language processing  
>What you get is a list of Synset instances, which are groupings of synonymous words that express the same concept

***Synset for word cookbook***

In [None]:
# the wordnet files are in the \corpa\wordnet folder

In [6]:
from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
syn.name()

'cookbook.n.01'

In [7]:
print('definition=', syn.definition(), '\nexamples= \n', 
      wordnet.synsets('cooking')[0].examples())

definition= a book of recipes and cooking directions 
examples= 
 ['cooking can be a great art', 'people are needed who have experience in cookery', 'he left the preparation of meals to his wife']


Synsets are organized in a structure similar to that of an inheritance tree   
    More abstract terms are known as **hypernyms**
    more specific terms are **hyponyms*

In [8]:
syn.hypernyms()

[Synset('reference_book.n.01')]

In [9]:
syn.hypernyms()[0].hyponyms()

[Synset('annual.n.02'),
 Synset('atlas.n.02'),
 Synset('cookbook.n.01'),
 Synset('directory.n.01'),
 Synset('encyclopedia.n.01'),
 Synset('handbook.n.01'),
 Synset('instruction_book.n.01'),
 Synset('source_book.n.01'),
 Synset('wordbook.n.01')]

In [10]:
syn.root_hypernyms()

[Synset('entity.n.01')]

In [11]:
syn.hypernym_paths()

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('creation.n.02'),
  Synset('product.n.02'),
  Synset('work.n.02'),
  Synset('publication.n.01'),
  Synset('book.n.01'),
  Synset('reference_book.n.01'),
  Synset('cookbook.n.01')]]

 ### simplified part-of-speech tag

In [12]:
syn.pos()

'n'

POS tags can be used to look up specific Synsets for a word

In [13]:
print('this number of synsets for great as a noun\n', 
      len(wordnet.synsets('great', pos='n')), '\n this is number as adjective\n',
      len(wordnet.synsets('great', pos='a')))

this number of synsets for great as a noun
 1 
 this is number as adjective
 6


### lemma (in linguistics), is the canonical form or morphological form of a word

In [14]:
from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
lemmas = syn.lemmas()

In [15]:
len(lemmas)

2

In [16]:
print('there are 2 forms or lemmas for cookbook and they are equivalent or synset')
print(lemmas[0].name())
print(lemmas[1].name())
print(lemmas[1].synset()== lemmas[0].synset())

there are 2 forms or lemmas for cookbook and they are equivalent or synset
cookbook
cookery_book
True


### All possible synonyms

In [17]:
synonyms = []
for syn in wordnet.synsets('book'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
len(synonyms)

38

### Antonyms

In [18]:
gn2 = wordnet.synset('good.n.02')


In [19]:
gn2.definition()

'moral excellence or admirableness'

In [20]:
evil = gn2.lemmas()[0].antonyms()[0]

In [21]:
evil.name()

'evil'

In [22]:
evil.synset().definition()

'the quality of being morally wrong in principle or practice'

In [23]:
ga1 = wordnet.synset('good.a.01')
ga1.definition()

'having desirable or positive qualities especially those suitable for a thing specified'

In [24]:
bad = ga1.lemmas()[0].antonyms()[0]
bad.name()

'bad'

In [25]:
bad.synset().definition()

'having undesirable or negative qualities'

### calculate Synset similarity
The closer the two Synsets are in the tree, the more similar they are

**wup_similarity** method is short for Wu-Palmer Similarity  core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym

In [26]:
from nltk.corpus import wordnet
cb = wordnet.synset('cookbook.n.01')
ib = wordnet.synset('instruction_book.n.01')
cb.wup_similarity(ib)

0.9166666666666666

In [27]:
ref = cb.hypernyms()[0]

In [28]:
cb.shortest_path_distance(ref)

1

In [29]:
ib.shortest_path_distance(ref)

1

In [30]:
cb.shortest_path_distance(ib)

2

In [31]:
comparing two dissimilar words

SyntaxError: invalid syntax (<ipython-input-31-14d69e202c92>, line 1)

In [32]:
dog = wordnet.synsets('dog')[0]
print('becuase the share the synset _entity_ they are similar ', dog.wup_similarity(cb))

becuase the share the synset _entity_ they are similar  0.38095238095238093


In [33]:
sorted(dog.common_hypernyms(cb))

[Synset('entity.n.01'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('whole.n.02')]

In [34]:
cook = wordnet.synset('cook.v.01')


In [35]:
bake = wordnet.synset('bake.v.02')
cook.wup_similarity(bake)

0.6666666666666666

### BigramCollocationFinder
most common bigrams in Monty Python and the Holy Grail. 
***Collocations are two or more words that tend to appear frequently together***

In [36]:
import nltk

In [37]:
from nltk.corpus import webtext

In [38]:
from nltk.collocations import BigramCollocationFinder

In [39]:
from nltk.metrics import BigramAssocMeasures

In [41]:
nltk.download('webtext')

[nltk_data] Downloading package webtext to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


True

In [42]:
words = [w.lower() for w in webtext.words('grail.txt')]

In [43]:
bcf = BigramCollocationFinder.from_words(words)

In [44]:
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]

***adding a word filter to remove punctuation***

In [45]:
"""BigramCollocationFinder constructs two frequency distributions: one for each word,
and another for bigrams. A frequency distribution, or FreqDist in NLTK, is basically an
enhanced Python dictionary where the keys are what's being counted, and the values are
the counts. Any filtering functions that are applied reduce the size of these two FreqDists
by eliminating any words that don't pass the filter. By using a filtering function to eliminate all
words that are one or two characters, and all English stopwords, we can get a much cleaner
result. After filtering, the collocation finder is ready to accept a generic scoring function for
finding collocations."""

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

[('black', 'knight'),
 ('clop', 'clop'),
 ('head', 'knight'),
 ('mumble', 'mumble')]

###  TrigramCollocationFinder


In [63]:
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
words = [w.lower() for w in webtext.words('singles.txt')]
tcf = TrigramCollocationFinder.from_words(words)
tcf.apply_word_filter(filter_stops)
tcf.apply_freq_filter(3)
tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 5)

[('long', 'term', 'relationship')]

In [None]:
### Scoring ngrams