***
# <center>***Tokenizing Text and WordNet***
***

## ***I learned the following natural language processing techniques:***

* **Tokenization:**
    * [Tokenizing text into sentences](#sentence-tokenization) 
    * [Tokenizing sentences into words](#word-tokenization)
    * [Tokenizing sentences using regular expressions](#regex-tokenization) 
    * [Training a sentence tokenizer](#training-tokenizer)
* **Text Cleaning:**
    * [Filtering stopwords in a tokenized sentence](#stop-word-filtering)
* **Lexical Semantics:**
    * [Synsets for a word in WordNet](#wordnet-synsets) 
    * [Lemmas and synonyms in WordNet](#wordnet-lemmas-synonyms)
    * [Calculating WordNet Synset similarity](#wordnet-similarity)
* **Collocation Analysis:** 
    * [Discovering word collocations](#word-collocations) 


***
## ***Introduction***
***

**Natural Language ToolKit (NLTK)** is a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage. NLTK is often used for rapid prototyping of text processing programs and can even be used in production applications. 

**Tokenization** is a method of breaking up a piece of text into many pieces, such as sentences and words. **WordNet** is a dictionary designed  for programmatic access by natural language processing systems. It has many different use cases, including:
- Looking up the definition of a word
- Finding synonyms and antonyms
- Exploring word relations and similarity
- Word sense disambiguation for words that have multiple uses and definitions

**NLTK** includes a WordNet corpus reader, which we will use to access and explore WordNet. A corpus is just a body of text, and corpus readers are designed to make accessing a corpus much easier than direct file access.

***
### ***<a id="sentence-tokenization"></a>Sentence Tokenization:***
***

**Tokenization** is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. We will start with sentence tokenization, or splitting a paragraph into a list of sentences.

we can start by creating a paragraph of text:

In [1]:

para = "Hello World. It's good to see you. Thanks forreading this Notebook."


Now we want to **split the paragraph into sentences**. First we need to import the **sentence tokenization** function, and then we can call it with the paragraph as an argument:

In [2]:

from nltk.tokenize import sent_tokenize
Sent_tokenize = sent_tokenize(para)
Sent_tokenize


['Hello World.', "It's good to see you.", 'Thanks forreading this Notebook.']

So now we have a list of sentences that we can use for further processing.

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

***
### **<a id="word-tokenization"></a>Word Tokenization:** 
***

In this step, **we will split a sentence into individual words**. The simple task of creating a list of words from a string is an essential part of all text processing.

In [3]:

from nltk.tokenize import word_tokenize
Word_tokenize = word_tokenize(para)
Word_tokenize

['Hello',
 'World',
 '.',
 'It',
 "'s",
 'good',
 'to',
 'see',
 'you',
 '.',
 'Thanks',
 'forreading',
 'this',
 'Notebook',
 '.']

The word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

In [4]:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
Treebank_word_tokenizer = tokenizer.tokenize(para)
Treebank_word_tokenizer


['Hello',
 'World.',
 'It',
 "'s",
 'good',
 'to',
 'see',
 'you.',
 'Thanks',
 'forreading',
 'this',
 'Notebook',
 '.']

***more...***

Ignoring the obviously named `WhitespaceTokenizer` and `SpaceTokenizer`, there are two other word **tokenizers** worth looking at: `PunktWordTokenizer` and `WordPunctTokenizer.` These differ from `TreebankWordTokenizer` by how they handle punctuation and 
contractions, but they all inherit from TokenizerI.

***Separating contractions***
- One of the tokenizer's most significant conventions is to separate contractions. For example:

In [5]:

word_tokenize("Can't")


['Ca', "n't"]

***WordPunctTokenizer***
- Another alternative word tokenizer is WordPunctTokenizer. It splits all punctuation into separate tokens:

In [6]:

from nltk.tokenize import WordPunctTokenizer 
tokenizer = WordPunctTokenizer()
Punkt_Word_Tokenizer = tokenizer.tokenize("Can't is a contraction.")
print(Punkt_Word_Tokenizer)


['Can', "'", 't', 'is', 'a', 'contraction', '.']


***
### ***<a id="regex-tokenization"></a>Tokenizing sentences using regular expressions***
***


First you need to decide how you want to tokenize a piece of text as this will determine how you construct your regular expression. The choices are:
> Match on the tokens
  > 
> Match on the separators or gaps

We will create an instance of RegexpTokenizer, giving it a regular expression string to use for matching tokens:

In [7]:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
Regexp_Tokenizer = tokenizer.tokenize(para)
Regexp_Tokenizer


['Hello',
 'World',
 "It's",
 'good',
 'to',
 'see',
 'you',
 'Thanks',
 'forreading',
 'this',
 'Notebook']

We can also use a simple helper function if you do not want to instantiate the class: 

In [8]:

from nltk.tokenize import regexp_tokenize
simple_regexp_tokenize = regexp_tokenize(para, "[\w']+")
simple_regexp_tokenize


['Hello',
 'World',
 "It's",
 'good',
 'to',
 'see',
 'you',
 'Thanks',
 'forreading',
 'this',
 'Notebook']

The **RegexpTokenizer** class works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but RegexpTokenizer implements the TokenizerI interface,

***more...***
 
**RegexpTokenizer** can also work by matching the gaps, as opposed to the tokens. Instead of using re.findall(), the RegexpTokenizer class will use re.split(). This is how the `BlanklineTokenizer` class in nltk.tokenize is implemented.

***Simple whitespace tokenizer***
- The following is a simple example of using **RegexpTokenizer** to tokenize on whitespace:

In [9]:

tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize(para)


['Hello',
 'World.',
 "It's",
 'good',
 'to',
 'see',
 'you.',
 'Thanks',
 'forreading',
 'this',
 'Notebook.']

***
### ***<a id="training-tokenizer"></a>Training a sentence tokenizer***
***


**NLTK's** default sentence tokenizer is general purpose, and usually works quite well. But sometimes it is not the best choice for your text. Perhaps your text uses nonstandard punctuation, or is formatted in a unique way. In such cases, training your own sentence 
tokenizer can result in much more accurate sentence tokenization.

In [10]:

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)


In [11]:

sents1 = sent_tokenizer.tokenize(text)
sents1[0]


'White guy: So, do you have any plans for this evening?'

In [21]:

from nltk.tokenize import sent_tokenize
sents2 = sent_tokenize(text)
sents2[67]


'Guy #1: Well, he sort of was, spiritually.'

In [16]:

sents1[678]


'Girl: But you already have a Big Mac...'

In [19]:

sents2[678]


'Girl: But you already have a Big Mac...\nHobo: Oh, this is all theatrical.'

While the first sentence is the same, you can see that the tokenizers disagree on how to tokenize sentence 679 (this is the first sentence where the tokenizers diverge). The default tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that the next line is a separate sentence. This difference is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text is not in the typical paragraph sentence structure.

***
### ***<a id="stop-word-filtering"></a>Filtering stopwords in a tokenized sentence***
***


**Stopwords** are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as ***the and a***. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.

In [54]:

para_1 = para.split()
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = para_1
Stopwords = [word for word in words if word not in english_stops]
print(para)
print(" ".join(Stopwords))


Hello World. It's good to see you. Thanks forreading this Notebook.
Hello World. It's good see you. Thanks forreading Notebook.


The **stopwords** corpus is an instance of nltk.corpus.reader. `WordListCorpusReader`. As such, it has a `words()` method that can take a single argument for the file ID, which in this case is **english**, referring to a file containing  a list of **English stopwords**. You could also call **stopwords.words()** with no argument to get a list of all stopwords in every language available.

You can see the complete list of languages using the fileids method as follows:

In [57]:

stopwords.fileids()


['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

Any of these fileids can be used as an argument to the words() method to get a list of stopwords for that language.

In [59]:

stopwords.words('english')


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [61]:

stopwords.words('dutch')


['de',
 'en',
 'van',
 'ik',
 'te',
 'dat',
 'die',
 'in',
 'een',
 'hij',
 'het',
 'niet',
 'zijn',
 'is',
 'was',
 'op',
 'aan',
 'met',
 'als',
 'voor',
 'had',
 'er',
 'maar',
 'om',
 'hem',
 'dan',
 'zou',
 'of',
 'wat',
 'mijn',
 'men',
 'dit',
 'zo',
 'door',
 'over',
 'ze',
 'zich',
 'bij',
 'ook',
 'tot',
 'je',
 'mij',
 'uit',
 'der',
 'daar',
 'haar',
 'naar',
 'heb',
 'hoe',
 'heeft',
 'hebben',
 'deze',
 'u',
 'want',
 'nog',
 'zal',
 'me',
 'zij',
 'nu',
 'ge',
 'geen',
 'omdat',
 'iets',
 'worden',
 'toch',
 'al',
 'waren',
 'veel',
 'meer',
 'doen',
 'toen',
 'moet',
 'ben',
 'zonder',
 'kan',
 'hun',
 'dus',
 'alles',
 'onder',
 'ja',
 'eens',
 'hier',
 'wie',
 'werd',
 'altijd',
 'doch',
 'wordt',
 'wezen',
 'kunnen',
 'ons',
 'zelf',
 'tegen',
 'na',
 'reeds',
 'wil',
 'kon',
 'niets',
 'uw',
 'iemand',
 'geweest',
 'andere']

In [62]:

stopwords.words('greek')


['αλλα',
 'αν',
 'αντι',
 'απο',
 'αυτα',
 'αυτεσ',
 'αυτη',
 'αυτο',
 'αυτοι',
 'αυτοσ',
 'αυτουσ',
 'αυτων',
 'αἱ',
 'αἳ',
 'αἵ',
 'αὐτόσ',
 'αὐτὸς',
 'αὖ',
 'γάρ',
 'γα',
 'γα^',
 'γε',
 'για',
 'γοῦν',
 'γὰρ',
 "δ'",
 'δέ',
 'δή',
 'δαί',
 'δαίσ',
 'δαὶ',
 'δαὶς',
 'δε',
 'δεν',
 "δι'",
 'διά',
 'διὰ',
 'δὲ',
 'δὴ',
 'δ’',
 'εαν',
 'ειμαι',
 'ειμαστε',
 'ειναι',
 'εισαι',
 'ειστε',
 'εκεινα',
 'εκεινεσ',
 'εκεινη',
 'εκεινο',
 'εκεινοι',
 'εκεινοσ',
 'εκεινουσ',
 'εκεινων',
 'ενω',
 'επ',
 'επι',
 'εἰ',
 'εἰμί',
 'εἰμὶ',
 'εἰς',
 'εἰσ',
 'εἴ',
 'εἴμι',
 'εἴτε',
 'η',
 'θα',
 'ισωσ',
 'κ',
 'καί',
 'καίτοι',
 'καθ',
 'και',
 'κατ',
 'κατά',
 'κατα',
 'κατὰ',
 'καὶ',
 'κι',
 'κἀν',
 'κἂν',
 'μέν',
 'μή',
 'μήτε',
 'μα',
 'με',
 'μεθ',
 'μετ',
 'μετά',
 'μετα',
 'μετὰ',
 'μη',
 'μην',
 'μἐν',
 'μὲν',
 'μὴ',
 'μὴν',
 'να',
 'ο',
 'οι',
 'ομωσ',
 'οπωσ',
 'οσο',
 'οτι',
 'οἱ',
 'οἳ',
 'οἷς',
 'οὐ',
 'οὐδ',
 'οὐδέ',
 'οὐδείσ',
 'οὐδεὶς',
 'οὐδὲ',
 'οὐδὲν',
 'οὐκ',
 'οὐχ',
 'οὐχὶ',
 'οὓς'

***
### ***<a id="wordnet-synsets"></a>Synsets for a word in WordNet***
***


**WordNet** is a ***lexical database*** for the English language. In other words, it's a dictionary designed specifically for natural language processing.

**NLTK** comes with a simple interface to look up words in **WordNet**. What you get is a list of Synset instances, which are groupings of synonymous words that express the same concept. Many words have only one Synset, but some have several. 

Synset for *cookbook*

In [66]:

from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
syn.name()


'cookbook.n.01'

In [67]:

syn.definition()


'a book of recipes and cooking directions'

Any word in **WordNet** using ***wordnet.synsets(word)*** to get a list of Synsets. The list may be empty if the word is not found. The list may also have quite a few elements, as some words can have many possible meanings, and, therefore, many Synsets.

Each **Synset** in the list has a number of methods you can use to learn more about it. The **name()** method will give you a unique name for the Synset, which you can use to get the Synset directly:

In [68]:

wordnet.synset('cookbook.n.01')


Synset('cookbook.n.01')

The `definition()` method should be self-explanatory. Some Synsets also have an `examples()` method, which contains a list of phrases that use the word in context:

In [69]:

wordnet.synsets('cooking')[0].examples()


['cooking can be a great art',
 'people are needed who have experience in cookery',
 'he left the preparation of meals to his wife']

***
#### ***Working with hypernyms***
***


Synsets are organized in a structure similar to that of an inheritance tree. More abstract terms are known as **hypernyms** and more specific terms are **hyponyms**. This tree can be traced all the way up to a root hypernym. Hypernyms provide a way to categorize and group words based on their similarity to each other. The Calculating WordNet Synset similarity recipe details the functions used to calculate the similarity based on the distance between two words in the hypernym tree:

In [70]:

syn.hypernyms()


[Synset('reference_book.n.01')]

In [71]:

syn.hypernyms()[0].hyponyms()


[Synset('annual.n.02'),
 Synset('atlas.n.02'),
 Synset('cookbook.n.01'),
 Synset('directory.n.01'),
 Synset('encyclopedia.n.01'),
 Synset('handbook.n.01'),
 Synset('instruction_book.n.01'),
 Synset('source_book.n.01'),
 Synset('wordbook.n.01')]

In [72]:

syn.root_hypernyms()


[Synset('entity.n.01')]

As you can see, *reference_book* is a **hypernym** of *cookbook*, but cookbook is only one of the many hyponyms of *reference_book*. And all these types of books have the same root **hypernym**, which is entity, one of the most abstract terms in the English language. You can trace the entire path from entity down to cookbook using the **hypernym_paths()** method, as follows:

In [74]:

syn.hypernym_paths()


[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('creation.n.02'),
  Synset('product.n.02'),
  Synset('work.n.02'),
  Synset('publication.n.01'),
  Synset('book.n.01'),
  Synset('reference_book.n.01'),
  Synset('cookbook.n.01')]]

The `hypernym_paths()` method returns a list of lists, where each list starts at the roothypernym and ends with the original Synset. Most of the time, you'll only get one nested  list of Synsets.

You can also look up a simplified **part-of-speech** tag as follows

In [76]:

syn.pos()


'n'

There are four common part-of-speech tags (or POS tags) found in WordNet, as shown in the following table:

In [82]:

pos = {'Part of speech': ['Noun', 'Adjective', 'Adverb','Verb'], 'Tag': ['n', 'a', 'r', 'v'] }
import pandas as pd
pos_dataframe = pd.DataFrame(pos)
pos_dataframe


Unnamed: 0,Part of speech,Tag
0,Noun,n
1,Adjective,a
2,Adverb,r
3,Verb,v


These **POS** tags can be used to look up specific Synsets for a word. For example, the word ***great*** can be used as a noun or an adjective. In WordNet, ***great*** has 1 noun Synset and 6 adjective Synsets, as shown in the following code:

In [93]:

len(wordnet.synsets('great'))


7

In [94]:

len(wordnet.synsets('great', pos='n'))


1

In [95]:

len(wordnet.synsets('great', pos='a'))


6

***
### ***<a id="wordnet-lemmas-synonyms"></a>lemmas and synonyms in WordNet***
***


we can also luse lemmas in WordNet to find synonyms of a word. A **lemma (in linguistics)**, is the canonical form or morphological form of a word.

In the following code, we will find that there are two lemmas for the cookbook Synset using the **lemmas()** method:

In [96]:

from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
lemmas = syn.lemmas()
len(lemmas)


2

In [97]:

lemmas[0].name()


'cookbook'

In [98]:

lemmas[1].name()


'cookery_book'

In [99]:

lemmas[0].synset() == lemmas[1].synset()


True

As you can see, cookery_book and cookbook are two distinct lemmas in the same Synset. In fact, a lemma can only belong to a single Synset. In this way, a Synset represents a group  of lemmas that all have the same meaning, while a lemma represents a distinct word form.

***
#### ***<a id="wordnet-similarity"></a>Calculating WordNet Synset similarity***
***


***Synsets*** are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the Synsets it contains. The closer the two Synsets are in the tree, the more similar they are.

In [100]:

from nltk.corpus import wordnet
cb = wordnet.synset('cookbook.n.01')
ib = wordnet.synset('instruction_book.n.01')
cb.wup_similarity(ib)


0.9166666666666666

The ``wup_similarity``method is short for **Wu-Palmer Similarity**, which is a scoring method based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym:

In [101]:

ref = cb.hypernyms()[0]
cb.shortest_path_distance(ref)


1

In [102]:

ib.shortest_path_distance(ref)


1

In [106]:

cb.shortest_path_distance(ib)


2

***
### ***<a id="word-collocations"></a>Discovering word collocations***
***


**Collocations** are two or more words that tend to appear frequently together, such as United States. Of course, there are many other words that can come after United, such as United Kingdom and United Airlines. As with many aspects of natural language processing, context  is very important. And for collocations, context is everything!

In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that  occur frequently throughout the text.

We are creating a list of all lowercased words in the text, and then produce `BigramCollocationFinder`, which we can use to find bigrams, which are pairs of words. These bigrams are found using association measurement functions in the **nltk.metrics** package, 

In [115]:

from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
words = [w.lower() for w in webtext.words('grail.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 8)


[("'", 's'),
 ('arthur', ':'),
 ('#', '1'),
 ("'", 't'),
 ('villager', '#'),
 ('#', '2'),
 (']', '['),
 ('1', ':')]

that's not very useful! Let's refine it a bit by adding a word filter to remove punctuation and stopwords:

In [114]:

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 8)


[('black', 'knight'),
 ('clop', 'clop'),
 ('head', 'knight'),
 ('mumble', 'mumble'),
 ('squeak', 'squeak'),
 ('saw', 'saw'),
 ('holy', 'grail'),
 ('run', 'away')]

`BigramCollocationFinder` constructs two frequency distributions: one for each word, and another for bigrams. A frequency distribution, or FreqDist in NLTK, is basically an enhanced Python dictionary where the keys are what's being counted, and the values are the counts. Any filtering functions that are applied reduce the size of these two FreqDists by eliminating any words that don't pass the filter. By using a filtering function to eliminate all words that are one or two characters, and all English stopwords, we can get a much cleaner result. After filtering, the collocation finder is ready to accept a generic scoring function for finding collocations.

In addition to BigramCollocationFinder, there's also TrigramCollocationFinder, which finds triplets instead of pairs. 

In [128]:

from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
words = [w.lower() for w in webtext.words('singles.txt')]
tcf = TrigramCollocationFinder.from_words(words)
tcf.apply_word_filter(filter_stops)
tcf.apply_freq_filter(3)
tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4)


[('long', 'term', 'relationship')]

In addition to the stopword filter, I also applied a frequency filter, which removed any trigrams that occurred less than three times. This is why only one result was returned when we asked for four because there was only one result that occurred more than two times