<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/exercises_notebooks_solutions/01_LM_NLP_python_basics__solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to use the package `NLTK` - 'Natural Language Toolkit' (https://www.nltk.org/).

NLTK is a great package for research and for learning. However, it isn't recommended for production use and for real-world applications, as it isn't fast enough and therefore doesn't scale.

# Setup

In [1]:
import random

import nltk

In [2]:
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# A Closer Look at Python: Texts as Lists of Words

We will use the great book 'Moby Dick' by Herman Melville, as our learning experiment playground.

The book is already tokenized and stored as a list of these tokens, under the variable `text1`.

We start - as always - with looking at our data. 

Let's peek at the first 100 words:

In [4]:
text1[:100]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now',
 '.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '.',
 '"',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',']

**Pay attention that punctuations are also conisdered as a `token`.**

Exercise #1: Show the last 23 tokens in the book:

In [5]:
### your turn: Write a code that shows the last sentence (23 tokens) of the book
text1[-23:]
### End

['It',
 'was',
 'the',
 'devious',
 '-',
 'cruising',
 'Rachel',
 ',',
 'that',
 'in',
 'her',
 'retracing',
 'search',
 'after',
 'her',
 'missing',
 'children',
 ',',
 'only',
 'found',
 'another',
 'orphan',
 '.']

In python, an ordered set, with repetition, is defined as a List, and marked with sqaured brackets [].

An unordered set, where repetitions are discarded, is defined with regular brackets: ().

When converting the list into a set, we get the vocabulary of the corpus:

In [6]:
vocab = set(text1)

# We can't get the 'last 25 words' of the vocabulary, since the set has no order 
# But we can convert it into a list first, and even sort it:
list(sorted(vocab))[-25:]

['yon',
 'yonder',
 'yore',
 'you',
 'young',
 'younger',
 'youngest',
 'youngish',
 'your',
 'yours',
 'yourselbs',
 'yourself',
 'yourselves',
 'youth',
 'youthful',
 'zag',
 'zay',
 'zeal',
 'zephyr',
 'zig',
 'zodiac',
 'zone',
 'zoned',
 'zones',
 'zoology']

Exercise #2: How many words does our vocabulary have?

In [7]:
### your turn: Write a code that prints the size of Moby Dick book's vocabulary
len(vocab)
### End

19317

# Text Analysis: Frequency Distribution

nltk is a library with many research tools for probabilistic information. 

For example, it includes a function, `FreqDist`, that return the probability of the occurance of a word in a text:

http://www.nltk.org/api/nltk.html?highlight=freqdist#module-nltk.probability

In [8]:
## write a code that calculate the frequency of words in text1 and prints the top 50 common ones.
## How many times do the words 'with', 'Moby', 'fish' and 'whale' appear in the book?
## hint - fdist is a smart dictionary that already has methods for these tasks, 
## such as .most_common() 

freqdist = FreqDist(text1)
print(freqdist.most_common(20))

for w in ['with', 'Moby', 'fish', 'whale']:
  print(f"{w}: {freqdist[w]}")

### End

[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632)]
with: 1659
Moby: 84
fish: 133
whale: 906


Some of the common words are actually punctuations and 'stop-words'. They don't really help us with our analysis of the text, and therefore should be ignored.

Luckily, NLTK supplies a list of stop words, and python has the punctuation built in into the string package:

In [9]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [11]:
### Find and print the top 50 frequennt words, without stop words or punctuation.
### Hint: like the mathematical sets, a python Set has an ability to intersect, detect subsets and even subtract:
### See more in here: https://docs.python.org/3.8/library/stdtypes.html#set

punct = set(string.punctuation)
sw = set(stopwords.words('english'))
both = punct.union(sw)

freqdist = FreqDist(w for w in text1 if w.lower() not in both)
print(freqdist.most_common(50))


###

[('--', 1070), ('whale', 906), ('one', 889), ('like', 624), ('upon', 538), ('man', 508), ('ship', 507), ('Ahab', 501), ('."', 489), ('ye', 460), ('old', 436), ('sea', 433), ('would', 421), ('head', 335), ('though', 335), ('boat', 330), ('time', 324), ('long', 318), ('!"', 305), ('said', 302), (',"', 302), ('yet', 300), ('still', 299), ('great', 293), ('two', 285), ('seemed', 283), ('must', 282), ('Whale', 282), ('last', 277), ('way', 269), ('Stubb', 255), ('see', 253), ('?"', 252), ('Queequeg', 252), ('little', 247), ('round', 242), ('whales', 237), ('say', 237), ('three', 237), ('men', 236), ('thou', 232), ('may', 230), ('us', 228), ('every', 222), ('much', 218), ('could', 215), ('Captain', 215), ('first', 210), ('side', 208), ('hand', 205)]


FreqDist can be used even further. Let's analyse the text by the word length.

Using python 'list-comprehension' (https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) we can easily get a list of all the words by their lengths:

In [12]:
# For convenience of reading, showing here only the first 30
[len(w) for w in text1][:30]

[1,
 4,
 4,
 2,
 6,
 8,
 4,
 1,
 9,
 1,
 1,
 8,
 2,
 1,
 4,
 11,
 5,
 2,
 1,
 7,
 6,
 1,
 3,
 4,
 5,
 2,
 10,
 2,
 4,
 1]

In [13]:
### Write a code to calculate the frequency of the length of words in `text`. 
### How often do the 20 most lengthiest words appear in the text?
### Extra: Find out what those 20 words are (hint: a python dictionary has a .keys() method)





### End

# Text Analusis: n-grams and collocation

As we learnt in class, a word is not always a single token. In the case of 'New York', 'ice cream', 'red wine', etc., a single word meaning is different than the combined one.

A **collocation** is a sequence of words that occur together unusually often.

An `n-gram` is a sequence of a size of 'n' of tokens (i.e. words):

* When n=1: it is called **unigram**
* When n=2: it is called **bigram**
* When n=3: it is called **trigram** ...
* When n>3: it is just called an **n-gram** with the size of 4.


NLTK has two functions: `bigrams` and `collocations`

In [14]:
list(bigrams([1,2,3,4,5]))

[(1, 2), (2, 3), (3, 4), (4, 5)]

In [15]:
## Bigrams generates bi-grams from the text: every two words would be collected together.
list(bigrams(text1))[:20]

[('[', 'Moby'),
 ('Moby', 'Dick'),
 ('Dick', 'by'),
 ('by', 'Herman'),
 ('Herman', 'Melville'),
 ('Melville', '1851'),
 ('1851', ']'),
 (']', 'ETYMOLOGY'),
 ('ETYMOLOGY', '.'),
 ('.', '('),
 ('(', 'Supplied'),
 ('Supplied', 'by'),
 ('by', 'a'),
 ('a', 'Late'),
 ('Late', 'Consumptive'),
 ('Consumptive', 'Usher'),
 ('Usher', 'to'),
 ('to', 'a'),
 ('a', 'Grammar'),
 ('Grammar', 'School')]

In [16]:
text1.collocations()

Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand


# Python and NLP

Python has many strong capabilities, built in, when it comes to string and text procesing, in combined with the list comprehension.

Here are some examples of filtering the word list:

In [17]:
# Get all the words that ends with 'ableness', sorted:
sorted(w for w in set(text1) if w.endswith('ableness'))

['comfortableness',
 'honourableness',
 'immutableness',
 'indispensableness',
 'indomitableness',
 'intolerableness',
 'palpableness',
 'reasonableness',
 'uncomfortableness']

In [18]:
# Get all the words that contains 'orate', sorted:
sorted(term for term in set(text1) if 'orate' in term)

['camphorated',
 'corroborated',
 'decorated',
 'elaborate',
 'elaborately',
 'evaporate',
 'evaporates',
 'incorporate',
 'incorporated']

In [19]:
# Get all the words which their first letter is capitalized:
sorted(item for item in set(text1) if item.istitle())

['3D',
 'A',
 'Abashed',
 'Abednego',
 'Abel',
 'Abjectus',
 'Aboard',
 'Abominable',
 'About',
 'Above',
 'Abraham',
 'Academy',
 'Accessory',
 'According',
 'Accordingly',
 'Accursed',
 'Achilles',
 'Actium',
 'Acushnet',
 'Adam',
 'Adieu',
 'Adios',
 'Admiral',
 'Admirals',
 'Advance',
 'Advancement',
 'Adventures',
 'Adverse',
 'Advocate',
 'Affected',
 'Affidavit',
 'Affrighted',
 'Afric',
 'Africa',
 'African',
 'Africans',
 'Aft',
 'After',
 'Afterwards',
 'Again',
 'Against',
 'Agassiz',
 'Ages',
 'Ah',
 'Ahab',
 'Ahabs',
 'Ahasuerus',
 'Ahaz',
 'Ahoy',
 'Ain',
 'Air',
 'Akin',
 'Alabama',
 'Aladdin',
 'Alarmed',
 'Alas',
 'Albatross',
 'Albemarle',
 'Albert',
 'Albicore',
 'Albino',
 'Aldrovandi',
 'Aldrovandus',
 'Alexander',
 'Alexanders',
 'Alfred',
 'Algerine',
 'Algiers',
 'Alike',
 'Alive',
 'All',
 'Alleghanian',
 'Alleghanies',
 'Alley',
 'Almanack',
 'Almighty',
 'Almost',
 'Aloft',
 'Alone',
 'Alps',
 'Already',
 'Also',
 'Am',
 'Ambergriese',
 'Ambergris',
 'Amelia'

And there are more. if `wrd` is a string, then, for example:

* `wrd.islower()` will return true if the word is all lowercase
* `wrd.isalpha()` will return true if all the character in the string are letters

and there are also: `wrd.startswith('str')`, `wrd.isdigit()`, `wr.isalnum()`
and more.

In [20]:
from typing import List

### Exercise: 

def detect_string(tokens: List[str], search_str: str, search_position: int = 0):
  """Returns a sorted list of the vocabulary tokens which match the search conditions

  params:
    tokens: a document tokens list.
    search_str: a string to search in the token list 
    search_position: one of the following:
      0 - anywhere in the string
      1 - searches for the string at the beginning of the token
      2 - searches for the string at the end of the token
  """
  ###  Fill in this function to returns the result of searching for the  
  ### given string in the vocabulary of the tokens, according to the 
  ### position parameter
  
  vocab = set(tokens)
  
  if search_position == 1:
    return sorted([t for t in vocab if t.startswith(search_str)])

  if search_position == 2:
    return sorted([t for t in vocab if t.endswith(search_str)])

  return sorted([t for t in vocab if search_str in t])

### 

In [21]:
detect_string(text1, 'larg') 

['enlarge',
 'enlarged',
 'enlarges',
 'large',
 'largely',
 'largeness',
 'larger',
 'largest']

In [22]:
### Test:
assert detect_string(text1, 'tably', 2) == ['comfortably',
 'discreditably',
 'illimitably',
 'immutably',
 'indubitably',
 'inevitably',
 'inscrutably',
 'profitably',
 'unaccountably',
 'unwarrantably']

In [23]:
### Test:
assert detect_string(text1, 'argu', 1) == ['argue', 'argued', 'arguing', 'argument', 'arguments']

In [24]:
### Test:
assert detect_string(text1, 'arg', 2) == []

In [25]:
### Test
assert detect_string(text1, 'larg') == ['enlarge',
 'enlarged',
 'enlarges',
 'large',
 'largely',
 'largeness',
 'larger',
 'largest']