<a href="https://colab.research.google.com/github/larajakl/Machine-Learning/blob/main/01_lm_nlp_python_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Basics

In this exercise, we'll explore python NLP capabilities with the help of the package `nltk`.

By the end of the exercise, you will:
* be introduced to the nltk package and its functionality
* understand the basics of text analysis, and know how to approach this unstructured data.
* Understand the terms 'n-gram' & 'collocation'

We are going to use the package `NLTK` - 'Natural Language Toolkit' (https://www.nltk.org/).

NLTK is a great package for research and for learning. However, it isn't recommended for production use and for real-world applications, as it isn't fast enough and therefore doesn't scale.

# Setup

In [1]:
import random

import nltk

In [2]:
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# Exploratory Data Analysis (EDA)

## A Closer Look at Python: Texts as Lists of Words

We will use the great book 'Moby Dick' by Herman Melville, as our learning experiment playground.

The book is already tokenized and stored as a list of these tokens, under a variable with the excellent, well expressed name - `text1` (please do yourself - and me - a favor and name your variables in a more meaningful manner than that...).

We start - as you should always do - with exploring and looking at our dataset.

Let's peek at the first 100 words:

In [4]:
text1[:100]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now',
 '.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '.',
 '"',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',']

[Linktext](https://)Pay attention that punctuations here are also conisdered as a `token`.

## Exercise #1: Show the last 23 tokens in the book:

In [5]:
### YOUR TURN:
### Write a code that shows the last sentence (23 tokens) of the book

tokens = text1[len(text1)-23:]
print(tokens)

### End

['It', 'was', 'the', 'devious', '-', 'cruising', 'Rachel', ',', 'that', 'in', 'her', 'retracing', 'search', 'after', 'her', 'missing', 'children', ',', 'only', 'found', 'another', 'orphan', '.']


## Lists vs Sets

In python, an ordered set, with repetition, is defined as a `List`, and is defied by sqaured braces [].

An unordered set, where repetitions are *discarded*, is defined with curly braces: {}.

When converting a list into a set, we can get the **vocabulary** of the corpus, the *unique* words that the dataset is constructed of:

In [6]:
vocab = set(text1)

# We can't get the 'last 25 words', since there is no order...
# But we can convert it into a list first, and even sort it
list(sorted(vocab))[-50:]

['yawned',
 'yawning',
 'ye',
 'yea',
 'year',
 'yearly',
 'years',
 'yeast',
 'yell',
 'yelled',
 'yelling',
 'yellow',
 'yellowish',
 'yells',
 'yes',
 'yesterday',
 'yet',
 'yield',
 'yielded',
 'yielding',
 'yields',
 'yoke',
 'yoked',
 'yokes',
 'yoking',
 'yon',
 'yonder',
 'yore',
 'you',
 'young',
 'younger',
 'youngest',
 'youngish',
 'your',
 'yours',
 'yourselbs',
 'yourself',
 'yourselves',
 'youth',
 'youthful',
 'zag',
 'zay',
 'zeal',
 'zephyr',
 'zig',
 'zodiac',
 'zone',
 'zoned',
 'zones',
 'zoology']

## Exercise #2: Vocabulary Length

How many words does our vocabulary contain?

In [7]:
### YOUR TURN:
### Write python code that prints the size of Moby Dick book's vocabulary

wordcount = len(vocab)
print(wordcount)

# this prints the number of (unique) words in the book!

### End

19317


# Text Analysis: Frequency Distribution

[nltk](http://www.nltk.org) is a library with many research tools for probabilistic information and dataset exploration.

For example, it includes a function, `FreqDist`, that return the probability of the occurance of a word in a text:

http://www.nltk.org/api/nltk.html?highlight=freqdist#module-nltk.probability

In [8]:
### YOUR TURN:
## 1) Write python function named `get_most_frequent(n: int)` that calculates the frequency of words in text1 and returns the top n common ones (n is given as a parameter).
## 2) Write a python function - `get_frequency(words: list[str])` that given a list of words, prints the frequency of each of those words in text1.
## 3) Use the functions to print how many times the words 'with', 'Moby', 'fish' and 'whale' appear in the book.
## hint: FreqDist is a smart python dictionary that already has methods for these tasks, such as .most_common()



def get_most_frequent(n: int):
  frequency_distribution = FreqDist(text1)
  return frequency_distribution.most_common(n)

def get_frequency(words: list[str]):
  frequency_distribution = FreqDist(text1)
  for x in words:
    print(f"{x}: {frequency_distribution[x]}")


### End

get_frequency(["with", "Moby", "fish", "whale"])


with: 1659
Moby: 84
fish: 133
whale: 906


In [9]:
assert get_most_frequent(5) == [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024)]

:Some of the common words are actually punctuations and '**stop-words**'. They don't help us much with our text analysis, and therefore can be safely ignored.

Luckily, NLTK supplies a list of stop words, and python has the punctuation built in into the string package:

In [10]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [12]:
### Write a function - get_most_frequent_filtered(n: int) - that returns the top
### n frequennt words, after filtering out stop words and punctuation.

def get_most_frequent_filtered(n: int):
  stop_words = set(stopwords.words("english"))
  punctuation = set(string.punctuation)
  punctuation.add("--")
  filtered_words = [word for word in text1 if word.lower() not in stop_words and word not in punctuation]
  frequency_distribution = FreqDist(filtered_words)
  return frequency_distribution.most_common(n)


###
get_most_frequent_filtered(5)

[('whale', 906), ('one', 889), ('like', 624), ('upon', 538), ('man', 508)]

In [13]:
assert get_most_frequent_filtered(5) == [('whale', 906), ('one', 889), ('like', 624), ('upon', 538), ('man', 508)]

FreqDist can be used even further. Let's analyse the text by the word length.

Using python ['list-comprehension'](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) method, we can easily get a list of all the words by their lengths:

In [14]:
# For convenience of reading, showing here only the first 30
[len(w) for w in text1][:30]

[1,
 4,
 4,
 2,
 6,
 8,
 4,
 1,
 9,
 1,
 1,
 8,
 2,
 1,
 4,
 11,
 5,
 2,
 1,
 7,
 6,
 1,
 3,
 4,
 5,
 2,
 10,
 2,
 4,
 1]

## Exercise #3 (Advanced): Length Frequency

In [15]:
### Write a code to calculate the words lengthes frequency inside `text1`.
### Find out what those 20 words are.
### How many times do the 20 most lengthiest words appear in the text?

# Step 1: Create dictionaries to hold the frequency of each word and word lengths
word_frequency = {}
length_frequency = {}

# Step 2: Count the frequency of each word and their lengths
for word in text1:
    if word.isalpha():  # Only consider alphabetic words
        word = word.lower()  # Normalize to lowercase for consistent counting

        # Update word frequency
        if word in word_frequency:
            word_frequency[word] += 1
        else:
            word_frequency[word] = 1

        # Update word length frequency
        word_length = len(word)
        if word_length in length_frequency:
            length_frequency[word_length] += 1
        else:
            length_frequency[word_length] = 1

# Step 3: Find the longest 20 words and their frequencies
longest_words = sorted(word_frequency.keys(), key=len, reverse=True)[:20]

# Step 4: Print the results for word length frequencies
print("Word Length Frequencies:")
for length, frequency in sorted(length_frequency.items()):
    print(f"Length: {length}, Frequency: {frequency}")

# Step 5: Print the results for the longest words
print("\nThe 20 longest words and their frequencies:")
for word in longest_words:
    print(f"Word: '{word}', Frequency: {word_frequency[word]}")



### End

Word Length Frequencies:
Length: 1, Frequency: 9168
Length: 2, Frequency: 35620
Length: 3, Frequency: 49556
Length: 4, Frequency: 42221
Length: 5, Frequency: 26590
Length: 6, Frequency: 17111
Length: 7, Frequency: 14399
Length: 8, Frequency: 9966
Length: 9, Frequency: 6428
Length: 10, Frequency: 3528
Length: 11, Frequency: 1873
Length: 12, Frequency: 1053
Length: 13, Frequency: 565
Length: 14, Frequency: 177
Length: 15, Frequency: 70
Length: 16, Frequency: 22
Length: 17, Frequency: 12
Length: 18, Frequency: 1
Length: 20, Frequency: 1

The 20 longest words and their frequencies:
Word: 'uninterpenetratingly', Frequency: 1
Word: 'characteristically', Frequency: 1
Word: 'uncomfortableness', Frequency: 1
Word: 'cannibalistically', Frequency: 1
Word: 'circumnavigations', Frequency: 1
Word: 'superstitiousness', Frequency: 2
Word: 'comprehensiveness', Frequency: 3
Word: 'preternaturalness', Frequency: 1
Word: 'indispensableness', Frequency: 1
Word: 'uncompromisedness', Frequency: 1
Word: 'subt

# Text Analysis: n-grams and collocation

As we saw in class, a word might not always also be a `token`. In the case of 'New York', 'ice cream', 'red wine', etc., every word meaning on its own is different than the combined meaning as a phrase.

A **collocation** is a sequence of words that occur together unusually often.

An `n-gram` is a sequence of a size of 'n' of tokens (i.e. words):

* When n=1: it is called **unigram**
* When n=2: it is called **bigram**
* When n=3: it is called **trigram** ...
* When n>3: it is just called an **n-gram** with the size of 4.


NLTK has two functions: `bigrams` and `collocations`

In [16]:
list(bigrams([1,2,3,4,5]))

[(1, 2), (2, 3), (3, 4), (4, 5)]

In [17]:
## Bigrams generates bi-grams from the text: every two words would be collected together.
list(bigrams(text1))[:20]

[('[', 'Moby'),
 ('Moby', 'Dick'),
 ('Dick', 'by'),
 ('by', 'Herman'),
 ('Herman', 'Melville'),
 ('Melville', '1851'),
 ('1851', ']'),
 (']', 'ETYMOLOGY'),
 ('ETYMOLOGY', '.'),
 ('.', '('),
 ('(', 'Supplied'),
 ('Supplied', 'by'),
 ('by', 'a'),
 ('a', 'Late'),
 ('Late', 'Consumptive'),
 ('Consumptive', 'Usher'),
 ('Usher', 'to'),
 ('to', 'a'),
 ('a', 'Grammar'),
 ('Grammar', 'School')]

In [32]:
### Your Turn ###
# Write here code that returns and print the collocations in text1

from nltk import bigrams
from nltk.corpus import stopwords

def collocations(text):
  # Step 1: Filter the text to create bigrams
  # Remove punctuation and convert to lower case
  stop_words = set(stopwords.words("english"))
  filtered_words = [word.lower() for word in text1 if word.isalpha() and word.lower() not in stop_words]
  # Step 2: Generate bigrams from the filtered words
  bigram_list = list(bigrams(filtered_words))
  # Step 3: Count occurrences of each bigram. I will get a dictionary of bigrams (keys) and their frequencies (values)
  bigram_frequency = {}
  for bigram in bigram_list:
        if bigram not in bigram_frequency:
          bigram_frequency[bigram] = 1
        else:
          bigram_frequency[bigram] += 1
  # Step 4: sort the bigram frequency dictionary from highest to lowest frequency. it now will become a list of tuples because i
  # am using  the sorted() function with bigram_frequency.items(). it converts the dictionary into a list of tuples
  sorted_bigrams_according_to_frequency = sorted(bigram_frequency.items(), key=lambda item: item[1], reverse=True)
  #return the collocations
  return sorted_bigrams_according_to_frequency

result = collocations(text1)
# now I will print the list of collocations with their frequency
print(result)

# "result" is now a list containing tuples that contain the bigram and the frequency

# Print the collocations in the right format
print("Collocations:")
for bigram, frequency in result:
  print(f"{bigram[0]} {bigram[1]}; ", end="")

#### need to remove stopwords?

# Expected output:
# Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
# whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
# years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
# mate; white whale; ivory leg; one hand

### END ###########

Collocations:

# Python and NLP

Python has many strong capabilities, built in, when it comes to string and text procesing, combined with the list comprehension.

Here are some examples of filtering the word list:

In [33]:
# Get all the words that ends with 'ableness', sorted:
sorted(w for w in set(text1) if w.endswith('ableness'))

['comfortableness',
 'honourableness',
 'immutableness',
 'indispensableness',
 'indomitableness',
 'intolerableness',
 'palpableness',
 'reasonableness',
 'uncomfortableness']

In [34]:
# Get all the words that contains 'orate', sorted:
sorted(term for term in set(text1) if 'orate' in term)

['camphorated',
 'corroborated',
 'decorated',
 'elaborate',
 'elaborately',
 'evaporate',
 'evaporates',
 'incorporate',
 'incorporated']

In [35]:
# Get all the words which their first letter is capitalized:
sorted(item for item in set(text1) if item.istitle())

['3D',
 'A',
 'Abashed',
 'Abednego',
 'Abel',
 'Abjectus',
 'Aboard',
 'Abominable',
 'About',
 'Above',
 'Abraham',
 'Academy',
 'Accessory',
 'According',
 'Accordingly',
 'Accursed',
 'Achilles',
 'Actium',
 'Acushnet',
 'Adam',
 'Adieu',
 'Adios',
 'Admiral',
 'Admirals',
 'Advance',
 'Advancement',
 'Adventures',
 'Adverse',
 'Advocate',
 'Affected',
 'Affidavit',
 'Affrighted',
 'Afric',
 'Africa',
 'African',
 'Africans',
 'Aft',
 'After',
 'Afterwards',
 'Again',
 'Against',
 'Agassiz',
 'Ages',
 'Ah',
 'Ahab',
 'Ahabs',
 'Ahasuerus',
 'Ahaz',
 'Ahoy',
 'Ain',
 'Air',
 'Akin',
 'Alabama',
 'Aladdin',
 'Alarmed',
 'Alas',
 'Albatross',
 'Albemarle',
 'Albert',
 'Albicore',
 'Albino',
 'Aldrovandi',
 'Aldrovandus',
 'Alexander',
 'Alexanders',
 'Alfred',
 'Algerine',
 'Algiers',
 'Alike',
 'Alive',
 'All',
 'Alleghanian',
 'Alleghanies',
 'Alley',
 'Almanack',
 'Almighty',
 'Almost',
 'Aloft',
 'Alone',
 'Alps',
 'Already',
 'Also',
 'Am',
 'Ambergriese',
 'Ambergris',
 'Amelia'

And there are more. if `wrd` is a string, then, for example:

* `wrd.islower()` will return true if the word is all lowercase
* `wrd.isalpha()` will return true if all the character in the string are letters

and there are also: `wrd.startswith('str')`, `wrd.isdigit()`, `wr.isalnum()`
and [many others](https://www.w3schools.com/python/python_ref_string.asp).

## Exercise #4: Functions and substrings search

In [51]:
from typing import List

### Exercise:

def detect_string(tokens: List[str], search_str: str, search_position: int = 0) -> List[str]:
  # it says in the function definition "search_position: int = 0" because if no search_position is entered the default value will be 0!
  # if the user enters 0 for search position (or does not enter any), it means that the search_str can be anywhere in the word. Then I create a new list of
  # all words in tokens that contain the search_str somewhere:
  if search_position == 0:
    new_list = sorted(word for word in set(tokens) if search_str in word)
  # if the user enters 1 for search position, it means that I am looking for words in token that start with the search_str:
  elif search_position == 1:
    new_list = sorted(word for word in set(tokens) if word.startswith(search_str))
  # if the user enters 2 for search position, it means that I am looking for words in token that end with the search_str:
  elif search_position == 2:
    new_list = sorted(word for word in set(tokens) if word.endswith(search_str))
  # if the user enters anything else, I inform them that this is not a valid search position.
  else:
    print("Invalid search position. Enter either 0, 1 or 2.")
  return new_list

  """Returns a sorted list of the vocabulary tokens which match the search conditions

  params:
    tokens: a document tokens list.
    search_str: a string to search in the token list
    search_position: one of the following:
      0 - anywhere in the string
      1 - searches for the string at the beginning of the token
      2 - searches for the string at the end of the token
  """
  ### Fill in this function to returns the result of searching for the
  ### given string "search_str" in the token vocabulary "tokens", according to
  ### the position parameter, as explained in the docstring



###

In [47]:
### Test:
assert detect_string(text1, 'tably', 2) == ['comfortably',
 'discreditably',
 'illimitably',
 'immutably',
 'indubitably',
 'inevitably',
 'inscrutably',
 'profitably',
 'unaccountably',
 'unwarrantably']

In [48]:
### Test:
assert detect_string(text1, 'argu', 1) == ['argue', 'argued', 'arguing', 'argument', 'arguments']

In [49]:
### Test:
assert detect_string(text1, 'arg', 2) == []

In [50]:
### Test
assert detect_string(text1, 'larg') == ['enlarge',
 'enlarged',
 'enlarges',
 'large',
 'largely',
 'largeness',
 'larger',
 'largest']