<h1 style="text-align: center;">1 Setup</h1>

You shoud install python first if you do not have one:

Go to https://www.python.org/downloads/ and download Python 3.10/3.11

For required packages:

### pip install notebook
### pip install matplotlib
### pip install wordcloud
Then you can type:
### jupyter notebook
For more details, you can access: https://www.nltk.org/book/ch01.html and https://www.nltk.org/book/ch02.html

<h2 style="text-align: center;">1.1 Getting Started with Python</h2>

In [None]:
print("hello world")

In [None]:
1 + 1 - 3

In [None]:
1 / 0

<h2 style="text-align: center;">1.2 Getting Started with NLTK</h2>

In [None]:
# before import you should install nltk
import nltk
# download all datasets in nltk
nltk.download()

In [None]:
# from NLTK's book module, load all items.
from nltk.book import *

In [None]:
# enter their names at the Python prompt
print(text1)

![Moby-Dick](figs/moby-dick.jpg)

In [None]:
print(text2)

![Sense and sensibility](figs/sense-and-sensibility.jpg)

In [None]:
print(sent1)

In [None]:
print(sent2)

<h1 style="text-align: center;">2 Python: Texts as Lists of Words</h1>


<h2 style="text-align: center;">2.1 Lists</h2>
What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here's how we represent text in Python, in this case the opening sentence of Moby Dick:


In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
# Task: ask for len of this sent1
len(sent1)

In [None]:
# Task: list out more defined sentences in nltk.book
print(sent2)
print(sent3)

In [None]:
# Task: make a concatenation of two lists
print(sent2 + sent3)

In [None]:
# Task: append a word to a list and then delete this
print(sent1)
sent1.append("NLP")
print(sent1)
sent1.pop()
print(sent1)

<h2 style="text-align: center;">2.2 Indexing Lists</h2>

In [None]:
# Task: find the word at the position 173 in 
#       US presidential inaugural address texts (text4)
print(text4)
text4[173]

In [None]:
# Task: index the word "awaken" of its first occurs
text4.index('awaken')

In [None]:
# Task: index the word "NLP"
text4.index('NLP')

In [None]:
# Task: slicing, index a sublist from 16715 to 16735 starting from 
#       the beginning of text5 (Chat Corpus)
print(text5[16715:16735])

In [None]:
# Task: slicing, index a sublist from 9850 to 9888 starting from the end of text4
print(' '.join(text4[-9888:-9850]))

In [None]:
# Task: index sublist from position 141525 to the last word
print(text2[141525:])

In [None]:
# Task: index the last 3 words
print(text2[-3:])

In [None]:
# Task: index a sublist word from sent1
print(sent1[10:12])
print(sent1[10])

In [None]:
# Task: create a new variable, an artificial sentence
sent = [f'word{_+1}' for _ in range(10)]
print(sent)

In [None]:
# Task: find the first 5 words
print(sent[:5])

In [None]:
# Task: modify the first and last element
print(sent)
sent[0] = 'First'
sent[9] = 'Last'
print(sent)

In [None]:
# Task: modify the sublist to a smaller list
sent[1:9] = ['Second', 'Third']
print(sent)

<h2 style="text-align: center;">2.3 Variables</h2>

In [None]:
# Task: define a variable and a sub list of my_sent,
#       and then sort the sublist
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', 'forth', 'from', 'Camelot', '.']
noun_phrase = my_sent[1:4]
print(noun_phrase)
print(sorted(noun_phrase))

In [None]:
# Task: find the set of vocabularies of corpus text1 (Moby-Dick)
#       list out first 10 words and show the vocabulary size
vocab = set(text1)
print(list(vocab)[:10])
vocab_size = len(vocab)
print(vocab_size)

In [None]:
# Task: valid the following new variables
not = 'Camelot'

<h2 style="text-align: center;">2.3 Strings</h2>

In [None]:
# Task: Define a string and find first and last char of it.
name = 'Monty'
print(name)
print(name[0])
print(name[-1])

In [None]:
# Task: multiply 2 of name; add it with '!!!':
print(name * 2)
print(name + "!!!")

In [None]:
# Task: join the words of a list to make a single string
joint_str = ' '.join(['Python', 'for', 'NLP'])
print(joint_str)

In [None]:
# Task: split a string into a list of words based on a splitter
print(joint_str.split())

<h1 style="text-align: center;">3 Computing with Language: Simple Statistics</h1>

<h2 style="text-align: center;">3.1 Frequency Distributions</h2>

Frequency distribution tells us the frequency of each vocabulary in the text. It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. 

In [None]:
# Task: create a frequency distribution for text1 (Moby-Dick text)

# A frequency distribution for the outcomes of an experiment. A frequency distribution 
# records the number of times each outcome of an experiment has occurred. For example, 
# a frequency distribution could be used to record the frequency of each word type in 
# a document. 

fdist1 = FreqDist(text1)
print(fdist1)

In [None]:
# Task: find the 50 most frequent words of text1
print(fdist1.most_common(50))

In [None]:
# Task: find the frequency of word 'whale' in text1
print(fdist1['whale'])

In [None]:
# Task: plot Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'DeJavu Serif'
plt.rcParams['font.serif'] = ['Times New Roman']
font = {'weight' : 'bold', 'size'   : 18}
matplotlib.rc('font', **font)
plt.rcParams["figure.figsize"] = (15, 8)
fdist1.plot(50, cumulative=True)

In [None]:
# Task: plot Probability Density Function
fdist1.plot(50, cumulative=False)
# What do you find ?

<h2 style="text-align: center;">3.2 Fine-grained Selection of Words</h2>

In [None]:
# Task: look at the long words of the book Moby-Dick (text1)
#       find all words that have at least 15 chars.
V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

In [None]:
# Task: look at the long words of the Inaugural Address Corpus (text4)
#       find all words that have at least 15 chars.
V = set(text4)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

# What do you find ?

In [None]:
# Task: find all long words of text5
print(sorted([w for w in set(text5) if len(w) > 15]))

# What do you find ?

In [None]:
# Task: find frequently occurring long words. 
fdist5 = FreqDist(text5)
print(sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7))

# do you find ?

In [None]:
# Task: find hapaxes (hapax legomenon, 孤立词，文本中出现一次)
fdist1 = FreqDist(text1)
hapax = fdist1.hapaxes()
print(hapax[:10])

# What do you find ?

<h2 style="text-align: center;">3.3 Collocations and Bigrams</h2>

In [None]:
# Task: generate bigrams from a word list
# A collocation is a sequence of words that occur together unusually often. 
# Thus red wine is a collocation,  whereas the wine is not.
list(bigrams(['more', 'is', 'said', 'than', 'done']))

In [None]:
# Task: find frequent bigrams of text4 (Inaugural Address Corpus)
text4.collocations()

In [None]:
# Task: find frequent bigrams of text8 (Personals Corpus 
#       comes from personal ads posted on various online 
#       dating sites)
text8.collocations() 

<h2 style="text-align: center;">3.4 Counting Other Things</h2>

In [None]:
# Task: find and cluster frequency of word length

# For example, we can look at the distribution of word lengths 
# in a text, by creating a FreqDist out of a long list of numbers, 
# where each number is the length of the corresponding word in the text:

fdist = FreqDist(len(w) for w in text1)
print(fdist)
fdist 

In [None]:
# Task: show all frequent of the different lengths of words
fdist.most_common()

In [None]:
# Task: find a specific frequency the most frequency length
print(fdist.max())
print(fdist[3])
print(fdist.freq(3))

![Functions Defined for NLTK's Frequency Distributions](figs/freq-dist.png)

<h1 style="text-align: center;">4 Python: Making Decisions and Taking Control</h1>

<h2 style="text-align: center;">4.1 Numerical Comparison Operators</h2>

![dd](figs/operations.png)

In [None]:
# Task: find shorter and longer words
print(sent7)
print([w for w in sent7 if len(w) < 4])
print([w for w in sent7 if len(w) >= 4])
print([w for w in sent7 if len(w) == 4])
print([w for w in sent7 if len(w) != 4])

In [None]:
# Task: what is this ?
print(sorted(w for w in set(text1) if w.endswith('ableness'))[:10])

In [None]:
# Task: what is this ?
print(sorted(term for term in set(text4) if 'gnt' in term))

In [None]:
# Task: what is this ?
print(sorted(item for item in set(text6) if item.istitle())[:10])

In [None]:
# Task: what is this ?
print(sorted(item for item in set(sent7) if item.isdigit()))

In [None]:
print(sorted(w for w in set(text7) if '-' in w and 'index' in w))
print(sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10))
print(sorted(w for w in set(sent7) if not w.islower()))
print(sorted(t for t in set(text2) if 'cie' in t or 'cei' in t))

In [None]:
captical_words = [w.upper() for w in text1]
print(captical_words[:10])

In [None]:
print(text1)
print(len(text1))
print(len(set(text1)))
print(len(set(word.lower() for word in text1)))
# merge words like The the.

In [None]:
# eliminate numbers and punctuation from the vocabulary count by 
# filtering out any non-alphabetic items:
print(len(set(word.lower() for word in text1 if word.isalpha())))

<h2 style="text-align: center;">4.2 Nested Code Blocks</h2>

In [None]:
# If condition
word = 'cat'
if len(word) < 5:
    print('word length is less than 5')

In [None]:
# For loop
for word in ['Call', 'me', 'Baojian', '.']:
    print(word)

In [None]:
# Check the word type for sent1
for token in sent1:
    if token.islower():
        print(f'{token:10} is a lowercase word')
    elif token.istitle():
        print(f'{token:10} is a titlecase word')
    else:
        print(f'{token:10} is punctuation')

In [None]:
# create a list of cie and cei words, 
# then we loop over each item and print it. 
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
    print(word, end=' ')

<h1 style="text-align: center;">5 Automatic Natural Language Understanding</h1>

<h2 style="text-align: center;">5.1 Searching Text</h2>

In [None]:
# Task: look up the context of word "monstrous" in Moby Dick (text1) 
print('-'*17)
text1.concordance("monstrous")
print('-'*17)
text1.concordance("fudan")

In [None]:
# Task: search Sense and Sensibility (text2) for the word "affection"
print('-'*17)
text2.concordance("affection")

In [None]:
# Task: search the book of Genesis (text3) to find out how long some people lived
text3.concordance("lived")

In [None]:
# Task: look at text4, the Inaugural Address Corpus, to see examples of English going back to 1789, 
print('-'*17)
text4.concordance("nation")
print('-'*17)
text4.concordance("terror")
# see how these words have been used differently over time. 

In [None]:
# Task: find lol context in text5, the NPS Chat Corpus: 
#       search this for unconventional words like im, ur, lol.
text5.concordance("lol")

In [None]:
# Task: find similar context,
#.      we saw that monstrous occurred in contexts such as 
#.      the ___ pictures and a ___ size. What other words 
#.      appear in a similar range of contexts?
text1.concordance("monstrous")
print('-'*17)
text1.similar("monstrous")

In [None]:
# Task: find similar context of monstrous in text2
text2.concordance("monstrous")
print('-'*17)
text2.similar("monstrous")

# Observe that we get different results for different texts. 
# Austen uses this word quite differently from Melville; 
# for her, monstrous has positive connotations, and sometimes 
# functions as an intensifier like the word very.

In [None]:
# Task: find common context of two words
# The term "common_contexts" allows us to examine 
# just the contexts that are shared by two or more words
text1.common_contexts(["monstrous", "very"])
text2.common_contexts(["monstrous", "very"])

In [None]:
# Task: determine the location of a word in the text: 
# how many words from the beginning it appears. 
# This positional information can be displayed using a dispersion plot. 
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

In [None]:
# Task: show the dispersion plot for Elinor, Edward, Marianne, Willoughby
text2.dispersion_plot(["Elinor", "Edward", "Marianne", "Willoughby"])
# TODO: There is a bug in the code.

In [None]:
# Task: generate some random text in the various styles we have just seen.
text3.generate()

In [None]:
# Task: generate word cloud
#.      you may need to install wordcloud first
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import sys
!{sys.executable} -m pip install wordcloud

In [None]:
words = [w for w in text1]
fd = nltk.FreqDist(words).most_common()
wc = WordCloud(background_color='white', max_words=2000, stopwords=STOPWORDS, max_font_size=50,
              random_state=17)
wc.generate(' '.join(words))
plt.rcParams["figure.figsize"] = (6, 4)
plt.imshow(wc)
plt.axis('off')
plt.show()

In [None]:
# Task: create a new function
def dispersion_plot(text, words, ignore_case=False, title="Lexical Dispersion Plot"):
    """
    Generate a lexical dispersion plot.

    :param text: The source text
    :type text: list(str) or iter(str)
    :param words: The target words
    :type words: list of str
    :param ignore_case: flag to set if case should be ignored when searching text
    :type ignore_case: bool
    :return: a matplotlib Axes object that may still be modified before plotting
    :rtype: Axes
    """

    try:
        import matplotlib.pyplot as plt
    except ImportError as e:
        raise ImportError(
            "The plot function requires matplotlib to be installed. "
            "See https://matplotlib.org/"
        ) from e

    word2y = {
        word.casefold() if ignore_case else word: y
        for y, word in enumerate(reversed(words)) # should not be reversed(words)
    }
    xs, ys = [], []
    for x, token in enumerate(text):
        token = token.casefold() if ignore_case else token
        y = word2y.get(token)
        if y is not None:
            xs.append(x)
            ys.append(y)

    _, ax = plt.subplots()
    ax.plot(xs, ys, "|")
    ax.set_yticks(list(range(len(words))), reversed(words), color="C0") # or put revered here.
    ax.set_ylim(-1, len(words))
    ax.set_title(title)
    ax.set_xlabel("Word Offset")
    return ax

In [None]:
import matplotlib.pyplot as plt
from nltk.corpus import gutenberg

words = ["Elinor", "Marianne", "Edward", "Willoughby"]
dispersion_plot(gutenberg.words("austen-sense.txt"), words)
plt.show()

<h2 style="text-align: center;">5.2 Counting words</h2>

In [None]:
# Task: count how many words (including punctuation symbols) in the book of Genesis
len(text3)
# So Genesis has 44,764 words and punctuation symbols, or "tokens." 
# A token is the technical name for a sequence of characters — 
# such as hairy, his, or :) — that we want to treat as a group.

In [None]:
# Task: calculate a measure of the lexical richness of the text. 
print(f"{len(set(text3)) / len(text3) * 100:.2f}%")
# the number of distinct words is just 6% of the total number of words, 
# or equivalently that each word is used 16 times on average

In [None]:
# Task: count a specific word a
print(text3.count("the"))
print('-'*17)
# count percentage of the word "a" used in the book
val = 100 * text4.count(r"a") / len(text4)
print(f"{val:.2f}%")

In [None]:
# Task: counting lol
print(text1.count("lol"))
print(text2.count("lol"))
print(text3.count("lol"))
print(text4.count("lol"))
print(text5.count("lol"))
print(text6.count("lol"))
# text5 is a corpus of Online Chat Dialogs

<h2 style="text-align: center;">5.3 Gutenberg Corpus</h2>

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

In [None]:
import nltk
nltk.corpus.gutenberg.fileids()

In [None]:
# Let's pick out the first of these texts — Emma by Jane Austen — 
# and give it a short name, emma, then  find out how many words 
# it contains
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

In [None]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)
# average word length, average sentence length,
# and the number of times each vocabulary item appears 
# in the text on average (our lexical diversity score). 

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
print(macbeth_sentences)
print(macbeth_sentences[1116])
longest_len = max(len(s) for s in macbeth_sentences)
print([s for s in macbeth_sentences if len(s) == longest_len])

<h2 style="text-align: center;">5.4 Web and Chat Text</h2>

In [None]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)