## Hello DIGIT world!
Welcome, DIGIT student Python adventurer! This is the **Jupyter Notebook** version of my [exploring-nltk.py](exploring-nltk.py) file. The cells in the notebook will help me demonstrate particular things in class without having to comment everything else out. Jupyter notebooks are written with a combination of markdown cells (like this) for documentation, and cells with executable scripts. You usually have to run the cells in order, but you can choose which ones to run. I'm splitting up my exploring-nltk.py file into tidy cells because my commenting was getting a bit out of hand inside the file.

### Installs before imports 
You only need to make installations ONCE in your Python virtual environment (your venv). After that, you're good to go.

To install things, open a shell (or use the terminal in PyCharm CE), and be sure the venv is activated--usually you see (venv) in parentheseses. You can activate your venv like this, after navigating to the directory above your .venv file:
* On Mac: **.venv/bin/activate**
* On Windows: **.venv/bin/activate.bat**

We need to install nltk first: in your shell or the Pycharm terminal with .venv activated, run the following:

**pip install nltk**

Some of you might need to write:
**pip3 install nltk**
(If this is you and you're annoyed about it, make an alias for pip in your .bashrc or .zshrc to point to pip3 every time you type pip.)

*If you, like me, have multiple versions of python on your machine*, run:
**python3.12 -m pip install nltk**

#### Other libraries to install:
These next ones are for plotting interactive graphs and making a simple user interface:
* **pip install matplotlib**
* **UPDATE/CORRECTION** Tkinter is code for a special simple interface package (lets people enter input, like a word, to run in your program.) It should come with your Python3.12 installation already. To smoke test, open a shell and run:

  **python3.12 -m tkinter**

If you have it, you'll see a little "click me" window open up. (Previously at this point, we had you run pip install tk, but that was installing tensorkit, used in machine learning applications. You might use that later on in this class, but it's not the same thing.)

#### For making Jupyter notebooks like this:
For me to make this Jupyter notebook, I needed to install:
* **python3.12 -m pip install notebook**
* **python3.12 -m pip install notebook ipykernel**
Then I just ran the following to launch the. notebook in a web browser for editing:
**python3.12 -m jupyter notebook** or just **jupyter notebook** 

### After installs, time to start writing the script, with import lines
Then we can import these things in your Python script, which is what our first executable code cell is doing:

In [1]:
import nltk
import nltk.corpus
# The next line downloads all the example texts used in the NLTK book at https://www.nltk.org/book !
# You can comment out the download line after the first time you do it.
nltk.download('book')
from nltk.book import *
# The next line lets us do GET requests from remote URLs on the web:
from urllib import request
# The following import lines are for plotting interactive visualizations in Python
import matplotlib
import matplotlib.pyplot as plt
# import tk ebb: Sorry this was INCORRECT! We need to distinguish tkinter from tensorkit.

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/eeb4/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /Users/eeb4/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/eeb4/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/eeb4/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/eeb4/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/eeb4/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Tkinter! What is this and how do we use it?
We did this wrong in our first version of this notebook. So now we're going to experiment with tkinter more directly. Tkinter is used for making a simple (**desktop-only**) interactive user interface to allow user input to run Python. (You don't totally need this, but the NLTK book uses it occasionally, and we figured we'd check it out.) 

Tkinter should come with your installation of Python 3.12, and it's actually NOT something we install with pip (when we did a pip install of "tk" we got the tensorkit library which we could use in a different context for machine learning!) 
Here's what we need to import for it: 

In [2]:
# These imports will let us make a simple tkinter user input / output interface:
import tkinter as tk
from tkinter import scrolledtext
import io
import sys

### Smoke test for graphing libraries
After the imports, run the next cells to see if graphing works.

In [None]:
plt.plot(range(10))
plt.show()

In [None]:
### See how these words are dispersed in NLTK text 1 (Moby Dick)
words = ["whale", "sea", "ship", "captain"]
nltk.draw.dispersion_plot(text1, words)
plt.show()

In [None]:
# Another dispersion plot written closer to the NLTK example:
# Choose the text first (text 4 is Inaugural Addresses):
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
plt.show()


### Take a look at some common contexts for uses of the words
"monstrous" and "very" in text1
Try changing these up for different texts.

In [None]:
text6.common_contexts(["find","seek"])

### Look for similar words
This looks for words that appear in the same context as the word you enter
SO, I used text6 ("Monty Python and the Holy Grail") for this next example:

In [None]:
text6.similar('grail')

### Pulling files from a web url
For this, let's pull *Blithedale Romance* direct from Project Gutenberg (just like David did with it while introducing invisible XML).
I'm also demonstrating **how to make "picture string" variables** so you can easily know what you're printing out in the console:

In [3]:
# Blithedale Romance text file on Project Gutenberg
bookurl= "https://www.gutenberg.org/cache/epub/2081/pg2081.txt"
response = request.urlopen(bookurl)
br = response.read().decode('utf8')
type(br)
print(len(br))
# make a variable
howLong = len(br)
# picture string version! 
print(f"howLong = {howLong}")
novelSlice = br[:500]
print(f"novelSlice = {novelSlice}")

splitEmUp = br.split()
print(f"splitEmUp = {splitEmUp[-100:]}")

for token in splitEmUp:
    if token.endswith('ing'):
        print(token)

464044
howLong = 464044
novelSlice = ﻿The Project Gutenberg eBook of The Blithedale Romance
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using t
splitEmUp = ['network', 'of', 'volunteer', 'support.', 'Project', 'Gutenberg™', 'eBooks', 'are', 'often', 'created', 'from', 'several', 'printed', 'editions,', 'all', 'of', 'which', 'are', 'confirmed', 'as', 'not', 'protected', 'by', 'copyright', 'in', 'the', 'U.S.', 'unless', 'a', 'copyright', 'notice', 'is', 'included.', 'Thus,', 'we', 'do', 'not', 'necessarily', 'keep', 'eBooks', 'in', 'compliance', 'with', 'any', 'particular', 'paper', 'edition.', 'Most', 'peopl

## Make your own NLTK Text object from a file

This lets you use NLTK functions like similar on a text. 
First, you tokenize the text, and then you create an NLTK Text object.
There's a LOT you can do with a Text object, including to run similar and and common_contexts (like we did above). You can also look up regex patterns in your text! For details, see <https://www.nltk.org/api/nltk.text.Text.html>

In [4]:
# Create an NLTK Text object from the Blithedale Romance file in the previous cell.
# We'll start with splitEmUp (our tokenized version of the text)

blithedaleTextObject = nltk.Text(splitEmUp)
print(f"blithedaleTextObject = {blithedaleTextObject}")

# Now we can run thinks like NLTK's **similar** and **common_context** functions
blithedaleTextObject.similar("veil")

blithedaleTextObject = <Text: ﻿The Project Gutenberg eBook of The Blithedale Romance...>
and by a town little great heart warm back noble rich gray false
higher sad cold surrounding passionate tall project


### Make a Text Concordance 
Use the concordance feature...In the NLTK book, they introduce this with the prefab text of Jane Austen's Emma (already in NLTK's text corpora). I bet we can do this with our split-up Blithedale text that we pulled in from Project Gutenberg...
**NOTE**: You need to execute the previous cell for the next one to know the variables it needs.

Basically, to make the concordance, you have to convert the list of tokens into a special NLTK **text object**, and then run the concordance feature.

In [None]:
concordance = nltk.Text(splitEmUp).concordance("living")
print(f"concordance = {concordance}")

### Frequency Distributions
Here's an example plotting of frequency distributions. This is from the NLTK book, and you might be wondering why they didn't just use text4 for the corpus--which has all the addresses baked together in one file. They pulled from a **different** set, a collection of texts with each address stored in just one file, because the year of each address is in the fileid property! Being 

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()
cfd = nltk.ConditionalFreqDist(
    (target,fileid[:4])
    for fileid in inaugural.fileids()
    if fileid[:4] > "1990"
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()
plt.show()

## A desktop user interface widget 
### Try out tkinter to explore concordances
We'll work with our blithedaleTextObject here for *The Blithedale Romance*.
Credits to [tkinter documentation](https://docs.python.org/3/library/tkinter.html) and a little back-and-forth help from ChatGPT to get this working properly!
[RealPython's Python GUI Programming with Tkinter](https://realpython.com/python-gui-tkinter/) is a good place to keep exploring this, but NOTE: we can't build this out into a live website. It's just for tinkering at the local computer, or if people import your Jupyter notebook, set up a Python environment, and run your cells locally on their computers. 

In [None]:
# Playing with Tkinter for an input / output concordance window
def show_concordance():
    word = entry.get().strip()  # Get user input
    if not word:
        output_text.insert(tk.END, "Please enter a word.\n")
        return
    
    output_text.delete(1.0, tk.END)  # Clear previous results

    # Capture concordance output 
    # ebb: The next two lines help with delivery to the tkinter GUI window
    output_capture = io.StringIO()  # Create an in-memory text buffer
    sys.stdout = output_capture  # Redirect print output to this buffer
    
    try:
        blithedaleTextObject.concordance(word, lines=20)  # Show up to 20 matches
        result = output_capture.getvalue()  # Literally, "get" the captured output
        output_text.insert(tk.END, result)  # Deliver it into the Tkinter text box 
        # ebb: If you remove the result and output_text lines, this will just deliver to the console window
    except Exception as e:
        output_text.insert(tk.END, f"Sorry! No concordance found for '{word}'.\n")

    sys.stdout = sys.__stdout__  # Reset stdout to normal
    # ebb: The line above is also used for tkinter (we wouldn't need it for just delivering results to the console)

# Use tkinter here: it makes a GUI window pop up. 
root = tk.Tk()
root.title("NLTK Concordance Finder")

# Now, give tkinter an input field
tk.Label(root, text="Enter a word:").pack(pady=5)
entry = tk.Entry(root, width=50)
entry.pack(pady=5)

# Button
tk.Button(root, text="Deliver the Concordance!", command=show_concordance).pack(pady=5)

# Output area (Scrollable text box)
output_text = scrolledtext.ScrolledText(root, width=60, height=15)
output_text.pack(pady=5)

# Run the application
root.mainloop()

2025-04-02 10:28:10.704 Python[52639:19447611] +[IMKClient subclass]: chose IMKClient_Modern
2025-04-02 10:28:10.704 Python[52639:19447611] +[IMKInputSession subclass]: chose IMKInputSession_Modern
