# Reading File Directories and Exploring WordNet

This notebook provides some guidance on working with file directories for a text corpus. It's also introducing some things we can explore with the WordNet dictionary accessible through NLTK. 

Read about [Wordnet in Chapter 2, section 5.2](https://www.nltk.org/book/ch02.html#wordnet) of the NLTK book. 
Credits: Some of our explanations are distilled from [David Birnbaum's introductory Wordnet notebook](https://github.com/djbpitt/wordnet/blob/master/Wordnet.ipynb) that we used to explore ambiguity of spooky words in projects some years ago! 


In [1]:
# Imports
import os
import nltk
from nltk.corpus import wordnet as wn


In [2]:
# SMOKE TEST: Explore Wordnet for specific words.
wn.synsets('clear')

[Synset('clear.n.01'),
 Synset('open.n.01'),
 Synset('unclutter.v.01'),
 Synset('clear.v.02'),
 Synset('clear_up.v.04'),
 Synset('authorize.v.01'),
 Synset('clear.v.05'),
 Synset('pass.v.09'),
 Synset('clear.v.07'),
 Synset('clear.v.08'),
 Synset('clear.v.09'),
 Synset('clear.v.10'),
 Synset('clear.v.11'),
 Synset('clear.v.12'),
 Synset('net.v.02'),
 Synset('net.v.01'),
 Synset('gain.v.08'),
 Synset('clear.v.16'),
 Synset('clear.v.17'),
 Synset('acquit.v.01'),
 Synset('clear.v.19'),
 Synset('clear.v.20'),
 Synset('clear.v.21'),
 Synset('clear.v.22'),
 Synset('clear.v.23'),
 Synset('clear.v.24'),
 Synset('clear.a.01'),
 Synset('clear.s.02'),
 Synset('clear.s.03'),
 Synset('clear.a.04'),
 Synset('clear.s.05'),
 Synset('clear.s.06'),
 Synset('clean.s.03'),
 Synset('clear.s.08'),
 Synset('clear.s.09'),
 Synset('well-defined.a.02'),
 Synset('clear.a.11'),
 Synset('clean.s.02'),
 Synset('clear.s.13'),
 Synset('clear.s.14'),
 Synset('clear.s.15'),
 Synset('absolved.s.01'),
 Synset('clear.s.17

### Look at some synset data from WordNet
Choose one of the synsets by its identifier, and lets explore what you can see with it.
In the next couple of cells we explore how WordNet shares information: 

Wordnet shows:
* A representative word that stands for set of meanings (a "synset"). There may be other representative words, and that same representative word might be used in different senses (and have multiple synsets).  
* A part of speech (POS) identifier, like “n” for ‘noun’ or “v” for ‘verb’.
* A two-digit number that distinguishes different synsets that may have the same head word and the same POS, but that convey different meaning. For example, the synsets 'ghost.n.01' and 'ghost.n.02' are two different nominal meanings that can be expressed by the lexeme “ghost.”

In WordNet you can request:
* the available **lemmas** which mean "lexemes" or the available set of words associated with a particular synset.
* the **definition** associated with a synset

In [3]:
wn.synset('acquit.v.01').lemmas()

[Lemma('acquit.v.01.acquit'),
 Lemma('acquit.v.01.assoil'),
 Lemma('acquit.v.01.clear'),
 Lemma('acquit.v.01.discharge'),
 Lemma('acquit.v.01.exonerate'),
 Lemma('acquit.v.01.exculpate')]

In [4]:
# For every lem in one of the synsets, isolate just the names without the extra stuff:
for lem in wn.synset('acquit.v.01').lemmas():
    print(lem.name())

acquit
assoil
clear
discharge
exonerate
exculpate


In [5]:
# Here we'll just return the count of the available lemmas for each synset for "clear"
# In this output we're just surveying the data as lists:
for synset in wn.synsets('sick'):
    print(synset, ": ", synset.lemma_names(), ": ", len(synset.lemma_names()))  

Synset('sick.n.01') :  ['sick'] :  1
Synset('vomit.v.01') :  ['vomit', 'vomit_up', 'purge', 'cast', 'sick', 'cat', 'be_sick', 'disgorge', 'regorge', 'retch', 'puke', 'barf', 'spew', 'spue', 'chuck', 'upchuck', 'honk', 'regurgitate', 'throw_up'] :  19
Synset('ill.a.01') :  ['ill', 'sick'] :  2
Synset('nauseated.s.01') :  ['nauseated', 'nauseous', 'queasy', 'sick', 'sickish'] :  5
Synset('brainsick.s.01') :  ['brainsick', 'crazy', 'demented', 'disturbed', 'mad', 'sick', 'unbalanced', 'unhinged'] :  8
Synset('disgusted.s.01') :  ['disgusted', 'fed_up', 'sick', 'sick_of', 'tired_of'] :  5
Synset('pale.s.02') :  ['pale', 'pallid', 'wan', 'sick'] :  4
Synset('sick.s.06') :  ['sick'] :  1
Synset('ghastly.s.01') :  ['ghastly', 'grim', 'grisly', 'gruesome', 'macabre', 'sick'] :  6


In [6]:
# Ask for the definitions and part of speech available for a word lemma names in WordNet
for synset in wn.synsets('sick'):
    print(synset.lemma_names(), ": Part of Speech: ", synset.pos(), ": Definition: ", synset.definition())
    

['sick'] : Part of Speech:  n : Definition:  people who are sick
['vomit', 'vomit_up', 'purge', 'cast', 'sick', 'cat', 'be_sick', 'disgorge', 'regorge', 'retch', 'puke', 'barf', 'spew', 'spue', 'chuck', 'upchuck', 'honk', 'regurgitate', 'throw_up'] : Part of Speech:  v : Definition:  eject the contents of the stomach through the mouth
['ill', 'sick'] : Part of Speech:  a : Definition:  affected by an impairment of normal physical or mental function
['nauseated', 'nauseous', 'queasy', 'sick', 'sickish'] : Part of Speech:  s : Definition:  feeling nausea; feeling about to vomit
['brainsick', 'crazy', 'demented', 'disturbed', 'mad', 'sick', 'unbalanced', 'unhinged'] : Part of Speech:  s : Definition:  affected with madness or insanity
['disgusted', 'fed_up', 'sick', 'sick_of', 'tired_of'] : Part of Speech:  s : Definition:  having a strong distaste from surfeit
['pale', 'pallid', 'wan', 'sick'] : Part of Speech:  s : Definition:  (of light) lacking in intensity or brightness; dim or feebl

## Ambiguity from WordNet's point of view
When words are ambiguous, they can multiple shades of meaning or different ways we could interpret them. How we decide on the meaning depends on the context of the words. WordNet can't tell you everything about how you could interpret every word in your project, but when words are available in its vast database, you can explore how many synsets (or possible meaningful interpretations) it has on file for that word. You can find which words depend the most on specific contexts for their meaning. 

To find out WordNet's sense of how ambiguous a word is, you could just count the number of available synsets for it with Python's **len()**.
Note: when you want ALL the synsets, use **wn.synsets** (plural)

In [7]:
hookey_words = ['cry', 'late', 'sick', 'homework']
for w in hookey_words:
    synsets = len(wn.synsets(w))
    print("The word ", w, "belongs to ", synsets, "synsets in WordNet.")

The word  cry belongs to  12 synsets in WordNet.
The word  late belongs to  11 synsets in WordNet.
The word  sick belongs to  9 synsets in WordNet.
The word  homework belongs to  1 synsets in WordNet.


### WordNet's morphy: for inflected forms like plural vs. singular
Wordnet won't have an entry for plural forms of words (for example). For other things like -ing words, those tend to have their own WordNet entry. 

When working with project data, **run your word tokens through Wordnet's morphy in case they need to be looked up in a lemmatized form**
`wn.morphy('ghosts')`. Try it out on various word forms:

In [8]:
wn.morphy('cry')

'cry'

### A Workflow for projects exploring ambiguity
* Start by isolating a set of words of interest from your project. (One idea: Use nltk's pos tagger to identify all the verbs in your files and retrieve a set of them).
* Deliver them to WordNet to retrieve their synset counts
* Output the information on each distinct word from Python to a simple XML, TSV, or JSON (We'll probably use XML)
* Options:
    * Use the frequency distribution plotting in Python to see how frequently ambiguous words are used in your corpus (plotting is a bit limited)
    * Write XSLT to map that information back into your XML files: We can use this to **instantiate** the uses of ambiguous words: retrieve counts of how frequently they appear in the text, and plot with SVGs of your own design. (You could make an SVG representing info about the most ambiguous words. You could also make specific SVGs for ambiguous words of interest to show how they're distributed throughout your texts... what would be interesting to visualize?


## Part of Speech (and other related kinds of) Tagging
NLTK (and spacY and other language models) offer Part of Speech tagging. We could use this to collect sets of distinct verbs, nouns, adjectives. 
* We could learn more about the words in a set by sending them to WordNet.
* We could also look at words as a network: How often do specific characters rely on a word of interest? Which characters share a word of interest? Or which words of interest are shared by say, male vs. female characters?  (Here's [a student project](http://bamfs.obdurodon.org/allFilms_Results.xhtml) that explored a collection of colorful swear words used by characters in Quentin Tarantino's films.)

In our Text Analysis class, you didn't spend time marking specific words, but you could do that using regex search and replace over your texts. But you could also get some help from POS tagging from NLTK, and mapping that information back to your XML files for further analysis. So let's explore that.  We'll look at POS (part of speech) tagging, but note that it's related to Named Entity Recognition and sentiment analysis. 

(Full disclosure: We like POS for reasons of interesting patterns it can display without overdetermining what the data means in context. You can show interesting patterns, especially by working with POS tagging together with WordNet analyses.)

## How to get NLTK to do POS tagging
* Let's open a file, tokenize it, prepare it for nltk to analyze.
* Then apply the .pos tagger
* What we see in the output is a list of tuples

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

filepath = 'texts/sixteen.txt'
f = open(filepath, 'r', encoding='utf8').read()
# Make a list of tokens in your text. 
# tokenList = f.split()
tokenList = word_tokenize(f)
# How is NLTK's word_tokenize() different from just splitting on spaces? Here's an example of how it's different: 
# Look for "don't" in the output of this cell and see how it's split.
print(tokenList)



In [10]:
# Reduce the complexity by: 1) lowercasing them, and 2) returning the set() of words (remove multiple of the same value)
lowercaseTokens = [token.lower() for token in tokenList]
uniqueTokens = set(lowercaseTokens)
print(uniqueTokens)



In [11]:
# Let's try out the NLTK POS tagger on the uniqueTokens
pos_tag(uniqueTokens)

[('bummed', 'VBN'),
 ('indeed', 'RB'),
 ('life', 'NN'),
 ('blouses', 'NNS'),
 ('eating', 'VBG'),
 ('surges', 'NNS'),
 ('bowl', 'NN'),
 ('offer', 'VBP'),
 ('semiclothed', 'VBN'),
 ('no', 'DT'),
 ('birth', 'NN'),
 ('sight', 'VBD'),
 ('lies', 'NNS'),
 ('cruise', 'VBP'),
 ('went', 'VBD'),
 ('grandfather', 'JJ'),
 ('situation', 'NN'),
 ('intercourse.', 'NN'),
 ('honk', 'NN'),
 ('knees', 'NNS'),
 ('why', 'WRB'),
 ('pictures', 'NNS'),
 ('power', 'NN'),
 ('red', 'JJ'),
 ('section', 'NN'),
 ('again.', 'IN'),
 ('ideal.', 'JJ'),
 ('sam.', 'NN'),
 ('divorce', 'NN'),
 ('shitfaced', 'VBD'),
 ('hijacked', 'VBN'),
 ('fell', 'VBD'),
 ('wimp', 'RB'),
 ('caught', 'VBN'),
 ('prom', 'JJ'),
 ('hand', 'NN'),
 ('jesus', 'NN'),
 ('chilling', 'NN'),
 ('adults', 'NNS'),
 ('hardly', 'RB'),
 ('relieved.', 'VBP'),
 ('ass', 'RB'),
 ('forward', 'RB'),
 ('sniffles', 'VBZ'),
 ('grab', 'NN'),
 ('first', 'RB'),
 ('foyer', 'JJ'),
 ('love', 'VB'),
 ('serene', 'JJ'),
 ('deadbolt', 'JJ'),
 ('moves', 'NNS'),
 ('him', 'PRP'),


### Wait...what is that POS category?
Find out what a category is by asking NLTK:
`nltk.help.upenn_tagset('VBZ')`

In [12]:
nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


We can limit our list of words of interest by asking for words of a particular kind, isolating them by POS

In [13]:
POStagged = pos_tag(uniqueTokens)
tagsIwant = ['VB', 'VBZ']
# This is a Python list comprehension that'll help us with our list of tuples [(word, tag), (word, tag), ...]
shortList = [word for word, tag in POStagged if tag in tagsIwant]
print(len(shortList))
print(shortList)

# Fancy string formatting (not super useful here, just reviewing it): 
# f'My short list is {shortList} and it is this long: {len(shortList)}'

155
['sniffles', 'love', 'backs', 'makes', 'gives', 'scowls', 'takes', 'moans', 'smiles', 'shoo', 'tresses', 'laughs', 'tends', 'tall', 'changes', 'speaks', 'hands', 'take', 'puts', 'stop', 'lenses', 'snuggles', 'suitcase', 'worships', 'rests', 'leans', 'relatives', 'shove', 'is', 'lets', 'sees', 'lie', 'receives', 'fusses', 'screwed', 'winces', 'shakes', 'continues', 'agrees', 'embraces', 'day.', 'descends', 'bryce', 'walk', 'occurs', 'rolls', 'passes', 'helps', 'piles', 'rattles', 'remember', 'looks', 'damn', 'meet', 'happens', 'thank', 'falls', 'directs', 'hears', 'back', 'closes', 'tosses', 'executes', 'throw', 'ask', 'throws', 'insensitivity.', 'faces', 'rides', 'eyebrows', 'joins', 'flies', 'follows', 'stays', 'windows', 'hates', 'boyfriends', 'monogram', 'cigarette', 'fishtails', 'reaches', 'thanks', 'lays', 'shoves', 'decides', 'glasses', 'run', 'stands', 'trays', 'shuffles', 'tan', 'suspects', 'crawls', 'luncheon.', 'row', 'breaking', 'holds', 'snakes', 'craves', 'hit', 'begin

## Send them to WordNet for synset lookup

* Use list comprehension
* Use wn.morphy to simplify the word to its synset lemma form
* While we're at it, let's sort the shortList using `sorted()`.
* We noticed that WordNet has its own parts of speech, and they're a lot simpler than NLTK's.
    * WordNet's parts of speech are ADJ, ADJ_SAT, ADV, NOUN, VERB:  'a', 's', 'r', 'n', 'v'
    * We also noticed that NLTK's part of speech tagging is rather shockingly unreliable on our texts.
    * We wonder if spaCy might be better, or if we really need to train the models on our texts by tagging them ourselves.
    * You might want to just make a list of all the words that have the same ending, like our [-ing smoke test](https://github.com/newtfire/textAnalysis-Hub/blob/main/Class-Examples/Python/orient-ing.py) from the NLTK book intro in our first Python assignment. 
  

In [14]:
for w in sorted(shortList):
    lemma = wn.morphy(w)
    # I don't think we need the next line, but it's a fallback if there's no WordNet lemmas: 
    lemma = lemma if lemma else w 
    print(f"Word: {w} | Wordnet Lemma: {lemma}")
    synsets = wn.synsets(lemma)
    pos = {synset.pos() for synset in synsets}
    if synsets:
       
        
        print(f" Word: {w}, POS-according-to-WordNet {pos} Number of Synsets (Ambiguity): {len(synsets)}") 
              
        
        # for syn in synsets:
        #   print(f"  Synset: {syn.name()}, Definition: {syn.definition()}}")
    
      
    


Word: agrees | Wordnet Lemma: agree
 Word: agrees, POS-according-to-WordNet {'v'} Number of Synsets (Ambiguity): 7
Word: ask | Wordnet Lemma: ask
 Word: ask, POS-according-to-WordNet {'v'} Number of Synsets (Ambiguity): 7
Word: back | Wordnet Lemma: back
 Word: back, POS-according-to-WordNet {'a', 'r', 's', 'v', 'n'} Number of Synsets (Ambiguity): 28
Word: backs | Wordnet Lemma: back
 Word: backs, POS-according-to-WordNet {'a', 'r', 's', 'v', 'n'} Number of Synsets (Ambiguity): 28
Word: be | Wordnet Lemma: be
 Word: be, POS-according-to-WordNet {'v', 'n'} Number of Synsets (Ambiguity): 14
Word: begins | Wordnet Lemma: begin
 Word: begins, POS-according-to-WordNet {'v', 'n'} Number of Synsets (Ambiguity): 11
Word: bit | Wordnet Lemma: bit
 Word: bit, POS-according-to-WordNet {'v', 'n'} Number of Synsets (Ambiguity): 15
Word: boxes | Wordnet Lemma: box
 Word: boxes, POS-according-to-WordNet {'v', 'n'} Number of Synsets (Ambiguity): 13
Word: boyfriends | Wordnet Lemma: boyfriend
 Word: bo

## Read in just one file from a directory in your repo
Open one of your text files in your repo for reading.
In this example, we'll climb directories, and that means we'll use the os library to show you how to handle filepaths.

When your file isn't immediately in the same folder as your Python script, and you need to climb for it, start with os library by getting the current working directory:

In [15]:
# cwd is my shorthand for "current working directory"
cwd = os.getcwd()
print(cwd)

C:\Users\raika\OneDrive\Desktop\GitHub\d210kranz\python\python_assignments


In [16]:
# climb up one directory and retrieve a file. (here's how you would do that)
# (ADAPT THIS CODE TO REACH DOWN OR UP AS NEEDED.)
filepath = '../ferris.txt'
print(filepath)

../ferris.txt


In [17]:
# Now, Python must OPEN and READ the text file in order to process it:
f = open(filepath, 'r', encoding='utf8').read()
# readFile = f.read()
print(f)

"FERRIS BUELLER'S DAY OFF"

                                     by

                                John Hughes

						SHOOTING SCRIPT
						July 24, 1985


                         "FERRIS BUELLER'S DAY OFF"

  1  BLACK SCREEN                                                  1

     MAIN TITLES

     IT'S SILENT. A BEAT...AND AN EXPLOSION OF SOUND.  A HOUSEHOLD
     IN THE MORNING. KIDS GETTING READY FOR SCHOOL. CLOCK RADIOS.
     KITCHEN APPLIANCES. SHOWERS. FIGHTING. PEOPLE YELLING. DOG
     BARKING. APPLIANCES BUZZING. CAR HORNS. IT SOUNDS JUST LIKE
     YOUR HOUSE DID. STREAMS OF ROCK'N ROLL FADE IN AND OUT. HUEY
     LEWIS TO LIONEL RITCHIE TO HUSKER DU. SURROUND MAKES IT FEEL
     LIKE YOU'RE IN THE ROOM. AN AURAL TOUR OF A HOUSE ON A
     SCHOOL MORNING. BEGINING IN THE KITCHEN AND MOVING UPSTAIRS.

                              FATHER'S VOICE (TOM)
               Where's my wallet?!

                              SEVEN YEAR OLD BOY (TODD)
               YOU IDIOT!!

         

## Read in some project data from a collection of text files
You have a file directory with some text files probably near your Python script. 

 From the "for loop" in the next cell, we can then write code to process information about each file separately.

In [18]:
# Remember, we defined cwd as our current working directory holding this file.
# list directories:
os.listdir(cwd)


['.ipynb_checkpoints',
 'cytoscape',
 'downloaded_videos',
 'ferris.txt',
 'first.py',
 'first_notes.ipynb',
 'kranz-Words-to-Network-Data.ipynb',
 'kranz_Exploring-Vector-Similarity.ipynb',
 'kranz_webscrape.py',
 'kranz_WordNet-and-Files.ipynb',
 'networkData-2.tsv',
 'networkData.tsv.png',
 'texts',
 'Untitled.ipynb']

In [19]:
coll = os.path.join(cwd, 'texts')
os.listdir(coll)


['breakfast.txt', 'ferris', 'sixteen.txt']

## Processing the directory as one corpus
What if we want to create an NLTK corpus of these texts and process them as one unit? See the [NLTK book chapter 2 section 1.9](https://www.nltk.org/book/ch02.html#loading-your-own-corpus)

In [20]:
for file in os.listdir(coll):
   if file.endswith(".txt"):
        filepath = f"{coll}/{file}"
        print(filepath)

C:\Users\raika\OneDrive\Desktop\GitHub\d210kranz\python\python_assignments\texts/breakfast.txt
C:\Users\raika\OneDrive\Desktop\GitHub\d210kranz\python\python_assignments\texts/sixteen.txt


In [21]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'texts'
corpus = PlaintextCorpusReader(corpus_root, '.*')
corpus.fileids()
# Check on one file in the collection
# corpus.words('breakfast.txt')



['breakfast.txt', 'ferris/ferris.txt', 'sixteen.txt']

## 

## For Wednesday April 9: YOUR TURN!
What do you need to do to process texts in your directory with NLTK tools?
You will need to:
* Turn your text(s) into a list of tokens!
* Then you can process those tokens with NLTK

### Your assignment is...
Split up your texts into a list of tokens, and do some new NLTK processing of them.
Look Stuff Up: See if you can use NLTK to output a **set** of a certain kind of word: could be...
* **part of speech** (find out how NLTK can retrieve pos (part-of-speech) information). Retrieve pos information!
* Pull a **set** of all the tokens that share a specific part of speech. Make a set() because it removes multiple instances: (it's the same as taking distinct-values() in XPath. (Want all the adjectives? All the verbs? etc.)
* Do something interesting with NTLK over that set of words. Try out wordnet synsets, or experiment with frequency distributions, or something else that looks nifty in NLTK. 

## For Friday April 11: YOUR TURN!
Continue applying what you're learning from this notebook.
Take some project files, read and tokenize them, and explore them for Wordnet data.
Retrieve info on a limited set of words of your choice (tagged for POS probably, but you can explore!)
Output ambiguity data on each word by finding the len() of its synsets. 