# Reading File Directories and Exploring WordNet

This notebook provides some guidance on working with file directories for a text corpus. It's also introducing some things we can explore with the WordNet dictionary accessible through NLTK. 

Read about [Wordnet in Chapter 2, section 5.2](https://www.nltk.org/book/ch02.html#wordnet) of the NLTK book. 
Credits: Some of our explanations are distilled from [David Birnbaum's introductory Wordnet notebook](https://github.com/djbpitt/wordnet/blob/master/Wordnet.ipynb) that we used to explore ambiguity of spooky words in projects some years ago! 


In [None]:
# Imports
import os
import nltk
from nltk.corpus import wordnet as wn


In [None]:
# SMOKE TEST: Explore Wordnet for specific words.
wn.synsets('clear')

### Look at some synset data from WordNet
Choose one of the synsets by its identifier, and lets explore what you can see with it.
In the next couple of cells we explore how WordNet shares information: 

Wordnet shows:
* A representative word that stands for set of meanings (a "synset"). There may be other representative words, and that same representative word might be used in different senses (and have multiple synsets).  
* A part of speech (POS) identifier, like “n” for ‘noun’ or “v” for ‘verb’.
* A two-digit number that distinguishes different synsets that may have the same head word and the same POS, but that convey different meaning. For example, the synsets 'ghost.n.01' and 'ghost.n.02' are two different nominal meanings that can be expressed by the lexeme “ghost.”

In WordNet you can request:
* the available **lemmas** which mean "lexemes" or the available set of words associated with a particular synset.
* the **definition** associated with a synset

In [None]:
wn.synset('net.v.02').lemmas()

In [None]:
# For every lem in one of the synsets, isolate just the names without the extra stuff:
for lem in wn.synset('net.v.02').lemmas():
    print(lem.name())

In [None]:
# Here we'll just return the count of the available lemmas for each synset for "clear"
# In this output we're just surveying the data as lists:
for synset in wn.synsets('clear'):
    print(synset, ": ", synset.lemma_names(), ": ", len(synset.lemma_names()))  

In [None]:
# Ask for the definitions and part of speech available for a word lemma names in WordNet
for synset in wn.synsets('clear'):
    print(synset.lemma_names(), ": Part of Speech: ", synset.pos(), ": Definition: ", synset.definition())
    

## Ambiguity from WordNet's point of view
When words are ambiguous, they can multiple shades of meaning or different ways we could interpret them. How we decide on the meaning depends on the context of the words. WordNet can't tell you everything about how you could interpret every word in your project, but when words are available in its vast database, you can explore how many synsets (or possible meaningful interpretations) it has on file for that word. You can find which words depend the most on specific contexts for their meaning. 

To find out WordNet's sense of how ambiguous a word is, you could just count the number of available synsets for it with Python's **len()**.
Note: when you want ALL the synsets, use **wn.synsets** (plural)

In [None]:
spooky_words = ['creep', 'eldritch', 'horror', 'cry', 'scream', 'ghost', 'scare']
for w in spooky_words:
    synsets = len(wn.synsets(w))
    print("The word ", w, "belongs to ", synsets, "synsets in WordNet.")

### WordNet's morphy: for inflected forms like plural vs. singular
Wordnet won't have an entry for plural forms of words (for example). For other things like -ing words, those tend to have their own WordNet entry. 

When working with project data, **run your word tokens through Wordnet's morphy in case they need to be looked up in a lemmatized form**
`wn.morphy('ghosts')`. Try it out on various word forms:

In [None]:
wn.morphy('cries')

### A Workflow for projects exploring ambiguity
* Start by isolating a set of words of interest from your project. (One idea: Use nltk's pos tagger to identify all the verbs in your files and retrieve a set of them).
* Deliver them to WordNet to retrieve their synset counts
* Output the information on each distinct word from Python to a simple XML, TSV, or JSON (We'll probably use XML)
* Options:
    * Use the frequency distribution plotting in Python to see how frequently ambiguous words are used in your corpus (plotting is a bit limited)
    * Write XSLT to map that information back into your XML files: We can use this to **instantiate** the uses of ambiguous words: retrieve counts of how frequently they appear in the text, and plot with SVGs of your own design. (You could make an SVG representing info about the most ambiguous words. You could also make specific SVGs for ambiguous words of interest to show how they're distributed throughout your texts... what would be interesting to visualize?


## Part of Speech (and other related kinds of) Tagging
NLTK (and spacY and other language models) offer Part of Speech tagging. We could use this to collect sets of distinct verbs, nouns, adjectives. 
* We could learn more about the words in a set by sending them to WordNet.
* We could also look at words as a network: How often do specific characters rely on a word of interest? Which characters share a word of interest? Or which words of interest are shared by say, male vs. female characters?  (Here's [a student project](http://bamfs.obdurodon.org/allFilms_Results.xhtml) that explored a collection of colorful swear words used by characters in Quentin Tarantino's films.)

In our Text Analysis class, you didn't spend time marking specific words, but you could do that using regex search and replace over your texts. But you could also get some help from POS tagging from NLTK, and mapping that information back to your XML files for further analysis. So let's explore that.  We'll look at POS (part of speech) tagging, but note that it's related to Named Entity Recognition and sentiment analysis. 

(Full disclosure: We like POS for reasons of interesting patterns it can display without overdetermining what the data means in context. You can show interesting patterns, especially by working with POS tagging together with WordNet analyses.)

## How to get NLTK to do POS tagging
* Let's open a file, tokenize it, prepare it for nltk to analyze.
* Then apply the .pos tagger
* What we see in the output is a list of tuples

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

filepath = 'hughes-txt/sixteen.txt'
f = open(filepath, 'r', encoding='utf8').read()
# Make a list of tokens in your text. 
# tokenList = f.split()
tokenList = word_tokenize(f)

In [None]:
# Reduce the complexity by: 1) lowercasing them, and 2) returning the set() of words (remove multiple of the same value)
lowercaseTokens = [token.lower() for token in tokenList]
uniqueTokens = set(lowercaseTokens)
print(uniqueTokens)

In [None]:
# Let's try out the NLTK POS tagger on the uniqueTokens
pos_tag(uniqueTokens)

### Wait...what is that POS category?
Find out what a category is by asking NLTK:
`nltk.help.upenn_tagset('VBZ')`

In [None]:
nltk.help.upenn_tagset('VBD')

We can limit our list of words of interest by asking for words of a particular kind, isolating them by POS

In [None]:
POStagged = pos_tag(uniqueTokens)
tagsIwant = ['VBD', 'VBN', 'VBZ']
# This is a Python list comprehension that'll help us with our list of tuples [(word, tag), (word, tag), ...]
shortList = [word for word, tag in POStagged if tag in tagsIwant]
print(len(shortList))
print(shortList)

# Fancy string formatting (not super useful here, just reviewing it): 
# f'My short list is {shortList} and it is this long: {len(shortList)}'

## Send them to WordNet for synset lookup

* Use list comprehension
* Use wn.morphy to simplify the word to its synset lemma form
  

In [None]:
for w in shortList:
    lemma = wn.morphy(w)
    # I don't think we need the next line, but it's a fallback if there's no WordNet lemmas: 
    # lemma = lemma if lemma else w 
    print(f"Word: {w} | Wordnet Lemma: {lemma}")
    synsets = wn.synsets(lemma)
    if synsets:
        print(f" Word: {w}, Number of Synsets (Ambiguity): {len(synsets)}") 
              
        
        # for syn in synsets:
        #   print(f"  Synset: {syn.name()}, Definition: {syn.definition()}}")
    
      
    


## Read in just one file from a directory in your repo
Open one of your text files in your repo for reading.
In this example, we'll climb directories, and that means we'll use the os library to show you how to handle filepaths.

When your file isn't immediately in the same folder as your Python script, and you need to climb for it, start with os library by getting the current working directory:

In [None]:
# cwd is my shorthand for "current working directory"
cwd = os.getcwd()
print(cwd)

In [None]:
# climb up one directory and retrieve a file. (here's how you would do that)
# (ADAPT THIS CODE TO REACH DOWN OR UP AS NEEDED.)
filepath = '../grimm.txt'
print(filepath)

In [None]:
# Now, Python must OPEN and READ the text file in order to process it:
f = open(filepath, 'r', encoding='utf8').read()
# readFile = f.read()
print(f)

## Read in some project data from a collection of text files
You have a file directory with some text files probably near your Python script. 

 From the "for loop" in the next cell, we can then write code to process information about each file separately.

In [None]:
# Remember, we defined cwd as our current working directory holding this file.
# list directories:
os.listdir(cwd)


In [None]:
coll = os.path.join(cwd, 'hughes-txt')
os.listdir(coll)


## Processing the directory as one corpus
What if we want to create an NLTK corpus of these texts and process them as one unit? See the [NLTK book chapter 2 section 1.9](https://www.nltk.org/book/ch02.html#loading-your-own-corpus)

In [None]:
for file in os.listdir(coll):
   if file.endswith(".txt"):
        filepath = f"{coll}/{file}"
        print(filepath)

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'hughes-txt'
corpus = PlaintextCorpusReader(corpus_root, '.*')
corpus.fileids()
# Check on one file in the collection
# corpus.words('breakfast.txt')



## YOUR TURN!
What do you need to do to process texts in your directory with NLTK tools?
You will need to:
* Turn your text(s) into a list of tokens!
* Then you can process those tokens with NLTK

### Your assignment is...
Split up your texts into a list of tokens, and do some new NLTK processing of them.
Look Stuff Up: See if you can use NLTK to output a **set** of a certain kind of word: could be...
* **part of speech** (find out how NLTK can retrieve pos (part-of-speech) information). Retrieve pos information!
* Pull a **set** of all the tokens that share a specific part of speech. Make a set() because it removes multiple instances: (it's the same as taking distinct-values() in XPath. (Want all the adjectives? All the verbs? etc.)
* Do something interesting with NTLK over that set of words. Try out wordnet synsets, or experiment with frequency distributions, or something else that looks nifty in NLTK. 
