Hopefully by this week, you're comfortable with lists (i.e. [a, b, c, ..]) and growing to understand list comprehensions.

Two fundamental Python skills to be aware of. First, there's a general purpose method called `len()` that returns the length of an object, like "how many items in this list" or "how many characters in this string". e.g.

In [None]:
l = ['hello', 'text', 'mining']
len(l)

3

In [None]:
len("Text Mining")

11

There is also an object called a `set`, which is like a list, but without an ordering and only allowing unique elements. This is useful for us, because it gives a quick way to see just the unique words of a list: the vocabulary.

In [None]:
l = ['Buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo']
s = set(l)
print("List:", l)
print("Set:", s)

List: ['Buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo']
Set: {'Buffalo', 'buffalo'}


# Normalization

This week, follow along with [Searching for Meaning](https://github.com/sgsinclair/alta/blob/41f389f3d9708573c44c883bcd95fd16bad54a24/ipynb/SearchingMeaning.ipynb) from the Art of Literary Text Analysis.

Use the trimmed version of Frankenstein from last week to try some of the concepts in the chapter. This should get you up to speed.

In [None]:
import nltk
with open('../data/frankenstein.txt') as f:
    frankensteinString = f.read()
frankensteinTokens = nltk.word_tokenize(frankensteinString)
cleanedTokens = [word.lower() for word in frankensteinTokens if word[0].isalpha()]
len(frankensteinTokens)

85440

Note, that there are 85440 tokens in the text. If we count just the _unique_ words (the _vocabulary size_), we find 7038:

In [None]:
len(set(cleanedTokens))

7038

**Questions**

- 1) Create a lemmatized version of cleanedTokens and count the unique lemmas. Share the code to do this: the answer that it gives you should be 6416.
- 2) Re-do the lemmatization after stopping words against the default NLTK stoplist, and tabulate the top ten words. Paste the code and output.
- 3) How does the tabulation of lemmas differ from the tabulation of the non-lemmatized (but still stopped and case-folded) tokens?
- 4) What are the WordNet synsets for 'monster'?
- 5) A synset has a method called `definition()`. Noting that the code for Q4 resulted in a list, write a list comprehension to extracts all the definitions for each synset. Share the code and output.
- 6) Each synset is a child of a more general synset. For example, `crab` is an example of a `decapod_crustacean`, which is more generally a `crustacean`, and so on. You can get at the paths to the root of this tree with `hypernym_paths()`. Paste the code and hypernym path for `freak.n.01`.
- 7) (for 2 points): We've already seen some corpora that NLTK can pull in, from the complex WordNet information to a basic stoplist. Using the NLTK information on male/female names, determine and paste in the unique female names in Frankenstein. This isn't in the ALTA book, but searching Google sometimes helps ;)

Our copy of Frankenstein is from Project Gutenberg, a collection of transcriptions of public domain (i.e. legally shareable) books. NLTK offers a small selection of those books through `nltk.corpus.gutenberg`.

Load the gutenberg corpus and convert it to what NLTK calls a TextCollection:

In [None]:
from nltk.text import TextCollection
gutenberg_docs = nltk.corpus.gutenberg
gutenberg_collection = TextCollection(gutenberg_docs)

Using a method of `gutenberg_collection` (remember auto-complete!), answer the final two-part question.

Questions:
    
 - 8) For 2 points:
   - What is the TFIDF for 'monster' in Frankenstein? You'll need the original (unnormalized) tokens.
   - What word has the highest TF-IDF for the following: 'miserable', 'horror', 'monster'? If you need it, you can compare numbers in python with > (greater than) or < (less than).