# Introduction to Natural Language Processing using NLTK

There are many features of natural languages that are remarkably difficult to get a computer to identify. Much early NLP work was around structuring text according to how it functions as a natural language, not a series of characters as computers read text. Python's NLTK (Natural Language Tool Kit) bundles a bunch of these early tools together for us on text.

## Reminder of where we are

* values (e.g. `1.2`, `100`, `'Hello, Boston!'`)
* value types (e.g., `float`, `int`, `string`)
* variables, or objects (for storing and referencing values)
* operators (e.g., `=`, `+`, `-`)
* logical operators (e.g., `==`, `>`, `<`, `>=`)
* statements and expressions (e.g. `10 + 500`)
* built-in functions (e.g. `print()`, `type()`)
* string functions and string methods (e.g., `string.lower()`, `string.islower()`)
* list functions and list metods (e.g., `len(mylist)`, `mylist.append()`)
* conditionals (e.g., `if`, `else`, `elif`)
* loops (e.g., `for` loops)
* user-defined functions (using `def`)
* tuples (e.g., ('education', 'high school'))
* dictionaries, or key:value pairs
* list comprehension - a way to filter lists (and do other things)
* Pandas, the dataframe library
* Matplotlib, the visualization library
* Seaborn, another visualization library that makes everything easier
* Today: NLTK, the natural language processing library



## Natural Language Processing
* *pre-processing*
    * Transforming a human lanugage text into computer-manipulable format. A typical pre-processing workflow includes <i>stop-word</i> removal, setting text in lower case, and <i>term frequency</i> counting.
* *token*
    * An individual word unit within a sentence.
* *stop words*
    * The function words in a natural langauge, such as <i>the</i>, <i>of</i>, <i>it</i>, etc. These are typically the most common words.
* *term frequency*
    * The number of times a term appears in a given text. This is either reported as a raw tally or it is <i>normalized</i> by dividing by the total number of words in a text.    
* *POS tagging*
    * One common task in NLP is the determination of a word's part-of-speech (POS). The label that describes a word's POS is called its <i>tag</i>. Specialized functions that make these determinations are called <i>POS Taggers</i>.
* *dependency parsing*
    * THe grammatical relationships between words and the type of relationships.
* *NLTK (Natural Language Tool Kit)*
    * A common Python package that contains many NLP-related functions

## Further Resources:

Check out the full range of techniques included in Python's nltk package here: http://www.nltk.org/book/

As with all of the sections this semester, we could spend an entire semester on NLP alone. This is just meant to give you a taste of what's possible, and equip you with enough knowledge to learn more on your own if you're interested!

In [3]:
#import string, which is where we get a list of punctuation
import string
#First import the Python package nltk (Natural Language Tool Kit)
import nltk

#NLTK is huge and relies on a bunch of data. These data need to be downloaded. 
# The two lines download the needed NLTK data to your computer

nltk_data = ["punkt", "words", "stopwords", "averaged_perceptron_tagger", 
             "maxent_ne_chunker", 'wordnet', 'vader_lexicon']
nltk.download(nltk_data)

#import the function to split the text into separate words from the NLTK package
from nltk import word_tokenize
#import the stopwords list
from nltk.corpus import stopwords
#dependency parser
from nltk.parse.stanford import StanfordDependencyParser


[nltk_data] Downloading package punkt to /Users/jinyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /Users/jinyang/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jinyang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jinyang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/jinyang/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jinyang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jinyang/nltk_data...
[nltk_data]   Package vader_lexicon is alr

In [4]:
#Let's go back to our sentence from last week

sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

#print the content
print(sentence)

For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media.


In [5]:
#We tokenized the sentence last week by splitting on the white space.
#NLTK has a much more sophisticated approach to tokenizing text
#note the difference

#create new variable that applies the word_tokenize function to our sentence.
sentence_tokens = word_tokenize(sentence)

#This new variable contains the tokenized text, and is now a list
print(type(sentence_tokens))
print(sentence_tokens)

<class 'list'>
['For', 'me', 'it', 'has', 'to', 'do', 'with', 'the', 'work', 'that', 'gets', 'done', 'at', 'the', 'crossroads', 'of', 'digital', 'media', 'and', 'traditional', 'humanistic', 'study', '.', 'And', 'that', 'happens', 'in', 'two', 'different', 'ways', '.', 'On', 'the', 'one', 'hand', ',', 'it', "'s", 'bringing', 'the', 'tools', 'and', 'techniques', 'of', 'digital', 'media', 'to', 'bear', 'on', 'traditional', 'humanistic', 'questions', ';', 'on', 'the', 'other', ',', 'it', "'s", 'also', 'bringing', 'humanistic', 'modes', 'of', 'inquiry', 'to', 'bear', 'on', 'digital', 'media', '.']


Notice each token is either a word *or* punctuation - different than splitting on the white space. Careful, does the length indicate word count anymore?

In [6]:
#The number of tokens is the length of the list, or the number of elements in the list
print(len(sentence_tokens))

71


Last week we used dictionaries and tuples to produce word counts. NLTK has its own function to do the same.

In [7]:
#apply the nltk function FreqDist to count the number of times each token occurs.
word_frequency = nltk.FreqDist(sentence_tokens)

print(word_frequency)

#print out the 10 most frequent words using the function most_common
print(word_frequency.most_common(10))

<FreqDist with 44 samples and 71 outcomes>
[('the', 5), ('it', 3), ('to', 3), ('of', 3), ('digital', 3), ('media', 3), ('humanistic', 3), ('.', 3), ('on', 3), ('that', 2)]


Of course, we still need to do pre-processing.

## Pre-Processing: Lower Case, Removing Stop Words and Punctuation


To convert to lower case we use the function lower() and list comprehension. 



In [8]:
sentence_tokens_lc = [word.lower() for word in sentence_tokens]

#see the result
print(sentence_tokens_lc)

['for', 'me', 'it', 'has', 'to', 'do', 'with', 'the', 'work', 'that', 'gets', 'done', 'at', 'the', 'crossroads', 'of', 'digital', 'media', 'and', 'traditional', 'humanistic', 'study', '.', 'and', 'that', 'happens', 'in', 'two', 'different', 'ways', '.', 'on', 'the', 'one', 'hand', ',', 'it', "'s", 'bringing', 'the', 'tools', 'and', 'techniques', 'of', 'digital', 'media', 'to', 'bear', 'on', 'traditional', 'humanistic', 'questions', ';', 'on', 'the', 'other', ',', 'it', "'s", 'also', 'bringing', 'humanistic', 'modes', 'of', 'inquiry', 'to', 'bear', 'on', 'digital', 'media', '.']


Words like "the", "to", and "and" are what text analysis call "stop words." Stop words are the most common words in a language, and while necessary and useful for some analysis purposes, do not tell us much about the *substance* of a text. Another common pre-processing steps is to simply remove punctuation and stop words. NLTK contains a built-in stop words list, which we use to remove stop words from our list of tokens.

In [9]:
#take a look at what stop words are included:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
#create a new variable that contains the sentence tokens without the stopwords
sentence_tokens_clean = [word for word in sentence_tokens_lc if word not in stopwords.words('english')]

#see what words we're left with
print(sentence_tokens_clean)

['work', 'gets', 'done', 'crossroads', 'digital', 'media', 'traditional', 'humanistic', 'study', '.', 'happens', 'two', 'different', 'ways', '.', 'one', 'hand', ',', "'s", 'bringing', 'tools', 'techniques', 'digital', 'media', 'bear', 'traditional', 'humanistic', 'questions', ';', ',', "'s", 'also', 'bringing', 'humanistic', 'modes', 'inquiry', 'bear', 'digital', 'media', '.']


Punctuation also does not help us understand the substance of a text, so we'll remove punctuation in a similar fashion. [Again, think about tasks where me may not want to remove punctuation.] There are many many ways to do this. For now, we'll create a list of punctuation tokens, similar to the list of stop words, and remove them from our list of tokens.

In [11]:
#creat list of punctuation symbols
punctuation = list(string.punctuation)

#see what punctuation is included
print(punctuation)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [12]:
#remove punctuation from our tokens
sentence_tokens_clean = [w for w in sentence_tokens_clean if w not in punctuation]

#see what's left
print(sentence_tokens_clean)


['work', 'gets', 'done', 'crossroads', 'digital', 'media', 'traditional', 'humanistic', 'study', 'happens', 'two', 'different', 'ways', 'one', 'hand', "'s", 'bringing', 'tools', 'techniques', 'digital', 'media', 'bear', 'traditional', 'humanistic', 'questions', "'s", 'also', 'bringing', 'humanistic', 'modes', 'inquiry', 'bear', 'digital', 'media']


Now, after our pre-processing steps, let's re-count the most frequent words in the sentence.

In [13]:
word_frequency_clean = nltk.FreqDist(sentence_tokens_clean)

print(word_frequency_clean.most_common(10))

[('digital', 3), ('media', 3), ('humanistic', 3), ('traditional', 2), ("'s", 2), ('bringing', 2), ('bear', 2), ('work', 1), ('gets', 1), ('done', 1)]


Ok, we replicated what we did last week! In slightly fewer steps, but in a more sophisticated way.

NLTK can do so much more.


## Part-of-Speech Tagging

You may have noticed that stop words are typically short function words. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying words of substance. NLTK can do that too!

NLTK has a function that will tag the part of speech of every token in a text. For this, we go back to our original tokenized text, with the stop words and punctuation.

NLTK uses the Penn Treebank Project to tag the part-of-speech of the words. You can find a list of all the part-of-speech tags here:

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [14]:
#use the nltk pos function to tag the tokens
tagged_sentence_tokens = nltk.pos_tag(sentence_tokens)

#view new variable
print(tagged_sentence_tokens)



[('For', 'IN'), ('me', 'PRP'), ('it', 'PRP'), ('has', 'VBZ'), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), ('the', 'DT'), ('work', 'NN'), ('that', 'WDT'), ('gets', 'VBZ'), ('done', 'VBN'), ('at', 'IN'), ('the', 'DT'), ('crossroads', 'NNS'), ('of', 'IN'), ('digital', 'JJ'), ('media', 'NNS'), ('and', 'CC'), ('traditional', 'JJ'), ('humanistic', 'JJ'), ('study', 'NN'), ('.', '.'), ('And', 'CC'), ('that', 'DT'), ('happens', 'VBZ'), ('in', 'IN'), ('two', 'CD'), ('different', 'JJ'), ('ways', 'NNS'), ('.', '.'), ('On', 'IN'), ('the', 'DT'), ('one', 'CD'), ('hand', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('bringing', 'VBG'), ('the', 'DT'), ('tools', 'NNS'), ('and', 'CC'), ('techniques', 'NNS'), ('of', 'IN'), ('digital', 'JJ'), ('media', 'NNS'), ('to', 'TO'), ('bear', 'VB'), ('on', 'IN'), ('traditional', 'JJ'), ('humanistic', 'JJ'), ('questions', 'NNS'), (';', ':'), ('on', 'IN'), ('the', 'DT'), ('other', 'JJ'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('also', 'RB'), ('bringing', 'VBG'

In [15]:
#check your variable type!

type(tagged_sentence_tokens)

list

In [16]:
#It's, of course, a list of tuples

type(tagged_sentence_tokens[0])

tuple

We can count the part-of-speech tags in a similar way we counted words, to output the most frequent types of words in our text.

In [17]:
#Check your understanding of the line below, 
# it's a form of list comprehension to translate a list of tuples into a list
print([tag for (word, tag) in tagged_sentence_tokens])


['IN', 'PRP', 'PRP', 'VBZ', 'TO', 'VB', 'IN', 'DT', 'NN', 'WDT', 'VBZ', 'VBN', 'IN', 'DT', 'NNS', 'IN', 'JJ', 'NNS', 'CC', 'JJ', 'JJ', 'NN', '.', 'CC', 'DT', 'VBZ', 'IN', 'CD', 'JJ', 'NNS', '.', 'IN', 'DT', 'CD', 'NN', ',', 'PRP', 'VBZ', 'VBG', 'DT', 'NNS', 'CC', 'NNS', 'IN', 'JJ', 'NNS', 'TO', 'VB', 'IN', 'JJ', 'JJ', 'NNS', ':', 'IN', 'DT', 'JJ', ',', 'PRP', 'VBZ', 'RB', 'VBG', 'JJ', 'NNS', 'IN', 'NN', 'TO', 'VB', 'IN', 'JJ', 'NNS', '.']


In [18]:
#we can do a frequency distribution on a list of strings
tagged_frequency = nltk.FreqDist([tag for (word, tag) in tagged_sentence_tokens])

tagged_frequency.most_common(10)

[('IN', 11),
 ('JJ', 10),
 ('NNS', 9),
 ('DT', 6),
 ('VBZ', 5),
 ('PRP', 4),
 ('NN', 4),
 ('TO', 3),
 ('VB', 3),
 ('CC', 3)]

This sentence contains a lot of adjectives. So let's first look at the most frequent adjectives

In [19]:
adjectives = [word for word,pos in tagged_sentence_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']

#print all of the adjectives
print(adjectives)

['digital', 'traditional', 'humanistic', 'different', 'digital', 'traditional', 'humanistic', 'other', 'humanistic', 'digital']


In [20]:
#calculate the frequency of the adjectives
freq_adjectives=nltk.FreqDist(adjectives)

#print the most frequent adjectives
print(freq_adjectives.most_common(5))

[('digital', 3), ('humanistic', 3), ('traditional', 2), ('different', 1), ('other', 1)]


Let's do the same for nouns.

In [21]:
nouns = [word for (word,pos) in tagged_sentence_tokens if pos=='NN' or pos=='NNS']

#print all of the nouns
print(nouns)

['work', 'crossroads', 'media', 'study', 'ways', 'hand', 'tools', 'techniques', 'media', 'questions', 'modes', 'inquiry', 'media']


In [22]:
#calculate the frequency of the nouns
freq_nouns=nltk.FreqDist(nouns)

#print the most frequent nouns
print(freq_nouns.most_common(10))

[('media', 3), ('work', 1), ('crossroads', 1), ('study', 1), ('ways', 1), ('hand', 1), ('tools', 1), ('techniques', 1), ('questions', 1), ('modes', 1)]


And now verbs.

In [23]:
verbs = [word for word,pos in tagged_sentence_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']

#print all of the verbs
print(verbs)



['has', 'do', 'gets', 'done', 'happens', "'s", 'bringing', 'bear', "'s", 'bringing', 'bear']


In [24]:
#calculate the frequency of the verbs
freq_verbs=nltk.FreqDist(verbs)

#print the most frequent verbs
print(freq_verbs.most_common(10))

[("'s", 2), ('bringing', 2), ('bear', 2), ('has', 1), ('do', 1), ('gets', 1), ('done', 1), ('happens', 1)]


If we bring all of this together we get a pretty good summary of the sentence:

In [25]:
print(freq_adjectives.most_common(3))
print(freq_nouns.most_common(3))
print(freq_verbs.most_common(3))

[('digital', 3), ('humanistic', 3), ('traditional', 2)]
[('media', 3), ('work', 1), ('crossroads', 1)]
[("'s", 2), ('bringing', 2), ('bear', 2)]


## Illustration: Compare Melville to Austen


In [26]:
#Define a function to output the most frequent words based on a part of speech
import string

def read_file(filename):
        
    with open(filename, 'r', encoding='utf-8') as myfile:
        mytext = myfile.read()
    return(mytext)

def freq_words_pos(filename, pos_tag_list):
    """
    This is called a docstring. It defines and explains what the function is doing.
    Any function more than a few lines should include a docstring
    
    This function takes a filename containing text and a list of part of speeches,
    and outputs the most frequent words for that part of speech
    
    Input: filename and path, list of penn treebank part of speech tags
    Output: List of tuples words and counts for the pos, in descending order
    """
    
    mytext = read_file(filename) #calling a function inside a function!
    
    punctuation = list(string.punctuation)
        
    tokens = word_tokenize(mytext)
    tagged = nltk.pos_tag(tokens)
    
    freq_words = [word for word,pos in tagged if pos in pos_tag_list]
    
    return(nltk.FreqDist(freq_words))

In [27]:
freq_words_pos('../data/Melville_MobyDick.txt', ['NN', 'NNS']).most_common(20)

[('whale', 714),
 ('man', 472),
 ('ship', 440),
 ('sea', 346),
 ('time', 314),
 ('boat', 280),
 ('head', 262),
 ('way', 256),
 ('whales', 231),
 ('men', 228),
 ('hand', 196),
 ('thing', 184),
 ('side', 176),
 ('ye', 169),
 ('world', 167),
 ('water', 162),
 ('day', 157),
 ('deck', 157),
 ('eyes', 154),
 ('sort', 151)]

In [28]:
freq_words_pos('../data/Austen_SenseAndSensibility.txt', ['NN', 'NNS']).most_common(20)

[('sister', 266),
 ('mother', 248),
 ('time', 235),
 ('thing', 182),
 ('nothing', 163),
 ('house', 146),
 ('day', 143),
 ('heart', 126),
 ('man', 118),
 ('moment', 97),
 ('room', 97),
 ('mind', 95),
 ('kind', 94),
 ('world', 90),
 ('morning', 85),
 ('town', 85),
 ('family', 81),
 ('affection', 79),
 ('brother', 78),
 ('place', 76)]

## Named Entity Recognition

We can also tag named entities - person, places, organizations, etc. Similar syntax, just using the `ne_chunk` function

In [29]:
#tokenize our text
ner_tokens = word_tokenize('Google moved their headquarters from San Jose to Seattle, per spokesperson Sudhir')

#tag it with part-of-speech
ner_tokens_tagged = nltk.pos_tag(ner_tokens)

#add a named entity tage
namedEnt = nltk.ne_chunk(ner_tokens_tagged)
print(namedEnt)

(S
  (PERSON Google/NNP)
  moved/VBD
  their/PRP$
  headquarters/NNS
  from/IN
  (GPE San/NNP)
  Jose/NNP
  to/TO
  (GPE Seattle/NNP)
  ,/,
  per/IN
  spokesperson/NN
  (PERSON Sudhir/NNP))


## Dependency Parsing

Dependency parsing finds grammatical relationships between words and the type of relationships. In the social sciences, this is typically focused on subject-verb-object relationships (who does what to whom).

You can train your own model here, but we will be using [Stanford's CoreNLP dependency parser](https://nlp.stanford.edu/software/stanford-dependencies.html). It's fast relative to other dependency parsers, but as you might learn if you try to do this on a longer text, it's also quite slow. 

Because depdency parser parse relationships among words, it can produce a lot of output. We'll try it on a simple sentence.

Stanford dependencies are triplets: name of the relation, governor and dependent.You can find a pdf that includes the definition of the Stanford typed relationships [here](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf).

In [33]:
#point to the Stanford model and software

path_to_jar = '../stanford_parser/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar'
path_to_models_jar = '../stanford_parser/stanford-corenlp-4.2.2-models-english.jar'

#use the Python NLTK wrapper to implement it
dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)

#parse a simple sentence
result = dependency_parser.raw_parse('I shot an elephant in my sleep')

#access the output
dep = result.__next__()

#print the relationships
list(dep.triples())

Please use [91mnltk.parse.corenlp.CoreNLPDependencyParser[0m instead.
  dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)


[(('shot', 'VBD'), 'nsubj', ('I', 'PRP')),
 (('shot', 'VBD'), 'obj', ('elephant', 'NN')),
 (('elephant', 'NN'), 'det', ('an', 'DT')),
 (('shot', 'VBD'), 'obl', ('sleep', 'NN')),
 (('sleep', 'NN'), 'case', ('in', 'IN')),
 (('sleep', 'NN'), 'nmod:poss', ('my', 'PRP$'))]

In [35]:
#check your variable types!
print(type(dep.triples()))
type(list(dep.triples()))

<class 'generator'>


list

In [36]:
#look at the first element
list(dep.triples())[0]

(('shot', 'VBD'), 'nsubj', ('I', 'PRP'))

In [37]:
#Check your variale type!
type(list(dep.triples())[0])

tuple

In [38]:
#ah, the tuple. Let's find all the 'nsubj' dependencies (there's only one, but imagine with me)

nsubj = []

for gov, rel, depend in list(dep.triples()):
    print("Govenor:")
    print(gov)
    print("Relationship")
    print(rel)
    print("Dependent")
    print(depend)
    if rel == 'nsubj':
        nsubj.append((gov, rel, depend))

print()
print("All nsubj relationships:")
nsubj

Govenor:
('shot', 'VBD')
Relationship
nsubj
Dependent
('I', 'PRP')
Govenor:
('shot', 'VBD')
Relationship
obj
Dependent
('elephant', 'NN')
Govenor:
('elephant', 'NN')
Relationship
det
Dependent
('an', 'DT')
Govenor:
('shot', 'VBD')
Relationship
obl
Dependent
('sleep', 'NN')
Govenor:
('sleep', 'NN')
Relationship
case
Dependent
('in', 'IN')
Govenor:
('sleep', 'NN')
Relationship
nmod:poss
Dependent
('my', 'PRP$')

All nsubj relationships:


[(('shot', 'VBD'), 'nsubj', ('I', 'PRP'))]

## Sentiment Analysis!

NLTK has a sentiment analyzer took called Vader. Vader is not great, no sentiment tools is, but it's OK, and works well on more contemporary text. 

## Sentiment analysis using Vader

You can find more information on Vader and what it can do [here](https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664).

VADER’s SentimentIntensityAnalyzer() takes in a string and returns a `dictionary` of scores in each of four categories:

* negative
* neutral
* positive
* compound (computed by normalizing the scores above)


In [39]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#assign the analyzer function to its own object (yes, an object can be a function!)
sid = SentimentIntensityAnalyzer()

#Let's run it on our original sentence variable
sid.polarity_scores(sentence)

{'neg': 0.0, 'neu': 0.951, 'pos': 0.049, 'compound': 0.4939}

In [40]:
#It's a dictionary, so we can pull out the values separately

sid.polarity_scores(sentence)['pos']

0.049

# Exercises!

Whew, that was a lot, but take a few moments to practice. Then practice more over the week!

1. Print the most frequent adjective in Moby Dick and Sense and Sensibility. Does anything intrigue you?
    * Hint, remind yourself of the tags [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
2. Print the most frequent verbs in Moby Dick and Sense and Sensibility. Does anything intrigue you?
3. Who is more negative? Melville or Austen? Who is more positive? Before you calculate, think through a hypothesis (if you know these authors).
    * Hint: Vader works better on short texts. You probaby want to calculate the sentiment for each sentence separately, and then take the average. NLTK has a sentence tokenizer! See if you can use the documentation to get it to work.
4. Parse a few sentences from either Melville or Austen:
    * Identify all nsubj or obj relationships. Anything interesting?
    * Extract named entities. Anything interesting?
        * Hint: don't try the full novel, it'll take too long and be far too much output.

1. Print the most frequent adjective in Moby Dick and Sense and Sensibility. Does anything intrigue you?  
Hint, remind yourself of the tags here

In [None]:
def freq_words_pos(filename, pos_tag_list):
    """
    This is called a docstring. It defines and explains what the function is doing.
    Any function more than a few lines should include a docstring
    
    This function takes a filename containing text and a list of part of speeches,
    and outputs the most frequent words for that part of speech
    
    Input: filename and path, list of penn treebank part of speech tags
    Output: List of tuples words and counts for the pos, in descending order
    """
    
    mytext = read_file(filename) #calling a function inside a function!
    
    punctuation = list(string.punctuation)
        
    tokens = word_tokenize(mytext)
    tagged = nltk.pos_tag(tokens)
    
    freq_words = [word for word,pos in tagged if pos in pos_tag_list]
    
    return(nltk.FreqDist(freq_words))


In [41]:
freq_words_pos('../data/Melville_MobyDick.txt', ['JJ', 'JJR','JJS']).most_common(20)

[('old', 427),
 ('other', 409),
 ('great', 290),
 ('last', 273),
 ('such', 256),
 ('more', 245),
 ('little', 238),
 ('same', 209),
 ('own', 201),
 ('long', 187),
 ('first', 174),
 ('good', 172),
 ('many', 160),
 ('white', 158),
 ('much', 134),
 ('small', 121),
 ('whole', 115),
 ('full', 110),
 ('poor', 99),
 ('thy', 93)]

In [42]:
freq_words_pos('../data/Austen_SenseAndSensibility.txt', ['JJ','JJR', 'JJS']).most_common(20)

[('own', 267),
 ('such', 259),
 ('more', 184),
 ('other', 182),
 ('much', 170),
 ('little', 148),
 ('great', 147),
 ('good', 131),
 ('first', 121),
 ('sure', 118),
 ('young', 103),
 ('last', 100),
 ('same', 100),
 ('many', 97),
 ('happy', 94),
 ('present', 77),
 ('few', 77),
 ('dear', 73),
 ('least', 64),
 ('better', 57)]

In [43]:
freq_words_pos('../data/Melville_MobyDick.txt', ['VB']).most_common(20)

[('be', 1024),
 ('have', 353),
 ('see', 164),
 ('go', 125),
 ('do', 110),
 ('let', 100),
 ('say', 87),
 ('make', 86),
 ('take', 83),
 ('get', 79),
 ('tell', 79),
 ('ye', 69),
 ('come', 67),
 ('know', 67),
 ('think', 57),
 ('keep', 57),
 ('give', 54),
 ('till', 52),
 ('seem', 50),
 ('look', 43)]

In [44]:
freq_words_pos('../data/Austen_SenseAndSensibility.txt', ['VB']).most_common(20)

[('be', 1289),
 ('have', 451),
 ('do', 152),
 ('see', 140),
 ('make', 132),
 ('say', 128),
 ('give', 122),
 ('think', 122),
 ('know', 90),
 ('go', 89),
 ('tell', 68),
 ('hear', 59),
 ('come', 55),
 ('speak', 54),
 ('find', 52),
 ('feel', 51),
 ('leave', 42),
 ('take', 41),
 ('believe', 38),
 ('till', 36)]

In [47]:
from nltk.tokenize import sent_tokenize
austen=read_file('../data/Austen_SenseAndSensibility.txt')
meville=read_file('../data/Melville_MobyDick.txt')
austen_sent = sent_tokenize(austen)
meville_sent= sent_tokenize(meville)

In [50]:
austen_pos=[]
for sent in austen_sent:
    austen_pos.append(sid.polarity_scores(sent)['pos'])

In [None]:
meville_pos = [sid.polarity_scores(sent)['pos'] for sent in meville_sent]

austen_neg = [sid.polarity_scores(sent)['neg'] for sent in austen_sent]
meville_neg = [sid.polarity_scores(sent)['neg'] for sent in meville_sent]

In [None]:
len(austen_pos)

In [None]:
austen_pos[:10]