# NLP for Feature Extraction and Machine Learning

Today's workshop will address various concepts in the Natural Language Processing pipeline aimed at feature and information extraction for subsequent use in machine learning tasks. A fundmental understanding of Python is necessary. We will cover:

1. Pre-processing
2. Preparing and declaring your own corpus
3. POS-Tagging
4. Dependency Parsing
5. NER
6. Sentiment Analysis
7. Classification


You will need:

* NLTK ( `$ pip install nltk`)
* the parser wrapper requires the [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml#Download) (in Java)
* the NER wrapper requires the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml#Download) (in Java)
* scikit-learn ( `$ pip install scikit-learn`)
* keras ( `$ pip install keras`)
* gensim ( `$ pip install gensim`)

# 1) Pre-processing

This won't be covered much today, but regex and basic python string methods are most important in preprocessing tasks. NLTK does, however, offer an array of tokenizers and stemmers for various languages.

### The term-document model

This is also sometimes referred to as "bag-of-words" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored.

This is the primary method still used for most text analysis, although models utilizing word embeddings are beginning to take hold. We will discuss word embeddings briefly at the end.

To start with, let's import NLTK and load a document from their toy corpus.

### Python Regex Basics

In [1]:
import nltk
nltk.download('webtext')
document = nltk.corpus.webtext.open('grail.txt').read()

[nltk_data] Downloading package webtext to /Users/chench/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


Let's see what's in this document

In [2]:
print(document[:1000])

SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop clop clop] 
SOLDIER #1: Halt!  Who goes there?
ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!
SOLDIER #1: Pull the other one!
ARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.
SOLDIER #1: What?  Ridden on a horse?
ARTHUR: Yes!
SOLDIER #1: You're using coconuts!
ARTHUR: What?
SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.
ARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--
SOLDIER #1: Where'd you get the coconuts?
ARTHUR: We found them.
SOLDIER #1: Found them?  In Mercea?  The coconut's tropical!
ARTHUR: What do you mean?
SOLDIER #1: Well, this is a temperate zone.
AR

In [3]:
import re
snippet = document.split("\n")[8]
print(snippet)

SOLDIER #1: You're using coconuts!


In [4]:
re.search(r'coconuts', snippet)

<_sre.SRE_Match object; span=(25, 33), match='coconuts'>

Just like with `str.find`, we can search for plain text. But `re` also gives us the option for searching for patterns of bytes - like only alphabetic characters.

In [5]:
re.search(r'[a-z]', snippet)

<_sre.SRE_Match object; span=(13, 14), match='o'>

In this case, we've told re to search for the first sequence of bytes that is only composed of lowercase letters between `a` and `z`. We could get the letters at the end of each sentence by including a bang at the end of the pattern.

In [6]:
re.search(r'[a-z]!', snippet)

<_sre.SRE_Match object; span=(32, 34), match='s!'>

There are two things happening here:

1. `[` and `]` do not mean 'bracket'; they are special characters which mean 'anything of this class'
2. we've only matched one letter each

Re is flexible about how you specify numbers - you can match none, some, a range, or all repetitions of a sequence or character class.

character | meaning
----------|--------
`{x}`     | exactly x repetitions
`{x,y}`   | between x and y repetitions
`?`       | 0 or 1 repetition
`*`       | 0 or many repetitions
`+`       | 1 or many repetitions

Part of the power of regular expressions are their special characters. Common ones that you'll see are:

character | meaning
----------|--------
`.`       | match anything except a newline
`^`       | match the start of a line
`$`       | match the end of a line
`\s`      | matches any whitespace or newline

What if we wanted to grab all of Arthur's speech without grabbing the name `ARTHUR` itself?

If we wanted to do this using base string manipulation, we would need to do something like:

```
split the document into lines
create a new list of just lines that start with ARTHUR
create a newer list with ARTHUR removed from the front of each element
```

Regex gives us a way of doing this in one line, by using something called groups. Groups are pieces of a pattern that can be ignored, negated, or given names for later retrieval.

character | meaning
----------|--------
`(x)`     | match x
`(?:x)`   | match x but don't capture it
`(?P<x>)` | match something and give it name x
`(?=x)`   | match only if string is followed by x
`(?!x)`   | match only if string is not followed by x

In [7]:
re.findall(r'(?:ARTHUR: )(.+)', document)[0:10]

['Whoa there!  [clop clop clop] ',
 'It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!',
 'I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.',
 'Yes!',
 'What?',
 'So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--',
 'We found them.',
 'What do you mean?',
 'The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?',
 'Not at all.  They could be carried.']

Because we are using `findall`, the regex engine is capturing and returning the normal groups, but not the non-capturing group. For complicated, multi-piece regular expressions, you may need to pull groups out separately. You can do this with names.

In [9]:
p = re.compile(r'(?P<name>[A-Z ]+)(?::)(?P<line>.+)')
match = re.search(p, document)
print(match)

<_sre.SRE_Match object; span=(34, 77), match='KING ARTHUR: Whoa there!  [clop clop clop] '>


In [11]:
print(match.group('name'))
print(match.group('line'))

KING ARTHUR
 Whoa there!  [clop clop clop] 


Using the regex patter `p` above to print the `set` of unique characters in *Monty Python*:

In [12]:
matches = re.findall(p, document)
chars = set([x[0] for x in matches])
print(chars, len(chars))

{'PATSY', 'CARTOON MONKS', 'RIGHT HEAD', ' GREEN KNIGHT', 'GALAHAD', 'ROBIN', 'WOMAN', 'OTHER FRENCH GUARD', 'SIR ROBIN', 'ARMY OF KNIGHTS', 'LOVELY', 'MASTER', ' CRAPPER', ' PARTY', 'GUESTS', 'GUEST', 'BRIDGEKEEPER', 'DINGO', 'CAMERAMAN', 'ALL HEADS', 'WINSTON', 'PRISONER', ' BEDEVERE', 'DEAD PERSON', 'VOICE', 'MAYNARD', 'BROTHER MAYNARD', 'ARTHUR', 'HEAD KNIGHT', 'S WIFE', 'NARRATOR', 'SUN', 'GIRLS', 'TIM THE ENCHANTER', 'PRINCE HERBERT', 'HISTORIAN', 'ROGER THE SHRUBBER', 'KNIGHT', 'PIGLET', 'AMAZING', 'SECOND BROTHER', 'SIR GALAHAD', 'HERBERT', 'ANIMATOR', 'RANDOM', 'BLACK KNIGHT', 'CARTOON CHARACTER', 'KNIGHTS', 'MAN', 'CRONE', 'FRENCH GUARD', 'BORS', 'MINSTREL', 'CROWD', 'CARTOON CHARACTERS', 'WITCH', 'STUNNER', 'HEAD KNIGHT OF NI', 'DENNIS', 'BEDEVERE', 'MIDDLE HEAD', 'LAUNCELOT', 'OLD CRONE', 'CUSTOMER', 'LEFT HEAD', 'FRENCH GUARDS', 'OLD MAN', 'FATHER', 'TIM', 'KNIGHTS OF NI', 'SIR BEDEVERE', 'CONCORDE', 'INSPECTOR', 'DIRECTOR', 'S FATHER', 'ROGER', 'MONKS', 'GREEN KNIGHT', 'G

You should have 84 different characters.

Now use the `set` you made above to gather all dialogue into a character `dictionary`, with the keys being the character name and the value being a list of dialogues.:

In [13]:
char_dict = {}
for n in chars:
    char_dict[n] = re.findall(re.compile(r'(?:' + n + ': )(.+)'), document)

In [15]:
char_dict["PATSY"]

["It's only a model."]

### Tokenizing

In [16]:
text = '''Hello, my name is Chris. 
I'll be talking about the python library NLTK today. 
NLTK is a popular tool to conduct text processing tasks in NLP.'''

In [17]:
from nltk.tokenize import word_tokenize

print("Notice the difference!")
print()
print(word_tokenize(text))

print()
print("vs.")
print()

print(text.split())

Notice the difference!

['Hello', ',', 'my', 'name', 'is', 'Chris', '.', 'I', "'ll", 'be', 'talking', 'about', 'the', 'python', 'library', 'NLTK', 'today', '.', 'NLTK', 'is', 'a', 'popular', 'tool', 'to', 'conduct', 'text', 'processing', 'tasks', 'in', 'NLP', '.']

vs.

['Hello,', 'my', 'name', 'is', 'Chris.', "I'll", 'be', 'talking', 'about', 'the', 'python', 'library', 'NLTK', 'today.', 'NLTK', 'is', 'a', 'popular', 'tool', 'to', 'conduct', 'text', 'processing', 'tasks', 'in', 'NLP.']


You can also tokenize sentences.

In [18]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

['Hello, my name is Chris.', "I'll be talking about the python library NLTK today.", 'NLTK is a popular tool to conduct text processing tasks in NLP.']


In [19]:
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
print(tokenized_text)

[['Hello', ',', 'my', 'name', 'is', 'Chris', '.'], ['I', "'ll", 'be', 'talking', 'about', 'the', 'python', 'library', 'NLTK', 'today', '.'], ['NLTK', 'is', 'a', 'popular', 'tool', 'to', 'conduct', 'text', 'processing', 'tasks', 'in', 'NLP', '.']]


A list of sentences with a list of tokenized words is generally the accepted format for most libraries for analysis.

### Stemming/Lemmatizing

In [20]:
from nltk import SnowballStemmer

snowball = SnowballStemmer('english')

print(snowball.stem('running'))
print(snowball.stem('eats'))
print(snowball.stem('embarassed'))

run
eat
embarass


But watch out for errors:

In [21]:
print(snowball.stem('cylinder'))
print(snowball.stem('cylindrical'))

cylind
cylindr


Or collision:

In [22]:
print(snowball.stem('vacation'))
print(snowball.stem('vacate'))

vacat
vacat


This is why lemmatizing, if the computing power and time is sufficient, is always preferable:

In [23]:
from nltk import WordNetLemmatizer
wordnet = WordNetLemmatizer()

print(wordnet.lemmatize('vacation'))
print(wordnet.lemmatize('vacate'))

vacation
vacate


# 2) Declaring a corpus in NLTK

While you can use NLTK on strings and lists of sentences, it's better to formally declare your corpus, as this will take care of the above for you and provide methods to access them. For our purposes today, we'll use a corpus of [book summaries](http://www.cs.cmu.edu/~dbamman/booksummaries.html). I've changed them into a folder of .txt files for demonstration. The file below will convert the .tsv file.

In [24]:
! ls texts

'Art'.txt
'Tis: A Memoir.txt
003½: The Adventures of James Bond Junior.txt
1 Esdras.txt
1632.txt
1876.txt
1975 in Prophecy!.txt
1982, Janine.txt
2001: A Space Odyssey.txt
2010: Odyssey Two.txt
2061: Odyssey Three.txt
253.txt
3 Maccabees.txt
3001: The Final Odyssey.txt
31 Songs.txt
334.txt
4 Maccabees.txt
4.50 From Paddington.txt
69.txt
A Bend in the River.txt
A Bright Shining Lie: John Paul Vann and America in Vietnam.txt
A Burnt-Out Case.txt
A Calculus of Angels.txt
A Canticle for Leibowitz.txt
A Caribbean Mystery.txt
A Case of Conscience.txt
A Child Called "It".txt
A Christmas Carol.txt
A Clash of Kings.txt
A Clergyman's Daughter.txt
A Clockwork Orange.txt
A Connecticut Yankee in King Arthur's Court.txt
A Crown of Swords.txt
A Dance with Dragons.txt
A Darkness at Sethanon.txt
A Death in the Family.txt
A Deepness in the Sky.txt
A Delicate Balance.txt
A Descent into the Maelstrom.txt
A Dog of Flanders.txt
A Door Into Ocean.txt
A Fall of Moondust

In [25]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "texts/"  # relative path to texts.
my_texts = PlaintextCorpusReader(corpus_root, '.*txt')

We now have a text corpus, on which we can run all the basic preprocessing methods. To list all the files in our corpus:

In [26]:
my_texts.fileids()[:10]

["'Art'.txt",
 "'Tis: A Memoir.txt",
 '003½: The Adventures of James Bond Junior.txt',
 '1 Esdras.txt',
 '1632.txt',
 '1876.txt',
 '1975 in Prophecy!.txt',
 '1982, Janine.txt',
 '2001: A Space Odyssey.txt',
 '2010: Odyssey Two.txt']

In [27]:
my_texts.words('To Kill A Mockingbird.txt')  # uses punkt tokenizer like above

['The', 'book', 'opens', 'with', 'the', 'Finch', ...]

In [28]:
my_texts.sents('To Kill A Mockingbird.txt')

[['The', 'book', 'opens', 'with', 'the', 'Finch', 'family', "'", 's', 'ancestor', ',', 'Simon', 'Finch', ',', 'a', 'Cornish', 'Methodist', 'fleeing', 'religious', 'intolerance', 'in', 'England', ',', 'settling', 'in', 'Alabama', ',', 'becoming', 'wealthy', 'and', ',', 'contrary', 'to', 'his', 'religious', 'beliefs', ',', 'buying', 'several', 'slaves', '.'], ['The', 'main', 'story', 'takes', 'place', 'during', 'three', 'years', 'of', 'the', 'Great', 'Depression', 'in', 'the', 'fictional', '"', 'tired', 'old', 'town', '"', 'of', 'Maycomb', ',', 'Alabama', '.'], ...]

It also add as paragraph method:

In [29]:
my_texts.paras('To Kill A Mockingbird.txt')[0]

[['The',
  'book',
  'opens',
  'with',
  'the',
  'Finch',
  'family',
  "'",
  's',
  'ancestor',
  ',',
  'Simon',
  'Finch',
  ',',
  'a',
  'Cornish',
  'Methodist',
  'fleeing',
  'religious',
  'intolerance',
  'in',
  'England',
  ',',
  'settling',
  'in',
  'Alabama',
  ',',
  'becoming',
  'wealthy',
  'and',
  ',',
  'contrary',
  'to',
  'his',
  'religious',
  'beliefs',
  ',',
  'buying',
  'several',
  'slaves',
  '.'],
 ['The',
  'main',
  'story',
  'takes',
  'place',
  'during',
  'three',
  'years',
  'of',
  'the',
  'Great',
  'Depression',
  'in',
  'the',
  'fictional',
  '"',
  'tired',
  'old',
  'town',
  '"',
  'of',
  'Maycomb',
  ',',
  'Alabama',
  '.'],
 ['It',
  'focuses',
  'on',
  'six',
  '-',
  'year',
  '-',
  'old',
  'Scout',
  'Finch',
  ',',
  'who',
  'lives',
  'with',
  'her',
  'older',
  'brother',
  'Jem',
  'and',
  'their',
  'widowed',
  'father',
  'Atticus',
  ',',
  'a',
  'middle',
  '-',
  'aged',
  'lawyer',
  '.'],
 ['Jem',
  '

Let's save these to a variable to look at the next step on a low level:

In [30]:
m_sents = my_texts.sents('To Kill A Mockingbird.txt')
print (m_sents)

[['The', 'book', 'opens', 'with', 'the', 'Finch', 'family', "'", 's', 'ancestor', ',', 'Simon', 'Finch', ',', 'a', 'Cornish', 'Methodist', 'fleeing', 'religious', 'intolerance', 'in', 'England', ',', 'settling', 'in', 'Alabama', ',', 'becoming', 'wealthy', 'and', ',', 'contrary', 'to', 'his', 'religious', 'beliefs', ',', 'buying', 'several', 'slaves', '.'], ['The', 'main', 'story', 'takes', 'place', 'during', 'three', 'years', 'of', 'the', 'Great', 'Depression', 'in', 'the', 'fictional', '"', 'tired', 'old', 'town', '"', 'of', 'Maycomb', ',', 'Alabama', '.'], ...]


We now have a corpus, or text, from which we can get any of the statistics you learned in Day 3 of the Python workshop. We will review some of these functions once we get some more information

# 3) POS-Tagging

There are many situations, in which "tagging" words (or really anything) may be useful in order to determine or calculate trends, or for further text analysis to extract meaning. NLTK contains several methods to achieve this, from simple regex to more advanced machine learning models models.

It is important to note that in Natural Language Processing (NLP), POS (Part of Speech) tagging is the most common use for tagging, but the actual tag can be anything. Other applications include sentiment analysis and NER (Named Entity Recognition). Tagging is simply labeling a word to a specific category via a tuple.

Nevertheless, for training more advanced tagging models, POS tagging is nearly essential. If you are defining a machine learning model to predict patterns in your text, these patterns will most likley rely on, among other things, POS features. You will therefore first tag POS and then use the POS as a feature in your model.

### On a low-level

Tagging is creating a tuple of (word, tag) for every word in a text or corpus. For example: "My name is Chris" may be tagged for POS as:

My/PossessivePronoun name/Noun is/Verb Chris/ProperNoun ./Period

*NB: type 'nltk.data.path' to find the path on your computer to your downloaded nltk corpora. You can explore these files to see how large corpora are formatted.*

You'll notice how the text is annotated, using a forward slash to match the word to its tag. So how can we get this to a useful form for Python?

In [31]:
from nltk.tag import str2tuple

line = "My/Possessive_Pronoun name/Noun is/Verb Chris/Proper_Noun ./Period"
tagged_sent = [str2tuple(t) for t in line.split()]

print (tagged_sent)

[('My', 'POSSESSIVE_PRONOUN'), ('name', 'NOUN'), ('is', 'VERB'), ('Chris', 'PROPER_NOUN'), ('.', 'PERIOD')]


Further analysis of tags with NLTK requires a *list* of sentences, otherwise you will get an index error on higher level methods.

Naturally, these tags are a bit verbose, the standard tagging conventions follow the Penn Treebank (more in a second): https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

### Automatic Tagging

NLTK's stock English `pos_tag` tagger is a perceptron tagger:

In [32]:
from nltk import pos_tag
m_tagged_sent = pos_tag(m_sents[0])
print (m_tagged_sent)

[('The', 'DT'), ('book', 'NN'), ('opens', 'VBZ'), ('with', 'IN'), ('the', 'DT'), ('Finch', 'NNP'), ('family', 'NN'), ("'", "''"), ('s', 'JJ'), ('ancestor', 'NN'), (',', ','), ('Simon', 'NNP'), ('Finch', 'NNP'), (',', ','), ('a', 'DT'), ('Cornish', 'JJ'), ('Methodist', 'NNP'), ('fleeing', 'VBG'), ('religious', 'JJ'), ('intolerance', 'NN'), ('in', 'IN'), ('England', 'NNP'), (',', ','), ('settling', 'VBG'), ('in', 'IN'), ('Alabama', 'NNP'), (',', ','), ('becoming', 'VBG'), ('wealthy', 'NN'), ('and', 'CC'), (',', ','), ('contrary', 'JJ'), ('to', 'TO'), ('his', 'PRP$'), ('religious', 'JJ'), ('beliefs', 'NN'), (',', ','), ('buying', 'VBG'), ('several', 'JJ'), ('slaves', 'NNS'), ('.', '.')]


What do these tags mean?

In [33]:
from nltk import help
help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [34]:
m_tagged_all = [pos_tag(sent) for sent in m_sents]
print(m_tagged_all[:3])

[[('The', 'DT'), ('book', 'NN'), ('opens', 'VBZ'), ('with', 'IN'), ('the', 'DT'), ('Finch', 'NNP'), ('family', 'NN'), ("'", "''"), ('s', 'JJ'), ('ancestor', 'NN'), (',', ','), ('Simon', 'NNP'), ('Finch', 'NNP'), (',', ','), ('a', 'DT'), ('Cornish', 'JJ'), ('Methodist', 'NNP'), ('fleeing', 'VBG'), ('religious', 'JJ'), ('intolerance', 'NN'), ('in', 'IN'), ('England', 'NNP'), (',', ','), ('settling', 'VBG'), ('in', 'IN'), ('Alabama', 'NNP'), (',', ','), ('becoming', 'VBG'), ('wealthy', 'NN'), ('and', 'CC'), (',', ','), ('contrary', 'JJ'), ('to', 'TO'), ('his', 'PRP$'), ('religious', 'JJ'), ('beliefs', 'NN'), (',', ','), ('buying', 'VBG'), ('several', 'JJ'), ('slaves', 'NNS'), ('.', '.')], [('The', 'DT'), ('main', 'JJ'), ('story', 'NN'), ('takes', 'VBZ'), ('place', 'NN'), ('during', 'IN'), ('three', 'CD'), ('years', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Great', 'NNP'), ('Depression', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('fictional', 'JJ'), ('"', 'NN'), ('tired', 'VBD'), ('old', 'JJ'), ('

We can find and aggregate certain parts of speech too:

In [35]:
from nltk import ConditionalFreqDist
def find_tags(tag_prefix, tagged_text):
    cfd = ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions()) #cfd.conditions() yields all tags possibilites

In [36]:
m_tagged_words = [item for sublist in m_tagged_all for item in sublist]

tagdict = find_tags('JJ', m_tagged_words)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])

JJ [('s', 7), ('mysterious', 2), ('many', 2), ('old', 2), ('religious', 2)]
JJR [('older', 1)]


We can begin to quantify syntax by look at environments of words, so what commonly follows a verb?

In [37]:
import nltk
tags = [b[1] for (a, b) in nltk.bigrams(m_tagged_words) if a[1].startswith('VB')]
fd1 = nltk.FreqDist(tags)

print ("To Kill A Mockingbird")
fd1.tabulate(10)

To Kill A Mockingbird
 IN VBN NNP PRP  NN  DT  JJ  TO  RB VBG 
 27  13  12  11  10   8   7   5   4   3 


### Creating a tagged corpus

Now that we know how tagging works, we can quickly tag all of our documents, but we'll only do a few hundred from the much larger corpus.

In [38]:
tagged_sents = {}
for fid in my_texts.fileids()[::10]:
    tagged_sents[fid.split(".")[0]] = [pos_tag(sent) for sent in my_texts.sents(fid)]

In [39]:
tagged_sents.keys()

dict_keys(['Phaedo', "Why Didn't They Ask Evans?", 'Closer', 'The Sisterhood of the Traveling Pants', 'Elidor', 'Celebrated Cases of Judge Dee', 'The Dragonbone Chair', 'The Lord of the Rings', 'Death Be Not Proud', 'The Blackwater Lightship', 'David Starr, Space Ranger', 'Holes', "Lady Audley's Secret", 'The Fortune of War', 'The Wind in the Willows', 'The Great Hunt', 'The Sound of Waves', 'Rainbow Valley', 'A Thousand Acres', 'Magic, Inc', 'Efuru', 'Misery', 'I Am Charlotte Simmons', 'Pollyanna', 'A Yellow Raft in Blue Water', 'Quartet in Autumn', 'The Last of the Really Great Whangdoodles', 'Rabbit Is Rich', 'The Artemis Fowl Files', 'Protector', 'Tamburlaine', 'The Lions of Al-Rassan', 'Space', 'Asterix the Legionary', 'Iron Sunrise', "Harlot's Ghost", 'The Clan of the Cave Bear', 'The Man in the Brown Suit', 'The Cold Six Thousand', 'Once Were Warriors', 'Planet of the Apes', 'Timeline', 'Pericles, Prince of Tyre', 'Keep the Aspidistra Flying', 'Mutiny on the Bounty', 'Streets of

In [40]:
tagged_sents["Harry Potter and the Prisoner of Azkaban"]

[[('The', 'DT'),
  ('book', 'NN'),
  ('opens', 'VBZ'),
  ('on', 'IN'),
  ('the', 'DT'),
  ('night', 'NN'),
  ('before', 'IN'),
  ('Harry', 'NNP'),
  ("'", 'POS'),
  ('s', 'JJ'),
  ('thirteenth', 'JJ'),
  ('birthday', 'NN'),
  (',', ','),
  ('when', 'WRB'),
  ('he', 'PRP'),
  ('receives', 'VBZ'),
  ('gifts', 'NNS'),
  ('by', 'IN'),
  ('owl', 'NN'),
  ('post', 'NN'),
  ('from', 'IN'),
  ('his', 'PRP$'),
  ('friends', 'NNS'),
  ('at', 'IN'),
  ('school', 'NN'),
  ('.', '.')],
 [('The', 'DT'),
  ('next', 'JJ'),
  ('morning', 'NN'),
  ('at', 'IN'),
  ('breakfast', 'NN'),
  (',', ','),
  ('Harry', 'NNP'),
  ('sees', 'VBZ'),
  ('on', 'IN'),
  ('television', 'NN'),
  ('that', 'IN'),
  ('a', 'DT'),
  ('man', 'NN'),
  ('named', 'VBN'),
  ('Black', 'NNP'),
  ('is', 'VBZ'),
  ('on', 'IN'),
  ('the', 'DT'),
  ('loose', 'JJ'),
  ('from', 'IN'),
  ('prison', 'NN'),
  ('.', '.')],
 [('At', 'IN'),
  ('this', 'DT'),
  ('time', 'NN'),
  (',', ','),
  ('Aunt', 'NNP'),
  ('Marge', 'NNP'),
  ('comes', 'VBZ'

Absolute frequencies are available through NLTK's `FreqDist` method:

In [41]:
all_tags = []
all_tups = []

for k in tagged_sents.keys():
    for s in tagged_sents[k]:
        for t in s:
            all_tags.append(t[1])
            all_tups.append(t)

nltk.FreqDist(all_tags).tabulate(10)

   NN    IN   NNP    DT    JJ     ,   VBZ     .   NNS   PRP 
34614 27166 26943 23283 13346 13153 12640  9558  9257  8833 


In [42]:
tags = ['NN', 'VB', 'JJ']
for t in tags:
    tagdict = find_tags(t, all_tups)
    for tag in sorted(tagdict):
        print(tag, tagdict[tag])

NN [('s', 1422), ('time', 331), ('father', 313), ('man', 293), ('story', 287)]
NNP [('"', 451), ('’', 197), ('Mr', 149), ('John', 144), ('Richard', 126)]
NNPS [('Indians', 56), ('States', 21), ('Agnes', 10), ('Sirians', 8), ('Moties', 8)]
NNS [('years', 203), ('people', 196), ('children', 139), ('friends', 101), ('men', 100)]
VB [('be', 731), ('have', 191), ('take', 161), ('find', 159), ('get', 97)]
VBD [('was', 707), ('had', 407), ('were', 143), ('s', 69), ('did', 68)]
VBG [('being', 270), ('having', 117), ('including', 89), ('leaving', 58), ('using', 55)]
VBN [('been', 380), ('named', 140), ('killed', 100), ('known', 97), ('called', 93)]
VBP [('are', 863), ('have', 291), ('find', 59), ('"', 57), ('do', 49)]
VBZ [('is', 3529), ('has', 918), ('tells', 184), ('takes', 184), ('finds', 178)]
JJ [('s', 879), ('other', 275), ('"', 217), ('first', 207), ('old', 193)]
JJR [('more', 106), ('older', 35), ('better', 30), ('younger', 28), ('larger', 14)]
JJS [('most', 51), ('best', 39), ('least',

We can compare this to other genres:

In [43]:
from nltk.corpus import brown

for c in brown.categories():
    tagged_words = brown.tagged_words(categories=c)  # not universal tagset
    tag_fd = nltk.FreqDist(tag for (word, tag) in tagged_words)
    print(c.upper())
    tag_fd.tabulate(10)
    print()
    tags = ['NN', 'VB', 'JJ']
    for t in tags:
        tagdict = find_tags(t, tagged_words)
        for tag in sorted(tagdict):
            print(tag, tagdict[tag])
    print()
    print()

ADVENTURE
  NN   IN   AT    .  VBD    ,   RB   JJ  NNS   CC 
8051 5908 5531 5104 3702 3488 2845 2687 2302 2171 

NN [('man', 165), ('time', 127), ('face', 72), ('head', 68), ('door', 67)]
NN$ [("man's", 25), ("father's", 4), ("coroner's", 4), ("marine's", 4), ("night's", 3)]
NN$-TL [("Throat's", 5), ("Uncle's", 2), ("Eagle's", 1), ("Knife's", 1)]
NN+BEZ [("fire's", 2), ("name's", 1), ("fat's", 1), ("leg's", 1), ("wife's", 1)]
NN+BEZ-TL [("Knife's", 1)]
NN+HVZ [("boat's", 1)]
NN+HVZ-TL [("Knife's", 1)]
NN+MD [("sun'll", 1)]
NN-HL [('Chapter', 1), ('Attack', 1)]
NN-NC [('eromonga', 1), ('Commodore', 1)]
NN-TL [('Uncle', 12), ('Throat', 11), ('Woman', 7), ('Dr.', 6), ('Aunt', 6)]
NNS [('eyes', 90), ('men', 81), ('feet', 49), ('hands', 48), ('horses', 34)]
NNS$ [("men's", 2), ("grownups'", 1), ("guests'", 1), ("longshoremen's", 1), ("girls'", 1)]
NNS$-TL [("Stockgrowers'", 1)]
NNS+MD [("duds'd", 1)]
NNS-TL [('Eyes', 7), ('Highlands', 4), ('Gardens', 4), ('Riders', 4), ('Nations', 2)]
VB [(

We can also look at what linguistic environment words are in on a low level, below lists all the words preceding "love" in the romance category:

In [44]:
brown_news_text = brown.words(categories='romance')
sorted(set(a for (a, b) in nltk.bigrams(brown_news_text) if b == 'love'))

[',',
 'I',
 'My',
 'a',
 'always',
 'and',
 'could',
 'for',
 'if',
 'in',
 'is',
 'made',
 'make',
 'making',
 'mingled',
 'my',
 'of',
 'only',
 'our',
 'real',
 'that',
 'this',
 'to',
 'true',
 'undying',
 'wondrous']

# 4) Dependency Parsing

While tagging parts of speech can be helpful for certain NLP tasks, dependency parsing is better at extracting real relationships within a sentence.

In [None]:
from nltk.parse.stanford import StanfordDependencyParser

dependency_parser = StanfordDependencyParser(path_to_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser.jar",
                                             path_to_models_jar = "/Users/chench/Documents/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar")

result = dependency_parser.raw_parse_sents(['I shot an elephant in my sleep.', 'It was great.'])

As the program takes longer to run, I will not run it on the entire corpus, but an example is below:

In [None]:
for r in result:
    for o in r:
        trips = list(o.triples())  # ((head word, head tag), rel, (dep word, dep tag))
        for t in trips:
            print(t)
            print()

# 5) Named Entity Recognition

After tokening, tagging, and parser, one of the last steps in the pipeline is NER. Identifying named entities can be useful in determing many different relationships, and often serves as a prerequisite to mapping textual relationships within a set of documents.

In [None]:
from nltk.tag.stanford import StanfordNERTagger

ner_tag = StanfordNERTagger(
        '/Users/chench/Documents/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
        '/Users/chench/Documents/stanford-ner-2015-12-09/stanford-ner.jar')

In [None]:
import pyprind

ner_sents = {}
books = ["To Kill A Mockingbird.txt", "Harry Potter and the Prisoner of Azkaban.txt"]

for fid in books:
    bar = pyprind.ProgBar(len(my_texts.sents(fid)), monitor=True, bar_char="#")
    tagged_sents = []
    for sent in my_texts.sents(fid):
        tagged_sents.append(ner_tag.tag(sent))
        bar.update()
    ner_sents[fid.split(".")[0]] = tagged_sents
    print()

We can look on the low level at a single summary:

In [None]:
print(ner_sents["To Kill A Mockingbird"])

In [None]:
print(ner_sents["Harry Potter and the Prisoner of Azkaban"])

In [None]:
from itertools import groupby
from nltk import FreqDist

NER = {"LOCATION": [],
       "PERSON": [],
       "ORGANIZATION": [],
       }

for sentence in ner_sents["To Kill A Mockingbird"]:
    for tag, chunk in groupby(sentence, lambda x: x[1]):
        if tag != "O":
            NER[tag].append(" ".join(w for w, t in chunk))

if NER["LOCATION"]:
    print("Locations:")
    FreqDist(NER["LOCATION"]).tabulate()
    print()

if NER["PERSON"]:
    print("Persons:")
    FreqDist(NER["PERSON"]).tabulate()
    print()

if NER["ORGANIZATION"]:
    print("Organizations")
    FreqDist(NER["ORGANIZATION"]).tabulate()

Or between the two:

In [None]:
NER = {"LOCATION": [],
       "PERSON": [],
       "ORGANIZATION": [],
       }

for k in ner_sents.keys():
    for sentence in ner_sents[k]:
        for tag, chunk in groupby(sentence, lambda x: x[1]):
            if tag != "O":
                NER[tag].append(" ".join(w for w, t in chunk))

if NER["LOCATION"]:
    print("Locations:")
    FreqDist(NER["LOCATION"]).tabulate()
    print()

if NER["PERSON"]:
    print("Persons:")
    FreqDist(NER["PERSON"]).tabulate()
    print()

if NER["ORGANIZATION"]:
    FreqDist(NER["ORGANIZATION"]).tabulate()

# 6) Sentiment Analysis

While earlier sentiment analysis was based on simple dictionary look-up methods denoting words as positive or negative, or assigning numerical values to words, newer methods are better able to take a word's or sentence's environment into account. VADER (Valence Aware Dictionary and sEntiment Reasoner) is one such example.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np

sid = SentimentIntensityAnalyzer()

print(sid.polarity_scores("I really don't like that book.")["compound"])

In [None]:
for fid in books:
    print(fid.upper())
    sent_pols = [sid.polarity_scores(s)["compound"] for s in sent_tokenize(my_texts.raw(fid))]
    for i, s in enumerate(my_texts.sents(fid)):
        print(s, sent_pols[i])
        print()
    
    print()
    print("Mean: ", np.mean(sent_pols))
    print()
    print("="*100)
    print()

# 7) Classification

We'll use the IMDB [movie review database](http://ai.stanford.edu/~amaas/data/sentiment/).

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics, tree, cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import RandomizedLogisticRegression

In [None]:
import glob

data = {"train": {"pos": [], "neg": []},
        "test": {"pos": [], "neg": []}}

txt_types = [("train", "neg"), ("train", "pos"), ("test", "neg"), ("test", "pos")]

for t in txt_types:
    for txt_file in glob.glob("data/" + t[0] + "/" + t[1] + "/*.txt"):
        with open(txt_file, "r") as f:
            text = f.read()
        data[t[0]][t[1]].append(text)

In [None]:
list(data["train"]["pos"])[0]

In [None]:
list(data["train"]["neg"])[0]

In [None]:
# get training + test data
import numpy as np

X_train = data["train"]["pos"] + data["train"]["neg"]
y_train = np.append(np.ones(len(data["train"]["pos"])), np.zeros(len(data["train"]["neg"])))

X_test = data["test"]["pos"] + data["test"]["neg"]
y_test = np.append(np.ones(len(data["test"]["pos"])), np.zeros(len(data["test"]["neg"])))

In [None]:
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))

### *tfidf*

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(X_train)

### Pipeline

In [None]:
# build a pipeline - SVC
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', OneVsRestClassifier(LinearSVC(random_state=0)))
                     ])

In [None]:
# fit using pipeline
clf = text_clf.fit(X_train, y_train)

In [None]:
# predict
predicted = clf.predict(X_test)
clf.score(X_test, y_test) 

In [None]:
# print metrics
print(metrics.classification_report(y_test, predicted)) 

In [None]:
scores = cross_validation.cross_val_score(text_clf, X_train + X_test, np.append(y_train, y_test), cv=5)

### TPOT

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_imdb_pipeline.py')

### Neural Networks and word2vec

Word embeddings are the first successful attempt to move away from the "bag of words" model of language. Instead of looking at word frequencies, and vocabulary usage, word embeddings aim to retain syntactic information. Generally, a neural network model *will not* remove stopwords or punctuation, because they are vital to the model itself.

Embedding first changes a tokenized sentence into a vector of numbers, with each unique token being its own number.

e.g.:

~~~
[["I", "like", "coffee", "."], ["I", "like", "my", "coffee", "without", "sugar", "."]]
~~~

is tranformed to:

~~~
[[43, 75, 435, 98], [43, 75, 10, 435, 31, 217, 98]]
~~~

Notice, the "I"s, the "likes", the "coffees", and the "."s, all have the same assignment.

The model is created by taking these numbers, and creating a high dimensional vector by mapping every word to its surrounding, creating a sort of "cloud" of words, where words used in a similar syntactic, and often semantic, fashion, will cluster closer together.

One of the drawbacks of word2vec is the volume of data necessary for a decent analysis.

Let's see how to code this into a classifier.

### One-hot encode text

First we have to one-hot encode the text, but let's limit the features to the most common 20,000 words.

In [None]:
from collections import Counter

max_features = 20000
all_words = []

for text in X_train + X_test:
    all_words.extend(text.split())
unique_words_ordered = [x[0] for x in Counter(all_words).most_common()]
word_ids = {}
rev_word_ids = {}
for i, x in enumerate(unique_words_ordered[:max_features-1]):
    word_ids[x] = i + 1  # so we can pad with 0s
    rev_word_ids[i + 1] = x

In [None]:
X_train_one_hot = []
for text in X_train:
    t_ids = [word_ids[x] for x in text.split() if x in word_ids]
    X_train_one_hot.append(t_ids)
    
X_test_one_hot = []
for text in X_test:
    t_ids = [word_ids[x] for x in text.split() if x in word_ids]
    X_test_one_hot.append(t_ids)

### NN Classification

Now we can use Keras, a popular Theano wrapper, to quickly build an NN classifier.

In [None]:
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding
from keras.layers import LSTM, SimpleRNN, GRU

maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train_one_hot, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test_one_hot, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=15,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)