I'm reading _Natural Language Processing with Python_ this evening, because I recently discovered it has a sentence segmenter built in, so I thought I would return to the book and work through some of the materials.

I'm going to note up top here that there are a collection of [How-Tos](http://www.nltk.org/howto/), but they are not well documented.

The first thing to do is to grab some texts:

In [1]:
import pandas

# =-=-=-=-=-=
# Read CSV into DataFrame and then create lists
# =-=-=-=-=-=

# Create pandas dataframe
colnames = ['author', 'title', 'date' , 'length', 'text']
df = pandas.read_csv('../data/talks_3.csv', names=colnames)
texts = df.text.tolist()

Since I am working through these materials, I'm going to break off a small piece of the larger corpus. I create two test lists of words below, one with a single talk and one with five. The lines of code are a bit long:

```python
gore = re.sub(r"[^\w\d'\s]+",'', texts[0]).lower().split()
```

The real work of the line is in the regex phrase, `re.sub(r"[^\w\d'\s]+",'', texts[0])`, which grabs the first text, `texts[0]`, and substitutes literally nothing wherever there isn't a word or a number or an apostrophe -- I'm tempted to try the NLTK word tokenizer here. The phrase is followed by two string methods that convert everything to lower case and then converts the string into a list.

The only difference in the second line is that five texts are in a list which must first be joined, here with a single space, before the substitution, casing, and split to list.

In [None]:
import re

gore = re.sub(r"[^\w\d'\s]+",'', texts[0]).lower().split()
firsts = re.sub(r"[^\w\d'\s]+",'', ' '.join(texts[0:5])).lower().split()

In [None]:
past_tense = [w for w in gore if re.search('ed$', w)] # From Bird et al. "NLP with Python"
print(past_tense)

Okay, the next step was to use the NLTK to look for phrases and such, as per the example in the third chapter of the book: "For instance, searching a large text corpus for expressions of the form *x and other ys* allows us to discover hypernyms." This is not as easy as it sounds, and I had to download the Brown corpus to try to understand part of what's going on and then to check on a few things. 

Here are my steps:

In [7]:
# =-=-=-=-=-=-=-=-=-=-=
# Use to avoid ssl issues with nltk.download()
# =-=-=-=-=-=-=-=-=-=-= 

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> taggers
    Error loading taggers: Package 'taggers' not found in index

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Informati

KeyboardInterrupt: 

Having downloaded the Brown corpus, I tried my hand at the example from the book:

In [None]:
import nltk
from nltk.corpus import brown

hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))

In [None]:
hobbies_learned.findall(r'<\w*> <and> <other> <\w*s>')

The key to this seems to be the `nltk.Text` functionality, which is used as below, which feels like essentially round-tripping the text, but is, I assume creating a special kind of object.

In [None]:
f = open('my-file.txt','rU')
raw = f.read()
# one-line: raw = open('my-file.txt','rU').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

In [None]:
import nltk

# x and other y: hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

#strung = nltk.Text(firsts)

test = ' '.join(texts[0:5])
tokens = nltk.word_tokenize(test)
text = nltk.Text(tokens)

comps = text.findall(r"<\w*> <as> <\w*>")
print(comps)

In [2]:
def textualize(string):
    import nltk
    tokens = nltk.word_tokenize(string)
    text = nltk.Text(tokens)
    return text

In [None]:
strung = ' '.join(texts)
tokens = nltk.word_tokenize(strung)
text = nltk.Text(tokens)
hypernyms = text.findall(r'<\w*> <and> <other> <\w*s>')
print(hypernyms)

**Tokenization** is "the task of cutting a string into identifiable linguistic units that constitute a piece of language data."

> The function `nltk.regexp_tokenize()` is similar to `re.findall()` (as we’ve been using it for tokenization). However, `nltk.regexp_tokenize()` is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special `(?x)` “verbose flag” tells Python to strip out the embedded whitespace and comments.

Here's what that looks like in practice. Note that there's nothing here, as far as I can tell, to preserve parentheticals. I'll need to test it.

In [None]:
pattern = r'''(?x)      # set flag to allow verbose regexps
    ([A-Z]\.)+          # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
'''

In [None]:
pattern_2 = r'\w'

In [None]:
test = "You have to understand. (Laughter.) I was used to flying."

nltk.regexp_tokenize(test, pattern)

In [None]:
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

# This isn't working: it's not breaking sentences into individual items in a list. Blah.
string = ' '.join(gore)
text = textualize(string)
sents = sent_tokenizer.tokenize(string)
for item in sents:
    print(item + "\n")

In [None]:
len(sents)

## April 18

This morning I am going to try my hand at parts-of-speech tagging. PoS tagging seems to take a long time. I'm running it on Gore's first Tedtalk, and it's taking a while, I should have had it output to an object, and I should also try the timing function.

In [27]:
import nltk

gore = nltk.word_tokenize(texts[0])

parts = nltk.pos_tag(gore[0:40])
print("The tagger produces a {0} of {1}s.".format(type(parts), type(parts[0])))
print(parts)

The tagger produces a <class 'list'> of <class 'tuple'>s.
[('Thank', 'NNP'), ('you', 'PRP'), ('so', 'RB'), ('much', 'JJ'), ('Chris', 'NNP'), ('.', '.'), ('And', 'CC'), ('it', 'PRP'), ("'s", 'VBZ'), ('truly', 'RB'), ('a', 'DT'), ('great', 'JJ'), ('honor', 'NN'), ('to', 'TO'), ('have', 'VB'), ('the', 'DT'), ('opportunity', 'NN'), ('to', 'TO'), ('come', 'VB'), ('to', 'TO'), ('this', 'DT'), ('stage', 'NN'), ('twice', 'RB'), ('I', 'PRP'), ("'m", 'VBP'), ('extremely', 'RB'), ('grateful', 'JJ'), ('.', '.'), ('I', 'PRP'), ('have', 'VBP'), ('been', 'VBN'), ('blown', 'VBN'), ('away', 'RB'), ('by', 'IN'), ('this', 'DT'), ('conference', 'NN'), ('and', 'CC'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO')]


`nltk.bigrams` produces a generator object, so if you need to tweak anything, make sure to run the line below again to re-create the generator.

In [26]:
gore_tagged = nltk.pos_tag(gore)
word_tag_pairs = nltk.bigrams(gore_tagged)
list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'NN'))

['NNP',
 'IN',
 'VBG',
 'VBN',
 'VBZ',
 'JJR',
 'PRP',
 'PRP$',
 'JJ',
 'NN',
 'DT',
 'CC',
 'POS',
 'RB',
 'CD',
 'VBD',
 'VB',
 '.']

In [33]:
verbs = nltk.FreqDist(gore_tagged)

In [36]:
gore_tagged = nltk.pos_tag(gore)
type(gore_tagged)

list

In [32]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text 
                               if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

In [35]:
findtags('NN', gore_tagged)

TypeError: 'dict_keys' object is not subscriptable