# Module 4: NLP and the Pipeline

* DS 6001
* Raf Alvarado

# Overview

We import a collection of texts and convert to F2. Then we annotate the collection to create an F3-level model.

# Set Up

## Configs

In [None]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
epub_dir = 'epubs'

## Imports

In [2]:
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk

In [3]:
%matplotlib inline

## Import NLTK and download resources

If you need to install NLTK, see the [instructions here](https://www.nltk.org/install.html). You can also install this with Anaconda, like so:

`conda install nltk`

Once you have installed NLTK, you will need to download resources, which will happen when you run the following cell. If the interactive window opens, you may need to set your NLTK Data Directory, as described in the [instructions here](https://www.nltk.org/data.html). To set the directory, click on the File menu and select Change Download Directory. For central installation, set this to `C:\nltk_data` (Windows),`/usr/local/share/nltk_data` (Mac), or `/usr/share/nltk_data` (Unix). 

> If you did not install the data to one of the above central locations, you will need to set the NLTK_DATA environment variable to specify the location of the data. (On a Windows machine, right click on “My Computer” then select Properties > Advanced > Environment Variables > User Variables > New...)

In [4]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets')

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 8] nodename nor servname provided, or not
[nltk_data]     known>
[nltk_data] Error loading stopwords: <urlopen error [Errno 8] nodename
[nltk_data]     nor servname provided, or not known>
[nltk_data] Error loading tagsets: <urlopen error [Errno 8] nodename
[nltk_data]     nor servname provided, or not known>


False

# Acquire

We download the novels of Jane Austen and Herman Melville from Project Gutenberg. I actually used a utility I created called PGTK:

* https://github.com/ontoligent-design/pgtk

# Inspect

Since Project Gutenberg texts vary widely in their markup, we define our chunking patterns by hand.

In [5]:
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
chap_pats = {
    158: {
        'start_line': 37,
        'end_line': 16261,
        'volume': re.compile('^\s*VOLUME\s+{}\s*$'.format(roman)),
        'chapter': re.compile('^\s*CHAPTER\s+{}\s*$'.format(roman))
    },
    946: {
        'start_line': 38,
        'end_line': 2556,
        'chapter': re.compile("^\s*{}\s*$".format(roman))
    },
    1212: {
        'start_line': 77,
        'end_line': 3432,
        'chapter': re.compile("^\s*LETTER .* to .*$")
    },
    141: {\
        'start_line': 40,
        'end_line': 15376,
        'chapter': re.compile("^CHAPTER\s+{}$".format(roman))
    },
    121: {
        'start_line': 57,
        'end_line': 7874,
        'chapter': re.compile("^CHAPTER\s+\d+$")
    },
    105: {
        'start_line': 48,
        'end_line': 8360,
        'chapter': re.compile("^Chapter\s+\d+$")
    },
    1342: {
        'start_line': 37,
        'end_line': 13061,
        'chapter': re.compile("^Chapter\s+\d+$")
    },
    161: {
        'start_line': 43,
        'end_line': 12654,
        'chapter': re.compile("^CHAPTER\s+\d+$")          
    },
    15422: {
        'start_line': 193,
        'end_line': 7501,
        'chapter': re.compile("^\s*CHAPTER\s+{}\.".format(roman))
    },
    13720: {
        'start_line': 187,
        'end_line': 11470,
        'chapter': re.compile("^\s*CHAPTER\s+{}\s*$".format(roman))
    },
    13721: {
        'start_line': 164,
        'end_line': 13135,
        'chapter': re.compile("^\s*CHAPTER\s+{}\s*$".format(roman))
    },
    2701: {
        'start_line': 52,
        'end_line': 21743,
        'chapter': re.compile("^(ETYMOLOGY|EXTRACTS|CHAPTER)")
    },
    4045: {
        'start_line': 138,
        'end_line': 11655,
        'volume': re.compile("^\s*PART\s+{}\s*$".format(roman)),
        'chapter': re.compile("^\s*CHAPTER\s+{}\.\s*$".format(roman))
    },
    34970: {
        'start_line': 234,
        'end_line': 16267,
        'volume': re.compile("^\s*BOOK\s+{}\.\s*$".format(roman)),
        'chapter': re.compile("^\s*{}\.\s*$".format(roman))
    },
    8118: {
        'start_line': 142,
        'end_line': 12300,
        'chapter': re.compile("^\s*{}\. .*$".format(roman))
    },
    53861: {
        'start_line': 129,
        'end_line': 6904,
        'chapter': re.compile('^\s*{}\s*$'.format(caps))
    },
    21816: {
        'start_line': 309,
        'end_line': 11023,
        'chapter': re.compile('^CHAPTER\s+{}\.?$'.format(roman))
    },
    15859: {
        'start_line': 77,
        'end_line': 8619,
        'chapter': re.compile('^\s*{}\s*$'.format(caps))
    },
    1900: {
        'start_line': 43,
        'end_line': 11216,
        'chapter': re.compile("^CHAPTER\s+\w+\s*$")
    },
    10712: {
        'start_line': 205,
        'end_line': 15487,
        'chapter': re.compile("^CHAPTER\s+{}\.\s*$".format(roman))
    }
}

# Register and Chunk

In [11]:
def acquire_epubs(epub_list, chap_pats, OHCO=OHCO):
    
    my_lib = []
    my_doc = []

    for epub_file in epub_list:
        
        # Get PG ID from filename
        book_id = int(epub_file.split('-')[-1].split('.')[0].replace('pg',''))
        print("BOOK ID", book_id)
        
        # Import file as lines
        lines = open(epub_file, 'r', encoding='utf-8-sig').readlines()
        df = pd.DataFrame(lines, columns=['line_str'])
        df.index.name = 'line_num'
        df.line_str = df.line_str.str.strip()
        df['book_id'] = book_id
        
        # FIX CHARACTERS TO IMPROVE TOKENIZATION
        df.line_str = df.line_str.str.replace('—', ' — ')
        df.line_str = df.line_str.str.replace('-', ' - ')
        
        # Get book title and put into LIB table -- note problems, though
        book_title = re.sub(r"The Project Gutenberg eBook( of|,) ", "", df.loc[0].line_str, flags=re.IGNORECASE)
        book_title = re.sub(r"Project Gutenberg's ", "", book_title, flags=re.IGNORECASE)
        
        # Remove cruft
        a = chap_pats[book_id]['start_line'] - 1
        b = chap_pats[book_id]['end_line'] + 1
        df = df.iloc[a:b]
        
        # Chunk by chapter
        chap_lines = df.line_str.str.match(chap_pats[book_id]['chapter'])
        chap_nums = [i+1 for i in range(df.loc[chap_lines].shape[0])]
        df.loc[chap_lines, 'chap_num'] = chap_nums
        df.chap_num = df.chap_num.ffill()

        # Clean up
        df = df[~df.chap_num.isna()] # Remove chapter heading lines
        df = df.loc[~chap_lines] # Remove everything before Chapter 1
        df['chap_num'] = df['chap_num'].astype('int')
        
        # Group -- Note that we exclude the book level in the OHCO at this point
        df = df.groupby(OHCO[1:2]).line_str.apply(lambda x: '\n'.join(x)).to_frame() # Make big string
        
        # Split into paragrpahs
        df = df['line_str'].str.split(r'\n\n+', expand=True).stack().to_frame().rename(columns={0:'para_str'})
        df.index.names = OHCO[1:3] # MAY NOT BE NECESSARY UNTIL THE END
        df['para_str'] = df['para_str'].str.replace(r'\n', ' ').str.strip()
        df = df[~df['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs
        
        # Set index
        df['book_id'] = book_id
        df = df.reset_index().set_index(OHCO[:3])

        # Register
        my_lib.append((book_id, book_title, epub_file))
        my_doc.append(df)

    docs = pd.concat(my_doc)
    library = pd.DataFrame(my_lib, columns=['book_id', 'book_title', 'book_file']).set_index('book_id')
    return library, docs

In [7]:
epubs = [epub for epub in sorted(glob(epub_dir+'/*.txt'))]
LIB, DOC = acquire_epubs(epubs, chap_pats)

BOOK ID 158
BOOK ID 946
BOOK ID 1212
BOOK ID 141
BOOK ID 121
BOOK ID 105
BOOK ID 1342
BOOK ID 161
BOOK ID 15422
BOOK ID 13720
BOOK ID 13721
BOOK ID 2701
BOOK ID 4045
BOOK ID 34970
BOOK ID 8118
BOOK ID 53861
BOOK ID 21816
BOOK ID 15859
BOOK ID 1900
BOOK ID 10712


In [8]:
DOC.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,para_str
book_id,chap_num,para_num,Unnamed: 3_level_1
13721,28,17,"""Peace! everlasting foes,"" cried Media, interp..."
21816,25,3,"It was in the semicircular porch of a cabin, o..."
1342,43,71,Elizabeth excused herself as well as she could...
2701,122,26,The tableau all waned at last with the pallidn...
21816,2,29,"As pine, beech, birch, ash, hackmatack, hemloc..."
4045,76,1,UPON arriving home we fully laid open to Po - ...
2701,90,15,"Stripped to our shirts and drawers, we sprang ..."
10712,20,6,"During warm nights in the Tropics, your hammoc..."
34970,50,10,But her gentler sex returned to Isabel at last...
13721,32,39,"""Shall I adjourn the court then, my lord?"" sai..."


# Tokenize

We use NLTK this time. Note that this process takes some time, mainly because the NLTK functions are not optimized for dataframes.

In [9]:
def tokenize(doc_df, remove_pos_tuple=False, OHCO=OHCO):
    
    # Paragraphs to Sentences
    df = doc_df.para_str\
        .apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame()\
        .rename(columns={0:'sent_str'})
    
    # Sentences to Tokens
    # .apply(lambda x: pd.Series(nltk.pos_tag(nltk.word_tokenize(x))))\
    df = df.sent_str\
        .apply(lambda x: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x))))\
        .stack()\
        .to_frame()\
        .rename(columns={0:'pos_tuple'})
    
    # Grab info from tuple
    df['pos'] = df.pos_tuple.apply(lambda x: x[1])
    df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
    if remove_pos_tuple:
        df = df.drop('pos_tuple', 1)
    
    # Add index
    df.index.names = OHCO
    
    return df

In [10]:
TOKEN = tokenize(DOC)

In [12]:
TOKEN.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
158,1,1,0,0,"(Emma, NNP)",NNP,Emma
158,1,1,0,1,"(Woodhouse,, NNP)",NNP,"Woodhouse,"
158,1,1,0,2,"(handsome,, NN)",NN,"handsome,"
158,1,1,0,3,"(clever,, NN)",NN,"clever,"
158,1,1,0,4,"(and, CC)",CC,and


In [13]:
TOKEN[TOKEN.pos.str.match('^JJ')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
158,1,1,0,8,"(comfortable, JJ)",JJ,comfortable
158,1,1,0,11,"(happy, JJ)",JJ,happy
158,1,1,0,19,"(best, JJS)",JJS,best
158,1,1,0,27,"(twenty, JJ)",JJ,twenty
158,1,1,0,36,"(little, JJ)",JJ,little
158,1,2,0,3,"(youngest, JJS)",JJS,youngest
158,1,2,0,11,"(affectionate,, JJ)",JJ,"affectionate,"
158,1,2,0,30,"(early, JJ)",JJ,early
158,1,2,1,11,"(more, JJR)",JJR,more
158,1,2,1,14,"(indistinct, JJ)",JJ,indistinct


# Reduce

Extract a vocabulary from the TOKEN table

In [14]:
TOKEN['term_str'] = TOKEN['token_str'].str.lower().str.replace('[\W_]', '')

In [15]:
VOCAB = TOKEN.term_str.value_counts().to_frame().rename(columns={'index':'term_str', 'term_str':'n'})\
    .sort_index().reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [16]:
VOCAB['num'] = VOCAB.term_str.str.match("\d+").astype('int')

In [17]:
VOCAB

Unnamed: 0_level_0,term_str,n,num
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,50493,0
1,0,2,1
2,1,18,1
3,10,6,1
4,100,2,1
5,1000,2,1
6,10000,3,1
7,1000000,1,1
8,10000000,1,1
9,10440,1,1


# Annotate (VOCAB)

## Add Stopwords

We use NLTK's built in stopword list for English. Note that we can add and subtract from this list, or just create our own list and keep it in our data model.

In [599]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1

In [600]:
sw.sample(10)

Unnamed: 0_level_0,dummy
term_str,Unnamed: 1_level_1
ourselves,1
whom,1
more,1
at,1
once,1
mightn,1
won,1
my,1
these,1
shouldn't,1


In [601]:
VOCAB['stop'] = VOCAB.term_str.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')

In [606]:
VOCAB[VOCAB.stop == 1].sample(10)

Unnamed: 0_level_0,term_str,n,num,stop,p_stem
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
24083,now,5773,0,1,now
16913,him,9194,0,1,him
34853,t,30,0,1,t
35475,these,2704,0,1,these
23982,nor,1219,0,1,nor
24504,only,3398,0,1,onli
17310,how,3070,0,1,how
16734,her,17020,0,1,her
32652,so,9843,0,1,so
10906,down,2451,0,1,down


## Add Stems

In [593]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
VOCAB['p_stem'] = VOCAB.term_str.apply(stemmer.stem)

In [594]:
VOCAB.sample(10)

Unnamed: 0_level_0,term_str,n,num,stop,p_stem
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
33307,sprig,1,False,0,sprig
38270,upbraids,2,False,0,upbraid
36640,truxills,1,False,0,truxil
2161,askance,4,False,0,askanc
35953,tolerantly,1,False,0,tolerantli
11835,emphatic,6,False,0,emphat
7224,concernments,2,False,0,concern
3925,blocks,42,False,0,block
13099,fallen,102,False,0,fallen
37903,unrecorded,4,False,0,unrecord


# Save

In [643]:
DOC.to_csv('DOC.csv')
LIB.to_csv('LIB.csv')
VOCAB.to_csv('VOCAB.csv')
TOKEN.to_csv('TOKEN.csv')

# Appendix: Explore NER tagging

In [615]:
sents = TOKEN.groupby(OHCO[:4]).apply(lambda x: x.token_str.str.cat(sep=' '))\
    .to_frame().rename(columns={0:'sent_str'})

In [618]:
sents.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,sent_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1
105,1,1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
105,1,1,1,This was the page at which the favourite volum...
105,1,2,0,"""ELLIOT OF KELLYNCH HALL."
105,1,3,0,"""Walter Elliot, born March 1, 1760, married, J..."
105,1,3,1,"of South Park, in the county of Gloucester, by..."


In [641]:
for sent in sents.sample(10).sent_str.values:

    print(sent)
    print()
    
    x = nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(sent))
    print(x)
    print()
    
    y = nltk.ne_chunk(x)
    print(y)
    print('-' * 80)

Then arranging his person in the basket, he gave the word for them to hoist him to his perch, Starbuck being the one who secured the rope at last; and afterwards stood near it.

[('Then', 'RB'), ('arranging', 'VBG'), ('his', 'PRP$'), ('person', 'NN'), ('in', 'IN'), ('the', 'DT'), ('basket,', 'NN'), ('he', 'PRP'), ('gave', 'VBD'), ('the', 'DT'), ('word', 'NN'), ('for', 'IN'), ('them', 'PRP'), ('to', 'TO'), ('hoist', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('his', 'PRP$'), ('perch,', 'NN'), ('Starbuck', 'NNP'), ('being', 'VBG'), ('the', 'DT'), ('one', 'NN'), ('who', 'WP'), ('secured', 'VBD'), ('the', 'DT'), ('rope', 'NN'), ('at', 'IN'), ('last;', 'NN'), ('and', 'CC'), ('afterwards', 'NNS'), ('stood', 'VBD'), ('near', 'IN'), ('it.', 'NN')]

(S
  Then/RB
  arranging/VBG
  his/PRP$
  person/NN
  in/IN
  the/DT
  basket,/NN
  he/PRP
  gave/VBD
  the/DT
  word/NN
  for/IN
  them/PRP
  to/TO
  hoist/VB
  him/PRP
  to/TO
  his/PRP$
  perch,/NN
  Starbuck/NNP
  being/VBG
  the/DT
  one/NN
  who/WP


# POS Tagset

This a token-level feature -- not a vocab feature

In [15]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or