# NLTK tokenize pos lemmatize (en)

Code block that tokenizes, part-of-speech tags, and lemmatizes a text field in the input data frame, using NLTK tools. First sentences are tokenized using the English sentence tokenizer. Next, words are tokenized using the TreebankWordTokenizer. 

The data frame that is returned, contains fields:

* `token` (`str`): the word extracted by the tokenizer
* `tokennumber` (`int`): the index of the token in the sentence
* `sentencenumber` (`int`): the index of the sentence in the text
* `textid` (`str`): the text id
* `pos` (`str`): the part of speech tag of the word extracted by the tokenizer
* `lemma` (`str`): the lemma of the word extracted by the tokenizer

For example,

```
Row(token='Dear', tokennumber=0, sentencenumber=0, textid='text1', pos='NNP', lemma='Dear')
Row(token='phillip', tokennumber=1, sentencenumber=0, textid='text1', pos='NN', lemma='phillip')
Row(token=',', tokennumber=2, sentencenumber=0, textid='text1', pos=',', lemma=',')
```

This block requires `nltk` and `nltk.data` to be installed.

```
conda install nltk
python -m nltk.downloader all
```

**Please note**: This block requires python 3 (because somehow, nltk is only installed under python 3).

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext('local[*]')

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [2]:
data = [('text1', """Dear phillip,


This e-mail is automated notification of the availability of your
current Natural Gas Intelligence Newsletter(s). Please use your
username of "pallen" and your password to access


       NGI's Daily Gas Price Index


  http://intelligencepress.com/subscribers/index.html

If you have forgotten your password please visit
  http://intelligencepress.com/password.html
and we will send it to you.

If you would like to stop receiving e-mail notifications when your
publications are available, please reply to this message with
REMOVE E-MAIL in the subject line.

Thank you for your subscription.

For information about Intelligence Press products and services,
visit our web site at http://intelligencepress.com or
call toll-free (800) 427-5747.

ALL RIGHTS RESERVED. (c) 2000, Intelligence Press, Inc.
---"""),
           ('text2', """George,

 I received the drawings.  They look good at first glance.  I will look at 
them in depth this weekend.  The proforma was in the winmail.dat format which 
I cannot open.  Please resend in excel or a pdf format.  If you will send it 
to pallen70@hotmail.com, I will be able to look at it this weekend.  Does 
this file have a timeline for the investment dollars?  I just want to get a 
feel for when you will start needing money.  

 
Phillip""")]
data_in = sqlContext.createDataFrame(data)
data_in

DataFrame[_1: string, _2: string]

## Arguments

* `textid_field_index` (`int`): the index of the textid field in the row of the input data frame
* `text_field_index` (`int`): the index of the text field in the row of the input data frame, this field contains the text that should be tokenized 

In [3]:
textid_field_index = 0
text_field_index = 1

## Block

In [4]:
import nltk.data
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer

from collections import namedtuple

fields = ('token', 'tokennumber', 'sentencenumber', 'textid', 'pos', 'lemma')
Token = namedtuple('Token', fields)

# convert penn treebank pos tags to wordnet pos tags
# source: http://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word
wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

def tokenize_text(row):
    text = row[text_field_index]
    textid = row[textid_field_index]
  
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    word_tokenizer = TreebankWordTokenizer()
    lemmatizer = WordNetLemmatizer()
    
    tokens = []
    
    for i, s in enumerate(sent_tokenizer.tokenize(text)):
        sentence_pos = nltk.pos_tag(word_tokenizer.tokenize(s))
        for j, (token, pos) in enumerate(sentence_pos):
            lemma = lemmatizer.lemmatize(token, pos=wnpos(pos))
            tokens.append(Token(token, j, i, textid, pos, lemma))
    return tokens

data_out = data_in.flatMap(tokenize_text).toDF()

In [5]:
print(data_out.count())
print(data_out.take(10))
data_out

231
[Row(token='Dear', tokennumber=0, sentencenumber=0, textid='text1', pos='NNP', lemma='Dear'), Row(token='phillip', tokennumber=1, sentencenumber=0, textid='text1', pos='NN', lemma='phillip'), Row(token=',', tokennumber=2, sentencenumber=0, textid='text1', pos=',', lemma=','), Row(token='This', tokennumber=3, sentencenumber=0, textid='text1', pos='DT', lemma='This'), Row(token='e-mail', tokennumber=4, sentencenumber=0, textid='text1', pos='NN', lemma='e-mail'), Row(token='is', tokennumber=5, sentencenumber=0, textid='text1', pos='VBZ', lemma='be'), Row(token='automated', tokennumber=6, sentencenumber=0, textid='text1', pos='VBN', lemma='automate'), Row(token='notification', tokennumber=7, sentencenumber=0, textid='text1', pos='NN', lemma='notification'), Row(token='of', tokennumber=8, sentencenumber=0, textid='text1', pos='IN', lemma='of'), Row(token='the', tokennumber=9, sentencenumber=0, textid='text1', pos='DT', lemma='the')]


DataFrame[token: string, tokennumber: bigint, sentencenumber: bigint, textid: string, pos: string, lemma: string]