# AIAC536 Assignment 1 - Text Classification and POS Tagging Using NLTK
This assignment is taken from the book *Hands-On Python Natural Language Processing* by Aman Kedia and Mayank Rasu.

## 1 Setting up NLTK
Run the following commands to install NLTK and Numpy
```bash
sudo pip install -U nltk
sudo pip install -U numpy
```

Furthermore, run the block below to install required collections, text corpora, NLTK models, and packages:
- brown
- gutenberg
- max_ne_chunker
- movie_reviews
- product_review_1
- punkt
- treebank
- twitter_samples
- universal_tagset
- webtext
- wordnet
- words


In [22]:
import nltk
nltk.download('brown')
nltk.download('gutenberg')
nltk.download('max_ne_chunker')
nltk.download('movie_reviews')
nltk.download('product_reviews_1')
nltk.download('punkt')
nltk.download('treebank')
nltk.download('twitter_samples')
nltk.download('universal_tagset')
nltk.download('webtext')
nltk.download('wordnet')
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Error loading max_ne_chunker: Package 'max_ne_chunker' not
[nltk_data]     found in index
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package product_reviews_1 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/kristian.aars/nltk_data...
[nltk_data]   Package treebank is already up-t

True

## Tokenization
First, we're importing some sample twitter data

In [6]:
import nltk
from nltk.corpus import twitter_samples as ts
ts.fileids()

samples_tw = ts.strings('tweets.20150430-223406.json')

['@',
 'B0MBSKARE',
 'the',
 'anti-Scottish',
 'feeling',
 'is',
 'largely',
 'a',
 'product',
 'of',
 'Tory',
 'press',
 'scaremongering',
 '.',
 'In',
 'practice',
 'most',
 'people',
 'wo',
 "n't",
 'give',
 'a',
 'toss',
 '!']

### Word Tokenization

In [7]:
from nltk.tokenize import word_tokenize as wtoken

wtoken(samples_tw[20])

['@',
 'B0MBSKARE',
 'the',
 'anti-Scottish',
 'feeling',
 'is',
 'largely',
 'a',
 'product',
 'of',
 'Tory',
 'press',
 'scaremongering',
 '.',
 'In',
 'practice',
 'most',
 'people',
 'wo',
 "n't",
 'give',
 'a',
 'toss',
 '!']

### Punctuation Tokenization

In [8]:
from nltk.tokenize import wordpunct_tokenize as wptoken

wptoken(samples_tw[20])

['@',
 'B0MBSKARE',
 'the',
 'anti',
 '-',
 'Scottish',
 'feeling',
 'is',
 'largely',
 'a',
 'product',
 'of',
 'Tory',
 'press',
 'scaremongering',
 '.',
 'In',
 'practice',
 'most',
 'people',
 'won',
 "'",
 't',
 'give',
 'a',
 'toss',
 '!']

### Custom Tokenizer Using RegEx

In [14]:
from nltk import regexp_tokenize

## Only tokens containing pure latin alphabet letters
pattern = '\w+'
print(regexp_tokenize(samples_tw[20], pattern))

## Includes some punctuation characters
pattern = '\w+|[!,\-,]'
print(regexp_tokenize(samples_tw[20], pattern))


['B0MBSKARE', 'the', 'anti', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss']
['B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss', '!']


## Stemming
Stemming is the process of transforming different forms of a word to a common word. Such as walking to it's base form walk. This example uses the `PorterStemmer`

In [17]:
from nltk.stem import PorterStemmer

stemming = PorterStemmer()

print(stemming.stem('walking'))
print(stemming.stem('enjoys'))
print(stemming.stem('enjoyable'))

walk
enjoy
enjoy


### Custom Stemmer With RegEx

In [18]:
from nltk import RegexpStemmer

regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
print(regexp_stemmer.stem('flyable'))
print(regexp_stemmer.stem('flying'))


fly
fly


## Removing Stop Words

In [24]:
from nltk.corpus import stopwords

sw_l = stopwords.words('english')
example_test = 'This is an example sentence to test stopwords'

## Remove words which exist in stop words list
[word for word in example_test.split() if word not in sw_l]

['This', 'example', 'sentence', 'test', 'stopwords']

Now, lets calculate the percentage amount of stop words in Shakespeare Hamlet.

In [28]:
from nltk.corpus import gutenberg

words_in_hamlet = gutenberg.words('shakespeare-hamlet.txt')
words_in_hamlet_wo_sw = [word for word in words_in_hamlet if word not in sw_l]

## Calculate percentage
len(words_in_hamlet_wo_sw) * 100.0 / len(words_in_hamlet)

69.26124197002142