# NLTK: Data Engineering for Unstructured Text
#### Katon Minhas

Using the Amazon musical instrument review data on Kaggle (https://www.kaggle.com/eswarchandt/amazon-music-reviews), retrieve the summary column, perform tokenization, stemming, and lemmatization

### Retrieve the summary column

In [1]:
# import libraries

import numpy as np
import pandas as pd
import nltk

# read in data
path = "C:/Users/Katon/Documents/JHU/CreatingAIEnabledSystems/Assignment3/Musical_instruments_reviews.csv"
mir = pd.read_csv(path)
summary = mir['summary']  # get summary column

# string concatenate all summary rows to a single string
summary = summary.str.cat(sep=' ')

# print first 500 characters of resulting string
print(summary[0:500])


good Jake It Does The Job Well GOOD WINDSCREEN FOR THE MONEY No more pops when I record my vocals. The Best Cable Monster Standard 100 - 21' Instrument Cable Didn't fit my 1996 Fender Strat... Great cable Best Instrument Cables On The Market One of the best instrument cables within the brand It works great but I hardly use it. HAS TO GET USE TO THE SIZE awesome It works! Definitely Not For The Seasoned Piano Player Durable Instrument Cable fender 18 ft. Cali clear... So far so good.  Will revisi


### Perform tokenization

In [2]:
# word tokenize
wtokens = nltk.word_tokenize(summary)

# casual tokenize
ctokens = nltk.casual_tokenize(summary)

# sentence tokenize
stokens = nltk.sent_tokenize(summary)


# print just first 100 tokens from each tokenization
print("word tokens = \n", wtokens[0:100], "\n\n")
print("casual tokens = ", ctokens[0:100], "\n\n")
print("sentence tokens = ", stokens[0:10], "\n\n") # sentences are longer, only take 10

word tokens = 
 ['good', 'Jake', 'It', 'Does', 'The', 'Job', 'Well', 'GOOD', 'WINDSCREEN', 'FOR', 'THE', 'MONEY', 'No', 'more', 'pops', 'when', 'I', 'record', 'my', 'vocals', '.', 'The', 'Best', 'Cable', 'Monster', 'Standard', '100', '-', '21', "'", 'Instrument', 'Cable', 'Did', "n't", 'fit', 'my', '1996', 'Fender', 'Strat', '...', 'Great', 'cable', 'Best', 'Instrument', 'Cables', 'On', 'The', 'Market', 'One', 'of', 'the', 'best', 'instrument', 'cables', 'within', 'the', 'brand', 'It', 'works', 'great', 'but', 'I', 'hardly', 'use', 'it', '.', 'HAS', 'TO', 'GET', 'USE', 'TO', 'THE', 'SIZE', 'awesome', 'It', 'works', '!', 'Definitely', 'Not', 'For', 'The', 'Seasoned', 'Piano', 'Player', 'Durable', 'Instrument', 'Cable', 'fender', '18', 'ft.', 'Cali', 'clear', '...', 'So', 'far', 'so', 'good', '.', 'Will', 'revisit'] 


casual tokens =  ['good', 'Jake', 'It', 'Does', 'The', 'Job', 'Well', 'GOOD', 'WINDSCREEN', 'FOR', 'THE', 'MONEY', 'No', 'more', 'pops', 'when', 'I', 'record', 'my', 'voca

#### Discussion
Three tokenization techniques were applied:

Word Tokenization - Split into words by space

Casual Tokenization - Split into words, recognizing symbols as line breaks

Sentence Tokenization - Split text into sentence

##### Issues and Limitations:
1) Includes of out of vocabulary words. Very basic and does not attempt to handle words that are outside of the known vocabulary

2) In casual tokenization, symbols are treated as their own words

3) Duplicate words - not really an issue, but all duplicate words are kept


### Perform Stemming

In [3]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer

# for stemming, only use the word tokenization wtokens
porter_stems = []
lanc_stems = []
snow_stems = []

# apply 3 different stemming functions
for f in ctokens:
    porter_stems.append(PorterStemmer().stem(f))
    lanc_stems.append(LancasterStemmer().stem(f))
    snow_stems.append(SnowballStemmer('english').stem(f))

# print first 100 values
print("Porter Stemming: ", porter_stems[0:100], "\n\n")
print("Lancaster Stemming: ", lanc_stems[0:100], "\n\n")
print("Snowball Stemming: ", snow_stems[0:100], "\n\n")



Porter Stemming:  ['good', 'jake', 'It', 'doe', 'the', 'job', 'well', 'good', 'windscreen', 'for', 'the', 'money', 'No', 'more', 'pop', 'when', 'I', 'record', 'my', 'vocal', '.', 'the', 'best', 'cabl', 'monster', 'standard', '100', '-', '21', "'", 'instrument', 'cabl', "didn't", 'fit', 'my', '1996', 'fender', 'strat', '...', 'great', 'cabl', 'best', 'instrument', 'cabl', 'On', 'the', 'market', 'one', 'of', 'the', 'best', 'instrument', 'cabl', 'within', 'the', 'brand', 'It', 'work', 'great', 'but', 'I', 'hardli', 'use', 'it', '.', 'ha', 'TO', 'get', 'use', 'TO', 'the', 'size', 'awesom', 'It', 'work', '!', 'definit', 'not', 'for', 'the', 'season', 'piano', 'player', 'durabl', 'instrument', 'cabl', 'fender', '18', 'ft', '.', 'cali', 'clear', '...', 'So', 'far', 'so', 'good', '.', 'will', 'revisit'] 


Lancaster Stemming:  ['good', 'jak', 'it', 'doe', 'the', 'job', 'wel', 'good', 'windscreen', 'for', 'the', 'money', 'no', 'mor', 'pop', 'when', 'i', 'record', 'my', 'voc', '.', 'the', 'best'

#### Discussion
Three word stemming techniques were applied:

Porter Stemming: Oldest and most commonly used stemming algorithm

Lancaster Stemming: More agressive than Porter

Snowball Stemming (Porter2): Uses snowball language for improved performance

##### Issues and Limitations:
1) Agressive overstemming (especially in Lancaster) can lead to incomprehensible output

2) Doesn't consider word usage, so two different words can be stemmed to the same result. For example, the word STREAM in "He streamed the video" and "He sat by the stream" would both be stemmed to "STREAM"

### Perform Lemmatization

In [4]:
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()

lemmas = []
for f in wtokens:
    lemmas.append(lmtzr.lemmatize(f, 'v'))
    
print("Lemmatized Tokens: ", lemmas[0:100], "\n\n")

Lemmatized Tokens:  ['good', 'Jake', 'It', 'Does', 'The', 'Job', 'Well', 'GOOD', 'WINDSCREEN', 'FOR', 'THE', 'MONEY', 'No', 'more', 'pop', 'when', 'I', 'record', 'my', 'vocals', '.', 'The', 'Best', 'Cable', 'Monster', 'Standard', '100', '-', '21', "'", 'Instrument', 'Cable', 'Did', "n't", 'fit', 'my', '1996', 'Fender', 'Strat', '...', 'Great', 'cable', 'Best', 'Instrument', 'Cables', 'On', 'The', 'Market', 'One', 'of', 'the', 'best', 'instrument', 'cable', 'within', 'the', 'brand', 'It', 'work', 'great', 'but', 'I', 'hardly', 'use', 'it', '.', 'HAS', 'TO', 'GET', 'USE', 'TO', 'THE', 'SIZE', 'awesome', 'It', 'work', '!', 'Definitely', 'Not', 'For', 'The', 'Seasoned', 'Piano', 'Player', 'Durable', 'Instrument', 'Cable', 'fender', '18', 'ft.', 'Cali', 'clear', '...', 'So', 'far', 'so', 'good', '.', 'Will', 'revisit'] 




#### Discussion
WordNetLemmatizer was used. This reduces words to the dictionary form of the word. Unlike stemming, it is based on identifying part of speech and meaning, rather than simply removing letters in the string
For example: "go, going, went, and gone" are all lemmatized to "go", even though the string "go" doesn't appear in the string "went"

##### Issues and Limitizations
1) Removes aspects of verb tense and subject, so final sentences can be confusing

2) May incorrectly interpret word meaning.
