# Homework 5 (Due Friday, 11:59pm PST, May 7th)

This homework is **optional, and worth 6 points**. These **six points will be added to your overall final homework average**. Any leftover points will be added to your midterm grade.

## Option 1: Build A Classification Model w/ Amazon

Build a classification model using the **Amazon toy reviews dataset** that is able to predict on a hold-out set the sentiment of the reviews with at minimum 91% accuracy (do not round).

You may incorporate as many samples as you wish (out of the original ~120,000) data points. However, **the class balance in your training and test set must be 50/50**.

You will likely need to include some preprocessing techniques that we have learned about so far in this course.

If you are unable to achieve 91% accuracy, then please show in this notebook at least **3 different models** that you have tried (ie. RNN, LSTM using `word2vec`, `GloVe`, `doc2vec`, etc.)

**Make sure to cite your sources if you use other people's code or ideas.**

In [1]:
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import wordnet
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from random import randint
from numpy import array, argmax, asarray, zeros
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers.recurrent import SimpleRNN, LSTM
from keras.layers import Flatten, Masking

In [2]:
np.random.seed(2)

In [3]:
# https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258
def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return lemmatized_sentence

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

### Import Data

In [4]:
good_reviews = open("good_amazon_toy_reviews.txt").readlines()
bad_reviews = open("poor_amazon_toy_reviews.txt").readlines()

num = 4000
sampled_good_reviews = good_reviews[:num]
sampled_bad_reviews = bad_reviews[:num]

reviews = sampled_good_reviews + sampled_bad_reviews
labels = np.concatenate([np.ones(num), np.zeros(num)])

### Remove punctuation and stopwords

In [5]:
regex = re.compile(r'&#[0-9]{2,3};')
regex2 = re.compile(r'<br />')
regex3 = re.compile(r'[^\w\s\d]')
reviews_re = []
for review in reviews:
    newline = regex.sub('',review)
    newline = regex2.sub('',newline)
    newline = regex3.sub('',newline)
    reviews_re.append(newline)


In [6]:
stopword_set = set(stopwords.words('english'))

reviews_clean = []
for line in reviews_re:
    newline = []
    for word in word_tokenize(line):
        if word.lower() in stopword_set:
            continue
        else:
            newline.append(word.lower())
    review = ' '.join(newline)
    reviews_clean.append(review)


### Perform Lemmatization

In [7]:
lemmatizer = WordNetLemmatizer()

In [8]:
reviews_lemma = [' '.join(lemmatize_sentence(line)) for line in reviews_clean]

### Tokenization

In [9]:
max_num = 6000
tokenizer = Tokenizer(num_words=max_num, oov_token="UNKNOWN_TOKEN")
tokenizer.fit_on_texts(reviews_lemma)

In [10]:
encoded_docs = tokenizer.texts_to_sequences(reviews_lemma)
# number represents the index position of the word

In [11]:
MAX_SEQUENCE_LENGTH = 64
padded_docs = pad_sequences(encoded_docs, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

In [12]:
padded_docs.shape

(8000, 64)

### Encode label and split data

In [13]:
encoder = LabelEncoder()
labels = to_categorical(encoder.fit_transform(labels))

In [14]:
X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size=0.2, stratify = labels, random_state=1)

### Embedding

In [15]:
def load_glove_vectors():
    embeddings_index = {}
    with open('glove.6B.100d.txt') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print('Loaded %s word vectors.' % len(embeddings_index))
    return embeddings_index


embeddings_index = load_glove_vectors()

Loaded 400000 word vectors.


In [16]:
VOCAB_SIZE = int(len(tokenizer.word_index) * 1.1)

In [17]:
embedding_matrix = zeros((VOCAB_SIZE, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Models

In [18]:
def make_binary_classification_rnn_model(plot=False):
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
    model.add(Masking(mask_value=0.0)) # masking layer, masks any words that don't have an embedding as 0s.
    model.add(SimpleRNN(units=64, input_shape=(1, MAX_SEQUENCE_LENGTH)))
    model.add(Dense(16))
    model.add(Dense(2, activation='softmax'))
    
    # Compile the model
    model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    # summarize the model
    model.summary()
    
    if plot:
        plot_model(model, to_file='model.png', show_shapes=True)
    return model

def make_lstm_classification_model(plot=False):
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
    model.add(Masking(mask_value=0.0)) # masking layer, masks any words that don't have an embedding as 0s.
    model.add(LSTM(units=32, input_shape=(1, MAX_SEQUENCE_LENGTH)))
    model.add(Dense(16))
    model.add(Dense(2, activation='softmax'))
    
    # Compile the model
    model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    # summarize the model
    model.summary()
    
    if plot:
        plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [19]:
model = make_lstm_classification_model()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 64, 100)           1155600   
_________________________________________________________________
masking (Masking)            (None, 64, 100)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 16)                528       
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 34        
Total params: 1,173,186
Trainable params: 17,586
Non-trainable params: 1,155,600
_________________________________________________________________


In [20]:
np.random.seed(2)
history = model.fit(X_train, y_train,validation_split = 0.01, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [21]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 91.187501


Model final accuracy 91.19

## Option 2: Build A Multi-Class Classification Model w/ BBC News Dataset

Perform the same classification exercise using the `bbc-text.csv` dataset. There are 5 distinct categories. You must achieve a baseline accuracy of at least 61% on a hold-out test set.


### Random Seeds

Make sure to set the random seeds in your notebook so I can run your results and get the same exact output:

```python
from numpy.random import seed
seed(42)

from tensorflow import set_random_seed
set_random_seed(32)
```