## Prompt
Use the following dataset
https://github.com/microsoft/ML-Server-Python-Samples/blob/master/microsoftml/202/data/sentiment_analysis/yelp_labelled.txt
to classify reviews to positive sentiment reviews or negative sentiment reviews.
Train a RNN and a convolutional network to this task. You should train them character-wise and word-wise (one model for each).
Complete the whole assignment in a single self-contained notebook/colab. Find a good architecture for each model and tune the model parameters.
Compare each methods accuracy.
Which one gives you the best results?
Write 1-2 paragraphs on the motivation and the way you came about in the design of the models and your observations.

### Step 1: Import data
- 0 for positive, 1 for negative, tab delimited

In [66]:
import pandas as pd
import numpy as np
from tensorflow.keras.datasets import imdb 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Flatten 
from tensorflow.keras.layers import LSTM 
from tensorflow.keras.layers import Embedding 
from tensorflow.keras.preprocessing import sequence

In [106]:
yelp_labelled = pd.read_csv('yelp_labelled.txt', sep='\t',header=None)
yelp_labelled.columns = ['text','label']
yelp_labelled


Unnamed: 0,text,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


### Step 2: Word 2 Vec


In [68]:
from gensim.models import word2vec
from gensim.models.word2vec import Word2Vec

In [107]:
positive_posts = pd.Series.to_numpy(yelp_labelled[yelp_labelled.label == 1]['text'])
negative_posts = pd.Series.to_numpy(yelp_labelled[yelp_labelled.label == 0]['text'])
print(len(positive_posts))
print(len(negative_posts))
posts = positive_posts + negative_posts

500
500


In [71]:
w2v = Word2Vec(size=100, min_count=1)
w2v.build_vocab(map(lambda x: x.split(), posts), )
w2v.vocabulary

<gensim.models.word2vec.Word2VecVocab at 0x153db44f0>

In [72]:
w2v.train(posts, total_examples=w2v.corpus_count,epochs=10)

(78274, 583160)

In [73]:
w2v.wv.most_similar(positive=['crust'])

[('groups', 0.34418344497680664),
 ('Luke', 0.3424365818500519),
 ('go!Host', 0.33621126413345337),
 ('orders', 0.33029842376708984),
 ('eggs', 0.32191407680511475),
 ('amazing...rge', 0.3136864900588989),
 ('disappointed!', 0.29802176356315613),
 ('deeply', 0.2806062698364258),
 ('company', 0.2769371271133423),
 ('Steve', 0.2752118706703186)]

In [74]:
w2v.wv.similarity('I', 'you')

-0.18840203

In [75]:
w2v.wv.similarity('flavor', 'food')

0.05288045

## RNN with Words
using this tutorial:
https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ <br>
and the "LSTM Classification Generation" notebook<br>
From wikipedia: "Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)"

In [115]:
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import one_hot, text_to_word_sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM, GRU
from tensorflow.keras.preprocessing import sequence
from sklearn.model_selection import train_test_split


In [112]:

filtered_positive_posts = list(filter(lambda p: len(p) > 0, positive_posts))
filtered_negative_posts = list(filter(lambda p: len(p) > 0, negative_posts))


In [None]:
# text processing - one hot builds index of the words
pos_one_hot = []
neg_one_hot = []
n = 30000
for post in filtered_positive_posts:
    try:
        pos_one_hot.append(one_hot(post, n, split=" ", filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True))
    except:
        continue

for post in filtered_negative_posts:
    try:
        neg_one_hot.append(one_hot(post,n,split=" ",filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',lower=True))
    except:
        continue

In [None]:
# 0 for bad, 1 for good
concatenate_array_rnn = np.concatenate((np.zeros(len(neg_one_hot)),
                                        np.ones(len(pos_one_hot))))


In [None]:
from sklearn.model_selection import train_test_split

X_train_rnn, X_test_rnn, y_train_rnn, y_test_rnn = train_test_split(np.concatenate((neg_one_hot,pos_one_hot)),
                                                                    concatenate_array_rnn, 
                                                                    test_size=0.2)

In [None]:
# get max length review
maxlen = max([len(r) for r in pd.Series.tolist(yelp_labelled.text)])
X_train_rnn = sequence.pad_sequences(X_train_rnn, maxlen=maxlen)
X_test_rnn = sequence.pad_sequences(X_test_rnn, maxlen=maxlen)
print('X_train_rnn shape:', X_train_rnn.shape, y_train_rnn.shape)
print('X_test_rnn shape:', X_test_rnn.shape, y_test_rnn.shape)


In [None]:
max_features = 30000
dimension = 128
output_dimension = 128
model = Sequential()
model.add(Embedding(max_features, dimension))
model.add(LSTM(output_dimension))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

In [None]:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In [None]:
model.fit(X_train_rnn, y_train_rnn, batch_size=32,
          epochs=4, validation_data=(X_test_rnn, y_test_rnn))

In [None]:
score, acc = model.evaluate(X_test_rnn, y_test_rnn, batch_size=32)

#### Using TFIDF Vectorizer as input instead of one hot

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(decode_error='ignore', norm='l2', min_df=5)
tfidf_good = vectorizer.fit_transform(filtered_positive_posts)
tfidf_bad = vectorizer.fit_transform(filtered_negative_posts)

flattened_array_tfidf_good = tfidf_good.toarray()
flattened_array_tfidf_bad = tfidf_bad.toarray()

#0 bad, 1 good
y_rnn = np.concatenate((np.zeros(len(flattened_array_tfidf_bad)),
                                        np.ones(len(flattened_array_tfidf_good))))

X_train_rnn, X_test_rnn, y_train_rnn, y_test_rnn = train_test_split(np.concatenate((flattened_array_tfidf_male, 



In [None]:
X_train_rnn = sequence.pad_sequences(X_train_rnn, maxlen=maxlen)
X_test_rnn = sequence.pad_sequences(X_test_rnn, maxlen=maxlen)
print('X_train_rnn shape:', X_train_rnn.shape, y_train_rnn.shape)
print('X_test_rnn shape:', X_test_rnn.shape, y_test_rnn.shape)


In [None]:
max_features = 30000
dimension = 30
output_dimension =20
model = Sequential()
model.add(Embedding(max_features, dimension))
model.add(LSTM(output_dimension))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='mean_squared_error',optimizer='sgd', metrics=['accuracy'])

model.fit(X_train_rnn, y_train_rnn, 
          batch_size=32, epochs=4,
          validation_data=(X_test_rnn, y_test_rnn))

score,acc = model.evaluate(X_test_rnn, y_test_rnn, 
                           batch_size=32)

print(score, acc)

In [117]:
max_features = 30000
dimension = 30
output_dimension =20
model = Sequential()
model.add(Embedding(max_features, dimension))
model.add(LSTM(output_dimension))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='mean_squared_error',optimizer='sgd', metrics=['accuracy'])

model.fit(X_train_rnn, y_train_rnn, 
          batch_size=32, epochs=4,
          validation_data=(X_test_rnn, y_test_rnn))

score,acc = model.evaluate(X_test_rnn, y_test_rnn, 
                           batch_size=32)

print(score, acc)

In [119]:
# 0 for bad, 1 for good
concatenate_array_rnn = np.concatenate((np.zeros(len(neg_one_hot)),
                                        np.ones(len(pos_one_hot))))


In [121]:
from sklearn.model_selection import train_test_split

X_train_rnn, X_test_rnn, y_train_rnn, y_test_rnn = train_test_split(np.concatenate((neg_one_hot,pos_one_hot)),
                                                                    concatenate_array_rnn, 
                                                                    test_size=0.2)



In [133]:
# get max length review
maxlen = max([len(r) for r in pd.Series.tolist(yelp_labelled.text)])
X_train_rnn = sequence.pad_sequences(X_train_rnn, maxlen=maxlen)
X_test_rnn = sequence.pad_sequences(X_test_rnn, maxlen=maxlen)
print('X_train_rnn shape:', X_train_rnn.shape, y_train_rnn.shape)
print('X_test_rnn shape:', X_test_rnn.shape, y_test_rnn.shape)


X_train_rnn shape: (800, 149) (800,)
X_test_rnn shape: (200, 149) (200,)


In [124]:
max_features = 30000
dimension = 128
output_dimension = 128
model = Sequential()
model.add(Embedding(max_features, dimension))
model.add(LSTM(output_dimension))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

In [125]:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

In [126]:
model.fit(X_train_rnn, y_train_rnn, batch_size=32,
          epochs=4, validation_data=(X_test_rnn, y_test_rnn))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x154897430>

In [127]:
score, acc = model.evaluate(X_test_rnn, y_test_rnn, batch_size=32)



#### Using TFIDF Vectorizer as input instead of one hot

In [129]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [130]:
vectorizer = TfidfVectorizer(decode_error='ignore', norm='l2', min_df=5)
tfidf_good = vectorizer.fit_transform(filtered_positive_posts)
tfidf_bad = vectorizer.fit_transform(filtered_negative_posts)

flattened_array_tfidf_good = tfidf_good.toarray()
flattened_array_tfidf_bad = tfidf_bad.toarray()

#0 bad, 1 good
y_rnn = np.concatenate((np.zeros(len(flattened_array_tfidf_bad)),
                                        np.ones(len(flattened_array_tfidf_good))))

X_train_rnn, X_test_rnn, y_train_rnn, y_test_rnn = train_test_split(np.concatenate((flattened_array_tfidf_male, 



In [134]:
X_train_rnn = sequence.pad_sequences(X_train_rnn, maxlen=maxlen)
X_test_rnn = sequence.pad_sequences(X_test_rnn, maxlen=maxlen)
print('X_train_rnn shape:', X_train_rnn.shape, y_train_rnn.shape)
print('X_test_rnn shape:', X_test_rnn.shape, y_test_rnn.shape)


X_train_rnn shape: (800, 149) (800,)
X_test_rnn shape: (200, 149) (200,)


In [138]:
max_features = 30000
dimension = 30
output_dimension =20
model = Sequential()
model.add(Embedding(max_features, dimension))
model.add(LSTM(output_dimension))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='mean_squared_error',optimizer='sgd', metrics=['accuracy'])

model.fit(X_train_rnn, y_train_rnn, 
          batch_size=32, epochs=4,
          validation_data=(X_test_rnn, y_test_rnn))

score,acc = model.evaluate(X_test_rnn, y_test_rnn, 
                           batch_size=32)

print(score, acc)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
0.24996991455554962 0.5249999761581421
