# Problem description

Another type of less traditional data is text.
There is potentially a lot of information about a company in documents such as
- News articles
- Annual/quarterly filings
- Analyst reports
- Blogs

The key element about text is that a document is a *sequence* of words.
In other words, order matters.
Consider
- "Machine Learning is easy not hard"
- "Machine Learning is hard not easy"

Two sentences with identical words but different meaning.

In this assignment we will analyze text in the form of Tweets.
Our objective is: given a tweet about a company, does the tweet indicate Positive sentiment or Negative sentiment.

This assignment will also serve as a preview of Natural Language Processing: the use of Machine Learning to analyze text.
This will be the subject of a later lecture.

Our immediate objective is to use Recurrent Neural Networks to analyze a sequence of words (i.e., a tweet).




## Goal: problem set 1

There are two notebook files in this assignment:
- **`Sentiment_from_tweets.ipynb`**: First and only notebook you need to work on. Train your models and save them
- **`Model_test.ipynb`**: Test your results. After you complete the `Ships_in_satellite_images_P2.ipynb`, this notebook should be submitted

**Before you start working on this assignment, please change your kernel to Python 3.7**

In this `Sentiment_from_tweets.ipynb` notebook, you will need to create Sequential models in Keras to analyze the sentiment in tweets.
- Each example is a sequence of words
- The labels are integers: high values indicate Positive sentiment, low values indicate Negative sentiment


## Learning objectives
- Learn how to use Recurrent Layer types as part of a Keras Sequential model
- Appreciate how layer choices impact number of weights

In [None]:
from __future__ import print_function

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn import preprocessing

import os
import re
import math

%matplotlib inline


## Import tensorflow and check the version of tensorflow
import tensorflow as tf
print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, LSTM
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import plot_model, to_categorical
import IPython


# API for students

We will define some utility routines.

This will simplify problem solving

More importantly: it adds structure to your submission so that it may be easily graded

**If you want to take a look at the API, you can open it by selecting "File->Open->RNN_helper.py"**

`helper = RNN_helper.rnn_helper()`

### Preprocess raw dataset
- getDataRaw: get raw data. 
  >`DIR` is the directory of data     
  >`tweet_file` is the name of data file     
  >`tweets_raw = helper.getDataRaw(DIR, tweet_file)`   
- getTextClean: clean text. 
  >`tweets_raw` is the raw data you get from `helper.getDataRaw()`, which is a pandas DataFrame     
  >`docs, sents = helepr.getTextClean(tweets_raw)`     
- show: display data by reversing index back to word. 
  >`tok` is an object of `Tokenizer`     
  >`encoded_docs_padded` is the text data which you have encoded and padded      
  >`helper.show(tok, encoded_docs_padded)`      
- getExamples: one-hot encode samples. 
  >`encoded_docs_padded` is the text data which you have encoded and padded     
  >`sents` is the labels     
  >`max_features` is number of words in the vocabulary    
  >`X, y = helper.getExamples(encoded_docs_padded, sents, max_features)`
  
### Save model and load model
- save model (portable): save a model in `./models` directory
  >`helper.saveModel(model, modelName)`
- save history (non-portable): save a model history in `./models` directory
  >`helper.saveHistory(history, modelName)`
- save model (portable): save a model in `./models` directory
  >`helper.saveModel(model, modelName)`
- save history (non-portable): save a model history in `./models` directory
  >`helper.saveHistory(history, modelName)`

### Plot models and training results
- plotModel: plot your models
  >`plotModel(model, model_name)`
- plot_training: plot your training results
  >`plot_training(history, metric='acc)`

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

import RNN_helper
%aimport RNN_helper

helper = RNN_helper.rnn_helper()

## Get the tweets (as text)

In [None]:
# Directory and file name
DATA_DIR = "../resource/asnlib/publicdata/tweets/Data"
tweet_file = "Apple-Twitter-Sentiment-DFE-1.csv"

# Load raw data
tweets_raw = helper.getDataRaw(DATA_DIR, tweet_file)
tweets_raw[ ["text", "sentiment"] ].head(10)

print("Sentiment values (raw)", np.unique(tweets_raw["sentiment"]))

## Data preprocessing

There will be a number of preprocessing steps necessary to convert the raw tweets to a form
amenable to a Neural Network.

The next few cells will take you through the journey from **"raw" data** to the **X** (array of examples)
and **y** (array of labels for each example) arrays that you will need for your Neural Network.

In an academic setting you will often be given X and y.
This will rarely be the case in the real world.

So although this journey has little to do with our objective in learning about Recurrent Neural Networks,
we encourage you to follow along.

If you are anxious to get to the Recurrent Neural Network part: you can defer the journey until later
and skip to the cell that defines X and y.

As you can see, tweets have their own special notation that distinguishes it from ordinary language
- "Mentions" begin with "@" and refer to another user: "@kenperry"
- "Hash tags" begin witn "#" and refer to a subject: #MachineLearning

This means that our vocabulary (set of distinct words) can be huge.  To manage the vocabulary size
and simplify the problem (perhaps losing information on the way), we will **not** distinguish between
individual mentions and hash tags

Let's also examine the possible sentiment values
- There is a "not_relevant" value; we should eliminate these examples
- The sentiment value is a string
- The strings represent non-consecutive integers

There is quite a bit of cleaning of the raw data necessary; fortunately, we will do that for you below. We will use `helper.getTextClean()` here

In [None]:
docs, sents = helper.getTextClean(tweets_raw)

print("Docs shape is ", docs.shape)
print("Sents shape is ", sents.shape)

print("Possible sentiment values: ",  np.unique(sents) ) 


## More data preprocessing

Great, our text is in much better shape and our sentiment (target value for prediction) are now consecutive values.

But computers handle numbers much more readily than strings.
We will need to convert the text in a *sequence* of numbers
- Break text up into words
- Assign each word a distinct integer

Moreover, it will be easier if all sequences have the same length.
We can add a "padding" character to the front if necessary.

Again, we do this for you below.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

## set parameters
# max_features: number of words in the vocabulary (and hence, the length of the One Hot Encoding feature vector)
# maxlen: number of words in a review
max_features = 1000
maxlen = 40

## Tokenize text
tok = Tokenizer(num_words=max_features)
tok.fit_on_texts(docs)

encoded_docs = tok.texts_to_sequences(docs)
# The length of different sequence samples may be different, so we use padding method to make them have same length
encoded_docs_padded = sequence.pad_sequences(encoded_docs, maxlen=maxlen)

encoded_docs_padded[:5]

## Verify that our encoded documents are the same as the cleaned original

At this point: convince yourself that all we have done was encode words as integers and pad out all text to the same length.  The following will demonstrate this. We will use `helper.show()` here

In [None]:
helper.show(tok, encoded_docs_padded)

## Even more preprocessing

Although a word has been encoded as an integer, this integer doesn't have a particular meaning.

We will therefore convert each word to a One Hot Encoded (OHE) vector
- The length of the vector is equal to the length of the vocabulary (set of distinct words)
- The vector is all 0 except for a single location which will be 1
- If the word is the $k^{th}$ word of the vocabulary, the position of the sole 1 will be $k$

This representation is called One Hot Encoding
- A word as a feature vector of length $V$, where $V$ is the number of words in the vocabulary
    - Feature $j$ is a binary indicator which is true if the word is the $j^{th}$ word in the vocabulary
    
Finally: we can get the set of examples and associated labels in a form ready for processing by
the Neural Network.

At this point, they will be hard to recognize by a human being. We will use `helper.getExamples()` here

In [None]:
X, y = helper.getExamples(encoded_docs_padded, sents, max_features)
print(X[:5])

In [None]:
# save X and y for further testing
if not os.path.exists('./data'):
    os.mkdir('./data')
np.savez_compressed('./data/dataset.npz', X = X, y = y)

## A note on the representation of words as OHE vectors

There are *much better* representations of words than as OHE vectors !

We will learn about this in our lecture on Natural Languag Processing.

For now, the OHE representation will suffice.

# Split the data into test and training sets


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

# How long is the sequence in a *single* example

**Question:** Compute the length and number of features of a sequence

In [None]:
# Set two variables
# example_sequence_len: length of the sequence
# example_num_features: number of features in a single element of the sequence (of a single example)

###
### YOUR CODE HERE
###

print('The length of a sequence is ', example_sequence_len)
print('Number of features is ', example_num_features)

Your length of sequence should be the **maxlen** you set  
Your number of features should be the **max_features** you set

# Part 1: Create a Keras Sequential model using a Recurrent layer type

You will create a model that
- takes as input: a sequence of one hot encodings of words (i.e., a representation of a tweet)
- predicts (outputs) a sentiment

**Note**
You should treat the sentiment as a Categorical (discrete) value, rather than a continous one
- As we saw: the sentiment label values are not continuous
- We cannot really assign a "magnitude" to the sentiment
    - We cannot say that a sentiment of 5 is five times "higher" than a sentiment of 1
- We will thus treat the problem as one of Classification rather than Regression
- **We have not one hot encoded the labels** (i.e., the sents variable). 

## Create model 

**Question:** Build a very basic model with two layers
- A Recurrent layer (LSTM to be specific) with a hidden state size of 128, name it "lstm_1"
- A Head layer implementing multinomial Classification, name it "dense_head"

**Hint:**
Since this is a multi-classification problem, you need to use `softmax` function for your head layer

In [None]:
model_lstm = None

###
### YOUR CODE HERE
###

model_lstm.summary()

In [None]:
# Plot your model
plot_lstm = helper.plotModel(model_lstm, "lstm")
IPython.display.Image(plot_lstm) 

## Train model

**Question:** Now that you have built your first RNN model, next you will compile and train your model first way. The base requirements are as follows:
- Name your model "LSTM_sparsecat"
- Split your dataset `X_train` into 0.9 training data and 0.1 validation data. Set the `random_state` to be 42. You can use `train_test_split()`
- Metric: accuracy
- Training epochs is 15
- Save your training results in a variable named `history1`
- Plot your training results using API `helper.plot_training()`

**Note: about loss function**   
- Like what we mentioned before, we haven't one-hot encoded our lables, so please use `sparse_categorical_crossentropy()` as loss function and here. 
- Alternatively, you can encode the labels using `to_categorical()` and use `categorical_crossentropy` as loss function.

In [None]:
# Set parameters
model_name_1 = 'LSTM_sparsecat'
num_epochs = 15

# If you don't use one-hot encoded labels
loss_ = 'sparse_categorical_crossentropy'
metric = 'acc'

###
### YOUR CODE HERE
###


**Expected outputs (there may be some differences):**  
<table> 
    <tr> 
        <td>  
            Training accuracy
        </td>
        <td>
         0.9120
        </td>
    </tr>
    <tr> 
        <td>
            Validation accuracy
        </td>
        <td>
         0.7259
        </td>
    </tr>

</table>

The loss and accuracy graphs of first model are similiar to this:
<img src="./resource/asnlib/publicdata/tweets/Images/model1_acc.png" style="width:600px;height:300px;">
<img src="./resource/asnlib/publicdata/tweets/Images/model1_loss.png" style="width:600px;height:300px;">

We can see that the accuracy curve in the graph above seems going up and down while the training curve is increasing, which means our model may have a overfitting problem.

## Evalutate the model

**Question:** We have trained our model, then what we need to do next is to evaluate the model using test dataset. Please store the model score in a variable named `score1`.   
**Hint:** The method we should use is `evaluate()`. 

In [None]:
socre1 = None

###
### YOUR CODE HERE
###

print("{n:s}: Test loss: {l:3.2f} / Test accuracy: {a:3.2f}".format(n=model_name_1, l=score1[0], a=score1[1]))

## Save the trained model_lstm and history1 for submission

In [None]:
helper.saveModel(model_lstm, model_name_1)
helper.saveHistory(history1, model_name_1)

## Let's check the number of our models, how many weights in your recurrent model?

How many weights in your model?

You should always be sensitive to how "big" your model is.

In [None]:
# Set variable
# - num_weights_lstm: number of weights in your model
num_weights_lstm = 0

###
### YOUR CODE HERE
###

print("number of parameters is ", num_weights_lstm)

# Part 2: Create a model consider only of a Classification head

The Recurrent layer type creates a fixed length (i.e., size of hidden state) encoding of a variable length input sequence
- No matter how long the input, the encoding will have fixed length

But it needs quite a few parameters, and seems to have a overfitting problem.

Let's compare this to a simple Classifier only model
- That reduces the sequence to a single feature vector
    - Length of the single feature vector is the same as any element of the sequence
- There are a couple of ways to do this
    - Take the sum or average (across the sequence) of each feature
    - Take the max (across the sequence) of each feature

**Question:** Create a Keras Sequential model (only 1 pooling layer + Head layer) that
- Reduces the sequence to a singleton with the same number of features
- Classifies directly on this singleton
- Name your head layer "dense_head"

**Hint:**
- Investigate the Keras `GlobalMaxPooling1D` and `GlobalAveragePooling1D` layer types

In [None]:
model_simple = None

###
### YOUR CODE HERE
###

# Plot model
plot_simple = helper.plotModel(model_simple, "simple")
IPython.display.Image(plot_simple) 

model_simple.summary()

## Train model

**Question:** Now that you have built your Classification model, next you will compile and train your model. The base requirements are as follows:
- Name your model "Only_head"
- Split your dataset `X_train, y_train` into 0.9 training data and 0.1 validation data. Set the `random_state` to be 42. You can use `train_test_split()`
- Metric: "accuracy"; loss function: "sparse_categorical_crossentropy" (don't use the one-hot encoded labels)
- Training epochs is 15
- Save your training results in a variable named `history2`
- Plot your training results using API `plot_training()`

**loss function:** Do not one-hot encode labels, use `sparse_categorical_crossentropy` as loss function  

In [None]:
# Set parameters
model_name_2 = 'Only_head'
num_epochs = 15
metric = 'acc'
loss_ = 'sparse_categorical_crossentropy'


###
### YOUR CODE HERE
###


**Expected outputs (there may be some differences):**  
<table> 
    <tr> 
        <td>  
            Training accuracy
        </td>
        <td>
         0.7776
        </td>
    </tr>
    <tr> 
        <td>
            Validation accuracy
        </td>
        <td>
         0.7784
        </td>
    </tr>

</table>

The loss and accuracy graphs of first model are similiar to this:
<img src="./resource/asnlib/publicdata/tweets/Images/model2_acc.png" style="width:600px;height:300px;">
<img src="./resource/asnlib/publicdata/tweets/Images/model2_loss.png" style="width:600px;height:300px;">

We can see that two accuracy curves in the graph above are increasing, which means our model is learning even though it is very simple.

## Save the trained model_simple and history2 for submission

In [None]:
helper.saveModel(model_simple, model_name_2)
helper.saveHistory(history2, model_name_2)

## How many weights in your Classifier only model ?

How many weights in your model ?

You should always be sensitive to how "big" your model is.

In [None]:
# Set variable
# - num_weights_lstm: number of weights in your model
num_weights_simple = 0

###
### YOUR CODE HERE
###

print("number of parameters is ", num_weights_simple)

Compared with the previous RNN moddel, we have much **less** parameters, but the validation accuracy is better than RNN model!

# Discussion

- Was the increase in number of weights compensated by a gain in accuracy when using a Recurrent Layer type compared to the Classifier only model ?
- Can you speculate why this is so ?

## Now Submit your assignment!
Please click on the blue button <span style="color: blue;"> **Submit** </span> in this notebook. 