#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Natural Language Processing

Look almost anywhere around you and you'll see an application of natural language processing (NLP) at work. This broad field covers everything from spellcheck, to translation between languages, to full machine understanding of human language.

In this lesson, we'll work through the typical process of an NLP problem. We'll first use a bag-of-words approach to train a simple classifier model. Then we'll use a sequential approach (considering the order of words) to train an RNN model.

## Problem and Data

We will use the [Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) from the UCI Machine Learning Repository. This dataset contains 3000 user reviews from IMDB, Amazon, and Yelp with the corresponding sentiment of each review (positive: 1 or negative: 0). This supervised problem of predicting sentiment is often called a "sentiment analysis task".

### Download data

In [0]:
# Set random seeds for reproducible results.
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

In [0]:
import zipfile
import io
import shutil
import urllib.request

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip'

# Download zip file from url.
zipdata = io.BytesIO()
zipdata.write(urllib.request.urlopen(url).read())

# Extract zip files.
zfile = zipfile.ZipFile(zipdata)
zfile.extractall()
zfile.close()

# Rename directory to "data".
shutil.rmtree('./data', ignore_errors=True)
shutil.move('sentiment labelled sentences', 'data')

### Process data into NumPy arrays

The downloaded data is split across three files: `amazon_cells_labelled.txt`, `imdb_labelled.txt`, and `yelp_labelled.txt`. Each file has two tab-separated columns, one containing the review text and one containing the sentiment label. Let's combine all the files into one DataFrame, then get a sense of what the data looks like.

Note: How would you split the two columns if there were tabs *within* the review text? We don't need to worry about it for this dataset, but you should generally check your labels to make sure everything is processed correctly.

In [0]:
import pandas as pd

filepath_dict = {
    'amazon': 'data/amazon_cells_labelled.txt',
    'imdb':   'data/imdb_labelled.txt',
    'yelp':   'data/yelp_labelled.txt'
}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['review', 'label'], sep='\t')
    # Add another column filled with the source name, which may be helpful for
    # analysis.
    df['source'] = source
    df_list.append(df)

df = pd.concat(df_list)
df.sample(n=10)

Machine learning models expect separate arrays for input features and labels. We must store all reviews in one array and all corresponding labels in another array.

In [0]:
reviews = df['review'].values
y = df['label'].values

print('first review: "{}"'.format(reviews[0]))
print('first label: {}'.format(y[0]))

Finally, let's split the consolidated dataset so that 80% is used for training and the other 20% is used for testing.

In [0]:
from sklearn.model_selection import train_test_split

reviews_train_raw, reviews_test_raw, y_train, y_test = train_test_split(
  reviews, y, test_size=0.2, random_state=1000)

print(len(y_train), len(y_test))

## Feature Extraction

We will manually extract features from the raw text to use as input vectors for our first model. Remember that a bag-of-words model that does not consider the order of words in the text.

### spaCy: Industrial-strength NLP

[spaCy](https://spacy.io) is a library for advanced NLP tools. It's built based on state-of-the-art research and designed to be efficient for industry use. spaCy is extremely useful for extracting more complex linguistic features from text. Another mature and popular Python NLP toolkit is [NLTK](https://www.nltk.org/), which is a little more academic-oriented.

We must specify a linguistic model for spaCy to use. For this exercise, we'll use their "medium-sized" English language model. If you already have this model downloaded, you can skip to the `load` step below.

**Note:** This is a large file, and may take a few minutes to download and process.

In [0]:
# Download the en_core_web_md model, if you don't already have it downloaded.
!python -m spacy download en_core_web_md

In [0]:
# Load the model into our program.
import en_core_web_md
nlp = en_core_web_md.load()

spaCy language models process raw text into a `Doc` object, which is a collection of `Token` objects. Each `Token` contains many useful [linguistic annotations](https://spacy.io/usage/linguistic-features). For example, `.text` stores the raw text of a `Token` and `.pos_` stores its Part of Speech (pos) tag.

In [0]:
tokens = nlp(reviews[0])
for token in tokens:
  print(token.text, token.pos_)

For our relatively small sentiment analysis task, we will augment each review with Part-of-Speech tags for each word in the review. Since we are using a bag-of-words approach, we can add these tags anywhere in the review.

In [0]:
# Given a review, adds a part-of-speech tag for each word after that word.
def add_pos_tags(reviews_raw):
  reviews = []
  for i, review in enumerate(reviews_raw):
    tokens = nlp(review)
    review_with_pos = []
    for token in tokens:
      review_with_pos.append(token.text)
      review_with_pos.append(token.pos_)
    reviews.append(' '.join(review_with_pos))
  return reviews

In [0]:
# These may take a litle while to run, as spaCy needs to parse each review.
reviews_train = add_pos_tags(reviews_train_raw)
reviews_test = add_pos_tags(reviews_test_raw)

print(reviews_train[0])

## Bag-of-Words Model

We can't use these sentences directly to train models; we need to convert them into standardized-length vectors first. We will first use a bag-of-words (BOW) approach to vectorize the sentences. This means we will consider each review as a "bag of words," where the order of the words does not matter.

Sometimes our approaches must sacrifice potentially useful information (like the order of words in a sentence) in exchange for lower computational complexity. In other words, the importance of the order of the words is less than the increase in computational time or memory accounting for it would require. The relatively simple bag-of-words model has proven to be surprisingly effective for many problems.

The [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit-learn is a very useful tool for performing a bag-of-words vectorization. `CountVectorizer` supports more advanced feature extraction as well, such as n-grams, but we will use the default parameters for now.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(reviews_train)

x_train = vectorizer.transform(reviews_train)
x_test = vectorizer.transform(reviews_test)

x_train.shape

Let's take a closer look at what `CountVectorizer` did. First, `fit` creates a dictionary ("vocabulary"), mapping each unique word to a word index. Then, `transform` converts each review to a list of *counts*, where the element at index $i$ corresponds to the number of times the word at index $i$ in the vocabulary appeared in that review.

In [0]:
len(vectorizer.vocabulary_)

### Logistic Regression model

A common first model to try with classification problems is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), because it trains fairly quickly.

In [0]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')
model.fit(x_train, y_train)

print('Training accuracy: {}'.format(model.score(x_train, y_train)))
print('Testing accuracy: {}'.format(model.score(x_test, y_test)))

Notice that this model achieves almost perfect accuracy on the training set but much lower accuracy on the testing set. This is a result of our model overfitting to the training data. To reduce this effect, we could try changing parameters of the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model or reducing the number of features in the input data.

We could also try using a completely different model, which we'll do now, adding in the information that bag-of-words ignores.

## Sequential model

Much of the meaning of language depends on the order of words: "That movie was not really good" is not quite the same as "That movie was really not good". For more complicated NLP tasks, a bag-of-words approach does not capture enough useful information. In this section, we will instead work with a Recurrent Neural Network (RNN) model, which is specifically designed to capture information about the order of sequences. 

### Preprocessing

We can't use `CountVectorizer` here, so we will need to do some slightly different preprocessing. We can first use the `keras` `Tokenizer` to learn a vocabulary, and transform each review into a list of indices. Note that we will not include part-of-speech information for this model.

In [0]:
from tensorflow import keras

tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(reviews_train_raw)

x_train = tokenizer.texts_to_sequences(reviews_train_raw)
x_test = tokenizer.texts_to_sequences(reviews_test_raw)

print(reviews_train_raw[0])
print(x_train[0])

We need to pad our input so all vectors have the same length. A quick histogram of review lengths shows that almost all reviews have fewer than 100 words. Let's take a closer look at the distribution of lengths less than 100 words.

In [0]:
import matplotlib.pyplot as plt

review_lengths = [len(review) for review in x_train if len(review) < 100]
plt.hist(review_lengths, density=True, cumulative=True)
plt.show()

Almost all reviews have fewer than 50 words! Therefore, will pad to a maximum review length of 50.

In [0]:
maxlen = 50

x_train = keras.preprocessing.sequence.pad_sequences(x_train, padding='post',
                                                     maxlen=maxlen)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, padding='post',
                                                    maxlen=maxlen)

print(x_train[0])

### Pre-trained word embeddings

Word embeddings are foundational to most NLP tasks. It's common to experiment with embeddings, feature extraction, or a combination of both to determine what works best with your specific data and problem.

In practice, instead of training our own embeddings, we can often take advantage of existing embeddings that have already been trained. This is especially useful when we have a small dataset, and want or need the richer meaning that comes from embeddings trained on a larger dataset. 

There are a variety of extensively pre-trained word embeddings. One of the most powerful and widely-used is [GloVe (Global Vectors for Word Representation)](https://nlp.stanford.edu/projects/glove/). Luckily for us, the spaCy model we downloaded is already integrated with 300-dimensional GloVe embeddings. 

All we need to do is load these embeddings into an `embedding_matrix` so each word index properly matches with the words in our dataset. We can access the `tokenizer`'s vocabulary using `.word_index`.

*Note: This may take a few minutes to run.*


In [0]:
# Include an extra index for the "<PAD>" token.
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 300
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in tokenizer.word_index.items():
  token = nlp(word)[0]
  # Make sure spaCy has an embedding for this token.
  if not token.is_oov:
    embedding_matrix[i] = token.vector

print(embedding_matrix.shape)

Loading the embeddings may take a little while to run. When it's done, we'll have an `embedding_matrix` where each word index corresponds to a 300-dimensional GloVe vector. We can load this into an `Embedding` layer to train a model, or visualize the embeddings.

Note also that we have slightly more tokens now than from using `CountVectorizer`. This means that Keras's `Tokenizer` splits sentences into tokens using slightly different rules.

## RNN model

### Model setup

This model will have three layers:

1. `Embedding`

   We initialize its weights using the `embedding_matrix` of pre-trained GloVe embeddings. We set `trainable=False` to prevent the weights from being updated during training. You can keep `trainable=True` to allow for additional training, or "fine-tuning", of these weights. We also set `mask_zero=True` to ensure we do not train parameters based on the `"<PAD>"` tokens.

2. `LSTM` (Long Short-Term Memory)

   This is a type of RNN architecture that is especially good at handling long sequences of information. This layer takes input of dimensions `(batch size, maxlen, embedding dimension)` and returns output of dimension `(batch size, 64)`. A larger output size means a more complex model; we have chosen 64 after tuning based on model performance.

3. `Dense`

   A final layer to return a prediction of either positive or negative sentiment.

In [0]:
model = keras.Sequential([
  keras.layers.Embedding(
    vocab_size,
    embedding_dim,
    weights=[embedding_matrix],
    trainable=False,
    mask_zero=True
  ),
  keras.layers.LSTM(64),
  keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

### Train and test model

We will train this model for 10 epochs since it is slower to train per epoch and reaches high training accuracy after 10 epochs. We use a batch size of 64 based on hyperparameter tuning.

In [0]:
model.compile(
  loss='binary_crossentropy',
  optimizer='adam',
  metrics=['accuracy']
)

history = model.fit(
    x_train,
    y_train,
    epochs=10,
    batch_size=64
)

In [0]:
loss, acc = model.evaluate(x_test, y_test)
print('Test accuracy: {}'.format(acc))

Note that the final testing set accuracy is not significantly higher than that of our Logistic Regression model. We are using a complex model on a small dataset, which is prone to overfitting -- you can usually achieve more generalizable results with a larger dataset.

# Exercises

Natural Language Processing is a broad and quickly-changing field. These are just a few of the questions you can ask yourself as you approach NLP tasks!

## Exercise 1: Feature Experimentation

Try to improve the test accuracy of either the Logistic Regression or RNN model. Techniques you could try include:

* Using bigrams or larger n-grams as features
* Ignoring "stop words" (very common words, e.g. "the", "and")
* Including other types of features from spaCy
* Adding regularization to either model

### Student Solution

**Your analysis here:**

*Comment on what you tried, and why it did or did not work.*

In [0]:
### YOUR CODE HERE ###


### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 2: Error Analysis

Find an example of a review where the RNN and Logistic Regression models made different predictions. Based on the review text and what you know about each model, why do you think this happened?

### Student Solution

**Your answer here:**

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 3: What's Your Source?

Our dataset uses reviews from 3 sources, but each of these sources may have different patterns of sentiment. Do our models make better predictions for one source versus the others? Can we achieve better performance for the IMDB reviews by just training on data from IMDB, or does having a variety of sources help?

### Student Solution

**Your analysis here:**


In [0]:
### YOUR CODE HERE ###


### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 4: Data Preprocessing

When loading word embeddings, we skipped all words that did not have a corresponding `GloVe` embedding. Let's take a closer look at these skipped words.



In [0]:
for word, i in tokenizer.word_index.items():
  token = nlp(word)[0]
  if token.is_oov:
    print(token.text)

Some of these are usernames or nonsense strings are spelling errors. Some are unintentional errors, in which case it'd be useful to recover the original word, while others are intentional errors, like "waaaaaayyyyyyyyyy", that are probably strong indicators of sentiment.

In fact, this is a common problem when working with natural language: it's messy! That makes data preprocessing extremely difficult and vital. How could you improve our preprocessing to handle spelling errors, either intentional or unintentional?

### Student Solution

**Your answer here:**

In [0]:
### YOUR CODE HERE (if you write any) ###


### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO