# Practical 5.1 Modeling Text

## Basic data preprocessing for modeling text sequences

In [1]:
from __future__ import print_function

## 1. Data description

We will use IMDB review data set to train a Recurrent Neural Networks (RNN) model, by using two (2) type of text sequences as model input: characters and words. Data can be downloaded from https://storage.googleapis.com/trl_data/imdb_dataset.zip. Training set contains 25000 reviews with labels 0 for "negative" sentiment and 1 for "positive" sentiment. For validation set, the information about binary labels (0 and 1) can be seen in attribute "id" of the data set. Number after character '\_' represents rating score. If rating <5, then the sentiment score is 0 or "negative" sentiment. If the rating is greater than 7, then the score is 1 or "positive". Otherwise, it is negative (0).

Example of (part of) original text in data set:

```
id	sentiment	review

"7759_3"	0	"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger ."

```

## 2. Problem Definition

Given a text (e.g. a movie review), we need to predict whether this review is positive (class label = 1) or negative (class label = 0). We will work with two (2) types of preprocessing to create sequence for our model input: character-level and word-level.

## 3. Data Preprocessing

Basic data preprocessing for text sequence:

* Cleaning raw text data
    - remove HTML tags
    - remove non-informative characters
* Tokenizing raw text into array of word tokens (for word-level sequences)
* Create vocabulary index: character based and word based look up dictionary index
* Transform tokenized text into integer sequences (based on look up vocabulary index)

In [2]:
import os
import sys
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 100
import re
import nltk

DATA_PATH = 'amazon_data'
EMBEDDING_PATH = 'embedding'
MODEL_PATH = 'model'

Create above directories under your current working directory. Download data set provided and locate it in directory 'data' above.

### 3.1. Read data

In [3]:
# function to clean raw text data

def striphtml(html):
    p = re.compile(r'<.*?>')
    return p.sub('', html)

def clean(s):
    return re.sub(r'[^\x00-\x7f]', r'', s)

In [4]:
ex1_labelled_data = pd.read_csv(os.path.join(DATA_PATH,"example1_labelled.tsv"), header=0, delimiter="\t")

In [5]:
ex2_unlabelled_data = pd.read_csv(os.path.join(DATA_PATH,"example2_unlabelled.tsv"), header=0, delimiter="\t")
ex2_labelled_data = pd.read_csv(os.path.join(DATA_PATH,"example2_labelled.tsv"), header=0, delimiter="\t")

In [15]:
ex1_labelled_data

Unnamed: 0,label,review
0,camera,"My husband bought this camera about 3 months ago and we continue to love it...wow, what an impro..."
1,laptop,I got this notebook several months ago and I've had a great experience with it. I've had zero pr...
2,mobilephone,I have this phone for about 10 months. The calls are clear in many places where I can't get my ...


In [16]:
ex2_unlabelled_data

Unnamed: 0,review
0,"I purchased the 20d in Feb 2011, around 7 years after it was first introduced. The camera was so..."
1,It's been 3 weeks now and I've only had minor voice mail setup issues. The phone itself is the ...
2,"I purchased this Z Series laptop about 5 months ago, its my first Sony laptop, Im a Mac fan, but..."
3,"When I first got this laptop (at a garage sale, broken) It was really slow and leggy so I instal..."
4,I love this phone. I've had my own for over a year now and have since bought one for my son and ...
5,I purchased this camera to replace my Casio EX-Z4 which broke. This camera takes great photos. ...
6,"My perfect camera has to do two things very well. First, it has to deliver superior results. S..."


In [17]:
ex2_labelled_data

Unnamed: 0,label,review
0,camera,"I purchased the 20d in Feb 2011, around 7 years after it was first introduced. The camera was so..."
1,mobilephone,It's been 3 weeks now and I've only had minor voice mail setup issues. The phone itself is the ...
2,laptop,"I purchased this Z Series laptop about 5 months ago, its my first Sony laptop, Im a Mac fan, but..."
3,laptop,"When I first got this laptop (at a garage sale, broken) It was really slow and leggy so I instal..."
4,mobilephone,I love this phone. I've had my own for over a year now and have since bought one for my son and ...
5,camera,I purchased this camera to replace my Casio EX-Z4 which broke. This camera takes great photos. ...
6,camera,"My perfect camera has to do two things very well. First, it has to deliver superior results. S..."


### 3.2. Clean data

### Cleaning training set

In [6]:
# this  will create a cleaned version of training set

train_docs = []
train_labels = []
for cont, label in zip(ex1_labelled_data.review, ex1_labelled_data.label):
    
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    train_docs.append(doc)
    train_labels.append(label)

### Cleaning validation set

In [7]:
# this  will create a cleaned version of validation set
# we also need to extract labels from attribute 'id'

valid_docs =[]
valid_labels = []
for cont, label in zip(ex2_labelled_data.review, ex2_labelled_data.label):
    
    doc = clean(striphtml(cont))
    doc = doc.lower() 
    valid_docs.append(doc)
    valid_labels.append(label)

### 3.3. Build vocabulary index

### Word-level vocabulary index

In [8]:
# FUNCTION to tokenize documents into array list of words
# you may also use nltk tokenizer, sklearn tokenizer, or keras tokenizer - 
# but for the tutorial in text modeling, we will use below function: 

def tokenizeWords(text):
    
    tokens = re.sub(r"[^a-z0-9]+", " ", text.lower()).split()
    return [str(strtokens) for strtokens in tokens]

# FUNCTION to create word-level vocabulary index

def indexingVocabulary(array_of_words):

    wordIndex = list(array_of_words)
    
    # we will later pad our sequence into fixed length, so
    # we will use '0' as the integer index of pad 
    wordIndex.insert(0,'<pad>')
    
    # index for word token '<start>' as a starting sign of sequence. We won't use it for this model
    # but for the latter model (sequence-to-sequence model)
    wordIndex.append('<start>')
    
    # index for word token '<end>' as an ending sign of sequence. We won't use it for this model
    # but for the latter model (sequence-to-sequence model)
    wordIndex.append('<end>')
    
    # index for word token '<unk>' or unknown words (out of vocabulary words) 
    wordIndex.append('<unk>')
    
    vocab=dict([(i,wordIndex[i]) for i in range(len(wordIndex))])
    
    return vocab

### Tokenization (for word sequences as model input)

Create array list of tokenized words and merged array of these word tokens to generate vocabulary index. Notice that we only use 10.000 most frequent words from training set. Out of Vocabulary (OOV) words will be presented as '<unk>' or unknown words.

In [9]:
# tokenize text from training set

train_str_tokens = []
all_tokens = []
for i, text in enumerate(train_docs):
    
    # this will create our training corpus
    train_str_tokens.append(tokenizeWords(text))
    
    # this will be our merged array to create vocabulary index
    all_tokens.extend(tokenizeWords(text))

In [10]:
# likewise, tokenize text from validation set

valid_str_tokens = []
for i, text in enumerate(valid_docs):

    valid_str_tokens.append(tokenizeWords(text))

In [11]:
# use nltk to count word frequency and use 10.000 most frequent words to generate vocabulary index

tf = nltk.FreqDist(all_tokens)
common_words = tf.most_common(10000)
arr_common = np.array(common_words)
words = arr_common[:,0]

# create vocabulary index

# word- index pairs
words_indices = indexingVocabulary(words)

# index - word pairs
indices_words = dict((v,k) for (k,v) in words_indices.items())

In [27]:
list(words_indices.items())[:5]

[(0, '<pad>'), (1, 'i'), (2, 'it'), (3, 'and'), (4, 'a')]

In [28]:
list(indices_words.items())[:5]

[('<pad>', 0), ('i', 1), ('it', 2), ('and', 3), ('a', 4)]

In [29]:
# save vocabulary index

np.save(os.path.join(DATA_PATH,'words_indices.npy'), words_indices)
np.save(os.path.join(DATA_PATH,'indices_words.npy'), indices_words)

### 3.4. Preparing model input - output

### Word-level sequences

In [30]:


# integer format of training input 
train_int_input = []
for i, text in enumerate(train_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['<unk>'] for w in text ]
    train_int_input.append(int_tokens)

In [31]:
# integer format of test validation input 
valid_int_input = []
for i, text in enumerate(valid_str_tokens):
    int_tokens = [indices_words[w] if w in indices_words.keys() else indices_words['<unk>'] for w in text ]
    valid_int_input.append(int_tokens)

In [32]:
X_train_arr = np.array(train_int_input)
y_train = np.array(train_labels)

X_valid_arr = np.array(valid_int_input)
y_valid = np.array(valid_labels)

#### Padding word sequences

We define maximum 500 words as our fixed length of input sequences. Here, we use keras padding, but you may also define your own padding function.

In [33]:
from keras.preprocessing import sequence

max_review_length = 500
X_train = sequence.pad_sequences(X_train_arr, maxlen=max_review_length)
X_valid = sequence.pad_sequences(X_valid_arr, maxlen=max_review_length)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [34]:
# save files

np.save(os.path.join(DATA_PATH,'X_train_word.npy'), X_train)
np.save(os.path.join(DATA_PATH,'y_train_word.npy'), y_train)

np.save(os.path.join(DATA_PATH,'X_valid_word.npy'), X_valid)
np.save(os.path.join(DATA_PATH,'y_valid_word.npy'), y_valid)