<a href="https://colab.research.google.com/github/hari0624/Tensorflow/blob/master/Text_Classification_Using_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, IMDB data set is used to perform sentiment analysis by using Natural Language Processing in Keras.

In [2]:
import tensorflow as tf
import numpy as np

Load Data

In [11]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  58.9M      0  0:00:01  0:00:01 --:--:-- 58.9M


In [12]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [13]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [14]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [15]:
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

unsup does not require, hence it can be deleted

In [16]:
!rm -r aclImdb/train/unsup

In [17]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


Data Preprocessing

Note: 
You can use the utility tf.keras.preprocessing.text_dataset_from_directory to generate a labeled tf.data.Dataset object from a set of text files on disk filed into class-specific folders.

When using the validation_split & subset arguments, make sure to either specify a random seed, or to pass shuffle=False, so that the validation & training splits you get have no overlap.

In [18]:
batch_size=32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [19]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [20]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test",
    batch_size=batch_size
)

Found 25000 files belonging to 2 classes.


Printing the DS results

In [21]:
print("no  of batches in raw_train_ds: %d"   % tf.data.experimental.cardinality(raw_train_ds))

no  of batches in raw_train_ds: 625


In [22]:
print("no of batches in raw_val_ds: %d" % tf.data.experimental.cardinality(raw_val_ds))
print("no of batches in raw_test_ds: %d" % tf.data.experimental.cardinality(raw_test_ds))

no of batches in raw_val_ds: 157
no of batches in raw_test_ds: 782


Data Preview:



In [None]:
# It's important to take a look at your raw data to ensure your normalization 
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
# This is one of the places where eager execution shines:
# we can just evaluate these tensors using .numpy()
# instead of needing to evaluate them in a Session/Graph context.

In [23]:
for text, label in raw_train_ds.take(1):
  for i in range(5):
    print(text.numpy()[i])
    print(label.numpy()[i])

b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its

Data Preprocessing

Note: in the baove sample data, remove <br, />

In [24]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

In [None]:
# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to
# create a custom standardization function.

In [25]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
  return tf.strings.regex_replace(
      stripped_html, "[%s]" % re.escape(string.punctuation), ""
  )


In [26]:
#model constants

max_features=20000
embedding_dim=128
sequence_length=500

In [None]:
# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.

In [27]:
vectorize_layer=TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length,

)

In [None]:
# Now that the vocab layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.


In [28]:
# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y :x)
# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

Two options to vectorize the data
There are 2 ways we can use our text vectorization layer:

Option 1: Make it part of the model, so as to obtain a model that processes raw strings, like this:

In [None]:
text_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...

Option 2: Apply it to the text dataset to obtain a dataset of word indices, then feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So if you're training the model on GPU, you probably want to go with this option to get the best performance. This is what we will do below.

If we were to export our model to production, we'd ship a model that accepts raw strings as input, like in the code snippet for option 1 above. This can be done after training. We do this in the last section.

In [29]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

Now vectorize the data

In [30]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

Do async prefetching / buffering of the data for best performance on GPU.

In [31]:
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

Model Building

1D Convnet is used starting with an embedding layer

In [32]:
from tensorflow.keras import layers

#int input for vocab indicies
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# adding a layer to map those vocab indices into a space of dimensionality 'embedding_dim'.
# this is the embedding layer
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# adding 2 conv1D layers and  Globalmaxpooling layer
x = layers.Conv1D(128, 7, padding="valid", activation = "relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation = "relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

#adding a vanilla hidden layer
x = layers.Dense(128, activation = "relu")(x)
x = layers.Dropout(0.5)(x)

# Projecting to a single output layer with sigmoid function
predictions = layers.Dense(1, activation = "sigmoid", name = "predictions")(x)

model = tf.keras.Model(inputs, predictions)

# compling the model using binary cross entropy loss with Adam optimizer
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])


Train the Model

In [33]:
epoch = 3

# fitting the model using train and test data sets
model.fit(train_ds, validation_data=val_ds, epochs=epoch)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f1bec4aeef0>

Model evaluation on test data

In [34]:
model.evaluate(test_ds)



[0.49929508566856384, 0.8471999764442444]