<a href="https://colab.research.google.com/github/mirucar/dlprojects/blob/main/text_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
import re
import string
import tensorflow as tf



In [27]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1",url,untar=True,cache_dir=".",cache_subdir="")

dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
os.listdir(dataset_dir)

['README', 'test', 'imdbEr.txt', 'train', 'imdb.vocab']

In [28]:
train_dir = os.path.join(dataset_dir,"train")
os.listdir(train_dir)

['unsup',
 'pos',
 'unsupBow.feat',
 'labeledBow.feat',
 'urls_unsup.txt',
 'neg',
 'urls_pos.txt',
 'urls_neg.txt']

In [29]:
remove_dir = os.path.join(train_dir,"unsup")
shutil.rmtree(remove_dir)

In [30]:
sample_file = os.path.join(train_dir,"pos/10002_7.txt")
sample_file

'./aclImdb/train/pos/10002_7.txt'

In [31]:
with open(sample_file) as f:
  print(f.read())

This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).


In [32]:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size,
    validation_split = 0.2,
    subset = "training",
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [33]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review:", text_batch.numpy()[i])
    print("Label: ", label_batch.numpy()[i])

Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label:  0
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get i

In [36]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size,
    validation_split = 0.2,
    subset = "validation",
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [37]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size,
    validation_split = 0.2,
    subset = "validation",
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [40]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
  stripped_punc = tf.strings.regex_replace(stripped_html, "[%s]" % re.escape(string.punctuation), "")
  return stripped_punc


In [41]:
for text_batch, label_batch in raw_train_ds.take(1):
  print("Review: ", custom_standardization(text_batch[0]).numpy())
  break

Review:  b'silent night deadly night 5 is the very last of the series and like part 4 its unrelated to the first three except by title and the fact that its a christmasthemed horror flick  except to the oblivious theres some obvious things going on heremickey rooney plays a toymaker named joe petto and his creepy sons name is pino ring a bell anyone now a little boy named derek heard a knock at the door one evening and opened it to find a present on the doorstep for him even though it said dont open till christmas he begins to open it anyway but is stopped by his dad who scolds him and sends him to bed and opens the gift himself inside is a little red ball that sprouts santa arms and a head and proceeds to kill dad oops maybe he should have left wellenough alone of course derek is then traumatized by the incident since he watched it from the stairs but he doesnt grow up to be some killer santa he just stops talking  theres a mysterious stranger lurking around who seems very interested 

In [42]:
max_features = 10000
sequence_length = 250

vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize = custom_standardization,
    max_tokens = max_features,
    output_mode="int",
    output_sequence_length = sequence_length

)

In [44]:
train_text = raw_train_ds.map(lambda x,y: x)
vectorize_layer.adapt(train_text)


In [52]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text),label

In [53]:
tf.expand_dims(text_batch, -1)

<tf.Tensor: shape=(32, 1), dtype=string, numpy=
array([[b'Recipe for one of the worst movies of all time: a she-male villain who looks like it escaped from the WWF, has terrible aim with a gun that has inconsistent effects (the first guy she shoots catches on fire but when she shoots anyone else they just disappear) and takes time out to pet a deer. Then you got the unlikable characters, 30 year old college students, a lame attempt at a surprise ending and lots, lots more. Avoid at all costs.'],
       [b"Icy and lethal ace hit-man Tony Arzenta (a divinely smooth and commanding performance by Alain Delon) wants to quit the assassination business, but the dangerous mobsters he works for won't let him. After his wife and child are killed, Arzenta declares open season on everyone responsible for their deaths. Director Duccio Tessari relates the absorbing story at a constant snappy pace, maintains a properly serious and no-nonsense tone throughout, stages the stirring shoot-outs and exciti

In [54]:
text_batch, label_batch = next(iter(raw_train_ds))
print("Review: ", text_batch[0])
print("Label: ", label_batch[0])
print("Vectorize", vectorize_text(text_batch[0],label_batch[0]))

Review:  tf.Tensor(b"I went to see Fever Pitch with my Mom, and I can say that we both loved it. It wasn't the typical romantic comedy where someone is pining for the other, and blah blah blah... You weren't waiting for the climatic first kiss or for them to finally get together. It was more real, because you saw them through the relationship, rather than the whole movie be about them getting together. People could actually relate to the film, because it didn't seem like extraordinary circumstances, or impossible situations. It was really funny, and I think it was Jimmy Fallon's best performance. All in all... I would definitely recommend it!", shape=(), dtype=string)
Label:  tf.Tensor(1, shape=(), dtype=int32)
Vectorize (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[  10,  426,    6,   67, 3775, 3322,   16,   54, 1611,    3,   10,
          68,  131,   12,   71,  192,  446,    9,    9,  269,    2,  769,
         736,  220,  114,  282,    7,    1,   15,    2,   78,    3, 2642

In [55]:
print(vectorize_layer.get_vocabulary()[2])

the


In [56]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [58]:
train_ds = train_ds.cache().prefetch(buffer_size = tf.data.AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size = tf.data.AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size = tf.data.AUTOTUNE)

In [61]:
embedding_dim = 16

In [72]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(max_features+1, embedding_dim),
                             tf.keras.layers.LSTM(64),
                             tf.keras.layers.Dense(64, activation = "relu"),
                             tf.keras.layers.Dense(1)
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 16)          160016    
                                                                 
 lstm (LSTM)                 (None, 64)                20736     
                                                                 
 dense_2 (Dense)             (None, 64)                4160      
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 184,977
Trainable params: 184,977
Non-trainable params: 0
_________________________________________________________________


In [73]:
model.compile(optimizer="adam",
              loss= tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics = ["accuracy"])

In [75]:
history = model.fit(train_ds,validation_data=val_ds,epochs=15)

Epoch 1/15
 18/625 [..............................] - ETA: 1:27 - loss: 0.6791 - accuracy: 0.5295

KeyboardInterrupt: ignored

In [67]:
model.evaluate(test_ds)



[0.2873171269893646, 0.8762000203132629]

In [80]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(max_features+1, embedding_dim),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
                             tf.keras.layers.Dense(64, activation ="relu"),
                             tf.keras.layers.Dense(1)
])
model.summary()

model.compile(optimizer="adam",
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics = ["accuracy"])


history = model.fit(train_ds, validation_data=val_ds, epochs = 15)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, None, 16)          160016    
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              41472     
 nal)                                                            
                                                                 
 dense_6 (Dense)             (None, 64)                8256      
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 209,809
Trainable params: 209,809
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
106/625 [====>.........................] - ETA: 1:53 - loss: 0.6912 - accuracy: 0.5100

KeyboardInterrupt: ignored

In [79]:
model.evaluate(test_ds)

RuntimeError: ignored