# Stoneburner, Kurt
- ## DSC 650 - Assignment 10


Links to Deep Learning Sample Code:
- https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part01_introduction.ipynb

- https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part02_sequence-models.ipynb

- https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part03_transformer.ipynb

- https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part04_sequence-to-sequence-learning.ipynb

ngram reference:
- https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/



In [1]:
import os
from pathlib import Path
import sys
# //*** Imports and Load Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from tensorflow import keras

#//*** Reusing Code from assignment 04
from chardet.universaldetector import UniversalDetector
from bs4 import BeautifulSoup


import re

#//*** Use the whole window in the IPYNB editor
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#//*** Maximize columns and rows displayed by pandas
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

In [2]:
#//*** Get Working Directory
current_dir = Path(os.getcwd()).absolute()

#//*** Go up Two folders
project_dir = current_dir.parents[2]

#//*** IMDB Data Path
imdb_path = project_dir.joinpath("dsc650/data/external/imdb/aclImdb")

file_path = imdb_path.joinpath("train/pos")

#//*** Grab the first positive review text for testing
file_path = file_path.joinpath(os.listdir(file_path)[0])

with open(file_path,'r') as f:
    sample_text = f.read()

print(sample_text)


Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


In [3]:
#//*** Randomly assign 20% of the training Data and copy to a validation folder
import os, pathlib, shutil, random

val_dir = imdb_path.joinpath("val")
train_dir = imdb_path.joinpath("train")
test_dir = imdb_path.joinpath("test")

for category in ("neg", "pos"):
    #//*** Skip if val folder exists (Delete Folder to resample)
    if os.path.exists(val_dir.joinpath(category)):
        break
    
    os.makedirs(val_dir.joinpath(category))
    files = os.listdir(train_dir.joinpath(category))
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)



# Load IMDB Dataset #

In [104]:
#//*** Use Universal Detector to determine file encoding.
#//*** Borrowed from Assignment04
def read_file_with_encoding(filepath):

    detector = UniversalDetector()
    
    try:
        with open(filepath) as f:
            return f.read()
    except UnicodeDecodeError:
        detector.reset()
        with open(filepath, 'rb') as f:
            for line in f.readlines():
                detector.feed(line)
                if detector.done:
                    break
        detector.close()
        encoding = detector.result['encoding']
        with open(filepath, encoding=encoding) as f:
            return f.read()

#//*** Borrowed from Assignment04
def parse_html_payload(payload):
    """
    This function uses Beautiful Soup to read HTML data
    and return the text.  If the payload is plain text, then
    Beautiful Soup will return the original content
    """
    soup = BeautifulSoup(payload, 'html.parser')
    return str(soup.get_text()).encode('utf-8').decode('utf-8')

def load_dataset(dir_path):
    
    text = []
    targets = []
    
    #//*** Crawl the neg and pos folders
    for category in ("neg", "pos"):
        files = os.listdir(dir_path.joinpath(category))
        
        #//*** Loop through each file in the folder
        for file in files:
            try:
                #//*** Add processed file to text
                text.append(
                    #//*** Strip HTML Tags
                    parse_html_payload(
                        #//*** Read File from disk. Function uses Universal Detector to determine file encoding
                        read_file_with_encoding(
                            dir_path.joinpath(category).joinpath(file))))

                #//*** Append Target Value
                if category == 'neg':
                    targets.append(0)
                else:
                    targets.append(1)
            except:
                print(f"Dropping File: {file} due to decoding issues")
    return text,targets
print("Loading Raw Validation Set")
raw_val_text, val_targets = load_dataset(val_dir)

print("Loading Raw Train Data")
raw_train_text, train_targets = load_dataset(train_dir)

print("Loading Raw Test Data")
raw_test_text, test_targets = load_dataset(test_dir)
print("Done")

Loading Raw Validation Set
Loading Raw Train Data
Dropping File: 7714_1.txt due to decoding issues
Dropping File: 11351_9.txt due to decoding issues
Dropping File: 8263_9.txt due to decoding issues
Loading Raw Test Data
Dropping File: 4414_1.txt due to decoding issues
Dropping File: 6973_1.txt due to decoding issues
Dropping File: 2464_10.txt due to decoding issues
Dropping File: 5281_10.txt due to decoding issues
Done


# Assignment 10.1 #

In [120]:
#//*** Vectorize a corpus
class Vectorizer:
    def __init__(self,**kwargs):
        self.corpus_tokens = []
        self.corpus_ngrams = []

        self.max_tokens = None
        self.ngram_size = 1
        self.tidyup = True
        
        for key,value in kwargs.items():
            if key =="max_tokens":
                self.max_tokens = value
                
            if key == "ngrams":
                self.ngram_size = value
            
            if key == "tidyup":
                self.tidyup = value
        
        
        #//*** One Hot Encoding Dictionaries
        #//*** Key = Token Index, Value = Word
        self.ngram_index = {}
        
        #//*** Key = Word, Value = Token Index
        self.vocabulary_index = {}
        
    def tokenize(self,raw_text):
        #//*** Initialize Output Tokens
        tokens = []

        #//*** Split Text into words
        for x in re.split("\s",raw_text):

            #//*** Findall Non text characters in each word
            non_text = re.findall("\W",x)

            #//*** Remove non_text Characters
            for i in non_text:
                x = x.replace(i,"")

            #//*** If X has length, append out
            if len(x) > 0:
                tokens.append(x.lower())
        return tokens

    def build_ngrams(self):
        if self.ngram_size <= 0:
            print("Ngram size must be an integer > 0")
            print("Quitting!")
            return None
        
        #//*** Using unigrams, use tokens
        if self.ngram_size == 1:
            self.corpus_ngrams = self.corpus_tokens
            return

        self.corpus_ngrams = []
        
        #//*** Get each token group from corpus_tokens
        for token in self.corpus_tokens:
            
            loop_ngram = []
            
            #//*** Use an index based range to loop through tokens
            for x in range(0,len(token) ):

                #//*** Check if index + ngram_size exceeds the length of tokens
                if x+self.ngram_size <= len(token):

                    result = ""

                    #//*** Build the ngram
                    for y in range(self.ngram_size):
                        #print(self.tokens[x+y])
                        result += token[x+y] + " "

                    loop_ngram.append(result[:-1])

                else:
                    break
            
            #//*** Token group ngram is built. Add loop_ngram to corpus_ngram
            self.corpus_ngrams.append(loop_ngram)
        

    def one_hot_encode(self,tokens):
        
        #//*** Encoded Results
        result = []
        
        #//*** Set the Max array size to the total number of items in self.ngram_index
        array_size = len(self.ngram_index.keys())
        

        #//*** hot encode each ngram
        for ngram in tokens:
            
            #//*** Skip words not in self.vocabulary_index
            #//*** These are skipped due to max_tokens limitations
            if ngram not in self.vocabulary_index.keys():
                continue
            
            #//*** Generate Array of Zeroes of Vocabulary length 
            encoded_text = list(np.zeros(array_size,dtype = int))
            
            #//*** Set Index of Vocabulary Word to 1
            encoded_text[ self.vocabulary_index[ngram] ] = 1
            
            #//*** Add the one-hot-encoded word to encoded text
            result.append(encoded_text)
        
        #//*** END for ngram in tokens:
        
        return result
    
    def encode(self,corpus):
        
        if not isinstance(corpus,list) :
            print("Vectorizer Requires a corpus (list of text):")
            return None
        
        self.tokens = []
        
        #//*** Tokenize each text entry in the corpus
        for raw_text in corpus:
            self.corpus_tokens.append(self.tokenize(raw_text))
        
        #//*** Build ngrams (Defaults to unigrams)
        self.build_ngrams()
        
        word_freq = {}
        
        #//*** Build dictionary of unique words
        #//*** Loop through each element of the corpus
        for element in self.corpus_ngrams:
        
            #//*** Process each individual ngram
            for ngram in element:

                #//*** Add unique words to dictionaries
                if ngram not in self.ngram_index.values():
                    index = len(self.ngram_index.values())
                    self.ngram_index[ index ] = ngram
                    self.vocabulary_index [ ngram ] = index
                    
                    #//*** Initialize Word Frequency
                    word_freq[ ngram ] = 1
                else:
                    #//*** Increment Word Frequency
                    word_freq[ ngram ] += 1

        #//*** END for element in self.corpus_ngrams:
        if self.max_tokens != None:
            
            #//*** Check if token count exceeds max tokens
            if self.max_tokens < len(self.ngram_index.items()):
                
                #//*** Sort the Word Frequency Dictionary. Keep the highest frequency words
                word_freq = dict(sorted(word_freq.items(), key=lambda x: x[1], reverse=True))
                
                
                #//*** Get list of keys that are lowest frequency
                for key in list(word_freq.keys())[self.max_tokens:]:
                    #//*** Delete Low Frequency ngrams
                    del word_freq[ key ]
                
                self.ngram_index = {}
                self.vocabulary_index = {}
                
                #//*** Rebuild ngram_index & vocabulary_index
                for ngram in word_freq.keys():
                    index = len(self.ngram_index.values())
                    self.ngram_index[ index ] = ngram
                    self.vocabulary_index [ ngram ] = index        
            
            #//*** END Trim Low Frequency ngrams
        self.word_freq = word_freq
        
        #//**** List of Encoded Values
        encoded = []
        
        #//*** One hot encode each text element
        for element in self.corpus_ngrams:
            encoded.append( self.one_hot_encode(element) )
            
        #//*** TidyUp (Delete) ngrams and Tokens
        if self.tidyup:
            self.corpus_tokens = []
            self.corpus_ngrams = []
            
        return encoded
    
    #//*** Convert One-Hot-Encoding to text
    def decode(self,corpus):
        
        results = []
        
        #//*** For Each element in Corpus
        for elements in corpus:
            
            decoded = ""
            
            #//*** For Each ngram (word(s)) in Elements
            for ngram in elements:
                
                
                
                decoded += self.ngram_index[ ngram.index(1) ] + " "
                
            #//*** END for ngram in elements:
            results.append( decoded[:-1])
            
        #//*** END for elements in corpus:
        return results


#//*** Test the Vectorizer with some sample data
vectorizer = Vectorizer(max_tokens=100,ngrams=2, tidyup=False)

temp_vals = vectorizer.encode(raw_val_text[:5])

print("Sample Text: (First 500 Chars)")
for element in raw_val_text[:5]:
    print(element[:500])
    print("====")
print()
print()

print("Tokens: (First 100 tokens)")
for token in vectorizer.corpus_tokens:
    print(token[:100])
    print("====")
print()
print()

print("ngrams: (First 50 tokens)")
for token in vectorizer.corpus_ngrams:
    print(token[:100])
    print("====")
print()
print()
print("Small one hot encoded Sample:")
print(temp_vals[0][:10])
print()
print()
print("Encoded Vocabulary")
print(vectorizer.vocabulary_index)
print()
print()
print("Decoded Text from vocabulary (limited by max tokens)")
for result in vectorizer.decode(temp_vals):
    print(result)
    print()


Sample Text: (First 500 Chars)
Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markh
====
This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.But could it have been the screenplay? Just exactly who was the chef in l

In [6]:
dir(keras.utils)

['CustomObjectScope',
 'GeneratorEnqueuer',
 'OrderedEnqueuer',
 'Progbar',
 'Sequence',
 'SequenceEnqueuer',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_sys',
 'array_to_img',
 'custom_object_scope',
 'deserialize_keras_object',
 'experimental',
 'get_custom_objects',
 'get_file',
 'get_registered_name',
 'get_registered_object',
 'get_source_inputs',
 'image_dataset_from_directory',
 'img_to_array',
 'load_img',
 'model_to_dot',
 'normalize',
 'pack_x_y_sample_weight',
 'plot_model',
 'register_keras_serializable',
 'save_img',
 'serialize_keras_object',
 'text_dataset_from_directory',
 'timeseries_dataset_from_array',
 'to_categorical',
 'unpack_x_y_sample_weight']

In [7]:
print(raw_val_text[:10])

["Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for

In [8]:

batch_size = 32

print("Build: Training Data Set")
train_ds = keras.utils.text_dataset_from_directory(
    imdb_path.joinpath("train"), batch_size=batch_size
)

print("Build: Validation Data Set")
val_ds = keras.utils.text_dataset_from_directory(
    imdb_path.joinpath("val"), batch_size=batch_size
)

print("Build: Test Data Set")
test_ds = keras.utils.text_dataset_from_directory(
    imdb_path.joinpath("test"), batch_size=batch_size
)




Build: Training Data Set
Found 70000 files belonging to 3 classes.
Build: Validation Data Set
Found 5000 files belonging to 2 classes.
Build: Test Data Set
Found 25000 files belonging to 2 classes.


In [9]:
#//*** Displaying the shapes and dtypes of the first batch

for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

len(train_ds)

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"I saw this about 14 years ago in a stroke of luck ( a local TV station had picked up a print, and my mother, horror buff that she is, decided to tape it), and the film has stuck with me ever since. It's not your typical horror film, and has more of a tragic element which was so very common to films of the genre in this particular era. The dark and dirty imagery only serves to enhance the premise, and the shrine the Hook children build to their mother is downright creepy. The children do a very decent job of portraying children ( something that is increasingly rare these days) and Dirk Bogarde does a fantastic job of portraying their scumbag father. And to boot, we've got a heavy incest theme going on. If you can get a hold of this one, go for it: it's very much of its time, but the opportunity is well worth any trouble.", shape=(), dtype=string)
targets[0]: tf

2188

In [10]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [11]:


for inputs, targets in binary_1gram_val_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break



inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


In [12]:
# //*** CODE HERE