<a href="https://colab.research.google.com/github/SauravMaheshkar/trax/blob/SauravMaheshkar-example-1/examples/Deep_N_Gram_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title
# Copyright 2020 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Author - [@SauravMaheshkar](https://github.com/SauravMaheshkar)

# Downloading the Trax Package

[Trax](https://trax-ml.readthedocs.io/en/latest/) is an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in the [Google Brain team](https://research.google/teams/brain/). This notebook ([run it in colab](https://colab.research.google.com/github/google/trax/blob/master/trax/intro.ipynb)) shows how to use Trax and where you can find more information.

In [None]:
import os
import sys

# For example, if trax is inside a 'src' directory
project_root = os.environ.get('TRAX_PROJECT_ROOT', '')
sys.path.insert(0, project_root)

# Option to verify the import path
print(f"Python will look for packages in: {sys.path[0]}")

# Import trax
import trax

# Verify the source of the imported package
print(f"Imported trax from: {trax.__file__}")

# Importing Packages

In this notebook we will use the following packages:

* [**Pandas**](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language. It offers a fast and efficient DataFrame object for data manipulation with integrated indexing.
* [**os**](https://docs.python.org/3/library/os.html) module provides a portable way of using operating system dependent functionality.
* [**trax**](https://trax-ml.readthedocs.io/en/latest/trax.html) is an end-to-end library for deep learning that focuses on clear code and speed.
* [**random**](https://docs.python.org/3/library/random.html) module implements pseudo-random number generators for various distributions.
* [**itertools**](https://docs.python.org/3/library/itertools.html) module implements a number of iterator building blocks inspired by constructs from APL, Haskell, and SML. Each has been recast in a form suitable for Python.

In [None]:
import shutil
import trax.fastmath.numpy as np
import random as rnd
from trax import layers as tl

# Loading the Data

For this project, I've used the [gothic-literature](https://www.kaggle.com/charlesaverill/gothic-literature), [shakespeare-plays](https://www.kaggle.com/kingburrito666/shakespeare-plays) and [shakespeareonline](https://www.kaggle.com/kewagbln/shakespeareonline) datasets from the Kaggle library.

We perform the following steps for loading in the data:

* Iterate over all the directories in the `/kaggle/input/` directory
* Filter out `.txt` files
* Make a `lines` list containing the individual lines from all the datasets combined

In [None]:
import os
import subprocess
import zipfile


def download_datasets(download_dir):
    os.makedirs(download_dir, exist_ok=True)

    # Define the datasets with output filename and download URL
    datasets = [
        {
            "filename": "gothic-literature.zip",
            "url": "https://www.kaggle.com/api/v1/datasets/download/charlesaverill/gothic-literature"
        },
        {
            "filename": "shakespeare-plays.zip",
            "url": "https://www.kaggle.com/api/v1/datasets/download/kingburrito666/shakespeare-plays"
        },
        {
            "filename": "shakespeareonline.zip",
            "url": "https://www.kaggle.com/api/v1/datasets/download/kewagbln/shakespeareonline"
        }
    ]

    # Download each dataset using curl
    for dataset in datasets:
        output_path = os.path.join(download_dir, dataset["filename"])
        # Build the curl command (using -L for following redirects)
        cmd = [
            "curl",
            "-L",
            "-o", output_path,
            dataset["url"]
        ]
        print(f"Downloading {dataset['filename']}...")
        subprocess.run(cmd, check=True)
        print(f"Downloaded to {output_path}")


def extract_zip_files(download_dir, extract_dir):
    os.makedirs(extract_dir, exist_ok=True)

    # Iterate through the zip files in the download directory
    for file in os.listdir(download_dir):
        if file.lower().endswith(".zip"):
            zip_path = os.path.join(download_dir, file)
            # Create a subdirectory for each zip file (optional)
            extract_subdir = os.path.join(extract_dir, os.path.splitext(file)[0])
            os.makedirs(extract_subdir, exist_ok=True)
            print(f"Extracting {zip_path} to {extract_subdir}...")
            with zipfile.ZipFile(zip_path, 'r') as z:
                z.extractall(extract_subdir)
            print("Extraction completed.")


def read_text_files(extracted_dir):
    lines = []

    # Walk through the unzipped directories and process each .txt file
    for root, _, files in os.walk(extracted_dir):
        for filename in files:
            if filename.lower().endswith(".txt"):
                file_path = os.path.join(root, filename)
                print(f"Reading {file_path}...")
                with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                    for line in f:
                        processed_line = line.strip()
                        if processed_line:
                            lines.append(processed_line)
    return lines

In [None]:
# Set download and extraction directories
download_dir = os.path.expanduser("~/Downloads")
extract_dir = os.path.join(download_dir, "extracted_datasets")

# Download datasets using curl
download_datasets(download_dir)

# Extract downloaded zip files
extract_zip_files(download_dir, extract_dir)

# Read text files from extracted data
all_lines = read_text_files(extract_dir)

print(f"Total non-empty lines read: {len(all_lines)}")
# For example purposes, printing first 10 lines
print("\nFirst 10 lines:")
for line in all_lines[:10]:
    print(line)

## Pre-Processing

### Converting to Lowercase

Converting all the characters in the `lines` list to **lowercase**.

In [None]:
for i, line in enumerate(all_lines):
    all_lines[i] = line.lower()

### Converting into Tensors

Creating a function to convert each line into a tensor by converting each character into it's ASCII value. And adding a optional `EOS`(**End of statement**) character.

In [None]:
def line_to_tensor(line, EOS_int=1):
    tensor = []
    for c in line:
        c_int = ord(c)
        tensor.append(c_int)

    tensor.append(EOS_int)

    return tensor

### Creating a Batch Generator

Here, we create a `batch_generator()` function to yield a batch and mask generator. We perform the following steps:

* Shuffle the lines if not shuffled
* Convert the lines into a Tensor
* Pad the lines if it's less than the maximum length
* Generate a mask

In [None]:
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
    index = 0
    cur_batch = []
    num_lines = len(data_lines)
    lines_index = [*range(num_lines)]

    if shuffle:
        rnd.shuffle(lines_index)

    while True:

        if index >= num_lines:
            index = 0
            if shuffle:
                rnd.shuffle(lines_index)

        line = data_lines[lines_index[index]]

        if len(line) < max_length:
            cur_batch.append(line)

        index += 1

        if len(cur_batch) == batch_size:

            batch = []
            mask = []

            for li in cur_batch:
                tensor = line_to_tensor(li)

                pad = [0] * (max_length - len(tensor))
                tensor_pad = tensor + pad
                batch.append(tensor_pad)

                example_mask = [0 if t == 0 else 1 for t in tensor_pad]
                mask.append(example_mask)

            batch_np_arr = np.array(batch)
            mask_np_arr = np.array(mask)

            yield batch_np_arr, batch_np_arr, mask_np_arr

            cur_batch = []


In [None]:
generator = data_generator(2, 10, all_lines, line_to_tensor=line_to_tensor, shuffle=True)
print(next(generator))

# Defining the Model

## Gated Recurrent Unit

This function generates a GRU Language Model, consisting of the following layers:

* ShiftRight()
* Embedding()
* GRU Units(Number specified by the `n_layers` parameter)
* Dense() Layer
* LogSoftmax() Activation

In [None]:
def GRULM(vocab_size=256, d_model=512, n_layers=2, mode='train'):
    model = tl.Serial(
        tl.ShiftRight(mode=mode),
        tl.Embedding(vocab_size=vocab_size, d_feature=d_model),
        [tl.GRU(n_units=d_model) for _ in range(n_layers)],
        tl.Dense(n_units=vocab_size),
        tl.LogSoftmax()
    )
    return model

## Long Short Term Memory

This function generates a LSTM Language Model, consisting of the following layers:

* ShiftRight()
* Embedding()
* LSTM Units(Number specified by the `n_layers` parameter)
* Dense() Layer
* LogSoftmax() Activation

In [None]:
def LSTMLM(vocab_size=256, d_model=512, n_layers=2, mode='train'):
    model = tl.Serial(
        tl.ShiftRight(mode=mode),
        tl.Embedding(vocab_size=vocab_size, d_feature=d_model),
        [tl.LSTM(n_units=d_model) for _ in range(n_layers)],
        tl.Dense(n_units=vocab_size),
        tl.LogSoftmax()
    )
    return model

## Simple Recurrent Unit

This function generates a SRU Language Model, consisting of the following layers:

* ShiftRight()
* Embedding()
* SRU Units(Number specified by the `n_layers` parameter)
* Dense() Layer
* LogSoftmax() Activation

In [None]:
def SRULM(vocab_size=256, d_model=512, n_layers=2, mode='train'):
    model = tl.Serial(
        tl.ShiftRight(mode=mode),
        tl.Embedding(vocab_size=vocab_size, d_feature=d_model),
        [tl.SRU(n_units=d_model) for _ in range(n_layers)],
        tl.Dense(n_units=vocab_size),
        tl.LogSoftmax()
    )
    return model

In [None]:
GRUmodel = GRULM(n_layers=5)
LSTMmodel = LSTMLM(n_layers=5)
SRUmodel = SRULM(n_layers=5)
print(GRUmodel)
print(LSTMmodel)
print(SRUmodel)

## Hyperparameters

Here, we declare `the batch_size` and the `max_length` hyperparameters for the model.

In [None]:
batch_size = 32
max_length = 64

# Creating Evaluation and Training Dataset

In [None]:
eval_lines = all_lines[-1000:]  # Create a holdout validation set
lines = all_lines[:-1000]  # Leave the rest for training

# Training the Models

Here, we create a function to train the models. This function does the following:

* Creating a Train and Evaluation Generator that cycles infinetely using the `itertools` module
* Train the Model using Adam Optimizer
* Use the Accuracy Metric for Evaluation

In [None]:
from trax.learning.supervised import training
from trax import optimizers as optimizers
import itertools


def train_model(model, data_generator, batch_size=32, max_length=64, lines=lines, eval_lines=eval_lines, n_steps=10,
                output_dir='model/'):
    bare_train_generator = data_generator(batch_size, max_length, data_lines=lines)
    infinite_train_generator = itertools.cycle(bare_train_generator)

    bare_eval_generator = data_generator(batch_size, max_length, data_lines=eval_lines)
    infinite_eval_generator = itertools.cycle(bare_eval_generator)

    train_task = training.TrainTask(
        labeled_data=infinite_train_generator,
        loss_layer=tl.CrossEntropyLoss(),
        optimizer=optimizers.Adam(0.0005),
        n_steps_per_checkpoint=1
    )

    eval_task = training.EvalTask(
        labeled_data=infinite_eval_generator,
        metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
        n_eval_batches=1
    )

    training_loop = training.Loop(model,
                                  train_task,
                                  eval_tasks=[eval_task],
                                  output_dir=output_dir
                                  )

    training_loop.run(n_steps=n_steps)

    return training_loop


In [None]:
shutil.rmtree(os.path.expanduser('model/GRU'), ignore_errors=True)
GRU_training_loop = train_model(GRUmodel, data_generator, n_steps=10, output_dir='model/GRU')

In [None]:
shutil.rmtree(os.path.expanduser('model/LSTM'), ignore_errors=True)
LSTM_training_loop = train_model(LSTMmodel, data_generator, n_steps=10, output_dir='model/LSTM')

In [None]:
shutil.rmtree(os.path.expanduser('model/SRU'), ignore_errors=True)
SRU_training_loop = train_model(SRUmodel, data_generator, n_steps=50_000, output_dir='model/SRU')