# Understanding the Character-Level LSTM Model

This notebook utilizes a character-level language model based on Long Short-Term Memory (LSTM) networks, implemented in `src/myprogram.py`. Here's a brief overview:

**Model Architecture:**

The core of the model (`CharLSTM` class) consists of:
1.  **Embedding Layer:** Converts input characters into dense vector representations. Each unique character in the training data gets its own vector.
2.  **LSTM Layer(s):** These recurrent neural network layers process the sequence of character embeddings. LSTMs are designed to capture dependencies and context over varying lengths in sequential data, making them suitable for text.
3.  **Fully Connected (Linear) Layer:** Takes the LSTM's output and projects it onto the vocabulary space, producing a score (logit) for each possible next character.
4.  **Dropout:** Incorporated to prevent overfitting during training.

**Data Processing (`CharDataset` class):**
*   Text data is transformed into sequences of a fixed length (`seq_length`).
*   A vocabulary is built, mapping each unique character to an integer index.
*   For each input sequence, the target is the character immediately following it.

**Functionality (`MyModel` class):**
*   **Training (`run_train`):** The model learns to predict the next character by minimizing the cross-entropy loss between its predictions and the actual next characters in the training data. It uses the Adam optimizer and supports checkpointing to save and resume training.
*   **Prediction (`run_pred`):** Given an input string, the model processes the last `seq_length` characters and outputs the top 3 most probable characters that could follow.

**Use Case:**

The primary use case is **next character prediction**. Given a segment of text, the model attempts to predict which character is most likely to appear next. This has applications in:
*   **Text Autocompletion:** Suggesting subsequent characters or words.
*   **Generative Text Models:** While this model predicts one character at a time, this is a fundamental building block for more complex text generation systems.
*   **Understanding Language Structure:** Character-level modeling helps capture fundamental patterns and structures within different languages.

The model is designed to be **multilingual**. By training it on a corpus containing texts from various languages (as demonstrated in the data download step of this notebook), it can learn to predict characters across those languages.

# Running Multilingual Character-Level LSTM in Google Colab

This notebook guides you through running the character-level LSTM model in Google Colab.

## Step 1: Check GPU Availability

First, verify that Colab has assigned a GPU to your notebook. Go to **Runtime > Change runtime type** and select **GPU** as your hardware accelerator.

In [None]:
# Check if GPU is available
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

### Mount Google Drive for Persistent Storage

To ensure your project files, training data, and model checkpoints are saved persistently across sessions, we will work from your Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Define the base path on your Google Drive where the project will reside
# IMPORTANT: Create this directory in your Google Drive if it doesn't exist:
# For example, My Drive > Colab Notebooks > cse517p_projects
# Then set GDRIVE_BASE_PATH accordingly.
import os
GDRIVE_BASE_PATH = "/content/drive/My Drive/Colab_Projects_CSE517P" # CHANGE THIS TO YOUR PREFERRED GDRIVE PATH
PROJECT_DIR_NAME = "cse517p-project"
GDRIVE_PROJECT_PATH = os.path.join(GDRIVE_BASE_PATH, PROJECT_DIR_NAME)

# Create the base directory on Drive if it doesn't exist
if not os.path.exists(GDRIVE_BASE_PATH):
    os.makedirs(GDRIVE_BASE_PATH)
    print(f"Created base directory: {GDRIVE_BASE_PATH}")

print(f"Project will be set up in: {GDRIVE_PROJECT_PATH}")
# Note: The actual project directory (cse517p-project) will be created by cloning or manually in the next steps.

## Step 2: Set Up Project Repository on Google Drive

To ensure persistence, your project files (including code, data, and saved models/checkpoints) must reside on your Google Drive.

**Choose one of the options below:**

In [None]:
# Option 1: Clone your GitHub repository into Google Drive

# Ensure the GDRIVE_PROJECT_PATH is defined from the cell above.
# If the project directory already exists from a previous run, this cell might show an error or skip cloning.
# You might want to remove the existing directory if you want a fresh clone:
# !rm -rf "$GDRIVE_PROJECT_PATH"

if not os.path.exists(GDRIVE_PROJECT_PATH):
    print(f"Cloning repository into {GDRIVE_PROJECT_PATH}...")
    # Clone into the parent of where GDRIVE_PROJECT_PATH should be, then cd into it
    # Or, more simply, clone directly if GDRIVE_PROJECT_PATH is meant to be the repo root
    %cd $GDRIVE_BASE_PATH 
    !git clone https://github.com/jamevaalet/cse517p-project.git $PROJECT_DIR_NAME # Replace with your repo URL
    %cd $PROJECT_DIR_NAME
    print(f"Successfully cloned and changed directory to: {os.getcwd()}")
else:
    print(f"Project directory {GDRIVE_PROJECT_PATH} already exists. Skipping clone.")
    %cd $GDRIVE_PROJECT_PATH
    print(f"Changed directory to: {os.getcwd()}")

# Verify current directory
!pwd
!ls

Option 2: Upload files manually to Google Drive

If you haven't pushed to GitHub, create the project structure on your Google Drive and upload files.
Run the cell below to create the directory structure. Then, use the Colab file browser (left sidebar) to navigate to `/content/drive/My Drive/Your_Path/cse517p-project/` and upload your files into the `src`, `data`, etc., subdirectories.

In [None]:
# Option 2: Create project directory structure on Google Drive

# Ensure the GDRIVE_PROJECT_PATH is defined from a cell above.
print(f"Creating project structure in: {GDRIVE_PROJECT_PATH}")

# Create the main project directory if it doesn't exist
if not os.path.exists(GDRIVE_PROJECT_PATH):
    os.makedirs(GDRIVE_PROJECT_PATH)
    print(f"Created project directory: {GDRIVE_PROJECT_PATH}")

# Create subdirectories
!mkdir -p "$GDRIVE_PROJECT_PATH/src"
!mkdir -p "$GDRIVE_PROJECT_PATH/data"
!mkdir -p "$GDRIVE_PROJECT_PATH/work"  # work directory for checkpoints and models
!mkdir -p "$GDRIVE_PROJECT_PATH/example"
!mkdir -p "$GDRIVE_PROJECT_PATH/output" # For test outputs

%cd $GDRIVE_PROJECT_PATH
print(f"Changed directory to: {os.getcwd()}")
print("Please upload your files to the respective subdirectories (e.g., src, data) using the Colab file browser.")

# Verify current directory and structure
!pwd
!ls -l

## Step 3: Install Required Dependencies

In [None]:
# Install dependencies
!pip install numpy tqdm

## Step 4: Download Multilingual Training Data

Let's download sample multilingual data for training.

In [None]:
# Download sample data
# This will download data into the 'data' subdirectory of your project on Google Drive
# (assuming you have successfully cd'd into GDRIVE_PROJECT_PATH)
!mkdir -p data

# English sample (Pride and Prejudice)
!curl -s https://www.gutenberg.org/files/1342/1342-0.txt > data/english_pride_prejudice.txt
# Additional English samples
!curl -s https://www.gutenberg.org/files/1661/1661-0.txt > data/english_sherlock_holmes.txt
!curl -s https://www.gutenberg.org/files/2701/2701-0.txt > data/english_moby_dick.txt

# Spanish sample (Don Quixote)
!curl -s https://www.gutenberg.org/files/2000/2000-0.txt > data/spanish_don_quijote.txt
# Additional Spanish samples
!curl -s https://www.gutenberg.org/files/5946/5946-0.txt > data/spanish_la_regenta.txt # La Regenta by Clarín
!curl -s https://www.gutenberg.org/files/1701/1701-0.txt > data/spanish_fortunata_y_jacinta.txt # Fortunata y Jacinta by Benito Pérez Galdós


# French sample (Les Misérables - Tome I)
!curl -s https://www.gutenberg.org/files/17489/17489-0.txt > data/french_les_miserables.txt
# Additional French samples
!curl -s https://www.gutenberg.org/files/2413/2413-0.txt > data/french_madame_bovary.txt # Madame Bovary by Flaubert
!curl -s https://www.gutenberg.org/files/10003/10003-0.txt > data/french_le_rouge_et_le_noir.txt # Le Rouge et le Noir by Stendhal

# German sample (Also sprach Zarathustra)
!curl -s https://www.gutenberg.org/cache/epub/1998/pg1998.txt > data/german_zarathustra.txt
# Additional German samples
!curl -s https://www.gutenberg.org/files/5200/5200-0.txt > data/german_die_verwandlung.txt # Die Verwandlung by Kafka
!curl -s https://www.gutenberg.org/files/2229/2229-0.txt > data/german_faust_part1.txt # Faust I by Goethe

# Portuguese sample (Os Lusíadas)
!curl -s https://www.gutenberg.org/files/3333/3333-0.txt > data/portuguese_lusiadas.txt
# Additional Portuguese samples
!curl -s https://www.gutenberg.org/files/5518/5518-0.txt > data/portuguese_dom_casmurro.txt # Dom Casmurro by Machado de Assis
!curl -s https://www.gutenberg.org/files/54706/54706-0.txt > data/portuguese_memorias_postumas_bras_cubas.txt # Memórias Póstumas de Brás Cubas by Machado de Assis


# Commenting out other Latin-script languages to focus on the top 5 spoken ones:
# Italian sample (La Divina Commedia di Dante)
# !curl -s https://www.gutenberg.org/files/1001/1001-0.txt > data/italian_divina_commedia.txt
# Dutch sample (Max Havelaar)
# !curl -s https://www.gutenberg.org/files/36000/36000-0.txt > data/dutch_max_havelaar.txt
# Swedish sample (Röda rummet)
# !curl -s https://www.gutenberg.org/files/5381/5381-0.txt > data/swedish_roda_rummet.txt
# Finnish sample (Seitsemän veljestä)
# !curl -s https://www.gutenberg.org/files/11961/11961-0.txt > data/finnish_seitseman_veljesta.txt
# Danish sample (Niels Lyhne)
# !curl -s https://www.gutenberg.org/files/19099/19099-0.txt > data/danish_niels_lyhne.txt
# Norwegian sample (Peer Gynt)
# !curl -s https://www.gutenberg.org/files/2339/2339-0.txt > data/norwegian_peer_gynt.txt
# Polish sample (Pan Tadeusz)
# !curl -s https://www.gutenberg.org/files/20933/20933-0.txt > data/polish_pan_tadeusz.txt
# Hungarian sample (Az arany ember)
# !curl -s https://www.gutenberg.org/files/20925/20925-0.txt > data/hungarian_az_arany_ember.txt
# Latin sample (Commentarii de Bello Gallico)
# !curl -s https://www.gutenberg.org/files/10657/10657-0.txt > data/latin_bello_gallico.txt

# Non-Latin script languages removed from download:
# Russian, Chinese, Japanese, Korean, Arabic, Hindi

# Create example input file for testing
# This will create files in the 'example' subdirectory of your project on Google Drive
!mkdir -p example
!echo "Hello, how are you" > example/input.txt
!echo "Bonjour mon ami" >> example/input.txt
!echo "Hola, ¿cómo estás" >> example/input.txt

### Languages in the Training Data

The preceding cell downloads sample texts from Project Gutenberg to serve as training data. Based on these downloads, the model will be trained on content from the following **top 5 most spoken Latin-script languages**:

*   English
*   Spanish
*   French
*   German
*   Portuguese

The `MyModel.load_training_data()` method in `src/myprogram.py` will load all `.txt` files from the `data/` directory. The character vocabulary and subsequent training will be based on the combined content of these files.

#### Script Usage in Downloaded Languages

All languages listed above and downloaded for training primarily use **Latin-based scripts**. This focused approach aims to build a model specialized in these specific widely-spoken scripts.

By focusing on these 5 Latin-script languages, the vocabulary will be smaller and potentially allow the model to better learn the nuances within these scripts given a fixed model capacity and dataset size.

#### Top 5 Spoken Latin-Script Languages in the Training Set

The training set for this notebook now exclusively comprises these top 5 most spoken Latin-script languages (by total number of speakers worldwide):

1.  **English**
2.  **Spanish**
3.  **French**
4.  **Portuguese**
5.  **German**

The model will be exposed to texts only from these languages.

## Step 5: If needed, create or upload the Python files

If you cloned from GitHub, skip this step. Otherwise, you need to upload or create your Python files in the `src` directory.

Click on the folder icon on the left sidebar, navigate to the `src` directory, and upload your `myprogram.py` and `predict.sh` files.

### Note on Training Timeouts and Resuming

To mitigate Colab timeouts, the training script now saves a checkpoint (`work/checkpoint.pt`) after each epoch. This checkpoint includes:
*   The model's learned weights.
*   The state of the optimizer.
*   The vocabulary (character-to-index mapping) used during training.
*   The number of the last completed epoch.

**How Resuming Works:**
*   If your session disconnects, **ensure your Google Drive is remounted** if necessary (run the Drive mount cell again). Then, navigate back to your project directory on Drive (`%cd /content/drive/My Drive/Your_Path/cse517p-project`).
*   Simply re-run the training cell (`!python src/myprogram.py train --work_dir work`).
*   The script will automatically detect `work/checkpoint.pt` (which is now on your Google Drive). If found, it loads the saved progress and resumes training from the next epoch.
*   **Data Handling on Resume:**
    *   The **vocabulary** from the checkpoint is reloaded. This ensures character encodings remain consistent.
    *   The raw **training data files are re-read** from the `data/` directory (on your Google Drive) at the start of the resumed session. If you've modified the contents of the `data/` directory (e.g., added more text files), the resumed training will use this updated set of files. However, any new characters in these files not present in the loaded vocabulary will be filtered out.
    *   The `DataLoader` shuffles the dataset at the beginning of each epoch, including resumed epochs.
*   **Final Model:** Once all epochs are completed, the final trained model will be saved as `work/model.pt` and `work/vocab.pt` on your Google Drive.

## Step 6: Train the Model

In [None]:
# Train the model
!python src/myprogram.py train --work_dir work

### Inspect Model Vocabulary (After Training)

After training, the model saves its vocabulary. Let's load it to see what characters it learned.
If you haven't trained the model yet in this session, this cell might show information from a previously saved `vocab.pt` or fail if no such file exists.

In [None]:
import torch
import os

# Ensure you are in the project directory
if os.path.basename(os.getcwd()) != PROJECT_DIR_NAME and os.path.exists(GDRIVE_PROJECT_PATH):
    %cd $GDRIVE_PROJECT_PATH

vocab_path = os.path.join('work', 'vocab.pt')
if os.path.exists(vocab_path):
    vocab_info = torch.load(vocab_path, map_location='cpu') # Load to CPU for inspection
    char_to_idx = vocab_info['char_to_idx']
    idx_to_char = vocab_info['idx_to_char']
    vocab_size = vocab_info['vocab_size']
    
    print(f"Vocabulary Size: {vocab_size}")
    print("\nSample of idx_to_char mapping (first 100 entries):")
    for i in range(min(100, vocab_size)):
        print(f"{i}: '{idx_to_char[i]}'", end='  ')
        if (i+1) % 10 == 0:
            print() # Newline every 10 chars
    print("\n\nChecking for specific characters (example):")
    example_chars = ['a', 'z', 'A', 'Z', ' ', '.', 'ñ', 'ç', 'ü', 'Привет', '你好', 'こんにちは', '안녕하세요', 'مرحبا', 'नमस्ते']
    for char_set in example_chars:
        for char_to_check in char_set: # Iterate if it's a string like "Привет"
             present = "Present" if char_to_check in char_to_idx else "Absent"
             idx = char_to_idx.get(char_to_check, "N/A")
             print(f"Character '{char_to_check}': {present} (Index: {idx})")
else:
    print(f"Vocabulary file not found at {vocab_path}. Train the model first or ensure the path is correct.")

# Alternative: If you want to inspect vocab from raw data without relying on a trained model's vocab.pt
# This requires access to the CharDataset class and training data.
# from src.myprogram import CharDataset, MyModel
# print("\nInspecting vocabulary from raw training data:")
# raw_training_data = MyModel.load_training_data() # Loads and cleans
# if raw_training_data:
#     temp_dataset = CharDataset(raw_training_data)
#     print(f"Raw Data Vocab Size: {temp_dataset.vocab_size}")
#     print("\nSample of raw data idx_to_char (first 100):")
#     for i in range(min(100, temp_dataset.vocab_size)):
#         print(f"{i}: '{temp_dataset.idx_to_char[i]}'", end='  ')
#         if (i+1) % 10 == 0:
#             print()
#     print()
# else:
#     print("Could not load raw training data for vocab inspection.")


## Step 7: Test the Model

In [None]:
# Create output directory
!mkdir -p output

# Run prediction
!python src/myprogram.py test --work_dir work --test_data example/input.txt --test_output output/pred.txt

# Display predictions
print("Input text:")
!cat example/input.txt

print("\nPredictions (top 3 next characters):")
!cat output/pred.txt

## Step 7.1: Comprehensive Multilingual Test (Optional)

This step uses a more diverse set of input strings covering the languages downloaded in Step 4.

In [None]:
# Create the multilingual input and answer files on Google Drive (within the Colab environment)

input_multi_content = """Hello, how are yo
Hola, ¿cómo está
Bonjour mon am
Guten Tag, wie geh
Olá, como est"""

answer_multi_content = """u
s
i
t
á"""

with open("example/input_multi.txt", "w", encoding="utf-8") as f:
    f.write(input_multi_content)

with open("example/answer_multi.txt", "w", encoding="utf-8") as f:
    f.write(answer_multi_content)

print("Created example/input_multi.txt and example/answer_multi.txt (Top 5 Latin-script focused)")

In [None]:
# Run prediction with the new multilingual test file
!python src/myprogram.py test --work_dir work --test_data example/input_multi.txt --test_output output/pred_multi.txt

# Display results
print("Multilingual Input text (example/input_multi.txt):")
!cat example/input_multi.txt

print("\nMultilingual Predictions (output/pred_multi.txt - top 3 next characters):")
!cat output/pred_multi.txt

print("\nMultilingual Gold Answers (example/answer_multi.txt):")
!cat example/answer_multi.txt

In [None]:
# Grade the multilingual predictions
# Ensure the grader script is available. If you cloned the repo, it should be in `grader/grade.py`.
# If not, you might need to upload it or adjust the path.
# Assuming GDRIVE_PROJECT_PATH is your project root where 'grader' directory exists.

GRADER_SCRIPT_PATH = "grader/grade.py" # Path relative to GDRIVE_PROJECT_PATH

if os.path.exists(GRADER_SCRIPT_PATH):
    print("\nGrading multilingual predictions:")
    !python $GRADER_SCRIPT_PATH output/pred_multi.txt example/answer_multi.txt --verbose
else:
    print(f"\nGrader script not found at {GRADER_SCRIPT_PATH}. Skipping multilingual grading.")
    print(f"Current directory: {os.getcwd()}")
    print("Ensure 'grader/grade.py' exists in your project directory on Google Drive.")

## Step 8: Save the Trained Model

If you want to save your trained model from Colab to your local machine:

In [None]:
# Zip the work directory which contains your model (now on Google Drive)
# The zip file will be created in the current directory (your project root on Drive)
!zip -r trained_model.zip work/

# Download the model (click the link that appears)
# This downloads the zip file from your Colab environment's view of Google Drive
from google.colab import files
files.download('trained_model.zip')

## Optional: Experiment with Hyperparameters

You can modify the training hyperparameters by editing the code or passing additional arguments. Here's an example of how to modify key parameters:

In [None]:
# Create a temporary modified version of the program with different hyperparameters
%%writefile src/modified_program.py
# Import the original program
from src.myprogram import *

# Override the run_train method to use different hyperparameters
def custom_run_train(self, data, work_dir):
    # Create dataset with smaller sequence length
    seq_length = 32  # Smaller sequence length
    self.dataset = CharDataset(data, seq_length)
    
    # Create model with different parameters
    vocab_size = self.dataset.vocab_size
    embedding_dim = 64     # Smaller embedding size
    hidden_dim = 128       # Smaller hidden dimension
    num_layers = 1         # Fewer layers
    self.model = CharLSTM(vocab_size, embedding_dim, hidden_dim, num_layers)
    self.model.to(self.device)
    
    # Modified training parameters
    batch_size = 32        # Smaller batch size
    num_epochs = 5         # Fewer epochs
    learning_rate = 0.005  # Higher learning rate
    
    # Create DataLoader
    dataloader = DataLoader(self.dataset, batch_size=batch_size, shuffle=True)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
    
    # Training loop
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (sequences, labels) in enumerate(dataloader):
            sequences, labels = sequences.to(self.device), labels.to(self.device)
            
            # Forward pass
            outputs = self.model(sequences)
            loss = criterion(outputs, labels)
            
            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
            if batch_idx % 100 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}, Loss: {loss.item():.4f}')
        
        avg_loss = total_loss / len(dataloader)
        print(f'Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}')

# Apply the monkey patch
MyModel.run_train = custom_run_train

# Run with the modified script
if __name__ == '__main__':
    # Use the same main code from the original program
    from src.myprogram import *
    if __name__ == '__main__' and globals()['__name__'] == '__main__':
        parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
        parser.add_argument('mode', choices=('train', 'test'), help='what to run')
        parser.add_argument('--work_dir', help='where to save', default='work_modified')
        parser.add_argument('--test_data', help='path to test data', default='example/input.txt')
        parser.add_argument('--test_output', help='path to write test predictions', default='pred_modified.txt')
        args = parser.parse_args()

        random.seed(0)

        if args.mode == 'train':
            if not os.path.isdir(args.work_dir):
                print('Making working directory {}'.format(args.work_dir))
                os.makedirs(args.work_dir)
            print('Instatiating model')
            model = MyModel()
            print('Loading training data')
            train_data = MyModel.load_training_data()
            print('Training')
            model.run_train(train_data, args.work_dir)
            print('Saving model')
            model.save(args.work_dir)
        elif args.mode == 'test':
            print('Loading model')
            model = MyModel.load(args.work_dir)
            print('Loading test data from {}'.format(args.test_data))
            test_data = MyModel.load_test_data(args.test_data)
            print('Making predictions')
            pred = model.run_pred(test_data)
            print('Writing predictions to {}'.format(args.test_output))
            assert len(pred) == len(test_data), 'Expected {} predictions but got {}'.format(len(test_data), len(pred))
            model.write_pred(pred, args.test_output)
        else:
            raise NotImplementedError('Unknown mode {}'.format(args.mode))

Run the modified version:

In [None]:
# Train with modified hyperparameters
!python src/modified_program.py train --work_dir work_modified

## Running with Docker in Google Colab

You can also run your model using Docker inside Google Colab, which ensures the exact same environment as your local setup.

### 1. Install Docker in Colab

First, we need to install Docker in the Colab environment:

In [None]:
# Remove any old Docker installations
!apt-get remove docker docker-engine docker.io containerd runc

# Install prerequisites
!apt-get update
!apt-get install -y apt-transport-https ca-certificates curl gnupg lsb-release

# Add Docker's official GPG key
!curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Set up the stable repository
!echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine
!apt-get update
!apt-get install -y docker-ce docker-ce-cli containerd.io

# Verify Docker installation
!docker --version

### 2. Start Docker service

In [None]:
# Start the Docker service
!service docker start

# Check Docker status
!service docker status

### 3. Create Project Files for Docker

We need to create all necessary files for our Docker container:

In [None]:
# Create Dockerfile
%%writefile Dockerfile
# Using the latest PyTorch image with CUDA support
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime

RUN mkdir /job
WORKDIR /job
VOLUME ["/job/data", "/job/src", "/job/work", "/job/output"]

# Install dependencies using requirements.txt
COPY requirements.txt /job/
RUN pip install -r requirements.txt

In [None]:
# Create requirements.txt
%%writefile requirements.txt
numpy>=1.20.0
tqdm>=4.64.0

In [None]:
# Create predict.sh script
%%writefile src/predict.sh
#!/usr/bin/env bash
set -e
set -v
python src/myprogram.py test --work_dir work --test_data $1 --test_output $2

### 4. Build Docker Image

Now we can build the Docker image:

In [None]:
# Build the Docker image
!docker build -t cse517-proj/mylstm -f Dockerfile .

### 5. Run Training with Docker

Now we can train our model using Docker:

In [None]:
# Ensure directories exist (they should, if set up on Drive)
# These commands will operate relative to your project directory on Drive
!mkdir -p data work output example

# Check if chmod is needed for the script
!chmod +x src/predict.sh

# Run training with Docker
# Note: Docker volume mounts will now point to paths on your Google Drive via Colab's mount
!docker run --rm \
  -v "$PWD/src":/job/src \
  -v "$PWD/data":/job/data \
  -v "$PWD/work":/job/work \
  cse517-proj/mylstm bash -c "cd /job && python src/myprogram.py train --work_dir work"

### 6. Run Testing with Docker

Now we can test our model using Docker:

In [None]:
# Create test data if it doesn't exist (in 'example' on Drive)
!mkdir -p example
!echo "Hello, how are you" > example/input.txt
!echo "Bonjour mon ami" >> example/input.txt
!echo "Hola, ¿cómo estás" >> example/input.txt

# Run testing with Docker
# Volume mounts point to paths on your Google Drive
!docker run --rm \
  -v "$PWD/src":/job/src \
  -v "$PWD/work":/job/work \
  -v "$PWD/example":/job/data \
  -v "$PWD/output":/job/output \
  cse517-proj/mylstm bash /job/src/predict.sh /job/data/input.txt /job/output/pred.txt

# Display results (from files on Drive)
print("Input:")
!cat example/input.txt
print("\nPredictions:")
!cat output/pred.txt

### 7. Compare Docker vs. Direct Execution

You can now compare the results between running directly in Colab versus running in Docker. Both should produce similar results, but Docker ensures better reproducibility and consistency with your local environment.

### Notes on Running Docker in Colab

1. Docker in Colab requires administrative privileges, which Google provides.
2. The Docker installation process might take a few minutes.
3. If you encounter memory issues, try reducing batch sizes or sequence lengths.
4. Docker containers are ephemeral - data will be lost when the container stops unless mounted as volumes.
5. Colab sessions have time limits - save your model frequently.