# Running Multilingual Character-Level LSTM in Google Colab

This notebook guides you through running the character-level LSTM model in Google Colab.

## Step 1: Check GPU Availability

First, verify that Colab has assigned a GPU to your notebook. Go to **Runtime > Change runtime type** and select **GPU** as your hardware accelerator.

In [None]:
# Check if GPU is available
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

## Step 2: Clone the Project Repository

Option 1: Clone your GitHub repository (if you've pushed the code to GitHub)

In [None]:
# Replace with your actual repository URL
!git clone https://github.com/yourusername/cse517p-project.git
%cd cse517p-project

Option 2: Upload files manually

If you haven't pushed to GitHub, you'll need to create the project structure and upload the necessary files. Run the following cell and then upload the required files through the Colab file browser.

In [None]:
# Create project directory structure
!mkdir -p cse517p-project/src cse517p-project/data cse517p-project/work cse517p-project/example
%cd cse517p-project

## Step 3: Install Required Dependencies

In [None]:
# Install dependencies
!pip install numpy tqdm

## Step 4: Download Multilingual Training Data

Let's download sample multilingual data for training.

In [None]:
# Download sample data
!mkdir -p data

# English sample (Pride and Prejudice)
!curl -s https://www.gutenberg.org/files/1342/1342-0.txt > data/english_pride_prejudice.txt

# Spanish sample (Don Quixote)
!curl -s https://www.gutenberg.org/files/2000/2000-0.txt > data/spanish_don_quijote.txt

# Additional multilingual samples can be added here
# French sample (Les Misérables - Tome I)
!curl -s https://www.gutenberg.org/files/17489/17489-0.txt > data/french_les_miserables.txt

# German sample (Also sprach Zarathustra)
!curl -s https://www.gutenberg.org/cache/epub/1998/pg1998.txt > data/german_zarathustra.txt

# Italian sample (La Divina Commedia di Dante)
!curl -s https://www.gutenberg.org/files/1001/1001-0.txt > data/italian_divina_commedia.txt

# Portuguese sample (Os Lusíadas)
!curl -s https://www.gutenberg.org/files/3333/3333-0.txt > data/portuguese_lusiadas.txt

# Dutch sample (Max Havelaar)
!curl -s https://www.gutenberg.org/files/36000/36000-0.txt > data/dutch_max_havelaar.txt

# Swedish sample (Röda rummet)
!curl -s https://www.gutenberg.org/files/5381/5381-0.txt > data/swedish_roda_rummet.txt

# Finnish sample (Seitsemän veljestä)
!curl -s https://www.gutenberg.org/files/11961/11961-0.txt > data/finnish_seitseman_veljesta.txt

# Danish sample (Niels Lyhne)
!curl -s https://www.gutenberg.org/files/19099/19099-0.txt > data/danish_niels_lyhne.txt

# Norwegian sample (Peer Gynt)
!curl -s https://www.gutenberg.org/files/2339/2339-0.txt > data/norwegian_peer_gynt.txt

# Polish sample (Pan Tadeusz)
!curl -s https://www.gutenberg.org/files/20933/20933-0.txt > data/polish_pan_tadeusz.txt

# Hungarian sample (Az arany ember)
!curl -s https://www.gutenberg.org/files/20925/20925-0.txt > data/hungarian_az_arany_ember.txt

# Latin sample (Commentarii de Bello Gallico)
!curl -s https://www.gutenberg.org/files/10657/10657-0.txt > data/latin_bello_gallico.txt

# Russian sample (War and Peace)
!curl -s https://www.gutenberg.org/files/2600/2600-0.txt > data/russian_war_and_peace.txt

# Chinese sample (The Art of War)
!curl -s https://www.gutenberg.org/files/132/132-0.txt > data/chinese_art_of_war.txt
# Japanese sample (The Tale of Genji)
!curl -s https://www.gutenberg.org/files/23643/23643-0.txt > data/japanese_genji_monogatari.txt
# Korean sample (The Tale of Hong Gildong)  
!curl -s https://www.gutenberg.org/files/22060/22060-0.txt > data/korean_hong_gildong.txt
# Arabic sample (One Thousand and One Nights)
!curl -s https://www.gutenberg.org/files/555/555-0.txt > data/arabic_one_thousand_and_one_nights.txt
# Hindi sample (The Ramayana)
!curl -s https://www.gutenberg.org/files/24899/24899-0.txt > data/hindi_ramayana.txt


# Create example input file for testing
!mkdir -p example
!echo "Hello, how are you" > example/input.txt
!echo "Bonjour mon ami" >> example/input.txt
!echo "Hola, ¿cómo estás" >> example/input.txt

## Step 5: If needed, create or upload the Python files

If you cloned from GitHub, skip this step. Otherwise, you need to upload or create your Python files in the `src` directory.

Click on the folder icon on the left sidebar, navigate to the `src` directory, and upload your `myprogram.py` and `predict.sh` files.

## Step 6: Train the Model

In [None]:
# Train the model
!python src/myprogram.py train --work_dir work

## Step 7: Test the Model

In [None]:
# Create output directory
!mkdir -p output

# Run prediction
!python src/myprogram.py test --work_dir work --test_data example/input.txt --test_output output/pred.txt

# Display predictions
print("Input text:")
!cat example/input.txt

print("\nPredictions (top 3 next characters):")
!cat output/pred.txt

## Step 8: Save the Trained Model

If you want to save your trained model from Colab to your local machine:

In [None]:
# Zip the work directory which contains your model
!zip -r trained_model.zip work/

# Download the model (click the link that appears)
from google.colab import files
files.download('trained_model.zip')

## Optional: Experiment with Hyperparameters

You can modify the training hyperparameters by editing the code or passing additional arguments. Here's an example of how to modify key parameters:

In [None]:
# Create a temporary modified version of the program with different hyperparameters
%%writefile src/modified_program.py
# Import the original program
from src.myprogram import *

# Override the run_train method to use different hyperparameters
def custom_run_train(self, data, work_dir):
    # Create dataset with smaller sequence length
    seq_length = 32  # Smaller sequence length
    self.dataset = CharDataset(data, seq_length)
    
    # Create model with different parameters
    vocab_size = self.dataset.vocab_size
    embedding_dim = 64     # Smaller embedding size
    hidden_dim = 128       # Smaller hidden dimension
    num_layers = 1         # Fewer layers
    self.model = CharLSTM(vocab_size, embedding_dim, hidden_dim, num_layers)
    self.model.to(self.device)
    
    # Modified training parameters
    batch_size = 32        # Smaller batch size
    num_epochs = 5         # Fewer epochs
    learning_rate = 0.005  # Higher learning rate
    
    # Create DataLoader
    dataloader = DataLoader(self.dataset, batch_size=batch_size, shuffle=True)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
    
    # Training loop
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (sequences, labels) in enumerate(dataloader):
            sequences, labels = sequences.to(self.device), labels.to(self.device)
            
            # Forward pass
            outputs = self.model(sequences)
            loss = criterion(outputs, labels)
            
            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
            if batch_idx % 100 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}, Loss: {loss.item():.4f}')
        
        avg_loss = total_loss / len(dataloader)
        print(f'Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}')

# Apply the monkey patch
MyModel.run_train = custom_run_train

# Run with the modified script
if __name__ == '__main__':
    # Use the same main code from the original program
    from src.myprogram import *
    if __name__ == '__main__' and globals()['__name__'] == '__main__':
        parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
        parser.add_argument('mode', choices=('train', 'test'), help='what to run')
        parser.add_argument('--work_dir', help='where to save', default='work_modified')
        parser.add_argument('--test_data', help='path to test data', default='example/input.txt')
        parser.add_argument('--test_output', help='path to write test predictions', default='pred_modified.txt')
        args = parser.parse_args()

        random.seed(0)

        if args.mode == 'train':
            if not os.path.isdir(args.work_dir):
                print('Making working directory {}'.format(args.work_dir))
                os.makedirs(args.work_dir)
            print('Instatiating model')
            model = MyModel()
            print('Loading training data')
            train_data = MyModel.load_training_data()
            print('Training')
            model.run_train(train_data, args.work_dir)
            print('Saving model')
            model.save(args.work_dir)
        elif args.mode == 'test':
            print('Loading model')
            model = MyModel.load(args.work_dir)
            print('Loading test data from {}'.format(args.test_data))
            test_data = MyModel.load_test_data(args.test_data)
            print('Making predictions')
            pred = model.run_pred(test_data)
            print('Writing predictions to {}'.format(args.test_output))
            assert len(pred) == len(test_data), 'Expected {} predictions but got {}'.format(len(test_data), len(pred))
            model.write_pred(pred, args.test_output)
        else:
            raise NotImplementedError('Unknown mode {}'.format(args.mode))

Run the modified version:

In [None]:
# Train with modified hyperparameters
!python src/modified_program.py train --work_dir work_modified

## Running with Docker in Google Colab

You can also run your model using Docker inside Google Colab, which ensures the exact same environment as your local setup.

### 1. Install Docker in Colab

First, we need to install Docker in the Colab environment:

In [None]:
# Remove any old Docker installations
!apt-get remove docker docker-engine docker.io containerd runc

# Install prerequisites
!apt-get update
!apt-get install -y apt-transport-https ca-certificates curl gnupg lsb-release

# Add Docker's official GPG key
!curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Set up the stable repository
!echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine
!apt-get update
!apt-get install -y docker-ce docker-ce-cli containerd.io

# Verify Docker installation
!docker --version

### 2. Start Docker service

In [None]:
# Start the Docker service
!service docker start

# Check Docker status
!service docker status

### 3. Create Project Files for Docker

We need to create all necessary files for our Docker container:

In [None]:
# Create Dockerfile
%%writefile Dockerfile
# Using the latest PyTorch image with CUDA support
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime

RUN mkdir /job
WORKDIR /job
VOLUME ["/job/data", "/job/src", "/job/work", "/job/output"]

# Install dependencies using requirements.txt
COPY requirements.txt /job/
RUN pip install -r requirements.txt

In [None]:
# Create requirements.txt
%%writefile requirements.txt
numpy>=1.20.0
tqdm>=4.64.0

In [None]:
# Create predict.sh script
%%writefile src/predict.sh
#!/usr/bin/env bash
set -e
set -v
python src/myprogram.py test --work_dir work --test_data $1 --test_output $2

### 4. Build Docker Image

Now we can build the Docker image:

In [None]:
# Build the Docker image
!docker build -t cse517-proj/mylstm -f Dockerfile .

### 5. Run Training with Docker

Now we can train our model using Docker:

In [None]:
# Ensure directories exist
!mkdir -p data work output example

# Check if chmod is needed for the script
!chmod +x src/predict.sh

# Run training with Docker
!docker run --rm -v "$PWD/src":/job/src -v "$PWD/data":/job/data -v "$PWD/work":/job/work cse517-proj/mylstm bash -c "cd /job && python src/myprogram.py train --work_dir work"

### 6. Run Testing with Docker

Now we can test our model using Docker:

In [None]:
# Create test data if it doesn't exist
!mkdir -p example
!echo "Hello, how are you" > example/input.txt
!echo "Bonjour mon ami" >> example/input.txt
!echo "Hola, ¿cómo estás" >> example/input.txt

# Run testing with Docker
!docker run --rm -v "$PWD/src":/job/src -v "$PWD/work":/job/work -v "$PWD/example":/job/data -v "$PWD/output":/job/output cse517-proj/mylstm bash /job/src/predict.sh /job/data/input.txt /job/output/pred.txt

# Display results
print("Input:")
!cat example/input.txt
print("\nPredictions:")
!cat output/pred.txt

### 7. Compare Docker vs. Direct Execution

You can now compare the results between running directly in Colab versus running in Docker. Both should produce similar results, but Docker ensures better reproducibility and consistency with your local environment.

### Notes on Running Docker in Colab

1. Docker in Colab requires administrative privileges, which Google provides.
2. The Docker installation process might take a few minutes.
3. If you encounter memory issues, try reducing batch sizes or sequence lengths.
4. Docker containers are ephemeral - data will be lost when the container stops unless mounted as volumes.
5. Colab sessions have time limits - save your model frequently.