# Reproducible Guide: CPU Training of GPT-2 (from Scratch)

This Jupyter Notebook provides a step-by-step guide to set up a Python environment and train a small GPT-2 language model on the CPU. This guide is designed to be reproducible and explicitly specifies all dependencies and versions used. It addresses the previous issue of GPU interference and ensures CPU-only training.

**Important:** This guide trains on a *very small* example dataset and uses the base `gpt2` model, which is not designed for high-quality text generation. The goal here is to establish a working CPU training setup. To generate better text, you will need to use a larger model variant (like `gpt2-medium`, `gpt2-large`, or `gpt2-xl`) and train on a much more substantial and relevant dataset (see "Next Steps" at the end).

## 1. Environment Setup: Create a New Conda Environment

It's best practice to create a dedicated Conda environment to isolate project dependencies. We'll create an environment named `cpu_gpt2_env` with Python 3.10 (you can adjust the Python version if needed).
**Open your terminal or Anaconda Prompt and run:**

In [None]:
conda create -n cpu_gpt2_env python=3.10
conda activate cpu_gpt2_env

## 2. Install Dependencies with Explicit Versions

We will now install all the necessary Python packages with specific versions to ensure reproducibility. We will install PyTorch for CPU, `transformers` from the Hugging Face GitHub development branch (to get the latest features and CPU-related fixes), and the `datasets` library.
**Run the following `pip install` commands in your activated `cpu_gpt2_env` environment:**

In [None]:
# Install PyTorch (CPU version). Get the latest stable CPU wheel URL from pytorch.org if needed.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install transformers from Hugging Face GitHub development branch (for latest features and fixes)
pip install --no-cache-dir --force-reinstall git+https://github.com/huggingface/transformers.git

# Install datasets library (latest stable version)
pip install datasets

## 3. Python Code for CPU Training (`cpu_trainer.py`)

This is the Python code that will perform the CPU-based training of GPT-2.  Save this code as `cpu_trainer.py` in your project directory. This code includes all the necessary fixes and CPU-specific settings we've discussed previously.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset

# --- Configuration ---
MODEL_NAME = "gpt2"  # A small, readily available model
OUTPUT_DIR = "./cpu_trained_model"

# --- Minimal Example Data ---
mini_data = [
    {"text": "The quick brown fox"},
    {"text": "jumps over the lazy"},
    {"text": "dog. This is a test."},
    {"text": "Another example sentence."}
]

# --- Tokenizer Setup ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token  # Crucial for padding

def tokenize_function(example):
    """Tokenizes a single example."""
    tokenized_inputs = tokenizer(
        example["text"],
        max_length=32,
        truncation=True,
        padding="max_length",
        return_tensors="pt", # Request PyTorch tensors
    )
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].clone() #labels are input ids
    return {key: value.squeeze(0) for key, value in tokenized_inputs.items()}

# --- Create and Tokenize Dataset ---
dataset = Dataset.from_list(mini_data)
tokenized_dataset = dataset.map(tokenize_function)
tokenized_dataset.set_format("torch")
tokenized_dataset = tokenized_dataset.remove_columns(['text'])

# --- TrainingArguments ---
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_steps=10,
    report_to="none",
    save_strategy="epoch",
    device="cpu",  # Explicitly set device to CPU
    no_cuda=True, # Ensure no CUDA is used
    # NO GPU-related settings here!
)

# --- Load Model (on CPU) ---
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
model.to("cpu")  # Explicitly put the model on the CPU

def custom_data_collator(features):
    """Collates a batch of tokenized examples with detailed debugging."""
    batch = {}
    print("\n--- Inside data collator ---")
    print(f"Type of 'features': {type(features)}")
    if features:
        print(f"Length of 'features': {len(features)}")
        print(f"Type of 'features[0]': {type(features[0])}")

        if isinstance(features[0], dict):
            for key in features[0].keys():
                print(f"\n  --- Key: '{key}' ---")
                print(f"  Type of 'features[0][key]': {type(features[0][key])}")

                elements_for_stack = [f[key] for f in features]
                print(f"  Type of 'elements_for_stack': {type(elements_for_stack)}")
                if elements_for_stack:
                    print(f"  Type of 'elements_for_stack[0]': {type(elements_for_stack[0])}")

                    if isinstance(elements_for_stack[0], list):
                        print(f"  Type of 'elements_for_stack[0][0]': {type(elements_for_stack[0][0]) if elements_for_stack[0] else 'empty list'}")
                        print(f"  Value of 'elements_for_stack[0]': {elements_for_stack[0]}")
                    elif isinstance(elements_for_stack[0], torch.Tensor):
                        print(f"  Shape of 'elements_for_stack[0]': {elements_for_stack[0].shape}")
                    else:
                        print(f"  Value of 'elements_for_stack[0]': {elements_for_stack[0]}")

                try:
                    batch[key] = torch.stack(elements_for_stack)
                    print(f"  'batch[key]' stacked successfully. Type: {type(batch[key])}, Shape: {batch[key].shape}")
                except TypeError as e:
                    print(f"  TypeError during torch.stack for key '{key}': {e}")
        else:
            print("  'features[0]' is NOT a dictionary!")

    print("--- End data collator ---\n")
    return batch

# --- Trainer ---
trainer = Trainer(
    model=model, # Model is already on CPU
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=custom_data_collator,
)

# --- Train (on CPU) ---
trainer.train()
print("Training complete!")

# --- Inference (Example) ---
model.eval()
prompt = "The quick brown"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cpu")

with torch.no_grad():
    output = model.generate(input_ids.to("cpu"), max_new_tokens=10, do_sample=True) # input_ids to CPU again, and model should be on CPU
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"Prompt: {prompt}")
print(f"Generated Text: {generated_text}")

#---Save---
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

## 4. Execution Instructions

1.  **Save the code:** Make sure you have saved the Python code from section 3 as `cpu_trainer.py` in your current working directory.
2.  **Activate the Conda environment:** If you haven't already, activate the `cpu_gpt2_env` environment in your terminal:
    ```bash
    conda activate cpu_gpt2_env
    ```
3.  **Set Environment Variables (Crucial for CPU forcing):**  Before running the script, set these environment variables in your terminal. This is essential to explicitly disable GPU usage and force CPU execution:
    ```bash
    export CUDA_VISIBLE_DEVICES=""
    export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
    ```
4.  **Run the Python script:** Execute the training script from your terminal:
    ```bash
    python cpu_trainer.py
    ```

## 5. Verification of CPU Usage

To ensure that the training and inference are indeed running on the CPU (and not accidentally using the GPU), you can monitor your system's resource usage during the script execution.

**Using System Monitoring Tools:**

*   **Linux/macOS:** Open a separate terminal and use tools like `top` or `htop`. Look for high CPU utilization by the Python process running `cpu_trainer.py`. You should see very little to no GPU utilization (GPU usage should be close to 0%).
*   **Windows:** Use Task Manager (Ctrl+Shift+Esc). Go to the "Performance" tab and monitor CPU and GPU usage while the script is running.  CPU usage should be significant, and GPU usage should be minimal.

**Script Output:**

Examine the output of `cpu_trainer.py` in your terminal. You should see:

*   Training progress (progress bar and loss values).
*   The "Prompt:" and "Generated Text:" output at the end, indicating successful inference.
*   **Crucially, the absence of any `RuntimeError` or CUDA-related warnings** (especially the `RuntimeError: Expected all tensors to be on the same device...` error we were previously encountering).


## 6. Next Steps and Improvements (Text Quality)

As mentioned earlier, the text generated by `gpt2-base` trained on `mini_data` will be of limited quality. To improve the generated text:

*   **Use a Larger GPT-2 Model Variant:** Try `MODEL_NAME = "gpt2-medium"`, `"gpt2-large"`, or `"gpt2-xl"` in the code (be mindful of CPU RAM and training speed).
*   **Use a Larger and More Relevant Dataset:** Explore the Hugging Face Datasets Hub ([https://huggingface.co/datasets](https://huggingface.co/datasets)) for datasets relevant to the type of text you want to generate.
*   **Train for More Epochs:**  Consider increasing `num_train_epochs` in `TrainingArguments` (but monitor training time on CPU).
*   **Experiment with Generation Parameters:**  Adjust parameters like `temperature`, `top_k`, `top_p`, and `num_beams` in the `model.generate()` function to control the style and quality of generated text.

## 7. Conclusion

Congratulations! You have successfully set up a reproducible environment for CPU-based training of GPT-2 and have a working script. By following these steps, you should be able to train and run inference on your CPU without encountering GPU-related device errors. Remember that improving text quality requires using larger models and, most importantly, larger and more relevant datasets. Happy experimenting!