## AI-powered Code Autocompletion


Code autocompletion models have revolutionized software development by predicting and suggesting code as developers type. Tools like Cursor have gained massive popularity in record time. In this notebook, I aimed to "fine-tune" a pre-trained LLM trained to complete code based on input on some python functions from `Code_Search_Net/code_search_net` (huggingface) that it had not seen before.


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
!pip install transformers
!pip install datasets
from datasets import load_dataset

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
import random
import torch
from torch.utils.data import Dataset, DataLoader

### Load the dataset


I run into memory issues trying to load all the data from `code-search-net`'s hugging face, so I had to manually download portion of the data and re-upload to my personal profile.


In [None]:
train_ds = load_dataset("ieadoboe/python-function-examples", split="train")
val_ds = load_dataset("ieadoboe/python-function-examples", split="validation")
test_ds = load_dataset("ieadoboe/python-function-examples", split="test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

python_train_0.jsonl:   0%|          | 0.00/141M [00:00<?, ?B/s]

python_valid_0.jsonl:   0%|          | 0.00/98.6M [00:00<?, ?B/s]

python_test_0.jsonl:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

In [4]:
train_ds[0]

{'repo': 'ageitgey/face_recognition',
 'path': 'examples/face_recognition_knn.py',
 'func_name': 'train',
 'original_string': 'def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo=\'ball_tree\', verbose=False):\n    """\n    Trains a k-nearest neighbors classifier for face recognition.\n\n    :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n     (View in source code to see train_dir example tree structure)\n\n     Structure:\n        <train_dir>/\n        ├── <person1>/\n        │   ├── <somename1>.jpeg\n        │   ├── <somename2>.jpeg\n        │   ├── ...\n        ├── <person2>/\n        │   ├── <somename1>.jpeg\n        │   └── <somename2>.jpeg\n        └── ...\n\n    :param model_save_path: (optional) path to save model on disk\n    :param n_neighbors: (optional) number of neighbors to weigh in classification. Chosen automatically if not specified\n    :param knn_algo: (optional) underlying data structure to support

### Code examples from dataset


In [None]:
def print_code_examples(dataset, num_examples=5):
    count = 0
    for example in dataset:
        print(f"\n--- Example {count+1} ---")
        print(f"Function name: {example.get('func_name', 'N/A')}")
        print(
            f"Docstring: {example.get('docstring', 'N/A')[:100]}..."
        )  # Print first 100 chars
        print(f"Code:\n{example.get('code', 'N/A')}")
        count += 1
        if count >= num_examples:
            break

In [6]:
# Print training set examples
print_code_examples(train_ds, 3)


--- Example 1 ---
Function name: train
Docstring: Trains a k-nearest neighbors classifier for face recognition.

    :param train_dir: directory that ...
Code:
def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):
    """
    Trains a k-nearest neighbors classifier for face recognition.

    :param train_dir: directory that contains a sub-directory for each known person, with its name.

     (View in source code to see train_dir example tree structure)

     Structure:
        <train_dir>/
        ├── <person1>/
        │   ├── <somename1>.jpeg
        │   ├── <somename2>.jpeg
        │   ├── ...
        ├── <person2>/
        │   ├── <somename1>.jpeg
        │   └── <somename2>.jpeg
        └── ...

    :param model_save_path: (optional) path to save model on disk
    :param n_neighbors: (optional) number of neighbors to weigh in classification. Chosen automatically if not specified
    :param knn_algo: (optional) underlying data structure

In [7]:
print_code_examples(val_ds, 3)


--- Example 1 ---
Function name: learn
Docstring: Train a deepq model.

    Parameters
    -------
    env: gym.Env
        environment to train on
  ...
Code:
def learn(env,
          network,
          seed=None,
          lr=5e-4,
          total_timesteps=100000,
          buffer_size=50000,
          exploration_fraction=0.1,
          exploration_final_eps=0.02,
          train_freq=1,
          batch_size=32,
          print_freq=100,
          checkpoint_freq=10000,
          checkpoint_path=None,
          learning_starts=1000,
          gamma=1.0,
          target_network_update_freq=500,
          prioritized_replay=False,
          prioritized_replay_alpha=0.6,
          prioritized_replay_beta0=0.4,
          prioritized_replay_beta_iters=None,
          prioritized_replay_eps=1e-6,
          param_noise=False,
          callback=None,
          load_path=None,
          **network_kwargs
            ):
    """Train a deepq model.

    Parameters
    -------
    env: gym.E

### Extract few training examples


In [None]:
def collect_training_data_from_dataset(dataset, max_examples=1000):
    examples = []
    for example in dataset:
        code = example.get("code", "")
        examples.append(
            {
                "function_name": example.get("func_name", ""),
                "docstring": example.get("docstring", ""),
                "code": code,
                "language": example.get("language", "python"),
            }
        )

        if len(examples) >= max_examples:
            break

    print(f"Collected {len(examples)} examples")
    return examples


# Collect training data from the dataset
training_data = collect_training_data_from_dataset(train_ds, max_examples=1000)

Collected 1000 examples


In [9]:
training_data[1]

{'function_name': 'predict',
 'docstring': "Recognizes faces in given image using a trained KNN classifier\n\n    :param X_img_path: path to image to be recognized\n    :param knn_clf: (optional) a knn classifier object. if not specified, model_save_path must be specified.\n    :param model_path: (optional) path to a pickled knn classifier. if not specified, model_save_path must be knn_clf.\n    :param distance_threshold: (optional) distance threshold for face classification. the larger it is, the more chance\n           of mis-classifying an unknown person as a known one.\n    :return: a list of names and face locations for the recognized faces in the image: [(name, bounding box), ...].\n        For faces of unrecognized persons, the name 'unknown' will be returned.",
 'code': 'def predict(X_img_path, knn_clf=None, model_path=None, distance_threshold=0.6):\n    """\n    Recognizes faces in given image using a trained KNN classifier\n\n    :param X_img_path: path to image to be recogni

## Salesforce/Code-350M-mono


### Create training tokenized samples from extracted examples

Programming languages require specialized tokenization that preserves code syntax. The tokenizer must recognize language-specific tokens like brackets, keywords, indentation, and operators. Since I will be using the `Salesforce/codegen-350M-mono` model, I go ahead and use their tokenizer as well.


In [None]:
# Load a code-optimized tokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")


def prepare_training_examples(examples, tokenizer, max_length=256):
    training_samples = []

    for example in examples:
        code = example["code"]
        # Skip very short functions
        if len(code.strip()) < 20:
            continue

        # Tokenize the code
        tokenized = tokenizer(code, truncation=True, max_length=max_length)
        input_ids = tokenized["input_ids"]

        # Create training examples
        seq_length = len(input_ids)
        if seq_length > 20:
            for _ in range(3):
                # Decide how much to keep (50-90% of tokens)
                keep_percent = random.uniform(0.5, 0.9)
                keep_tokens = int(seq_length * keep_percent)

                # Create input/target pairs
                input_sample = input_ids[:keep_tokens]
                target_sample = input_ids[keep_tokens:]

                training_samples.append(
                    {
                        "input_ids": input_sample,
                        "labels": target_sample,
                    }
                )

    print(
        f"Created {len(training_samples)} training samples from {len(examples)} examples"
    )
    return training_samples


# Prepare the training data
training_samples = prepare_training_examples(training_data, tokenizer)

tokenizer_config.json:   0%|          | 0.00/240 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Created 3000 training samples from 1000 examples


In [11]:
training_samples[0]

{'input_ids': [4299,
  4512,
  7,
  27432,
  62,
  15908,
  11,
  2746,
  62,
  21928,
  62,
  6978,
  28,
  14202,
  11,
  299,
  62,
  710,
  394,
  32289,
  28,
  14202,
  11,
  638,
  77,
  62,
  282,
  2188,
  11639,
  1894,
  62,
  21048,
  3256,
  15942,
  577,
  28,
  25101,
  2599,
  198,
  50284,
  37811,
  198,
  50284,
  2898,
  1299,
  257,
  479,
  12,
  710,
  12423,
  12020,
  1398,
  7483,
  329,
  1986,
  9465,
  13,
  628,
  50284,
  25,
  17143,
  4512,
  62,
  15908,
  25,
  8619,
  326,
  4909,
  257,
  850,
  12,
  34945,
  329,
  1123,
  1900,
  1048,
  11,
  351,
  663,
  1438,
  13,
  628,
  50283,
  7,
  7680,
  287,
  2723,
  2438,
  284,
  766,
  4512,
  62,
  15908,
  1672,
  5509,
  4645,
  8,
  628,
  50283,
  1273,
  5620,
  25,
  198,
  50280,
  27,
  27432,
  62,
  15908,
  29,
  14,
  198,
  50280,
  6552,
  250,
  8418,
  1279,
  6259,
  16,
  29,
  14,
  198,
  50280,
  6552,
  224,
  50285,
  6552,
  250,
  8418,
  1279,
  82,
  3674,
  480,
  16,

### Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA)

PEFT with LoRA helps modify only a small subset of the model's parameters, making training more efficient while maintaining performance.


In [None]:
# Load a pre-trained code model
model_name = "Salesforce/codegen-350M-mono"
model = AutoModelForCausalLM.from_pretrained(model_name)  # for causal language modeling

# Configure LoRA adapter
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,  # for fine-tuning
    r=8,  # controls the number of trainable parameters
    lora_alpha=32,
    lora_dropout=0.1,
)

# Create PEFT model
model = get_peft_model(model, peft_config)
print(
    f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}"
)
print(f"Total parameters: {sum(p.numel() for p in model.parameters())}")
print(
    f"Percentage of trainable parameters: {100 * sum(p.numel() for p in model.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters()):.2f}%"
)

config.json:   0%|          | 0.00/999 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/797M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/797M [00:00<?, ?B/s]

Some weights of the model checkpoint at Salesforce/codegen-350M-mono were not used when initializing CodeGenForCausalLM: ['transformer.h.0.attn.causal_mask', 'transformer.h.1.attn.causal_mask', 'transformer.h.10.attn.causal_mask', 'transformer.h.11.attn.causal_mask', 'transformer.h.12.attn.causal_mask', 'transformer.h.13.attn.causal_mask', 'transformer.h.14.attn.causal_mask', 'transformer.h.15.attn.causal_mask', 'transformer.h.16.attn.causal_mask', 'transformer.h.17.attn.causal_mask', 'transformer.h.18.attn.causal_mask', 'transformer.h.19.attn.causal_mask', 'transformer.h.2.attn.causal_mask', 'transformer.h.3.attn.causal_mask', 'transformer.h.4.attn.causal_mask', 'transformer.h.5.attn.causal_mask', 'transformer.h.6.attn.causal_mask', 'transformer.h.7.attn.causal_mask', 'transformer.h.8.attn.causal_mask', 'transformer.h.9.attn.causal_mask']
- This IS expected if you are initializing CodeGenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e

Trainable parameters: 655360
Total parameters: 357367808
Percentage of trainable parameters: 0.18%


### Testing the pre-trained model (Before training)


In [13]:
text = "def read_json(filename, print=False):"
input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


def read_json(filename, print=False):
    """
    Reads a JSON file and returns a list of dictionaries.
    """
    with open(filename, 'r') as f:
        return json.load(f)

def write_json(filename, data):
    """
    Writes a JSON file.
    """
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)

def read_csv(filename, print=False):
    """
    Reads a CSV


## Training

The `Salesforce/codegen-350M-mono` implementation used PyTorch and has no tensorflow implementation. I tried to train the model with Tensorflow but run into issues constantly. The training is performed in `torch`


In [None]:
class CodeCompletionDataset(Dataset):
    def __init__(self, samples, tokenizer, max_length=256):
        self.samples = samples
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]

        # Get input and target
        input_ids = sample["input_ids"]
        labels = sample["labels"]

        # Combine input with labels for training
        combined_ids = input_ids + labels

        # Handle truncation if needed
        if len(combined_ids) > self.max_length:
            combined_ids = combined_ids[: self.max_length]

        # Create attention mask
        attention_mask = [1] * len(combined_ids)

        # Pad sequences if needed
        padding_length = self.max_length - len(combined_ids)
        if padding_length > 0:
            combined_ids = combined_ids + [self.tokenizer.pad_token_id] * padding_length
            attention_mask = attention_mask + [0] * padding_length

        # Set up labels (set to -100 for input portion to ignore in loss)
        labels = [-100] * len(input_ids) + combined_ids[len(input_ids) :]

        # Ensure all sequences have the right length
        if len(labels) > self.max_length:
            labels = labels[: self.max_length]
        elif len(labels) < self.max_length:
            labels = labels + [-100] * (self.max_length - len(labels))

        return {
            "input_ids": torch.tensor(combined_ids),
            "attention_mask": torch.tensor(attention_mask),
            "labels": torch.tensor(labels),
        }

In [None]:
def train_code_completion_model(
    training_samples,
    model_name="Salesforce/codegen-350M-mono",
    output_dir="./code-completion-model",
    num_epochs=3,
    batch_size=4,
    grad_accum_steps=4,
    learning_rate=5e-5,
    weight_decay=0.01,
    max_length=256,
):
    # Load tokenizer and model
    print(f"Loading model and tokenizer: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    # Ensure padding token is set
    if tokenizer.pad_token_id is None:
        if tokenizer.eos_token_id is not None:
            tokenizer.pad_token_id = tokenizer.eos_token_id
            print(f"Setting pad_token_id to eos_token_id: {tokenizer.pad_token_id}")
        else:
            tokenizer.pad_token_id = 0
            print("Setting pad_token_id to 0")

    # Split data into train and validation
    train_size = int(0.9 * len(training_samples))
    train_samples = training_samples[:train_size]
    val_samples = training_samples[train_size:]

    print(f"Training samples: {len(train_samples)}")
    print(f"Validation samples: {len(val_samples)}")

    # Create datasets
    train_dataset = CodeCompletionDataset(train_samples, tokenizer, max_length)
    val_dataset = CodeCompletionDataset(val_samples, tokenizer, max_length)

    # Set up training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=grad_accum_steps,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=f"{output_dir}/logs",
        logging_steps=10,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        fp16=True,
        load_best_model_at_end=True,
    )

    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

    # Train the model
    print("Starting training...")
    trainer.train()

    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer

In [None]:
# Prepare the training samples
training_samples = prepare_training_examples(training_data, tokenizer)

# Train the model
model, tokenizer = train_code_completion_model(training_samples)

Created 3000 training samples from 1000 examples
Loading model and tokenizer: Salesforce/codegen-350M-mono


Some weights of the model checkpoint at Salesforce/codegen-350M-mono were not used when initializing CodeGenForCausalLM: ['transformer.h.0.attn.causal_mask', 'transformer.h.1.attn.causal_mask', 'transformer.h.10.attn.causal_mask', 'transformer.h.11.attn.causal_mask', 'transformer.h.12.attn.causal_mask', 'transformer.h.13.attn.causal_mask', 'transformer.h.14.attn.causal_mask', 'transformer.h.15.attn.causal_mask', 'transformer.h.16.attn.causal_mask', 'transformer.h.17.attn.causal_mask', 'transformer.h.18.attn.causal_mask', 'transformer.h.19.attn.causal_mask', 'transformer.h.2.attn.causal_mask', 'transformer.h.3.attn.causal_mask', 'transformer.h.4.attn.causal_mask', 'transformer.h.5.attn.causal_mask', 'transformer.h.6.attn.causal_mask', 'transformer.h.7.attn.causal_mask', 'transformer.h.8.attn.causal_mask', 'transformer.h.9.attn.causal_mask']
- This IS expected if you are initializing CodeGenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e

Setting pad_token_id to eos_token_id: 50256
Training samples: 2700
Validation samples: 300




Starting training...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mieadoboe[0m ([33mieadoboe-memorial-university-of-newfoundland[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.0489,0.264018
2,0.0009,0.343611


### Generate some coding examples


In [None]:
def generate_completion(model, tokenizer, function_prefix, max_new_tokens=100):
    # Move model to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    # Tokenize and move to appropriate device
    inputs = tokenizer(function_prefix, return_tensors="pt")
    # Move each tensor to the device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate completion
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],  # Use dictionary indexing
            attention_mask=inputs["attention_mask"],  # Use dictionary indexing
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.95,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode the generated tokens
    completed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Return only the newly generated part
    return completed_code[len(function_prefix) :]


# Test with some examples
test_prefixes = [
    "def train_model(X_train, y_train):\n    # Create TensorFlow model\n    model = tf.keras",
    "def process_image(image_path):\n    # Load and preprocess image\n    import numpy as np\n    img = ",
    "def create_bert_classifier():\n    # Initialize a BERT model from HuggingFace\n    from transformers import ",
]

for prefix in test_prefixes:
    completion = generate_completion(model, tokenizer, prefix)
    print(f"\nPrefix:\n{prefix}")
    print(f"\nCompletion:\n{completion}")
    print("-" * 50)


Prefix:
def train_model(X_train, y_train):
    # Create TensorFlow model
    model = tf.keras

Completion:
.Sequential([
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile model
    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=
--------------------------------------------------

Prefix:
def process_image(image_path):
    # Load and preprocess image
    import numpy as np
    img = 

Completion:
ia.load_image(image_path)
    img = img.resize((512, 512), ia.ResampleMethod.NEAREST)
    img = np.expand_dims(img, 0)
    return img
--------------------------------------------------

Prefix:
def create_bert_classifier():
    # Initialize a BERT model from HuggingFace
    from transformers import 

Completion:

    model = AutoModelWithLMHead.from_pretrained(
        "bert-base-uncased",
        cache_dir=PYTORCH_PRETRAINED_BERT_CAC