# Description
The idea is to build a model that helps with classifying python code snippets.

Its output is 1 out of 4 classes:
- Data Processing
- Web/API Code
- Algorithms/Logic
- Machine Learning

# Dataset
Ideas:
- Use sth like `CodeSearchNet`
- Scrape github repos and leverage info from their tags
- Create a synthetic dataset from tutorials or documentation

Goal: Have 100 to 500 data points per category.

# Preprocessing
- Use a tokenizer like HuggingFace's AutoTokenizer

# Training
- Fine-tune something like CodeBERT
- Loss: CrossEntropyLoss
- Optimizer: Adam
- Epochs: 5–10 (for initial demo-level)

## Evaluation
- Accuracy
- Confusion matrix

# Questions:
- How to store/load model?
- How to compare models?
- Baseline vs. mine vs. huggingface/CodeBERTa-small-v1 vs. DistilBERT vs. ??

In [3]:
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [4]:
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dr

In [8]:
code_tokens = tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens = [tokenizer.cls_token] + code_tokens + [tokenizer.eos_token]
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[0,
 9232,
 19220,
 1640,
 102,
 6,
 428,
 3256,
 114,
 10,
 15698,
 428,
 35,
 671,
 10,
 1493,
 671,
 741,
 2]

In [10]:
model(torch.tensor(token_ids).to(device)[None,:])

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1685,  0.3331,  0.0392,  ..., -0.2262, -0.3359,  0.3277],
         [-1.0436,  0.3191,  0.3959,  ..., -0.4708, -0.1289,  0.5579],
         [-0.9022,  0.5009,  0.1820,  ..., -0.4935, -0.5855,  0.6971],
         ...,
         [-0.4663,  0.2088,  0.5154,  ..., -0.1752, -0.3702,  0.5890],
         [-0.4513,  0.4893,  0.4857,  ..., -0.3150, -0.6229,  0.3867],
         [-0.1703,  0.3353,  0.0404,  ..., -0.2282, -0.3384,  0.3300]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 4.9750e-01, -3.6809e-01, -5.7454e-01,  1.0966e-01,  3.4282e-01,
          7.0422e-02,  4.8956e-01, -3.5818e-01,  1.0147e-01, -3.2847e-01,
          4.4498e-01,  2.5206e-02, -2.6024e-01,  1.1288e-01, -6.0998e-02,
          5.9084e-01,  4.4048e-01, -5.2328e-01,  9.5334e-02,  3.2977e-01,
         -3.4737e-01,  5.2274e-01,  3.5409e-01,  8.1222e-04, -4.9782e-02,
          2.3763e-01,  1.2317e-01,  9.8236e-02,  4

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
sample_code = "def max(a,b): if a>b: return a else return b"


In [13]:
tokenizer(sample_code, padding="max_length", truncation=True, max_length=256# # Do a summary *after* freezing the features and changing the output classifier layer (uncomment for actual output)

{'input_ids': [0, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

# Fine-tuning v0
## Create dataset
This also creates the label2id and id2label maps.

In [95]:
label_list = ["Data Processing", "Web/API Code", "Algorithms/Logic", "Machine Learning"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for label, i in label2id.items()}

In [96]:
from datasets import load_dataset
dataset = load_dataset("json", data_files="data/balanced_data.jsonl", split="train")
dataset = dataset.map(lambda x: {"label": label2id[x["label"]]})
dataset = dataset.train_test_split(test_size=0.2, seed=42)


## Create tokenizer


In [97]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

def tokenize(example):
    return tokenizer(
        example["code"],
        padding="max_length",
        truncation=True,
        max_length=256,
        return_tensors="pt",
    )

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [36]:
tokenized_dataset, tokenizer

(DatasetDict({
     train: Dataset({
         features: ['code', 'label', 'input_ids', 'attention_mask'],
         num_rows: 36
     })
     test: Dataset({
         features: ['code', 'label', 'input_ids', 'attention_mask'],
         num_rows: 9
     })
 }),
 RobertaTokenizerFast(name_or_path='microsoft/codebert-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
 	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 	3: AddedToken("<unk>", rstrip=False, lst

## Define model

In [98]:
import torch
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/codebert-base",
    num_labels=4,
    id2label=id2label,
    label2id=label2id
).to(device)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train

In [99]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./codebert-finetuned",
    num_train_epochs=6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    eval_on_start=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=2,
)

In [56]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [57]:
trainer.evaluate()

{'eval_loss': 1.1917643547058105,
 'eval_model_preparation_time': 0.0036,
 'eval_runtime': 0.1914,
 'eval_samples_per_second': 47.03,
 'eval_steps_per_second': 5.226}

## Inference function

In [44]:
def classify_code(code_snippet):
    inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True, max_length=256)
    outputs = model(**inputs.to(device))
    probs = outputs.logits.softmax(dim=1)
    pred = probs.argmax(dim=1).item()
    return id2label[pred], probs[0][pred].item()

In [58]:
sample_code = """
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)
"""
classify_code(sample_code)

('Data Processing', 0.335357129573822)

Estaría bueno que el scraper acumule en lugar de borrar. Que se guarde el hash para no duplicar cosas y que haya un sistema de reviews. Pienso que estaría genial que muestre el código formateado, que haya un botón para aprobar y pasar al siguiente o reprobar y hacer un soft delete antes de avanazar. Tiene que ser un soft delete para que no se vuelvan a agregar y revisar códigos ya revisados. 

Quizá incluso estaría bueno que no se procese dos veces la misma página. Que no se guarde sólo un pedazo de código (o su hash), sino una URI. Y que no se tomen dos pedazos de código del mismo lugar. Así, la diversidad será máxima.

Pero todo esto puede quedar para una segunda revisión.

# Migrar a pytorch


# Limpiar dataset


In [None]:
import pandas as pd
import sqlite3 as sqlite
import json

db_path = "schemas/code_snippets.db"
random_seed = 42

conn = sqlite.connect(db_path)
df = pd.read_sql("SELECT * FROM snippets ORDER BY hash", conn)


Es probable que esto quede desbalanceado. Hay que asegurarse que no sea el caso en el dataset de entrenamiento.


In [87]:
counts = df["label"].value_counts()
min_count = counts.min()

print(f"\nSe tomarán {min_count} ejemplos por clase para balancear.")
counts


Se tomarán 8 ejemplos por clase para balancear.


label
Web/API Code        13
Data Processing     13
Algorithms/Logic     9
Machine Learning     8
Name: count, dtype: int64

In [None]:

balanced = (
    df.groupby("label", group_keys=False)
    .apply(lambda x: x.sample(n=min_count, random_state=random_seed))
    .reset_index(drop=True)
)


In [None]:

output_file = "data/balanced_data.jsonl"

with open(output_file, "w", encoding="utf-8") as f:
    for _, row in balanced.iterrows():
        record = {
            "code": row["code"],
            "label": row["label"]
        }
        f.write(json.dumps(record, ensure_ascii=False) + "\n")