# Generating code

This notebook serves the purpose of running the [deep seek model](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) on a custom dataset for further evaluation (see the ```evaluation``` folder).

You can find here a simple implementation to generate samples given a structured dataset found in ```utils```.

## Env setup

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from utils import RepoExtractor, CodeDataset
import random
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json

random.seed(1) # Adding this for reproducibility

## Creating the dataset

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
repo = RepoExtractor("https://github.com/gp-1108/snake_rl")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True).to(device)
dataset = CodeDataset(repo.get_files(), min_lengths=(100,10,50), max_lengths=(200,200,200))

tokenizer_config.json:   0%|          | 0.00/793 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


pytorch_model.bin:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [None]:
@torch.no_grad()
def generate_code(input: list[int]) -> str:
    input_ids = tokenizer(input, return_tensors="pt").to(model.device)
    outputs = model.generate(**input_ids, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input):]

Now let's try a simple test run in order to see if it does perform:

In [None]:
out = generate_code("def sum(a: int, b:int):")
print(out)

Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.



    return a + b

def sub(a: int, b:int):
    return a - b

def mul(a: int, b:int):
    return a * b

def div(a: int, b:int):
    return a / b

def mod(a: int, b:int):
    return a % b

def exp(a: int, b:int):
    return a ** b

def main():
    print("Welcome to the calculator")
    print("Enter 'q' to quit")
    while True:
        try:
            a = int(input("Enter the first number: "))
            b = int(input("Enter the second number: "))
            print("Enter '+' for addition")
            print("Enter '-' for subtraction")
            print("Enter '*' for multiplication")


Also with code completion given a context:

In [None]:
prefix, middle, suffix = dataset[0]
input_text = f"<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>"
out = generate_code(input_text)
print(prefix)
print("*"*10)
print(out)
print("*"*10)
print(middle)
print("*"*10)
print(suffix)

Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.


import BaseAgent
from .BaselineAgent import BaselineAgent
from .RandomAgent import RandomAgent
from .DQN
**********
Agent import DQNAgent

__all__ = 
**********
Agent import DQNAgent
from .HybridDQNAgent import HybridDQNAgent

__all__ = 
**********
["BaseAgent", "BaselineAgent", "RandomAgent", "DQNAgent", 


## Generating all answers

Here we will generate all of the needed answers and save them to a .json file.

Given I am running all of this con Colab (GPU contraints) I manually downloaded the file afterwards and moved it to ```evaluation/model_outputs.json```.

In [None]:
prefixes = []
ans = []
corrects = []
suffixes = []

for i in range(len(dataset)):
    prefix, middle, suffix = dataset[i]
    input_text = f"<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>"
    out = generate_code(input_text)
    prefixes.append(prefix)
    ans.append(out)
    corrects.append(middle)
    suffixes.append(suffix)

# Now saving everything to a json file as a list of dictionaries
data = []
for i in range(len(prefixes)):
    entry = {
        "prefix": prefixes[i],
        "generated": ans[i],
        "correct_middle": corrects[i],
        "suffix": suffixes[i]
    }
    data.append(entry)

# Now saving everything to a JSON file
with open("output.json", "w") as json_file:
    json.dump(data, json_file, indent=4)


Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:32014 for open-end gene