## Dataset
* https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca/viewer
* https://www.atyun.com/datasets/files/sahil2801/code_instructions_120k.html (no access as of right now)

## Pandas Dataframe
### Strengths
* Data Analysis
* Data Manipulation

In [3]:
import pandas as pd

code_df = pd.read_parquet("train-00000-of-00001-d9b93805488c263e.parquet")

In [5]:
code_df.shape

(121959, 4)

In [9]:
code_df.columns

Index(['instruction', 'input', 'output', 'prompt'], dtype='object')

In [20]:
code_df.iloc[0]

instruction    Create a function to calculate the sum of a se...
input                                            [1, 2, 3, 4, 5]
output         # Python code\ndef sum_sequence(sequence):\n  ...
prompt         Below is an instruction that describes a task....
Name: 0, dtype: object

## Pyarrow Table
### Strengths
* Reading specific columns
* Filtering
* Working with raw data structure

In [2]:
import pyarrow.parquet as pq

code_table = pq.read_table("train-00000-of-00001-d9b93805488c263e.parquet")

In [4]:
code_table.shape

(121959, 4)

In [11]:
code_table.schema

instruction: string
input: string
output: string
prompt: string
-- schema metadata --
huggingface: '{"info": {"features": {"instruction": {"dtype": "string", "' + 165

In [29]:
code_table.take([8])

pyarrow.Table
instruction: string
input: string
output: string
prompt: string
----
instruction: [["Create a MySQL query to find the most expensive product from the table "products"."]]
input: [[""]]
output: [["SELECT * FROM products ORDER BY price DESC LIMIT 1;"]]
prompt: [["Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a MySQL query to find the most expensive product from the table "products".

### Input:


### Response:
SELECT * FROM products ORDER BY price DESC LIMIT 1;"]]

# Tasks
* Get number of examples for each language
* Get list of large Code LLMs (including the used ones)
    * Analyze their performance
* Find code that  can generate the malicious code (look through paper and details)
* ***Extra*** -- How does gpt-3.5-turbo determine malicious code?

## Number of Examples for Each Language
### Ideas
* Get a model to identify what language a snippet is written in
* Implement rules/checks that are specific to each language
#### Dataset with General Code Examples
* https://huggingface.co/datasets/bigcode/the-stack


## Large Code LLMs
#### From The Paper
* CodeLlama (7B)
    * https://github.com/meta-llama/codellama
* DeepSeek-Coder (6.7B)
    * https://github.com/deepseek-ai/DeepSeek-Coder
    * https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct
* StarCoder2 (7B)
    * https://github.com/bigcode-project/starcoder2
#### Others
* Codex
    * https://github.com/openai/human-eval
    * Restricted access
    * Python
* GitHub Copilot
    * Not open source
* CodeT5
    * https://huggingface.co/Salesforce/codet5-base
    * Snippets:
        * // Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text2text-generation", model="Salesforce/codet5-base")
        * // Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/codet5-base")

* AlphaCode
    * Not public
    * Not open source

##### CodeLlama (*In model-repos directory*)

##### DeepSeek-Coder (torch.bfloat16)

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
messages=[
    { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

##### StarCoder2 (*In model-repos directory*)


##### CodeT5