# Model Evaluation

This notebook will be evaluating the following models:

- FLAN-T5
- CodeGen
- CodeTrans
- CodeBert
- StarEncoder

The architecture, dataset, and training approaches of each model will be
discussed. Metrics will also be generated for each model.

## Criterias

- Trained on C/C++
- Trained on Natural Language
  - Prefably also with Git commits
- Architecture
  - Encoder (preferred)
  - Decoder
- Learning Objective
  - Either Masked Language Modelling (MLM) or Casual Language Modelling (CLM)
  - Both can be fine-tuned for text classification

## Metrics Output

### `results.csv`

```csv
masked_input,prediction,actual
```

### `confusion_matrix.csv`

```csv
tp,fp,tn,fn
```

### `metrics.csv`

```csv
accuracy,precision,recall,f1
```


## FLAN-T5


## CodeGen

- [Paper](https://arxiv.org/pdf/2203.13474.pdf)
- [GitHub](https://github.com/salesforce/CodeGen)
- [HuggingFace](https://huggingface.co/docs/transformers/model_doc/codegen)

### Overview

- Released 2022
- Architecure
  - Decoder, Autoregressive
- Learning Objective
  - Next-token Prediction (CLM)
- 3 Dataset Stages:
  1. CodeGen-NL
     - Dataset: The Pile
     - Natural Language `1159.04 GB`, `354.7B Tokens`
     - Code `95.16 GB`, `31.6B Tokens`
  2. CodeGen-Multi
     - Dataset: Google BigQuery
     - Code `340 GB`, `119.3B Tokens`
       - C/C++ `119 GB`, `19.B Tokens`
  3. CodeGen-Mono
     - Dataset: BigPython
     - Code consists of Python, not necessary for our use case
- 4 Checkpoints per variant
  - 350M, 2.7B, 6.1B, 16.1B

#### Pros

- Trained on a lot of C/C++
- Available checkpoints for small model
  - 350M, 2.7B
  - Can run on consumer GPUs

#### Cons

- Architecture not as ideal for text classification fine-tuning
- Learning objective is CLM, not MLM
  - However, there is a newer version `CodeGen2`, 2023
    - Adds MLM training objective
    - Encoder-Decoder architecture
    - Uses more languages and more data, including C and C++, using `The Stack` dataset


## CodeTrans


## CodeBert


## StarEncoder

- [Blog](https://huggingface.co/blog/starcoder)
- [Paper](https://arxiv.org/pdf/2305.06161.pdf)
- [GitHub](https://github.com/bigcode-project/bigcode-encoder)
- [HuggingFace](https://huggingface.co/bigcode/starencoder)

### Pros

- Encoder architecture (Bert)
  - Better for text classification
  - Much more memory efficient, i.e. can run on consumer GPUs
- Trained on source code and Git commits
- Trained on a lot of C & C++ code
- Trained on identified repos: torvalds/linux, D
  - Verified using [this](https://huggingface.co/spaces/bigcode/search) and [this](https://stack.dataportraits.org/)


In [None]:
MLM_MASKING_PROBABILITY = 0.15