# Finetuning using MLX 

Apple uses Apple Silicon chips on their new Macs. These chips are based on ARM architecture and have builtin GPU. Apple provides a framework called MLX to use the GPU for machine learning tasks. This notebook shows how to use MLX to finetune a pretrained model.

## Special tokens 

Language models use special tokens to indicate the start and end of a sentence. Those are typically called `bos_token` and `eos_token`. The `bos_token` is used to indicate the start of a sentence and `eos_token` is used to indicate the end of a sentence. A common practice is to use `<s>` for `bos_token` and `</s>` for `eos_token`.

## LoRA

To reduce computational cost, we use the [low-rank adaptation (LoRA) technique](https://huggingface.co/docs/diffusers/training/lora). LoRA is a technique to reduce the number of parameters in a neural network. It does so by training low-rank adapters on top of a pretrained model.

![](https://lightningaidev.wpengine.com/wp-content/uploads/2023/04/lora-4-300x226@2x.png)



## Mistral 7B

[Mistral 7B is an open LLM that outperforms Llama 2 13B on all benchmarks.](https://mistral.ai/news/announcing-mistral-7b/#:~:text=Outperforms%20Llama%202%2013B%20on%20all%20benchmarks)




## Data

As dataset, we use data compiled by [Luc Patiny](https://scholar.google.ch/citations?user=FaGwMp4AAAAJ&hl=en) and [Guillaume Godin](https://scholar.google.com/citations?user=o9BCO5IAAAAJ&hl=fr). 

In [1]:
import json

In [2]:
with open('data.json', 'r') as handle:
    data = json.load(handle)

There have been some sanity checks performed, and we filter out the data that passed those.

In [6]:
logs_w_info = list(filter(lambda x: x['logs']['status'] == 'info', data))

Now, we can build the dataset, which will be strings of the form:
```
<s>Q: {question}
A: {answer}</s>
````

In [10]:
logs_w_info[0]

{'parsed': {'name': '2-(2-Nitrobenzyloxy)benzaldehyde',
  'yield': '13%',
  'appearance': 'colourless solid',
  'mp': {'low': 103, 'high': 105, 'units': '°C'},
  '1Hnmr': {'signals': [{'delta': '5.45',
     'multiplicities': ['s'],
     'nbH': '2H',
     'assignment': 'Ar-CH2'},
    {'delta': '7.09-7.17',
     'multiplicities': ['m'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    {'delta': '7.30-7.37',
     'multiplicities': ['m'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    {'delta': '7.64-7.72',
     'multiplicities': ['m'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    {'delta': '7.72-7.78',
     'multiplicities': ['m'],
     'nbH': '2H',
     'assignment': 'Ar-H'},
    {'delta': '7.97-8.05',
     'multiplicities': ['m'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    {'delta': '8.18-8.25',
     'multiplicities': ['m'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    {'delta': '8.40',
     'multiplicities': ['s'],
     'nbH': '1H',
     'assignment': 'Ar-H'},
    

In [11]:
data = []
for log in logs_w_info:
    q = log['source']['prompt']
    c = log['chatgpt']['content']
    string = f"<s>Q: {q}\nA: {c}</s>"
    data.append(string)

In [12]:
data[0]

'<s>Q: \nWe have a scientific experimental part describing a chemical molecule and we need to parse all the available properties to the following YAML format:\n\n {\n  name?: str\n  procedure?: str\n  yield?: str\n  appearance?: str\n  alphaD?: {value?: num, concentration?: num, solvent?: str}\n  ir?:\n    conditions?: str\n    peaks?: [{wavenum: num, assignment?: str}]\n  bp?: {low?: num, high?: num, units?: str, pressure?: num, pressureUnits?: str, otherInfo?: str}\n  mp?: {low?: num, high?: num, units?: str, otherInfo?: str}\n  1Hnmr?: {\n    solvent?: str\n    frequency?: num\n    signals?: [{delta?: str, multiplicities?: str[], couplings?: num[], nbH?: str, assignment?: str}]\n  },\n  13Cnmr?:\n    solvent?: str\n    frequency?: num\n    signals?: [{delta?: str, otherInfo?: str}]\n  mass?:\n    ionization?: str\n    mode?: str\n    analyser?: str\n    conditions?: str\n    peaks?: [{mass?: num, intensity?: num, otherInfo?: str}]\n  hrms?: {mf?: str, assignment?: str, expected?: nu

Since the dataset is relatively small, we keep most of it for training for now.