# Qwen Fine Tuning Script
- Fine tune a QWEN model
- Show how to output model weights to GGUF Format so we can use it easily via Ollama (useful if we want to host these local models on the StockSensei deployment)

## Part 1: Model Finetuning (HuggingFace, MLX)

In [1]:
# First, install necessary dependencies
%pip install datasets
%pip install transformers
%pip install huggingface_hub
%pip install -U mlx_lm
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting mlx_lm
  Using cached mlx_lm-0.20.1-py3-none-any.whl.metadata (9.0 kB)
INFO: pip is looking at multiple versions of mlx-lm to determine which version is compatible with other requirements. This could take a while.
  Using cached mlx_lm-0.19.3-py3-none-any.whl.metadata (8.9 kB)
  Using cached mlx_lm-0.19.2-py3-none-any.whl.metadata (8.1 kB)
  Using cached mlx_lm-0.19.1-py3-none-any.whl.metadata (8.1 kB)
  Using cached mlx_lm-0.19.0-py3-none-any.whl.metadata (7.3 kB)
  Using cached mlx_lm-0.18.2-py3-none-any.whl.metadata (7.3 kB)
  Using cached mlx_lm-0.18.1-py3-none-any.whl.metadata (6.9 kB)
  Using cached mlx_lm-0.17.1-py3-none-any.whl.metadata (5.8 kB)
INFO: pip is still looking at multiple versions of mlx-lm to determine which version is compatible with other requirements. This 

Download the model from the huggingface cli using the following prompt in your terminal

```bash
huggingface-cli login
#Paste your huggingface token in#
```


In [2]:
# Download the model from huggingface
!huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir='../models/'

Fetching 10 files: 100%|██████████████████████| 10/10 [00:00<00:00, 3023.79it/s]
/Users/willferguson/Downloads/GT Fall 2024/CS 6220/cs6220-project/models


Now we need to create the datasets. MLX is expecting a test, train and validation dataset in the following form

```json
{'prompt': "What is your name?", "completion": "Will Ferguson"}
```

In [3]:
import pandas as pd
import json
import os

In [4]:
def ingest_train_data(directory: str):
    all_data = []
    for file in os.listdir(directory):
        path = os.path.join(directory, file)
        with open(path, 'r') as f:
            data = json.load(f)
            company_name = data['company']
            # drop keys that don't matter
            del_keys = [
                'state_location'
                'cik',
                'company',
                'filing_type',
                'filing_date',
                'period_of_report',
                'sic',
                'state_of_inc',
                'fiscal_year_end',
                'filing_html_index',
                'htm_filing_link',
                'complete_text_filing_link',
                'filename'
            ]
            all_data.extend([(data[text], text, company_name) for text in data.keys() if text not in del_keys])
    return all_data


def create_master_dataset(directory: str):
    df = pd.DataFrame(columns=["prompt", "completion"])
    data = ingest_train_data('../data/')
    for completion, item, company in data:
        prompt = f"What did {item} in {company}'s SEC 10-K filing say?"
        df.loc[len(df.index)] = [prompt, completion]
    return df

directory = '../data/'
master_set = create_master_dataset(directory=directory)

In [5]:
%pip install scikit-learn
from sklearn.model_selection import train_test_split

Note: you may need to restart the kernel to use updated packages.


In [6]:

def split(df: pd.DataFrame):
    train, test = train_test_split(df, test_size=0.2, random_state=143)
    test, val = train_test_split(test, test_size=0.25, random_state=143)
    train.to_json('../tmp/train.jsonl', orient='records', lines=True)
    test.to_json('../tmp/test.jsonl', orient='records', lines=True)
    val.to_json('../tmp/valid.jsonl', orient='records', lines=True)

split(master_set)

In [2]:

!export SYSTEM_VERSION_COMPAT=0
import platform; print(platform.mac_ver())

('10.16', ('', '', ''), 'x86_64')


Now we can fine-tune the model with a simple MLX call

In [8]:
data = '../tmp/'
model = '/Users/willferguson/Downloads/GT Fall 2024/CS 6220/cs6220-project/models'

# Fine Tune the Model
!python -m mlx_lm.lora \
  --model '/Users/willferguson/Downloads/GT Fall 2024/CS 6220/cs6220-project/models' \
  --data './tmp/' \
  --train \
  --batch-size 4\
  --lora-layers 16\
  --iters 1000

Traceback (most recent call last):
    raise ImportError(
ImportError: Only macOS 13.5 and newer are supported, not 10.16

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/Users/willferguson/miniconda3/envs/cs_6220/lib/python3.11/site-packages/mlx_lm/__init__.py", line 3, in <module>
    from .utils import convert, generate, load
  File "/Users/willferguson/miniconda3/envs/cs_6220/lib/python3.11/site-packages/mlx_lm/utils.py", line 14, in <module>
    import mlx.core as mx
ImportError: initialization failed
