# Download a FineTuned Model 
This notebook demonstrates how to download a finetuned model that you've created using LLM Engine and add it to huggingface!

**This notebook is an extension of the previous finetuning notebook on ScienceQA**

# Packages Required
For this demo, we'll be using the `scale-llm-engine` package, the `datasets` package for downloading our finetuning dataset, `transformers`, and `huggingface_hub` for uploading our model to huggingface.


In [1]:
%pip install scale-llm-engine
%pip install transformers
%pip install datasets

Collecting scale-llm-engine
  Downloading scale_llm_engine-0.0.0b28-py3-none-any.whl.metadata (1.9 kB)
Collecting aiohttp<4.0,>=3.8 (from scale-llm-engine)
  Downloading aiohttp-3.9.3-cp39-cp39-win_amd64.whl.metadata (7.6 kB)
Collecting pydantic>=1.10 (from scale-llm-engine)
  Downloading pydantic-2.6.4-py3-none-any.whl.metadata (85 kB)
     ---------------------------------------- 0.0/85.1 kB ? eta -:--:--
     ---------------------------------------- 85.1/85.1 kB 2.4 MB/s eta 0:00:00
Collecting aiosignal>=1.1.2 (from aiohttp<4.0,>=3.8->scale-llm-engine)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting frozenlist>=1.1.1 (from aiohttp<4.0,>=3.8->scale-llm-engine)
  Downloading frozenlist-1.4.1-cp39-cp39-win_amd64.whl.metadata (12 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp<4.0,>=3.8->scale-llm-engine)
  Downloading multidict-6.0.5-cp39-cp39-win_amd64.whl.metadata (4.3 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp<4.0,>=3.8->scale-llm-engine)
  Downloading 


[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.2-cp39-cp39-win_amd64.whl.metadata (3.1 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets)
  Downloading pandas-2.2.1-cp39-cp39-win_amd64.whl.metadata (19 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp39-cp39-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py39-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting tzdata>=2022.7 (from pandas->datasets)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Data Preparation
Let's load in the dataset using Huggingface and view the features.

In [3]:
%pip install smart_open

Collecting smart_open
  Downloading smart_open-7.0.4-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart_open)
  Downloading wrapt-1.16.0-cp39-cp39-win_amd64.whl.metadata (6.8 kB)
Downloading smart_open-7.0.4-py3-none-any.whl (61 kB)
   ---------------------------------------- 0.0/61.2 kB ? eta -:--:--
   ------ --------------------------------- 10.2/61.2 kB ? eta -:--:--
   ------ --------------------------------- 10.2/61.2 kB ? eta -:--:--
   ------ --------------------------------- 10.2/61.2 kB ? eta -:--:--
   -------------------------- ------------- 41.0/61.2 kB 196.9 kB/s eta 0:00:01
   ---------------------------------------- 61.2/61.2 kB 272.6 kB/s eta 0:00:00
Downloading wrapt-1.16.0-cp39-cp39-win_amd64.whl (37 kB)
Installing collected packages: wrapt, smart_open
Successfully installed smart_open-7.0.4 wrapt-1.16.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from datasets import load_dataset
from smart_open import smart_open
import pandas as pd

physics_dataset = load_dataset('camel-ai/physics')
chemistry_dataset = load_dataset('camel-ai/chemistry')
biology_dataset = load_dataset('camel-ai/biology')
maths_dataset = load_dataset('camel-ai/math')


Downloading readme: 100%|██████████| 2.19k/2.19k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 39.8M/39.8M [00:13<00:00, 2.90MB/s]
Generating train split: 50000 examples [10:19, 80.77 examples/s] 


In [18]:
def format_prompt(r):
    return r['message_1']

def format_label(r):
    return r['message_2']

def convert_dataset(ds):
    prompts = [format_prompt(i) for i in ds]
    labels = [format_label(i) for i in ds]
    df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels})
    return df

physics_train_data = convert_dataset(physics_dataset['train'])
maths_train_data = convert_dataset(maths_dataset['train'])
chemistry_train_data = convert_dataset(chemistry_dataset['train'])
biology_train_data = convert_dataset(biology_dataset['train'])

In [20]:
df_train = pd.concat([physics_train_data, chemistry_train_data, maths_train_data, biology_train_data])
df_train

Unnamed: 0,prompt,response
0,What is the probability of finding a particle ...,To find the probability of finding a particle ...
1,What is the time-independent Schrödinger equat...,The time-independent Schrödinger equation is a...
2,Determine the wave function and energy eigenva...,To determine the wave function and energy eige...
3,What are the possible energy levels and wave f...,"To solve this problem, we need to apply the ti..."
4,If a particle is located in a one-dimensional ...,"Yes, I can help you find the possible energy l..."
...,...,...
19995,What are the current conservation efforts and ...,Current conservation efforts and methods being...
19996,What are the current strategies being used for...,The conservation and preservation of endangere...
19997,How can we effectively preserve and conserve e...,To effectively preserve and conserve endangere...
19998,How can the preservation of endangered fungal ...,The preservation of endangered fungal species ...


In [32]:
import numpy as np
train, validate = np.split(df_train.sample(frac=1, random_state=42), [int(.8 * len(df_train))])

  return bound(*args, **kwds)


In [34]:
train.to_csv('train.csv', index=False)  
validate.to_csv('val.csv', index=False)  

Now, let's format the dataset into what's acceptable for LLM Engine - a CSV file with 'prompt' and 'response' columns.

In [None]:
choice_prefixes = [chr(ord('A') + i) for i in range(26)] # A-Z
def format_options(options, choice_prefixes):
    return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])

def format_prompt(r, choice_prefixes):
    options = format_options(r['choices'], choice_prefixes)
    return f'''Context: {r["hint"]}\nQuestion: {r["question"]}\nOptions:{options}\nAnswer:'''

def format_label(r, choice_prefixes):
    return choice_prefixes[r['answer']]

def convert_dataset(ds):
    prompts = [format_prompt(i, choice_prefixes) for i in ds if i['hint'] != '']
    labels = [format_label(i, choice_prefixes) for i in ds if i['hint'] != '']
    df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels})
    return df

save_to_s3 = False
df_train = convert_dataset(dataset['train'])
if save_to_s3:
    train_url = 's3://...'
    val_url = 's3://...'
    df_train = convert_dataset(dataset['train'])
    with smart_open(train_url, 'wb') as f:
        df_train.to_csv(f)

    df_val = convert_dataset(dataset['validation'])
    with smart_open(val_url, 'wb') as f:
        df_val.to_csv(f)
else:
    # Gists of the already processed datasets
    train_url = 'https://gist.githubusercontent.com/jihan-yin/43f19a86d35bf22fa3551d2806e478ec/raw/91416c09f09d3fca974f81d1f766dd4cadb29789/scienceqa_train.csv'
    val_url = 'https://gist.githubusercontent.com/jihan-yin/43f19a86d35bf22fa3551d2806e478ec/raw/91416c09f09d3fca974f81d1f766dd4cadb29789/scienceqa_val.csv'

df_train

# Fine-tune
Now, we can fine-tune the model using LLM Engine.

In [21]:
import os
os.environ['SCALE_API_KEY'] = 'test_e486e3e02f2e4940816ad04c939bc66f'

from llmengine import FineTune
train_url = './train.csv'
validate = './val.csv'
response = FineTune.create(
    model="llama-2-7b",
    training_file=train_url,
    validation_file=val_url,
    hyperparameters={
        'lr':2e-4,
    },
    suffix='science-qa-llama'
)
run_id = response.id

NameError: name 'train_url' is not defined

We can sleep until the job completes.

In [None]:
import time

while True:
    job_status = FineTune.get(run_id).status
    print(job_status)
    if job_status == 'SUCCESS':
        break
    time.sleep(60)

fine_tuned_model = FineTune.get(run_id).fine_tuned_model

# Downloading our Finetuned model 
Let's download the weights for the new fine-tuned model using LLM Engine.

In [None]:
from llmengine import Model

response = Model.download(FineTune.get(run_id).fine_tune_model, download_format="hugging_face")
print(response.urls)

We now have a dictionary of filenames and urls that point to the file(s) where our finetuned model lives. We can download the associated finetuned model either synchronously or asynchronously.

In [None]:
import os
import requests

def download_files(url_dict, directory):
    """
    Download files from given URLs to specified directory.
    
    Parameters:
    - url_dict: Dictionary of {file_name: url} pairs.
    - directory: Directory to save the files.
    """
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    for file_name, url in url_dict.items():
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for HTTP errors
        file_path = os.path.join(directory, file_name)
        
        with open(file_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

    

In [None]:
output_directory = "YOUR_MODEL_DIR"
download_files(response.urls, output_directory) 

Lastly, we can upload our downloaded model to the huggingface hub.

In [None]:
!pip install huggingface-hub

In [None]:
import os
from huggingface_hub import Repository

HF_USERNAME = "YOUR_HUGGINGFACE_USERNAME"
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN"

def upload_to_huggingface(directory, model_name):
    """
    Upload files from a directory to the Hugging Face Hub as a new model.

    Parameters:
    - directory: Directory containing the files to be uploaded.
    - model_name: Name of the new model.
    - token: Your Hugging Face authentication token.
    """
    
    # Create a repository with the given name
    repo = Repository(directory, clone_from=f"{HF_USERNAME}/{model_name}", use_auth_token=HF_TOKEN)
    
    # Commit and push files
    repo.push_to_hub()

model_name = "my-new-model"
    
upload_to_huggingface(output_directory, model_name, HF_TOKEN)