From https://wiki.corp.ebay.com/display/COREAI/Athena+AI+Conference+2024

# Prerequisites

- Please use this docker image `hub.tess.io/gen-ai/ellement:latest` for your workspace.
- Enable hadoop access to apollo-rno. Use your team's batch account.
- If you "pip installed" packages in your workspace or have a conda environment, you might need to deactivate it.
  - To backup your locally installed packages: `mv ~/.local ~/.local.bkp`

# Creating & Registering the training dataset

In [27]:
import pandas as pd
import pyarrow, json
import pyarrow.parquet as pq

In [28]:
# The file containing the training data. The file is in JSON format with keys: "input" and "output"
# Example row: {input:"I am happy.", output:"Happy, I am."}
json_file = "dataset.json"

# Please change to your directory
# The HDFS folder in which we will store the training dataset for the Yoda project
output_parquet_folder = "/user/ppetrushkov/athena-aiconf2024/yoda/"
# The HDFS file name 
output_parquet_file = f"{output_parquet_folder}train.hf.parquet"

In [29]:
# Open the dataset file and load as Python object
with open(json_file) as f:
    js = json.load(f)

FileNotFoundError: [Errno 2] No such file or directory: 'dataset.json'

In [None]:
# Transform the dataset into a Pandas Dataframe  
df_train = pd.DataFrame(js)

### Formatting the dataset to match training requirements

In [30]:
# The structure of our dataset (input and output columns) 
# Create messages in the OpenAI Chat format:
# One "user" turn with instruction and context; 
# One "assistant" turn that has the expected response

df_messages = pd.DataFrame({"messages": [[ {"role": "user", "content": f"{row.input}"}, 
                                           {"role": "assistant", "content": row.output}
                                         ] 
                                         for row in df_train.itertuples()]
                           })

NameError: name 'df_train' is not defined

In [43]:
df_messages

NameError: name 'df_messages' is not defined

### Creating the Parquet file and uploading it to HDFS

In [7]:
# Uuse pyarrow to transform the Pandas DataFrame into a Parquet file
pdf = pyarrow.Table.from_pandas(df_messages)

In [31]:
# Connect to HDFS (with batch account set)
fs = pyarrow.hdfs.connect()

AttributeError: module 'pyarrow' has no attribute 'hdfs'

In [9]:
# Write the file into the HDFS folder
pq.write_table(pdf, f"{output_parquet_file}", filesystem=fs)

2024-06-03 18:22:08,229 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.


In [10]:
!hadoop fs -ls {output_parquet_file}

-rw-r--r--   3 ppetrushkov hdmi-technology      16175 2024-06-03 18:23 /user/ppetrushkov/athena-aiconf2024/yoda/train.hf.parquet


### Registering the dataset with AIHub

#### Option 1: go to [AIHub Datasets](https://aip.vip.ebay.com/data/data-set?projectName=athena-aiconf2024) to register the dataset

#### Option 2: use Python API

In [41]:
from pykrylov.batch.dataset import create_dataset, HdfsSpecific
from pykrylov.batch.consts import DatasetFormat, DatasetSource

# The project on AIHub. In our case "athena-aiconf2024" is a shared project where all Workshop participants are members.
aip_project = "athena-aiconf2024"

# Setting up the name with which the dataset will be registered on AIHub
dataset_name = "athena-yoda-v1.0"

# Register dataset
hdfs_dataset = create_dataset(
    name=dataset_name,
    owner_domain="CoreAI",
    source=DatasetSource.HDFS,
    data_format=DatasetFormat.PARQUET,
    project=aip_project,
    is_public=False,
    hdfs_specific=HdfsSpecific(paths=[output_parquet_file]),
    digest=None,
    labeled_fields=["messages"],
    description="Yoda style finetuning using Athena"
)

2025-03-09 18:49:14,208 - pykrylov.batch.dataset [MainThread  ] [ERROR]  Dataset creation failed. Error 403 {'detail': 'User is not allowed to update on given resource', 'instance': 'class com.ebay.taichi.trainingmanagement.service.DatasetService', 'status': 403, 'title': 'ForBidden', 'type': 'ClientError'}
ERROR:pykrylov.batch.dataset:Dataset creation failed. Error 403 {'detail': 'User is not allowed to update on given resource', 'instance': 'class com.ebay.taichi.trainingmanagement.service.DatasetService', 'status': 403, 'title': 'ForBidden', 'type': 'ClientError'}


In [125]:
hdfs_dataset

Dataset(
    id: "b26df8c6-206b-49b7-9164-f8a80e95484b",
    name: "athena-yoda-style-output-v3.0",
    owner_domain: "CoreAI",
    source: DatasetSource.HDFS,
    format: DatasetFormat.PARQUET,
    storage_managed: False,
    project: "athena-aiconf2024",
    is_public: False,
    hdfs_specific: HdfsSpecific(
        paths: ['/user/b_pynlp/athena-aiconf2024/yoda/train.hf.parquet'],
    ),
    labeled_fields: ['messages'],
    description: "Yoda style finetuning using Athena",
    created_by: "ayouroukov",
    created_time: 1717150376254,
    updated_by: "ayouroukov",
    updated_time: 1717150376254,
)

# We're done! The dataset has been successfully registered and we're ready to begin fine tuning our base LLM

----------------

# Next step: Finetuning the base model on our Yoda dataset using Athena 
### Please go to [AIHub Athena](https://aip.vip.ebay.com/data/athena?projectName=athena-aiconf2024) to begin the process.

-----------

Training has finished successfully?
## Congratulations! You just fine tuned a LLM!  
### Let's use Chomsky SDKs to interact with our newly trained LLM

In [3]:
aip_project = "ppetrushkov-project"

# Enter the adapater name you used for training
adapter_name = "athena-yoda"

# Enter the adapter version you used for training
adapter_version = "1"

# Base model used for training. Please see the adapter detail page for the exact name of the model
base_model = "ebay-internal-chat-completions-athena-lilium2-7b-chat"

# Let's construct the adapter full name/path. It's structure is <aip_project>/<adapter_name>/<adapter_version>
adapter = f"{aip_project}/{adapter_name}/{adapter_version}"

In [4]:
from pychomsky.chchat import EbayLLMChatWrapper
from langchain.schema.messages import HumanMessage
chat = EbayLLMChatWrapper(
    model_name = base_model,
    model_adapter = adapter,
    max_tokens = 256,
    temperature = 0.2,
    top_p = 0.98,
    presence_penalty = 0.0,
    frequency_penalty = 0.0
)

In [5]:
print(chat([HumanMessage(content='What is athena?')]))

content=' Athena is what? ' response_metadata={'model_name': 'ebay-internal-chat-completions-athena-lilium2-7b-chat', 'token_usage': {'completion_tokens': 7, 'prompt_tokens': 15, 'total_tokens': 22}} id='run-61a39b0d-5032-45e2-bac3-f095bd6142bd-0'
