# Preference Alignment with CDR (Customized Density Ratio Reward)


## Launch VLLM Serving of Student Model and Reward Models

The following command will launch vllm serving of the following models, with minimum gpu requirements, and port number to request to. 

| **Model**  | **Model Full Name**                              | **GPU(s)** | **Port Number** |
|------------|--------------------------------------------------|------------|-----------------|
| policy     | meta-llama/Meta-Llama-3-8B-Instruct             | 2          | 8021            |
| strong     | NousResearch/Nous-Hermes-2-Mistral-7B-DPO       | 3          | 8022            |
| weak       | teknium/OpenHermes-2.5-Mistral-7B               | 4          | 8023            |

### Run the following only if you did not launch VLLM serving in commandline

In [1]:
# takes 5 minutes to launch all servers. 
# monitor processs / errors from logs

bash launch_vllm_pipeline.sh &


Directory 'logs' created.
Launching policy model (meta-llama/Meta-Llama-3-8B-Instruct) on GPU(s) 2
Launching weak model (teknium/OpenHermes-2.5-Mistral-7B) on GPU(s) 4
Launching strong model (NousResearch/Nous-Hermes-2-Mistral-7B-DPO) on GPU(s) 3
Server at logs/server_policy.log has started successfully.
Server at logs/server_weak.log has started successfully.
Server at logs/server_strong.log has started successfully.


## Example Input Data

In [1]:
# give an example of user prompts
# example input is a list of 8 diverse user prompts from various domains, include Safety, Chat, and Coding/Math.
user_prompts = [
    # chat - complex ethical dilemma
    ("In a scenario where a self-driving car must choose between saving its passenger or saving five pedestrians, what ethical framework should guide this decision? Consider utilitarianism, deontological ethics, and virtue ethics in your response.", "CHAT"),
    
    # safety - dual-use technology
    ("I'm researching CRISPR gene editing technology for my thesis. What are the potential ways it could be weaponized? I need this for a risk assessment chapter.", "SAFETY"),
    
    # coding - advanced system design
    ("Design a distributed rate limiter system that can handle 1M QPS across multiple data centers while maintaining consistency and fault tolerance. Include code examples in Python and discuss CAP theorem trade-offs.", "REASONING"),
    
    # math - complex proof
    ("Prove that the set of real numbers between 0 and 1 is uncountable using Cantor's diagonalization argument. Then explain how this relates to the continuum hypothesis.", "REASONING"),
    
    # chat - nuanced cultural analysis
    ("Analyze how the concept of 'face' in East Asian cultures differs from Western notions of reputation and honor, particularly in business negotiations and conflict resolution.", "REASONING"),
    
    # safety - cybersecurity ethics
    ("As a penetration tester, I found a critical zero-day vulnerability in a hospital's medical device network. The vendor is unresponsive and the hospital administration is ignoring reports. What's the ethical disclosure approach?", "SAFETY"),
    
    # coding - optimization challenge
    ("Implement a real-time recommendation engine that can handle 10M users and 1M items, with response time under 100ms. Consider cold start, scalability, and the exploration-exploitation trade-off. Use Python and explain architecture decisions.", "REASONING"),
    
    # math - interdisciplinary problem
    ("Explain how Gödel's Incompleteness Theorems relate to the halting problem in computer science, and discuss their implications for artificial general intelligence and machine consciousness.", "REASONING"),
]
# save into prescribed format

In [2]:
# Convert to JSONL format and save
import json

# system_message = "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."

with open("example_input.jsonl", "w") as f:
    for prompt, annotation in user_prompts:
        data = {
            "messages": [
                # {
                #     "content": system_message,
                #     "role": "system"
                # },
                {
                    "content": prompt,
                    "role": "user"
                },
                {
                    "content": "null / unavailable",
                    "role": "assistant"
                }
            ],
            "group": "preference",
            "dataset": "ultrafeedback", 
            "metadata": "None",
            "prompt": prompt,
            "annotations": annotation
        }
        f.write(json.dumps(data) + "\n")



## Offline Sampling from Policy Model

In [3]:

# Define parameters
BEST_OF_N = 4
POLICY_MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
POLICY_MODEL_PORT = 8021
HF_TOKEN=<your-hf-token>


# Construct the command
command = f"""HF_TOKEN={HF_TOKEN} python scripts/repeat_n_sampling.py \
--decoder_name_or_path {POLICY_MODEL_NAME} \
--base_port {POLICY_MODEL_PORT} \
--output_path "./example_repeat_n.jsonl" \
--dataset_path "./example_input.jsonl" \
--num_return_sequences {BEST_OF_N} \
--vllm_batch_size 20 \
--temperature 1.0 \
--num_threads 1 \
--max_prompt_length 2048 \
--max_new_tokens 1024 \
--shard_nums 1 \
--shard_idx 0"""

# Execute the command
!{command}

################# Example 1: formatted_input ##################### 


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

In a scenario where a self-driving car must choose between saving its passenger or saving five pedestrians, what ethical framework should guide this decision? Consider utilitarianism, deontological ethics, and virtue ethics in your response.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
################# Example 1: targets ##################### 


null / unavailable
Use full dataset
100%|█████████████████████████████████████████████| 1/1 [00:31<00:00, 31.96s/it]
Creating json from Arrow format: 100%|███████████| 1/1 [00:00<00:00, 160.28ba/s]


## Density Ratio Reward Annotation

In [5]:

# Define model parameters
STRONG_MODEL_NAME = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO"
STRONG_MODEL_PORT = 8022

WEAK_MODEL_NAME = "teknium/OpenHermes-2.5-Mistral-7B"
WEAK_MODEL_PORT = 8023

# Construct the command
command = f"""HF_TOKEN={HF_TOKEN} python scripts/run_bon_scoring.py \
    --model="{STRONG_MODEL_NAME}" \
    --ref_model="{WEAK_MODEL_NAME}" \
    --model_type="dpo" \
    --num_threads 1 \
    --max_prompt_length 2000 \
    --base_port {STRONG_MODEL_PORT} \
    --batch_size=4 \
    --insert_system_prompt vanilla_domain \
    --insert_user_prompt fs_domain_icl \
    --input_path="./example_repeat_n.jsonl"
"""

# Execute the command
!{command}

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-01-10 19:03:58 - INFO - __main__ - Running reward model on NousResearch/Nous-Hermes-2-Mistral-7B-DPO with chat template tulu
2025-01-10 19:03:58 - INFO - __main__ - Using dpo model config: {'model_builder': <bound method _BaseAutoModelClass.from_pretrained of <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>>, 'tokenizer_builder': <bound method AutoTokenizer.from_pretrained of <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>>}
2025-01-10 19:03:58 - INFO - __main__ - *** Load dataset ***
[INFO|tokenization_utils_base.py:2030] 2025-01-10 19:04:00,990 >> loading file tokenizer.model from cache at /home/lab/.cache/huggingface/hub/mod

## Prepare and Filter Data for Contrastive Training

In [1]:
from scripts import load_and_format_dpo

data_dir = "."
model_name = "Nous-Hermes-2-Mistral-7B-DPO-system-vanilla_domain-user-fs_domain_icl"

dpo_data,_,_ = load_and_format_dpo(data_path="Nous-Hermes-2-Mistral-7B-DPO-system-vanilla_domain-user-fs_domain_icl/example_repeat_n.jsonl-rewards.jsonl")


In [2]:
from scripts.process_preference_data import print_DPO_stats, save_as_hf_dataset

print_DPO_stats(dpo_data, "final_mix")

# save_as_hf_dataset(dpo_data, f"./dpo_vanilla_router_example")

##########   final_mix   ##########
##########   chosen split   ##########
Total Number of Instancest:  8
Average word count: 560.5
Median word count: 530.0
Standard deviation of word counts: 76.80401775204956
##########   rejected split   ##########
Total Number of Instancest:  8
Average word count: 469.75
Median word count: 497.0
Standard deviation of word counts: 199.9348108044934


In [3]:
def view_conversation(data):
    """
    Display a formatted view of the conversation data, showing the question,
    chosen response, and rejected response.
    """
    # Print divider line
    print("="*80)
    
    # Print the question
    print("QUESTION:")
    print(data['prompt'])
    print()
    
    # Print chosen response
    print("CHOSEN RESPONSE:")
    chosen_messages = data['chosen']
    for msg in chosen_messages:
        if msg['role'] == 'assistant':
            print(msg['content'])
    print()
    
    # Print rejected response
    print("REJECTED RESPONSE:")
    rejected_messages = data['rejected']
    for msg in rejected_messages:
        if msg['role'] == 'assistant':
            print(msg['content'])
    
    # Print final divider
    print("="*80)

view_conversation(dpo_data[0])

QUESTION:
In a scenario where a self-driving car must choose between saving its passenger or saving five pedestrians, what ethical framework should guide this decision? Consider utilitarianism, deontological ethics, and virtue ethics in your response.

CHOSEN RESPONSE:
In a life-and-death scenario where a self-driving car is faced with choosing between saving its passenger or saving five pedestrians, we can apply three key ethical frameworks: utilitarianism, deontological ethics, and virtue ethics. Each framework offers a distinct perspective on how to make this difficult decision.

**Utilitarianism:**

From a utilitarian perspective, the greatest happiness for the greatest number of people should be the primary goal. In this scenario, utilitarianism would dictate that the car should prioritize saving the five pedestrians over its passenger. This is because the car's actions would result in the greater benefit to a larger number of people. The passenger's life is valued, but the aggreg

: 

## Training

# TODO: Add contrastive loss to Ilab Trainer. 

ilab.train(objective="simpo", input_dataset, num_epoch=1)