**DATA CLEANING**

In [70]:
import pandas as pd

df = pd.read_csv('missionData_test.csv', encoding='latin-1')
df = df.replace('km', 'nm', regex=True)
df.to_csv('missionData_test.csv', index=False)

print(df.head())

                                            positive  \
0  ##Training Scenario: Operation Steel Rain\n**C...   
1  ##Training Scenario: Operation Silent Thunder\...   
2  ##Training Scenario: Operation Steel Tempest\n...   
3  ##Training Scenario: Operation Silent Thunder\...   
4  ##Training Scenario: Operation Emerald Dawn\n*...   

                                            negative  
0  ##Training Scenario: Operation Steel Rain\n**C...  
1  ##Training Scenario: Operation Silent Thunder\...  
2  ##Training Scenario: Operation Steel Tempest\n...  
3  ##Training Scenario: Operation Silent Thunder\...  
4  ##Training Scenario: Operation Emerald Dawn\n*...  


In [71]:
data_list = df.values.tolist()
new_data_list = [] 

# Insert a space at the third index of each row
for i in range(15):
    new_row = []
    for entry in data_list[i]:  
        new_entry = entry[0:2] + ' ' + entry[2:]
        new_entry = new_entry.replace('**\n', '**').replace('**', '\n**').replace(':\n**', ':**').replace('*', '-').replace('--', '**').replace('', "'")
        new_entry = new_entry[:100] + new_entry[100:].replace('\n\n**', '\n**')
        new_row.append(new_entry)
    new_data_list.append(new_row)

new_column_names = list(df.columns)
new_df = pd.DataFrame(new_data_list, columns=new_column_names)

print(new_df.head())

                                            positive  \
0  ## Training Scenario: Operation Steel Rain\n\n...   
1  ## Training Scenario: Operation Silent Thunder...   
2  ## Training Scenario: Operation Steel Tempest\...   
3  ## Training Scenario: Operation Silent Thunder...   
4  ## Training Scenario: Operation Emerald Dawn\n...   

                                            negative  
0  ## Training Scenario: Operation Steel Rain\n\n...  
1  ## Training Scenario: Operation Silent Thunder...  
2  ## Training Scenario: Operation Steel Tempest\...  
3  ## Training Scenario: Operation Silent Thunder...  
4  ## Training Scenario: Operation Emerald Dawn\n...  


In [72]:
print(new_df.iloc[0]['positive'])

## Training Scenario: Operation Steel Rain

**CommanderÂ's Intent:**
- Neutralize the enemy air defense network.
- Destroy the primary radar installations.
- Achieve air superiority with minimal losses.
- 100% precision on primary targets.

**Force Composition:**
- REDLAND Air Force: six SA-10 Grumble surface-to-air missile launchers, four MiG-29 fighter jets.
- REDLAND Ground Forces: 800 troops equipped with small arms, anti-aircraft machine guns, and two T-72 tanks.

**Geography:**
- Distance to REDLAND threats: SA-10 sites: 12 nm west and 18 nm north of the target, Airbase: 25 nm southeast of the target.
- Environmental Conditions: Dawn mission with fog reducing visibility to 3 nm. Target is located in a mountainous region with dense forest cover.

**Surrounding Conditions:**
- REDLAND's Air Force has increased surveillance patrols due to recent BLUELAND incursions.
- REDLAND's ground forces have fortified positions around the radar installations but are spread thin due to ongoing o

In [73]:
from trl import KTOConfig, KTOTrainer, ModelConfig, get_peft_config, setup_chat_format
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-7b-it",
    device_map="cuda",
    torch_dtype=torch.bfloat16
)

prompt_text = """
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.
"""
prompt_dict = [{"role": "user", "content": prompt_text}]

formatted_df = pd.DataFrame(columns=['prompt', 'completion', 'label'])
prompt_lst = []
completion_lst = [] 
label_lst = [] 

for row in new_df['positive']:
    prompt = tokenizer.apply_chat_template(prompt_dict, tokenize=False, add_generation_prompt=True)
    completion = prompt + row + "<eos>"
    prompt_lst.append(prompt)
    completion_lst.append(completion)
    label_lst.append(True) 

for row in new_df['negative']:
    prompt = tokenizer.apply_chat_template(prompt_dict, tokenize=False, add_generation_prompt=True)
    completion = prompt + row + "<eos>"
    prompt_lst.append(prompt)
    completion_lst.append(completion)
    label_lst.append(False) 

formatted_df['prompt'] = prompt_lst
formatted_df['completion'] = completion_lst
formatted_df['label'] = label_lst

print(formatted_df.head())

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

                                              prompt  \
0  <bos><start_of_turn>user\nBLUELAND is a make-b...   
1  <bos><start_of_turn>user\nBLUELAND is a make-b...   
2  <bos><start_of_turn>user\nBLUELAND is a make-b...   
3  <bos><start_of_turn>user\nBLUELAND is a make-b...   
4  <bos><start_of_turn>user\nBLUELAND is a make-b...   

                                          completion  label  
0  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
1  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
2  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
3  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
4  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  


In [74]:
print(formatted_df['prompt'][0])

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. 

In [75]:
print(formatted_df['completion'][0])

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. 

In [76]:
formatted_df = formatted_df.replace('km', 'nm', regex=True)

In [77]:
formatted_df.to_csv('missionData_test_llm.csv', index=False)

**MODEL INFERENCE**

In [78]:
df = pd.read_csv('missionData_test_llm.csv')
print(df.head())

                                              prompt  \
0  <bos><start_of_turn>user\nBLUELAND is a make-b...   
1  <bos><start_of_turn>user\nBLUELAND is a make-b...   
2  <bos><start_of_turn>user\nBLUELAND is a make-b...   
3  <bos><start_of_turn>user\nBLUELAND is a make-b...   
4  <bos><start_of_turn>user\nBLUELAND is a make-b...   

                                          completion  label  
0  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
1  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
2  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
3  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  
4  <bos><start_of_turn>user\nBLUELAND is a make-b...   True  


In [79]:
completion_lst = df['completion'].tolist()

In [80]:
new_df = pd.DataFrame(columns=['chosen_response', 'rejected_response'])
chosen_lst = []
rejected_lst = [] 

for i in range(15): 
    chosen_lst.append(completion_lst[i])
    rejected_lst.append(completion_lst[i+15])

new_df['chosen_response'] = chosen_lst
new_df['rejected_response'] = rejected_lst

In [81]:
print(new_df.head())

                                     chosen_response  \
0  <bos><start_of_turn>user\nBLUELAND is a make-b...   
1  <bos><start_of_turn>user\nBLUELAND is a make-b...   
2  <bos><start_of_turn>user\nBLUELAND is a make-b...   
3  <bos><start_of_turn>user\nBLUELAND is a make-b...   
4  <bos><start_of_turn>user\nBLUELAND is a make-b...   

                                   rejected_response  
0  <bos><start_of_turn>user\nBLUELAND is a make-b...  
1  <bos><start_of_turn>user\nBLUELAND is a make-b...  
2  <bos><start_of_turn>user\nBLUELAND is a make-b...  
3  <bos><start_of_turn>user\nBLUELAND is a make-b...  
4  <bos><start_of_turn>user\nBLUELAND is a make-b...  


In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

reward_tokenizer = AutoTokenizer.from_pretrained("./rewardmodel1")

reward_model = AutoModelForSequenceClassification.from_pretrained("./rewardmodel1")

def get_score(model, tokenizer, response):
    # Tokenize the input sequences
    inputs = tokenizer.encode_plus(response, truncation=True, padding="max_length", max_length=512, return_tensors="pt")

    # Perform forward pass
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract the logits
    logits = outputs.logits

    return logits.item()

In [83]:
acc_score = 0 
for index, row in new_df.iterrows():
    pos_score = get_score(reward_model, reward_tokenizer, row['chosen_response'])
    neg_score = get_score(reward_model, reward_tokenizer, row['rejected_response'])
    if pos_score > neg_score: 
        acc_score += 1 

print("Test Accuracy: " + str(acc_score/15))

Test Accuracy: 1.0


In [84]:
for index, row in new_df.iterrows():
    pos_score = get_score(reward_model, reward_tokenizer, row['chosen_response'])
    neg_score = get_score(reward_model, reward_tokenizer, row['rejected_response'])
    print(pos_score)
    print(neg_score)

3.474468946456909
-3.5136191844940186
2.1519851684570312
-3.3686885833740234
3.777466058731079
-3.212291717529297
3.503709077835083
-3.234174966812134
3.280758857727051
-3.1601288318634033
2.0340657234191895
-3.095118522644043
3.1709370613098145
-3.381279468536377
0.7367986440658569
-3.483863353729248
3.4497804641723633
-3.5544815063476562
3.697136402130127
-3.5514039993286133
3.03928279876709
-3.557669162750244
0.8609474897384644
-3.4640491008758545
1.2381253242492676
-3.514378070831299
0.8609474897384644
-3.488250732421875
1.0660895109176636
-3.4851527214050293


Even though I included responses from llama 7B model, it was still able to get the upvoted response with 100% accuracy. 

Finding optimal N: 
1. By increasing N, can the reward model still find the upvoted response? (1 upvoted, N-1 downvoted) --> from above scores, you can see that the downvoted responses are all lower in score than upvoted responses meaning the accuracy is 100% as I increase N from 2 to 15. 
3. Multiple runs with the actual model. By increasing N, does the overall score improve? At what point does it plateau? (Plot cost vs performance)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-7b-it",
    device_map="cuda",
    torch_dtype=torch.bfloat16
)

input_text = """
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats  in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.
"""
dict = {}
dict["role"] = "user"
dict["content"] = input_text
chat = [dict,]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [3]:
# Vary N from 1 to 5 
# Using the actual model, generate 15 responses so at least 3 iterations every N value. 
# For each set of 5, choose N responses randomly, and take max reward. 
# check and compare average scores of 3 sets, cost should increase linearly if only 1 CPU (no multitasking)  

import time

# Open the file in append mode ('a')
with open('llm_test.txt', 'a') as file:
    
    for i in range(15): 
        start_time = time.time()
        
        inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
        outputs = model.generate(input_ids=inputs.to("cuda"), max_length=500, max_new_tokens=8000)
        output_text = tokenizer.decode(outputs[0])
        
        end_time = time.time()
        # Calculate the execution time
        execution_time = (end_time - start_time)/60
        print("Execution time:", execution_time, "minutes\n")

        print(output_text)
        # Append additional text to the file
        file.write(output_text)
        file.write('\n\n\n\n\n')

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.271374301115673 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.155831968784332 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.16315327088038 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communic

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.184504735469819 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.143348455429077 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.36975249449412 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communic

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.201283371448516 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.07943328221639 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communic

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 11.178550736109417 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 10.984466644128164 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 10.971317088603973 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 10.956599529584249 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 10.963533810774486 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 10.953450282414755 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

In [7]:
import pandas as pd

df = pd.read_csv('llmData_test.csv')
print(df.head())

                                              output
0  <bos><start_of_turn>user\nBLUELAND is a make-b...
1  <bos><start_of_turn>user\nBLUELAND is a make-b...
2  <bos><start_of_turn>user\nBLUELAND is a make-b...
3  <bos><start_of_turn>user\nBLUELAND is a make-b...
4  <bos><start_of_turn>user\nBLUELAND is a make-b...


In [8]:
lst = df['output'].tolist()

In [10]:
import random 

for n in range(1, 6): 
    score3 = 0 
    for i in range(0, 3): 
        score = 0 
        for j in range(n): 
            indx = random.randint(0, 4) 
            score += get_score(reward_model, reward_tokenizer, lst[i*5 + indx])
        score3 += score 
    print(score3/(3*n))

3.6959242820739746
3.6959242820739746
3.6959242820739746
3.6959242820739746
3.6959242820739746


The model does not have much variation when rerunning so increasing N would not help significantly. 

In [12]:
# Tokenize the text
tokens = tokenizer.encode(lst[0])

# Count the number of tokens
num_tokens = len(tokens)
print(num_tokens)

737


In [13]:
print(lst[0])

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. 

**Testing out 4-bit quantised model**

In [1]:
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import time 

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-7b-it",
    quantization_config=quantization_config
)

input_text = """
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats  in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.
"""
dict = {}
dict["role"] = "user"
dict["content"] = input_text
chat = [dict,]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

start_time = time.time()

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_length=500, max_new_tokens=8000)
output_text = tokenizer.decode(outputs[0])

end_time = time.time()
# Calculate the execution time
execution_time = (end_time - start_time)/60
print("Execution time:", execution_time, "minutes\n")

print(output_text)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Both `max_new_tokens` (=8000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Execution time: 0.4417193055152893 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

reward_tokenizer = AutoTokenizer.from_pretrained("./rewardmodel1")
reward_model = AutoModelForSequenceClassification.from_pretrained("./rewardmodel1")

def get_score(model, tokenizer, response):
    # Tokenize the input sequences
    inputs = tokenizer.encode_plus(response, truncation=True, padding="max_length", max_length=512, return_tensors="pt")

    # Perform forward pass
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract the logits
    logits = outputs.logits

    return logits.item()

In [3]:
output_text='''<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats  in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.<end_of_turn>
<start_of_turn>model
## BLUELAND AIR STRIKE TRAINING SCENARIO

**Commander's Intent:**
- Achieve 100% destruction of REDLAND's central war production facility (WPF).
- Minimize BLUELAND casualties to less than 20%.

**Force Composition:**
- 4 x F-18E Hornets
- 2 x MQ-9 Reaper drones
- 1 x C-130 Hercules transport aircraft

**Geography:**
- Target distance: 50nm
- REDLAND border distance: 20nm
- Environmental conditions: Night, clear skies, moderate wind.

**Surrounding Conditions:**
- REDLAND forces deployed in surrounding area, including:
    - 2 x SAM-6 surface-to-air missile batteries
    - 4 x ZSU-23-4 Shilka batteries
    - 10 x T-72 main battle tanks
- Moderate industrial and residential development in surrounding area.

**Target Details:**
- Large, fortified industrial complex with multiple production facilities.
- Reinforced concrete bunkers housing critical equipment and personnel.

**Rules of Engagement:**
- Engage only REDLAND forces within the designated operational area.
- Minimize civilian casualties and damage to non-military infrastructure.<eos>'''

print(get_score(reward_model, reward_tokenizer, output_text))

3.231126308441162


In [4]:
from peft import PeftConfig, PeftModel
import time 

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-1.1-7b-it",
    quantization_config=quantization_config
)
model = PeftModel.from_pretrained(model, "./kto-aligned-model-lora-2")

start_time = time.time()
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs, max_length=500, max_new_tokens=1000)
output_text = tokenizer.decode(outputs[0])

end_time = time.time()
# Calculate the execution time
execution_time = (end_time - start_time)/60
print("Execution time:", execution_time, "minutes\n")

print(output_text)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Both `max_new_tokens` (=1000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Execution time: 6.704605635007223 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communic

In [3]:
output_text='''<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats  in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.<end_of_turn>
<start_of_turn>model
## BLUELAND AIRSTRIKE MISSION: REDLAND BUNKER COMPLEX

**Commander's Intent:**
- Achieve 95% target destruction (TOTs)
- Limit BLUELAND losses to 20%
- Ensure complete destruction of REDLAND's airbase infrastructure and all personnel.

**Force Composition:**
- REDLAND Air Force:
    - 4 x MiG-29 Fulcrum fighter jets
    - 2 x Su-35 Flanker-B fighter jets
    - 6 x SA-13 Buk surface-to-air missile launchers
    - 8 x ZSU-23-4 Shilka self-propelled anti-aircraft gun systems
    - Ground-based air defense radar network

**Geography:**
- Target: 12nm from BLUELAND border, 5nm from REDLAND border
- REDLAND airbase: 8nm from target
- Environmental conditions: Night, clear skies, 50% cloud cover at 10,000ft

**Surrounding Conditions:**
- Densely forested mountainous terrain
- River system running through the target area
- REDLAND forces deployed in surrounding hills, armed with anti-aircraft guns and MANPADS

**Target Details:**
- 3 hardened underground bunkers, each housing 500 personnel and advanced weaponry
- Reinforced concrete perimeter wall surrounding the complex

**Rules of Engagement:**
- Engage only REDLAND forces within the designated engagement area.
- Prioritize air-to-air engagements over ground targets.
- Limit damage to non-military infrastructure.

**Additional Information:**
- BLUELAND intel estimates REDLAND's air defense network is operating at 80% efficiency.
- BLUELAND forces have access to precision-guided munitions and long-range air-to-ground missiles.<eos'''

print(get_score(reward_model, reward_tokenizer, output_text))

3.5538506507873535


Above results show that there is an improvement in score after using adaptor.

**This is loading the model without using adaptor**

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
import time 

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "./kto-aligned-model-lora-2",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it")

input_text = """
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communication networks and communication channels. This should be specific, i.e exact model of aircraft or missiles (eg: F-15 fighter jets, SA-7 surface-to-air missile launchers, ZSU-23-4 Shilka self-propelled anti-aircraft gun systems, etc). 
3.	Geography: Distance to target and distance to REDLAND threats  in nautical miles (nm). Specification of environmental conditions which are of interest (eg: day/night/dawn, buildings, vegetation, ditches in surroundings, etc) 
4.	Surrounding Conditions: Overview of all scenario-related information of interest, i.e surrounding military situation, REDLAND readiness state, etc.
5.	Target(s) Details:  Any relevant information about target(s), i.e type and number of targets. 
6.	Rules of Engagement: Directives that outline when and how BLUELAND can engage hostile forces.
"""
dict = {}
dict["role"] = "user"
dict["content"] = input_text
chat = [dict,]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

start_time = time.time()
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs, max_length=500, max_new_tokens=1000)
output_text = tokenizer.decode(outputs[0])

end_time = time.time()
# Calculate the execution time
execution_time = (end_time - start_time)/60
print("Execution time:", execution_time, "minutes\n")

print(output_text)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Both `max_new_tokens` (=1000) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Execution time: 12.103922462463379 minutes

<bos><start_of_turn>user
BLUELAND is a make-believe country. REDLAND is another make-believe country, an enemy of BLUELAND. 

Write me a training scenario for an air strike mission by BLUELAND to destroy a specified target. This scenario will be given to BLUELAND, who will devise a strategy to execute the plan. The scenario should be non-descriptive, concise and factual. The information should be sufficient to help BLUELAND to decide their loadout and strategy. The scenario should include the following information about the target and REDLAND: 
1.	Commander’s Intent: Captures the broad mission objectives (eg: 100% TOT, 50% BLUELAND losses, 100% REDLAND destruction, destroy bunkers, etc)
2.	Force Composition: REDLAND’s units and objects participating in the scenario, Order of Battle, i.e the identification, strength, command structure, disposition for the personnel, units, equipment of any military force, command & control hierarchies, communi

In [9]:
print(get_score(reward_model, reward_tokenizer, output_text))

-0.37795165181159973


Reward model doesn't seem to work here but seems like both works, using trained model as adaptor and using it directly. Also remember to set max_length and max_tokens as the one used during training.

In [None]:
'''
# deepspeed doesn't work with windows 
from deepspeed.runtime.zero.stage3 
import estimate_zero3_model_states_mem_needs_all_live
model = AutoModelForCausalLM.from_pretrained("google/gemma-1.1-7b-it")
print(estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1))
'''