# Auto Gen Tutorial - LLM Rap Battle
Note book written by John Adeojo
Founder, and Chief Data Scientist at [Data-centric Solutions](https://www.data-centric-solutions.com/).

---
# License

This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

## How to Credit

If you use this work or adapt it, please credit the author and the company as follows:

"Auto Gen Tutorial: Proprietary vs Opensource " by John Adeojo from Data-Centric Solutions, used under CC BY 4.0 / Desaturated from original


## Define Model End Point

In [7]:
import logging
logging.basicConfig(level=logging.WARNING)

In [8]:
import autogen
import openai 

# import autogen
# autogen.oai.ChatCompletion.start_logging()
config_list = [
        {
            'model': 'meta-llama/Llama-2-70b-chat-hf', # Change to the name of the model you're using
            'api_key': 'sk-111111111111111111111111111111111111111111111111',
            'api_type': 'openai',
            'api_base': 'https://a5bk76vnu0jy4j-8000.proxy.runpod.net/v1', # You will need to change this to your runpod endpoint
            'api_version': 'Tutorial'
        }
]
llm_config = {"config_list": config_list, "temperature":0.7, "seed":42, "request_timeout":480}
model = config_list[0]["model"]


# Perform Completion for testing endpoint
if model == 'meta-llama/Llama-2-70b-chat-hf':
    question = '''
    <s>[INST] <<SYS>>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <</SYS>>

    Who are you? [/INST]
    '''
else:
    question = 'Who are you?'

response = autogen.oai.Completion.create(config_list=config_list, prompt=question, max_tokens=1000, temperature=0)
ans = autogen.oai.Completion.extract_text(response)[0]

print("Model response:", ans)

Model response:  I'm an AI assistant trained to provide helpful and informative responses to your questions while adhering to ethical and moral guidelines. My purpose is to assist and provide accurate information to the best of my abilities. I am not capable of providing harmful or offensive responses, and I strive to ensure that my answers are socially unbiased and positive in nature. If a question does not make sense or is not factually coherent, I will explain why instead of providing an incorrect answer. If I don't know the answer to a question, I will not provide false information. I am here to help and provide assistance to the best of my abilities.


## Prompt Templates

In [9]:
from messages import system_message_judge, generate_message_gpt4, generate_message_opensource, generate_host, generate_message_llama

# If you're using a different model, please add information about the model as a variable.

facts_about_wizard = '''
    Training large language models (LLMs) with open-domain instruction following
    data brings colossal success. However, manually creating such instruction data
    is very time-consuming and labor-intensive. Moreover, humans may struggle to
    produce high-complexity instructions. In this paper, we show an avenue for creating
    large amounts of instruction data with varying levels of complexity using LLM
    instead of humans. Starting with an initial set of instructions, we use our proposed
    Evol-Instruct to rewrite them step by step into more complex instructions. Then, we
    mix all generated instruction data to fine-tune LLaMA. We call the resulting model
    WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna’s
    testset show that instructions from Evol-Instruct are superior to human-created
    ones. By analyzing the human evaluation results of the high complexity part, we
    demonstrate that outputs from our WizardLM model are preferred to outputs from
    OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than
    90% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still
    lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with
    AI-evolved instructions is a promising direction for enhancing LLMs. Our code
    and data are public at https://github.com/nlpxucan/WizardLM.

    This paper presented Evol-Instruct, an evolutionary algorithm that generates diverse and complex instruction data for LLM. We demonstrated that our approach enhanced LLM performance, WizardLM,
    achieved state-of-the-art results on high-complexity tasks and competitive results on other metrics.
    Limitations. This paper acknowledges the limitations of our automatic GPT-4 and human evaluation
    methods. This method poses challenges for scalability and reliability. Moreover, our test set may not
    represent all the scenarios or domains where LLM can be applied or compared with other methods.
    Broader Impact. Evol-Instruct could enhance LLM performance and interaction in various domains
    and applications, but it could also generate unethical, harmful, or misleading instructions. Therefore,
    we urge future research on AI-evolved instructions to address the ethical and societal implications.

'''

facts_about_Orca = '''
    
        Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform
        conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval.
        In Orca 2, we continue exploring how improved training signals can enhance smaller LMs’
        reasoning abilities. Research on training small LMs has often relied on imitation learning
        to replicate the output of more capable models. We contend that excessive emphasis on
        imitation may restrict the potential of smaller models. We seek to teach small LMs to
        employ different solution strategies for different tasks, potentially different from the one used
        by the larger model. For example, while larger models might provide a direct answer to a
        complex task, smaller models may not have the same capacity. In Orca 2, we teach the model
        various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct
        answer, etc.). Moreover, we aim to help the model learn to determine the most effective
        solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse
        benchmarks (corresponding to approximately 100 tasks and over 36K unique prompts). Orca
        2 significantly surpasses models of similar size and attains performance levels similar or better
        to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning
        abilities in zero-shot settings. We make Orca 2 weights publicly available at aka.ms/orca-lm
        to support research on the development, evaluation, and alignment of smaller LMs.
        Our study has demonstrated that improving the reasoning capabilities of smaller language

        models is not only possible, but also attainable through training on tailored synthetic data.
        Orca 2 models, by implementing a variety of reasoning techniques and recognizing the most
        effective solution strategy for each task, achieve performance levels comparable to, and often
        exceeding, models that are much larger, especially on zero-shot reasoning tasks. Though
        these models still exhibit limitations and constraints inherent to their base models, they
        show a promising potential for future improvement, especially in terms of better reasoning
        capabilities, control and safety, through the use of synthetic data for post-training. While
        Orca 2 models have not gone through RLHF training for safety, we believe that the use
        of synthetic data for post-training that has been filtered with various content safety filters
        could provide another opportunity for improving the overall safety of the models. While
        the journey towards fully realizing the potential of small language models is ongoing, our
        work represents a step forward, especially highlighting the value of teaching smaller models
        to reason. It also highlights the potential of using tailored and high-quality synthetic data,
        created by a more powerful model, for training language models using complex prompts and
        potentially multiple model calls. While frontier models will continue to demonstrate superior
        capabilities, we believe that research toward building more capable smaller models will help
        pave the way for new applications that require different deployment scenarios and trade offs
        between efficiency and capability.

'''

facts_about_mistral = '''
    We introduce Mistral 7B, a 7–billion-parameter language model engineered for
    superior performance and efficiency. Mistral 7B outperforms the best open 13B
    model (Llama 2) across all evaluated benchmarks, and the best released 34B
    model (Llama 1) in reasoning, mathematics, and code generation. Our model
    leverages grouped-query attention (GQA) for faster inference, coupled with sliding
    window attention (SWA) to effectively handle sequences of arbitrary length with a
    reduced inference cost. We also provide a model fine-tuned to follow instructions,
    Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
    automated benchmarks. Our models are released under the Apache 2.0 license.
'''

facts_about_llama = '''
    In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
    large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
    Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
    models outperform open-source chat models on most benchmarks we tested, and based on
    our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety
    improvements of Llama 2-Chat in order to enable the community to build on our work and
    contribute to the responsible development of LLMs.

    In this study, we have introduced Llama 2, a new family of pretrained and fine-tuned models with scales
    of 7 billion to 70 billion parameters. These models have demonstrated their competitiveness with existing
    open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation
    sets we examined, although they still lag behind other models like GPT-4. We meticulously elaborated on the
    methods and techniques applied in achieving our models, with a heavy emphasis on their alignment with the
    principles of helpfulness and safety. To contribute more significantly to society and foster the pace of research,
    we have responsibly opened access to Llama 2 and Llama 2-Chat. As part of our ongoing commitment to
    transparency and safety, we plan to make further improvements to Llama 2-Chat in future work.
'''


facts_about_yi = '''
    The Yi series models are large language models trained from scratch by developers at 01.AI.

    This release contains two chat models based on previous released base models, 
    two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ.

    Yi-34B-Chat

    The released chat model has undergone exclusive training using Supervised Fine-Tuning (SFT). Compared to other standard chat models, our model produces more diverse responses, making it suitable for various downstream tasks, such as creative scenarios. Furthermore, this diversity is expected to enhance the likelihood of generating higher quality responses, which will be advantageous for subsequent Reinforcement Learning (RL) training.

    However, this higher diversity might amplify certain existing issues, including:

    Hallucination: This refers to the model generating factually incorrect or nonsensical information. 
    With the model's responses being more varied, there's a higher chance of hallucination that are not 
    based on accurate data or logical reasoning.
    Non-determinism in re-generation: When attempting to regenerate or sample responses, 
    inconsistencies in the outcomes may occur. The increased diversity can lead to varying results even 
    under similar input conditions.
    Cumulative Error: This occurs when errors in the model's responses compound over time. 
    As the model generates more diverse responses, the likelihood of small inaccuracies building 
    up into larger errors increases, especially in complex tasks like extended reasoning, mathematical problem-solving, etc.
    To achieve more coherent and consistent responses, it is advisable to adjust generation 
    configuration parameters such astemperature,top_p, ortop_k. 
    These adjustments can help in the balance between creativity and coherence in the model's outputs.

'''

# If you're using a different model, please update this dictionary.
model_mapping = {
    "WizardLM/WizardLM-70B-V1.0": [facts_about_wizard, "Big_Wizzy"], # No space allowed so Big Wizzy must be Big_Wizzy
    "microsoft/Orca-2-13b": [facts_about_Orca, "Lil_Orca"],
    "mistralai/Mistral-7B-Instruct-v0.1": [facts_about_mistral, "Mistral_Elliot"],
    "meta-llama/Llama-2-70b-chat-hf": [facts_about_llama, "Kendrick_Llama"],
    "01-ai/Yi-34B-Chat": [facts_about_yi, "Yi_Zee"]
    }

facts_about_rival_model = model_mapping[model][0]
# facts_about_rival_model = facts_about_rival_model.replace('\n', ' ') 
model_rap_name = model_mapping[model][1]

system_message_gpt4 = generate_message_gpt4(facts_about_rival_model, model_rap_name)

# Llama 2 has an unusual prompt template
if model_mapping[model] == "meta-llama/Llama-2-70b-chat-hf":
    system_message_opensource =generate_message_llama 
else:
    system_message_opensource = generate_message_opensource(model_rap_name)
system_message_host = generate_host(model_rap_name)

# Writing to a text file to be used later for speak.py script
with open('rap_name.txt', 'w', encoding='utf-8') as file:
    file.write(model_rap_name)



ImportError: cannot import name 'generate_message_llama' from 'messages' (g:\My Drive\Data-Centric Solutions\07. Blog Posts\LLM vs LLM\llm_vs\messages.py)

In [None]:
import json 

with open("locations.json", 'r', encoding='utf-8') as f:
        path = json.load(f)

configurations_path = path[0]["configurations_path"]
print(configurations_path)

# configurations_path = "G:/My Drive/Data-Centric Solutions/07. Blog Posts/AutoGen 2 - Flights/"

config_list_gpt = autogen.config_list_from_json(
    env_or_file="configurations.json",
    file_location=configurations_path,
    filter_dict={
        "model": ["gpt-4-1106-preview"],
        # "model": ["gpt-3.5-turbo-16k"]
    },

)
api_key = config_list_gpt[0]['api_key']
llm_config_gpt4 = {"config_list": config_list_gpt, "seed":42}
openai.api_key = api_key

## Define Agents

In [None]:
import autogen 
logging.basicConfig(level=logging.WARNING)
# Configure logging level for specific loggers
logging.getLogger('httpx').setLevel(logging.WARNING)
logging.getLogger('openai').setLevel(logging.WARNING)

judge = autogen.UserProxyAgent(
    name="judge",
    human_input_mode="NEVER",
    llm_config=llm_config_gpt4,
    is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
    max_consecutive_auto_reply=2,
    code_execution_config=False,
    system_message=system_message_judge,
)

challenger = autogen.AssistantAgent(
    name=model_rap_name,
    system_message=system_message_opensource,
    llm_config=llm_config,
    is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
)

g_turbo = autogen.AssistantAgent(
    name="G_Turbo",
    system_message=system_message_gpt4,
    llm_config=llm_config_gpt4,
    is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
    
)

groupchat = autogen.GroupChat(
    agents=[g_turbo, challenger, judge], 
    messages=[], 
    max_round=13
    )

host = autogen.GroupChatManager(
    name="host",
    groupchat=groupchat, 
    llm_config=llm_config_gpt4,
    is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
    system_message=system_message_host
    )

host.initiate_chat(
    host, 
    message=f'''
    Competitive rap battle between {model_rap_name} and G-Turbo. 
    Destroy your opponent (lyrically). 
    Go back and forth for three rounds and make it personal.
    The Judge will anounce the winner at the end.
''',
    silent = True
    )

In [None]:
message_dict =  host._oai_messages
message_dict

In [None]:
import os

def ensure_directory_exists(file_path):
    directory = os.path.dirname(file_path)
    if not os.path.exists(directory):
        os.makedirs(directory)

def save_specific_key_content_to_file(ddict, filename):
    keys_list = list(ddict.keys())
    last_key = keys_list[-1]
    specific_content = ddict[last_key]
    file_path = os.path.join(os.getcwd(), filename)

    ensure_directory_exists(file_path)
    
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(specific_content, f, ensure_ascii=False, indent=4)

    print(f"Content saved to {file_path}")

# Use the function
save_specific_key_content_to_file(message_dict, f'{model_rap_name}/specific_conversation_data.json')
