# TASK OVERVIEW
Here is the detailed context content for the next phase of your task to fine-tune an assistant that can help with developing and coding Gradio apps:
To generate synthetic data for fine-tuning a Gradio app development assistant, we can follow a two-step process of distillation from a stronger model:

Generate diverse instructions and input context related to Gradio app development
Generate high-quality responses to those instructions

For step 1, start by creating a small seed set of human-written instructions covering key aspects of Gradio app development, such as:

Designing the user interface and layout
Defining input and output components
Integrating machine learning models
Handling events and interactivity
Customizing the look and feel
Deploying the app

Example seed instructions:
"Design a simple Gradio interface that allows a user to input text and displays the sentiment (positive, negative, neutral)."
"Explain how to load a pretrained machine learning model from Hugging Face and use it to make predictions in a Gradio app."
"What are the different options for deploying a Gradio app and how do you configure them?"
Then, use a strong language model like GPT-3.5 or GPT-4 to generate additional diverse instructions by few-shot prompting with the seed examples. Encourage the model to be creative and cover a wide range of Gradio development topics. Also have it generate relevant input context for each instruction.
Filter the generated instructions to remove any that are invalid, irrelevant, or duplicates. Aim to build a dataset of hundreds to thousands of unique instructions spanning beginner to advanced Gradio development.
For step 2, use the same strong language model to generate high-quality responses for each instruction-input pair. Provide the model with the instruction and input context, and sample multiple candidate responses using techniques like nucleus sampling to encourage diversity while maintaining coherence.
Have the model self-evaluate each candidate response and select the best one, or rank them. Criteria for evaluation can include:

Directly addressing the instruction
Correctness of the information
Clarity of explanation
Conciseness
Helpfulness for a developer

The final result will be a synthetic dataset of instruction-input-response pairs covering a wide range of topics and skills needed for Gradio app development. This can then be used to fine-tune the target LLaMA model using the LoRA adapter.
Some key considerations during the data generation process:

Ensure a good mix of high-level and low-level instructions, covering both conceptual topics as well as specific coding details
Aim for diversity of instructions to teach a broad skillset, but also generate multiple examples for key concepts to reinforce them
Prompt the language model to include code snippets and examples in the responses to make them more actionable
Experiment with different prompts, sampling parameters, and filtering criteria to optimize the quality of the synthetic data
Spot check samples of the generated data and iterate to improve quality

By distilling high-quality instruction-tuning data from a strong language model, we can imbue the target LLaMA model with comprehensive knowledge of Gradio app development best practices. The resulting fine-tuned model should be able to engage in substantive conversations about Gradio and provide detailed, helpful guidance to developers working on Gradio projects.

In [15]:
#!pip install anthropic

In [34]:
import os
from anthropic import Anthropic

In [17]:
import pandas as pd

In [11]:
#os.environ['ANTHROPIC_API_KEY'] = ''

In [21]:
from glob import glob

# Use ** to match all files in all subdirectories
files = glob('../data/latest-docs/**/*.md', recursive=True)

In [22]:
files = pd.Series(files).rename('fp').to_frame()

In [29]:
files['fn'] = files.fp.str.rsplit("/",n=1).str[-1]#,expand=True)

In [32]:
files['category'] = files.fp.str.split("/",expand=True)[3]

In [35]:

client = Anthropic(
    # This is the default and can be omitted
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
)


In [47]:
# Prompt template
prompt_template = """
<prompt_template>
You will be acting as a coding assistant to help answer questions about the Gradio Python framework. I will provide you with a chunk of documentation about Gradio. Your task is to identify the core concepts covered in this documentation and generate questions that the documentation would be key in answering. Then, you will provide concise, reformatted answers to these questions based on the documentation.
You will only retun the jsonl question answer pair content no preamble.

Here is the chunk of Gradio documentation:
<documentation_chunk>
{{DOCUMENTATION_CHUNK}}
</documentation_chunk>

Please carefully read through the documentation chunk and identify the core concepts it covers. For each core concept, simulate a question that a user might ask where this part of the documentation would be essential to providing a good answer. Try to generate at least one question per core concept.

Once you have your list of questions, answer each one by reformatting the relevant parts of the documentation into an optimal, concise answer. Focus on capturing the key information needed to address the question, but don't simply copy and paste from the documentation - aim to rephrase things in a more accessible way.

Please return your output in JSONL format, with each line containing a JSON object representing a simulated question and answer pair. The JSON object should have a "question" field with the simulated question, and an "answer" field with the concise, reformatted answer you generated from the documentation.
</prompt_template>
"""



In [38]:
import re

def read_markdown(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def parse_markdown(content):
    # Regex pattern to match markdown headings
    heading_pattern = re.compile(r'^(#{1,6})\s+(.*)', re.MULTILINE)
    headings = []
    for match in heading_pattern.finditer(content):
        level = len(match.group(1))
        title = match.group(2).strip()
        start = match.start()
        headings.append((level, title, start))
    return headings

def get_text_chunks(content, headings):
    chunks = []
    for i in range(len(headings)):
        level, title, start = headings[i]
        if i + 1 < len(headings):
            end = headings[i + 1][2]
        else:
            end = len(content)
        chunk = content[start:end].strip()
        chunks.append((level, title, chunk))
    return chunks

def process_chunks(chunks):
    hierarchy = {1: None, 2: None, 3: None, 4: None, 5: None, 6: None}
    processed_chunks = []

    for level, title, chunk in chunks:
        hierarchy[level] = title
        upper_headings = [hierarchy[i] for i in range(1, level) if hierarchy[i] is not None]
        full_title = " > ".join(upper_headings + [title])
        processed_chunks.append((full_title, chunk))

    return processed_chunks



In [41]:
all_chunks = []
for index, row in files.iterrows():
    content = read_markdown(row.fp)
    headings = parse_markdown(content)
    chunks = get_text_chunks(content, headings)
    processed_chunks = process_chunks(chunks)
    
    for title, chunk in processed_chunks:
        all_chunks.append(f"Heading: {title}\n Content:\n{chunk}\n")

In [43]:
#chunk = all_chunks[0]

In [50]:
from tqdm import tqdm

In [None]:
# List to store the responses
responses = []
# Iterate through the documentation chunks
counter_incase_crash = 0
for chunk in tqdm(all_chunks):
    # Insert the chunk into the prompt template
    prompt = prompt_template.replace("{{DOCUMENTATION_CHUNK}}", chunk)
    
    # Send the prompt to Claude Opus
    response = client.messages.create(
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    model="claude-3-opus-20240229",
    )
    
    # Get the assistant's response
    assistant_response = response.content
    
    # Append the response to the list
    responses.append(assistant_response)
    counter_incase_crash+=1

 49%|███████████████████████████████████████▋                                         | 271/553 [48:19<49:00, 10.43s/it]

In [None]:
# Print the responses
for response in responses:
    print(response)

In [55]:
responses_df = pd.Series(responses).rename('all_data')
responses_df.to_frame().to_parquet('../datasets/raw/opus_doc_qas.parquet')

[[TextBlock(text='{"question": "What are the two main types of guides that can be contributed to Gradio?", "answer": "The two main types of guides that can be contributed to Gradio are:\\n1. Use cases: step-by-step guides on how to build a particular type of machine learning demo or app using Gradio.\\n2. Feature explanation: detailed guides that describe a particular feature of Gradio."}\n{"question": "Can you give an example of a \'use case\' style guide in the Gradio documentation?", "answer": "Yes, an example of a \'use case\' style guide in the Gradio documentation is the guide titled \\"Creating a Chatbot\\". This guide covers step-by-step how to build a chatbot demo or app using Gradio."}\n{"question": "What\'s an example of a \'feature explanation\' style guide from the Gradio docs?", "answer": "An example of a \'feature explanation\' style guide in the Gradio documentation is the guide titled \\"Using Flagging\\". This guide describes the flagging feature of Gradio in detail."

In [None]:
index=0
opus_response = responses_df.iloc[0]

In [None]:
for index, opus_response in responses_df.items():
    