# Generation

In this notebook we will dive deeper on prompting the model by passing a better context by using available data from users questions and using the documentation files to generate better answers.

#### Install dependencies

In [1]:
%pip install -Uqq rich openai tiktoken wandb tenacity pandas

Note: you may need to restart the kernel to use updated packages.


#### Import libraries

In [2]:
import os, shutil, wget, random
from pathlib import Path
import pandas as pd
import openai
import tiktoken
from pprint import pprint
from rich.markdown import Markdown
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential
)
import wandb
from wandb.integration.openai import autolog

#### Loading OPENAI API key

In [3]:
if os.getenv('OPENAI_API_KEY') is None:
    if any(['VSCODE' in x for x in os.environ.keys()]):
        print('Plase enter password in the VS Code prompt at the top of your VS Code window!')
    os.environ['OPENAI_API_KEY'] == getpass('Paste your OpenAI key from: https://paltform.openai.com/account/api-key\n')

assert os.getenv('OPENAI_API_KEY',''.startswith('sk-')), "This doesn't look like a valid OpenAI API key"
print('OpenAI API key configured')

OpenAI API key configured


#### Enable W&B to track our experiment

In [5]:
# start logging to W&B
os.environ['WANDB_NOTEBOOK_NAME'] = 'generation.ipynb'
# autolog(init={'project':'llmapps', 'job_type': 'generation'})
wandb.init(project='llmapps', job_type='generation')

#### Download necessary files

In [6]:
import stat
class Utils:
    # Function to clone the repository
    def clone_repo(self, repo_url, target_dir):
        if not os.path.exists(target_dir):
            os.system(f'git config --global http.postBuffer 1048576000')
            os.system(f'git clone --depth=1 {repo_url} {target_dir}')
        else:
            print(f"Directory {target_dir} already exists.")

    # Function to get file absolute path
    def get_file_name(self, dir):
        file_list = []
        for file in os.listdir(dir):
            if os.path.isfile(os.path.join(dir, file)):
                file_list.append(file)
        return file_list
    
    # Function to copy files from one directory to another
    def copy_dir(self, src_dir, dst_dir):
        file_list = self.get_file_name(src_dir)
        if not os.path.isdir(dst_dir):
            os.makedirs(dst_dir)
        if len(file_list) != 0:
            for file_name in file_list:
                src = os.path.join(src_dir, file_name)
                dst = os.path.join(dst_dir, file_name)
                print(f"Copying {src} to {dst}")
                self.move_or_copy_file(src, dst, False)
                
    # Function to move one file from one directory to another
    def move_or_copy_file(self, src_file, dst_file, move_file=True):
        try:
            if move_file:
                shutil.move(src_file, dst_file)
                print(f"{src_file} moved successfully.")
            else:
                shutil.copy(src_file, dst_file)
                print(f"{src_file} copied successfully.")
        # If source and destination are same
        except shutil.SameFileError:
            print("Source and destination represent the same file.")
        # If there is any permission issue
        except PermissionError:
            print("Permission denied.")
        # For other errors
        except Exception as e:
            print(f"Error occurred while copying file: {e}")

    # Function to handle read-only files
    def handle_remove_readonly(self, func, path, exc_info):
        os.chmod(path, stat.S_IWRITE)
        func(path)

    # Function to remove a directory with its files and folders inside it
    def remove_dir(self, dir_path):
        try:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path, onerror=self.handle_remove_readonly)
                print(f"Directory {dir_path} and all its contents have been removed successfully.")
            else:
                print(f"Directory {dir_path} does not exist.")
        except PermissionError:
            print(f"Permission denied while trying to remove {dir_path}.")
        except Exception as e:
            print(f"Error occurred while removing directory {dir_path}: {e}")

In [7]:
# Download files on colab
if not Path("../files/examples.txt").exists():
    for file_name in ['examples.txt','prompt_template.txt','system_template.txt']:
        downloaded_file = wget.download(f'https://raw.githubusercontent.com/wandb/edu/main/llm-apps-course/notebooks/{file_name}')
        Utils().move_or_copy_file(f'{file_name}', f'../files/{file_name}')

In [8]:
Utils().clone_repo('https://github.com/wandb/edu.git', '../edu')
Utils().copy_dir(src_dir='../edu/llm-apps-course/docs_sample', dst_dir='../files/docs_sample')
Utils().remove_dir('../edu')

Directory ../edu does not exist.


## Generating synthetic support questions
We will add a retry behavior in case we hit the [API rate limit](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb)

In [9]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.chat.completions.create(**kwargs)

In [10]:
MODEL = 'gpt-3.5-turbo'

## Prompting
![prompting levels](../images/prompting.png "Prompting Levels")

### zero-shot prompting 
In this type of prompting we are not giving the model any examples. we are not giving it any context, we're just asking it to do some work (here generating a support question)

In [11]:
system_prompt = 'You are a helpful assistant.'
user_prompt = 'Generate  a support question from a W&B user'

def generate_and_print(system_prompt, user_prompt, n=5):
    messages = [
        {'role': 'system', 'content': system_prompt},
        {'role': 'user', 'content': user_prompt}
    ]
    response = completion_with_backoff(
        model=MODEL,
        messages=messages,
        n=n
    )
    for response in response.choices:
        generation = response.message.content
        display(Markdown(generation))

generate_and_print(system_prompt, user_prompt)

### Few Shot
Let's read some user submitted queries from the file example.txt
This file contains multiline questions seperated by tabs

In [12]:
delimiter = '\t' # tab seperated queries
with open('../files/examples.txt', 'r', encoding='utf-8') as file:
    data = file.read()
    real_queries = data.split(delimiter)

pprint(f'We have {len(real_queries)} real queries:')
Markdown(f'Sample one: \n\'{random.choice(real_queries)}\'')

'We have 228 real queries:'


We can now use those real user questions to guide our model to produce synthetic questions like those.

In [13]:
def generate_few_shot_prompt(queries, n=3):
    prompt = 'Generate a support question from a W&B user\n Below you will find a few examples of real user queries:\n'
    for i in range(n):
        prompt += random.choice(queries) + '\n'
    prompt += "Let's start!"
    return prompt

generation_prompt = generate_few_shot_prompt(real_queries)
print(generation_prompt)

Generate a support question from a W&B user
 Below you will find a few examples of real user queries:
I am logging the score of my LightGBM regression model by doing `run = wandb.init(project=project_name)` and then `wandb.log({'dev_score': dev_score})`. The problem is that it is logged as a chart with step-score x-y axes, and I only want the scalar value. As I do not have steps, it is difficult to visualize the score. How can I add a Scalar chart instead?
how do i load the latest model from a specific project to continue training? Im using Pytorch.
how can i get all the versions of an aritfact of a particular type?
Let's start!


In [14]:
generate_and_print(system_prompt=system_prompt, user_prompt=generation_prompt)

### Add Context & Response
Let's create a function to find all the markdown files in a directory and return it's content and path

In [15]:
def find_md_files(directory):
    'Find all markdown files in a directory and return their content and path'
    md_files = []
    for file  in Path(directory).rglob('*.md'):
        with open(file, 'r', encoding='utf-8') as md_file:
            content = md_file.read()
        md_files.append((file.relative_to(directory), content))
    return md_files

documents = find_md_files('../files/docs_sample')

Let's check if the documents are not too long for our context window. We need to compute the number of tokens in each document.

In [16]:
# Check how longs our documents are
tokenizer = tiktoken.encoding_for_model(MODEL)
tokens_per_document = [len(tokenizer.encode(document)) for _, document in documents]
pprint(tokens_per_document)

[365, 2596, 2940, 4179, 803, 1206, 537, 956, 2093, 2529, 1644]


Some of them are too long - instead of using entire documents, we'll extract a random chunck from them

In [17]:
# extract a random chunck from a document
def extract_random_chunk(document, max_tokens=512):
    tokens = tokenizer.encode(document)
    if len(tokens) <= max_tokens:
        return document
    start = random.randint(0, len(tokens) - max_tokens)
    end = start + max_tokens
    return tokenizer.decode(tokens[start:end])

Now, we will use that extracted chunck to create a question that can be answered by the document. This way we can generate questions that our current documentation is capable of answering

In [18]:
def generate_context_prompt(chunk):
    prompt = 'Generate a support question from a W&B user\n The question should be answerable by provided fragment of W&B documentation.\n Below you will find a fragment of W&B documentation:\n'+\
    chunk + "\nLet's start!"
    return prompt

chunk = extract_random_chunk(documents[0][1])
generation_prompt = generate_context_prompt(chunk)

In [19]:
Markdown(generation_prompt)

In [20]:
generate_and_print(system_prompt, generation_prompt, n=3)

### Level 5 prompting
complex directive that include the following:
- Description of high-level goal
- A detailed bulleted list of sub-tasks
- An explicit statement asking LLM to explain its own output
- A guideline on how LLM output will be evaluated
- Few-shot examples

In [21]:
with open('../files/system_template.txt', 'r') as file:
    system_prompt = file.read()

In [22]:
Markdown(system_prompt)

In [23]:
with open('../files/prompt_template.txt', 'r') as file:
    prompt_template = file.read()

In [1]:
Markdown(prompt_template)

NameError: name 'Markdown' is not defined

In [25]:
def generate_context_prompt(chunk, n_questions=3):
    questions = '\n'.join(random.sample(real_queries, n_questions))
    user_prompt = prompt_template.format(QUESTIONS=questions, CHUNK=chunk)
    return user_prompt

user_prompt = generate_context_prompt(chunk)

In [26]:
Markdown(user_prompt)

In [27]:
def generate_questions(documents, n_questions=3, n_generations=5):
    questions = []
    for _, document in documents:
        chunk = extract_random_chunk(document)
        user_prompt = generate_context_prompt(chunk, n_questions)
        messages = [
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': user_prompt}
        ]
        response = completion_with_backoff(
            model=MODEL,
            messages = messages,
            n = n_generations
        )
        questions.extend([response.choices[i].message.content for i in range(n_generations)])
    return questions

A Note about the system role: For GPT4 based pipelines you probably want to move some part of the context prompt to the system context. As we are using gpt-3.5-turbo here, you can put the instruction on the user prompt, you can read more about this on [OpenAI docs here](https://platform.openai.com/docs/guides/chat-completions/overview)

In [28]:
import re
# Function to parse model generation and extract CONTEXT, QUESTION and ANSWER
def parse_generation(generation):
    lines = generation.split('\n\n')
    context = []
    question = []
    answer = []

    for line in lines:
        if 'CONTEXT:' in line:
            context_pattern = re.compile(r'^\s*\*{0,2}CONTEXT:?\*{0,2}:?\s*')
            line = context_pattern.sub('', line).strip()
            context.append(line)
        elif 'QUESTION:' in line:
            question_pattern = re.compile(r'^\s*\*{0,2}QUESTION:?\*{0,2}:?\s*')
            line = question_pattern.sub('', line).strip()
            question.append(line)
        elif 'ANSWER:' in line:
            answer_pattern = re.compile(r'^\s*\*{0,2}ANSWER:?\*{0,2}:?\s*')
            line = answer_pattern.sub('', line).strip()
            answer.append(line)
    
    context = '\n'.join(context)
    question = '\n'.join(question)
    answer = '\n'.join(answer)
    return context, question, answer

In [29]:
generations = generate_questions([documents[0]], n_questions=3, n_generations=5)

In [30]:
parse_generation(generations[1])

('A user is working on a collaborative project using Weights & Biases and is interested in sharing reports with team members or the public. They want to understand the process of sharing reports and managing permissions within team projects.',
 'How can I share a report in Weights & Biases and manage permissions within team projects?',
 "To share a report in Weights & Biases, you can select the **Share** button located in the upper right-hand corner of the report. From there, you have the option to provide an email account for invitation or copy a magic link. Users invited by email will need to log into Weights & Biases to view the report, while users given a magic link do not need to log in. Reports created within an individual's private project are only visible to that user until shared with a team or made public. In team projects, both the administrator and the member who created the report can toggle permissions between edit or view access for other team members. Team members with 

In [31]:
generations = generate_questions(documents, n_questions=3, n_generations=5)

In [32]:
parsed_generations = []
for generation in generations:
    context, question, answer = parse_generation(generation)
    parsed_generations.append({'context': context, 'question': question, 'answer': answer})

# let's convert parsed_generations to a pandas dataframe and save it locally
df = pd.DataFrame(parsed_generations)
csv_path = '../files/generated_examples.csv'
df.to_csv(csv_path, index=False)
# log df as a table to W&B for inter
wandb.log({'generated_examples': wandb.Table(dataframe=df)})

# log csv file as an artifact to W&B for later use
# artifact is a dataset and model versioning tool
artifact = wandb.Artifact('generated_examples', type='dataset')
artifact.add_file(csv_path)
wandb.log_artifact(artifact)

<Artifact generated_examples>

In [33]:
wandb.finish()