# DIGI405 - Generate a corpus with an LLM

Geoff Ford  
[https://geoffford.nz](https://geoffford.nz/)  

There are five examples in this notebook. Work through them and think about the relevance for your assignment.  

The last cells in the notebook demonstrates how to delete a corpus or zip a corpus for you to download.

Consult the [README](README.md) for detailed information and the [CHANGELOG](CHANGELOG.md) for changes.

Run this cell to install required Python libraries.

In [None]:
!pip install -r requirements.txt

Run the following cell to import relevant Python libraries used in this notebook and set the logging level.

In [None]:
import logging
import requests
import json
import time
import getpass
from slugify import slugify
import shutil
import os
import csv

logging.basicConfig(level=logging.INFO)

The [README](README.md) file discusses how to generate an OpenRouter API key. Configure the key by running this cell ...

In [None]:
OPENROUTER_API_KEY = getpass.getpass()

The following cell contains a function to query Open Router and generate LLM text. Just run it to make the function available. Change it if you know what you are doing.

Note: you can use API endpoints compatible with the OpenAI completion endpoint, but you will need to specify the relevant `api_url` and specify an `api_url` to `query_llm` calls. For example, if you have software to run LLMs locally, like [Ollama](https://ollama.com/), you can specifying an `api_url` (e.g. `http://127.0.0.1:11434/v1/chat/completions`).

In [None]:
def query_llm(prompt:str, # prompt to send to LLM
            model: str, # model name e.g. google/gemma-2-9b-it:free
            system_prompt: str = None, # system prompt to send to LLM
            max_tokens: int = 2048, # maximum number of tokens to generate (includes prompt tokens)
            response_format: str = None, # response format: json or None
            temperature: float = None, # temperature for sampling
            api_url: str = None # OpenAI completion endpoint compatible API to query, defaults to OpenRouter's API 
            ) -> str: # generated text from LLM call
    """ Query LLM with prompt """

    if api_url is None or api_url.strip() == '':
        api_url = "https://openrouter.ai/api/v1/chat/completions"
    
    if OPENROUTER_API_KEY is None:
        logging.error("OPENROUTER_API_KEY not set. Not querying llm.")
        return None
    api_key = OPENROUTER_API_KEY
    
    if prompt.strip() == '':
        logging.error('No prompt provided. Not querying llm.')
        return None
    
    messages = []
    if system_prompt is not None and system_prompt.strip() != '':
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})

    request_data = {
                "model": model, 
                "messages": messages,
                'max_tokens': max_tokens
                    }

    if temperature is not None:
        request_data['temperature'] = temperature
    
    if response_format == "json":
        request_data['response_format'] = {"type": "json_object"}
        
    text = None

    try:
        response = requests.post(
            url=api_url,
            headers={
                "Authorization": f"Bearer {api_key}",
            },
            data=json.dumps(request_data)
            )
        response.raise_for_status() 
        text = response.json()['choices'][0]['message']['content']
    except requests.exceptions.RequestException as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise
    except KeyError as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise
    except Exception as e:
        logging.error(f"Error querying LLM: {e}")
        print(response.json())
        raise

    return text


## Set the model (and a note about OpenRouter free models)

When using OpenRouter's free models it is possible that specific models will be unavailable at times. If you get errors querying OpenRouter you can look up the [message or error codes in their documentation](https://openrouter.ai/docs). The `query_llm` function will raise errors when the API responds with an error code or if the JSON data returned by the API does not include generated content. If you get an error when using a free model, it is likely that this is temporary. Changing to [another free model](https://openrouter.ai/models?max_price=0) will typically resolve the issue. Look for models with '(free)' in their name. The next cell is where you can set the model to use for generation.

In [None]:
model = 'meta-llama/llama-3-8b-instruct:free'

## Example 1 - generate a corpus with one LLM prompt

This is a very basic example that uses one single prompt to generate a small corpus. Example 2 and 3 will probably be more helpful for your assignment.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example1-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

Below are the settings we will use to generate a tiny corpus of three documents. 

### Note about the temperature setting

The only parameter implemented in the `query_llm` function is `temperature` (([video](https://www.youtube.com/watch?v=ezgqHnWvua8)). Feel free to implement [other parameters available in OpenRouter's API](https://openrouter.ai/docs/parameters) if you are confident doing this, but this is not expected for class activities or supported.  

Lower temperature values give similiar or identical responses. Higher values produce more varied responses.
The default value of temperature is 1.0. It can vary between 0.0 and 2.0.

### Warning

You can change the number of texts to generate, but DON'T do this during the lab times as this may affect the classes ability to run the notebook. Remember there are limits (200 requests for free models per day, and a maximum of 20 requests per minute).

In [None]:
number_of_texts_to_generate = 3 # PLEASE LEAVE THIS FOR NOW!

system_prompt = '''
'''

prompt = '''
Write a short children's story imagining the future with AI.
'''

max_tokens = 1024

temperature = 1.5

response_format = ''

Running the next cell queries the API and generates text. A preview of each generated text is displayed and the generated text is saved as a .txt file.

In [None]:
for i in range(number_of_texts_to_generate):
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format, temperature = temperature)
    print(f'Text {i} preview: {response[0:200]} ...')
    
    filename = f'text-{i}.txt'
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('------')
        
    time.sleep(10) # always leave a delay!

Inspect your txt files in the corpus path you specified above!

## Example 2: generate a corpus by seeding the prompt with some other generated data

In this example we first generate some data, and then we use this data to prompt the LLM. For this example the generated data is simply a title, but this could be more complex. For example, if you wanted to generate biographies with specific information, you could generate the required information (e.g. names of people, their life histories). If you wanted to create texts on a topic with very different points of view, you could generate a number of structured personas (e.g. name, where they live, job title, political allegiance) and then use this information as input to produce a variety of different perspectives.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example2-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

First, we are going to generate some data that we will then use to generate the corpus. In this case we are going to generate some titles of opinion pieces on AI. Note: here I am generating JSON data. Note: there is a system prompt advising JSON output is required and the user prompt includes the required format. This sometimes may not work with smaller models! If it doesn't, just run it again. If you wanted to generate more titles you would change the number in the user prompt. For now, leave it as it is.

In [None]:
system_prompt = '''
Always respond with JSON data.
'''

prompt = '''
Generate a list of 3 editorial titles about artificial intelligence. 
Vary the length and theme of each title.
JSON format: 
{
    "titles": [
        "title 1",
        "title 2",
    ]
}
'''

max_tokens = 4098

temperature = 1.0 

response_format = 'json'

Query the API to generate the titles ...

In [None]:
response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format, temperature = temperature)
print(response)

If the next cell runs and outputs the titles you have some valid JSON. If not, run the cell above again. Sometimes LLMs don't follow output instructions! If you were generating different structured data, you may have to modify this.  

In [None]:
try:
    seed_data = json.loads(response)
    for seed in seed_data['titles']:
        print(seed)
except json.JSONDecodeError as e:
    print('Error decoding JSON. Try regenerating. A lower temperature value may help.')


Now generate your tiny corpus of three documents based on the generated title data.  
Note: I have specifyied a system prompt, but there is no user prompt specified. The prompt for each story is the title we have generated above. 

In [None]:
system_prompt = '''
Write an editorial based on the title provided. The editorial should be written for a general audience.
It will appear in a major news outlet. Do not include the title as part of the output.
'''

max_tokens = 2048

temperature = 1.0 

response_format = ''

Run this to generate the corpus ...

In [None]:
for prompt in seed_data['titles']:
    print(f'Generating text based on: {prompt}')
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format, temperature = temperature)
    print(f'Text {i} preview: {response[0:200]} ...')

    filename = slugify(prompt, max_length=25) + '.txt' # Note: this creates a nice filename from the title
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('-----------------')
    time.sleep(10) # always use a time delay

Inspect your txt files in the corpus path you specified above!

## Example 3: generate a corpus by with a prompt from a CSV file

If you are using another data source, whether that is scraped or a corpus you have found online, you may want to generate  comparable corpus using this data. For example, if you want to compare human vs generated news stories and have the human-authored stories collected, you could generate texts based on the titles of the human-authored texts.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example3-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

Specify the CSV file name here and the field name you are using for prompting. You can also specify the encoding of the file. Leave encoding as it is for files encoded with UTF-8.  

The example is from some recent RadioNZ stories about AI.

In [None]:
csv_file = 'sample-for-example3.csv'
field_name = 'title'
encoding = 'utf-8-sig'

Read the data for field_name and preview it ...

In [None]:
with open(csv_file, 'r', encoding = encoding, newline = '') as file:
    csv_reader = csv.DictReader(file)
    
    header = csv_reader.fieldnames
    print(f'Header fields: {header}')
    if field_name not in header:
        print(f'The field name {field_name} is not in the header row!')
    else:
        seed_data = []
        for i, row in enumerate(csv_reader):
            seed_data.append(row['title'])
            
print(f'Data for prompting: {seed_data}')

Now we will generate a tiny corpus of three documents based on these title field.  
Note: we are specifying a system prompt, but there is no user prompt.  
The prompt for each story is the title we have generated above.  

In [None]:
system_prompt = '''
Write a news story for Radio New Zealand based on the title provided. This should be written for a general audience. 
Do not include the title as part of the output.
'''

max_tokens = 2048

temperature = 1.5 

response_format = ''

Run this to generate the corpus ...

In [None]:
for prompt in seed_data:
    print(f'Generating text based on: {prompt}')
    response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format, temperature = temperature)
    print(f'Text {i} preview: {response[0:200]} ...')

    filename = slugify(prompt, max_length=25) + '.txt' # Note: this creates a nice filename from the title
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    print('-----------------')
    time.sleep(10) # always use a time delay

## Example 4 - Interactive chatbot-like example, generating text based on memory of previous generated text

This example illustrates how to use previously generated text as context to generate an interactive chat exchange.

Set the path to save the corpus ...

In [None]:
corpus_path = 'example4-corpus/'

Check if the path exists, if not create it!

In [None]:
if not os.path.exists(corpus_path):
    print(f'Creating path: {corpus_path}')
    os.makedirs(corpus_path)
else:
    print(f'Path already exists: {corpus_path}')

Below are the settings we will use to generate a tiny corpus of five documents. 

In [None]:
number_of_texts_to_generate = 5 # PLEASE LEAVE THIS FOR NOW!

system_prompt = '''
Generate exchanges in an imaginary panel discussion on the political consequences of artificial intelligence. 
The people involved in the discussion are:
- Mr Big, Fortune 500 CEO
- Ms X, Political Activist
- Dr Y, Former-AI researcher turned tech bro podcaster
- Ms Z, Journalist leading the discussion

For each exchange in the panel discussion, choose a different person to speak and generate their response to the prompt based on the context 
provided. 

The generated text should have the person's name on the first line in all caps, then a new line, then their response. 

<PERSON NAME>: 
<person's response>

Only one person should respond to a prompt.

CONTEXT:
'''

prompt = '''
MS Z: 
I've been following the debates, and it's clear that there are sharp divisions. There are those who see AI as an opportunity 
and others who see it as a threat. What are the political consequences of artificial intelligence?
'''

max_tokens = 10000

temperature = 1.0

response_format = ''

Running the next cell queries the API and generates text. A preview of each generated text is displayed and the generated text is saved as a .txt file.

In [None]:
context = ''

print(f'Initial question: {prompt[0:200]} ...')
print('------')

for i in range(number_of_texts_to_generate):
    system_prompt_with_context = system_prompt + context
    response = query_llm(prompt, model, system_prompt_with_context, max_tokens, response_format = response_format, temperature = temperature)
    print(f'Text {i} preview: {response[0:200]} ...')
    
    probable_speaker = response.split('\n')[0]
    if probable_speaker.isupper():
        speaker = probable_speaker
    else:
        speaker = 'UNKNOWN'

    filename = f'{i:03}-{slugify(speaker, max_length = 25)}.txt'
    with open(os.path.join(corpus_path, filename), 'w', encoding='utf8') as f:
        print(f'Saving to {os.path.join(corpus_path, filename)}')
        f.write(response)
    
    context += prompt + '\n'
    prompt = response
    
    print('------')
        
    time.sleep(10) # always leave a delay!

Inspect your txt files in the corpus path you specified above!

## Example 5: generate structured data and save to a CSV file

You may want to save some generated structured data from an LLM. You could save the JSON response direct, but if you want to save to a format you can open and edit in Excel, CSV is a helpful format.

In this bonus example, I generate structured data for members of an imaginary focus group. Each 'voter' has a name, age, location where they live, a job title and a party they favour.

Set the filename you want to save your data to. You should probably leave the encoding as it is.

In [None]:
csv_file = 'example5.csv'
encoding = 'utf-8-sig'

Here are the prompts and other settings used to generate the structured data.

If you modify the prompt, make sure you are generating an array of data formatted in a similar way. The rest of this example relies on the generated JSON being an array of objects.

In [None]:
system_prompt = '''
Always respond with JSON data.
'''

prompt = '''
Generate a list of 3 New Zealand voters.
Each voter has a name, age, location where they live, job title, and a political party they support. 
Valid parties are National, Labour, Green, ACT, New Zealand First, Other, and Undecided.
JSON format: 
{
    "voters": [
        {
            "name": "",
            "age": "",
            "location": "",
            "job_title": "",
            "party": "",
        },
    ]
}
'''

max_tokens = 4098

temperature = 1.0 

response_format = 'json'

Query the API to generate the titles ...

In [None]:
response = query_llm(prompt, model, system_prompt, max_tokens, response_format = response_format, temperature = temperature)
print(response)

If the next cell runs and outputs the expected data you have some valid JSON and your CSV file will be populated. If not, run the cell above again. Sometimes LLMs don't follow output instructions! If you were generating different structured data, you may have to modify this. 

In [None]:
with open(csv_file, 'w', encoding = encoding, newline = '') as file:
    try:
        seed_data = json.loads(response)
        for i, seed in enumerate(seed_data['voters']):
            if i == 0:
                fieldnames = seed.keys()
                print(f'Field names: {fieldnames}')
                csv_writer = csv.DictWriter(file, fieldnames = fieldnames)
                csv_writer.writeheader()

            csv_writer.writerow(seed)
            print(seed)
    except json.JSONDecodeError as e:
        print('Error decoding JSON. Try regenerating. A lower temperature value may help.')


Inspect your CSV file.

## Delete your corpus (if you need to)

If you made a mistake or want to remove your corpus files for some other reason, you can use this part of the notebook. 

First, make sure the `corpus_path` below matches the corpus files you want to delete! 

In [None]:
corpus_path = 'example1-corpus/' # set this to whatever path makes sense!

### Warning: this will delete files!

If you want to delete your corpus files, change `i_want_to_delete_my_files` from `False` to `True` and run the cell. 

Make sure you change it back again afterwards so you don't run it accidentally and delete your files!

In [None]:
i_want_to_delete_my_files = False

if i_want_to_delete_my_files == True:
    for filename in os.listdir(corpus_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(corpus_path, filename)
            os.remove(file_path)
            print(f"Deleted: {filename}")

## Save a zip file of your corpus

If you want to download your corpus, you can zip it and download a .zip file. Set the filename for the zip file using `corpus_file_name`. Make sure the `corpus_path` matches the location of the files you want to download.  

In [None]:
corpus_file_name = 'example1-corpus.zip'
corpus_path = 'example1-corpus/'

shutil.make_archive(corpus_file_name[:-4], 'zip', corpus_path)