# My own attempt at parallelizing OpenAI API requests

Found this after I made an earlier version of this: https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py 

Requires data in jsonl format, which is inconveinient at this time. Might move to this in the future to take advantage of being able to explicitly set rate limits and abide by those rules. In this implementation, I'm eyeballing the rate/token limits. Trying to operate in 50-75% of each territory as to not get messed up results... ymmv

In [2]:
from bs4 import BeautifulSoup
import re
import os
import openai
import pickle
import time
import json
import random
from itertools import chain
import concurrent.futures
import requests

"""Load in data"""

test = True # sample or run on full dataset?
sample_size = 4 # if test=True

scape_output = pickle.load(open('data/scraper_output.p', 'rb'))
openai.api_key = os.getenv("OPENAI_API_KEY")

if test: 
    keys = random.sample(list(scape_output.keys()), sample_size)
    scape_output = {key: scape_output[key] for key in keys}

# Things to note.

This spot specifically has room for improvement. It might be worthwhile to implement changes to provide improvements to the dataset generation. More recently, the paper [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf) suggests that data quality may infact be the key to success in projects like this. Knowing that, here are some ideas to improve the data generation process in this project.

1. Clean up context provided to GPT more. Not much effort was put in to this step under the presupposition that GPT would be able to interpret it regardless. In practice, that did turn out to be true. However, it is difficult to guage how much of an effect, if any, the excess text from buttons and such made on the quality of responses. It is possible by removing it, and instead feeding in something similar to the chunks generated for the vector database (see [relevant code](./wrangling%20for%20vectordb.py)) we could get better quality responses. This could also create more diverse questions, or at least less similar ones. I hypothesize that since at the bottom of each page there is contact info, if GPT struggles to reach its target question count on short webpages, it could easily be writing questions regarding contact info many times.
2. Use GPT4 for higher quality questions.
3. Use more targetted webpages. I.e. avoid professor webpages, and try to keep it as general as possible.


In [3]:
def process_item(key: str, value: requests.models.Response, question_count: int) -> tuple[str, str]:
    """
    Generates a set of questions based on the HTML provided. Not much effort was put into cleaning
    the HTMLs before passing them in to gpt3.5 (theres a lot of excess words from buttons and such);
    the idea is that gpt3.5 will handle the excess and still return a good result. In practice, this
    has seemed to work, but there could be room for improvement in question generation by improving
    this part specifically.

    Args:
        key: URL of the response webpage
        value: Response object from scraping
        question_count: How many questions should gpt3.5 strive to reach
    """
    start_time = time.time()

    # Clean up context to give gpt
    soup = BeautifulSoup(value.content, "html.parser")
    context = re.sub('[\n]+', '\n', soup.text.strip())

    # Main portion of the prompt
    prompt = f"""Based on the cleaned HTML given below, generate as many questions possible with their answers.
    Try to make the questions relevant from the perspective of a prospective or current student, as well as faculty and staff.
    Format your responce in JSON, with the "instruction" field containing the question, an empty "input" field, and the answer in the "output" field.
    Include up to {question_count} questions, each being a sentence or two long in length. Do not include question number.
    Keep answers somewhat brief, but be enthusiastic in your response!\n\n"""
    
    # API call
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613", # Try GPT-4?
        messages=[
            {"role": "system", "content": "You are a helpful question answer generator."},
            {"role": "user", "content": f"{prompt}\n This is the cleaned HTML: \n{context}\n Start:."},
        ]
    )

    qa = completion.choices[0].message.content
    tokens = completion.usage["total_tokens"]

    print(f"Success! Complete in {time.time() - start_time:.2f}s with {tokens} tokens for {key}")

    return key, qa

# Parallelize Requests

Since he ChatCompletion API does not yet have support for batching, sending in many requests is difficult. To overcome this, the step below parallelizes the request. Basically, up to 64 requests will go through concurrently, and print out as they are completed. The number 64 is chosen in this case since it seems to utilize the majority of my request limit, but does not go over. If this is ever used for other projects, you may consider adjusting this number given the size of the prompt relative to the one used here.

In [5]:
gpt_output = {}
# How many questions should GPT aim to make per webpage. Default: 25
question_count = 25

with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
    future_to_item = {executor.submit(process_item, key, value, question_count): key for key, value in scape_output.items()}
    for future in concurrent.futures.as_completed(future_to_item):
        key = future_to_item[future]
        try:
            key, qa = future.result()
            gpt_output[key] = qa
        except Exception as exc:
            print('%r generated an exception: %s' % (key, exc))

Success! Complete in 27.04s with 1559 tokens for https://www2.brockport.edu/live/profiles/5237-miscellaneous-student-fees-fees-rentals-and-other
Success! Complete in 40.20s with 4097 tokens for https://www2.brockport.edu/life/residential_life/residency_requirement
Success! Complete in 42.37s with 1985 tokens for https://www2.brockport.edu/support/policies/adopted/aa_graduate_fulltime_status_enrollment_verification.html
Success! Complete in 49.09s with 2251 tokens for https://www2.brockport.edu/live/profiles/5386-equity-in-athletics-disclosure-act-policy


In [4]:
pickle.dump(gpt_output, open('data/gpt_output.p', 'wb'))

# Parse GPT output to JSON

While the prompt above to translate clean HTML to question/answer format does specify to do it in JSON format, GPT3.5 does not always do it perfectly. However, it always get close. Instead of trying to fix the JSON output from GPT, In this step I'm using regex to parse for all the instructions (questions), and outputs (answers). This is returned into a python list of dictionaries, which is appended for each webpage. Eventually I shuffle this so the questions are all mixed up instead of grouped by webpage, and dump it to a json file.

For any questions which seem off, investigate the original webpage. I've left gpt_output as a python dictionary specifically for this reason, so that we can always refer back to the data and know exactly where it came from.

In [3]:
def generate_json(gpt_output: dict[str, str], filename: str) -> None:
    """
    Parses GPT output into a JSON file. This function uses regex to parse for all the instructions (questions), and 
    outputs (answers) from the GPT output which is a dictionary. The parsed data is returned into a Python list of 
    dictionaries, which is appended for each webpage. This list is shuffled to mix up the questions and then dumped 
    to a JSON file.

    Args:
        gpt_output (dict): The dictionary containing the GPT output.
        filename (str): The JSON filename to write the parsed and shuffled list of dictionaries to.
    """

    # The regular expression pattern for a JSON object with "instruction" and "output"
    pattern = r'"instruction":\s*"(.*?)",.*?"output":\s*"(.*?)"'

    def extract_data(s):
        matches = re.findall(pattern, s, flags=re.DOTALL)
        # Add a conditional filter in the list comprehension
        data = [{"instruction": m[0], "output": m[1]} for m in matches if m[0] and m[1] and '"' not in m[0] and '"' not in m[1]]
        return data

    jsonqa = []

    for value in gpt_output.values():
        clean_value = extract_data(value)
        jsonqa.append(clean_value)

    jsonqa = list(chain(*jsonqa))

    random.shuffle(jsonqa)

    # Write to a JSON file
    with open('data/' + filename + '.json', 'w') as f:
        json.dump(jsonqa, f, indent=4)  # Dump the entire list at once

In [9]:
gpt_output = pickle.load(open('data/gpt_output.p', 'rb'))

generate_json(gpt_output, "full_dataset_v4")

# Small Dataset

Since training will take a while on the full dataset, we can pick out a fraction of the data to train on first. This is helpful for mocking up the end state, and experimenting with how different data will react to training. In this case, I am filtering to all urls which contain ".edu/admissions/" (basically just direct derivatives of admissions page)

In [12]:
filtered_urls = [url for url in gpt_output.keys() if '.edu/admissions/' in url]
print("Filtering to", len(filtered_urls), "webpages (" + str(round(len(filtered_urls)/ len(gpt_output) * 100, 2)) + "% of full dataset)")

filtered_dict = {link : gpt_output[link] for link in filtered_urls}

generate_json(filtered_dict, "admissions_dataset")

Filtering to 117 webpages (2.21% of full dataset)
