Use GPT4 and GPT3.5 on inputs of unlimited size. Uses multithreading to process multiple chunks in parallel. Useful for tasks like Named Entity Recognition, information extraction on large books, datasets, etc.
Use cases:
- Translating a large body of text
- Extracting geographic entities from a book on the history of wars
- Summarizing a long article, textbook, or other file bit by bit.
It is designed to handle large files that may exceed OpenAI's token limits if processed as a whole. The script splits the input file into manageable pieces and sends each chunk to the OpenAI API separately at the same time. The responses are then collected and saved into an output file.
If the OpenAI rate limit is reached, the code uses exponential backoff with jitter to keep retrying until success. It is by default set to give up after three failures.
Usage:
python main.py -i INPUT_FILE -o OUTPUT_FILE -l LOG_FILE -m MODEL -c CHUNKSIZE -t TOKENS -v TEMPERATURE -p PROMPT
Skip to bottom for usage instructions.
- Python 3.6 or above
- OpenAI API key (either set using echo or hard-code into the main.py script)
- Basic understanding of the command-line interface (Terminal for macOS and Linux, CMD or PowerShell for Windows)
- Clone the GitHub repository to your local machine. (OR might just be easier to download the main.py file and use it directly)
git clone https://github.com/your_username/openai-text-processor.git
- Change directory to the cloned repository.
cd openai-text-processor
- Install the required packages.
openai
tiktoken
tqdm
- The script requires an OpenAI API key, which should be set as an environment variable. You can do this in bash by running the following command:
export OPENAI_KEY=your_openai_key
Replace your_openai_key
with your actual OpenAI API key.
Note: The way to set environment variables can vary depending on your operating system and shell. Please consult the appropriate documentation if the above method does not apply to your situation.
You can use the OpenAI Text Processor through the command-line interface. The usage is as follows:
python main.py -i INPUT_FILE -o OUTPUT_FILE -l LOG_FILE -m MODEL -c CHUNKSIZE -t TOKENS -v TEMPERATURE -p PROMPT
Where:
INPUT_FILE
is the path to the input file. This argument is required.OUTPUT_FILE
is the path to the output file. This argument is required.LOG_FILE
is the path to the log file. This argument is required.MODEL
is the OpenAI model to use (default is 'gpt-3.5-turbo-0301'). Alternative: gpt-4-0314. Better quality but slower and more expensive.CHUNKSIZE
is the maximum number of tokens per chunk (default is 1000). This shouldn't be too large (>4000) or OpenAI will be overloaded. A safe size is under 3000 tokens. Your prompt length also counts for the OpenAI token limit.TOKENS
is the maximum tokens per API call (default is 100). shorter will be faster. but could terminate too early.TEMPERATURE
is the variability (temperature) for OpenAI model (default is 0.0). 0.0 is probably best if you are going for highest accuracyPROMPT
is the prompt for the OpenAI model. This argument is required. Counts towards the 4k token limit for OpenAI API calls.
python main.py -i input.txt -o output.txt -l log.txt -m 'gpt-3.5-turbo' -c 500 -t 200 -v 0.5 -p 'Translate English to French:'
This will process the file input.txt
, using the model 'gpt-3.5-turbo', a chunk size of 500 tokens, a maximum of 200 tokens per API call, a temperature of 0.5, and the prompt 'Translate English to French:'. The results will be saved in output.txt
and the logs in log.txt
.
Inspired by https://github.com/emmethalm/infiniteGPT from Emmet Halm.