# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [None]:
# Run in shell
!pip install --upgrade topicgpt_python

# Needed only for the OpenAI API deployment
export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

# Needed only for Gemini deployment
export GEMINI_API_KEY={your_gemini_api_key}

# Needed only for the Azure API deployment
export AZURE_OPENAI_API_KEY={your_azure_api_key}
export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

In [3]:
print(config)

{'verbose': True, 'data_sample': 'data/input/sample.jsonl', 'generation': {'prompt': 'prompt/generation_1.txt', 'seed': 'prompt/seed_1.md', 'output': 'data/output/sample/generation_1.jsonl', 'topic_output': 'data/output/sample/generation_1.md'}, 'refining_topics': True, 'refinement': {'prompt': 'prompt/refinement.txt', 'output': 'data/output/sample/refinement.jsonl', 'topic_output': 'data/output/sample/refinement.md', 'mapping_file': 'data/output/sample/refinement_mapping.json', 'remove': True}, 'generate_subtopics': True, 'generation_2': {'prompt': 'prompt/generation_2.txt', 'output': 'data/output/sample/generation_2.jsonl', 'topic_output': 'data/output/sample/generation_2.md'}, 'assignment': {'prompt': 'prompt/assignment.txt', 'output': 'data/output/sample/assignment.jsonl'}, 'correction': {'prompt': 'prompt/correction.txt', 'output': 'data/output/sample/assignment_corrected.jsonl'}}


### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [4]:
import topicgpt_python
from topicgpt_python.generation_1 import generate_topic_lvl1
import importlib

importlib.reload(topicgpt_python.generation_1)

INFO 04-14 12:38:37 [__init__.py:239] Automatically detected platform cuda.


<module 'topicgpt_python.generation_1' from '/storage/ice1/4/2/mshin90/hum/topicGPT/topicgpt_python/generation_1.py'>

In [7]:
generate_topic_lvl1(
    "ollama",
    "llama3.2-vision",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

skflajsldfkajsdfas
-------------------
Initializing topic generation...
Model: llama3.2-vision
Data file: data/input/sample.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: data/output/sample/generation_1.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


 20%|██        | 1/5 [00:00<00:01,  2.20it/s]

Prompt token usage: 613 ~$0.003065
Response token usage: 51 ~$0.000765
Topics: [1] Conservation: Mentions policies relating to preserving natural resources and protecting the environment.
[1] Agriculture: Mentions policies relating to agricultural practices and products.
[1] Trade: Mentions the exchange of capital, goods, and services.
--------------------


 40%|████      | 2/5 [00:01<00:01,  1.66it/s]

Prompt token usage: 1289 ~$0.006445
Response token usage: 72 ~$0.00108
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Invalid topic format: . Skipping...
Topics: [1] Compensation: Mentions compensation for land use and resources.

[1] Land Use: Mentions the use of land for hydropower generation and transfer of administrative jurisdiction.

[1] Government Funding: Mentions appropriations to carry out the Act.

[1] Tribal Affairs: Mentions tribal member benefits, resource development, and cultural preservation.
--------------------


 60%|██████    | 3/5 [00:01<00:01,  1.45it/s]

Prompt token usage: 860 ~$0.0043
Response token usage: 98 ~$0.00147
Topics: [1] Conservation: Mentions protection of marine habitats and ecosystems.
[1] Government Funding: Mentions establishment of a Reef Maintenance Fund and payment into it by owners of rigs enrolled in the artificial reef program.
[1] Land Use: Mentions exemption from platform removal deadlines for lessees who commit to entering a particular platform in the artificial reef program.
[1] Tribal Affairs: No mention of tribal affairs, but there is no relevant topic missing from the provided set.
--------------------


 80%|████████  | 4/5 [00:02<00:00,  1.79it/s]

Prompt token usage: 668 ~$0.00334
Response token usage: 41 ~$0.000615
Invalid topic format: . Skipping...
Topics: [1] Transportation: Mentions aerotropolis transportation systems and multimodal freight and passenger networks.

[1] Government: Directs the Secretary of Transportation to establish an aerotropolis grant program.
--------------------


100%|██████████| 5/5 [00:02<00:00,  1.87it/s]

Prompt token usage: 613 ~$0.003065
Response token usage: 35 ~$0.000525
Topics: [1] Government: Mentions policies and regulations related to government agencies and their actions.
[1] Immigration: Mentions requirements for citizenship or lawful immigration status verification.
--------------------





<topicgpt_python.utils.TopicTree at 0x155429516e50>

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [9]:
from topicgpt_python.refinement import refine_topics

# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        "ollama",
        "llama3.2-vision",
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

-------------------
Initializing topic refinement...
Model: llama3.2-vision
Input data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/refinement.txt
Output file: data/output/sample/refinement.md
Topic file: data/output/sample/generation_1.md
-------------------
Prompting model to merge topics:
You will receive a list of topics that belong to the same level of a topic hierarchy. Your task is to merge topics that are paraphrases or near duplicates of one another. Return "None" if no modification is needed. 

Here are some examples: 
[Example 1]
Topic List: 
<pairs of similar topics>

Your response: 
<topics being merged into an existing topic>

[Example 2]
<pairs of similar topics>

Your response: 
<topics being merged into a new topic>

[Rules]
- Each line represents a topic, with a level indicator and a topic label. 
- Perform the following operations as many times as needed: 
    - Merge relevant topics into a single topic.
    - Do nothing and return "None" if no mod

### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [12]:
from topicgpt_python.generation_2 import generate_topic_lvl2

# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "ollama",
        "llama3.2-vision",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: llama3.2-vision
Data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/generation_2.txt
Seed file: data/output/sample/generation_1.md
Output file: data/output/sample/generation_2.jsonl
Topic file: data/output/sample/generation_2.md
-------------------
Number of remaining documents for prompting: 5


  0%|          | 0/10 [00:00<?, ?it/s]

Current topic: [1] Conservation


 10%|█         | 1/10 [00:04<00:39,  4.43s/it]

Subtopics: [1] Conservation
    [2] National Forest Management (Document: 1): Refers to managing roadless areas within national forests.
    [2] Offshore Oil and Gas Platform Decommissioning (Document: 2): Discusses the assessment and potential reefing of offshore oil and gas platforms.

Since the provided top-level topic "Conservation" is specific enough, no additional subtopics are added. The existing relevant topics from the documents are returned as second-level topics.
National Forest Management (Count: 0): Refers to managing roadless areas within national forests.
Offshore Oil and Gas Platform Decommissioning (Count: 0): Discusses the assessment and potential reefing of offshore oil and gas platforms.
Not a match: 
Not a match: Since the provided top-level topic "Conservation" is specific enough, no additional subtopics are added. The existing relevant topics from the documents are returned as second-level topics.
--------------------------------------------------
Current topic: 

 20%|██        | 2/10 [00:04<00:15,  1.97s/it]

Subtopics: [1] Agriculture
    [2] Conservation (Document: 1): Mentions managing roadless areas for conservation purposes.
Conservation (Count: 0): Mentions managing roadless areas for conservation purposes.
--------------------------------------------------
Current topic: [1] Trade


 30%|███       | 3/10 [00:05<00:08,  1.25s/it]

Subtopics: [1] Trade
    [2] Exports (Document 1, 3): Mentions export policies on goods.
    [2] Tariff (Document 2): Mentions tax policies on imports or exports of goods.
Not a match: [2] Exports (Document 1, 3): Mentions export policies on goods.
Not a match: [2] Tariff (Document 2): Mentions tax policies on imports or exports of goods.
--------------------------------------------------
Current topic: [1] Compensation


 40%|████      | 4/10 [00:05<00:05,  1.06it/s]

Subtopics: [1] Compensation
    [2] Equitable Compensation (Document: 1): Mentions compensation for land use.
    [2] Repayment Credit (Document: 1): Mentions repayment provisions.
Equitable Compensation (Count: 0): Mentions compensation for land use.
Repayment Credit (Count: 0): Mentions repayment provisions.
--------------------------------------------------
Current topic: [1] Land Use


 50%|█████     | 5/10 [00:06<00:04,  1.06it/s]

Subtopics: [1] Land Use
    [2] Tribal Land Management (Document: 1): Refers to the management of land held in trust for the Spokane Tribe of Indians.
    [2] Offshore Oil and Gas Platform Decommissioning (Document: 2): Discusses the assessment and potential reefing of offshore oil and gas platforms in the Gulf of Mexico.

Note: The provided top-level topic "Land Use" is specific enough, but I added two second-level topics as they are more specific and relevant to the documents.
Tribal Land Management (Count: 0): Refers to the management of land held in trust for the Spokane Tribe of Indians.
Offshore Oil and Gas Platform Decommissioning (Count: 0): Discusses the assessment and potential reefing of offshore oil and gas platforms in the Gulf of Mexico.
Not a match: 
Not a match: Note: The provided top-level topic "Land Use" is specific enough, but I added two second-level topics as they are more specific and relevant to the documents.
--------------------------------------------------
C

 60%|██████    | 6/10 [00:07<00:03,  1.18it/s]

Subtopics: [1] Government Funding
    [2] Tribal Funding (Document: 1): Mentions funding allocation to the Spokane Tribe of Indians.
    [2] Offshore Oil and Gas Platform Management (Document: 2): Mentions management policies for offshore oil and gas platforms.
Tribal Funding (Count: 0): Mentions funding allocation to the Spokane Tribe of Indians.
Offshore Oil and Gas Platform Management (Count: 0): Mentions management policies for offshore oil and gas platforms.
--------------------------------------------------
Current topic: [1] Tribal Affairs


 70%|███████   | 7/10 [00:07<00:02,  1.34it/s]

Subtopics: [1] Tribal Affairs
    [2] Land Management (Document: 1, 9): Mentions land transfer and jurisdiction.
    [2] Environmental Protection (Document: 2): Mentions reef ecosystem protection.
Land Management (Count: 0): Mentions land transfer and jurisdiction.
Environmental Protection (Count: 0): Mentions reef ecosystem protection.
--------------------------------------------------
Current topic: [1] Transportation


 80%|████████  | 8/10 [00:07<00:01,  1.69it/s]

Subtopics: [1] Transportation
    [2] Aviation (Document: 1): Mentions aerotropolis development and airport transportation networks.
Aviation (Count: 0): Mentions aerotropolis development and airport transportation networks.
--------------------------------------------------
Current topic: [1] Government


 90%|█████████ | 9/10 [00:08<00:00,  1.88it/s]

Subtopics: [1] Government
    [2] Transportation (Document: 1): Mentions transportation systems and airport development.
    [2] Licensing (Document: 2): Mentions driver's licenses and identification documents.
Transportation (Count: 0): Mentions transportation systems and airport development.
Licensing (Count: 0): Mentions driver's licenses and identification documents.
--------------------------------------------------
Current topic: [1] Immigration


100%|██████████| 10/10 [00:08<00:00,  1.11it/s]

Subtopics: [1] Immigration
    [2] Visa and Documentation (Document: 1): Mentions requirements for issuing driver's licenses or identification documents to individuals. 

Since the provided top-level topic "Immigration" is specific enough, I did not add any subtopics. However, upon reviewing the document, I found that it actually pertains to a more specific aspect of immigration, which is visa and documentation policies.
Visa and Documentation (Count: 0): Mentions requirements for issuing driver's licenses or identification documents to individuals.
Not a match: 
Not a match: Since the provided top-level topic "Immigration" is specific enough, I did not add any subtopics. However, upon reviewing the document, I found that it actually pertains to a more specific aspect of immigration, which is visa and documentation policies.
--------------------------------------------------





### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [13]:
config['generation']

{'prompt': 'prompt/generation_1.txt',
 'seed': 'prompt/seed_1.md',
 'output': 'data/output/sample/generation_1.jsonl',
 'topic_output': 'data/output/sample/generation_1.md'}

In [23]:
config

{'verbose': True,
 'data_sample': 'data/input/sample.jsonl',
 'generation': {'prompt': 'prompt/generation_1.txt',
  'seed': 'prompt/seed_1.md',
  'output': 'data/output/sample/generation_1.jsonl',
  'topic_output': 'data/output/sample/generation_1.md'},
 'refining_topics': True,
 'refinement': {'prompt': 'prompt/refinement.txt',
  'output': 'data/output/sample/refinement.jsonl',
  'topic_output': 'data/output/sample/refinement.md',
  'mapping_file': 'data/output/sample/refinement_mapping.json',
  'remove': True},
 'generate_subtopics': True,
 'generation_2': {'prompt': 'prompt/generation_2.txt',
  'output': 'data/output/sample/generation_2.jsonl',
  'topic_output': 'data/output/sample/generation_2.md'},
 'assignment': {'prompt': 'prompt/assignment.txt',
  'output': 'data/output/sample/assignment.jsonl'},
 'correction': {'prompt': 'prompt/correction.txt',
  'output': 'data/output/sample/assignment_corrected.jsonl'}}

In [5]:
from topicgpt_python.assignment import assign_topics
import topicgpt_python.assignment

importlib.reload(topicgpt_python.assignment)

# Assignment
topicgpt_python.assignment.assign_topics(
    "ollama",
    "llama3.2-vision",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: llama3.2-vision
Data file: data/input/sample.jsonl
Prompt file: prompt/assignment.txt
Output file: data/output/sample/assignment.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


  0%|          | 0/5 [00:00<?, ?it/s]

seed_str='[1] Conservation: Mentions policies relating to preserving natural resources and protecting the environment.\n[1] Agriculture: Mentions policies relating to agricultural practices and products.\n[1] Trade: Mentions the exchange of capital, goods, and services.\n[1] Compensation: Mentions compensation for land use and resources.\n[1] Land Use: Mentions the use of land for hydropower generation and transfer of administrative jurisdiction.\n[1] Government Funding: Mentions appropriations to carry out the Act.\n[1] Tribal Affairs: Mentions tribal member benefits, resource development, and cultural preservation.\n[1] Transportation: Mentions aerotropolis transportation systems and multimodal freight and passenger networks.\n[1] Government: Directs the Secretary of Transportation to establish an aerotropolis grant program.\n[1] Immigration: Mentions requirements for citizenship or lawful immigration status verification.'
Prompt: You will receive a document and a topic hierarchy. As

  0%|          | 0/5 [00:17<?, ?it/s]


KeyboardInterrupt: 

In [6]:
# Correction
correct_topics(
    "openai",
    "gpt-4o-mini",
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic correction...
Model: gpt-4o-mini
Data file: data/output/sample/assignment.jsonl
Prompt file: prompt/correction.txt
Output file: data/output/sample/assignment_corrected.jsonl
Topic file: data/output/sample/generation_1.md
-------------------
Number of errors: 0
Number of hallucinated topics: 0
All topics are correct.
