# RAG 'Auto-Tag' using a Controlled Vocabulary

This is a follow-up to the `media_topics_structured_output.ipynb` notebook that demonstrated AI 'auto-tagging' using the IPTC Media Topics controlled vocabulary and the Google Gemini API. Here, we take tagging in a new direction by using a Retrieval Augmented Generation (RAG) architecture. We'll use the same Media Topics vocabulary, but will store embeddings of the terms in a vector database.

Since we found there to be a limit on the number of enum values an AI can be structured to output, we will dynamically load the relevant vocabulary.

Before tagging content with an LLM, we'll retrieve the most relevant vocabulary terms from a vector database. We'll break our content (text at this point) into chunks and use them to query the database. The most frequent tags from all queries will become our LLM's enum values.


## Import Python Packages

In [1]:
import json
import os
import requests
from IPython.display import display
from collections import Counter
from dotenv import load_dotenv
from google import genai
import chromadb
import pandas as pd

load_dotenv()

GOOGLE_AI_API_KEY = os.getenv("GOOGLE_AI_API_KEY")

## IPTC Media Topics Controlled Vocabulary

Media Topics is a constantly updated taxonomy of over 1,200 terms with a focus on categorising text.

Originally based on the IPTC Subject Codes taxonomy, the Media Topics taxonomy was first released  in 2010 and is updated at least once a year.

https://iptc.org/standards/media-topics/


### Download and Read JSON


In [2]:
MEDIATOPICS_URL = "https://cv.iptc.org/newscodes/mediatopic?lang=en-US&format=json"

MEDIATOPICS_PATH = "./schema/mediatopic_cptall-en-US.json"

# Function to download the Media Topics JSON file
def download_mediatopics_json():
    try:
        # request media topics
        response = requests.get(MEDIATOPICS_URL)
        # check if the request was successful
        response.raise_for_status()
        # parse the JSON content into a dictionary
        data = response.json()
        # create the schema directory if it doesn't exist
        os.makedirs("./schema", exist_ok=True)
        # write the data to the JSON file
        with open(MEDIATOPICS_PATH, 'w') as f:
            json.dump(data, f, indent=4)

    except requests.exceptions.RequestException as e:
        print(f"Error during request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
    except IOError as e:
        print(f"Error writing to file: {e}")

# Download the Media Topics JSON file if it doesn't exist
if not os.path.exists(MEDIATOPICS_PATH):
    print("Downloading Media Topics Controlled Vocabulary from IPTC web")
    download_mediatopics_json()
# Load the Media Topics JSON file
with open(MEDIATOPICS_PATH, "r") as file:
    media_topics = json.load(file)

In [3]:
# Let's load the concepts into a dictionary for easier access later
concepts_dict = {concept['qcode']: concept for concept in media_topics['conceptSet']}

## Build Vector DB

### Prepare Ids and Documents

Let's load the ids and documents that we will be storing in the vector db. Our ids are Media Topic's `qcode` and the documents concept definitions.

In [4]:
ids = []
documents = []
for key, val in concepts_dict.items():
    ids.append(key)
    documents.append(val.get('definition').get('en-US'))

print(ids[:5])
print(documents[:5])

['medtop:01000000', 'medtop:02000000', 'medtop:03000000', 'medtop:04000000', 'medtop:05000000']
['All forms of arts, entertainment, cultural heritage and media', 'The establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organizations and bodies involved in these activities', 'Man made or natural event resulting in loss of life or injury to living creatures and/or damage to inanimate objects or property', 'All matters concerning the planning, production and exchange of wealth.', 'All aspects of furthering knowledge, formally or informally']


In [5]:
# What's the average number of words in a media topic definition?
average_number_of_words = sum([len(definition.split(" ")) for definition in documents]) / len(documents)
print(average_number_of_words)

15.38433908045977


### Create a Database Client

In [6]:
chroma = chromadb.Client()

### Create a Database Collection

In [7]:
collection = chroma.get_or_create_collection(name="media_topics")

### Add Media Topics Vocabulary to Database

In [8]:
collection.upsert(
    ids=ids,
    documents=documents
)

### Query the Database

Let's do a simple search for "Science" to see what the results look like.

In [9]:
results = collection.query(
    query_texts=["Science"],
    n_results=5
)

In [10]:
# Let's look at our query results
display(results)

{'ids': [['medtop:20000717',
   'medtop:20000711',
   'medtop:20000735',
   'medtop:20000756',
   'medtop:20000441']],
 'embeddings': None,
 'documents': [['The sciences that deal with matter, energy and the physical world, including physics, biology, chemistry and astronomy',
   'The scientific manipulation of living organisms and biological processes for scientific, medical or agricultural purposes',
   'The scientific and methodical investigation of events, procedures and interactions to explain why they occur, or to find solutions for problems',
   'The study and practice of industrial or applied sciences such as physics, hydrodynamics or thermodynamics',
   'The natural world']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None, None, None, None]],
 'distances': [[0.8212347626686096,
   0.9833488464355469,
   0.9998847842216492,
   1.0518594980239868,
   1.1157684326171875]]}

### Convert qcodes to labels

In [11]:
tags = []
for qcode in results.get('ids', [[]])[0]:
    concept = concepts_dict.get(qcode)
    tags.append(concept.get('prefLabel', {}).get('en-US'))

print(tags)
    

['natural science', 'biotechnology', 'scientific research', 'technology and engineering', 'nature']


## Tagging Content

Let's write some text... this is our text that we will be tagging.

In [12]:
text_to_classify = """The digital commons are burning. Scroll through any LinkedIn comment section under an AI-generated image, and you will feel the heat.
"You created nothing."
"This is theft."
"You are just a commissioner, not an artist."
The anger is palpable. It is the anger of the craftsman watching the factory rise. It is the defense of "sweat equity." The belief that art must be difficult, that the soul leaks onto the canvas only through physical exhaustion.
But this picket line is being walked sixty years too late.
The definition of art was not broken by a tech bro in Silicon Valley in 2024. It was dismantled, piece by piece, in the lofts of New York City in the 1960s. A group of artists called Fluxus already severed the hand from the tool. They taught us that the art is not in the execution.
The art is in the choice.
The Prompt is just a Score (George Brecht)
Critics argue that typing a text prompt isn't creating. You are just giving orders.
In 1961, George Brecht wrote a piece titled Word Event. The entire artwork consisted of one word: "EXIT."
That was it. The viewer had to enact it. The "score" was a set of instructions; the reality was the rendering. When you type “imagine a red cube” into Midjourney, you are not cheating. You are writing a Fluxus Event Score. You provide the code; the machine provides the rendering. Brecht realized that the artist is not the builder. The artist is the architect of the situation.
The Machine plays itself (Joe Jones)
"But the machine does all the work!" they shout. "You don't even know how to hold a brush."
Joe Jones, the "Music Machine Man," didn't hold a violin bow. He built mechanical orchestras; violins fitted with motors, drums beaten by rubber bands. He opened the Tone Deaf Music Store in 1969, where the public could push a button and make the art happen.
Jones removed the virtuoso. He proved that music wasn't about the dexterity of fingers on a fretboard; it was about the organization of sound. AI removes the illustrator. It asks: is art the movement of the wrist, or the organization of the pixel?
The Idea is the Engine (Sol LeWitt)
The most damning accusation against AI is that it is "lazy." It bypasses the struggle.
Sol LeWitt, the father of Conceptual Art, handed us the defense decades ago: "The idea becomes a machine that makes the art."
LeWitt would write instructions for wall drawings—"Draw a line from the left corner to the center"—and let assistants execute them. He never touched the wall. Was he lazy? No. He understood that authorship lies in the concept, not the carpentry. If LeWitt can claim the wall drawn by an assistant, the modern creator can claim the image drawn by the algorithm. The assistant has simply changed from carbon to silicon.
Silence is just Latent Space (John Cage)
Finally, there is the fear that AI is just rearranging old data. That it is random noise.
John Cage sat at a piano for 4 minutes and 33 seconds and played nothing. He framed the silence. He forced the audience to listen to the ambient noise of the room and called it music.
An AI model is infinite noise. It is a "latent space" of chaos. The artist’s job is no longer to apply paint. The artist's job is to frame the noise. To reach into the chaos and pull out a specific moment of clarity. Selection is creation.
The Verdict
Anthropologist Ellen Dissanayake calls art "making special." It is the act of taking the ordinary and making it significant.
If a user types "cat" and posts it, they have made nothing special. That is a cheap signal. But the creator who wrestles with the prompt, who curates the output, who forces the machine to visualize a new reality—they are walking the path paved by Fluxus.
The anger you feel is real. It is the pain of a paradigm shift. But do not blame the software.
The ghost in the machine isn't a thief. It's just the spirit of 1960s avant-garde, finally accessible to everyone.
The brush is dead. Long live the idea.
"""

### Chunking

Let's take our text and split it into chunks. Each chunk will get queried. The results will have the top n tags for each chunk.

In [13]:
# Define a simple chunking strategy with a sliding window
def chunking_strategy(text_to_chunk: str, chunk_size: int) -> list[str]:
    word_list = []

    for line in text_to_chunk.split("\n"):
        for word in line.split(" "):
            word_list.append(word.strip())

    res = []

    # sliding window
    low, hi = 0, chunk_size
    while hi < len(word_list):
        res.append(" ".join(word_list[low:hi]))
        low += 5
        hi += 5

    return res

In [14]:
chunks = chunking_strategy(
    text_to_chunk=text_to_classify,
    chunk_size=int(average_number_of_words)
    )

display(chunks[:10])

['The digital commons are burning. Scroll through any LinkedIn comment section under an AI-generated image,',
 'Scroll through any LinkedIn comment section under an AI-generated image, and you will feel the',
 'section under an AI-generated image, and you will feel the heat. "You created nothing." "This',
 'and you will feel the heat. "You created nothing." "This is theft." "You are just',
 'heat. "You created nothing." "This is theft." "You are just a commissioner, not an artist."',
 'is theft." "You are just a commissioner, not an artist." The anger is palpable. It',
 'a commissioner, not an artist." The anger is palpable. It is the anger of the',
 'The anger is palpable. It is the anger of the craftsman watching the factory rise.',
 'is the anger of the craftsman watching the factory rise. It is the defense of',
 'craftsman watching the factory rise. It is the defense of "sweat equity." The belief that']

### Query the Database with Chunks

In [15]:
results = collection.query(query_texts=chunks, n_results=10)

In [16]:
# Iterate through the results and count the tag occurences
tags = Counter()
for r in results.get('ids', []):
    for qcode in r:
        tags[qcode] += 1

### Tags and Their Counts

In [17]:
pd.set_option("display.max_rows", 100)

In [18]:
# Lookup the labels from the tags and display their counts
labels_and_counts = []
for qcode, count in tags.most_common(n=90):
    concept = concepts_dict.get(qcode)
    label = concept.get('prefLabel', {}).get('en-US')
    labels_and_counts.append((label, count))

display(pd.DataFrame(labels_and_counts, columns=["Tag", "Count"]))

Unnamed: 0,Tag,Count
0,photography,44
1,dance,43
2,design (visual arts),41
3,visual arts,37
4,painting,34
5,drawing,31
6,theater,29
7,fiction,26
8,sculpture,26
9,opera,22


## Gemini Classifier

Let's use Google Gemini to classify text using the concepts from our controlled vocabulary.

In [19]:
# Define the response schema for classification
def load_json_response_schema(concepts: list[str]) -> dict:
    """Load a JSON schema for the given concepts."""

    response_schema = {
        '$defs': {
            'Tags': {
                'enum': concepts, 
                'title': 'Tags', 
                'type': 'string'
                }
            }, 
        'properties': {
            'keywords': {
                'items': {
                    '$ref': '#/$defs/Tags'
                    }, 
                'title': 'Keywords', 
                'type': 'array'
                }
            }, 
        'required': ['keywords'], 
        'title': 'Metadata', 
        'type': 'object'
        }
    
    return response_schema


### Create Controlled Vocab Response Schema

All of the relevant tags from the database query become our LLM's controlled vocabulary.

In [20]:
response_enum_vals = []
for qcode, count in tags.most_common(n=90):
    concept = concepts_dict.get(qcode)
    label = concept.get('prefLabel', {}).get('en-US')
    response_enum_vals.append(label)


# Create the response schema
response_schema = load_json_response_schema(response_enum_vals)

In [21]:
# Initialize the GenAI client
client = genai.Client(api_key=GOOGLE_AI_API_KEY)

### Classify the text!

Send our text to the model for classification using the most relevant Media Topics concepts.

In [22]:
# Function to classify media topics using GenAI
def classify_media_topics(content: str, response_schema: dict) -> genai.types.GenerateContentResponse:
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[content],
        config=genai.types.GenerateContentConfig(
            system_instruction="Extract relevant media topics from the text based on the IPTC Media Topics Controlled Vocabulary. Respond with a JSON object containing an array of 'keywords' that correspond to the 'qcode' values from the Media Topics vocabulary. Only include keywords that are directly relevant to the content provided. Do not include any additional text or explanation outside of the JSON object.",
            temperature=1.0,
            response_mime_type="application/json",
            response_schema=response_schema
        )
    )
    return response

In [23]:
response_object = classify_media_topics(content=text_to_classify, response_schema=response_schema)

### Classification Results

Let's see how our classifier classified our text. The response object can be parsed directly by the sdk to see the resulting list of keywords.

In [24]:
classification_results = response_object.parsed
display(classification_results)

{'keywords': ['visual arts',
  'music',
  'musical instrument',
  'artificial intelligence',
  'musical performance',
  'computing and information technology',
  'arts and entertainment',
  'cultural development',
  'architecture',
  'scientific innovation',
  'music industry',
  'design and engineering',
  'social media',
  'culture',
  'mass media']}

In [25]:
print(f"Our text was classified into {len(classification_results['keywords'])} media topics.")

Our text was classified into 15 media topics.
