# IPTC Media Topics Adjacency List 

This notebook demonstrates the steps in building an adjacency list from the Media Topics controlled vocabulary.

An adjacency list is a list of lists where each index in the outer list represents a parent node, and the inner list contains the integer indices of its direct child nodes.

Python APIs that handle Extreme Multi-Label Text Classification like `napkinXC` expect custom trees to be passed as an adgacency list when constructing Probabilistic Label Tree (PLT) models.

`napkinXC` [docs](https://napkinxc.readthedocs.io/en/latest/index.html)

```python
from napkinxc.models import PLT

# 1. Load your training data (X_train) and your mapped integer labels (Y_train)
# 2. Build your IPTC adjacency list 
# Example: node 0 is the root, containing child branches 1, 2, and 3.
iptc_custom_tree = [[1, 2, 3], # Root node's children (e.g., Top-level IPTC categories)
    [4, 5],    # Children of node 1
    [6],       # Children of node 2
    #... continue for all 1,200+ nodes
]

# 3. Initialize the PLT model with the custom IPTC tree
model = PLT("iptc_media_classifier", tree=iptc_custom_tree)

# 4. Fit the model
model.fit(X_train, Y_train)

# 5. Predict the top-k results
predictions = model.predict(X_test, top_k=5)
```



## Import Python Packages

In [2]:
import json
import os
import requests
from IPython.display import display, Markdown
from collections import deque

## IPTC Media Topics Controlled Vocabulary

Media Topics is a constantly updated taxonomy of over 1,200 terms with a focus on categorising text.

Originally based on the IPTC Subject Codes taxonomy, the Media Topics taxonomy was first released  in 2010 and is updated at least once a year.

https://iptc.org/standards/media-topics/


### Download and Read JSON


In [3]:
MEDIATOPICS_URL = "https://cv.iptc.org/newscodes/mediatopic?lang=en-US&format=json"

MEDIATOPICS_PATH = "./schema/mediatopic_cptall-en-US.json"

# Function to download the Media Topics JSON file
def download_mediatopics_json():
    try:
        # request media topics
        response = requests.get(MEDIATOPICS_URL)
        # check if the request was successful
        response.raise_for_status()
        # parse the JSON content into a dictionary
        data = response.json()
        # create the schema directory if it doesn't exist
        os.makedirs("./schema", exist_ok=True)
        # write the data to the JSON file
        with open(MEDIATOPICS_PATH, 'w') as f:
            json.dump(data, f, indent=4)

    except requests.exceptions.RequestException as e:
        print(f"Error during request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
    except IOError as e:
        print(f"Error writing to file: {e}")

# Download the Media Topics JSON file if it doesn't exist
if not os.path.exists(MEDIATOPICS_PATH):
    print("Downloading Media Topics Controlled Vocabulary from IPTC web")
    download_mediatopics_json()
# Load the Media Topics JSON file
with open(MEDIATOPICS_PATH, "r") as file:
    media_topics = json.load(file)

In [6]:
concepts_dict = {concept['qcode']: concept for concept in media_topics['conceptSet']}

## IPTC Edges

Define IPTC relationships as a list of (parent, child) tuples.

In [None]:
# iptc_edges = [
#     ("arts_and_entertainment", "animation"),
#     ("arts_and_entertainment", "cartoon"),
#     ("crime_law_and_justice", "court"),
#     ("court", "civil_court")
# ]

In [None]:
# Initialize root nodes of the vocabulary tree
root_nodes = []
for uri in media_topics['hasTopConcept']:
    medtopid = "medtop:" + uri.split("/")[-1]
    root_nodes.append(medtopid)

print(root_nodes)

['medtop:01000000', 'medtop:02000000', 'medtop:03000000', 'medtop:04000000', 'medtop:05000000', 'medtop:06000000', 'medtop:07000000', 'medtop:08000000', 'medtop:09000000', 'medtop:10000000', 'medtop:11000000', 'medtop:12000000', 'medtop:13000000', 'medtop:14000000', 'medtop:15000000', 'medtop:16000000', 'medtop:17000000']


In [None]:
# Traverse the tree and record parent, child relationships
iptc_edges = []
q = deque(root_nodes)
while q:
    parent = q.popleft()
    children = concepts_dict.get(parent, {}).get('narrower', [])
    for child in children:
        iptc_edges.append((parent, child))
        q.append(child)

print(iptc_edges)

[('medtop:01000000', 'medtop:20000002'), ('medtop:01000000', 'medtop:20000038'), ('medtop:01000000', 'medtop:20000045'), ('medtop:02000000', 'medtop:20000082'), ('medtop:02000000', 'medtop:20000106'), ('medtop:02000000', 'medtop:20000119'), ('medtop:02000000', 'medtop:20000121'), ('medtop:02000000', 'medtop:20000129'), ('medtop:03000000', 'medtop:20000139'), ('medtop:03000000', 'medtop:20000148'), ('medtop:03000000', 'medtop:20000160'), ('medtop:03000000', 'medtop:20000167'), ('medtop:03000000', 'medtop:20000168'), ('medtop:04000000', 'medtop:20000170'), ('medtop:04000000', 'medtop:20000209'), ('medtop:04000000', 'medtop:20000344'), ('medtop:04000000', 'medtop:20000349'), ('medtop:04000000', 'medtop:20000385'), ('medtop:05000000', 'medtop:20000398'), ('medtop:05000000', 'medtop:20000399'), ('medtop:05000000', 'medtop:20000400'), ('medtop:05000000', 'medtop:20000410'), ('medtop:05000000', 'medtop:20000411'), ('medtop:05000000', 'medtop:20000412'), ('medtop:05000000', 'medtop:20000413'),

## Vocabulary Size

In [13]:
# Find all unique nodes to build our vocabulary size
all_nodes = set()
parents = set()
children = set()

for parent, child in iptc_edges:
    all_nodes.add(parent)
    all_nodes.add(child)
    parents.add(parent)
    children.add(child)

# Top-level nodes are those that act as parents but never appear as children.
# In the real IPTC vocabulary, there are 17 of these.
top_level_nodes = parents - children

In [14]:
top_level_nodes

{'medtop:01000000',
 'medtop:02000000',
 'medtop:03000000',
 'medtop:04000000',
 'medtop:05000000',
 'medtop:06000000',
 'medtop:07000000',
 'medtop:08000000',
 'medtop:09000000',
 'medtop:10000000',
 'medtop:11000000',
 'medtop:12000000',
 'medtop:13000000',
 'medtop:14000000',
 'medtop:15000000',
 'medtop:16000000',
 'medtop:17000000'}

## Map String IDs to Integer Indices

In [None]:
# We reserve index 0 for our single "Dummy Root" required by the PLT.
node_to_int = {"dummy_root": 0}
int_to_node = {0: "dummy_root"}

current_idx = 1
for node in sorted(all_nodes):
    node_to_int[node] = current_idx
    int_to_node[current_idx] = node
    current_idx += 1

total_nodes = len(node_to_int)

## Initialize the Adjacency List

In [None]:

# Create a list of empty lists. The outer list's index represents the parent node.
adjacency_list = [[] for _ in range(total_nodes)]

## Populate the Adjacency List

In [28]:
for parent, child in iptc_edges:
    p_idx = node_to_int[parent]
    c_idx = node_to_int[child]
    
    # Add the child's integer ID to the parent's list
    adjacency_list[p_idx].append(c_idx)

## Connect the 17 top-level IPTC nodes to the dummy root (Index 0)


In [None]:
for top_node in top_level_nodes:
    top_idx = node_to_int[top_node]
    adjacency_list[0].append(top_idx)

# Print the first few nodes to verify
print("Dummy Root's Children:", adjacency_list)
print("medtop:01000000:", adjacency_list[node_to_int["medtop:01000000"]])

# You can now pass 'adjacency_list' directly into the napkinXC PLT model.

Dummy Root's Children: [[13, 17, 8, 5, 15, 3, 4, 10, 11, 16, 9, 6, 12, 2, 1, 7, 14], [18, 54, 61], [96, 120, 133, 135, 143], [152, 161, 173, 180, 181], [182, 221, 355, 360, 396], [408, 409, 410, 420, 421, 422, 423, 424, 425, 426, 1213, 1214, 1334], [427, 429, 433, 439, 450, 1371], [454, 469, 471, 472, 488, 491, 492, 493, 501, 1355], [504, 505, 506, 508, 509, 510, 511, 512, 1234, 1235], [515, 527, 529, 530, 537, 539, 542], [543, 570, 1336], [578, 596, 624, 640, 648, 649, 650, 651], [658, 688, 689, 690, 691, 697, 698, 703, 704, 706], [710, 715, 717, 735, 741, 742, 755, 756], [591, 767, 769, 770, 771, 774, 779, 787, 798, 801, 807, 816, 1370], [820, 1101, 1102, 1106, 1122, 1123, 1124, 1145, 1298, 1320, 1321, 1322], [68, 71, 80, 85, 86, 88, 92, 1358, 1374], [1125, 1126, 1127, 1128], [19, 20, 21, 23, 27, 29, 34, 45, 47, 1176, 1178], [], [], [], [], [24, 25, 26], [], [], [], [28, 1245, 1246], [], [30, 31, 32, 33], [], [], [], [], [35, 36, 37], [], [], [38, 39, 40, 41, 42, 43, 1181, 1182, 1183