### New ID Extraction

#### 1. Context & Purpose
TMDB provides daily ID snapshots but does not offer a direct "delta" file (a list of only newly added IDs). To optimize the crawling process, this script compares snapshots from two different timestamps (**12/31/2025** and **02/02/2026**) to isolate only the most recent entries.

#### 2. Extraction Logic
The script processes multiple categories, including **Movie**, **Person**, and **TV Series**:

* **Step 1:** The older ID list is loaded into a **Set** data structure. This allows for $O(1)$ constant-time lookups, which is essential for handling millions of records efficiently.
* **Step 2:** The script iterates through each line of the newer ID snapshot.
* **Step 3:** For each ID in the new list, the script checks if it exists within the baseline Set.
* **step 4:** If an ID is not present in the baseline Set, it is identified as a new entry and written to the output file.
* **Step 5:** This automated logic is applied across all primary entities (**Movie, Person, TV_Series**) to ensure a comprehensive incremental update.

In [11]:
import json
import tqdm

In [12]:
ID_LIST_LABEL = ["movie", "tv_series", "person"]
IDX_LABEL = 1   # Value in 0 1 2

FIRST_INPUT_PATH = f"../data/id_list/{ID_LIST_LABEL[IDX_LABEL]}_ids_12_31_2025.json"
SECOND_INPUT_PATH = f"../data/id_list/{ID_LIST_LABEL[IDX_LABEL]}_ids_02_02_2026.json"

OUTPUT_PATH = f"../data/id_list/{ID_LIST_LABEL[IDX_LABEL]}_new_ids.jsonl"

In [13]:
def count_lines(file_path):
    with open(file_path, "rb") as f:
        return sum(1 for _ in f)

In [14]:
def extract_new_ids(first_input_path, second_input_path, output_path):

    total_lines = count_lines(second_input_path)

    print("Start loading ids from first input file")
    first_ids = set()
    with open(first_input_path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                first_ids.add(json.loads(line.strip())["id"])

    print(f"Start extracting new ids from second input file ({total_lines} lines)")
    pbar = tqdm.tqdm(total=total_lines, desc="Extracting new ids", unit="line")
    with open(second_input_path, "r", encoding="utf-8") as f_in, open(output_path, "w", encoding="utf-8") as f_out:
        for line in f_in:
            if line.strip() and json.loads(line.strip())["id"] not in first_ids:
                f_out.write(line)
            pbar.update(1)

In [15]:
extract_new_ids(FIRST_INPUT_PATH, SECOND_INPUT_PATH, OUTPUT_PATH)

Start loading ids from first input file
Start extracting new ids from second input file (214192 lines)


Extracting new ids: 100%|██████████| 214192/214192 [00:01<00:00, 171484.76line/s]
