# Insilicom Data Harmonization API calling

This notebook demonstrate how to call Insilicom data harmonization API.

The input data is json format.

Update: 2025-03-20

## 0. Prep

Get API key at https://service.insilicom.com/

After register/login, details can be found at https://service.insilicom.com/docs/metadata_normalization.

In [1]:
import json
import os
import requests
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o")

imeta_key = os.getenv('IMETA_API_KEY')

benchmark_dir = "/home/jhuang/project/insilicom_harmonization_benchmark"

# Input json file.
input_json_path = os.path.join(benchmark_dir, "data", "geo_input_20.json")

# Save all results to a JSON file
output_file = os.path.join(benchmark_dir, "result", "api_all_sample_results.json")


In [2]:
# Wrapper for imeta API
def call_data_harmonization_api_v2(input_metadata: list, llm, api_key):
    url = "https://service.insilicom.com/open_api/bdft/v2/metadata_normalization"
    headers = {
        'Content-Type': 'application/json',
        'X-API-KEY': api_key
    }
    body = {
        'meta_data': json.dumps(dict(input_metadata)),  # Convert batch list to dict for API
        'llm_model': llm
    }
    response = requests.request("POST", url, json=body, headers=headers)
    return response

## 1. Load input data

In [3]:
# Load input
with open(input_json_path, 'r') as file:
    luad_data = json.load(file)

In [4]:
# Print one item from luad_data for illustration
example_item = next(iter(luad_data.items()))
print("Example one json item from luad_data:")

print(json.dumps(example_item, indent=4))

Example one json item from luad_data:
[
    "GSM6928120",
    {
        "title": [
            "A549 cells,shikonin2"
        ],
        "geo_accession": [
            "GSM6928120"
        ],
        "status": [
            "Public on Jan 10 2024"
        ],
        "submission_date": [
            "Jan 11 2023"
        ],
        "last_update_date": [
            "Jan 10 2024"
        ],
        "type": [
            "SRA"
        ],
        "channel_count": [
            "1"
        ],
        "source_name_ch1": [
            "cancer cell"
        ],
        "organism_ch1": [
            "Homo sapiens"
        ],
        "taxid_ch1": [
            "9606"
        ],
        "tissue": "cancer cell",
        "cell line": "A549",
        "cell type": "lung carcinoma cell",
        "genotype": "WT",
        "treatment": "shikonin",
        "treatment_protocol_ch1": [
            "A549 Cells were treated with 3\u03bcm shikonin for 24 hours,then collected cells and dissolved in trizol"
    

## 2. Divide input into batches

Prep batches, 10 samples each batch.

Calculate and print the total input tokens for each batch

In [5]:
# Convert to list of pairs
luad_data_items = list(luad_data.items())

# Break into batches of 10
batch_size = 10

luad_data_batches = [luad_data_items[i:i + batch_size] for i in range(0, len(luad_data_items), batch_size)]

In [6]:
for batch_idx, batch in enumerate(luad_data_batches):

    batch_json_str = json.dumps(dict(batch))

    batch_tokens = encoding.encode(batch_json_str)
    
    batch_token_count = len(batch_tokens)
    
    print(f"Batch {batch_idx + 1} total input tokens: {batch_token_count}")

Batch 1 total input tokens: 10506
Batch 2 total input tokens: 5213


## 3. Run imeta API.

https://service.insilicom.com/docs/metadata_normalization

llm="GPT" or "Claude".

This takes about 2-3mins.

In [7]:
# Process batches and combine results
combined_results = {}

for batch_idx, batch in enumerate(luad_data_batches):
    
    response = call_data_harmonization_api_v2(batch, llm="GPT", api_key=imeta_key)
    
    if response.status_code == 200:

        out_json = response.json()['Result']

        combined_results.update(out_json)  # Merge batch results into a single dict
        
        
        out_json_str = json.dumps(out_json)
        tokens = encoding.encode(out_json_str)
        token_count = len(tokens)
        
        print(f"Batch {batch_idx + 1}:")

        print(f"Number of output tokens in batch {batch_idx + 1}: {token_count}")
        
    else:

        print(f"Batch {batch_idx + 1} failed with Error {response.status_code}: {response.text}")
        
        combined_results[f"batch_{batch_idx + 1}_error"] = f"Error {response.status_code}: {response.text}"

Batch 1:
Number of output tokens in batch 1: 3613
Batch 2:
Number of output tokens in batch 2: 4231


## 4. Save API result.

api_all_sample_results.json

In [8]:
for gsm, data in combined_results.items():
    if 'gender' in data:
        data['sex'] = data.pop('gender')

In [9]:
with open(output_file, 'w') as f:
    json.dump(combined_results, f, indent=4)

print(f"All batch results saved to {output_file}")

All batch results saved to /home/jhuang/project/insilicom_harmonization_benchmark/result/api_all_sample_results.json
