# 05. Creating Dictionary Tables

In this notebook, we'll learn how to transform data and create lookup tables that are useful for dictionaries and character composers like Radically.


In [None]:
import pandas as pd
import json


## Adding New Columns

You can create new columns by assigning values to them:


In [None]:
# Load variant data
df_variants = pd.read_csv('../cjkvi-variants/joyo-variants.txt',
                          sep=',',
                          comment='#',
                          names=['character', 'type', 'variant'],
                          encoding='utf-8')

# Add a new column indicating if this is a variant relationship
df_variants['is_variant'] = df_variants['type'] == 'joyo/variant'

# Add a column with the length of the variant character
df_variants['variant_length'] = df_variants['variant'].str.len()

df_variants.head(10)


## Transforming Data with .apply()

The `.apply()` method lets you apply a function to each row or column:


In [None]:
# Example: Create a function to check if character has a variant
def has_variant(row):
    return row['variant'] is not None and pd.notna(row['variant'])

# Apply the function to each row
df_variants['has_variant'] = df_variants.apply(has_variant, axis=1)

# Simpler version using a lambda function
df_variants['has_variant_lambda'] = df_variants.apply(
    lambda row: pd.notna(row['variant']), 
    axis=1
)

df_variants[['character', 'variant', 'has_variant']].head()


## String Operations with .str

Pandas provides powerful string operations through the `.str` accessor:


In [None]:
# Load IDS data
# Note: Some characters have multiple IDS decompositions, so we use usecols to read only first 3 columns
df_ids = pd.read_csv('../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids.txt',
                     sep='\t',
                     skiprows=2,  # Skip copyright header
                     usecols=[0, 1, 2],  # Only read first 3 columns
                     names=['unicode', 'character', 'ids'],
                     encoding='utf-8',
                     nrows=5000)  # Load more for better examples

# Extract the first character of the IDS string (the structure type)
df_ids['ids_structure'] = df_ids['ids'].str[0]

# Count components in IDS (count non-structure characters)
# IDS structure characters: ⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻
structure_chars = '⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻'
df_ids['component_count'] = df_ids['ids'].apply(
    lambda x: sum(1 for char in str(x) if char not in structure_chars) if pd.notna(x) else 0
)

df_ids[['character', 'ids', 'ids_structure', 'component_count']].head(10)


## Creating Lookup Tables

### Example 1: Character → Variants Mapping

Create a dictionary-style lookup table:


In [None]:
# Group variants by character
variant_lookup = df_variants.groupby('character')['variant'].apply(list).to_dict()

# Show some examples
print("Character → Variants lookup:")
for char, variants in list(variant_lookup.items())[:5]:
    print(f"  {char}: {variants}")


### Example 2: Variant → Standard Form Mapping

Create a reverse lookup (variant to standard character):


In [None]:
# Create variant → character mapping
variant_to_standard = dict(zip(df_variants['variant'], df_variants['character']))

print("Variant → Standard lookup:")
for variant, standard in list(variant_to_standard.items())[:5]:
    print(f"  {variant}: {standard}")


### Example 3: Component → Characters (Forward Map)

From IDS data, create a forward map showing which characters contain each component:


In [None]:
# Extract components from IDS strings
# This is a simplified version - in reality, IDS parsing is more complex
def extract_components(ids_string):
    """Extract CJK characters from IDS string (simplified version)"""
    if pd.isna(ids_string):
        return []
    # Remove IDS structure characters
    structure_chars = '⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻'
    components = [char for char in str(ids_string) if char not in structure_chars and ord(char) > 127]
    return components

# Create component → characters mapping
component_map = {}
for idx, row in df_ids.iterrows():
    char = row['character']
    components = extract_components(row['ids'])
    for component in components:
        if component not in component_map:
            component_map[component] = []
        component_map[component].append(char)

# Show example: characters containing "人"
if '人' in component_map:
    print(f"Characters containing '人': {len(component_map['人'])}")
    print(component_map['人'][:10])


## Exporting Data

Once you've created your lookup tables, you'll want to save them:


In [None]:
# Export DataFrame to CSV
df_variants.to_csv('variant_table.csv', index=False, encoding='utf-8')
print("Exported to variant_table.csv")

# Export DataFrame to JSON
# First, convert to a more JSON-friendly format
variant_dict = df_variants.groupby('character')['variant'].apply(list).to_dict()
with open('variant_lookup.json', 'w', encoding='utf-8') as f:
    json.dump(variant_dict, f, ensure_ascii=False, indent=2)
print("Exported to variant_lookup.json")


## Summary

| Operation | Method | Example |
|-----------|--------|---------|
| Add column | `df['new'] = ...` | `df['len'] = df['col'].str.len()` |
| Apply function | `.apply(func, axis=1)` | `df.apply(lambda row: row['a'] + row['b'], axis=1)` |
| String operations | `.str.method()` | `df['col'].str.contains('text')` |
| Group and aggregate | `.groupby().apply()` | `df.groupby('char')['var'].apply(list)` |
| Export CSV | `.to_csv()` | `df.to_csv('file.csv', encoding='utf-8')` |
| Export JSON | `json.dump()` | `json.dump(dict, f, ensure_ascii=False)` |

## What's Next?

In the next notebook, we'll learn:
- Basic statistics and aggregation
- GroupBy operations
- Counting and summarizing data

## Try It Yourself

1. Create a character → all variants lookup table
2. Build a component → characters forward map from IDS data
3. Export your lookup tables to JSON or CSV
4. Experiment with different string operations on IDS data
