# 01. Loading Data

In this notebook, we'll learn how to load data from different file formats. This is the first step in any data analysis workflow!

## File Formats We'll Work With

When working with CJK character data, you'll encounter several file formats:

1. **CSV files** (Comma-Separated Values) - Common for variant data
2. **JSON files** (JavaScript Object Notation) - Used for structured data like variant tables
3. **TSV files** (Tab-Separated Values) - Common for IDS (Ideographic Description Sequence) data

Let's learn how to load each type!


In [2]:
import pandas as pd


## 1. Loading CSV Files

CSV files are text files where values are separated by commas. However, many "CSV" files actually use different separators or have special formatting.

### Example: Simplified/Traditional Chinese Variants

The `cjkvi-simplified.txt` file contains simplified and traditional Chinese character mappings.

**Important**: Before loading a file, it's good practice to inspect it first! Let's see what the file looks like:


In [3]:
# First, let's inspect the file to see its structure
# Open the file and look at the first few lines
with open('../cjkvi-variants/cjkvi-simplified.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    print("First 40 lines of the file:")
    print("=" * 50)
    for i, line in enumerate(lines[:40], 1):
        print(f"{i:3d}: {line.rstrip()}")
    print("=" * 50)
    print("\nNotice:")
    print("- Lines 1-5: Type definitions (not actual data)")
    print("- Lines 6-38: Comments explaining the format (start with #)")
    print("- Line 39+: Actual data rows start here!")


First 40 lines of the file:
  1: # Copyright (c) 2014 CJKVI Database
  2: cjkvi/simplified,<name>,简化字
  3: cjkvi/variant-simplified,<name>,异体简化字
  4: cjkvi/pseudo-simplified,<name>,拟似简化字
  5: cjkvi/traditional,<name>,繁体字
  6: #
  7: #     Simplified         Traditional         Variants
  8: #
  9: #            <--simplified--
 10: #         锺 <--------+----> 鍾
 11: #                     |
 12: #         钟 <----+---+----> 鐘
 13: #                 |
 14: #                 +--------> 鈡
 15: #            --traditional-->
 16: #
 17: #                     <--simplified--
 18: #         扬 <-------------> 揚 <------+------> 颺
 19: #            --traditional-->          |
 20: #                                      | (official)
 21: #                     <--simplified--  |
 22: #         飏 <-------------------------+
 23: #                    --traditional-->
 24: #
 25: #                     <--simplified--
 26: #         辉 <-------------> 輝 <------+------> 煇
 27: #            --traditional-->

Now that we know the data starts at line 39, we can load it correctly using `skiprows`:


In [14]:
# Load CSV file with comma separator
# Important parameters:
# - sep=',' : comma separator
# - skiprows=38 : skip the first 38 rows (rows 0-37), data starts at row 38 (line 39 in text editor)
# - comment='#' : skip lines starting with # (for any remaining comment lines)
# - names=[...] : specify column names (file doesn't have headers)
# - encoding='utf-8' : essential for CJK characters!

df_simplified = pd.read_csv('../cjkvi-variants/cjkvi-simplified.txt',
                            sep=',',
                            skiprows=38,  # Skip header/metadata lines (first 38 rows)
                            comment='#',
                            names=['variant', 'type', 'target'],
                            encoding='utf-8')

print(f"Loaded {len(df_simplified)} rows")
df_simplified.head(10)


Loaded 13681 rows


Unnamed: 0,variant,type,target
0,㑮,cjkvi/simplified,𫝈
1,㑯,cjkvi/simplified,㑔
2,㑳,cjkvi/simplified,㑇
3,㑶,cjkvi/simplified,㐹
4,㑺,cjkvi/simplified,俊
5,㒑,cjkvi/simplified,⿰亻汇
6,㒒,cjkvi/simplified,⿰亻业
7,㒓,cjkvi/simplified,𠉂
8,㒘,cjkvi/simplified,⿰亻竖
9,㒜,cjkvi/simplified,𠇐


### Understanding the Data

This file contains:
- **variant**: A variant character
- **type**: The type of variant (e.g., "cjkvi/simplified", "cjkvi/traditional")
- **target**: The target/standard character

Let's see what types of variants we have:


In [15]:
# See what variant types are in the data
print("Variant types:")
print(df_simplified['type'].value_counts())


Variant types:
type
cjkvi/simplified            8399
cjkvi/traditional           5097
cjkvi/variant-simplified     112
cjkvi/pseudo-simplified       73
Name: count, dtype: int64


## 2. Loading JSON Files

JSON files are structured data files that are commonly used for variant tables and lookup data.

### Example: Shinjitai (Modern Japanese) Variants

The `shinjitai-table` contains Japanese character variants in JSON format:


In [4]:
# Load JSON file
# The file contains a dictionary where keys are shinjitai characters
# and values are lists of kyujitai (traditional) forms
# Note: pd.read_json() doesn't work well with dictionaries that have
# lists of different lengths, so we use json.load() instead

import json
with open('../shinjitai-table/shinjitai.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    
# Convert to a more usable format
# Create a list of dictionaries for each character-variant pair
shinjitai_list = []
for shinjitai, kyujitai_list in data.items():
    if kyujitai_list:  # Only if there are variants
        for kyujitai in kyujitai_list:
            shinjitai_list.append({
                'shinjitai': shinjitai,
                'kyujitai': kyujitai
            })
    else:
        # Characters with no variants
        shinjitai_list.append({
            'shinjitai': shinjitai,
            'kyujitai': None
        })

df_shinjitai = pd.DataFrame(shinjitai_list)
print(f"Loaded {len(df_shinjitai)} character entries")
df_shinjitai.head(10)


Loaded 2138 character entries


Unnamed: 0,shinjitai,kyujitai
0,亜,亞
1,哀,
2,挨,
3,愛,
4,曖,
5,悪,惡
6,握,
7,圧,壓
8,扱,
9,宛,


## 3. Loading Tab-Separated Files (TSV)

Tab-separated files use tabs instead of commas. This is common for IDS (Ideographic Description Sequence) data.

### Example: IDS Character Decomposition Data

The IDS files show how characters are built from components:


In [22]:
# Load tab-separated file
# Important: sep='\t' for tab separator
# The file format is: Unicode_code  Character  IDS_string(s) [optional additional IDS...]
# Note: Some characters have multiple IDS decompositions (separated by tabs)
# Note: This file has a 2-line header (copyright), so we skip it

# Use usecols to only read the first 3 columns (some lines have 4+ columns)
df_ids = pd.read_csv('../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids.txt',
                     sep='\t',
                     skiprows=2,  # Skip copyright header
                     usecols=[0, 1, 2],  # Only read first 3 columns (unicode, character, first IDS)
                     names=['unicode', 'character', 'ids'],
                     encoding='utf-8',
                     nrows=1000)  # Load first 1000 rows for demonstration

print(f"Loaded {len(df_ids)} rows (showing first 1000)")
print(f"Total file has ~89,000 rows of IDS data!")
print("\nNote: Some characters have multiple IDS decompositions in the file.")
print("We're showing only the first IDS for each character here.")
df_ids.head(10)


Loaded 1000 rows (showing first 1000)
Total file has ~89,000 rows of IDS data!

Note: Some characters have multiple IDS decompositions in the file.
We're showing only the first IDS for each character here.


Unnamed: 0,unicode,character,ids
0,U+03B1,α,α
1,U+2113,ℓ,ℓ
2,U+2460,①,①
3,U+2461,②,②
4,U+2462,③,③
5,U+2463,④,④
6,U+2464,⑤,⑤
7,U+2465,⑥,⑥
8,U+2466,⑦,⑦
9,U+2467,⑧,⑧


### Understanding IDS Data

The IDS (Ideographic Description Sequence) shows how characters are structured:
- **unicode**: Unicode code point (e.g., U+4E1D)
- **character**: The actual character
- **ids**: The decomposition showing how the character is built (e.g., ⿰纟纟 means "left-right structure with 纟 on both sides")

## Key Parameters for Loading Data

Here's a summary of important parameters when loading data:

| Parameter | Purpose | Example |
|-----------|---------|---------|
| `sep` | Separator character | `sep=','` or `sep='\t'` |
| `encoding` | Character encoding | `encoding='utf-8'` (essential for CJK!) |
| `comment` | Skip comment lines | `comment='#'` |
| `names` | Column names | `names=['col1', 'col2']` |
| `nrows` | Load only first N rows | `nrows=100` (for testing) |

## Common Issues and Solutions

### Issue: Junk Rows at the Beginning
**Solution**: Always inspect the file first! Use a text editor or Python to check where the actual data starts, then use `skiprows=N` to skip header/metadata lines

### Issue: Encoding Errors
**Solution**: Always specify `encoding='utf-8'` for CJK character data

### Issue: Wrong Separator
**Solution**: Check the file first, then use `sep=','`, `sep='\t'`, or `sep=';'` as needed

### Issue: Comment Lines
**Solution**: Use `comment='#'` to skip comment lines

### Issue: No Headers
**Solution**: Use `names=[...]` to specify column names

## What's Next?

In the next notebook, we'll learn how to:
- Explore and inspect DataFrames
- Understand the structure of your data
- Select specific columns and rows

## Try It Yourself

1. Try loading `../cjkvi-variants/jp-old-style.txt` (hint: it's tab-separated!)
2. Load the full IDS file (remove `nrows=100`)
3. Experiment with different encoding options (see what happens without `encoding='utf-8'`)


## Daniel fixes it

All of these different files in different formats are a pain in the ass, so Daniel is going to help you cut corners and save some large tables to CSV format in the `daniel_tables` folder:

In [1]:
import glob
import pandas as pd
import json
import numpy as np

# get file paths
file_paths = [f for f in glob.glob('.//cjkvi-id--unicode/rawdata/manual_ids/*.txt')]

file_paths += [
    '../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids-analysis.txt',
    '../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids-cdp.txt',
    '../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids-ext-cdef.txt',
    '../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Basic.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Compat-Supplement.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Compat.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-A.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-1.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-2.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-3.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-4.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-5.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-B-6.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-C.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-D.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-E.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-F.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-G.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-H.txt',
    '../cjkvi-ids-unicode/rawdata/ids/IDS-UCS-Ext-I.txt'
]

# Create ids DataFrame combining all files
ids_df = pd.DataFrame()
for file_path in file_paths:
    try:
        # Solution 1: Use usecols to only read the first 3 columns
        # This prevents errors when some lines have extra fields
        df = pd.read_csv(
            file_path,
            sep='\t',
            comment='#',
            names=['code_point', 'character', 'components'],
            usecols=[0, 1, 2],  # Only read first 3 columns
            encoding='utf-8',
            on_bad_lines='skip'  # Skip lines that can't be parsed (pandas 1.3+)
        )
        if ids_df.empty:
            ids_df = df
        else:
            ids_df = pd.concat([ids_df, df])
    except Exception as e:
        print(f"Error loading {file_path}: {e}")
        continue

# Drop duplicate rows
# ids_df = ids_df.drop_duplicates()

# Save to csv
ids_df.to_csv('../daniel_tables/ids_df.csv', index=False)

# Create shin-kyu table
shin_kyu_df = pd.read_csv(
    '../cjkvi-variants/jp-old-style.txt',
    sep='\t',
    comment='#',
    skiprows=22,
    names=['shin', 'kyu'],
    usecols=[0, 1],  # Only read first 2 columns
    encoding='utf-8',
    on_bad_lines='skip'  # Skip lines that can't be parsed (pandas 1.3+)
)
temp = pd.read_csv(
    '../cjkvi-variants/joyo-variants.txt',
    sep=',',
    comment='#',
    skiprows=5,
    names=['shin', 'd', 'kyu'],
    usecols=[0, 1, 2],  # Only read first 3 columns
    encoding='utf-8',
    on_bad_lines='skip'  # Skip lines that can't be parsed (pandas 1.3+)
)
temp = temp[['shin', 'kyu']]

# From shinjitai-table
with open('../shinjitai-table/shinjitai.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    
# Convert to a more usable format
# Create a list of dictionaries for each character-variant pair
shinjitai_list = []
for shinjitai, kyujitai_list in data.items():
    if kyujitai_list:  # Only if there are variants
        for kyujitai in kyujitai_list:
            shinjitai_list.append({
                'shin': shinjitai,
                'kyu': kyujitai
            })

df_shinjitai = pd.DataFrame(shinjitai_list)

# Concatenate DataFrames
shin_kyu_df = pd.concat([shin_kyu_df, temp, df_shinjitai])

# Drop duplicate rows
shin_kyu_df = shin_kyu_df.drop_duplicates()

# Save to csv
shin_kyu_df.to_csv('../daniel_tables/shin_kyu_df.csv', index=False)

# Stroke count table
stroke_count_df = pd.read_csv(
    '../cjkvi-ids/ucs-strokes.txt',
    sep='\t',
    comment='#',
    names=['code_point', 'character', 'stroke_count'],
    usecols=[0, 1, 2],  # Only read first 3 columns
    encoding='utf-8',
    on_bad_lines='skip'  # Skip lines that can't be parsed (pandas 1.3+)
)

# Save to csv
stroke_count_df.to_csv('../daniel_tables/stroke_count_df.csv', index=False)

print('[DONE]')

[DONE]
