Skip to content

kaonashi-tyc/zi2zi-JiT

Repository files navigation

zi2zi-JiT logo

zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers

中文版

Overview

zi2zi-JiT is a conditional variant of JiT (Just image Transformer) designed for Chinese font style transfer. Given a source character and a style reference, it synthesizes the character in the target font style.

The architecture, illustrated above, extends the base JiT model with three components:

  • Content Encoder — a CNN that captures the structural layout of the input character, adapted from FontDiffuser.
  • Style Encoder — a CNN that extracts stylistic features from a reference glyph in the target font.
  • Multi-Source In-Context Mixing — instead of conditioning on a single category token as in the original JiT, font, style, and content embeddings are concatenated into a unified conditioning sequence.

Training

Two model variants are available — JiT-B/16 and JiT-L/16 — both trained for 2,000 epochs on a corpus of over 400+ fonts (70% simplified Chinese, 20% traditional Chinese, 10% Japanese), totalling 300k+ character images. For each font, the max number of characters used for training is capped at 800

Evaluation

Generated glyphs are evaluated against ground-truth references following the protocol in FontDiffuser. All metrics are computed over 2,400 pairs.

Model FID ↓ SSIM ↑ LPIPS ↓ L1 ↓
JiT-B/16 53.81 0.6753 0.2024 0.1071
JiT-L/16 56.01 0.6794 0.1967 0.1043

How To Use

Environment Setup

conda env create -f environment.yaml
conda activate zi2zi-jit
pip install -e .

Download

Pretrained checkpoints are available on Google Drive:

Download Models

Save desired checkpoint and place it under models/:

mkdir -p models
# zi2zi-JiT-B-16.pth  (Base variant)
# zi2zi-JiT-L-16.pth  (Large variant)

Dataset Generation

From font files

Generate a paired dataset from a source font and a directory of target fonts:

python scripts/generate_font_dataset.py \
    --source-font data/思源宋体light.otf \
    --font-dir   data/sample_single_font \
    --output-dir data/sample_dataset

This produces the following structure:

data/sample_dataset/
├── train/
│   ├── 001_FontA/
│   │   ├── 00000_U+XXXX.jpg
│   │   ├── 00001_U+XXXX.jpg
│   │   ├── ...
│   │   └── metadata.json
│   ├── 002_FontB/
│   │   └── ...
│   └── ...
├── test/
│   ├── 001_FontA/
│   │   └── ...
│   └── ...
└── test.npz

Each .jpg is a 1024x256 composite: source (256) | target (256) | ref_grid_1 (256) | ref_grid_2 (256).

From rendered glyph images

Alternatively, build a dataset from a directory of rendered character images. Each file should be a 256x256 PNG named by its character:

data/sample_glyphs/
├── 万.png
├── 上.png
├── 中.png
├── 人.png
├── 大.png
└── ...
python scripts/generate_glyph_dataset.py \
    --source-font data/思源宋体light.otf \
    --glyph-dir   data/sample_glyphs \
    --output-dir  data/sample_glyph_dataset \
    --train-count 200

LoRA Fine-Tuning

Fine-tune a pretrained model on a single GPU with LoRA. Fine-tuning a single font typically takes less than one hour on a single H100. The example below uses JiT-B/16 with batch size 16, which requires roughly 4 GB of VRAM:

python lora_single_gpu_finetune_jit.py \
    --data_path       data/sample_dataset/train/ \
    --test_npz_path   data/sample_dataset/test.npz \
    --output_dir      run/lora_ft_sample_single/ \
    --base_checkpoint models/zi2zi-JiT-B-16.pth \
    --model           JiT-B/16 \
    --num_fonts       1000 \
    --num_chars       20000 \
    --max_chars_per_font 200 \
    --img_size        256 \
    --lora_r          32 \
    --lora_alpha      32 \
    --lora_targets    "qkv,proj,w12,w3" \
    --epochs          200 \
    --batch_size      16 \
    --blr             8e-4 \
    --warmup_epochs   1 \
    --save_last_freq  10 \
    --proj_dropout    0.1 \
    --P_mean          -0.8 \
    --P_std           0.8 \
    --noise_scale     1.0 \
    --cfg             2.6 \
    --online_eval \
    --eval_step_folders \
    --eval_freq       10 \
    --gen_bsz         16 \
    --num_images      400 \
    --seed            42

Key parameters:

Parameter Note
--num_fonts, --num_chars Tied to the pretrained model's embedding size. Do not change unless pretraining from scratch.
--max_chars_per_font Caps the number of characters used from each font.
--lora_r, --lora_alpha LoRA capacity. Higher values give more capacity at the cost of memory.
--batch_size 16 uses ~4 GB VRAM.
--cfg Conditioning strength. Use 2.6 for JiT-B/16, 2.4 for JiT-L/16.

Generation

Generate characters from a fine-tuned checkpoint:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars/

Notes for generate_chars.py:

  • Supported samplers are euler, heun, and ab2.
  • If --num_sampling_steps is not set, the script uses method-specific defaults: euler -> 20, heun -> 50, ab2 -> 20.
  • If neither --sampling_method nor --num_sampling_steps is overridden, the script keeps the checkpoint's saved inference settings.
  • Current recommended fast setting: --sampling_method ab2 --cfg 2.6 and let the default 20 steps apply.
  • heun-50 is kept as a conservative legacy/reference baseline. In the current 50-sample MPS benchmark, ab2-20 and euler-20 were both faster and scored better than heun-50 on SSIM, LPIPS, and L1.

Example fast generation command:

python generate_chars.py \
    --checkpoint run/lora_ft_sample_single/checkpoint-last.pth \
    --test_npz   data/sample_dataset/test.npz \
    --output_dir run/generated_chars_ab2/ \
    --sampling_method ab2

Metrics

Compute pairwise metrics (SSIM, LPIPS, L1, FID) on the generated comparison grids:

python scripts/compute_pairwise_metrics.py \
    --device cuda \
    run/lora_ft_sample_single/heun-steps50-cfg2.6-interval0.0-1.0-image400-res256/step_10/compare/

Works

Fonts created with zi2zi-JiT:

Gallery

Ground truth on the left, generated one on the right

Cartoonish Cursive
Geometric Thin
Zhuan Shu Brush

License

Code is licensed under MIT. Generated font outputs are additionally subject to the "Font Artifact License Addendum" in LICENSE:

  • commercial use is allowed
  • attribution is required when distributing a font product that uses more than 200 characters created from repository artifacts

References / Thanks

This project builds on code and ideas from:

  • FontDiffuser — content/style encoder design and evaluation protocol
  • JiT — base diffusion transformer architecture

Citation

@article{zi2zi-jit,
  title   = {zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers},
  author  = {Yuchen Tian},
  year    = {2026},
  url     = {https://github.com/kaonashi-tyc/zi2zi-jit}
}

About

Font Synthesis with Pixel-Space Diffusion Transformer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages