Saiki

Saiki (採記) is a small toolkit for Anki-based language learning workflows: listening playlists, word mining, YouTube transcript mining, TTS sentence imports, and known/new word comparison.

The name is a coined Japanese compound from 採 as in gathering/collecting and 記 as in remembering or recording. Pronunciation: saiki, roughly "sigh-key".

./saiki.py --help

Requirements

Python 3.12 recommended
Anki with AnkiConnect
ffmpeg
Python dependencies from requirements.txt
spaCy models for word mining:

python -m spacy download es_core_news_sm
python -m spacy download ja_core_news_lg

Setup example:

python3.12 -m venv ~/.venv/saiki
source ~/.venv/saiki/bin/activate
python3 -m pip install -U pip
pip install -r requirements.txt
sudo dnf install ffmpeg

Configuration

Defaults are built in, but you can override them with YAML:

~/.config/saiki/config.yaml

Or pass a config explicitly:

./saiki.py --config ./config.yaml words jp

Example:

anki_connect_url: http://localhost:8765
media_dir: ~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media
audio_output_root: ~/Languages/Anki/anki-audio
word_output_root: ~/Languages/Anki/anki-words
sentence_dir: ~/Languages/Anki
note_model: Basic
fields:
  front: Front
  back: Back
languages:
  jp:
    name: japanese
    transcript_code: ja
    tts_code: ja
    tts_tld: com
    tts_tempo: 1.35
    decks: ["日本語"]
    field: Back
    word_model: ja_core_news_lg
    sentence_file: sentences_jp.txt
  es:
    name: spanish
    transcript_code: es
    tts_code: es
    tts_tld: es
    tts_tempo: 1.25
    decks: ["Español"]
    field: Back
    word_model: es_core_news_sm
    sentence_file: sentences_es.txt

A copyable template is also available at examples/config.yaml.

Supported language codes by default:

jp
es

CLI

Audio

Extract audio referenced by [sound:...] tags from configured decks and create an .m3u playlist.

./saiki.py audio jp
./saiki.py audio es --concat
./saiki.py audio jp --media-dir ~/.local/share/Anki2/User\ 1/collection.media --copy-only-new

Outputs go to ~/Languages/Anki/anki-audio/<language>/ by default.

Words

Extract frequent words from Anki notes using AnkiConnect and spaCy.

./saiki.py words jp
./saiki.py words es --deck "Español"
./saiki.py words es --query 'deck:"Español" tag:youtube'
./saiki.py words jp --min-freq 3 --out words_jp.txt
./saiki.py words jp --full-field

Output format:

word frequency

Examples:

comer 12
hablar 9
行く (行き) 8
見る (見た) 6

YouTube

Mine vocabulary or sentence rows from YouTube subtitles.

./saiki.py youtube es VIDEO_ID
./saiki.py youtube es VIDEO_ID --top 50
./saiki.py youtube jp VIDEO_ID --mode sentences
./saiki.py youtube es VIDEO_ID --raw --no-stopwords

Export Anki-ready sentence rows:

./saiki.py youtube es VIDEO_ID --mode sentences --out youtube.tsv

Export only rows that appear to contain unknown vocabulary:

./saiki.py youtube es VIDEO_ID \
  --mode sentences \
  --out youtube_new.tsv \
  --known-words ~/Languages/Anki/anki-words/spanish/words_es.txt \
  --only-new

Sentence exports contain:

sentence    timestamp    video_url    vocab_guess

Import

Generate TTS audio and add sentence cards to Anki.

./saiki.py import es
./saiki.py import jp ~/Languages/Anki/sentences_jp.txt
./saiki.py import es youtube.tsv --tags youtube,manual

The importer accepts plain text sentence files and TSV/CSV files with a sentence column. text-to-speech is always added as a tag. If --tags is not provided, AI-generated is added.

Known/New Words

Compare any generated word list against an existing known list:

./saiki.py compare-words transcript_words.txt ~/Languages/Anki/anki-words/spanish/words_es.txt

This prints entries from the first file whose word key does not appear in the second file.

Card Assumptions

The default configuration assumes Basic notes with audio on Front and the target-language sentence on Back. Word mining reads only the first visible line by default; use --full-field to process the whole field.

To Do

Add support for different Anki note/card types, including configurable field mappings per language and per import workflow.
Support multiple import profiles, such as sentence cards, vocab cards, audio cards, and cloze cards.
Let YouTube exports map directly into configurable note fields, not just a fixed sentence column.
Add richer transcript filtering, such as minimum/maximum sentence length, duplicate removal, and punctuation cleanup.
Add optional audio slicing from videos when timestamp data is available.
Improve known/new word matching with better lemmatization for transcript vocabulary.
Add more language profiles beyond Japanese and Spanish.
Add a dry-run mode for imports that previews notes before sending anything to AnkiConnect.
Build a GUI for common workflows like transcript review, sentence selection, import previews, and configuration editing.
Add integration tests with mocked AnkiConnect responses.
Add shell completion or a small installed command once packaging becomes useful.

Tests

Pure logic tests use the standard library test runner:

python -m unittest discover -s tests

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
figures		figures
saiki		saiki
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
saiki.py		saiki.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Saiki

Requirements

Configuration

CLI

Audio

Words

YouTube

Import

Known/New Words

Card Assumptions

To Do

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Saiki

Requirements

Configuration

CLI

Audio

Words

YouTube

Import

Known/New Words

Card Assumptions

To Do

Tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages