Skip to content

notHulK11/CantoCaptions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

CantoCaptions

Overview

This is a project that aims to collect and create accurate written Cantonese subtitles for educational purposes. Accurate subtitles are those that match the spoken dialogue. Written Cantonese is the written form of the Cantonese language that contrasts with what is typically used, known as Standard Written Chinese. Written Cantonese subtitles are seldom used but are very powerful learning resources.

If you would like to contribute to transcripts or subtitles, make a donation, find out about current projects, or simply learn more, please join our Discord server.

Important

Since many of the characters used in these subtitles fall outside the coverage of typical fonts, it is HIGHLY recommended that you install a Cantonese specific font. We recommend installing one of the fonts from https://github.com/chiron-fonts/chiron-hei-hk.

Table of Contents

Character Conventions

Written Vernacular Cantonese has no accepted standards, however, establishing our own conventions will make this resource even more useful to learners. With these conventions we can use greater specificity than would normally be shown. For example, we have chosen a set of sentences-final particles (SFP) such that each character represents a certain syllable/tone pair. In this way, learners can build a deeper level of understanding of the language.

There are 3 main resources that served as a starting point.

  1. https://www.cantonese.com.hk/cantonese/sfp/ - The sentence-final particles are largely based on the table used here with some modifications. 𠵝 is dropped in favor of 呀 due to the former being unsupported by almost all fonts, as well as 呀 being far more common. 可 is dropped in favor of 嗬 for disambiguation. Aside from those exceptions, there are some additional particles (gaa5, laa2, laa6, and zaa6) which exist but which were not mentioned in their table, so we devised our own conventions.
  2. https://jyutping.org/en/blog/typo/ - Many characters for disambiguation are taken directly from the list here.
  3. https://words.hk/ - A guiding principle behind the conventions are that they are searchable in words.hk which is the most comprehensible and accessible Cantonese dictionary. There are scant exceptions for rare SFP but nearly all selected characters must be searchable. Character variants are also taken directly from what they consider to be the correct Hong Kong variants.

Note

The conventions have been evolving over time and many of the existing subtitles have not been updated in accordance with the latest standards.

Simple Sentence-final Particles|單一句尾助詞

Syllable\Tone 1 2 3 4 5 6
aa 𠻺
aak 𡅅
baa
bo
gaa 𠺢 𠿪
gaak 𠺝
ge 𠸏
gwaa
haa
he
ho
laa 𠸎
laak
le
lo
lok
lu
maa
me
tim 𠻹
waa
wo 𡁜
zaa 𠾵
ze
zek

Compound Sentence-final Particles|複合句尾助詞

Jyutping|粵拼 Honzi|漢字
a1 maa3 吖嘛
a1 naa4 吖嗱
a3 ho2 啊嗬
a3 haa2 啊吓
a6 maa5 𠻺嗎
a6 le5 𠻺哩
baa2 laa1 罷啦
ding2 laa1 定啦
ga1 maa3 𠺢嘛
ga3 wo3 㗎喎
ge3 ne1 嘅呢
ge3 ze1 嘅啫
ge3 zek1 嘅唧
ha6 waa5 下哇
la1 maa3 啦嘛
la3 wo3 喇喎
la6 maa5 嚹嗎
za1 maa3 吒嘛
za6 maa5 咤嗎

Affixes|詞綴

Jyutping|粵拼 Honzi|漢字 Examples|例子
aa3 爸、伯、
can1 隻腳、跌
dei2 嘛嘛、悶悶
di1 、靚
dou2 、做唔
faan1 、畀
gam2 樣、係
gam3 多、
haa5 、試、行行
kiu1 Q Q線、做乜Q
maai4 、畀、交
saai3 多謝、辛苦

Interjections|感嘆詞

Jyutping|粵拼 Honzi|漢字 Explanation|解釋
aai1, aai2 sigh of exasperation
ai1 jaa3/5/6, ai1 jaak3 哎吔
ai1 jo3 哎喲
bai6 laa3 弊喇 "oh no"
ce1 "tsk"
e2, ei2
e4, e6 "uh"
hei1 as a greeting / shows satisfaction
hei5 shows discontent
hng6 "hmph"
hou2 je5 好嘢 woohoo; yeah
ji2
m2 "mmm"; sound of enjoyment of food
m2, m3, m6 "hmm"; "um"; "mhmm"
naa4 "look"; call for attention
o1
o2, o3, o4, o5, o6
oi2, oi3 variant of 喂
ou3
syu4 𭉝 "shh"
u1 "ooo"; sound of interest/wonder
waa1 "wah"; sound of crying
waa3, waa4 "wow"
wai2, wai3

Character Variants

Where applicable, these Hong Kong variants are used. These map 1:1.

✅ Selected Variant ❌ Other Variants Jyutping
wai4
wan2
gaan2
syut3
cong4
kwan4
leoi5
min6
gaau3
bei3
巿 si5
zung3
sap1
gai1
gou3
wu1
sit3
maa6
sau3
ngau1
wai6
cung1
jim6
joek6
wui6
kai2
zoeng2

Other Required CantoCaptions Conventions

✅ Selected Variant ❌ Other Variants Jyutping Explanation
bei2
gaau2 指「做」;搞錯
打攪晒 打搞晒 daa2 gaau2 saai3
gui6
juk1
𦧷 舔、lem lem2 用條脷輕輕力掃
𦧲 lur loe1*2
唯有 惟有 wai4 jau5
只係 衹係 zi2 hai6
之不過 只不過 zi1 bat1 gwo3
只不過 之不過 zi2 bat1 gwo3
唔止 唔只 m4 zi2
唔單止 唔單只 m4 daan1 zi2
唔止 唔只 m4 zi2
而家 宜家 ji4 gaa1
唔使 唔駛 m4 sai2
即係 姐係、啫係、唧係 zik1 hai6
淨係 剩係 zing6 hai6
呢個 依個 ni1 go3
依個 𠵱個 ji1 go3
傾偈 傾計 king1 gai2
丼、揼 dam2
𢱕 溚、揼 dap6
dam1, dam3, dam6
zoek3, zoek6
zyu3
zing3
篤、督、厾 duk1 係動詞,指「刺」、「戳」
涿 duk1 係量詞,指「一涿屎」或「一涿尿」
duk1 用於「監督」、「都督」等
duk1 用於「篤信」、「篤定」等
𡰪 duk1 指最尾或末端,例如「行到㞘」
便 bin1, bin6 用於「邊度」、「入邊」
tau2 休息;歇息(早唞、等等)
gam6
爹哋 爹地、爹啲 de1 di4
BB 啤啤 bi4 bi1
𠹷 ngo4 好煩噉樣批評或者抱怨
wan3 局限喺一個地方之內,唔出嚟

Recommended CantoCaptions Conventions (Disambiguation)

✅ Selected Variant ❌ Other Variants Jyutping Explanation
𠹻 zam6 氣味、風嘅量詞
𡃴 ceoi4 臭味
瀨屎、瀨尿 賴屎、賴尿 laai6 si2, laai6 niu6
dau3 1. 對打 2. 分勝負 3. 花工夫去整一樣嘢
dau3 摸;掂
𢯎 R、摳、撓、𢲷 ngaau1
鮓、謯、苴 zaa2
zi1 指植物或木嘅嘢
不嬲 不溜、不留 bat1 lau1, bat1 lau2 一直
𢫏 kam2 遮住
kam2 掌摑
ham2 撞到
kam2 用嚟遮住底下嘅嘢(量詞:個)
ham6 全部; 接口閂得實
ham6 引起強烈感受
hung1 兇猛、兇手、兇某人
hung1 泛指一啲不祥嘅嘢(凶兆)
搭、塔 taap3 用手銬;鎖
jau4 用油漆或顏料填上顏色、覆蓋表面
𨈇 𨂾、揇、檻 laam3
laan2 扮做;自命
大部份 大部分 daai6 bou6 fan6
過份 過分 gwo3 fan6
kaat1 例如:信用卡
car, carat, 黐住 kaat1
矇、蒙 mung4 朦朧;模糊
賜予 賜與 ci3 jyu5
joeng2 揮動一件軟軟地嘅物件
joeng4 傳揚;張揚
啃、鯁、骾 kang2 夾硬吞落喉嚨;有啲嘢食卡咗喺喉嚨
𬒔 ang2 一啲突起嘅嘢頂住,令人唔舒服或痛
濕𣲷𣲷 濕立立 sap1 nap6 nap6
嗱嗱聲 拿拿聲、啦啦聲 laa4 laa2 seng1
gwat6 執著;鈍
籮柚 囉柚 lo1 jau2
批、𠜱 pai1 1. 刀法 2. 削走啲嘢
lam1 1. 甜蜜、氹人 2. 花植物嘅一部分 3. 冧歌
lam3, lam6 1. 跌倒 2. 堆起 3. 連續
橋、蹺、巧 kiu2 表示咁啱
騎呢怪 奇離怪 ke4 le4 gwaai3, ke4 le4 gwaai2
fea、啡、fe fe4
gat1 用尖而幼細嘅嘢插入
咖哩雞 咖喱雞 gaa3 lei1 gai1
mang1
lim、令、捻 lim1 紙嘅單位,通常指500張
nin2 雙手或者多隻手指夾住一嚿嘢
吼住 睺住、喉住 hau1 zyu6, hau4 zyu6 望住
biu1
故仔、故事 古仔、古事 gu3 zai2, gu3 si6
𠼮、誽、𠱓 ngai1, ai1 央求
𠱁、𧨾 tam3 1. 令人開心 2. 哄騙
tam5 1. 水喺凹陷地方 2. 陷阱
zek6 用竹片等材料製成嘅墊
zik6
𥄫 gup gap6 1. 偷窺 2. 凝視
deoi2 1. 捅 2. 短時間內攝取好多嘢
盟塞 盲塞、萌塞 mang4 sik1, mang4 sak1
軟腍腍 軟淋淋 jyun5 nam4 nam4
倔頭路 掘頭路 gwat6 tau4 lou6
𣲷懦 𥹉懦 nap6 no6
裝、𥅾、𥊙 zong1 偷窺
係咁歹 係咁大 hai6 gam3 daai2
係噉咦 係咁意 hai6 gam2 ji2
urk oet4, oet6, oek4
lau3
gip、喼 gip1
鋅盤 sink盤、星盤、等等 sing1 pun2

Subtitle Style Guide

This section details how the subtitles should look. In general, Traditional characters are used as opposed to simplified characters, since they can always be converted to the latter with relative ease.

General Guiding Principles

The goal of these subtitles is to be as useful to learners as possible. The goal is NOT to be as faithful to the literal utterances as spoken by the actors or voice actors. Put another way, we want to capture intended, correct speech, and not misspeaks or agrammatical speech. Furthermore, while the subtitles do aim at comprehensive coverage of what is said, grunts, yells, laughter, and miscellaneous expressive noises should in general be transcribed sparingly and, in some cases, not at all. Such subtitles, broadly speaking, don't contribute to building understanding of the language. To this end, it is recommended to transcribe most interjections only in so far as they are followed by or form part of a longer utterance.

Formatting

  • .srt format
  • single line max length is 17.5 characters

The .srt subtitle format is chosen because of its wide-ranging compatibility especially with language learning tools such as pop-up dictionaries.

Timing

  • lines can appear 0-50ms before the start of the speech
  • lines should slightly trail the end of speech (50-100ms) when possible (e.g. no scene change or interruption)
  • lines that end within roughly 50ms of a scene change should be synced with the scene change
  • lines with a length of 3 characters or more need a minimum duration of 750ms
    • this can be shorter in the case of a scene change or based on other factors such as lots of speech or interrupted speech
  • lines with a length of 2 or fewer characters don't have to follow that minimum
  • background dialogue does not have to be subtitled
    • if subtitled, use {\an8} tag to put speech on top

Punctuation

Explanation Examples
Written or background info is enclosed in Chinese parentheses. (三年前)
The titles (of episodes, works, etc.) are enclosed in Chinese double arrow brackets. 《進擊的巨人》
Secondary titles are separated with a Chinese colon. 《哈利波特:神秘的魔法石》
Episode titles are enclosed with Chinese square brackets. [戰士]
Miscellaneous titles, such as in on-screen text are enclosed with lenticular brackets. 【Sub Topic】
A Chinese comma is placed after all SFP, except when followed by 你 without a pause. ❌好啦我明喇。
✅好啦,我明喇
❌好春廢啊,你
✅好春廢啊你
Multiple speaker dialogue uses two lines and dialogue that begins with a hyphen without a following space. -speaker 1
-speaker 2
Direct speech styling uses Chinese colon followed by dialogue enclosed in left and right Chinese quotation characters. 我媽媽話:「唔准去嗰度」
When a question is followed by the name of who is being addressed then the question mark is used as the separator as opposed to a comma and a question mark ❌你仲喺度,阿明?
✅你仲喺度?阿明
Only 1 Chinese ellipsis character is used (never 2 as in ……). ❌……
✅…
When an utterance is repeated, transcribe only 1 instance with a trailing Chinese ellipsis character. ❌ 喂喂喂
✅ 喂…
In the case of interrupted speech, a Chinese ellipsis character is used to mark where the speaker is cut off and a new line begins with the new speech. -點解你…
-唔知啊
In the case of trailing speech, a Chinese ellipsis character is used. ❌佢唔可以嘅話~~
✅佢唔可以嘅話…
In the case of stammering, the start is separated by a Chinese ellipsis, but this is only done once. ❌只只不過
❌只…只…只不過
✅只…只不過
When listing with 同 or 同埋, Chinese list comma is used on the elements that are not connected with the conjunction. A、B、C同埋D
Subtitles never end in a period and Chinese period is never used. ❌我個名叫Tom。
✅我個名叫Tom
The middle period is never used. ❌哈利·波特
✅哈利波特
Italics are never used.

Untranscribed Speech

In general, these subtitles are a learning resource. The goal is not to transcribe verbatim all utterances in their entirety. The goal is have a complete subtitle that contains information useful to the learner. We do not want to include very minor, incidental speech/sounds, or unintentionally incorrect speech. Sentence Final Particles are transcribed as accurately as possible to benefit the learner.

Speech Example
The sound of hesitation, e.g. "uh" (a6 / e6), is only transcribed when drawn out and precedes a longer utterance. When directly following a word, it should not be transcribed and an ellipsis should be used instead. ✅誒…你係邊個啊?
❌你誒…係邊個啊?
✅你…係邊個啊?
❌佢…誒…佢係…誒…我唔知
✅佢…佢係…我唔知
The Chinese exclamation point is used sparingly. For example, for exceptionally loud/declarative yells or for emphasis among quieter speech, such as when calling someone's name. Even if a character is yelling, it's discouraged to end every line with an exclamation point.
Miscellaneous grunts, yells, screams, and the like are not transcribed.
Ah, oh, hmm, huh, mhmm and other acknowledgement noises are transcribed sparingly and primarily in the case that they form part of other utterances. ❌吓?
✅吓?你講咩啊?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7