Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ass file using UTF-16LE encoding with a BOM cannot be detected #11482

Open
TieBaMma opened this issue Mar 21, 2023 · 12 comments
Open

.ass file using UTF-16LE encoding with a BOM cannot be detected #11482

TieBaMma opened this issue Mar 21, 2023 · 12 comments

Comments

@TieBaMma
Copy link

The following is an example file:

[WLGO&PPX][Itazura_na_Kiss][DVDRip][05][x264_AAC][880x480][FB7AD9EF].gb.zip

After saving as a file using e.g. UTF-8 encoding, the problem disappears, but it'll be more convenient if UTF-16LE with a BOM is supported.

@dyphire
Copy link
Contributor

dyphire commented Mar 21, 2023

Can be reproduced, it seems that the file decoding failed. The UTF-8 encoding with a BOM will also fail.

Update: When I try to load the subtitle using ffplay, it also fails. so it's most likely an ffmpeg issue and should be reported there.

@dyphire
Copy link
Contributor

dyphire commented Mar 22, 2023

If use --demuxer-lavf-format=ass to forcing the ffmpeg ass subtitle demuxer, mpv will successfully open this subtitle.
The issue is that FFmpeg cannot detect the subtitle format correctly.

This appears to be an issue with FFmpeg’s ass_probe, which rejects all files that don’t start with optionally preceded.

@dyphire
Copy link
Contributor

dyphire commented Mar 23, 2023

After that, I tried a dirty patch to fix the issue in FFmpeg, which can also fix another related issue #2846

From 4e88f228ec52d84fa84cfb892da51911b100150e Mon Sep 17 00:00:00 2001
From: dyphire <qimoge@gmail.com>
Date: Wed, 27 Sep 2023 22:16:48 +0800
Subject: [PATCH] avformat/assdec: more lenient probing

---
 libavformat/assdec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavformat/assdec.c b/libavformat/assdec.c
index 6b37af15e8..e2ad3f9a34 100644
--- a/libavformat/assdec.c
+++ b/libavformat/assdec.c
@@ -34,7 +34,7 @@ typedef struct ASSContext {
 
 static int ass_probe(const AVProbeData *p)
 {
-    char buf[13];
+    char buf[1000];
     FFTextReader tr;
     ff_text_init_buf(&tr, p->buf, p->buf_size);
 
@@ -43,7 +43,7 @@ static int ass_probe(const AVProbeData *p)
 
     ff_text_read(&tr, buf, sizeof(buf));
 
-    if (!memcmp(buf, "[Script Info]", 13))
+    if (strstr(buf, "[Script Info]") || strstr(buf, "[V4 Styles]") || strstr(buf, "[V4+ Styles]"))
         return AVPROBE_SCORE_MAX;
 
     return 0;
-- 

test build for windows: https://github.com/dyphire/mpv-winbuild/actions/runs/4493477864

I'm not satisfied with the way this patch works, but it's surprising that this issue has been around for over 7 years and still hasn't been fixed in FFmpeg.

@llyyr
Copy link
Contributor

llyyr commented Sep 23, 2023

I reported this to trac https://trac.ffmpeg.org/ticket/10583

While the above diff should work, ffmpeg already has code for stripping bom marks so that patch is unlikely to be accepted upstream since it doesn't address the root cause of the issue.

@llyyr
Copy link
Contributor

llyyr commented Sep 23, 2023

There are no bugs, this file just has a double BOM. I've verified that the first 4 bytes are fffe fffe, removing the second BOM allows the file to be loaded as utf16le correctly.

There are no bugs, however it might be possible to hack lavf to load these types of files, but I'd prefer not to submit such a patch. Are there a ton of files like this?

The UTF-8 encoding with a BOM will also fail.

This is not true, I tested utf-8, utf16be and utf16le BOMs they all work correctly.

You can also work around it like so

mpv [video] --sub-file=lavf://subfile,,start,2,end,0,,:[video]

@dyphire
Copy link
Contributor

dyphire commented Sep 24, 2023

There are no bugs, however it might be possible to hack lavf to load these types of files, but I'd prefer not to submit such a patch. Are there a ton of files like this?

In fact, there are many subtitles files with this issue, and I have also seen feedback elsewhere.
For this reason, I have written a python script to handle subtitle encoding conversion. see https://github.com/dyphire/subtitle-convert

However, what I am more concerned about is the issue #2846
There are much more subtitle files missing this header of [Script Info], and the users who are troubled by this are more widespread.
As mentioned there libass/libass#325 (comment), it is a more reasonable behavior to additionally detect the information of [V4+ Styles] for judgment, and no functioning ASS subtitles will be missing it.

@llyyr
Copy link
Contributor

llyyr commented Sep 24, 2023

There are much more subtitle files missing this header of [Script Info],

Even if you make ffmpeg probe such broken files correctly, it's pointless because libass doesn't know how to parse PlayResX/PlayResY without the section header. So your subtitles will appear broken anyway.

I could make such a change to lavf if libass became dumber in parsing like VSFilter, but currently there's not much point.

In fact, there are many subtitles files with this issue, and I have also seen feedback elsewhere.

Hopefully it gets merged lol https://ffmpeg.org/pipermail/ffmpeg-devel/2023-September/314767.html

Also if you have a sample of a broken UTF-8 BOM file it'd be helpful

/cc @TheOneric

@dyphire
Copy link
Contributor

dyphire commented Sep 24, 2023

Also if you have a sample of a broken UTF-8 BOM file it'd be helpful

The broken UTF-8 BOM file was only encoded from the example file, so it should be the same bug.

Even if you make ffmpeg probe such broken files correctly, it's pointless because libass doesn't know how to parse PlayResX/PlayResY without the section header. So your subtitles will appear broken anyway.

I could make such a change to lavf if libass became dumber in parsing like VSFilter, but currently there's not much point.

As mentioned in the comments above libass/libass#325 (comment), libass’ own whole-file parser handles files without [Script Info] just fine (In the past, mpv was work with use it), as vsfilter did (mpc-hc/be).

This is an ffmpeg quirk rather than a libass issue. There are thousands or more of these subtitle files on the internet (I suspect they were generated by similar scripts), and it seems foolish to not attempt to fix this compatibility issue.

@llyyr
Copy link
Contributor

llyyr commented Sep 24, 2023

libass’ own whole-file parser handles files without [Script Info] just fine, as vsfilter did (mpc-hc/be).

It can detect that it's an ass file, and playback dialogue correctly. However it won't read options without a header, unlike VSFilter.

You just get this message after fixing the ffmpeg bug, making the whole thing kind of pointless

[sub/ass] Neither PlayResX nor PlayResY defined. Assuming 384x288

@dyphire
Copy link
Contributor

dyphire commented Sep 24, 2023

libass’ own whole-file parser handles files without [Script Info] just fine, as vsfilter did (mpc-hc/be).

It can detect that it's an ass file, and playback dialogue correctly. However it won't read options without a header, unlike VSFilter.

You just get this message after fixing the ffmpeg bug, making the whole thing kind of pointless

[sub/ass] Neither PlayResX nor PlayResY defined. Assuming 384x288

I don't think this is completely meaningless. If the ass tag is not used in these subtitles (It is likely that the srt subtitles were converted using the wrong script), then it has almost no impact.
Of course, perhaps missing parsing should be fixed in libass.

I have ASS subtitles downloaded from various internet sites that do not have [Script Info] section completely (the first line is [V4+ Styles]).

In addition, there are extreme examples here libass/libass#325 (comment), these ass files completely lack PlayResX and PlayResY. libass and vsfilter will do the same thing.
Anyway, it's better than failing to parse without rendering.

@llyyr
Copy link
Contributor

llyyr commented Sep 24, 2023

I'll look into this in a few days, but it'd be better if libass attempted to find "PlayResX/Y:" even if it couldn't find the "[Script Info]" section header.

Of course, if there's a valid reason to not do this, then that's fair as well.

@dyphire
Copy link
Contributor

dyphire commented Sep 27, 2023

I'll look into this in a few days, but it'd be better if libass attempted to find "PlayResX/Y:" even if it couldn't find the "[Script Info]" section header.

Of course, if there's a valid reason to not do this, then that's fair as well.

I noticed that only a simple libass patch is needed to parse missing "PlayResX/Y:" when the "[Script Info]" header does not exist.

From 4cfbbe2f70c02e156785de9bc12d124b6914bd29 Mon Sep 17 00:00:00 2001
From: dyphire <qimoge@gmail.com>
Date: Wed, 27 Sep 2023 22:48:04 +0800
Subject: [PATCH] Try to parse script properties even if missing "[Script Info]"

---
 libass/ass.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libass/ass.c b/libass/ass.c
index 0679483..c0734d2 100644
--- a/libass/ass.c
+++ b/libass/ass.c
@@ -1143,7 +1143,7 @@ mem_fail:
 static int process_line(ASS_Track *track, char *str)
 {
     skip_spaces(&str);
-    if (!ass_strncasecmp(str, "[Script Info]", 13)) {
+    if (!ass_strncasecmp(str, "[Script Info]", 13) || !ass_strncasecmp(str, "ScriptType:", 11)) {
         track->parser_priv->state = PST_INFO;
     } else if (!ass_strncasecmp(str, "[V4 Styles]", 11)) {
         track->parser_priv->state = PST_STYLES;
-- 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants