Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode QR code data to UTF-8 #24350

Merged
merged 6 commits into from
Oct 12, 2023
Merged

Conversation

dkurt
Copy link
Member

@dkurt dkurt commented Oct 2, 2023

Pull Request Readiness Checklist

Merge with extra: opencv/opencv_extra#1105

resolves #23728

This is first PR in a series. Here we just return a raw Unicode. Later I will try expand QR codes decoding methods to use ECI assignment number and return a string with proper encoding, not only UTF-8 or raw unicode.

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@dkurt dkurt changed the title Decode string as raw unicode if UTF-8 failed Decode Python string as raw unicode if UTF-8 failed Oct 2, 2023
@dkurt dkurt force-pushed the py_return_non_utf8_string branch 3 times, most recently from 1d2404c to 609a3ed Compare October 2, 2023 14:28
@dkurt dkurt force-pushed the py_return_non_utf8_string branch 2 times, most recently from 99aefb8 to 03e470e Compare October 2, 2023 16:47
@dkurt dkurt marked this pull request as draft October 3, 2023 03:48
@dkurt dkurt changed the title Decode Python string as raw unicode if UTF-8 failed Encode QR code data to UTF-8 Oct 3, 2023
@dkurt dkurt marked this pull request as ready for review October 3, 2023 11:38
@opencv-alalek opencv-alalek requested review from SinM9 and removed request for VadimLevin October 3, 2023 13:41
@@ -2760,6 +2802,9 @@ bool QRDecode::decodingProcess()
{
result_info += qr_code_data.payload[i];
}
if (qr_code_data.data_type == QUIRC_DATA_TYPE_BYTE && !checkUTF8(result_info)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data type check should go before the first loop on the line 2801.

Do we really need checkUTF8? Which test cases fail without it?

Copy link
Member Author

@dkurt dkurt Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A QR code from #23728 is created in Bytes mode but the sequence is not UTF-8 (probably, decoded just a raw bytes array of the unicode string):

qr code content (qr_code_data.payload):

83, 80, 67, 13, 10, 48, 50, 48, 48, 13, 10, 49, 13, 10, 67, 72, 48, 52, 51, 48, 48, 48, 53, 50, 51, 48, 50, 50, 50, 52, 52, 57, 48, 49, 72, 13, 10, 83, 13, 10, 69, 109, 105, 108, 32, 70, 114, 101, 121, 32, 66, 101, 116, 114, 105, 101, 98, 115, 32, 65, 71, 13, 10, 66, 97, 104, 110, 104, 111, 102, 115, 116, 114, 97, 115, 115, 101, 32, 49, 55, 13, 10, 13, 10, 53, 55, 52, 53, 13, 10, 83, 97, 102, 101, 110, 119, 105, 108, 13, 10, 67, 72, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 51, 50, 50, 56, 46, 53, 48, 13, 10, 67, 72, 70, 13, 10, 83, 13, 10, 83, 105, 120, 116, 32, 114, 101, 110, 116, 32, 97, 32, 67, 97, 114, 32, 65, 71, 32, 13, 10, 77, 252, 108, 108, 104, 101, 105, 109, 115, 116, 114, 97, 115, 115, 101, 32, 49, 57, 53, 13, 10, 13, 10, 52, 48, 53, 55, 13, 10, 66, 97, 115, 101, 108, 13, 10, 67, 72, 13, 10, 81, 82, 82, 13, 10, 50, 54, 55, 50, 55, 52, 48, 51, 53, 56, 49, 48, 49, 48, 52, 56, 51, 48, 48, 48, 57, 54, 51, 57, 52, 51, 48, 13, 10, 75, 100, 110, 114, 32, 57, 54, 51, 57, 52, 51, 44, 32, 48, 51, 53, 56, 49, 45, 48, 49, 48, 52, 56, 51, 48, 13, 10, 69, 80, 68,

byte array of the text:

text = u"""
SPC
0200
1
CH043000523022244901H
S
Emil Frey Betriebs AG
Bahnhofstrasse 17

5745
Safenwil
CH







3228.50
CHF
S
Sixt rent a Car AG
Müllheimstrasse 195

4057
Basel
CH
QRR
267274035810104830009639430
Kdnr 963943, 03581-0104830
EPD
"""

print([int(v) for v in bytearray(text.encode('ISO-8859-1'))])

[83, 80, 67, 10, 48, 50, 48, 48, 10, 49, 10, 67, 72, 48, 52, 51, 48, 48, 48, 53, 50, 51, 48, 50, 50, 50, 52, 52, 57, 48, 49, 72, 10, 83, 10, 69, 109, 105, 108, 32, 70, 114, 101, 121, 32, 66, 101, 116, 114, 105, 101, 98, 115, 32, 65, 71, 10, 66, 97, 104, 110, 104, 111, 102, 115, 116, 114, 97, 115, 115, 101, 32, 49, 55, 10, 10, 53, 55, 52, 53, 10, 83, 97, 102, 101, 110, 119, 105, 108, 10, 67, 72, 10, 10, 10, 10, 10, 10, 10, 10, 51, 50, 50, 56, 46, 53, 48, 10, 67, 72, 70, 10, 83, 10, 83, 105, 120, 116, 32, 114, 101, 110, 116, 32, 97, 32, 67, 97, 114, 32, 65, 71, 10, 77, 252, 108, 108, 104, 101, 105, 109, 115, 116, 114, 97, 115, 115, 101, 32, 49, 57, 53, 10, 10, 52, 48, 53, 55, 10, 66, 97, 115, 101, 108, 10, 67, 72, 10, 81, 82, 82, 10, 50, 54, 55, 50, 55, 52, 48, 51, 53, 56, 49, 48, 49, 48, 52, 56, 51, 48, 48, 48, 57, 54, 51, 57, 52, 51, 48, 10, 75, 100, 110, 114, 32, 57, 54, 51, 57, 52, 51, 44, 32, 48, 51, 53, 56, 49, 45, 48, 49, 48, 52, 56, 51, 48, 10, 69, 80, 68, 10]

However, the UTF-8 byte array is different:

print([int(v) for v in bytearray(text.encode('UTF-8'))])

[83, 80, 67, 10, 48, 50, 48, 48, 10, 49, 10, 67, 72, 48, 52, 51, 48, 48, 48, 53, 50, 51, 48, 50, 50, 50, 52, 52, 57, 48, 49, 72, 10, 83, 10, 69, 109, 105, 108, 32, 70, 114, 101, 121, 32, 66, 101, 116, 114, 105, 101, 98, 115, 32, 65, 71, 10, 66, 97, 104, 110, 104, 111, 102, 115, 116, 114, 97, 115, 115, 101, 32, 49, 55, 10, 10, 53, 55, 52, 53, 10, 83, 97, 102, 101, 110, 119, 105, 108, 10, 67, 72, 10, 10, 10, 10, 10, 10, 10, 10, 51, 50, 50, 56, 46, 53, 48, 10, 67, 72, 70, 10, 83, 10, 83, 105, 120, 116, 32, 114, 101, 110, 116, 32, 97, 32, 67, 97, 114, 32, 65, 71, 10, 77, 195, 188, 108, 108, 104, 101, 105, 109, 115, 116, 114, 97, 115, 115, 101, 32, 49, 57, 53, 10, 10, 52, 48, 53, 55, 10, 66, 97, 115, 101, 108, 10, 67, 72, 10, 81, 82, 82, 10, 50, 54, 55, 50, 55, 52, 48, 51, 53, 56, 49, 48, 49, 48, 52, 56, 51, 48, 48, 48, 57, 54, 51, 57, 52, 51, 48, 10, 75, 100, 110, 114, 32, 57, 54, 51, 57, 52, 51, 44, 32, 48, 51, 53, 56, 49, 45, 48, 49, 48, 52, 56, 51, 48, 10, 69, 80, 68, 10]

There is a statement in the ISO that storing bytes array is generally fine, but the encoding step is up to user (alternative is to create a QR code in ECI mode which keeps an info about the encoding standard, but seems like not all the generators propose it):

In closed-system national or application-specific implementations of QR Code, an alternative 8-bit character set, for example as defined in an appropriate part of ISO/IEC 8859, may be specified for Byte mode. When an alternative character set is specified, however, the parties intending to read the QR Code 2005 symbols require to be notified of the applicable character set in the application specification or by bilateral agreement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our docstring, OpenCV should return result in UTF-8 format:

/** @brief Decodes graphical code in image once it's found by the detect() method.
Returns UTF8-encoded output string or empty string if the code cannot be decoded.
@param img grayscale or color (BGR) image containing graphical code.
@param points Quadrangle vertices found by detect() method (or some other algorithm).
@param straight_code The optional output image containing binarized code, will be empty if not found.
*/
CV_WRAP std::string decode(InputArray img, InputArray points, OutputArray straight_code = noArray()) const;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opencv-alalek, perhaps I misunderstood the question. Do you mean can we apply encoding right in the loop, without checkUTF8 method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without checkUTF8 failed tests are:

[  FAILED  ] Objdetect_QRCode.regression/24, where GetParam() = "russian.jpg"
[  FAILED  ] Objdetect_QRCode.regression/25, where GetParam() = "kanji.jpg"
[  FAILED  ] Objdetect_QRCode_Multi.regression/6, where GetParam() = ("4_qrcodes.png", "contours_based")
[  FAILED  ] Objdetect_QRCode_Multi.regression/7, where GetParam() = ("4_qrcodes.png", "aruco_based")
[  FAILED  ] Objdetect_QRCode_Multi.regression/8, where GetParam() = ("5_qrcodes.png", "contours_based")
[  FAILED  ] Objdetect_QRCode_Multi.regression/12, where GetParam() = ("7_qrcodes.png", "contours_based")
[  FAILED  ] Objdetect_QRCode_Multi.regression/13, where GetParam() = ("7_qrcodes.png", "aruco_based")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is completely missing code for proper handling of data type and ECI information.
Detail: https://en.wikipedia.org/wiki/Extended_Channel_Interpretation

Unfortunately it requires sometimes code-page maps.

P.S. Kanji is not properly handled (UTF-8 conversion is still required)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I wanted to take a look later to Kanji test too.

@dkurt dkurt marked this pull request as draft October 5, 2023 14:42
@dkurt dkurt marked this pull request as ready for review October 6, 2023 06:17
Copy link
Contributor

@opencv-alalek opencv-alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API should be extended to return metadata (ECI) for decoded streams.

@dkurt dkurt requested a review from asmorkalov October 10, 2023 09:34
}
result_info.assign((const char*)qr_code_data.payload, qr_code_data.payload_len);
} else if (qr_code_data.eci == 25/*ECI_UTF_16BE*/) {
CV_LOG_INFO(NULL, "QR: UTF-16BE ECI is not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make it CV_LOG_WARING. INFO is not printed in regular builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not spam with that message. QR detector is usually called for each frame.

Copy link
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@asmorkalov asmorkalov merged commit 5ddf3de into opencv:4.x Oct 12, 2023
24 checks passed
@asmorkalov asmorkalov mentioned this pull request Oct 17, 2023
@dkurt dkurt deleted the py_return_non_utf8_string branch October 18, 2023 19:18
@dkurt dkurt mentioned this pull request Oct 18, 2023
6 tasks
IskXCr pushed a commit to Haosonn/opencv that referenced this pull request Dec 20, 2023
Encode QR code data to UTF-8 opencv#24350

### Pull Request Readiness Checklist

**Merge with extra**: opencv/opencv_extra#1105

resolves opencv#23728

This is first PR in a series. Here we just return a raw Unicode. Later I will try expand QR codes decoding methods to use ECI assignment number and return a string with proper encoding, not only UTF-8 or raw unicode.

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request Jan 4, 2024
Encode QR code data to UTF-8 opencv#24350

### Pull Request Readiness Checklist

**Merge with extra**: opencv/opencv_extra#1105

resolves opencv#23728

This is first PR in a series. Here we just return a raw Unicode. Later I will try expand QR codes decoding methods to use ECI assignment number and return a string with proper encoding, not only UTF-8 or raw unicode.

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request May 29, 2024
Encode QR code data to UTF-8 opencv#24350

### Pull Request Readiness Checklist

**Merge with extra**: opencv/opencv_extra#1105

resolves opencv#23728

This is first PR in a series. Here we just return a raw Unicode. Later I will try expand QR codes decoding methods to use ECI assignment number and return a string with proper encoding, not only UTF-8 or raw unicode.

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QR code detecting in python with OpenCV raises UnicodeDecodeError: 'utf-8' codec can't decode byte
5 participants