Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata: Automatically sanitize bad Unicode strings #2897

Closed
mrichtarsky opened this issue Nov 15, 2022 · 5 comments
Closed

Metadata: Automatically sanitize bad Unicode strings #2897

mrichtarsky opened this issue Nov 15, 2022 · 5 comments
Assignees
Labels
bug Something isn't working released Available in the stable release

Comments

@mrichtarsky
Copy link

1. What is not working as documented?

After a scratch indexing run on a fresh instance (sidecar/cache dirs/DB dir empty) I get these errors in the logs:

time="2022-11-14T23:09:42Z" level=error msg="details: record not found (find or create 56854)"
time="2022-11-14T23:09:42Z" level=error msg="photo: Error 1366: Incorrect string value: '\\xCA\\xFD\\xC2\\xEB\\xB3\\xC9...' for column `photoprism`.`details`.`software` at row 1 (save details for 56854)"

This is one occurrence, I have about 15 (indexing 60k pictures).

2. How can we reproduce it?

Index the attached image.

3. What behavior do you expect?

No error in the logs.

4. What could be the cause of your problem?

I took a look at the picture from the log above.

$ exiftool -j e567c2dd874328b7689b427f79e07205.jpg
[{                                                                                                                
  "SourceFile": e567c2dd874328b7689b427f79e07205.jpg",
  "ExifToolVersion": 12.40,
  "FileName": "e567c2dd874328b7689b427f79e07205.jpg",
  "Directory": "/pictures",
  "FileSize": "344 KiB",
  "FileModifyDate": "2016:03:01 11:33:24+00:00",
  "FileAccessDate": "2022:11:15 12:06:57+00:00",
  "FileInodeChangeDate": "2021:01:07 18:20:35+00:00",
  "FilePermissions": "-rwxrwxrwx",
  "FileType": "JPEG",                                    
  "FileTypeExtension": "jpg",              
  "MIMEType": "image/jpeg",                              
  "ExifByteOrder": "Little-endian (Intel, II)",
  "Make": "Canon",                                       
  "Model": "Canon EOS 30D",                              
  "Orientation": "Horizontal (normal)",
  "XResolution": 72,                                                                                              
  "YResolution": 72,                                     
  "ResolutionUnit": "inches",                     
  "Software": "ACD Systems ????????", 
  "ModifyDate": "2010:09:15 16:50:46",
  "YCbCrPositioning": "Centered",                                                                                 
  "ExposureTime": 6,                                     
  "FNumber": 18.0,                                       
  "ExposureProgram": "Manual",
  "ISO": 400,                                            
  "ExifVersion": "",                                     
  "DateTimeOriginal": "2009:07:30 19:33:27",   
  "CreateDate": "2009:07:30 19:33:27",     
  "ComponentsConfiguration": "Y, Cb, Cr, -",                                                                      
  "ShutterSpeedValue": 6,                                                                                         
  "ApertureValue": 18.0,                                 
  "ExposureCompensation": 0,                       
  "MeteringMode": "Partial",         
  "FocalLength": "10.0 mm",                              
  "UserComment": "",                                     
  "SubSecTime": 295,                                     
  "SubSecTimeOriginal": 0,                               
  "SubSecTimeDigitized": 0,                              
  "FlashpixVersion": "",                                 
  "ColorSpace": "sRGB",                                  
  "ExifImageWidth": 1000,                                                                                         
  "ExifImageHeight": 667,                                                                                         
  "InteropIndex": "R98 - DCF basic file (sRGB)",                                                                  
  "InteropVersion": "0100",                                                                                       
  "FocalPlaneXResolution": 3959.322034,
  "FocalPlaneYResolution": 3959.322034,
  "FocalPlaneResolutionUnit": "inches",
  "CustomRendered": "Normal",                       
  "ExposureMode": "Manual",
  "WhiteBalance": "Auto",
  "SceneCaptureType": "Standard",        
  "Compression": "JPEG (old-style)",
  "ThumbnailOffset": 1029,
  "ThumbnailLength": 12574,
  "XMPToolkit": "Public XMP Toolkit Core 3.5",
  "NativeDigest": "256,257,258,259,262,274,277,284,530,531,282,283,296,301,318,319,529,532,306,270,271,272,305,315
,33432;82EAA3AEAA15175DC79AD064028FC263",     
  "CreatorTool": "Adobe Photoshop CS2 Windows",     
  "MetadataDate": "2010:05:03 13:06:48+08:00",  
  "DateTimeDigitized": "2009:07:30 19:33:27.0+8:00",                                                              
  "FlashFired": false,                                   
  "FlashReturn": "No return detection",
  "FlashMode": "Off",
  "FlashFunction": false,                                                                                         
  "FlashRedEyeMode": false,                              
  "DocumentID": "uuid:7F07266C7156DF11BDC5C4708C4E9C7E",
  "InstanceID": "uuid:8007266C7156DF11BDC5C4708C4E9C7E",
  "DerivedFromInstanceID": "uuid:8A1CCDB4A27EDE11B9749B75F8DD63FD",
  "DerivedFromDocumentID": "uuid:0A46978BA27EDE11B9749B75F8DD63FD",
  "Format": "image/jpeg",                                
  "Title": " ",                                          
  "ColorMode": "RGB",                                    
  "ICCProfileName": "sRGB IEC61966-2.1",       
  "History": "",                                         
  "ProfileCMMType": "Linotronic",                                                                                 
  "ProfileVersion": "2.1.0",                                                                                      
  "ProfileClass": "Display Device Profile",              
  "ColorSpaceData": "RGB ",                              
  "ProfileConnectionSpace": "XYZ ",  
  "ProfileDateTime": "1998:02:09 06:49:00",              
  "ProfileFileSignature": "acsp",                        
  "PrimaryPlatform": "Microsoft Corporation",            
  "CMMFlags": "Not Embedded, Independent",               
  "DeviceManufacturer": "Hewlett-Packard",               
  "DeviceModel": "sRGB",                                 
  "DeviceAttributes": "Reflective, Glossy, Positive, Color",
  "RenderingIntent": "Perceptual",                                                                                
  "ConnectionSpaceIlluminant": "0.9642 1 0.82491",                                                                
  "ProfileCreator": "Hewlett-Packard",                                                                            
  "ProfileID": 0,                                                                                                 
  "ProfileCopyright": "Copyright (c) 1998 Hewlett-Packard Company",
  "ProfileDescription": "sRGB IEC61966-2.1",
  "MediaWhitePoint": "0.95045 1 1.08905",
  "MediaBlackPoint": "0 0 0",                       
  "RedMatrixColumn": "0.43607 0.22249 0.01392",
  "GreenMatrixColumn": "0.38515 0.71687 0.09708",
  "BlueMatrixColumn": "0.14307 0.06061 0.7141",
  "DeviceMfgDesc": "IEC http://www.iec.ch",
  "DeviceModelDesc": "IEC 61966-2.1 Default RGB colour space - sRGB",
  "ViewingCondDesc": "Reference Viewing Condition in IEC61966-2.1",
  "ViewingCondIlluminant": "19.6445 20.3718 16.8089",
  "ViewingCondSurround": "3.92889 4.07439 3.36179",                                                               
  "ViewingCondIlluminantType": "D50",         
  "Luminance": "76.03647 80 87.12462",              
  "MeasurementObserver": "CIE 1931",            
  "MeasurementBacking": "0 0 0",                                                                                  
  "MeasurementGeometry": "Unknown",                      
  "MeasurementFlare": "0.999%",        
  "MeasurementIlluminant": "D65",
  "Technology": "Cathode Ray Tube Display",                                                                       
  "RedTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "GreenTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "BlueTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "CurrentIPTCDigest": "10b3e7199b0ccc607c7875ebfd6f05ef",         
  "ObjectName": " ",
  "ImageWidth": 1000,
  "ImageHeight": 667,
  "EncodingProcess": "Baseline DCT, Huffman coding",
  "BitsPerSample": 8,
  "ColorComponents": 3,
  "YCbCrSubSampling": "YCbCr4:2:2 (2 1)",
  "Aperture": 18.0,
  "ImageSize": "1000x667",
  "Megapixels": 0.667,
  "ScaleFactor35efl": 1.6,
  "ShutterSpeed": 6,
  "SubSecCreateDate": "2009:07:30 19:33:27.0",
  "SubSecDateTimeOriginal": "2009:07:30 19:33:27.0",
  "SubSecModifyDate": "2010:09:15 16:50:46.295",
  "ThumbnailImage": "(Binary data 12574 bytes, use -b option to extract)",
  "Flash": "Off, Did not fire",
  "CircleOfConfusion": "0.019 mm",
  "FOV": "96.7 deg",
  "FocalLength35efl": "10.0 mm (35 mm equivalent: 16.0 mm)",
  "HyperfocalDistance": "0.30 m",
  "LightValue": 3.8
}]

Looking at the error message, I suppose the problematic info from EXIF is "Software": "ACD Systems ????????". I'm not sure whether the question marks are contained verbatim in EXIF, or just indicate some garbage data, that is perhaps then stored directly into MariaDB as is. This happens for pictures related to China, which may have passed through some software that stores such strings there. At least the bytes in the error message are not valid UTF8:

$ python
>>> a=b'\xCA\xFD\xC2\xEB\xB3\xC9'
>>> a.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

Is my understanding correct that this column details.software should contain UTF8? If so, does PhotoPrism or MariaDB ensure the validity?

5. Can you provide us with example files for testing, error logs, or screenshots?

Attached

6. Which software versions do you use?

(a) AMD64, PhotoPrism® CE Build 221105-7a295cab4

(b) MariaDB, stock settings from docker-compose.yml, i.e.

    image: mariadb:10.6
    security_opt:
      - seccomp:unconfined
      - apparmor:unconfined
    command: mysqld --innodb-buffer-pool-size=128M --transaction-isolation=READ-COMMITTED --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci --max-connections=512 --innodb-rollback-on-timeout=OFF --innodb-lock-wait-timeout=120

(c) Linux

Linux immens 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy

7. On what kind of device is PhotoPrism installed?

Shouldn't matter but:

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz
4GB 

(This is a pretty low-powered system, 10+ years old, indexing took a couple of days, 5k videos in addition to the pictures, but no other issues)

e567c2dd874328b7689b427f79e07205

@mrichtarsky mrichtarsky added the bug Something isn't working label Nov 15, 2022
@lastzero
Copy link
Member

The software column supports unicode and the validity is verified by MariaDB. PhotoPeism also does some sanitization, but not with a focus on Unicode specific constraints. So it can be an issue with invalid Unicode in the metadata. First time someone is reporting this though. Feel free to suggest improvements, e.g. via pull request.

see https://github.com/photoprism/photoprism/blob/develop/internal/entity/details.go

lastzero added a commit that referenced this issue Nov 15, 2022
Signed-off-by: Michael Mayer <michael@photoprism.app>
lastzero added a commit that referenced this issue Nov 15, 2022
Signed-off-by: Michael Mayer <michael@photoprism.app>
@lastzero lastzero self-assigned this Nov 15, 2022
@lastzero
Copy link
Member

Alright, I fixed the error by making sure that all strings extracted from metadata are valid Unicode. You are welcome to test this with the upcoming preview build!

@lastzero
Copy link
Member

An updated preview build will be available for testing soon:

We hope you have a few minutes to let us know if it works so we can release the update tomorrow!

@lastzero lastzero added the please-test Ready for acceptance test label Nov 15, 2022
@lastzero lastzero changed the title photo: Error 1366: Incorrect string value: '\\xCA\\xFD\\xC2\\xEB\\xB3\\xC9...' for column photoprism.details.software Metadata: Automatically sanitize bad Unicode strings Nov 15, 2022
@lastzero
Copy link
Member

Happy testing! 🎁

@mrichtarsky
Copy link
Author

Thanks for the quick fix! Tested and can confirm the issue is fixed.

time="2022-11-15T22:05:28Z" level=info msg="index: found no .ppignore file"
time="2022-11-15T22:05:28Z" level=info msg="index: added folder /"
time="2022-11-15T22:05:29Z" level=info msg="media: created 9 thumbnails for e567c2dd874328b7689b427f79e07205.jpg [329.658595ms]"
time="2022-11-15T22:05:30Z" level=info msg="media: e567c2dd874328b7689b427f79e07205.jpg was taken at 2009-07-30 19:33:27 +0000 UTC (meta)"
time="2022-11-15T22:05:39Z" level=info msg="index: matched 1 label with e567c2dd874328b7689b427f79e07205.jpg [9.388298935s]"
time="2022-11-15T22:05:40Z" level=info msg="index: added main jpg file e567c2dd874328b7689b427f79e07205.jpg"
time="2022-11-15T22:05:40Z" level=info msg="indexing completed in 19 s"

@lastzero lastzero added tested Changes have been tested successfully released Available in the stable release and removed please-test Ready for acceptance test tested Changes have been tested successfully labels Nov 15, 2022
lastzero added a commit that referenced this issue Nov 17, 2022
Signed-off-by: Michael Mayer <michael@photoprism.app>
chain710 pushed a commit to chain710/photoprism that referenced this issue Nov 28, 2022
* merge-221118: (66 commits)
  Frontend: Update deps in package-lock.json
  Frontend: Update translations.json
  UI: Add Electra theme photoprism#2916
  MariaDB: Make version check compatible with 10.10 photoprism#2913
  Weblate: Update backend translations
  Weblate: Update frontend translations
  Backend: Upgrade golang.org/x/crypto in go.mod and go.sum
  Develop: Upgrade base image from 221116-jammy to 221117-jammy
  CI: Update "docker-develop-latest" target in Makefile
  CI: Update deploy-develop.sh script
  MariaDB: Upgrade pre-installed client from v10.6 to v10.9
  Videos: Add "intel" init target to force driver installation photoprism#2700
  Metadata: Improve data parsing and sanitization photoprism#2897
  Frontend: Update translations.json and package-lock.json
  Weblate: Update frontend translations
  Develop: Upgrade base image from 221102-jammy to 221116-jammy
  Frontend: Update translations.json
  Frontend: update options.js
  Weblate: Update frontend translations
  Weblate: Update backend translations
  ...
@lastzero lastzero moved this to Released 🌈 in Roadmap 🚀✨ Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released Available in the stable release
Projects
Status: Release 🌈
Development

No branches or pull requests

2 participants