Skip to content

Metadata: Automatically sanitize bad Unicode strings #2897

@mrichtarsky

Description

@mrichtarsky

1. What is not working as documented?

After a scratch indexing run on a fresh instance (sidecar/cache dirs/DB dir empty) I get these errors in the logs:

time="2022-11-14T23:09:42Z" level=error msg="details: record not found (find or create 56854)"
time="2022-11-14T23:09:42Z" level=error msg="photo: Error 1366: Incorrect string value: '\\xCA\\xFD\\xC2\\xEB\\xB3\\xC9...' for column `photoprism`.`details`.`software` at row 1 (save details for 56854)"

This is one occurrence, I have about 15 (indexing 60k pictures).

2. How can we reproduce it?

Index the attached image.

3. What behavior do you expect?

No error in the logs.

4. What could be the cause of your problem?

I took a look at the picture from the log above.

$ exiftool -j e567c2dd874328b7689b427f79e07205.jpg
[{                                                                                                                
  "SourceFile": e567c2dd874328b7689b427f79e07205.jpg",
  "ExifToolVersion": 12.40,
  "FileName": "e567c2dd874328b7689b427f79e07205.jpg",
  "Directory": "/pictures",
  "FileSize": "344 KiB",
  "FileModifyDate": "2016:03:01 11:33:24+00:00",
  "FileAccessDate": "2022:11:15 12:06:57+00:00",
  "FileInodeChangeDate": "2021:01:07 18:20:35+00:00",
  "FilePermissions": "-rwxrwxrwx",
  "FileType": "JPEG",                                    
  "FileTypeExtension": "jpg",              
  "MIMEType": "image/jpeg",                              
  "ExifByteOrder": "Little-endian (Intel, II)",
  "Make": "Canon",                                       
  "Model": "Canon EOS 30D",                              
  "Orientation": "Horizontal (normal)",
  "XResolution": 72,                                                                                              
  "YResolution": 72,                                     
  "ResolutionUnit": "inches",                     
  "Software": "ACD Systems ????????", 
  "ModifyDate": "2010:09:15 16:50:46",
  "YCbCrPositioning": "Centered",                                                                                 
  "ExposureTime": 6,                                     
  "FNumber": 18.0,                                       
  "ExposureProgram": "Manual",
  "ISO": 400,                                            
  "ExifVersion": "",                                     
  "DateTimeOriginal": "2009:07:30 19:33:27",   
  "CreateDate": "2009:07:30 19:33:27",     
  "ComponentsConfiguration": "Y, Cb, Cr, -",                                                                      
  "ShutterSpeedValue": 6,                                                                                         
  "ApertureValue": 18.0,                                 
  "ExposureCompensation": 0,                       
  "MeteringMode": "Partial",         
  "FocalLength": "10.0 mm",                              
  "UserComment": "",                                     
  "SubSecTime": 295,                                     
  "SubSecTimeOriginal": 0,                               
  "SubSecTimeDigitized": 0,                              
  "FlashpixVersion": "",                                 
  "ColorSpace": "sRGB",                                  
  "ExifImageWidth": 1000,                                                                                         
  "ExifImageHeight": 667,                                                                                         
  "InteropIndex": "R98 - DCF basic file (sRGB)",                                                                  
  "InteropVersion": "0100",                                                                                       
  "FocalPlaneXResolution": 3959.322034,
  "FocalPlaneYResolution": 3959.322034,
  "FocalPlaneResolutionUnit": "inches",
  "CustomRendered": "Normal",                       
  "ExposureMode": "Manual",
  "WhiteBalance": "Auto",
  "SceneCaptureType": "Standard",        
  "Compression": "JPEG (old-style)",
  "ThumbnailOffset": 1029,
  "ThumbnailLength": 12574,
  "XMPToolkit": "Public XMP Toolkit Core 3.5",
  "NativeDigest": "256,257,258,259,262,274,277,284,530,531,282,283,296,301,318,319,529,532,306,270,271,272,305,315
,33432;82EAA3AEAA15175DC79AD064028FC263",     
  "CreatorTool": "Adobe Photoshop CS2 Windows",     
  "MetadataDate": "2010:05:03 13:06:48+08:00",  
  "DateTimeDigitized": "2009:07:30 19:33:27.0+8:00",                                                              
  "FlashFired": false,                                   
  "FlashReturn": "No return detection",
  "FlashMode": "Off",
  "FlashFunction": false,                                                                                         
  "FlashRedEyeMode": false,                              
  "DocumentID": "uuid:7F07266C7156DF11BDC5C4708C4E9C7E",
  "InstanceID": "uuid:8007266C7156DF11BDC5C4708C4E9C7E",
  "DerivedFromInstanceID": "uuid:8A1CCDB4A27EDE11B9749B75F8DD63FD",
  "DerivedFromDocumentID": "uuid:0A46978BA27EDE11B9749B75F8DD63FD",
  "Format": "image/jpeg",                                
  "Title": " ",                                          
  "ColorMode": "RGB",                                    
  "ICCProfileName": "sRGB IEC61966-2.1",       
  "History": "",                                         
  "ProfileCMMType": "Linotronic",                                                                                 
  "ProfileVersion": "2.1.0",                                                                                      
  "ProfileClass": "Display Device Profile",              
  "ColorSpaceData": "RGB ",                              
  "ProfileConnectionSpace": "XYZ ",  
  "ProfileDateTime": "1998:02:09 06:49:00",              
  "ProfileFileSignature": "acsp",                        
  "PrimaryPlatform": "Microsoft Corporation",            
  "CMMFlags": "Not Embedded, Independent",               
  "DeviceManufacturer": "Hewlett-Packard",               
  "DeviceModel": "sRGB",                                 
  "DeviceAttributes": "Reflective, Glossy, Positive, Color",
  "RenderingIntent": "Perceptual",                                                                                
  "ConnectionSpaceIlluminant": "0.9642 1 0.82491",                                                                
  "ProfileCreator": "Hewlett-Packard",                                                                            
  "ProfileID": 0,                                                                                                 
  "ProfileCopyright": "Copyright (c) 1998 Hewlett-Packard Company",
  "ProfileDescription": "sRGB IEC61966-2.1",
  "MediaWhitePoint": "0.95045 1 1.08905",
  "MediaBlackPoint": "0 0 0",                       
  "RedMatrixColumn": "0.43607 0.22249 0.01392",
  "GreenMatrixColumn": "0.38515 0.71687 0.09708",
  "BlueMatrixColumn": "0.14307 0.06061 0.7141",
  "DeviceMfgDesc": "IEC http://www.iec.ch",
  "DeviceModelDesc": "IEC 61966-2.1 Default RGB colour space - sRGB",
  "ViewingCondDesc": "Reference Viewing Condition in IEC61966-2.1",
  "ViewingCondIlluminant": "19.6445 20.3718 16.8089",
  "ViewingCondSurround": "3.92889 4.07439 3.36179",                                                               
  "ViewingCondIlluminantType": "D50",         
  "Luminance": "76.03647 80 87.12462",              
  "MeasurementObserver": "CIE 1931",            
  "MeasurementBacking": "0 0 0",                                                                                  
  "MeasurementGeometry": "Unknown",                      
  "MeasurementFlare": "0.999%",        
  "MeasurementIlluminant": "D65",
  "Technology": "Cathode Ray Tube Display",                                                                       
  "RedTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "GreenTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "BlueTRC": "(Binary data 2060 bytes, use -b option to extract)",
  "CurrentIPTCDigest": "10b3e7199b0ccc607c7875ebfd6f05ef",         
  "ObjectName": " ",
  "ImageWidth": 1000,
  "ImageHeight": 667,
  "EncodingProcess": "Baseline DCT, Huffman coding",
  "BitsPerSample": 8,
  "ColorComponents": 3,
  "YCbCrSubSampling": "YCbCr4:2:2 (2 1)",
  "Aperture": 18.0,
  "ImageSize": "1000x667",
  "Megapixels": 0.667,
  "ScaleFactor35efl": 1.6,
  "ShutterSpeed": 6,
  "SubSecCreateDate": "2009:07:30 19:33:27.0",
  "SubSecDateTimeOriginal": "2009:07:30 19:33:27.0",
  "SubSecModifyDate": "2010:09:15 16:50:46.295",
  "ThumbnailImage": "(Binary data 12574 bytes, use -b option to extract)",
  "Flash": "Off, Did not fire",
  "CircleOfConfusion": "0.019 mm",
  "FOV": "96.7 deg",
  "FocalLength35efl": "10.0 mm (35 mm equivalent: 16.0 mm)",
  "HyperfocalDistance": "0.30 m",
  "LightValue": 3.8
}]

Looking at the error message, I suppose the problematic info from EXIF is "Software": "ACD Systems ????????". I'm not sure whether the question marks are contained verbatim in EXIF, or just indicate some garbage data, that is perhaps then stored directly into MariaDB as is. This happens for pictures related to China, which may have passed through some software that stores such strings there. At least the bytes in the error message are not valid UTF8:

$ python
>>> a=b'\xCA\xFD\xC2\xEB\xB3\xC9'
>>> a.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

Is my understanding correct that this column details.software should contain UTF8? If so, does PhotoPrism or MariaDB ensure the validity?

5. Can you provide us with example files for testing, error logs, or screenshots?

Attached

6. Which software versions do you use?

(a) AMD64, PhotoPrism® CE Build 221105-7a295cab4

(b) MariaDB, stock settings from docker-compose.yml, i.e.

    image: mariadb:10.6
    security_opt:
      - seccomp:unconfined
      - apparmor:unconfined
    command: mysqld --innodb-buffer-pool-size=128M --transaction-isolation=READ-COMMITTED --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci --max-connections=512 --innodb-rollback-on-timeout=OFF --innodb-lock-wait-timeout=120

(c) Linux

Linux immens 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy

7. On what kind of device is PhotoPrism installed?

Shouldn't matter but:

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz
4GB 

(This is a pretty low-powered system, 10+ years old, indexing took a couple of days, 5k videos in addition to the pictures, but no other issues)

e567c2dd874328b7689b427f79e07205

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingreleasedAvailable in the stable release

Type

No type

Projects

Status

Release 🌈

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions