Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata: Preserve stopwords in existing keywords #1153

Closed
graciousgrey opened this issue Mar 25, 2021 · 25 comments
Closed

Metadata: Preserve stopwords in existing keywords #1153

graciousgrey opened this issue Mar 25, 2021 · 25 comments
Assignees
Labels
bug Something isn't working released Available in the stable release

Comments

@graciousgrey
Copy link
Member

Current behaviour:
If an image has flash and keyword information in exif. Only flash is set as keyword, not the keywords itself.

Related code:
internal/meta/exif.go

Related test:
internal/photoprism/index_mediafile_test.go

Expected behaviour:
Both flash and the keywords MUST be added as keywords

@graciousgrey graciousgrey added the bug Something isn't working label Mar 25, 2021
@uLUViL3gCs
Copy link

I can confirm same behavior.
I'm in process to import my whole photo DB into PP and as such I'm running dry-runs with a random selection to see that none information (including keywords) gets lost.
For some of my photos, none of its keywords got reflected, while for others all of the keywords were shown as desired.

I am glad to see this bug report, as I miserably failed in figuring out the trigger of this event.
Having checked the failed photos, I can confirm that all of those have "flash" keyword reflected in PP.

Here a shortened summary of exif flash info of the impacted photos:

xxx@yyy:~/Pics$ exiftool -all -a -G0:1 -s images/* | grep -i flash
[EXIF:ExifIFD] Flash : On, Fired
[EXIF:ExifIFD] FlashpixVersion : 0100
[EXIF:ExifIFD] LightSource : Flash
[EXIF:ExifIFD] Flash : Auto, Fired
[EXIF:ExifIFD] Flash : Auto, Fired, Red-eye reduction

Thanks for reporting the bug and all your efforts.

I'm running PhotoPrism® 210222-ac5a9d5e-Linux-x86_64 on Mac OS.

@lastzero lastzero self-assigned this Apr 25, 2021
@lastzero lastzero added the in-progress Somebody is working on this label Apr 25, 2021
@lastzero lastzero added please-test Ready for acceptance test and removed in-progress Somebody is working on this labels Apr 25, 2021
@lastzero lastzero changed the title Indexer / Flash information from exif overwrites exif keywords Metadata: Merge existing keywords Apr 25, 2021
@lastzero
Copy link
Member

This is part of our latest Development Preview and needs comprehensive testing before we can release it:

https://docs.photoprism.org/release-notes/

@lastzero lastzero changed the title Metadata: Merge existing keywords Metadata: Merge keywords from different sources Apr 25, 2021
@uLUViL3gCs
Copy link

I started verifying this fix with those pictures that I stumbled across earlier and from what I can say so far, the fix seems to work for at least those two scenarios:
I) related files are getting imported to a PP version having the fix
II) related files that got imported on PP version having #1153 and being re-indexed on PP version proving the fix (complete Rescan)
I am going to continue importing photos next days/weeks and will report failures if and when occuring...

What I have done:
Scenario I)

  1. take some jpg files which do have “flash” as keyword, as well as some other keywords
  2. trigger import on PP ":latest" (210422-97e75b04-Linux-x86_64)
  3. verify keywords
    ==> NOK:
    — keyword “flash” is shown (OK)
    — PP generated keywords shown (OK)
    — remaining additional keywords from source file not shown (NOK)
  4. trigger import of files from step 1) on a “fresh” (ie. new parent location for storage/originals/DB) ":preview" (210425-dd652ce4-Linux-x86_64)
  5. verify keywords
    ==> OK:
    — keyword “flash” is shown (OK)
    — PP generated keywords shown (OK)
    — remaining additional keywords from source file shown as well (OK)
    ====>> Bug fix OK

Scenario II)

  1. start with a “fresh” PP (ie. new parent location for storage/originals/DB) ":latest" (210422-97e75b04-Linux-x86_64)
  2. repeat step 1)
  3. trigger import
  4. verify keywords
    ==> NOK (as expected; same results as from step 3)
  5. shutdown system (docker-compose down)
  6. change PP yml from :latest to :preview (keep other locations like storage/originals/DB)
  7. start-up system (docker-compose up)
  8. check that system is ":preview" (210425-dd652ce4-Linux-x86_64)
    ==> **OK (**browser informed about upload, hit “reload”)
  9. check keywords:
    ==> NOK (kind of as expected; same results as from step 3)
  10. trigger re-index of files imported in step 12) (Complete Rescan == true)
    ==> OK
  11. verify keywords:
    ==> OK:
    — keyword “flash” is shown (OK)
    — PP generated keywords shown (OK)
    — remaining additional keywords from source file shown as well (OK)
    ====>> Bug fix OK (a re-index of originals seems to take the fix into place for already imported files)

@uLUViL3gCs
Copy link

One observation though (irrespective of #1153): For those files of mine, which do have the word "done" (besides others) set as keyword, this word is not shown in PP. Other keywords (next to "done") instead are correctly shown.
Any clue?
Is there probably a blacklist for keywords?

@lastzero
Copy link
Member

Yes, PhotoPrism has a stopwords list (like many full-text search engines). Those words were previously shown but were not searchable. Since we're parsing and sorting keywords now in order to merge them, this becomes visible. I understand it can be an issue for certain use cases like when having a "done" tag. We can remove that from the stoplist as a quick fix.

@uLUViL3gCs
Copy link

There is no urgent need - at least for me - to fix this right away.
Nonetheless, is there a way to get the full list? Having this, I could circumvent this situation from a different angle.

@graciousgrey
Copy link
Member Author

@lastzero
Copy link
Member

Removed a few words... in the next release, we should not apply the stopwords list when adding keywords, just for indexing / searching.

@graciousgrey graciousgrey added released Available in the stable release and removed please-test Ready for acceptance test labels Apr 30, 2021
@uLUViL3gCs
Copy link

@uLUViL3gCs this is the stopwords list: https://github.com/photoprism/photoprism/blob/develop/pkg/txt/stopwords.go

Thanks for sharing! 👍🏼
(Quite many words though 😱😉)

lastzero added a commit that referenced this issue May 1, 2021
@lastzero lastzero changed the title Metadata: Merge keywords from different sources Metadata: Preserve stopwords in existing keywords May 1, 2021
@lastzero lastzero added the please-test Ready for acceptance test label May 1, 2021
@lastzero lastzero reopened this May 1, 2021
@lastzero
Copy link
Member

lastzero commented May 1, 2021

@uLUViL3gCs We've improved the indexer so that stopwords should not get removed from existing keywords anymore. Let us know if this works for you! It's part of our latest Development Preview.

@uLUViL3gCs
Copy link

Running 210501-d8322a59-Linux-x86_64 on MAC OS, I have

  • taken a random list of files of mine,
  • added purpusely some additional keywords to some files taken from https://github.com/photoprism/photoprism/blob/develop/pkg/txt/stopwords.go and
  • checked the keywords of the photos in PP afterwards.

I grouped the results into:

  1. OK == all keywords from file plus PP generated keywords are shown:
xxx@yyy:~/Pictures/photoprism_1153/import$ exiftool -a -s -G0:1 -Flash -keywords ../../photoprism/import/* 
======== ../../photoprism/import/01032008789.jpg
[EXIF:ExifIFD]  Flash                           : Auto, Did not fire
[IPTC]          Keywords                        : PPexported, photo
======== ../../photoprism/import/IMG_0436.HEIC
[EXIF:ExifIFD]  Flash                           : Off, Did not fire
[XMP:XMP-pdf]   Keywords                        : done, PPexported
======== ../../photoprism/import/IMG_7765.JPG
[EXIF:ExifIFD]  Flash                           : Auto, Fired, Red-eye reduction
[IPTC]          Keywords                        : PPexported
======== ../../photoprism/import/P1070475.JPG
[EXIF:ExifIFD]  Flash                           : Auto, Fired, Red-eye reduction
[IPTC]          Keywords                        : PPexported, var
  1. probably OK == all keywords from file except "39" and "lx", plus PP generated keywords are shown (is it so that 2-letter keywords are by default omitted?):
======== ../../photoprism/import/15072007391.jpg
[EXIF:ExifIFD]  Flash                           : Auto, Did not fire
[IPTC]          Keywords                        : group, videoBest, PPexported, 39
======== ../../photoprism/import/P1100808.JPG
[EXIF:ExifIFD]  Flash                           : Auto, Fired
[IPTC]          Keywords                        : done, PPexported, film, flickr, lx
  1. NOK == only PP generated keywords are shown, none of the file based keywords is shown:
======== ../../photoprism/import/IMG_1379.HEIC
[EXIF:ExifIFD]  Flash                           : Auto, Fired
[XMP:XMP-pdf]   Keywords                        : done, PPexported
======== ../../photoprism/import/IMG_1789.HEIC
[EXIF:ExifIFD]  Flash                           : Auto, Fired
[XMP:XMP-pdf]   Keywords                        : selfie, PPexported
    8 image files read

From what I can see, the original "flash" problem AND the "stopwords" seems to be solved for *JPG (irrespective of flash) and for such *HEIC files, where flash is off.
For *HEIC files for which flash is on (ie. true), however, PP is not reflecting any file based keyword.

@lastzero
Copy link
Member

lastzero commented May 2, 2021

Thanks a lot for the detailed test!

probably OK == all keywords from file except "39" and "lx", plus PP generated keywords are shown (is it so that 2-letter keywords are by default omitted?):

Yes, single latin letters and two letter "words" are omitted as they are most likely noise and won't be useful for performing a search (unless you do an exact match instead of searching for all keywords that contain the term). This does not apply for Chinese and other non-latin languages where single symbols are more expressive.

NOK == only PP generated keywords are shown, none of the file based keywords is shown:

======== ../../photoprism/import/IMG_1379.HEIC
[EXIF:ExifIFD]  Flash                           : Auto, Fired
[XMP:XMP-pdf]   Keywords                        : done, PPexported
======== ../../photoprism/import/IMG_1789.HEIC
[EXIF:ExifIFD]  Flash                           : Auto, Fired
[XMP:XMP-pdf]   Keywords                        : selfie, PPexported
    8 image files read

These are all HEIC files. If there are related JPEGs that contain metadata as well (typically the case), their metadata keywords are indexed and not the (now updated) ones in HEIC (or RAW) files. That's the default as we can't directly read metadata from many less common file formats - so JPEGs are searched first, instead of last after checking every other file. Obviously this is much faster as well.

@uLUViL3gCs
Copy link

Yes, single latin letters and two letter "words" are omitted

Alright, got it.

Regarding the NOK HEIC files from above:

so JPEGs are searched first, instead of last after checking every other file

I understand. In my use case there is no additional file coming along with HEIC files:

Use case:

Import .HEIC files which
a) do not have any sidecar files nor any related JPEGs containing metadata
b) do have Keywords set within HEIC (raw) files only
Expectation: PP to reflect keywords

Could you please help me with following questions I do have:

  • Is this use case from PP point of view "valid" (I think you rather "denied" this commenting above, right)?
  • How come use case still seems to work for files where flash is not set, while it is not working when flash is set (true for both "Convert to JPG" set and not set)?

FYI: Having done some more testing, I was able to get keywords reflected for HEIC files with flash == true: This seems to work if some sidecar files (here: xmp) is imported coming along raw file. Obviously, this is a slightly different use case...

Thanks.

@graciousgrey
Copy link
Member Author

@uLUViL3gCs I can reproduce it. Keywords can be extracted from HEIC directly when flash is off but not if it is on.
We will have a look at it.
Thanks for testing :)

lastzero added a commit that referenced this issue May 4, 2021
Also reduces length limit for latin words to 2 letters.
@lastzero
Copy link
Member

lastzero commented May 4, 2021

My last change will merge keywords from HEIC & JPEG if metadata priority is the same. You can still overwrite existing keywords manually or via XMP sidecar files. The length limit for latin words has been reduced to 2 letters.

@uLUViL3gCs
Copy link

I have re-tested mentioned Use Case from above #1153 (comment) and all expected keywords were now shown for HEIC and JPG files with flash ON and OFF. Excellent!

Observations made:

  • 2 letter keywords are omitted (tested with "39", "lx")
  • 3 letter numeric keywords got dropped, too (tested with "832")

Scenarios tested successfully:

  • fresh import into 210504-1fea0b13-Linux-x86_64 (providing the fix)
  • import into 210426-da6e948f-Linux-x86_64 (false keywords expected), followed by PP upgrade to 210504-1fea0b13-Linux-x86_64, followed by re-index (index -a)

Scenarios not tested:

  • using xmp sidecar files to overwrite

Files used:

xxx@yyy:~/Pictures/photoprism_1153$ exiftool -flash -keywords -ext jpg -ext heic import/
======== import/IMG_1379.HEIC
Flash : Auto, Fired
Keywords : done, PPexported
======== import/IMG_0436.HEIC
Flash : Off, Did not fire
Keywords : done, PPexported
======== import/15072007391.jpg
Flash : Auto, Did not fire
Keywords : group, videoBest, PPexported, 39, 832
======== import/IMG_1580.HEIC
Flash : Auto, Fired
Keywords : selfie, PPexported
======== import/P1100808.JPG
Flash : Auto, Fired
Keywords : done, PPexported, film, flickr, lx
======== import/P1070475.JPG
Flash : Auto, Fired, Red-eye reduction
Keywords : PPexported, var
======== import/IMG_1789.HEIC
Flash : Auto, Fired
Keywords : selfie, PPexported
======== import/IMG_7765.JPG
Flash : Auto, Fired, Red-eye reduction
Keywords : PPexported
======== import/01032008789.jpg
Flash : Auto, Did not fire
Keywords : PPexported, photo
1 directories scanned
9 image files read

@lastzero
Copy link
Member

lastzero commented May 4, 2021

Do you need numeric words? What language? XD

@uLUViL3gCs
Copy link

No, I personally do not use numeric words, thanks.
(That was rather a test to see how different keywords and lengths a currently being treated.)
I can't really judge the overall need for purely numeric words. Same time, I do not see any good reason to prevent the same, too. Esp. not if user is able to set it in a manual fashion (overwrite) anyway.

@lastzero
Copy link
Member

lastzero commented May 4, 2021

If someone needs it, we can do it. Gets us to the point of what a word is and how to separate them. Maybe a space or tab is a letter for some people and a word is a sentence, who knows? ;)

@uLUViL3gCs
Copy link

Esp. not if user is able to set it in a manual fashion (overwrite) anyway.

I just checked, and it looks that manual setting a pure numeric keyword is prevented as well. In my opinion this is "good", because from user point of view, I'd not like to see different system behavior for setting keywords (or other meta data which is merged), just depending on the way the data is coming into the system (manual or via import, from photo or sidecar). At least, I would not understand why my data is being omitted one way, but allowed the other and vice versa.

Regarding what is "allowed" and "what not":
I am not an expert in terms of photography, but then I am wondering is there a specification/standard out there, which could be used as a reference?
I can only say that both, single letter (numeric or non-numeric) and pure numeric letter keywords seem to be supported by at least exiftool and Apple Photos.
IPTC keywords did not give me any further clue either.

@lastzero
Copy link
Member

lastzero commented May 5, 2021

Obviously single characters can be stored (doubt there is an official standard that doesn't technically support it) and we do it e.g. for Chinese. On the other hand, data is sometimes corrupted or just not useful - like when we extract keywords from multiple sources that may contain bad data. Note that our typical use case is not that users manually enter keywords, we have a lot of automation. So it makes sense to have a sanity check. Also in case people confuse the field and enter complete sentences as "keywords". So "a" and "I" would get filtered out. That's also good for our index as indexing less useful stopwords incl. single characters reduces performance and consumes storage. Just because something can be done doesn't mean it should be done.

@uLUViL3gCs
Copy link

Thanks for this background explanation, esp. with respect to automation and impact on system performance.
Looking at my source files collected over decades as a non-pro user, I can confirm corrupt metadata in place. I, however, only figured it out now as a preparation to import to PP (I started semi-automated correction of corrupt meta-data prior PP import.).

If someone needs it, we can do it.

I henceforth would agree to this :)... Thanks!

@lastzero
Copy link
Member

lastzero commented May 5, 2021

Thanks a lot for testing!

@lastzero
Copy link
Member

lastzero commented May 5, 2021

Probably better to trim - from keyword endings, what do you think @graciousgrey @uLUViL3gCs?

truncate

@uLUViL3gCs
Copy link

I think the likelihood for a trailing '-' to derive from corrupt/faulty data is higher than from a user purposely having set it. If so, I'd vote to trim it.

@lastzero lastzero removed the please-test Ready for acceptance test label May 5, 2021
@lastzero lastzero closed this as completed May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released Available in the stable release
Projects
Status: Release 🌈
Development

No branches or pull requests

3 participants