Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract AppleDict meta-info (langs, title, author) #418

Merged
merged 4 commits into from
Jan 28, 2023

Conversation

soshial
Copy link
Contributor

@soshial soshial commented Jan 28, 2023

Done in this PR:

  • Simplify logic of finding filepath of Body.data
  • This is data that we can extract from Info.plist file:

Oxford Dictionary of English

{
   "DCSDictionaryCSS": "DefaultStyle.css",
   "CFBundleName": "English",
   "DCSDictionaryUseSystemAppearance": true,
   "IDXDictionaryIndexes": [
      {
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "IDXIndexPath": "KeyText.index",
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexName": "DCSKeywordIndex",
         "IDXIndexDataSizeLength": 2,
         "IDXIndexBigEndian": false,
         "TrieIndexCompressionType": 1,
         "IDXIndexDataFields": {
            "IDXFixedDataFields": [
               {
                  "IDXDataFieldName": "DCSPrivateFlag",
                  "IDXDataSize": 2
               }
            ],
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 8,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ],
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSKeyword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSHeadword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSEntryTitle",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSAnchor",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSYomiWord",
                  "IDXDataSizeLength": 2
               }
            ]
         },
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch",
            "IDXPrefixMatch",
            "IDXCommonPrefixMatch",
            "IDXWildcardMatch",
            "IDXAllMatch"
         ],
         "TrieAuxiliaryDataOptions": {
            "HeapDataCompressionType": 3,
            "IDXIndexPath": "KeyText.data"
         }
      },
      {
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "IDXIndexPath": "EntryID.index",
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexName": "DCSReferenceIndex",
         "IDXIndexDataSizeLength": 2,
         "IDXIndexBigEndian": false,
         "TrieIndexCompressionType": 1,
         "IDXIndexDataFields": {
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 8,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ]
         },
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch"
         ],
         "TrieAuxiliaryDataOptions": {
            "IDXIndexPath": "EntryID.data"
         }
      },
      {
         "IDXIndexPath": "Body.data",
         "IDXIndexAccessMethod": "com.apple.HeapAccessMethod",
         "IDXIndexName": "DCSBodyDataIndex",
         "IDXIndexBigEndian": false,
         "IDXIndexDataFields": {
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSBodyData",
                  "IDXDataSizeLength": 4
               }
            ]
         },
         "IDXIndexWritable": false,
         "HeapDataCompressionType": 2,
         "IDXIndexSupportDataID": true
      }
   ],
   "DCSDictionaryManufacturerName": "Apple Inc.",
   "CFBundleDevelopmentRegion": "en",
   "DCSDictionaryFrontMatterReferenceID": "fbm_index",
   "CFBundleShortVersionString": "2.6",
   "DCSElementXPath": {
      "pos": "//span[@class=\"sg\"]/descendant::span[@class=\"pos\"][1]",
      "definitions": "//span[contains(@class,\"x_xd0\")][1]/descendant::span[contains(@class,\"x_xd1\") and not(contains(@class,\"t_subsense\"))]/descendant::span[@class=\"df\" or @class=\"xrg\"][1][not(@d:prtl)]",
      "pronunciation": "(//span[@class=\"pr\" or @class=\"prx\"]/descendant::span[contains(@class,\"ph\")])[1]"
   },
   "DCSDictionaryDetailedDisplayName": "British English",
   "CFBundleInfoDictionaryVersion": "6.0",
   "CFBundleIdentifier": "com.apple.dictionary.ODE",
   "DCSDictionaryPrimaryLanguage": "en_GB",
   "DCSDictionaryDefaultPrefs": {
      "version": "1",
      "pronunciation": "0"
   },
   "DCSDictionaryPreviewMarkupVersion": 1,
   "CFBundleDisplayName": "Oxford Dictionary of English",
   "DCSDictionaryCopyright": "Oxford Dictionary of English<br/>Copyright © 2010, 2022 by Oxford University Press. All rights reserved.",
   "IDXDictionaryVersion": 3,
   "DCSDictionaryNativeDisplayName": "Dictionary",
   "DCSBuildToolVersion": 3,
   "DCSDictionaryLanguages": [
      {
         "DCSDictionaryDescriptionLanguage": "en_GB",
         "DCSDictionaryIndexLanguage": "en_GB"
      }
   ]
}

Custom example (make up one's mind, make it)

{
   "DCSDictionaryCSS": "DefaultStyle.css",
   "CFBundleName": "English",
   "DCSDictionaryUseSystemAppearance": true,
   "IDXDictionaryIndexes": [
      {
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "IDXIndexPath": "KeyText.index",
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexName": "DCSKeywordIndex",
         "IDXIndexDataSizeLength": 2,
         "IDXIndexBigEndian": false,
         "TrieIndexCompressionType": 1,
         "IDXIndexDataFields": {
            "IDXFixedDataFields": [
               {
                  "IDXDataFieldName": "DCSPrivateFlag",
                  "IDXDataSize": 2
               }
            ],
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 8,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ],
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSKeyword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSHeadword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSEntryTitle",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSAnchor",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSYomiWord",
                  "IDXDataSizeLength": 2
               }
            ]
         },
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch",
            "IDXPrefixMatch",
            "IDXCommonPrefixMatch",
            "IDXWildcardMatch",
            "IDXAllMatch"
         ],
         "TrieAuxiliaryDataOptions": {
            "HeapDataCompressionType": 3,
            "IDXIndexPath": "KeyText.data"
         }
      },
      {
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "IDXIndexPath": "EntryID.index",
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexName": "DCSReferenceIndex",
         "IDXIndexDataSizeLength": 2,
         "IDXIndexBigEndian": false,
         "TrieIndexCompressionType": 1,
         "IDXIndexDataFields": {
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 8,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ]
         },
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch"
         ],
         "TrieAuxiliaryDataOptions": {
            "IDXIndexPath": "EntryID.data"
         }
      },
      {
         "IDXIndexPath": "Body.data",
         "IDXIndexAccessMethod": "com.apple.HeapAccessMethod",
         "IDXIndexName": "DCSBodyDataIndex",
         "IDXIndexBigEndian": false,
         "IDXIndexDataFields": {
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSBodyData",
                  "IDXDataSizeLength": 4
               }
            ]
         },
         "IDXIndexWritable": false,
         "HeapDataCompressionType": 2,
         "IDXIndexSupportDataID": true
      }
   ],
   "DCSDictionaryManufacturerName": "Apple Inc.",
   "CFBundleDevelopmentRegion": "en",
   "DCSDictionaryFrontMatterReferenceID": "fbm_index",
   "CFBundleShortVersionString": "2.6",
   "DCSElementXPath": {
      "pos": "//span[@class=\"sg\"]/descendant::span[@class=\"pos\"][1]",
      "definitions": "//span[contains(@class,\"x_xd0\")][1]/descendant::span[contains(@class,\"x_xd1\") and not(contains(@class,\"t_subsense\"))]/descendant::span[@class=\"df\" or @class=\"xrg\"][1][not(@d:prtl)]",
      "pronunciation": "(//span[@class=\"pr\" or @class=\"prx\"]/descendant::span[contains(@class,\"ph\")])[1]"
   },
   "DCSDictionaryDetailedDisplayName": "British English",
   "CFBundleInfoDictionaryVersion": "6.0",
   "CFBundleIdentifier": "com.apple.dictionary.ODE",
   "DCSDictionaryPrimaryLanguage": "en_GB",
   "DCSDictionaryDefaultPrefs": {
      "version": "1",
      "pronunciation": "0"
   },
   "DCSDictionaryPreviewMarkupVersion": 1,
   "CFBundleDisplayName": "Oxford Dictionary of English",
   "DCSDictionaryCopyright": "Oxford Dictionary of English<br/>Copyright © 2010, 2022 by Oxford University Press. All rights reserved.",
   "IDXDictionaryVersion": 3,
   "DCSDictionaryNativeDisplayName": "Dictionary",
   "DCSBuildToolVersion": 3,
   "DCSDictionaryLanguages": [
      {
         "DCSDictionaryDescriptionLanguage": "en_GB",
         "DCSDictionaryIndexLanguage": "en_GB"
      }
   ]
}

Custom PL-RU

{
   "CFBundleDevelopmentRegion": "zh-Hans",
   "CFBundleDisplayName": "Wielki Słownik Polsko-Rosyjski",
   "CFBundleIdentifier": "com.apple.dictionary.pol-rus",
   "CFBundleInfoDictionaryVersion": "6.0",
   "CFBundleName": "pol-rus",
   "CFBundleShortVersionString": "1.0",
   "DCSBuildToolVersion": 1,
   "DCSDictionaryCSS": "DefaultStyle.css",
   "DCSDictionaryCopyright": "Ver. 01d, based on Wielki Słownik Pol-Ros (Wiedza Powszechna) / coded and updated by Berk Bear",
   "DCSDictionaryManufacturerName": "Berk Bear",
   "IDXDictionaryIndexes": [
      {
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexBigEndian": false,
         "IDXIndexDataFields": {
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 4,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ],
            "IDXFixedDataFields": [
               {
                  "IDXDataFieldName": "DCSPrivateFlag",
                  "IDXDataSize": 2
               }
            ],
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSKeyword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSHeadword",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSAnchor",
                  "IDXDataSizeLength": 2
               },
               {
                  "IDXDataFieldName": "DCSYomiWord",
                  "IDXDataSizeLength": 2
               }
            ]
         },
         "IDXIndexDataSizeLength": 2,
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch",
            "IDXPrefixMatch",
            "IDXCommonPrefixMatch",
            "IDXWildcardMatch"
         ],
         "IDXIndexName": "DCSKeywordIndex",
         "IDXIndexPath": "KeyText.index",
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "TrieAuxiliaryDataFile": "KeyText.data"
      },
      {
         "IDXIndexAccessMethod": "com.apple.TrieAccessMethod",
         "IDXIndexBigEndian": false,
         "IDXIndexDataFields": {
            "IDXExternalDataFields": [
               {
                  "IDXDataFieldName": "DCSExternalBodyID",
                  "IDXDataSize": 4,
                  "IDXIndexName": "DCSBodyDataIndex"
               }
            ]
         },
         "IDXIndexDataSizeLength": 2,
         "IDXIndexKeyMatchingMethods": [
            "IDXExactMatch"
         ],
         "IDXIndexName": "DCSReferenceIndex",
         "IDXIndexPath": "EntryID.index",
         "IDXIndexSupportDataID": false,
         "IDXIndexWritable": false,
         "TrieAuxiliaryDataFile": "EntryID.data"
      },
      {
         "HeapDataCompressionType": 1,
         "IDXIndexAccessMethod": "com.apple.HeapAccessMethod",
         "IDXIndexBigEndian": false,
         "IDXIndexDataFields": {
            "IDXVariableDataFields": [
               {
                  "IDXDataFieldName": "DCSBodyData",
                  "IDXDataSizeLength": 4
               }
            ]
         },
         "IDXIndexName": "DCSBodyDataIndex",
         "IDXIndexPath": "Body.data",
         "IDXIndexSupportDataID": true,
         "IDXIndexWritable": false
      }
   ],
   "IDXDictionaryVersion": 1
}

pyglossary/plugins/appledict/__init__.py Outdated Show resolved Hide resolved
pyglossary/plugins/appledict_bin.py Outdated Show resolved Hide resolved
@ilius
Copy link
Owner

ilius commented Jan 28, 2023

Please run ./scripts/gen.sh and add the changes.

Have you tested it on several glossaries?

That's all.
Thank you.

@soshial
Copy link
Contributor Author

soshial commented Jan 28, 2023

I have several dicts that come bundled with macOS and also a couple of custom dicts (generated by DictionaryKit). And those seem to work okay.
I have some plans what to improve in the code of parsing binary AppleDict... I wish we had tests with files for binary AppleDict to check that my changes don't break anything.

@soshial
Copy link
Contributor Author

soshial commented Jan 28, 2023

I would also like to check out files of the exact dictionaries that caused adding

  1. these lines (it's a bit hard to imagine that some dictionary might have chunk length encoded in less than 4 bytes)

    if plus < 4:
    bs = b"\x00" * (4 - plus) + bs

  2. and these lines (why do we need to search for </d:entry> closing tag when we have it's length)

    if chunkSize == 0:
    endI = self._buf[pos:].find(b"</d:entry>")
    if endI == -1:
    chunkSize = len(self._buf) - pos
    else:
    chunkSize = endI + 10

@ilius
Copy link
Owner

ilius commented Jan 28, 2023

Let's keep those out of this PR.

The problem with automated testing is that we need small glossaries. If you can create or find small glossaries for testing, open a new issue and we can work it out.

@soshial
Copy link
Contributor Author

soshial commented Jan 28, 2023

Please run ./scripts/gen.sh and add the changes.

What will this script do? I am a bit cautious.

The problem with automated testing is that we need small glossaries. If you can create or find small glossaries for testing, open a new issue and we can work it out.

I can send you some example dictionary files, but not many. Should I create a separate issue for that? Or maybe I may send to you via telegram?

@ilius
Copy link
Owner

ilius commented Jan 28, 2023

Just run it and you'll see!

Open a new issue (list your files and sizes) and I will explain.

@ilius
Copy link
Owner

ilius commented Jan 28, 2023

gen.sh updates documents and index.json when options or dependencies are changed.

@ilius ilius merged commit 13e8978 into ilius:master Jan 28, 2023
@soshial
Copy link
Contributor Author

soshial commented Jan 28, 2023

Hmm, we could have squashed these commits into one. Commit history would be cleaner.

ilius added a commit that referenced this pull request Jan 29, 2023
ilius added a commit that referenced this pull request Jan 29, 2023
@soshial
Copy link
Contributor Author

soshial commented Jan 30, 2023

Thank you very much, @ilius, for fixing some issues that my code caused.

@soshial soshial deleted the appledict-meta-info branch February 16, 2023 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants