Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add book text to Spotlight #26

Merged
merged 12 commits into from Mar 6, 2012

Conversation

Projects
None yet
2 participants
Contributor

chrisridd commented Mar 4, 2012

This seems to correctly add the HTML text inside epub (2 and 3) books into Spotlight.

chrisridd added some commits Mar 3, 2012

Code cleanup
BOOL variables should use YES/NO symbols, not true/false etc.

The container object is autoreleased, so sending release is an error.

The content object is autoreleased so an extra retain/release is
unnecessary.

The cover object is retained already, so another retain will cause it to
be leaked.

The coverData object is autoreleased, so sending release is an error.

The retain of publicationDate is moved out of the loop, in case it gets
set multiple times in the loop. That would cause previous values to
leak.

The unit tests still pass.
Convert XHTML items in manifest into plain text
The Spotlight importer is going to use this, but doesn't yet.

The model is that it requests items (0..) from the manifest, which are
plaintext (not HTML) NSStrings. The manifest is lazily constructed by
searching the OPF for items with the right mime type. Getting nil back
means you've completed the manifest.

I'm not sure what it will return for encrypted books.

Conversion to plaintext is done using a streaming XML parser. I started
using the DOM parser, but that doesn't handle named entities like &foo;.
DOM parsers also use a lot of memory.

So I used Apple's NSXMLParser, which is a SAX streaming parser that uses
a lot less memory. It calls a delegate (set here to self) when element
tags are read and when characters are read. The text is accumulated into
JTPepub's capturing ivar.

The wrinkle is that it doesn't handle named entities either. But it is
easy to support them - see the
parser:resolveExternalEntityName:systemID: method. This lazily loads in
a dictionary of entity names to Unicode strings which I took from W3C.

http://www.w3.org/TR/xml-entity-names/xhtml1-lat1.html
http://www.w3.org/2003/entities/2007/xhtml1-special.ent
http://www.w3.org/2003/entities/2007/xhtml1-symbol.ent

Unit tests pass. I added another one which reads all the chapters from
the Untitled epub, which I modified to have 3 chapters with some
character entities.
Index book content for Spotlight
If there's no DRM, iterate through the manifest XHTML items and
accumulate the text contents. Add the text contents to
kMDItemTextContent.

For debugging, I've added the number of characters indexed to the
kMDItemComment. You can see this field in mdls and it gives a clue that
the book's been indexed.

After I use mdimport to rescan my ebooks (mds works in the background?),
I can mdfind text in them and use the Spotlight menu to find things.

Unit tests still pass.
Remove readme file from Xcode template
The readme file was geared towards Core Data clients. This is not such a
client, and now all the necessary changes have been made for the plugin
this file is not useful.
More installation instructions
Installing the Spotlight importer's a bit tricky, or at least persuading it to start indexing stuff is.

The IPDF also capitalises the file format, so ePub should be EPUB.
Add a custom schema file
This has two benefits.

The first is that we can create custom attributes for things we can
extract (e.g. illustrators) that Spotlight has no concept of (i.e. the
kMDItemFoo things)

So this creates attributes for illustrators and translators. Note the
name is reverse DNS style with "." replaced with "_". Note also that "-"
needs to be replaced or the schema file will be unparseable at runtime -
I used "_" for that too.

The second benefit is that this allows us to customize the "More info"
panel in the Finder "Get Info" window. This is quite cool.

It would be nice if the More info panel could display our custom
attributes, but it doesn't seem to. There's no obvious way for it to map
the attribute name org_idpf_foo to a friendly label.

I commented out the "Indexed nnnn characters" things.
Add friendly names for our schema attributes
These should get displayed by Finder.
Add descriptions for custom schema attributes
These are shown in Finder when you start a smart folder (or do a
"Find"). Click the popup labelled "Kind" and choose "Other..." The table
in the sheet that appears uses these descriptions.
Put commands in separate code sections
I also turned all the plugin names bold to stand out a bit better.
Remove useless JTPepub init method
It wasn't called, and isn't necessary.

Unit tests still pass.

jaketmp added a commit that referenced this pull request Mar 6, 2012

Merge pull request #26 from chrisridd/master
Add book text to Spotlight

@jaketmp jaketmp merged commit 88da9b3 into jaketmp:master Mar 6, 2012

Owner

jaketmp commented Mar 6, 2012

Cool - do you think the spotlight importer is ready for prime-time?

Contributor

chrisridd commented Mar 6, 2012

I think so. The major hassle with it seems to more down to Spotlight being awkward and not re-indexing things. More a Spotlight issue than one with the importer IMO.

Have you tried it at all? It is quite cool searching for text inside books from the Spotlight menu and seeing the QuickLook preview coming up on matching books :-)

On 6 Mar 2012, at 10:13, jaketmp wrote:

Cool - do you think the spotlight importer is ready for prime-time?


Reply to this email directly or view it on GitHub:
#26 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment