Permalink
Browse files

Merge branch 'master' of git@github.com:internetarchive/epub

  • Loading branch information...
Mike McCabe
Mike McCabe committed Feb 11, 2010
2 parents da1faee + e859d3c commit db1e447c88cc1c48b0a3c8dd75fdc2269c327097
Showing with 978 additions and 319 deletions.
  1. +46 −29 TODO
  2. +230 −0 abbyy.py
  3. +76 −11 common.py
  4. +3 −3 condense_abbyy.py
  5. +42 −5 convert_iabook.py
  6. +30 −16 daisy.py
  7. +88 −10 epub.py
  8. +8 −3 epub_files/stylesheet.css
  9. +22 −5 font.py
  10. +38 −50 iabook_to_daisy.py
  11. +105 −121 iabook_to_epub.py
  12. +118 −18 iarchive.py
  13. +172 −48 visualize_abbyy.py → visualize_ocr.py
View
75 TODO
@@ -1,27 +1,65 @@
Mini to do:
-Get it working on the web
-Get daisy download happy
+
+
+finish checking in bits
+
+add bits from tweewee
+
+check stuff
+
+push to archive
+
+check with www-mccabe
+
+push?
+
+clear thru this file.
+
+Bug for: bad images on stanza
+
+
+Art of Caricaturing - needs cover?
+
+- check on get rid of
+try:
+ from lxml import etree
+except ImportError:
+ sys.path.append('/petabox/sw/lib/lxml/lib/python2.5/site-packages')
+ from lxml import etree
+from lxml import objectify
To do:
+general:
+- handle books with disjoint/repeating page number ranges,
+ e.g. libraries of the future, missing page 1 in romance.
+
+example of legit discontinuous numbering:
+http://www.archive.org/stream/snowimagechildis00hawtiala#page/n17/mode/2up
+
+- move all abbyy parsing into iabook; make json interface to same. (see booki)
+- make condense, visualize be output opts
+- infer [front, main] page types - maybe synthesize more types for repeated pages
+- handle catstorieshelen - seems to be compounded of multiple books
+
+- Handle .tar image stacks
+
+
epub:
-- images (cf. best practices?)
-- switch to incremental content generation
-- make sure temp files get deleted!
- Better chapter structure
+- link to streaming / details page / etc.
+- 'width', 'height' tags for images. (pamfile + tee? temp files?)
+
daisy:
- extract proper metadata (or pass it elseways)
- Copyright, legalese - add!
-- missing 1rst chapter headers
Maybe someday:
-
- Small-caps font emulation
- Auto portrait/landscape scaling/rotation of images
-- fix 'sample title' in process_abbyy
- Add code to iarchive.py to make sense of page numbers, e.g. unnumbered start pages, pages that start at '2', etc. Romance is an initial example. Also pageNumBox
- make sure required epub, daisy metadata is actually present
@@ -32,17 +70,12 @@ Skip line if too many chars are suspicious?
-rename things.
-
-convert_iabook
-
--daisy (opts - mogrify)
--epub (default) (opts -> -d)
--condense
--visualize
-http://ia301518.us.archive.org/2/items/whowillfeedthemi00amararch/
do we have a test/real eventual "/nls" (only) book/item now/yet?  i could dark the non-xml files (item mgr) and i could setup some sudoer rules for mike's script to be able to "sudo up" from "www-data" to root to access the files.  (or we could delay that for later?)
[10/13/09 5:21:04 PM] Tracey Jaquith: mike -- for books like that, a few simple lines in the script like this should get you close to the privs thing:
@@ -52,26 +85,10 @@ do we have a test/real eventual "/nls" (only) book/item now/yet?  i could dark
-
get that to you in a sec...
[10/13/09 8:39:35 PM] Hank Bromley: $_COOKIE['logged-in-user']
will be the user id (email address)
Auth::status() returns the current user's priv list (as an array), in a form suitable for being passed to the Auth::hasPriv() function tracey mentioned earlier. (you provide it no args; it looks up the current user via $_COOKIE['logged-in-user'])
[10/13/09 8:40:33 PM] Hank Bromley: you can see how all this is defined, if you're interested, in petabox/www/common/Auth.inc
-
-
-
-
-
-
-[10/13/09 8:27:45 PM] Hank Bromley: which would you like for that third param name?
-[10/13/09 8:29:21 PM] Mike McCabe: (identifier, dir, doc?)
-[10/13/09 8:30:28 PM] Hank Bromley: sounds good. where "doc" may, on occasion, start with subdir(s).
-[10/13/09 8:31:05 PM] Mike McCabe: been considering also tool mods to support password-wrapping daisy.zip files for protected books
-
-
-
-doing: making epubs work again!
-
View
230 abbyy.py
@@ -0,0 +1,230 @@
+#!/usr/bin/python
+
+import sys
+# import getopt
+import re
+# import gzip
+# import os
+# import zipfile
+
+import common
+
+from lxml import etree
+# try:
+# from lxml import etree
+# except ImportError:
+# sys.path.append('/petabox/sw/lib/lxml/lib/python2.5/site-packages')
+# from lxml import etree
+# from lxml import objectify
+
+from debug import debug, debugging, assert_d
+
+
+aby_ns="{http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml}"
+
+
+
+# missing here:
+# multiple potential indicators per page
+# needed gaps/discrepancies in pagenum mapping - raj mentioned
+# re_implement inference from assertions?
+# -> at least review get_page_scandata
+
+# Big missing: analyze headers_footers
+
+# stepstone there: current debug output of epub page trimmr.
+
+def par_is_pageno_header_footer(par):
+ # if:
+ # it's the first on the page
+ # there's only one line
+ # on that line, there's a formatting tag, s.t.
+ # - it has < 6 charParam kids
+ # - each is wordNumeric
+ # then:
+ # Skip it!
+ if len(par) != 1:
+ return False
+ line = par[0]
+
+ if len(par) != 1:
+ return False
+ line = par[0]
+ line_text = etree.tostring(line,
+ method='text',
+ encoding=unicode)
+ line_text = line_text.lower()
+
+ print line_text
+
+ mo = re.match('(preface)* *([xiv]*) *(preface)*',
+ line_text)
+ if mo and mo.group(2):
+# debug()
+ return common.rnum_to_int(mo.group(2))
+
+
+ line_text = line_text.replace('i', '1').replace('o', '0')
+ mo = re.match('[\[li] *[afhiklmnouvx^]*([0-9])[afhiklmnouvx^]* *[\]ijl1]',
+ line_text)
+ if mo:
+ return mo.group(1)
+ mo = re.match('[\[li] *([xiv]*) *[\]ijl1]',
+ line_text)
+ if mo and mo.group(1):
+ debug()
+ return common.rnum_to_int(mo.group(1))
+
+
+ for fmt in line:
+ if len(fmt) > 6:
+ continue
+ saw_non_num = False
+ for cp in fmt:
+ if cp.get('wordNumeric') != 'true':
+ saw_non_num = True
+ break
+ fmt_text = etree.tostring(fmt,
+ method='text',
+ encoding=unicode)
+ if not saw_non_num:
+ return fmt_text
+ fmt_text = fmt_text.lower()
+ rnums = ['i', 'ii', 'iii', 'iv',
+ 'v', 'vi', 'vii', 'viii',
+ 'ix', 'x', 'xi', 'xii',
+ 'xiii', 'xiv', 'xv', 'xvi',
+ 'xvii', 'xviii', 'xix', 'xx',
+ 'xxi', 'xxii', 'xxiii', 'xxiv',
+ 'xxv', 'xxvi', 'xxvii',
+ 'xxviii', 'xxix', 'xxx',
+ ]
+ r = common.rnum_to_int(fmt_text)
+ if r:
+ return fmt_text
+ fmt_text = fmt_text.replace('i', '1').replace('o', '0')
+ if re.match('[0-9]+', fmt_text):
+ return int(fmt_text)
+ if re.match('[0-9afhiklmnouvx^]*[0-9][0-9afhiklmnouvx^]*',
+ fmt_text):
+ return fmt_text
+ # common OCR errors
+# if re.match('[0-9afhiklmnouvx^]+',
+# fmt_text):
+# return fmt_text
+ return ''
+
+def get_hf_pagenos(page):
+ first_par = True
+ result = []
+ try:
+ for block in page:
+ for el in block:
+ if el.tag == aby_ns+'text':
+ for par in el:
+
+ # skip if its the first line and it could be a header
+ if first_par: # replace with xpath??
+ hdr = par_is_pageno_header_footer(par)
+ if hdr:
+ result.append(int(hdr))
+ return result
+ first_par = False
+
+ if (block == page[-1]
+ and el == block[-1]
+ and par == el[-1]):
+ ftr = par_is_pageno_header_footer(par)
+ if ftr:
+ result.append(int(ftr))
+ return result
+ except ValueError:
+ pass
+
+ return result
+
+
+def analyze(aby_file, iabook):
+ context = etree.iterparse(aby_file,
+ tag=aby_ns+'page',
+ resolve_entities=False)
+ i = 0
+
+ pages = []
+ for event, page in context:
+ page_struct = {}
+ page_struct['number'] = i
+ page_struct['picture'] = []
+ page_struct['texts'] = []
+ for block in page:
+ if block.get('blockType') == 'Picture':
+ page_struct['picture'].append(((int(block.get('l')),
+ int(block.get('t'))),
+ (int(block.get('r')),
+ int(block.get('b')))))
+ if block.get('blockType') == 'Text':
+ bstr = etree.tostring(block,
+ method='text',
+ encoding=unicode)
+
+ page_struct['texts'].append(bstr)
+
+ pages.append(page_struct)
+
+ page.clear()
+ i += 1
+
+ print pages
+ return 'hi'
+
+
+
+
+
+def analyze_pages(aby_file, iabook):
+ context = etree.iterparse(aby_file,
+ tag=aby_ns+'page',
+ resolve_entities=False)
+
+ page_offsets = {}
+ i = 0
+ for event, page in context:
+ page_scandata = iabook.get_page_scandata(i)
+ pageno = None
+ max_score = 0
+ if page_scandata is not None:
+# if i > 40:
+# debug()
+ found_pagenos = get_hf_pagenos(page)
+# pageno = page_scandata.find(iabook.get_scandata_ns()
+# + 'pageNumber')
+ for pageno in found_pagenos:
+ import common
+# r = common.rnum_to_int(pageno.text)
+# if r > 0:
+# pageno_val = r
+# else:
+# pageno_val = int(pageno.text)
+ pageno_val = int(pageno)
+
+ # offset of 0th page in group, as indicated by this pageno
+ cur_off = i - pageno_val
+ if cur_off > 0:
+ if not cur_off in page_offsets:
+ page_offsets[cur_off] = 0
+ page_offsets[cur_off] += 1
+ if page_offsets[cur_off] > max_score:
+ max_score = page_offsets[cur_off]
+# if max_score > 100:
+# return str(cur_off) + ' with max'
+ page.clear()
+ i += 1
+
+ offset_with_highest = None
+ highest_seen = 0
+ for offset in page_offsets:
+ print 'offset' + str(offset) + ': ' + str(page_offsets[offset]) + ' votes'
+ if page_offsets[offset] > highest_seen:
+ highest_seen = page_offsets[offset]
+ offset_with_highest = offset
+ return str(offset_with_highest)
Oops, something went wrong.

0 comments on commit db1e447

Please sign in to comment.