Skip to content

Commit

Permalink
chardet to rchardet; README fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
Dmitri Goutnik committed Jun 28, 2009
1 parent 3cf25e9 commit f2261c3
Show file tree
Hide file tree
Showing 11 changed files with 72 additions and 60 deletions.
51 changes: 31 additions & 20 deletions README.txt
@@ -1,27 +1,31 @@
== DESCRIPTION:
= Repub

Repub is a simple HTML to ePub converter.
Simple HTML to ePub converter.

== FEATURES/PROBLEMS:

Few samples to get started: (TODO real description)
Few samples to get started:

* Git User's Manual

repub -x 'title://h1' -x 'toc://div[@class="toc"]/dl' -x 'toc_item:dt' -x 'toc_section:following-sibling::*[1]/dl' \
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html

* Project Gutenberg's THE ADVENTURES OF SHERLOCK HOLMES
repub -x 'title://div.book//h1' -x 'toc:body//table' -x 'toc_item://tr' \
-X 'body/pre,body//hr,body/h1,body/h2' \
http://www.gutenberg.org/dirs/etext99/advsh12h.htm

repub -x 'title:div[@class='book']//h1' -x 'toc://table' -x 'toc_item://tr' \
-X '//pre' -X '//hr' -X '//body/h1' -X '//body/h2' \
http://www.gutenberg.org/dirs/etext99/advsh12h.htm

* Project Gutenberg's ALICE'S ADVENTURES IN WONDERLAND
repub -x 'title:body/h1' -x 'toc:body//table' -x 'toc_item://tr' \
-X 'body/pre,body//hr,body/h4' \
http://www.gutenberg.org/files/11/11-h/11-h.htm

repub -x 'title:body/h1' -x 'toc://table' -x 'toc_item://tr' \
-X '//pre' -X '//hr' -X '//body/h4' \
http://www.gutenberg.org/files/11/11-h/11-h.htm

* The Gelug-Kagyu Tradition of Mahamudra from Berzin Archives
repub http://www.berzinarchives.com/web/x/prn/p.html_680632258.html

* Git User's Manual
repub -x 'title://h1' -x 'toc://div.toc/dl' -x 'toc_item:/dt' \
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
repub http://www.berzinarchives.com/web/x/prn/p.html_680632258.html

== SYNOPSIS:

Expand All @@ -43,7 +47,7 @@ General options:
-h, --help Show this help message.

Parser options:
-x, --selector NAME:VALUE Set parser XPath or CSS selector NAME to VALUE.
-x, --selector NAME:VALUE Set parser XPath selector NAME to VALUE.
Recognized selectors are: [title toc toc_item toc_section]
-m, --meta NAME:VALUE Set publication information metadata NAME to VALUE.
Valid metadata names are: [creator date description
Expand All @@ -55,26 +59,31 @@ Parser options:
Post-processing options:
-s, --stylesheet PATH Use custom stylesheet at PATH to add or override existing
CSS references in the source document.
-X, --remove SELECTOR Remove source element using XPath or CSS selector.
-X, --remove SELECTOR Remove source element using XPath selector.
Use -X- to ignore stored profile.
-R, --rx /PATTERN/REPLACEMENT/ Edit source HTML using regular expressions.
Use -R- to ignore stored profile.
-B, --browse After processing, open resulting HTML in default browser.

== REQUIREMENTS:
== DEPENDENCIES:

wget or httrack
zip (Info-ZIP)
* Builder (https://rubyforge.org/projects/builder/)
* Nokogiri (http://nokogiri.rubyforge.org/nokogiri/)
* rchardet (https://rubyforge.org/projects/rchardet/)
* launchy (http://copiousfreetime.rubyforge.org/launchy/)

* wget or httrack
* zip (Info-ZIP)

== INSTALL:

gem install repub

== LICENSE:

The MIT License
(The MIT License)

Copyright (c) 2009 Invisible Llama
Copyright (c) 2009 Invisible Llama <dg@invisiblellama.net>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -93,3 +102,5 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

==
4 changes: 2 additions & 2 deletions Rakefile
Expand Up @@ -24,7 +24,7 @@ PROJ.exclude = %w[tmp/ \.git/ \.DS_Store .*\.tmproj ^pkg/]

PROJ.spec.opts << '--color'

depend_on 'builder'
depend_on 'nokogiri'
depend_on 'chardet'
depend_on 'builder'
depend_on 'rchardet'
depend_on 'launchy'
2 changes: 1 addition & 1 deletion TODO.txt → TODO
@@ -1,3 +1,3 @@
√ add support for rx cleaning/modifying source doc
√ make -q/-v actually do something
a elements name to id attribute translation
more parser tokens: author(s) etc
2 changes: 1 addition & 1 deletion bin/repub
Expand Up @@ -18,7 +18,7 @@ require 'repub/app'
# repub -x 'title://h2' -x 'toc://table' -x 'toc_item://a' -X 'div' -X 'table' -X '//hr' http://lib.ru/STERLINGB/shizmatrica.txt_with-big-pictures.html
#
# Айзек Азимов. Космические течения
# repub -B -v -x 'title://h2' -x 'toc://table' -x 'toc_item://a' -X 'div' -X 'table' -X '//hr' http://lib.ru/FOUNDATION/currspac.txt_with-big-pictures.html
# repub -x 'title://h2' -x 'toc://table' -x 'toc_item://a' -X 'div' -X 'table' -X '//hr' http://lib.ru/FOUNDATION/currspac.txt_with-big-pictures.html
#
# Git User's Manual
# repub -x 'title://h1' -x 'toc://div[@class="toc"]/dl' -x 'toc_item:dt' -x 'toc_section:following-sibling::*[1]/dl' http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
Expand Down
2 changes: 1 addition & 1 deletion lib/repub.rb
@@ -1,7 +1,7 @@
module Repub

# :stopdoc:
VERSION = '0.2.1'
VERSION = '0.3.0'
LIBPATH = File.expand_path(File.dirname(__FILE__)) + File::SEPARATOR
PATH = File.dirname(LIBPATH) + File::SEPARATOR
# :startdoc:
Expand Down
12 changes: 4 additions & 8 deletions lib/repub/app/builder.rb
Expand Up @@ -97,7 +97,7 @@ def postprocess_file(asset)
log.debug "-- Adding missing doctype"
source = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" + source
end
# Overwrite asset with fixed version
# Save processed file
File.open(asset, 'w') do |f|
f.write(source)
end
Expand All @@ -121,15 +121,11 @@ def postprocess_doc(asset)
doc.search(selector).remove
end
end
# XXX
# doc.search('//a[@name and not(@id)]') do |a|
# a[:id] = a[:name]
# end
# Save processed version
# Save processed doc
File.open(asset, 'w') do |f|
if @options[:fixup]
# HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present and adds them anyway
# So we just remove them here to avoid duplicates
# HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
# in html node and adds them anyway. Just remove them here to avoid duplicates.
doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }
doc.write_xhtml_to(f, :encoding => 'UTF-8')
else
Expand Down
13 changes: 6 additions & 7 deletions lib/repub/app/fetcher.rb
Expand Up @@ -3,11 +3,7 @@
require 'uri'
require 'iconv'
require 'rubygems'

# XXX: suppress warnings from chardet (until they fix them)
$VERBOSE=false
require 'UniversalDetector'
$VERBOSE=true
require 'rchardet'

module Repub
class App
Expand Down Expand Up @@ -128,9 +124,12 @@ def for_url(&block)
# detect encoding and convert to utf-8 if needed
@assets[:documents].each do |doc|
log.debug "-- Detecting encoding for #{doc}"
s = IO.read(doc)
s = File.open(doc) do |f|
# Detect encoding using first 100 lines...
100.times.inject('') { |s, n| s += f.gets }
end
raise FetcherException, "empty document" unless s
encoding = UniversalDetector::chardet(s)['encoding']
encoding = CharDet.detect(s)['encoding']
if encoding.downcase != 'utf-8'
log.debug "-- Looks like it's #{encoding}, will convert to UTF-8"
s = Iconv.conv('utf-8', encoding, s)
Expand Down
4 changes: 2 additions & 2 deletions lib/repub/app/options.rb
Expand Up @@ -92,7 +92,7 @@ def parse_options(args)
opts.separator " Parser options:"

opts.on("-x", "--selector NAME:VALUE", String,
"Set parser XPath or CSS selector NAME to VALUE.",
"Set parser XPath selector NAME to VALUE.",
"Recognized selectors are: [title toc toc_item toc_section]"
) do |value|
begin
Expand Down Expand Up @@ -134,7 +134,7 @@ def parse_options(args)
) { |value| options[:css] = File.expand_path(value) }

opts.on("-X", "--remove SELECTOR", String,
"Remove source element using XPath or CSS selector.",
"Remove source element using XPath selector.",
"Use -X- to ignore stored profile."
) { |value| value == '-' ? options[:remove] = [] : options[:remove] << value }

Expand Down
18 changes: 12 additions & 6 deletions lib/repub/app/parser.rb
Expand Up @@ -70,7 +70,7 @@ def parse_title
log.info "Found title \"#{@title}\""
else
@title = UNTITLED
log.warn "** Could not parse document title, using '#{@title}'"
log.warn "** Could not find document title, using '#{@title}'"
end
end

Expand All @@ -80,6 +80,8 @@ def parse_title_html
@title_html = el ? el.inner_html.gsub(/[\r\n]/, '') : UNTITLED
end

# Helper container for TOC items
#
class TocItem < Struct.new(
:title,
:uri,
Expand Down Expand Up @@ -108,27 +110,31 @@ def parse_toc
log.info "Found TOC with #{@toc.size} top-level items"
else
@toc = []
log.warn "** Could not parse document table of contents"
log.warn "** Could not find document table of contents"
end
end

def parse_toc_section(section)
toc = []
log.debug "-- Looking for TOC items with #{@selectors[:toc_item]}"
section.xpath(@selectors[:toc_item]).each do |item|
# Get item's anchor and href
a = item.name == 'a' ? item : item.at('a')
next if !a
href = a[:href]
next if !href
if item.children.empty?
title = item.inner_text
# Is this a leaf item or node ?
subsection = item.xpath(@selectors[:toc_section]).first
if subsection
# Item has subsection, use anchor text for title
title = a.inner_text
else
# Leaf item, glue inner_text from all children
title = item.children.map{|c| c.inner_text }.join(' ')
end
title = title.gsub(/[\r\n]/, '').gsub(/\s+/, ' ').strip
log.debug "-- Found item: #{title}"
subsection = item.xpath(@selectors[:toc_section]).first
#p subsection
# Parse sub-section
if subsection
log.debug "-- Found section with #{@selectors[:toc_section]}"
log.debug "-- >"
Expand Down
Empty file removed lib/repub/mobi/.githidden
Empty file.
24 changes: 12 additions & 12 deletions repub.gemspec
Expand Up @@ -2,46 +2,46 @@

Gem::Specification.new do |s|
s.name = %q{repub}
s.version = "0.2.1"
s.version = "0.3.0"

s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
s.authors = ["Dmitri Goutnik"]
s.date = %q{2009-06-26}
s.date = %q{2009-06-28}
s.default_executable = %q{repub}
s.description = %q{RePub is a simple HTML to ePub converter.}
s.description = %q{}
s.email = %q{dg@invisiblellama.net}
s.executables = ["repub"]
s.extra_rdoc_files = ["History.txt", "README.txt", "TODO.txt", "bin/repub", "lib/repub/mobi/.githidden"]
s.files = [".gitignore", "History.txt", "README.txt", "Rakefile", "TODO.txt", "bin/repub", "lib/repub.rb", "lib/repub/app.rb", "lib/repub/app/builder.rb", "lib/repub/app/fetcher.rb", "lib/repub/app/logger.rb", "lib/repub/app/options.rb", "lib/repub/app/parser.rb", "lib/repub/app/profile.rb", "lib/repub/app/utility.rb", "lib/repub/epub.rb", "lib/repub/epub/container.rb", "lib/repub/epub/content.rb", "lib/repub/epub/toc.rb", "lib/repub/mobi/.githidden", "test/epub/test_container.rb", "test/epub/test_content.rb", "test/epub/test_toc.rb", "test/test_builder.rb", "test/test_fetcher.rb", "test/test_logger.rb", "test/test_parser.rb"]
s.extra_rdoc_files = ["History.txt", "README.txt", "bin/repub"]
s.files = [".gitignore", "History.txt", "README.txt", "Rakefile", "TODO", "bin/repub", "lib/repub.rb", "lib/repub/app.rb", "lib/repub/app/builder.rb", "lib/repub/app/fetcher.rb", "lib/repub/app/logger.rb", "lib/repub/app/options.rb", "lib/repub/app/parser.rb", "lib/repub/app/profile.rb", "lib/repub/app/utility.rb", "lib/repub/epub.rb", "lib/repub/epub/container.rb", "lib/repub/epub/content.rb", "lib/repub/epub/toc.rb", "repub.gemspec", "test/epub/test_container.rb", "test/epub/test_content.rb", "test/epub/test_toc.rb", "test/test_builder.rb", "test/test_fetcher.rb", "test/test_logger.rb", "test/test_parser.rb"]
s.homepage = %q{http://github.com/invisiblellama/repub/tree/master}
s.rdoc_options = ["--main", "README.txt"]
s.require_paths = ["lib"]
s.rubyforge_project = %q{repub}
s.rubygems_version = %q{1.3.4}
s.summary = %q{RePub is a simple HTML to ePub converter}
s.summary = nil
s.test_files = ["test/epub/test_container.rb", "test/epub/test_content.rb", "test/epub/test_toc.rb", "test/test_builder.rb", "test/test_fetcher.rb", "test/test_logger.rb", "test/test_parser.rb"]

if s.respond_to? :specification_version then
current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
s.specification_version = 3

if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
s.add_runtime_dependency(%q<nokogiri>, [">= 1.3.2"])
s.add_runtime_dependency(%q<builder>, [">= 2.1.2"])
s.add_runtime_dependency(%q<hpricot>, [">= 0.8.1"])
s.add_runtime_dependency(%q<chardet>, [">= 0.9.0"])
s.add_runtime_dependency(%q<rchardet>, [">= 1.2"])
s.add_runtime_dependency(%q<launchy>, [">= 0.3.3"])
s.add_development_dependency(%q<bones>, [">= 2.5.1"])
else
s.add_dependency(%q<nokogiri>, [">= 1.3.2"])
s.add_dependency(%q<builder>, [">= 2.1.2"])
s.add_dependency(%q<hpricot>, [">= 0.8.1"])
s.add_dependency(%q<chardet>, [">= 0.9.0"])
s.add_dependency(%q<rchardet>, [">= 1.2"])
s.add_dependency(%q<launchy>, [">= 0.3.3"])
s.add_dependency(%q<bones>, [">= 2.5.1"])
end
else
s.add_dependency(%q<nokogiri>, [">= 1.3.2"])
s.add_dependency(%q<builder>, [">= 2.1.2"])
s.add_dependency(%q<hpricot>, [">= 0.8.1"])
s.add_dependency(%q<chardet>, [">= 0.9.0"])
s.add_dependency(%q<rchardet>, [">= 1.2"])
s.add_dependency(%q<launchy>, [">= 0.3.3"])
s.add_dependency(%q<bones>, [">= 2.5.1"])
end
Expand Down

0 comments on commit f2261c3

Please sign in to comment.