Skip to content

Commit

Permalink
Add an option for ignoring technically valid but probably wrong ISBNs
Browse files Browse the repository at this point in the history
The original idea for this came from https://github.com/na--/ebook-tools/pull/8, thanks @niavasha
  • Loading branch information
na-- committed Jul 1, 2018
1 parent d7f8021 commit 323b6a0
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 0 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,11 @@ All of these options are part of the common library and may affect some or all o
* `-i=<value>`, `--isbn-regex=<value>`; env. variable `ISBN_REGEX`; see default value in `lib.sh`
This is the regular expression used to match ISBN-like numbers in the supplied books. It is matched with `grep -P`, so look-ahead and look-behind can be used. Also it is purposefully a bit loose (i.e. it can match some non-ISBN numbers), since the found numbers will be checked for validity. Due to unicode handling, the default value is too long for the readme, you can find it in `lib.sh`.
* `--isbn-blacklist-regex=<value>`; env. variable `ISBN_BLACKLIST_REGEX`; default value `^(0123456789|([0-9xX])\2{9})$
Any ISBNs that were matched by the `ISBN_REGEX` above and pass the ISBN validation algorithm are normalized and passed through this regular expression. Any ISBNs that successfully match against it are discarded. The idea is to ignore technically valid but probably wrong numbers like `0123456789`, `0000000000`, `1111111111`, etc.
* `--isbn-direct-grep-files=<value>`; env. variable `ISBN_DIRECT_GREP_FILES`; default value `^text/(plain|xml|html)$`
This is a regular expression that is matched against the MIME type of the searched files. Matching files are searched directly for ISBNs, without converting or OCR-ing them to `.txt` first.
Expand Down
13 changes: 13 additions & 0 deletions lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ NC='\033[0m'
: "${ISBN_IGNORED_FILES:="^(image/(gif|svg.+)|application/(x-shockwave-flash|CDFV2|vnd.ms-opentype|x-font-ttf|x-dosexec|vnd.ms-excel|x-java-applet)|audio/.+|video/.+)\$"}"
: "${ISBN_RET_SEPARATOR:=,}"

# This is matched against normalized valid-looking ISBNs and any numbers that
# match it are discarded.
# The default value should match 0123456789 and any ISBN-10 that uses only one
# digit (e.g. 1111111111 or 3333333333)
: "${ISBN_BLACKLIST_REGEX="^(0123456789|([0-9xX])\\2{9})\$"}"

# These options specify if and how we should reorder ISBN_DIRECT_GREP files
# before passing them to find_isbns(). If true, the first
# ISBN_GREP_RF_SCAN_FIRST lines of the files are passed as is, then we pass
Expand Down Expand Up @@ -111,6 +117,7 @@ handle_script_arg() {

--tested-archive-extensions=*) TESTED_ARCHIVE_EXTENSIONS="${arg#*=}" ;;
-i=*|--isbn-regex=*) ISBN_REGEX="${arg#*=}" ;;
--isbn-blacklist-regex=*) ISBN_BLACKLIST_REGEX="${arg#*=}" ;;
--isbn-direct-grep-files=*) ISBN_DIRECT_GREP_FILES="${arg#*=}" ;;
--isbn-ignored-files=*) ISBN_IGNORED_FILES="${arg#*=}" ;;
--reorder-files-for-grep=*)
Expand Down Expand Up @@ -284,6 +291,12 @@ find_isbns() {
echo "$isbn"
fi
done
} | {
if [ "$ISBN_BLACKLIST_REGEX" != "" ]; then
grep -vP "$ISBN_BLACKLIST_REGEX" || true
else
cat
fi
} | stream_concat "$ISBN_RET_SEPARATOR"
}

Expand Down
5 changes: 5 additions & 0 deletions tests/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,9 @@ assert_eq "076532637X" "$(echo "just an isbn 076532637X in some text" | find_isb
assert_eq "075640407X,9780756404079" "$(echo "075640407X (ISBN13: 9780756404079)" | find_isbns)"
assert_eq "9781610391849,1610391845" "$(echo "crazy!978-16–103⁻918 49 16 10—39¯1845-z" | find_isbns)"

wrong_but_valid="0123456789,0000000000,1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777,8888888888,9999999999"

assert_eq "" "$(echo "$wrong_but_valid" | find_isbns)"
assert_eq "$wrong_but_valid" "$(echo "$wrong_but_valid" | ISBN_BLACKLIST_REGEX="" find_isbns)"

exit "$EXIT_CODE"

1 comment on commit 323b6a0

@niavasha
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Sorry for the delay and glad it gave some much larger inspiration!

Please sign in to comment.