Some bash scripts that leverage on existing pdf tools. Script names should be self exaplanatory with the exception perhaps of pdfmark.sh which allows to change metadata in pdfs (example provided to add bookmarks).
xpdf and ghostscript. They are available for Linux, Unix, OS X and probably Windows. In MAC OS X you just:
brew install xpdf brew install ghostscript
After installing dependencies just clone the project:
git clone https://github.com/nestoru/pdf-bash-tools.git
Enter the directory and run the scripts without params to confirm you get the help for each of them, for example:
cd pdf-bash-tools/ ./find_pdf_pages_containing_regex.sh ./extract_pdf_pages.sh ./extract_pdf_pages_containing_regex.sh ./pdfmark.sh
Run the below commands and note how the original sample file has 5 names and cities. The name "Jorge" is repeated in two pages. As we pull only those pages that contain the name "Jorge" the result is a shorter file containing only 2 pages.
rm -f /tmp/names_and_cities_jorge.pdf ./extract_pdf_pages_containing_regex.sh samples/names_and_cities.pdf /tmp/names_and_cities_jorge.pdf "Jorge" open /tmp/names_and_cities_jorge.pdf
Run the below commands and note how even though encrypted with password "test", we are still able to produce a final pdf file. If you want that file encrypted then just run gs with proper flags as you can see explained inside the extract_pdf_pages.sh script:
rm -f /tmp/names_and_cities_jorge.pdf ./extract_pdf_pages_containing_regex.sh samples/names_and_cities_encrypted.pdf /tmp/names_and_cities_jorge.pdf "Jorge" test open /tmp/names_and_cities_jorge.pdf
Run the below commands and note how the resulting files (obtained from encrypted and not encrypted source pdf files) contain bookmarks. The echo lines basically build a pdfmark (http://partners.adobe.com/public/developer/en/acrobat/sdk/pdf/pdf_creation_apis_and_specs/pdfmarkReference.pdf) to insert bookmarks in the source pdf. I have also added a test for filenames containing spaces.
rm -f /tmp/names_and_cities_bookmarked.pdf rm -f /tmp/names_and_cities_decrypted_and_bookmarked.pdf echo '[/PageMode /UseOutlines /DOCVIEW pdfmark' > /tmp/pdfmark echo '[ /Page 1 /Title (Nestor Urquiza) /OUT pdfmark' >> /tmp/pdfmark echo '[ /Page 2 /Title (Jorge Ruiz) /OUT pdfmark' >> /tmp/pdfmark echo '[ /Page 3 /Title (Mark Johnson) /OUT pdfmark' >> /tmp/pdfmark echo '[ /Page 4 /Title (Jorge Gacia) /OUT pdfmark' >> /tmp/pdfmark echo '[ /Page 5 /Title (Jristo Jristov) /OUT pdfmark' >> /tmp/pdfmark ./pdfmark.sh samples/names_and_cities.pdf /tmp/names_and_cities_bookmarked.pdf /tmp/pdfmark open /tmp/names_and_cities_bookmarked.pdf cp samples/names_and_cities.pdf "/tmp/names and cities.pdf" rm -f "/tmp/names and cities bookmarked.pdf" ./pdfmark.sh "/tmp/names and cities.pdf" "/tmp/names and cities bookmarked.pdf" /tmp/pdfmark open "/tmp/names and cities bookmarked.pdf" ./pdfmark.sh samples/names_and_cities_encrypted.pdf /tmp/names_and_cities_decrypted_and_bookmarked.pdf /tmp/pdfmark test open /tmp/names_and_cities_decrypted_and_bookmarked.pdf
Run the below:
./find_pdf_pages_containing_regex.sh --printmatch samples/names_and_cities.pdf "Jo[^ ]*"
Assert the result is the page and the exact match separated by comma:
2,Jorge 3,Johnson 4,Jorge
Run the below commands and note how the file is encrypted with password "test" and a watermark shows up in all pages of the original pdf.
rm -f /tmp/names_and_cities_watermarked.pdf ./pdf_watermark.sh samples/names_and_cities.pdf /tmp/names_and_cities_watermarked.pdf "Nestor Urquiza - Sample, Inc" "test" open /tmp/names_and_cities_watermarked.pdf
Look at the names_and_cities2.pdf file which includes an apostrophe in the name. Of course you will need to be careful with quotation and escaping. Try to go through the different examples above with the new file to find out the changes you will need. Yes this is homework, not kidding.