NB: I keep doing stuffs and post the code snippt here, if I feel the process usefull. packages may expire or developers may update the methood. So, try; if failed, debug. Thanks!
1. OCR from PDF using TIFF2TXT
- Install Imagemagic, tesseract.
pip install imagemagic
- Run this to convet the pdfs into .tiff file to keep the resulation intact.
convert -density 300 *.pdf -depth 8 -strip -background white -alpha off 2%5d.tiff
- Extract the texts into text file.
tesseract filename.tiff eng > outtext //single file
for i in *.tif ; do tesseract $i stdout >> outtext; done; //multiple files
2. Clone an Entire Website using Wget
wget --mirror --convert-links --wait=2 https://example.com/
3. Concat PDFs in a folder
sudo apt install pdftk
pdftk *.pdf cat output output.pdf