Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
This is the initial version of a spider to get the data from FECAM website. I've created a spider class which able to understand the fecam website and navigate thou the pages and extract the gazette file.
Most of the file are docs. Thus, I've also added a new step in the pipeline to convert the doc into pdf. For that, i used Libreoffice writer. Its the best option in the top of my mind, if you have a better idea, let me know.
I'll continue testing and adding the missing cities which have their gazette in the website. Currently, I'm just testing with my city
The files are being downloaded from: https://diariomunicipal.sc.gov.br/site/
Hello @jvanz! Nice PR. Just so I understand, why are you converting a DOC to a PDF? Is it just so we can reuse the PDF parsing pipeline?
I'm not from the core team, but it is a strange strategy to convert from a "more machine-readable" format (doc) to a less readable format (pdf) before parsing the data. IMHO, it would be better to change the pipeline to support not only PDFs, but also DOCs. Especially because I imagine there will be other formats in the future (e.g. TXT?), and this current requirement that the gazettes are PDFs will have to be changed anyway.
I imagine the
I decided to not extract the txt from the doc because I didn't find a good lib for that. I know there is the
I never have tried Apache Tika. Actually, I never have hear about it. It seems interesting indeed
Many thanks for the contribution, @jvanz! Really great code and solutions IMHO.
I don't know Apache Tika either, but I have the feeling that it would be more trustworthy to extract text from Word document than from PDF – I mean, the quality of the extracted text would be considerably better coming from
In other ways, I would consider extracting the text from the Word document and not from the converted PDF.
To ensure that the function is_doc work with all file paths, this commit adds a lower() call before the test. Thus, file path in upper case will work as well. Signed-off-by: José Guilherme Vanz <firstname.lastname@example.org>
To be compliance with PEP8, this commit changes the order of the import statements. Signed-off-by: José Guilherme Vanz <email@example.com>
To keep track of which file originates the PDF files remove the os.unlink call used to delete the doc files. Furthermore, keep the doc file extension in the pdf file path. Signed-off-by: José Guilherme Vanz <firstname.lastname@example.org>
Add a class in the pipeline able to detect the file type and extract the text using the best tool. The current implementation uses the already in-place tool for pdf files. For docish files, the new file type added, the pipeline step uses Apache Tika to get the content. #136 Signed-off-by: José Guilherme Vanz <email@example.com>
In the pipeline item to extract the text from the pdf file, when building the command string adds the path to the text file path as well. Thus, we avoid "No such file or directory" errors Signed-off-by: José Guilherme Vanz <firstname.lastname@example.org>