Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upFecam #135
Conversation
This comment has been minimized.
This comment has been minimized.
Hello @jvanz! Nice PR. Just so I understand, why are you converting a DOC to a PDF? Is it just so we can reuse the PDF parsing pipeline? I'm not from the core team, but it is a strange strategy to convert from a "more machine-readable" format (doc) to a less readable format (pdf) before parsing the data. IMHO, it would be better to change the pipeline to support not only PDFs, but also DOCs. Especially because I imagine there will be other formats in the future (e.g. TXT?), and this current requirement that the gazettes are PDFs will have to be changed anyway. I imagine the |
This comment has been minimized.
This comment has been minimized.
Yes
I decided to not extract the txt from the doc because I didn't find a good lib for that. I know there is the |
This comment has been minimized.
This comment has been minimized.
@jvanz Regarding extracting text from DOC, have you tried Apache Tika? I've never used it, but heard good things. If it works well, I think it's the ideal tool, as it supports dozens of file formats (including PDFs). What do you think, @cuducos / @Irio ? |
This comment has been minimized.
This comment has been minimized.
I never have tried Apache Tika. Actually, I never have hear about it. It seems interesting indeed |
This comment has been minimized.
This comment has been minimized.
Many thanks for the contribution, @jvanz! Really great code and solutions IMHO. I don't know Apache Tika either, but I have the feeling that it would be more trustworthy to extract text from Word document than from PDF – I mean, the quality of the extracted text would be considerably better coming from In other ways, I would consider extracting the text from the Word document and not from the converted PDF. |
To ensure that the function is_doc work with all file paths, this commit adds a lower() call before the test. Thus, file path in upper case will work as well. Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
To be compliance with PEP8, this commit changes the order of the import statements. Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
To keep track of which file originates the PDF files remove the os.unlink call used to delete the doc files. Furthermore, keep the doc file extension in the pdf file path. Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
This comment has been minimized.
This comment has been minimized.
I believe that all issues are solved in the last commits. Related to the Apache Tika, do you agree to leave this for a separate PR? I can work on that before adding other spiders for other cities. I can open an issue for that an keep track there. |
Add a class in the pipeline able to detect the file type and extract the text using the best tool. The current implementation uses the already in-place tool for pdf files. For docish files, the new file type added, the pipeline step uses Apache Tika to get the content. #136 Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
This comment has been minimized.
This comment has been minimized.
Apart from the small comment I added inline, this is looking great! I haven't tried running it, but it seems ready to go |
Keep the pdf file extension in the txt file path. Thus, it's more easy to keep track which file originates the txt file Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
The ScGasparSpider just need to import the FecamGazetteSpider spider. This commit removes all the other imports Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
In the pipeline item to extract the text from the pdf file, when building the command string adds the path to the text file path as well. Thus, we avoid "No such file or directory" errors Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
This comment has been minimized.
This comment has been minimized.
So guys, what do you think? Is the PR in the right path? :) |
Fixes a type error raising an Exception object instead of the error string. Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
I haven't executed the code, but it looks very good! Just found a tiny typo, and that's good to go |
Fixes a typo in the method name Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
d84552a
into
okfn-brasil:master
jvanz commentedNov 12, 2019
•
edited
This is the initial version of a spider to get the data from FECAM website. I've created a spider class which able to understand the fecam website and navigate thou the pages and extract the gazette file.
Most of the file are docs. Thus, I've also added a new step in the pipeline to convert the doc into pdf. For that, i used Libreoffice writer. Its the best option in the top of my mind, if you have a better idea, let me know.
I'll continue testing and adding the missing cities which have their gazette in the website. Currently, I'm just testing with my city
The files are being downloaded from: https://diariomunicipal.sc.gov.br/site/