Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fecam #135

Merged
merged 11 commits into from Nov 29, 2019

Conversation

@jvanz
Copy link
Contributor

jvanz commented Nov 12, 2019

This is the initial version of a spider to get the data from FECAM website. I've created a spider class which able to understand the fecam website and navigate thou the pages and extract the gazette file.

Most of the file are docs. Thus, I've also added a new step in the pipeline to convert the doc into pdf. For that, i used Libreoffice writer. Its the best option in the top of my mind, if you have a better idea, let me know.

I'll continue testing and adding the missing cities which have their gazette in the website. Currently, I'm just testing with my city

The files are being downloaded from: https://diariomunicipal.sc.gov.br/site/

jvanz added 2 commits Nov 10, 2019
 Adds a spider to crawl the data of the cities which are associate to FECAM

Issue #134

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
Adds a step in the pipeline to convert doc[x] file to pdf. Some gazette
files from SC cities are available in doc formats.

Issue #134

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@vitorbaptista

This comment has been minimized.

Copy link
Member

vitorbaptista commented Nov 12, 2019

Hello @jvanz! Nice PR. Just so I understand, why are you converting a DOC to a PDF? Is it just so we can reuse the PDF parsing pipeline?

I'm not from the core team, but it is a strange strategy to convert from a "more machine-readable" format (doc) to a less readable format (pdf) before parsing the data. IMHO, it would be better to change the pipeline to support not only PDFs, but also DOCs. Especially because I imagine there will be other formats in the future (e.g. TXT?), and this current requirement that the gazettes are PDFs will have to be changed anyway.

I imagine the PdfParsingPipeline becoming something like a TextExtractorPipeline, and supporting PDFs, DOC, etc. Have you tried something like this?

@jvanz

This comment has been minimized.

Copy link
Contributor Author

jvanz commented Nov 12, 2019

Hello @jvanz! Nice PR. Just so I understand, why are you converting a DOC to a PDF? Is it just so we can reuse the PDF parsing pipeline?

Yes

I'm not from the core team, but it is a strange strategy to convert from a "more machine-readable" format (doc) to a less readable format (pdf) before parsing the data. IMHO, it would be better to change the pipeline to support not only PDFs, but also DOCs. Especially because I imagine there will be other formats in the future (e.g. TXT?), and this current requirement that the gazettes are PDFs will have to be changed anyway.

I imagine the PdfParsingPipeline becoming something like a TextExtractorPipeline, and supporting PDFs, DOC, etc. Have you tried something like this?

I decided to not extract the txt from the doc because I didn't find a good lib for that. I know there is the python-docx library. But from previous experiences, where it was not able to read all docish files, I decided to convert to pdf. However, if you think that the way to go, I can update the PR. :)

@vitorbaptista

This comment has been minimized.

Copy link
Member

vitorbaptista commented Nov 12, 2019

@jvanz Regarding extracting text from DOC, have you tried Apache Tika? I've never used it, but heard good things. If it works well, I think it's the ideal tool, as it supports dozens of file formats (including PDFs). What do you think, @cuducos / @Irio ?

@jvanz

This comment has been minimized.

Copy link
Contributor Author

jvanz commented Nov 12, 2019

@jvanz Regarding extracting text from DOC, have you tried Apache Tika? I've never used it, but heard good things. If it works well, I think it's the ideal tool, as it supports dozens of file formats (including PDFs). What do you think, @cuducos / @Irio ?

I never have tried Apache Tika. Actually, I never have hear about it. It seems interesting indeed

@cuducos

This comment has been minimized.

Copy link
Member

cuducos commented Nov 12, 2019

Many thanks for the contribution, @jvanz! Really great code and solutions IMHO.

I don't know Apache Tika either, but I have the feeling that it would be more trustworthy to extract text from Word document than from PDF – I mean, the quality of the extracted text would be considerably better coming from .doc than from .pdf. But it's only a guess.

In other ways, I would consider extracting the text from the Word document and not from the converted PDF.

jvanz added 3 commits Nov 12, 2019
To ensure that the function is_doc work with all file paths,  this commit
adds a lower() call before the test. Thus, file path in upper case will
work as well.

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
To be compliance with PEP8, this commit changes the order of the import
statements.

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
To keep track of which file originates the PDF files remove the
os.unlink call used to delete the doc files. Furthermore, keep the doc
file extension in the pdf file path.

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@jvanz

This comment has been minimized.

Copy link
Contributor Author

jvanz commented Nov 12, 2019

I believe that all issues are solved in the last commits. Related to the Apache Tika, do you agree to leave this for a separate PR? I can work on that before adding other spiders for other cities. I can open an issue for that an keep track there.

@jvanz jvanz requested a review from cuducos Nov 13, 2019
@jvanz jvanz mentioned this pull request Nov 13, 2019
Add a class in the pipeline able to detect the file type and extract the
text using the best tool. The current implementation uses the already
in-place tool for pdf files. For docish files, the new file type added,
the pipeline step uses Apache Tika to get the content.

#136

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@vitorbaptista

This comment has been minimized.

Copy link
Member

vitorbaptista commented Nov 15, 2019

Apart from the small comment I added inline, this is looking great! I haven't tried running it, but it seems ready to go 👍

Keep the pdf file extension in the txt file path. Thus, it's more easy
to keep track which file originates the txt file

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@jvanz jvanz force-pushed the jvanz:fecam branch from 050659a to b0442a7 Nov 15, 2019
@jvanz jvanz requested a review from vitorbaptista Nov 17, 2019
The ScGasparSpider just need to import the FecamGazetteSpider spider.
This commit removes all the other imports

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@jvanz jvanz force-pushed the jvanz:fecam branch from 3f03268 to 34cfea4 Nov 17, 2019
In the pipeline item to extract the text from the pdf file, when
building the command string adds the path to the text file path as well.
Thus, we avoid "No such file or directory" errors

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@jvanz

This comment has been minimized.

Copy link
Contributor Author

jvanz commented Nov 27, 2019

So guys, what do you think? Is the PR in the right path? :)

Fixes a type error raising an Exception object instead of the error
string.

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
Copy link
Member

vitorbaptista left a comment

I haven't executed the code, but it looks very good! Just found a tiny typo, and that's good to go 👍

Fixes a typo in the method name

Signed-off-by: José Guilherme Vanz <jvanz@jvanz.com>
@sergiomario sergiomario merged commit d84552a into okfn-brasil:master Nov 29, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.