Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup pictures from all receipts #33

Closed
Irio opened this issue Sep 1, 2016 · 4 comments · Fixed by #35
Closed

Backup pictures from all receipts #33

Irio opened this issue Sep 1, 2016 · 4 comments · Fixed by #35

Comments

@Irio
Copy link
Collaborator

Irio commented Sep 1, 2016

It is vital for the project have a way of accessing all receipts, from any reimbursement since the first available and not depend from Chamber of Deputies.

Besides having proofs for legal reports, its useful for offline analyses. #32 is one I think about; doing OCR for generating new structured data is another.

Here's a function that, based on a record from quota datasets, returns the picture URL from the Chamber of Deputies' website:

def document_url(record):
    return 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/%s/%s/%s.pdf' % \
        (record['applicant_id'], record['year'], record['document_id'])
@cuducos
Copy link
Collaborator

cuducos commented Sep 2, 2016

I can write a src/fetch_receipts.py (or even refactor src/fetch_datasets.py to include that). Not sure how long it would take, or how many Gb it will require.

But let's find out ; )

If you don't mind, assign that Issue to me and I tackle it soon.

(OCR'd be awesome in a near future!)

@andrewhr
Copy link

andrewhr commented Sep 2, 2016

As I understand, the initial idea is just to have some kind of routine to download everything to a given folder for local storage. I'm ok with that.

It's within the scope of this issue to create a secondary, online source, for this image bank? Something like S3? If this is desirable, making this backup routine somewhat incremental and recurrent is necessary. What do you both think about it? Maybe split this task into a different issue?

@cuducos
Copy link
Collaborator

cuducos commented Sep 2, 2016

@andrewhr Exactly : )

As we say in the CONTRIBUTING.md:

a copy of this data will be avaliable elsewhere (just in case…).

The src/backup_data.py, given the proper API keys, copies the files to a Amazon S3 bucket.

@cuducos
Copy link
Collaborator

cuducos commented Sep 2, 2016

Complementing my last comment: I don't mean that everything we got is working perfectly, I just wanted to say that we have a basis to what you're proposing, @andrewhr — we are in tune!

That said any kind of improvement in this pipeline is welcomed. Feel free to start a new Issue to discuss and implement enhancements on that topic ; )

@cuducos cuducos self-assigned this Sep 2, 2016
cuducos added a commit that referenced this issue Sep 14, 2016
@Irio Irio unassigned cuducos Feb 9, 2018
Irio pushed a commit that referenced this issue Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants