New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excel URL streamer #112
Comments
Yes, I agree that using tabulator is the better option. |
you cant stream an excel file from a URL with the python excel libs (or any others that I personally know about). you can only stream read from a local file buffer. i don't know how tabulator is handling this internally ( @roll ? ) but when i wrote the handling of such cases in goodtables i wrote from the remote file to a local buffer, and then streamed out of that. probably inefficiently (at least two passes over the file contents). even without stream reading, you likely should use tabulator for this anyway, so just sayin'. |
For xlsx tabulator does a true streaming via http and for xls it's not possible so local buffer approach |
@cyberbikepunk |
@roll ok great! |
@cyberbikepunk I suppose @akariv was working on this cast thing on DP level and that's an expected behavior - DP ensures data follow JTS or raise an error. It's a contract between the DP lib and an user - data is always compliant to schema. |
Can you share the complete stack trace? On Wed, Sep 28, 2016 at 10:41 PM roll notifications@github.com wrote:
|
@cyberbikepunk ok. @roll @akariv in goodtables, when reading from excel, i forced everything to string to make things "consistent". this has disadvantages, esp. with dates and the way excel represents those, but i wonder if, in tabulator, it is useful to have a param to the excel parser that forces it to emit values as strings? |
@cyberbikepunk |
@pwalsh the strict typing that comes out from excel kind of breaks down the "human" work-flow that we've been doing so far, in the sense that non-tech people can't reliably describe the columns types. Bottom line: I would vote for a flag in the tabulator that says: "strings please". If everyone's finds this useful, I will file in issue for @roll cc @akariv |
@cyberbikepunk ok no problem. A common way to handle this when you encounter issues in dependencies for our own tooling would be that you do a patch and submit a PR. However, I think it will be quicker for @roll to do this himself, so yes, please file an issue and he can do a quick enhancement here. @roll you will encounter weirdness with dates: |
Loic can't you start with a simpler case of csv just to get things working end to end first, otherwise you will get lost in details like this. |
But the answer is that excel stores dates as floats. |
@pwalsh honestly i wasn't anticipating that excel was going to be that tricky. I got the data in now by cleaning some of the bad cells manually and I learned an important lesson: we absolutely need this feature of ingesting everything as strings because the streamer will fail if the data is not perfectly clean and this is never the case. Most of the files we've sourced so far are Excel files. |
Need to stream excel files into the pipeline from urls.
So @akariv : should i implement this as a service like so?
can't we just use the
iter
method of thedatapackage.Resource
class (cc @roll) ?The text was updated successfully, but these errors were encountered: