Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel URL streamer #112

Closed
cyberbikepunk opened this issue Sep 26, 2016 · 17 comments
Closed

Excel URL streamer #112

cyberbikepunk opened this issue Sep 26, 2016 · 17 comments
Assignees

Comments

@cyberbikepunk
Copy link
Collaborator

Need to stream excel files into the pipeline from urls.

So @akariv : should i implement this as a service like so?

We start with the possibility to have URLs for files other than CSV.
e.g. https://raw.githubusercontent.com/openspending/os-fdp-adapters/master/os_fdp_adapters/excel-adapter/tests/test.xls

Upon every URL, we hit the fdp-adapter endpoint:
http://next.openspending.org/fdp-adapter/convert?url=https://raw.githubusercontent.com/openspending/os-fdp-adapters/master/os_fdp_adapters/excel-adapter/tests/test.xls

If this file type is not supported, we get a 404 and we should proceed as before (assuming this is a csv/json file based on extension).

can't we just use the iter method of the datapackage.Resource class (cc @roll) ?

    def iter(self):
        '''Lazily-iterates over rows in data.
        This method is useful when you don't want to load all data in memory at
        once.
        Returns:
            iter: An iterator that yields each row in this resource.
        Raises:
            ValueError: If the data isn't tabular, if the resource has
                no data, or if its specified encoding is incorrect
            IOError: If there was some problem opening the data file (e.g. it
                doesn't exist or we don't have permissions to read it).
        '''
        result = None
        inline_data = self.descriptor.get('data')
        if self.local_data_path and os.path.isfile(self.local_data_path):
            data_path_or_url = self.local_data_path
        else:
            data_path_or_url = self.remote_data_path

        if inline_data:
            inline_data = self._parse_inline_data()
            result = iter(inline_data)
        elif data_path_or_url:
            dialect = self.descriptor.get('dialect', {})
            parser_options = {}
            if 'delimiter' in dialect:
                parser_options['delimiter'] = dialect['delimiter']
            if 'lineTerminator' in dialect:
                parser_options['lineterminator'] = dialect['lineTerminator']
            if len(dialect) > 0:
                parser_options['constructor'] = tabulator.parsers.CSV

            try:
                table = tabulator.topen(data_path_or_url, headers='row1',
                                        encoding=self.descriptor.get('encoding'),
                                        parser_options=parser_options)
                result = self._iter_from_tabulator(table, self.descriptor.get('schema'))
            except tabulator.exceptions.TabulatorException as e:
                msg = 'Data at \'{0}\' isn\'t in a known tabular data format'
                six.raise_from(ValueError(msg.format(data_path_or_url)), e)

        if result is None:
            if self.descriptor.get('path'):
                # FIXME: This is a hack to throw an IOError when local data
                # exists but couldn't be loaded for some reason. If "path"
                # existed and there were no issues opening it, "result" would
                # never be None.
                raise IOError('Resource\'s data couldn\'t be loaded.')

            raise ValueError('Resource has no data')

return result
@akariv
Copy link
Contributor

akariv commented Sep 26, 2016

Yes, I agree that using tabulator is the better option.

@pwalsh
Copy link

pwalsh commented Sep 26, 2016

you cant stream an excel file from a URL with the python excel libs (or any others that I personally know about). you can only stream read from a local file buffer. i don't know how tabulator is handling this internally ( @roll ? ) but when i wrote the handling of such cases in goodtables i wrote from the remote file to a local buffer, and then streamed out of that. probably inefficiently (at least two passes over the file contents).

even without stream reading, you likely should use tabulator for this anyway, so just sayin'.

@roll
Copy link

roll commented Sep 26, 2016

For xlsx tabulator does a true streaming via http and for xls it's not possible so local buffer approach

@roll
Copy link

roll commented Sep 26, 2016

@cyberbikepunk
Just for files tabulator is really straight to use. No need of datapackage.

@pwalsh
Copy link

pwalsh commented Sep 26, 2016

@roll ok great!

@cyberbikepunk
Copy link
Collaborator Author

cyberbikepunk commented Sep 28, 2016

@roll I've noticed that the DatapPackage.Resource._iter_from_tabulator method tries to cast fields quite aggressively. I don't know why I should get an error like:

jsontableschema.exceptions.InvalidStringType: 41640.0 is not of type <class 'str'>

Couldn't this also be a string?

cc @pwalsh @akariv

@roll
Copy link

roll commented Sep 28, 2016

@cyberbikepunk I suppose @akariv was working on this cast thing on DP level and that's an expected behavior - DP ensures data follow JTS or raise an error. It's a contract between the DP lib and an user - data is always compliant to schema.

@akariv
Copy link
Contributor

akariv commented Sep 29, 2016

Can you share the complete stack trace?

On Wed, Sep 28, 2016 at 10:41 PM roll notifications@github.com wrote:

@cyberbikepunk https://github.com/cyberbikepunk I suppose @akariv
https://github.com/akariv was working on this cast thing on DP level
and that's an expected behavior - DP ensure data follow JTS or raise an
error.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#112 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQMdbnwdC6GKeEsogpMaXAQ5ZhXG-QQks5qusMDgaJpZM4KGkj_
.

@cyberbikepunk
Copy link
Collaborator Author

FYI @roll @pwalsh talked to adam about this: I'm reading in from excel and i didn't realize that I was getting typed information from excel.

@pwalsh
Copy link

pwalsh commented Sep 29, 2016

@cyberbikepunk ok.

@roll @akariv in goodtables, when reading from excel, i forced everything to string to make things "consistent". this has disadvantages, esp. with dates and the way excel represents those, but i wonder if, in tabulator, it is useful to have a param to the excel parser that forces it to emit values as strings?

@roll
Copy link

roll commented Sep 29, 2016

@cyberbikepunk
@akariv
About excel to string parameter - it's easy to add so please create an issue for tabulator if you'll decide to have it.

@cyberbikepunk
Copy link
Collaborator Author

@pwalsh the strict typing that comes out from excel kind of breaks down the "human" work-flow that we've been doing so far, in the sense that non-tech people can't reliably describe the columns types. Bottom line: I would vote for a flag in the tabulator that says: "strings please". If everyone's finds this useful, I will file in issue for @roll

cc @akariv

@pwalsh
Copy link

pwalsh commented Sep 29, 2016

@cyberbikepunk ok no problem. A common way to handle this when you encounter issues in dependencies for our own tooling would be that you do a patch and submit a PR. However, I think it will be quicker for @roll to do this himself, so yes, please file an issue and he can do a quick enhancement here.

@roll you will encounter weirdness with dates:

@cyberbikepunk
Copy link
Collaborator Author

cyberbikepunk commented Sep 29, 2016

@roll @pwalsh do you know how a date in xls like 01/01/2014 can come out as 41640.0 ? I don't know what format this is.

@pwalsh
Copy link

pwalsh commented Sep 29, 2016

Loic can't you start with a simpler case of csv just to get things working end to end first, otherwise you will get lost in details like this.

@pwalsh
Copy link

pwalsh commented Sep 29, 2016

But the answer is that excel stores dates as floats.

@cyberbikepunk
Copy link
Collaborator Author

cyberbikepunk commented Sep 29, 2016

@pwalsh honestly i wasn't anticipating that excel was going to be that tricky. I got the data in now by cleaning some of the bad cells manually and I learned an important lesson: we absolutely need this feature of ingesting everything as strings because the streamer will fail if the data is not perfectly clean and this is never the case. Most of the files we've sourced so far are Excel files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants