Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSS: move online data reader subpackages to separate project? #8961

Closed
jorisvandenbossche opened this issue Dec 2, 2014 · 48 comments
Closed
Labels
Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Opening an issue to discuss this: should we think about moving the functionality to read online data sources to a separate package?

This was mentioned recently here #8842 (comment) and #8631 (comment)

Some reasons why we would want to move it:

  • This often needs much more urgent releases than pandas to fix changes in the remote sources. So with separate release cycle, this can speed up propagation of bugfixes to users
  • More general: to limit the scope of pandas as the core project, and to encourage that some people take ownership of this subproject (which is easier as a separate project?), publicize it separately, ..

Some questions that come to mind:

  • What should we move? (everything in remote_data.rst?)
    • Everything in io.data? (the DataReader function for Yahoo and Google Finance, FRED and Fama/French, and the Options class
    • Also io.wb (certain World Bank data) and io.ga (google analytics interface)?
  • Should we move it all to one separate project, or to multiple?
  • Are there outside of pandas already some packages like this? (possibilities to merge with that/collaborate/..)
  • Who would be interested to take this up? To be a maintainer of such a package?

Pinging some of the people have worked on these subpackages (certainly add others if you know):
@dstephens99 @MichaelWS @kdiether @cpcloud @vincentarelbundock @jnmclarty
@jreback @immerrr @TomAugspurger @hayd @jtratner

@immerrr
Copy link
Contributor

immerrr commented Dec 2, 2014

I was trying to think through the workflow of such migration and it hit me that there's one thing all those subprojects to be moved away get for free when being under the umbrella of pandas: the utilities aka the boring stuff, like setup scripts, docs, tests, benchmarks & ci infrastructure. Maybe we could think of a way to somehow move crucial parts of those into a subrepo(-s) so that a fix in one place can be easily propagated everywhere.

Also, as a food for thought, I think there are a lot of non-remote io packages that can benefit from moving away and being provided as plugins, e.g. fresh and unstable stata, clipboard, maybe excel. If you ask me, I'd say pretty much every library that works through a 3rdparty dependency could be worth separating and being maintained closer to that dependency package by a person having experience with it. One organizational bonus of such separation would be that pandas will have a lot less optional dependencies.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

I am not in favor of moving any core IO packages eg stata,clipboard,excel

one of the mains points of pandas is that it provides a consistent interface with IO
not sure having these as separate packages will help that and just add another dependency where except for excel don't actually have any deps

@immerrr
Copy link
Contributor

immerrr commented Dec 2, 2014

FTR, clipboard does have deps, they are simply hidden under a layer of pandas.util.clipboard:

  • gtk, PyQT4 and PySide modules:
$ grep import pandas/util/clipboard.py 
#   import pyperclip
import platform, os
    import ctypes
                import gtk
                    import PyQt4 as qt4
                    import PyQt4.QtCore
                    import PyQt4.QtGui
                        import PySide as qt4
                        import PySide.QtCore
                        import PySide.QtGui
  • pbcopy,pbpaste, xclip,xsel commands
$ grep "os\.\(popen\|system\)" pandas/util/clipboard.py 
    outf = os.popen('pbcopy', 'w')
    outf = os.popen('pbpaste', 'r')
    outf = os.popen('xclip -selection c', 'w')
    outf = os.popen('xclip -selection c -o', 'r')
    outf = os.popen('xsel -i', 'w')
    outf = os.popen('xsel -o', 'r')
    xclipExists = os.system('which xclip > /dev/null') == 0
        xselExists = os.system('which xsel > /dev/null') == 0

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

@immerrr hah..I knew you were going to say that. However, these are system level for the most part and generally included on linux/mac/windows (except the PyQt). So not that big of a deal.

@immerrr
Copy link
Contributor

immerrr commented Dec 2, 2014

Also, it depends on what do you call "core". I always thought of pandas core being containers+time utils+split/apply/combine+high perf queries/computations. The rest could (should) just feed data in or out via a high-throughput API.

That being said, I agree that unified API is good, so it should be sorted out before starting to pull pandas codebase apart. And yes, for practicality reasons, most frequently used io modules could be considered core, too, I just don't have the data to figure out which of them are most popular.

@jorisvandenbossche
Copy link
Member Author

Maybe we should limit the discussion first to the remote data readers? As the other io modules are indeed much more fundamental -> this would require a more general discussion on the scope of pandas as the core project.
And the remote data could also be a good exercise to see how smooth this goes / what obstacles there are (things that would also come up in a broader discussion), and that already will be difficult enough to sort out now:

  • the things @immerrr raises on setup scripts, docs, tests, benchmarks & ci infrastructure. Very good points I think. How can we facilitate this from the pandas side?
  • A better defined internal API? (packages that build upon pandas will want to use certain utility functions, such as functions in core.common)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

I think absolutely these sub-projects should use the existing tests (and pandas testing infrastructure).
Doc building actually could be on independently via readthedocs (no cython in these projects).
Pandas could maintain linkage from its docs to the sub-projects. Though I think they should physically be separate.

@immerrr
Copy link
Contributor

immerrr commented Dec 3, 2014

I think absolutely these sub-projects should use the existing tests

It's not obvious from the phrasing, I hope you mean that they should re-use the infrastructure, e.g. build matrix, but neither they shouldn't run tests of core pandas nor core pandas should run tests of its "plugin" packages. Which brings me to another issue which I'll describe in separate gh issue.

Speaking of build matrix, I'd expect the latter to be tested against pandas master and at least one last pandas major release. Or maybe two. The second one would ensure that when running for a fix for last-minute incompatible change in newly released core version one doesn't break the latest major release as of several days ago and don't force users to upgrade both pandas-io-foo and pandas in lockstep. But then again, if one wants to use a bleeding edge version of pandas-io-foo package they should be able to do the same for pandas itself.

Another issue is that I don't know of a way to include one yaml file inside the other which stacks with the fact that travis.yml should reside in repository root. I'm not yet sure I see how to make travis/appveyor.yml generic enough to require minimal (ideally, none whatsoever?) attention after project is bootstrapped and yet rely on some common submodule so that exact versions can be changed with a single commit to some "shared" repo and a single submodule update. Symlinking travis.yml to a submodule might do, but I'm not sure if that works on windows.

@femtotrader
Copy link

Hello,

I've been working on a more object oriented approach about DataReader than actual code.

so it can use urlopen or requests (for now I'm only using requests as my goal was to use requests-cache) #8713

see this https://github.com/femtotrader/pandas_datareaders
Sorry, I wasn't aware of this "issue".

This is just a friendly "fork", please excuse me.

I would be very pleased to have your feedback about this.

@hayd
Copy link
Contributor

hayd commented Dec 4, 2014

+1, it sucks that someone (stuck in the dark ages) using 0.10 or something can't use data readers (if something's changed in the api since then) without upgrading and potentially breaking/changing lots of their code.

I suspect for the most part we're not using any wild/depreciating commands, that aren't tested elsewhere in pandas, the datafeed codebase... so IMO testing against master isn't actually that important (and not doing so makes things much easier), just add the pandas version - e.g. 0.16.rc1 - to travis for the package right before a pandas release... see if anything breaks. We could potentially provide a way to test all these packages from within pandas (for a while we should keep the tests around in pandas), but tbh I don't think this is so important: a failure right before a release would be a big edge case (that's why we have RCs) and all you'd have to do is fix up the package and release a new version.

For the backwards compat these have to dependancies of (and depend on*) pandas right?

I think the easiest is to add this right here in the pydata umbrella group, or create another group here; that way it can seem more "official".

*though it may be interesting to lift that direction, and have pandas as a soft dependancy (like https://github.com/changhiskhan/poseidon does).

Does this make sense?


@femtotrader did you copy and paste the classes and tests from pandas or is this something different? (I worry a little about change requests dep at the same time as migrating, but +1 on using requests).

@femtotrader
Copy link

@hayd this is something different as (except Yahoo Options) in data.py https://github.com/pydata/pandas/blob/master/pandas/io/data.py almost everything is function... no class. Some people considers that we shouldn't depend on requests. I think that using these classes we can both depend on requests if we want or use (urllib or urllib2) urlopen.

@jnmclarty
Copy link
Contributor

One important consideration is the recruiting effects of newer contributors. I have points on each of the three, speaking as a rookie myself. I comment on the docs as well, as @jreback pulled that into this thread.

Carve out Documentation
  • Barriers to Entry Reduced. Stripping out the need to compile the most recent commit, to help fix typos, or clarity, makes complete sense. I would be more likely to contribute to docs for 0.16, if I could leave 0.14 installed.
My recommendation : If anything gets carved out, documentation should be the highest priority.
Carve out Data Sources
  • Barriers to Entry Mixed. For new maintainers, barriers are higher, as mentioned above. For new contributors, entry barriers are lower, for similar reasons as the Documentation.
  • As other's have said, +1 to finding a way to use the existing pandas framework for the boring stuff.
My recommendation : Add a note to the docs advising that maintainers are wanted. And, as they step up, one by one they can be carved out. For instance, I could likely learn, and volunteer, to be a maintainer for the World Bank stuff, but I wouldn't necessarily want to be a maintainer for the other stuff.
Carve out IO
  • Barriers to Entry Increased. My (potentially wrong) perception would be that helping on something like this would be more complicated, if it was a separate project. I would be dissuaded from trying. I would be intimidated by the implications of version mismatch. Eg. "For this commit, how do I test against all the necessary, and next, versions of pandas? That sounds hard."
My recommendation : It has it's benefits, but I think the costs outweigh them. I think the "darkages" reference above has merit, but it is trumped by the complicatedness that pandas would become.

@hayd
Copy link
Contributor

hayd commented Dec 6, 2014

@jnmclarty The problem with carving out docs is IMO that docs and code are "hard-linked" (changes in the code api means changing the docs... at the same time), I think @jreback is talking about docs for these sub-projects (not pandas docs in general/core).

Note: you can change the docs without compiling, you can even do it within github!

I envision users of older pandas (dark ages) being able to monkey-patch (depending on their code):

# import pandas.io.data as web  # previous way (and still works in newer pandas, but with latest data package)
import pandas_data_readers as web  # becomes, to work in older pandas, to use latest data package

This reminds me of @jreback's presentation "pandas is pydata middleware":
pandas is pydata middleware

Ripping it out and seeing what happens: https://github.com/hayd/pandas_data_readers (it works without change for pandas 0.13-0.15, there's a few compat import issues pre-0.13 see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562).

Edit: may be easy to also break apart the compat module - would make fixing earlier version easy - not sure if that's feasible though.

@jnmclarty
Copy link
Contributor

@hayd Fair point, on the docs. I admit, I'm relatively new to OS planning concepts like this. I do find it all pretty interesting.

@bashtage
Copy link
Contributor

bashtage commented Jan 6, 2015

I don't think stata should be considered unstable. It is heavily tested, written against the Stata file format spec, and hasn't had a show stopper bug* in a number of releases (mostly small things like increasing file size in a read/write cycle).

*Excluding big endian issues, which are probably not important for users of big endian systems, assuming these exist.

@femtotrader
Copy link

For doc "Read the docs" is very convenient https://readthedocs.org/. With webhook you can compile doc on server side when you commit changes.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Data Reader labels Jan 12, 2015
@jreback
Copy link
Contributor

jreback commented Jan 12, 2015

@femtotrader and for a datareader package this will probably work. Readthedocs won't handle a full build which requires things like compiling (that's the reason pandas does not host this way).

@hayd
Copy link
Contributor

hayd commented Jan 12, 2015

@jreback how would readthedocs for a datareader package sit in the pandas docs (url-wise)? Tbh could just keep the docs as they are on the pandas side, mostly updates to datareaders are just fixes to external APIs?

From the (pretty trivial) exercise I did splitting out datareaders, I think we should split it - it means people can use "old" pandas and up-to-date (working) datareaders. I think should do the same for pandas_compat. Happy to put a PR together and see if it just worksTM.

Not sure how fits with @femtotrader's thoughts?

@femtotrader
Copy link

You really want my opinion ;-)

My DataReader version is better because it supports cache and "that's useful for me"TM ;-) (but also some other people)

but yes I think that we should split datareaders to a separate GitHub project, a separate pip package, a separate testing and continuous integration process and have an easy doc build using readthedocs
so update of this very moving part will be easier and have a link in pandas doc for datareaders

Maybe we should do this in 2 steps.

First move and keep exactly same code.
Then add my wonderful cache features in a second step ;-)
@twiecki seems interested for zipline see quantopian/zipline#398 (comment)

That's just my opinion.

PS: I can grant access to either http://pandas-datareaders.readthedocs.org/en/latest/ or https://github.com/femtotrader/pandas_datareaders
that's not a problem

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

ok, let's create another package under pydata.

  • name: pandas-datareader, datareader ? (it will be under pydata/..... but the actual package name will user visible
  • I will add the pandas devs and other interested parties as commiters
  • would suggest that 0.1 be exactly the same as the pandas tests/docs/API
  • would suggest that any api transitions be backward-compat for a while
  • this should be a released package before any changes to pandas
  • mark any issues here and open corresponding ones in datareader here, though if these are bug fixes then leave open in pandas as well (for now)
  • if you want to add any deps, e.g. requests that would be up to the new package

@jreback jreback added this to the 0.17.0 milestone Jan 13, 2015
@femtotrader
Copy link

I also forget to say that I can also grant access to https://pypi.python.org/pypi/pandas_datareaders (even ownership to @wesm and @jreback because they are pandas pypi package owner). I just want this part of Pandas to be improved.

Please just say me what to do now. I will be able to help friday and a part of next week.

Maybe we could first move the doc.

@hayd
Copy link
Contributor

hayd commented Jan 13, 2015

@femtotrader what's the backwards compat situation with using pandas_datareaders vs pandas.

That is: What we want IMO is to just drop in pandas_datareaders (or whatever) into pandas.io, that way we don't break users code (whilst it's independent/for now they must replace from pandas.io import ... with from pandas_datareader import ... and it "just work"). If we do that does your lib pass the tests?

Make sense?

(also, does your lib work on 3.X? you should add that to travis.)

IMO docs can be last, since if we were to change this it could remain completely behind the scenes. I guess I don't know how much API you've changed/improved and so what a good migration strategy is.

@hayd
Copy link
Contributor

hayd commented Jan 15, 2015

Needs some readme + docs/ readthedocs love...

I totally missed ga.py and gbq.py, should these be included, should anything else?

@jorisvandenbossche
Copy link
Member Author

I can look at the docs if you want.

But some other things to discuss:

  • How do we see it's future in pandas?:
    • @hayd I have the impression you are suggesting that we still include the datareaders in pandas (a certain version), but that users if they want a latest update, can install the separate package? Is that correct?
      And how would that be done? (with adding it as a git submodule?)
    • I had originally more the idea that we would deprecate the functions in pandas and point users to the new package, so that after some transition period, everybody uses the separate package, and the code can be removed from pandas.
  • What do we do with the other mentioned packages? (at least ga.py and gbq.py are mentioned above)?
    • I personally would beware that we don't make the scope of pandas-datareader too broad. We split this off because pandas is becoming to big, so let's keep the new package focused (so we actually should try to define what this focus should be). That said, I can't really judge on ga.py, but I would say that gbq.py does not really belong in the new package, as this is really a storage platform to write to / read from?
  • Maybe we should put something on the mailing list about this?

@hayd
Copy link
Contributor

hayd commented Jan 16, 2015

@jorisvandenbossche No I was thinking of it as a dependancy, e.g. have pandas.io.data = pandas_datareader.data, these can/would be updated by the user. That way no need to deprecate/break code (but still we remove the code from pandas).

These are packages parts of pandas which periodically break due to external API changes, I kind of think they (data wb ga gbq) belong together (though share the concern about making it to big, maybe they can be separate - I don't think it matters too much, they're pretty stable atm).

Similarly I think extracting out pandas.compat could be useful for other packages (which depend on pandas).

Edit: Thinking about it, I agree gbq doesn't belong in pandas_datareader. But I think it could easily live in a sep package.

@femtotrader
Copy link

In my mind ga.py and gbq.py should also be part of pandas-datareader.

I also think that pandas.io.data = pandas_datareader.data and having datareader a dependency is a good idea so it won't break any code for now (but I'm not a specialist)

@jreback
Copy link
Contributor

jreback commented Jan 16, 2015

all of the modules moved to pandas_datareader are pulling from named web-sources. I could buy ga.py, as this is similar, though it pulls from 'private' sources.

gbq.py OTOH is much more like sql/hdf/excel where it provides a 2-way connection to this data. Yes it is web-based but that is incidental.

@jorisvandenbossche
Copy link
Member Author

agree with @jreback on gbq.py, which is what I tried to say above. There is a clear difference between reading in existing remote data sources and write/read to/from a storage place (although it is remote).

About having pandas.io.data = pandas_datareader.data and then having pandas_datareader as a dependency. I am not fully convinced on this, some concerns:

  • I don't know if we want more dependencies (although it is optional, still, we already have a huge list of those)
  • Isn't it more clear to users to just have it as an explicit separate package? As they will also have to update it separately, etc. Eg you then get people updating their pandas and questioning why the datareader was not updated.
  • What is the advantage of having it like this? In that they don't have to change their import statements. But once changing the import statement from from pandas.io import data to from pandas_datareader import data does not seem like that much of problem if we document/communicate that clearly.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

see issue here: pydata/pandas-datareader#15

in a nutshell. I think its important that the first release (say 0.1) of pandas-datareader and 0.16.0 be basically identical. Then pandas-datareader can update at its own pace. And at somepoint pandas will not ship the data reader stuff and instead show a nice error message to import pandas-datareader (say 0.17.0)

@hayd
Copy link
Contributor

hayd commented Jan 26, 2015

@jreback any reason for pandas not to do the import for you going forward?

pandas.io.web = data_reader.web

Either as a dependency or as a soft dependency (e.g. throw an error that you need to pip install pandas-datareader to use pandas.io.web) ? (One aim of pandas-datareader is to be backwards compatible, and work on older version of pandas. )

@jreback
Copy link
Contributor

jreback commented Jan 26, 2015

I think a soft dep is fine
but iirc the idea was to basically leave the code completely as is till say 0.17 then change it?

@hayd
Copy link
Contributor

hayd commented Jan 26, 2015

@jreback sounds good!

@jorisvandenbossche
Copy link
Member Author

Another question: what about the docs? (I started with that pydata/pandas-datareader#18, but was not fully sure about the path to follow)

Should they also stay in the pandas docs? Or refer to the pandas-datareader docs? Wich import do we use there? The 'real' pandas_datareader or pandas.io.data one?
In any case, pandas-datareader itself should have docs, but having a duplicate version in pandas is not a good way to go I think.
I still find that keep on using pandas.io.data as the import path in the long run (also when the code is removed from the pandas codebase itself) will only cause confusion.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

closed by #10870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

No branches or pull requests

8 participants