Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
DISCUSS: move online data reader subpackages to separate project? #8961
Comments
|
I was trying to think through the workflow of such migration and it hit me that there's one thing all those subprojects to be moved away get for free when being under the umbrella of pandas: the utilities aka the boring stuff, like setup scripts, docs, tests, benchmarks & ci infrastructure. Maybe we could think of a way to somehow move crucial parts of those into a subrepo(-s) so that a fix in one place can be easily propagated everywhere. Also, as a food for thought, I think there are a lot of non-remote |
|
I am not in favor of moving any core IO packages eg stata,clipboard,excel one of the mains points of pandas is that it provides a consistent interface with IO |
|
FTR,
|
|
@immerrr hah..I knew you were going to say that. However, these are system level for the most part and generally included on linux/mac/windows (except the PyQt). So not that big of a deal. |
|
Also, it depends on what do you call "core". I always thought of pandas core being containers+time utils+split/apply/combine+high perf queries/computations. The rest could (should) just feed data in or out via a high-throughput API. That being said, I agree that unified API is good, so it should be sorted out before starting to pull pandas codebase apart. And yes, for practicality reasons, most frequently used io modules could be considered core, too, I just don't have the data to figure out which of them are most popular. |
|
Maybe we should limit the discussion first to the remote data readers? As the other
|
|
I think absolutely these sub-projects should use the existing tests (and pandas testing infrastructure). |
It's not obvious from the phrasing, I hope you mean that they should re-use the infrastructure, e.g. build matrix, but neither they shouldn't run tests of core pandas nor core pandas should run tests of its "plugin" packages. Which brings me to another issue which I'll describe in separate gh issue. Speaking of build matrix, I'd expect the latter to be tested against pandas master and at least one last pandas major release. Or maybe two. The second one would ensure that when running for a fix for last-minute incompatible change in newly released core version one doesn't break the latest major release as of several days ago and don't force users to upgrade both pandas-io-foo and pandas in lockstep. But then again, if one wants to use a bleeding edge version of pandas-io-foo package they should be able to do the same for pandas itself. Another issue is that I don't know of a way to include one yaml file inside the other which stacks with the fact that travis.yml should reside in repository root. I'm not yet sure I see how to make travis/appveyor.yml generic enough to require minimal (ideally, none whatsoever?) attention after project is bootstrapped and yet rely on some common submodule so that exact versions can be changed with a single commit to some "shared" repo and a single submodule update. Symlinking travis.yml to a submodule might do, but I'm not sure if that works on windows. |
jorisvandenbossche
referenced
this issue
Dec 3, 2014
Closed
Google Finance DataReader returns columns with object type instead of float64 #8980
femtotrader
commented
Dec 4, 2014
|
Hello, I've been working on a more object oriented approach about DataReader than actual code. so it can use see this https://github.com/femtotrader/pandas_datareaders This is just a friendly "fork", please excuse me. I would be very pleased to have your feedback about this. |
femtotrader
referenced
this issue
in quantopian/zipline
Dec 4, 2014
Closed
Add Google Finance as intraday DataReader #215
|
+1, it sucks that someone (stuck in the dark ages) using 0.10 or something can't use data readers (if something's changed in the api since then) without upgrading and potentially breaking/changing lots of their code. I suspect for the most part we're not using any wild/depreciating commands, that aren't tested elsewhere in pandas, the datafeed codebase... so IMO testing against master isn't actually that important (and not doing so makes things much easier), just add the pandas version - e.g. For the backwards compat these have to dependancies of (and depend on*) pandas right? I think the easiest is to add this right here in the pydata umbrella group, or create another group here; that way it can seem more "official". *though it may be interesting to lift that direction, and have pandas as a soft dependancy (like https://github.com/changhiskhan/poseidon does). Does this make sense? @femtotrader did you copy and paste the classes and tests from pandas or is this something different? (I worry a little about change requests dep at the same time as migrating, but +1 on using requests). |
femtotrader
commented
Dec 4, 2014
|
@hayd this is something different as (except Yahoo Options) in data.py https://github.com/pydata/pandas/blob/master/pandas/io/data.py almost everything is function... no class. Some people considers that we shouldn't depend on |
|
One important consideration is the recruiting effects of newer contributors. I have points on each of the three, speaking as a rookie myself. I comment on the docs as well, as @jreback pulled that into this thread. Carve out Documentation
My recommendation : If anything gets carved out, documentation should be the highest priority.Carve out Data Sources
My recommendation : Add a note to the docs advising that maintainers are wanted. And, as they step up, one by one they can be carved out. For instance, I could likely learn, and volunteer, to be a maintainer for the World Bank stuff, but I wouldn't necessarily want to be a maintainer for the other stuff.Carve out IO
My recommendation : It has it's benefits, but I think the costs outweigh them. I think the "darkages" reference above has merit, but it is trumped by the complicatedness that pandas would become. |
|
@jnmclarty The problem with carving out docs is IMO that docs and code are "hard-linked" (changes in the code api means changing the docs... at the same time), I think @jreback is talking about docs for these sub-projects (not pandas docs in general/core). Note: you can change the docs without compiling, you can even do it within github! I envision users of older pandas (dark ages) being able to monkey-patch (depending on their code):
This reminds me of @jreback's presentation "pandas is pydata middleware": Ripping it out and seeing what happens: https://github.com/hayd/pandas_data_readers (it works without change for pandas 0.13-0.15, there's a few compat import issues pre-0.13 see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562). Edit: may be easy to also break apart the compat module - would make fixing earlier version easy - not sure if that's feasible though. |
|
@hayd Fair point, on the docs. I admit, I'm relatively new to OS planning concepts like this. I do find it all pretty interesting. |
This was referenced Dec 11, 2014
|
I don't think *Excluding big endian issues, which are probably not important for users of big endian systems, assuming these exist. |
femtotrader
referenced
this issue
in femtotrader/pandas_datareaders_unofficial
Jan 12, 2015
Closed
PR to pandas #3
femtotrader
commented
Jan 12, 2015
|
For doc "Read the docs" is very convenient https://readthedocs.org/. With webhook you can compile doc on server side when you commit changes. |
jreback
added Compat Data Reader
labels
Jan 12, 2015
|
@femtotrader and for a |
|
@jreback how would readthedocs for a datareader package sit in the pandas docs (url-wise)? Tbh could just keep the docs as they are on the pandas side, mostly updates to datareaders are just fixes to external APIs? From the (pretty trivial) exercise I did splitting out datareaders, I think we should split it - it means people can use "old" pandas and up-to-date (working) datareaders. I think should do the same for Not sure how fits with @femtotrader's thoughts? |
femtotrader
commented
Jan 13, 2015
|
You really want my opinion ;-) My but yes I think that we should split datareaders to a separate GitHub project, a separate pip package, a separate testing and continuous integration process and have an easy doc build using readthedocs Maybe we should do this in 2 steps. First move and keep exactly same code. That's just my opinion. PS: I can grant access to either http://pandas-datareaders.readthedocs.org/en/latest/ or https://github.com/femtotrader/pandas_datareaders |
|
ok, let's create another package under pydata.
|
jreback
added this to the
0.17.0
milestone
Jan 13, 2015
femtotrader
commented
Jan 13, 2015
|
I also forget to say that I can also grant access to https://pypi.python.org/pypi/pandas_datareaders (even ownership to @wesm and @jreback because they are pandas pypi package owner). I just want this part of Pandas to be improved. Please just say me what to do now. I will be able to help friday and a part of next week. Maybe we could first move the doc. |
|
@femtotrader what's the backwards compat situation with using pandas_datareaders vs pandas. That is: What we want IMO is to just drop in pandas_datareaders (or whatever) into pandas.io, that way we don't break users code (whilst it's independent/for now they must replace from Make sense? (also, does your lib work on 3.X? you should add that to travis.) IMO docs can be last, since if we were to change this it could remain completely behind the scenes. I guess I don't know how much API you've changed/improved and so what a good migration strategy is. |
|
FWIW, I have often thought about the fact that pandas is turning into a monolithic "kitchen-sink" package, and it might make sense to component-ize a bit (data readers/writers is an obvious one) so that folks can get new data sources without having to deal with upgrading their code bases to account for little API changes (typically where they were accidentally relying on some undocumented, unspecified, or untested) behavior that breaks their code. Now, that's happening less and less now. And nothing to stop you from having all the |
femtotrader
commented
Jan 14, 2015
|
@hayd my lib won't pass the test because in my lib An other example of my DataReader object is here Maybe that wasn't a good idea... In a second step, we will be able to add cache mechanism anyway passing
What is your opinion about it ? |
|
Ok, let's do this, theres a been 2 patches since I migrated it a month ago, so will do the following.
Then @femtotrader add in what you've learned making pandas_datareaders (hopefully with minimal/no breaking API changes). Caching + sessions sounds great! I'm not sure we can use |
femtotrader
commented
Jan 14, 2015
|
You can take the project on Pypi it doesn't matter for me we just need to use a version number > 0.0.2 for the official My (very few) users will have to use version 0.0.2 of my DataReader or wait for 0.2 pandas official datareader (because there will be no cache in 0.1) |
femtotrader
commented
Jan 14, 2015
|
I've just add @wesm and @jreback as owner of https://pypi.python.org/pypi/pandas_datareaders |
|
On a name, I more like "pandas-datareader" instead of "pandas_datareader", and then eg just |
|
@hayd What do you mean with "make similar lib for compat"? And with "Replace on pandas"? -> I think we jsut leave it as it is for now, and if everyting is in place just put a deprecation warning on it? (and then remove in a more future release) On "Get datareaders working on older pandas...", I think up to 0.11 (it is now?) is already very nice (if it would be much effort) |
|
@jorisvandenbossche sorry current failure is 0.12- (see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562) it's basically just pandas.compat not having certain functions back then - if we make pandas-compat a sep package we wouldn't have that issue (and I think it would work on 0.10+ for free). I don't see the reason to not just include it, what's the value in deprecate/remove? |
|
I agree with joris here I will setup pandas-datareader later in pydata then I would suggest that the first version just be the exact version that pandas has if someone wants to modify it later on a PR to replace the current pandas calls with the new pandas-datareader can be done in 0.16 after the new repo is up tested and released on pypi / conda |
|
I created this new repo: https://github.com/pydata/pandas-datareader |
@femtotrader Uncanny timing. |
|
I'm going to force push my repo to master, hope no-one complains too much... can correct (whatever is the correct thing to do) if it's an issue. @jreback Could you look over the setup.py, change the author etc and push to pypi ? I've fixed the tests and updated with recent commits except f88f15b as it's a (small) API change so wasn't clear what to do there. Should probably just apply it. |
femtotrader
commented
Jan 15, 2015
|
I renamed my package |
|
Needs some readme + docs/ readthedocs love... I totally missed |
|
I can look at the docs if you want. But some other things to discuss:
|
|
@jorisvandenbossche No I was thinking of it as a dependancy, e.g. have These are Similarly I think extracting out pandas.compat could be useful for other packages (which depend on pandas). Edit: Thinking about it, I agree gbq doesn't belong in pandas_datareader. But I think it could easily live in a sep package. |
femtotrader
commented
Jan 16, 2015
|
In my mind I also think that |
|
all of the modules moved to
|
|
agree with @jreback on About having
|
|
see issue here: pydata/pandas-datareader#15 in a nutshell. I think its important that the first release (say 0.1) of |
|
@jreback any reason for pandas not to do the import for you going forward?
Either as a dependency or as a soft dependency (e.g. throw an error that you need to |
|
I think a soft dep is fine |
|
@jreback sounds good! |
|
Another question: what about the docs? (I started with that pydata/pandas-datareader#18, but was not fully sure about the path to follow) Should they also stay in the pandas docs? Or refer to the pandas-datareader docs? Wich import do we use there? The 'real' |
This was referenced Jan 30, 2015
jreback
modified the milestone: 0.17.0, Next Major Release
Mar 26, 2015
jreback
modified the milestone: 0.17.0, Next Major Release
Jul 17, 2015
This was referenced Aug 11, 2015
|
closed by #10870 |

jorisvandenbossche commentedDec 2, 2014
Opening an issue to discuss this: should we think about moving the functionality to read online data sources to a separate package?
This was mentioned recently here pydata#8842 (comment) and pydata#8631 (comment)
Some reasons why we would want to move it:
Some questions that come to mind:
io.data? (theDataReaderfunction for Yahoo and Google Finance, FRED and Fama/French, and theOptionsclassio.wb(certain World Bank data) andio.ga(google analytics interface)?Pinging some of the people have worked on these subpackages (certainly add others if you know):
@dstephens99 @MichaelWS @kdiether @cpcloud @vincentarelbundock @jnmclarty
@jreback @immerrr @TomAugspurger @hayd @jtratner