Skip to content

Commit

Permalink
Merge branch 'master' into debian
Browse files Browse the repository at this point in the history
* master: (313 commits)
  TST: more Python 2.5 sadness
  TST: Python 2.5 float formatting changed
  TST: cast to i8 when checking margins
  BUG: DataFrame.join on keys produce wrong result, does not preserve order
  DOC: release notes
  ENH: xs level can take multiple levels, pass multiple levels to MultiIndex.droplevel, GH pandas-dev#371
  BUG: fix bugs related to comments in pandas-dev#371
  BUG: fix TextParser with list buglet, enable parsing of DataFrame output with index names
  BUG: convert tuples in concat to MultiIndex
  BUG: don't lose index names when adding row margin
  ENH: add margins to crosstab
  ENH: add crosstab function and test
  ENH: crosstab prototype function, API needs fleshing out, GH pandas-dev#170
  BUG: fix buglet with xs with level, GH pandas-dev#371
  TST: add test_sql.py module
  TST: testing, cleanup of io.sql module
  TST: indexing testing with minor Series.__getitem__ refactoring
  ENH: hack toward pandas-dev#629
  BUG: check for non-contiguous memory in SeriesGrouper, causing segfault
  ENH: add ability to pass list of dicts to DataFrame.append (GH pandas-dev#464)
  ...
  • Loading branch information
yarikoptic committed Jan 17, 2012
2 parents 2ca93a1 + 195ec30 commit 77c017f
Show file tree
Hide file tree
Showing 139 changed files with 16,081 additions and 4,680 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ MANIFEST
pandas/version.py
doc/source/generated
doc/source/_static
doc/source/vbench
doc/source/vbench.rst
*flymake*
scikits
.coverage
.coverage
pandas.egg-info
219 changes: 219 additions & 0 deletions RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,224 @@ Where to get it
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
* Documentation: http://pandas.sourceforge.net

pandas 0.7.0
============

**Release date:** NOT YET RELEASED

**New features / modules**

- New ``merge`` function for efficiently performing full gamut of database /
relational-algebra operations. Refactored existing join methods to use the
new infrastructure, resulting in substantial performance gains (GH #220,
#249, #267)
- New ``concat`` function for concatenating DataFrame or Panel objects along
an axis. Can form union or intersection of the other axes. Improves
performance of ``DataFrame.append`` (#468, #479, #273)
- Handle differently-indexed output values in ``DataFrame.apply`` (GH #498)
- Can pass list of dicts (e.g., a list of shallow JSON objects) to DataFrame
constructor (GH #526)
- Add ``reorder_levels`` method to Series and DataFrame (PR #534)
- Add dict-like ``get`` function to DataFrame and Panel (PR #521)
- ``DataFrame.iterrows`` method for efficiently iterating through the rows of
a DataFrame
- Added ``DataFrame.to_panel`` with code adapted from ``LongPanel.to_long``
- ``reindex_axis`` method added to DataFrame
- Add ``level`` option to binary arithmetic functions on ``DataFrame`` and
``Series``
- Add ``level`` option to the ``reindex`` and ``align`` methods on Series and
DataFrame for broadcasting values across a level (GH #542, PR #552, others)
- Add attribute-based item access to ``Panel`` and add IPython completion (PR
#554)
- Add ``logy`` option to ``Series.plot`` for log-scaling on the Y axis
- Add ``index``, ``header``, and ``justify`` options to
``DataFrame.to_string``. Add option to (GH #570, GH #571)
- Can pass multiple DataFrames to ``DataFrame.join`` to join on index (GH #115)
- Can pass multiple Panels to ``Panel.join`` (GH #115)
- Can pass multiple DataFrames to `DataFrame.append` to concatenate (stack)
and multiple Series to ``Series.append`` too
- Added ``justify`` argument to ``DataFrame.to_string`` to allow different
alignment of column headers
- Add ``sort`` option to GroupBy to allow disabling sorting of the group keys
for potential speedups (GH #595)
- Can pass MaskedArray to Series constructor (PR #563)
- Add Panel item access via attributes and IPython completion (GH #554)
- Implement ``DataFrame.lookup``, fancy-indexing analogue for retrieving
values given a sequence of row and column labels (GH #338)
- Add ``verbose`` option to ``read_csv`` and ``read_table`` to show number of
NA values inserted in non-numeric columns (GH #614)
- Can pass a list of dicts or Series to ``DataFrame.append`` to concatenate
multiple rows (GH #464)
- Add ``level`` argument to ``DataFrame.xs`` for selecting data from other
MultiIndex levels. Can take one or more levels with potentially a tuple of
keys for flexible retrieval of data (GH #371, GH #629)
- New ``crosstab`` function for easily computing frequency tables (GH #170)

**API Changes**

- Label-indexing with integer indexes now raises KeyError if a label is not
found instead of falling back on location-based indexing
- Label-based slicing via ``ix`` or ``[]`` on Series will now only work if
exact matches for the labels are found or if the index is monotonic (for
range selections)
- Label-based slicing and sequences of labels can be passed to ``[]`` on a
Series for both getting and setting (GH #86)
- `[]` operator (``__getitem__`` and ``__setitem__``) will raise KeyError
with integer indexes when an index is not contained in the index. The prior
behavior would fall back on position-based indexing if a key was not found
in the index which would lead to subtle bugs. This is now consistent with
the behavior of ``.ix`` on DataFrame and friends (GH #328)
- Rename ``DataFrame.delevel`` to ``DataFrame.reset_index`` and add
deprecation warning
- `Series.sort` (an in-place operation) called on a Series which is a view on
a larger array (e.g. a column in a DataFrame) will generate an Exception to
prevent accidentally modifying the data source (GH #316)
- Refactor to remove deprecated ``LongPanel`` class (PR #552)
- Deprecated ``Panel.to_long``, renamed to ``to_frame``
- Deprecated ``colSpace`` argument in ``DataFrame.to_string``, renamed to
``col_space``
- Rename ``precision`` to ``accuracy`` in engineering float formatter (GH
#395)

**Improvements to existing features**

- Better error message in DataFrame constructor when passed column labels
don't match data (GH #497)
- Substantially improve performance of multi-GroupBy aggregation when a
Python function is passed, reuse ndarray object in Cython (GH #496)
- Can store objects indexed by tuples and floats in HDFStore (GH #492)
- Don't print length by default in Series.to_string, add `length` option (GH
#489)
- Improve Cython code for multi-groupby to aggregate without having to sort
the data (GH #93)
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex,
test for backwards unpickling compatibility
- Improve column reindexing performance by using specialized Cython take
function
- Further performance tweaking of Series.__getitem__ for standard use cases
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.),
regression from prior versions
- Friendlier error message in setup.py if NumPy not installed
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class
also (GH #536)
- Default name assignment when calling ``reset_index`` on DataFrame with a
regular (non-hierarchical) index (GH #476)
- Use Cythonized groupers when possible in Series/DataFrame stat ops with
``level`` parameter passed (GH #545)
- Ported skiplist data structure to C to speed up ``rolling_median`` by about
5-10x in most typical use cases (GH #374)
- Some performance enhancements in constructing a Panel from a dict of
DataFrame objects
- Made ``Index._get_duplicates`` a public method by removing the underscore
- Prettier printing of floats, and column spacing fix (GH #395, GH #571)
- Add ``bold_rows`` option to DataFrame.to_html (GH #586)
- Improve the performance of ``DataFrame.sort_index`` by up to 5x or more
when sorting by multiple columns
- Substantially improve performance of DataFrame and Series constructors when
passed a nested dict or dict, respectively (GH #540, GH #621)
- Modified setup.py so that pip / setuptools will install dependencies (GH
#507, various pull requests)
- Unstack called on DataFrame with non-MultiIndex will return Series (GH
#477)
- Improve DataFrame.to_string and console formatting to be more consistent in
the number of displayed digits (GH #395)
- Use bottleneck if available for performing NaN-friendly statistical
operations that it implemented (GH #91)
- Can pass a list of functions to aggregate with groupby on a DataFrame,
yielding an aggregated result with hierarchical columns (GH #166)
- Monkey-patch context to traceback in ``DataFrame.apply`` to indicate which
row/column the function application failed on (GH #614)
- Improved ability of read_table and read_clipboard to parse
console-formatted DataFrames (can read the row of index names, etc.)

**Bug fixes**

- Raise exception in out-of-bounds indexing of Series instead of
seg-faulting, regression from earlier releases (GH #495)
- Fix error when joining DataFrames of different dtypes within the same
typeclass (e.g. float32 and float64) (GH #486)
- Fix bug in Series.min/Series.max on objects like datetime.datetime (GH
#487)
- Preserve index names in Index.union (GH #501)
- Fix bug in Index joining causing subclass information (like DateRange type)
to be lost in some cases (GH #500)
- Accept empty list as input to DataFrame constructor, regression from 0.6.0
(GH #491)
- Can output DataFrame and Series with ndarray objects in a dtype=object
array (GH #490)
- Return empty string from Series.to_string when called on empty Series (GH
#488)
- Fix exception passing empty list to DataFrame.from_records
- Fix Index.format bug (excluding name field) with datetimes with time info
- Fix scalar value access in Series to always return NumPy scalars,
regression from prior versions (GH #510)
- Handle rows skipped at beginning of file in read_* functions (GH #505)
- Handle improper dtype casting in ``set_value`` methods
- Unary '-' / __neg__ operator on DataFrame was returning integer values
- Unbox 0-dim ndarrays from certain operators like all, any in Series
- Fix handling of missing columns (was combine_first-specific) in
DataFrame.combine for general case (GH #529)
- Fix type inference logic with boolean lists and arrays in DataFrame indexing
- Use centered sum of squares in R-square computation if entity_effects=True
in panel regression
- Handle all NA case in Series.{corr, cov}, was raising exception (GH #548)
- Aggregating by multiple levels with ``level`` argument to DataFrame, Series
stat method, was broken (GH #545)
- Fix Cython buf when converter passed to read_csv produced a numeric array
(buffer dtype mismatch when passed to Cython type inference function) (GH
#546)
- Fix exception when setting scalar value using .ix on a DataFrame with a
MultiIndex (GH #551)
- Fix outer join between two DateRanges with different offsets that returned
an invalid DateRange
- Cleanup DataFrame.from_records failure where index argument is an integer
- Fix Data.from_records failure when passed a dictionary
- Fix NA handling in {Series, DataFrame}.rank with non-floating point dtypes
- Fix bug related to integer type-checking in .ix-based indexing
- Handle non-string index name passed to DataFrame.from_records
- DataFrame.insert caused the columns name(s) field to be discarded (GH #527)
- Fix erroneous in monotonic many-to-one left joins
- Fix DataFrame.to_string to remove extra column white space (GH #571)
- Format floats to default to same number of digits (GH #395)
- Added decorator to copy docstring from one function to another (GH #449)
- Fix error in monotonic many-to-one left joins
- Fix __eq__ comparison between DateOffsets with different relativedelta
keywords passed
- Fix exception caused by parser converter returning strings (GH #583)
- Fix MultiIndex formatting bug with integer names (GH #601)
- Fix bug in handling of non-numeric aggregates in Series.groupby (GH #612)
- Fix TypeError with tuple subclasses (e.g. namedtuple) in
DataFrame.from_records (GH #611)
- Catch misreported console size when running IPython within Emacs
- Fix minor bug in pivot table margins, loss of index names and length-1
'All' tuple in row labels

Thanks
------
- Craig Austin
- Marius Cobzarenco
- Mario Gamboa-Cavazos
- Arthur Gerigk
- Yaroslav Halchenko
- Jeff Hammerbacher
- Matt Harrison
- Andreas Hilboll
- Luc Kesters
- Adam Klein
- Gregg Lind
- Solomon Negusse
- Wouter Overmeire
- Christian Prinoth
- Sam Reckoner
- Craig Reeson
- Jan Schulz
- Ted Square
- Graham Taylor
- Chris Uga
- Dieter Vandenbussche
- Texas P.
- Pinxing Ye

pandas 0.6.1
============

Expand Down Expand Up @@ -85,6 +303,7 @@ pandas 0.6.1
- MultiIndex.get_level_values can take the level name
- More helpful error message when DataFrame.plot fails on one of the columns
(GH #478)
- Improve performance of DataFrame.{index, columns} attribute lookup

**Bug fixes**

Expand Down
7 changes: 7 additions & 0 deletions TODO.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
DOCS 0.7.0
----------
- no sort in groupby
- concat with dict

DONE
----
- SparseSeries name integration + tests
Expand Down Expand Up @@ -49,3 +54,5 @@ Performance blog
- Groupby
- joining
- Take

git log v0.6.1..master --pretty=format:%aN | sort | uniq -c | sort -rn
61 changes: 61 additions & 0 deletions bench/bench_groupby.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from pandas import *
from pandas.util.testing import rands

import string
import random

k = 20000
n = 10

foo = np.tile(np.array([rands(10) for _ in xrange(k)], dtype='O'), n)
foo2 = list(foo)
random.shuffle(foo)
random.shuffle(foo2)

df = DataFrame({'A' : foo,
'B' : foo2,
'C' : np.random.randn(n * k)})

import pandas._sandbox as sbx

def f():
table = sbx.StringHashTable(len(df))
ret = table.factorize(df['A'])
return ret
def g():
table = sbx.PyObjectHashTable(len(df))
ret = table.factorize(df['A'])
return ret

ret = f()

"""
import pandas._tseries as lib
f = np.std
grouped = df.groupby(['A', 'B'])
label_list = [ping.labels for ping in grouped.groupings]
shape = [len(ping.ids) for ping in grouped.groupings]
from pandas.core.groupby import get_group_index
group_index = get_group_index(label_list, shape).astype('i4')
ngroups = np.prod(shape)
indexer = lib.groupsort_indexer(group_index, ngroups)
values = df['C'].values.take(indexer)
group_index = group_index.take(indexer)
f = lambda x: x.std(ddof=1)
grouper = lib.Grouper(df['C'], np.ndarray.std, group_index, ngroups)
result = grouper.get_result()
expected = grouped.std()
"""
Loading

0 comments on commit 77c017f

Please sign in to comment.