Skip to content

Latest commit

 

History

History
722 lines (698 loc) · 66.6 KB

pandas_supported.rst

File metadata and controls

722 lines (698 loc) · 66.6 KB

Modin Supported Methods

For your convenience, we have compiled a list of currently implemented APIs and methods available in Modin. This documentation is updated as new methods and APIs are merged into the master branch, and not necessarily correct as of the most recent release. In order to install the latest version of Modin, follow the directions found on the installation page.

Questions on implementation details

If you have a question about the implementation details or would like more information about an API or method in Modin, please contact the Modin developer mailing list.

API Completeness

Currently, we support ~71% of the pandas API. The exact methods we have implemented are listed below.

We have taken a community-driven approach to implementing new methods. We did a study on pandas usage to learn what the most-used APIs are. Modin currently supports 93% of the pandas API based on our study of pandas usage, and we are actively expanding the API.

Defaulting to pandas

The remaining unimplemented methods default to pandas. This allows users to continue using Modin even though their workloads contain functions not yet implemented in Modin. Here is a diagram of how we convert to pandas and perform the operation:

image

We first convert to a pandas DataFrame, then perform the operation. There is a performance penalty for going from a partitioned Modin DataFrame to pandas because of the communication cost and single-threaded nature of pandas. Once the pandas operation has completed, we convert the DataFrame back into a partitioned Modin DataFrame. This way, operations performed after something defaults to pandas will be optimized with Modin.

DataFrame

The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the GitHub repository. Contributions are also welcome!

DataFrame method Implemented? Limitations/Notes for Current implementation
T Y
__abs__ Y
__add__ Y
__and__ Y
__array__ Y Will not result in a distributed object
__array_wrap__ Y Will not result in a distributed object
__bool__ Y
__contains__ Y
__copy__ Y Copy will always make a shallow copy
__deepcopy__ Y Copy will always make a shallow copy
__delitem__ Y
__div__ Y Requires shuffle when operating on two DataFrames
__eq__ Y Requires shuffle when operating on two DataFrames
__finalize__ N Defaults to pandas
__floordiv__ Y Requires shuffle when operating on two DataFrames
__ge__ Y Requires shuffle when operating on two DataFrames
__getitem__ Y

Returns a pandas Series (see Series section below)

key parameter as type DataFrame not yet supported

MultiIndex columns defaults to pandas

__getstate__ N Defaults to pandas
__gt__ Y Requires shuffle when operating on two DataFrames
__hash__ N Defaults to pandas
__iadd__ Y See __add__
__ifloordiv__ Y See __floordiv__
__imod__ Y See __mod__
__imul__ Y See __mul__
__invert__ N Defaults to pandas
__ipow__ Y See __pow__
__isub__ Y See __sub__
__iter__ Y
__itruediv__ Y See __truediv__
__le__ Y Requires shuffle when operating on two DataFrames
__len__ Y
__lt__ Y Requires shuffle when operating on two DataFrames
__mod__ Y Requires shuffle when operating on two DataFrames
__mul__ Y Requires shuffle when operating on two DataFrames
__ne__ Y Requires shuffle when operating on two DataFrames
__neg__ Y
__nonzero__ Y
__or__ Y
__pow__ Y Requires shuffle when operating on two DataFrames
__radd__ Y See __add__
__rdiv__ Y See __div__
__repr__ Y Blocking call: Must retrieve data from remote
__rfloordiv__ Y See __floordiv__
__rmod__ Y See __mod__
__rmul__ Y See __mul__
__round__ N Defaults to pandas
__rpow__ Y See __pow__
__rsub__ Y See __sub__
__rtruediv__ Y See __truediv__
__setitem__ Y Can only set if key parameter is type str
__setstate__ N Defaults to pandas
__sizeof__ N Defaults to pandas
__str__ Y Blocking call: Must retrieve data from remote
__sub__ Y Requires shuffle when operating on two DataFrames
__truediv__ Y Requires shuffle when operating on two DataFrames
__unicode__ N Defaults to pandas
__xor__ Y
abs Y
add Y See __add__
add_prefix Y
add_suffix Y
agg Y

Not yet optimized: Can return DataFrame or Series

Passing a dictionary for the func parameter not yet supported

Passing the string name of a numpy operation for the func parameter defaults to pandas

aggregate Y See agg
align N Defaults to pandas
all Y
any Y
append Y Can be further optimized to be non-blocking
apply Y See agg
applymap Y
as_blocks N Defaults to pandas
as_matrix Y Will not result in a distributed object
asfreq N Defaults to pandas
asof N Defaults to pandas
assign N Defaults to pandas
astype Y
at N Defaults to pandas
at_time N Defaults to pandas
axes Y
between_time N Defaults to pandas
bfill Y
blocks N Defaults to pandas
bool Y
boxplot Y
clip Y
clip_lower Y
clip_upper Y
columns Y
combine N Defaults to pandas
combine_first N Defaults to pandas
compound N Defaults to pandas
consolidate N Defaults to pandas
convert_objects N Defaults to pandas
copy Y Copy will always make a shallow copy
corr N Defaults to pandas
corrwith N Defaults to pandas
count Y
cov N Defaults to pandas
cummax Y
cummin Y
cumprod Y
cumsum Y
describe Y
diff Y
div Y See __div__
divide Y See __div__
dot N Defaults to pandas
drop Y
drop_duplicates N Defaults to pandas
dropna Y
dtypes Y
duplicated N Defaults to pandas
empty Y
eq Y See __eq__
equals Y Requires shuffle, can be further optimized
eval Y
ewm N Defaults to pandas
expanding N Defaults to pandas
ffill Y
fillna Y value parameter of type DataFrame defaults to pandas
filter Y
first N Defaults to pandas
first_valid_index Y
floordiv Y See __floordiv__
from_csv Y
from_dict Y
from_items Y
from_records Y
ftypes Y
ge Y See __ge__
get Y
get_dtype_counts Y
get_ftype_counts Y
get_value N Defaults to pandas
get_values N Defaults to pandas
groupby Y

Not yet optimized, will require Distributed Series

by with a list of columns defaults to pandas

gt Y See __gt__
head Y
hist N Defaults to pandas
iat N Defaults to pandas
idxmax Y
idxmin Y
iloc Y
index Y
infer_objects N Defaults to pandas
info Y
insert Y
interpolate N Defaults to pandas
is_copy N Defaults to pandas
isin Y
isna Y
isnull Y
items Y
iteritems Y
iterrows Y
itertuples Y
ix N Defaults to pandas
join Y
keys Y
kurt N Defaults to pandas
kurtosis N Defaults to pandas
last N Defaults to pandas
last_valid_index Y
le Y See __le__
loc Y
lookup N Defaults to pandas
lt Y See __lt__
mad N Defaults to pandas
mask N Defaults to pandas
max Y
mean Y
median Y
melt N Defaults to pandas
memory_usage Y
merge Y Only implemented for left_index=True and right_index=True, defaults to pandas otherwise
min Y
mod Y
mode Y
mul Y See __mul__
multiply Y See __mul__
ndim Y
ne Y See __ne__
nlargest N Defaults to pandas
notna Y
notnull Y
nsmallest N Defaults to pandas
nunique Y
pct_change N Defaults to pandas
pipe Y
pivot N Defaults to pandas
pivot_table N Defaults to pandas
plot Y
pop Y
pow Y See __pow__
prod Y
product Y
quantile Y
query Y Local variables not yet supported
radd Y See __add__
rank Y
rdiv Y See __div__
reindex Y
reindex_axis N Defaults to pandas
reindex_like N Defaults to pandas
rename Y
rename_axis Y
reorder_levels N Defaults to pandas
replace N Defaults to pandas
resample N Defaults to pandas
reset_index Y
rfloordiv Y See __floordiv__
rmod Y See __mod__
rmul Y See __mul__
rolling N Defaults to pandas
round Y
rpow Y See __pow__
rsub Y See __sub__
rtruediv Y See __truediv__
sample Y
select N Defaults to pandas
select_dtypes Y
sem N Defaults to pandas
set_axis Y
set_index Y
set_value N Defaults to pandas
shape Y
shift N Defaults to pandas
size Y
skew Y
slice_shift N Defaults to pandas
sort_index Y
sort_values Y Not optimized, will require a distributed Series
sortlevel N Defaults to pandas
squeeze N Defaults to pandas
stack N Defaults to pandas
std Y
style N Defaults to pandas
sub Y See __sub__
subtract Y See __sub__
sum Y
swapaxes N Defaults to pandas
swaplevel N Defaults to pandas
tail Y
take N Defaults to pandas
to_clipboard Y
to_csv Y
to_dense N Defaults to pandas
to_dict Y
to_excel Y
to_feather Y
to_gbq Y
to_hdf Y
to_html Y
to_json Y
to_latex Y
to_msgpack Y
to_panel N Defaults to pandas
to_parquet Y
to_period N Defaults to pandas
to_pickle Y
to_records Y
to_sparse N Defaults to pandas
to_sql Y
to_stata Y
to_string Y
to_timestamp N Defaults to pandas
to_xarray N Defaults to pandas
transform Y
transpose Y
truediv Y See __truediv__
truncate N Defaults to pandas
tshift N Defaults to pandas
tz_convert N Defaults to pandas
tz_localize N Defaults to pandas
unstack N Defaults to pandas
update Y raise_conflict=True not yet supported
values Y
var Y
where Y
xs N Defaults to pandas

Series

Currently, whenever a Series is used or returned, we use a pandas Series. In the future, we're going to implement a distributed Series, but until then there will be some performance bottlenecks. The pandas Series is completely compatible with all operations that both require and return one in Modin.

IO

A number of IO methods default to pandas. We have parallelized read_csv and read_parquet, though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.

IO method Implemented? Limitations/Notes for Current implementation
read_csv Y
read_table Y
read_parquet Y
read_json Y Defaults to pandas implementation
read_html Y Defaults to pandas implementation
read_clipboard Y Defaults to pandas implementation
read_excel Y Defaults to pandas implementation
read_hdf Y
read_feather Y Defaults to pandas implementation
read_msgpack Y Defaults to pandas implementation
read_stata Y Defaults to pandas implementation
read_sas Y Defaults to pandas implementation
read_pickle Y Defaults to pandas implementation
read_sql Y Defaults to pandas implementation

List of Other Supported Operations Available on Import

If you import modin.pandas as pd the following operations are available from pd.<op>, e.g. pd.concat. If you do not see an operation that pandas enables and would like to request it, feel free to open an issue. Make sure you tell us your primary use-case so we can make it happen faster!

  • pd.concat
  • pd.eval
  • pd.unique
  • pd.value_counts
  • pd.cut
  • pd.to_numeric
  • pd.factorize
  • pd.test
  • pd.qcut
  • pd.match
  • pd.to_datetime
  • pd.get_dummies
  • pd.Panel
  • pd.date_range
  • pd.Index
  • pd.MultiIndex
  • pd.Series
  • pd.bdate_range
  • pd.DatetimeIndex
  • pd.to_timedelta
  • pd.set_eng_float_format
  • pd.set_option
  • pd.CategoricalIndex
  • pd.Timedelta
  • pd.Timestamp
  • pd.NaT
  • pd.PeriodIndex
  • pd.Categorical