For your convenience, we have compiled a list of currently implemented APIs and methods available in Modin. This documentation is updated as new methods and APIs are merged into the master branch, and not necessarily correct as of the most recent release. In order to install the latest version of Modin, follow the directions found on the installation page.
If you have a question about the implementation details or would like more information about an API or method in Modin, please contact the Modin developer mailing list.
We have taken a community-driven approach to implementing new methods. We did a study on pandas usage to learn what the most-used APIs are. We currently support 93% of the pandas API based on usage, and are actively expanding the API.
The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the GitHub repository. Contributions are also welcome!
DataFrame method | Implemented? | Limitations/Notes for Current implementation |
T |
Y | |
__abs__ |
Y | |
__add__ |
Y | |
__and__ |
Y | |
__array__ |
Y | Will not result in a distributed object |
__array_wrap__ |
Y | Will not result in a distributed object |
__bool__ |
Y | |
__contains__ |
Y | |
__copy__ |
Y | Copy will always make a shallow copy |
__deepcopy__ |
Y | Copy will always make a shallow copy |
__delitem__ |
Y | |
__div__ |
Y | Requires shuffle when operating on two DataFrames |
__eq__ |
Y | Requires shuffle when operating on two DataFrames |
__finalize__ |
N | N/A, Not Yet Implemented |
__floordiv__ |
Y | Requires shuffle when operating on two DataFrames |
__ge__ |
Y | Requires shuffle when operating on two DataFrames |
__getitem__ |
Y | Returns a pandas Series (see Series section below)
|
__getstate__ |
N | N/A, Not Yet Implemented |
__gt__ |
Y | Requires shuffle when operating on two DataFrames |
__hash__ |
N | N/A, Not Yet Implemented |
__iadd__ |
Y | See __add__ |
__ifloordiv__ |
Y | See __floordiv__ |
__imod__ |
Y | See __mod__ |
__imul__ |
Y | See __mul__ |
__invert__ |
N | N/A, Not Yet Implemented |
__ipow__ |
Y | See __pow__ |
__isub__ |
Y | See __sub__ |
__iter__ |
Y | |
__itruediv__ |
Y | See __truediv__ |
__le__ |
Y | Requires shuffle when operating on two DataFrames |
__len__ |
Y | |
__lt__ |
Y | Requires shuffle when operating on two DataFrames |
__mod__ |
Y | Requires shuffle when operating on two DataFrames |
__mul__ |
Y | Requires shuffle when operating on two DataFrames |
__ne__ |
Y | Requires shuffle when operating on two DataFrames |
__neg__ |
Y | |
__nonzero__ |
Y | |
__or__ |
Y | |
__pow__ |
Y | Requires shuffle when operating on two DataFrames |
__radd__ |
Y | See
|
__rdiv__ |
Y | See
|
__repr__ |
Y | Blocking call: Must retrieve data from remote |
__rfloordiv__ |
Y | See
|
__rmod__ |
Y | See
|
__rmul__ |
Y | See
|
__round__ |
N | N/A, Not Yet Implemented |
__rpow__ |
Y | See
|
__rsub__ |
Y | See
|
__rtruediv__ |
Y | See
|
__setitem__ |
Y | Can only set if key parameter is type str |
__setstate__ |
N | N/A, Not Yet Implemented |
__sizeof__ |
N | N/A, Not Yet Implemented |
__str__ |
Y | Blocking call: Must retrieve data from remote |
__sub__ |
Y | Requires shuffle when operating on two DataFrames |
__truediv__ |
Y | Requires shuffle when operating on two DataFrames |
__unicode__ |
N | N/A, Not Yet Implemented |
__xor__ |
Y | |
abs |
Y | |
add |
Y | See
|
add_prefix |
Y | |
add_suffix |
Y | |
agg |
Y | Not yet optimized: Can return DataFrame or Series Passing a dictionary for the Passing the string name of a numpy operation for the |
aggregate |
Y | See agg |
align |
N | N/A, Not Yet Implemented |
all |
Y | level parameter not yet supported |
any |
Y | level parameter not yet supported |
append |
Y | Can be further optimized to be non-blocking |
apply |
Y | See agg |
applymap |
Y | |
as_blocks |
N | N/A, Not Yet Implemented |
as_matrix |
Y | Will not result in a distributed object |
asfreq |
N | N/A, Not Yet Implemented |
asof |
N | N/A, Not Yet Implemented |
assign |
N | N/A, Not Yet Implemented |
astype |
Y | |
at |
N | N/A, Not Yet Implemented |
at_time |
N | N/A, Not Yet Implemented |
axes |
Y | |
between_time |
N | N/A, Not Yet Implemented |
bfill |
Y | |
blocks |
N | N/A, Not Yet Implemented |
bool |
Y | |
boxplot |
Y | |
clip |
Y | |
clip_lower |
Y | |
clip_upper |
Y | |
columns |
Y | |
combine |
N | N/A, Not Yet Implemented |
combine_first |
N | N/A, Not Yet Implemented |
compound |
N | N/A, Not Yet Implemented |
consolidate |
N | N/A, Not Yet Implemented |
convert_objects |
N | N/A, Not Yet Implemented |
copy |
Y | Copy will always make a shallow copy |
corr |
N | N/A, Not Yet Implemented |
corrwith |
N | N/A, Not Yet Implemented |
count |
Y | level parameter not yet supported |
cov |
N | N/A, Not Yet Implemented |
cummax |
Y | |
cummin |
Y | |
cumprod |
Y | |
cumsum |
Y | |
describe |
Y | |
diff |
Y | |
div |
Y | See
|
divide |
Y | See
|
dot |
N | N/A, Not Yet Implemented |
drop |
Y | level parameter not yet supported |
drop_duplicates |
N | N/A, Not Yet Implemented |
dropna |
Y | |
dtypes |
Y | |
duplicated |
N | N/A, Not Yet Implemented |
empty |
Y | |
eq |
Y | See
|
equals |
Y | Requires shuffle, can be further optimized |
eval |
Y | |
ewm |
N | N/A, Not Yet Implemented |
expanding |
N | N/A, Not Yet Implemented |
ffill |
Y | |
fillna |
Y | value parameter of type DataFrame not yet supported |
filter |
Y | |
first |
N | N/A, Not Yet Implemented |
first_valid_index |
Y | |
floordiv |
Y | See
|
from_csv |
Y | |
from_dict |
Y | |
from_items |
Y | |
from_records |
Y | |
ftypes |
Y | |
ge |
Y | See
|
get |
Y | |
get_dtype_counts |
Y | |
get_ftype_counts |
Y | |
get_value |
N | N/A, Not Yet Implemented |
get_values |
N | N/A, Not Yet Implemented |
groupby |
Y | Not yet optimized, will require Distributed Series
|
gt |
Y | See
|
head |
Y | |
hist |
N | N/A, Not Yet Implemented |
iat |
N | N/A, Not Yet Implemented |
idxmax |
Y | |
idxmin |
Y | |
iloc |
Y | |
index |
Y | |
infer_objects |
N | N/A, Not Yet Implemented |
info |
Y | |
insert |
Y | |
interpolate |
N | N/A, Not Yet Implemented |
is_copy |
N | N/A, Not Yet Implemented |
isin |
Y | |
isna |
Y | |
isnull |
Y | |
items |
Y | |
iteritems |
Y | |
iterrows |
Y | |
itertuples |
Y | |
ix |
N | N/A, Not Yet Implemented |
join |
Y | Specifying on parameter not yet supported |
keys |
Y | |
kurt |
N | N/A, Not Yet Implemented |
kurtosis |
N | N/A, Not Yet Implemented |
last |
N | N/A, Not Yet Implemented |
last_valid_index |
Y | |
le |
Y | See
|
loc |
Y | |
lookup |
N | N/A, Not Yet Implemented |
lt |
Y | See
|
mad |
N | N/A, Not Yet Implemented |
mask |
N | N/A, Not Yet Implemented |
max |
Y | level parameter not yet supported |
mean |
Y | level parameter not yet supported |
median |
Y | level parameter not yet supported |
melt |
N | N/A, Not Yet Implemented |
memory_usage |
Y | |
merge |
Y | Only implemented for left_index=True and right_index=True |
min |
Y | level parameter not yet supported |
mod |
Y | level parameter not yet supported |
mode |
Y | |
mul |
Y | See
|
multiply |
Y | See
|
ndim |
Y | |
ne |
Y | See
|
nlargest |
N | N/A, Not Yet Implemented |
notna |
Y | |
notnull |
Y | |
nsmallest |
N | N/A, Not Yet Implemented |
nunique |
Y | |
pct_change |
N | N/A, Not Yet Implemented |
pipe |
Y | |
pivot |
N | N/A, Not Yet Implemented |
pivot_table |
N | N/A, Not Yet Implemented |
plot |
Y | |
pop |
Y | |
pow |
Y | See
|
prod |
Y | level parameter not yet supported |
product |
Y | level parameter not yet supported |
quantile |
Y | |
query |
Y | Local variables not yet supported |
radd |
Y | See
|
rank |
Y | |
rdiv |
Y | See
|
reindex |
Y | level parameter not yet supported |
reindex_axis |
N | N/A, Not Yet Implemented |
reindex_like |
N | N/A, Not Yet Implemented |
rename |
Y | level parameter not yet supported |
rename_axis |
Y | |
reorder_levels |
N | N/A, Not Yet Implemented |
replace |
N | N/A, Not Yet Implemented |
resample |
N | N/A, Not Yet Implemented |
reset_index |
Y | level parameter not yet supported |
rfloordiv |
Y | See
|
rmod |
Y | See
|
rmul |
Y | See
|
rolling |
N | N/A, Not Yet Implemented |
round |
Y | |
rpow |
Y | See
|
rsub |
Y | See
|
rtruediv |
Y | See
|
sample |
Y | |
select |
N | N/A, Not Yet Implemented |
select_dtypes |
Y | |
sem |
N | N/A, Not Yet Implemented |
set_axis |
Y | |
set_index |
Y | |
set_value |
N | N/A, Not Yet Implemented |
shape |
Y | |
shift |
N | N/A, Not Yet Implemented |
size |
Y | |
skew |
Y | level parameter not yet supported |
slice_shift |
N | N/A, Not Yet Implemented |
sort_index |
Y | level parameter not yet supported |
sort_values |
Y | Not optimized, will require a distributed Series |
sortlevel |
N | N/A, Not Yet Implemented |
squeeze |
N | N/A, Not Yet Implemented |
stack |
N | N/A, Not Yet Implemented |
std |
Y | level parameter not yet supported |
style |
N | N/A, Not Yet Implemented |
sub |
Y | See
|
subtract |
Y | See
|
sum |
Y | level parameter not yet supported |
swapaxes |
N | N/A, Not Yet Implemented |
swaplevel |
N | N/A, Not Yet Implemented |
tail |
Y | |
take |
N | N/A, Not Yet Implemented |
to_clipboard |
Y | |
to_csv |
Y | |
to_dense |
N | N/A, Not Yet Implemented |
to_dict |
Y | |
to_excel |
Y | |
to_feather |
Y | |
to_gbq |
Y | |
to_hdf |
Y | |
to_html |
Y | |
to_json |
Y | |
to_latex |
Y | |
to_msgpack |
Y | |
to_panel |
N | N/A, Not Yet Implemented |
to_parquet |
Y | |
to_period |
N | N/A, Not Yet Implemented |
to_pickle |
Y | |
to_records |
Y | |
to_sparse |
N | N/A, Not Yet Implemented |
to_sql |
Y | |
to_stata |
Y | |
to_string |
Y | |
to_timestamp |
N | N/A, Not Yet Implemented |
to_xarray |
N | N/A, Not Yet Implemented |
transform |
Y | |
transpose |
Y | |
truediv |
Y | See
|
truncate |
N | N/A, Not Yet Implemented |
tshift |
N | N/A, Not Yet Implemented |
tz_convert |
N | N/A, Not Yet Implemented |
tz_localize |
N | N/A, Not Yet Implemented |
unstack |
N | N/A, Not Yet Implemented |
update |
Y | raise_conflict=True not yet supported |
values |
Y | |
var |
Y | level parameter not yet supported |
where |
Y | level parameter not yet supported |
xs |
N | N/A, Not Yet Implemented |
Currently, whenever a Series is used or returned, we use a pandas Series. In the future, we're going to implement a distributed Series, but until then there will be some performance bottlenecks. The pandas Series is completely compatible with all operations that both require and return one in Modin.
A number of IO methods default to pandas. We have parallelized read_csv
and read_parquet
, though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.
IO method | Implemented? | Limitations/Notes for Current implementation |
read_csv |
Y | |
read_parquet |
Y | |
read_json |
Y | Defaults to pandas implementation |
read_html |
Y | Defaults to pandas implementation |
read_clipboard |
Y | Defaults to pandas implementation |
read_excel |
Y | Defaults to pandas implementation |
read_hdf |
Y | Defaults to pandas implementation |
read_feather |
Y | Defaults to pandas implementation |
read_msgpack |
Y | Defaults to pandas implementation |
read_stata |
Y | Defaults to pandas implementation |
read_sas |
Y | Defaults to pandas implementation |
read_pickle |
Y | Defaults to pandas implementation |
read_sql |
Y | Defaults to pandas implementation |
If you import modin.pandas as pd
the following operations are available from pd.<op>
, e.g. pd.concat
. If you do not see an operation that pandas enables and would like to request it, feel free to open an issue. Make sure you tell us your primary use-case so we can make it happen faster!
- concat
- eval
- unique
- value_counts
- cut
- to_numeric
- factorize
- test
- qcut
- match
- to_datetime
- get_dummies
- Panel
- date_range
- Index
- MultiIndex
- Series
- bdate_range
- DatetimeIndex
- to_timedelta
- set_eng_float_format
- set_option
- CategoricalIndex
- Timedelta
- Timestamp
- NaT
- PeriodIndex
- Categorical