For your convenience, we have compiled a list of currently implemented APIs and methods available in Modin. This documentation is updated as new methods and APIs are merged into the master branch, and not necessarily correct as of the most recent release. In order to install the latest version of Modin, follow the directions found on the installation page.
If you have a question about the implementation details or would like more information about an API or method in Modin, please contact the Modin developer mailing list.
Currently, we support ~71% of the pandas API. The exact methods we have implemented are listed below.
We have taken a community-driven approach to implementing new methods. We did a study on pandas usage to learn what the most-used APIs are. Modin currently supports 93% of the pandas API based on our study of pandas usage, and we are actively expanding the API.
The remaining unimplemented methods default to pandas. This allows users to continue using Modin even though their workloads contain functions not yet implemented in Modin. Here is a diagram of how we convert to pandas and perform the operation:
We first convert to a pandas DataFrame, then perform the operation. There is a performance penalty for going from a partitioned Modin DataFrame to pandas because of the communication cost and single-threaded nature of pandas. Once the pandas operation has completed, we convert the DataFrame back into a partitioned Modin DataFrame. This way, operations performed after something defaults to pandas will be optimized with Modin.
The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the GitHub repository. Contributions are also welcome!
DataFrame method | Implemented? | Limitations/Notes for Current implementation |
T |
Y | |
__abs__ |
Y | |
__add__ |
Y | |
__and__ |
Y | |
__array__ |
Y | Will not result in a distributed object |
__array_wrap__ |
Y | Will not result in a distributed object |
__bool__ |
Y | |
__contains__ |
Y | |
__copy__ |
Y | Copy will always make a shallow copy |
__deepcopy__ |
Y | Copy will always make a shallow copy |
__delitem__ |
Y | |
__div__ |
Y | Requires shuffle when operating on two DataFrames |
__eq__ |
Y | Requires shuffle when operating on two DataFrames |
__finalize__ |
N | Defaults to pandas |
__floordiv__ |
Y | Requires shuffle when operating on two DataFrames |
__ge__ |
Y | Requires shuffle when operating on two DataFrames |
__getitem__ |
Y | Returns a pandas Series (see Series section below)
|
__getstate__ |
N | Defaults to pandas |
__gt__ |
Y | Requires shuffle when operating on two DataFrames |
__hash__ |
N | Defaults to pandas |
__iadd__ |
Y | See __add__ |
__ifloordiv__ |
Y | See __floordiv__ |
__imod__ |
Y | See __mod__ |
__imul__ |
Y | See __mul__ |
__invert__ |
N | Defaults to pandas |
__ipow__ |
Y | See __pow__ |
__isub__ |
Y | See __sub__ |
__iter__ |
Y | |
__itruediv__ |
Y | See __truediv__ |
__le__ |
Y | Requires shuffle when operating on two DataFrames |
__len__ |
Y | |
__lt__ |
Y | Requires shuffle when operating on two DataFrames |
__mod__ |
Y | Requires shuffle when operating on two DataFrames |
__mul__ |
Y | Requires shuffle when operating on two DataFrames |
__ne__ |
Y | Requires shuffle when operating on two DataFrames |
__neg__ |
Y | |
__nonzero__ |
Y | |
__or__ |
Y | |
__pow__ |
Y | Requires shuffle when operating on two DataFrames |
__radd__ |
Y | See __add__ |
__rdiv__ |
Y | See __div__ |
__repr__ |
Y | Blocking call: Must retrieve data from remote |
__rfloordiv__ |
Y | See __floordiv__ |
__rmod__ |
Y | See __mod__ |
__rmul__ |
Y | See __mul__ |
__round__ |
N | Defaults to pandas |
__rpow__ |
Y | See __pow__ |
__rsub__ |
Y | See __sub__ |
__rtruediv__ |
Y | See __truediv__ |
__setitem__ |
Y | Can only set if key parameter is type str |
__setstate__ |
N | Defaults to pandas |
__sizeof__ |
N | Defaults to pandas |
__str__ |
Y | Blocking call: Must retrieve data from remote |
__sub__ |
Y | Requires shuffle when operating on two DataFrames |
__truediv__ |
Y | Requires shuffle when operating on two DataFrames |
__unicode__ |
N | Defaults to pandas |
__xor__ |
Y | |
abs |
Y | |
add |
Y | See __add__ |
add_prefix |
Y | |
add_suffix |
Y | |
agg |
Y | Not yet optimized: Can return DataFrame or Series Passing a dictionary for the Passing the string name of a numpy operation for
the |
aggregate |
Y | See agg |
align |
N | Defaults to pandas |
all |
Y | |
any |
Y | |
append |
Y | Can be further optimized to be non-blocking |
apply |
Y | See agg |
applymap |
Y | |
as_blocks |
N | Defaults to pandas |
as_matrix |
Y | Will not result in a distributed object |
asfreq |
N | Defaults to pandas |
asof |
N | Defaults to pandas |
assign |
N | Defaults to pandas |
astype |
Y | |
at |
N | Defaults to pandas |
at_time |
N | Defaults to pandas |
axes |
Y | |
between_time |
N | Defaults to pandas |
bfill |
Y | |
blocks |
N | Defaults to pandas |
bool |
Y | |
boxplot |
Y | |
clip |
Y | |
clip_lower |
Y | |
clip_upper |
Y | |
columns |
Y | |
combine |
N | Defaults to pandas |
combine_first |
N | Defaults to pandas |
compound |
N | Defaults to pandas |
consolidate |
N | Defaults to pandas |
convert_objects |
N | Defaults to pandas |
copy |
Y | Copy will always make a shallow copy |
corr |
N | Defaults to pandas |
corrwith |
N | Defaults to pandas |
count |
Y | |
cov |
N | Defaults to pandas |
cummax |
Y | |
cummin |
Y | |
cumprod |
Y | |
cumsum |
Y | |
describe |
Y | |
diff |
Y | |
div |
Y | See __div__ |
divide |
Y | See __div__ |
dot |
N | Defaults to pandas |
drop |
Y | |
drop_duplicates |
N | Defaults to pandas |
dropna |
Y | |
dtypes |
Y | |
duplicated |
N | Defaults to pandas |
empty |
Y | |
eq |
Y | See __eq__ |
equals |
Y | Requires shuffle, can be further optimized |
eval |
Y | |
ewm |
N | Defaults to pandas |
expanding |
N | Defaults to pandas |
ffill |
Y | |
fillna |
Y | value parameter of type DataFrame defaults to
pandas |
filter |
Y | |
first |
N | Defaults to pandas |
first_valid_index |
Y | |
floordiv |
Y | See __floordiv__ |
from_csv |
Y | |
from_dict |
Y | |
from_items |
Y | |
from_records |
Y | |
ftypes |
Y | |
ge |
Y | See __ge__ |
get |
Y | |
get_dtype_counts |
Y | |
get_ftype_counts |
Y | |
get_value |
N | Defaults to pandas |
get_values |
N | Defaults to pandas |
groupby |
Y | Not yet optimized, will require Distributed Series
|
gt |
Y | See __gt__ |
head |
Y | |
hist |
N | Defaults to pandas |
iat |
N | Defaults to pandas |
idxmax |
Y | |
idxmin |
Y | |
iloc |
Y | |
index |
Y | |
infer_objects |
N | Defaults to pandas |
info |
Y | |
insert |
Y | |
interpolate |
N | Defaults to pandas |
is_copy |
N | Defaults to pandas |
isin |
Y | |
isna |
Y | |
isnull |
Y | |
items |
Y | |
iteritems |
Y | |
iterrows |
Y | |
itertuples |
Y | |
ix |
N | Defaults to pandas |
join |
Y | |
keys |
Y | |
kurt |
N | Defaults to pandas |
kurtosis |
N | Defaults to pandas |
last |
N | Defaults to pandas |
last_valid_index |
Y | |
le |
Y | See __le__ |
loc |
Y | |
lookup |
N | Defaults to pandas |
lt |
Y | See __lt__ |
mad |
N | Defaults to pandas |
mask |
N | Defaults to pandas |
max |
Y | |
mean |
Y | |
median |
Y | |
melt |
N | Defaults to pandas |
memory_usage |
Y | |
merge |
Y | Only implemented for left_index=True and
right_index=True , defaults to pandas otherwise |
min |
Y | |
mod |
Y | |
mode |
Y | |
mul |
Y | See __mul__ |
multiply |
Y | See __mul__ |
ndim |
Y | |
ne |
Y | See __ne__ |
nlargest |
N | Defaults to pandas |
notna |
Y | |
notnull |
Y | |
nsmallest |
N | Defaults to pandas |
nunique |
Y | |
pct_change |
N | Defaults to pandas |
pipe |
Y | |
pivot |
N | Defaults to pandas |
pivot_table |
N | Defaults to pandas |
plot |
Y | |
pop |
Y | |
pow |
Y | See __pow__ |
prod |
Y | |
product |
Y | |
quantile |
Y | |
query |
Y | Local variables not yet supported |
radd |
Y | See __add__ |
rank |
Y | |
rdiv |
Y | See __div__ |
reindex |
Y | |
reindex_axis |
N | Defaults to pandas |
reindex_like |
N | Defaults to pandas |
rename |
Y | |
rename_axis |
Y | |
reorder_levels |
N | Defaults to pandas |
replace |
N | Defaults to pandas |
resample |
N | Defaults to pandas |
reset_index |
Y | |
rfloordiv |
Y | See __floordiv__ |
rmod |
Y | See __mod__ |
rmul |
Y | See __mul__ |
rolling |
N | Defaults to pandas |
round |
Y | |
rpow |
Y | See __pow__ |
rsub |
Y | See __sub__ |
rtruediv |
Y | See __truediv__ |
sample |
Y | |
select |
N | Defaults to pandas |
select_dtypes |
Y | |
sem |
N | Defaults to pandas |
set_axis |
Y | |
set_index |
Y | |
set_value |
N | Defaults to pandas |
shape |
Y | |
shift |
N | Defaults to pandas |
size |
Y | |
skew |
Y | |
slice_shift |
N | Defaults to pandas |
sort_index |
Y | |
sort_values |
Y | Not optimized, will require a distributed Series |
sortlevel |
N | Defaults to pandas |
squeeze |
N | Defaults to pandas |
stack |
N | Defaults to pandas |
std |
Y | |
style |
N | Defaults to pandas |
sub |
Y | See __sub__ |
subtract |
Y | See __sub__ |
sum |
Y | |
swapaxes |
N | Defaults to pandas |
swaplevel |
N | Defaults to pandas |
tail |
Y | |
take |
N | Defaults to pandas |
to_clipboard |
Y | |
to_csv |
Y | |
to_dense |
N | Defaults to pandas |
to_dict |
Y | |
to_excel |
Y | |
to_feather |
Y | |
to_gbq |
Y | |
to_hdf |
Y | |
to_html |
Y | |
to_json |
Y | |
to_latex |
Y | |
to_msgpack |
Y | |
to_panel |
N | Defaults to pandas |
to_parquet |
Y | |
to_period |
N | Defaults to pandas |
to_pickle |
Y | |
to_records |
Y | |
to_sparse |
N | Defaults to pandas |
to_sql |
Y | |
to_stata |
Y | |
to_string |
Y | |
to_timestamp |
N | Defaults to pandas |
to_xarray |
N | Defaults to pandas |
transform |
Y | |
transpose |
Y | |
truediv |
Y | See __truediv__ |
truncate |
N | Defaults to pandas |
tshift |
N | Defaults to pandas |
tz_convert |
N | Defaults to pandas |
tz_localize |
N | Defaults to pandas |
unstack |
N | Defaults to pandas |
update |
Y | raise_conflict=True not yet supported |
values |
Y | |
var |
Y | |
where |
Y | |
xs |
N | Defaults to pandas |
Currently, whenever a Series is used or returned, we use a pandas Series. In the future, we're going to implement a distributed Series, but until then there will be some performance bottlenecks. The pandas Series is completely compatible with all operations that both require and return one in Modin.
A number of IO methods default to pandas. We have parallelized read_csv
and
read_parquet
, though many of the remaining methods can be relatively easily
parallelized. Some of the operations default to the pandas implementation, meaning it
will read in serially as a single, non-distributed DataFrame and distribute it.
Performance will be affected by this.
IO method | Implemented? | Limitations/Notes for Current implementation |
read_csv |
Y | |
read_table |
Y | |
read_parquet |
Y | |
read_json |
Y | Defaults to pandas implementation |
read_html |
Y | Defaults to pandas implementation |
read_clipboard |
Y | Defaults to pandas implementation |
read_excel |
Y | Defaults to pandas implementation |
read_hdf |
Y | |
read_feather |
Y | Defaults to pandas implementation |
read_msgpack |
Y | Defaults to pandas implementation |
read_stata |
Y | Defaults to pandas implementation |
read_sas |
Y | Defaults to pandas implementation |
read_pickle |
Y | Defaults to pandas implementation |
read_sql |
Y | Defaults to pandas implementation |
If you import modin.pandas as pd
the following operations are available from
pd.<op>
, e.g. pd.concat
. If you do not see an operation that pandas enables and
would like to request it, feel free to open an issue. Make sure you tell us your
primary use-case so we can make it happen faster!
pd.concat
pd.eval
pd.unique
pd.value_counts
pd.cut
pd.to_numeric
pd.factorize
pd.test
pd.qcut
pd.match
pd.to_datetime
pd.get_dummies
pd.Panel
pd.date_range
pd.Index
pd.MultiIndex
pd.Series
pd.bdate_range
pd.DatetimeIndex
pd.to_timedelta
pd.set_eng_float_format
pd.set_option
pd.CategoricalIndex
pd.Timedelta
pd.Timestamp
pd.NaT
pd.PeriodIndex
pd.Categorical