Skip to content
forked from xorbitsai/xorbits

Scalable Python data science, in an API compatible & lightning fast way.

License

Notifications You must be signed in to change notification settings

hoarjour/xorbits

 
 

Repository files navigation


Xorbits: scalable Python data science, familiar & fast.

PyPI Latest Release License Coverage Build Status Doc Slack

What is it?

Xorbits is a scalable Python data science framework that aims to scale the whole Python data science world, including numpy, pandas, scikit-learn and many other libraries. It can leverage multi cores or GPUs to accelerate computation on a single machine, or scale out up to thousands of machines to support processing terabytes of data. In our benchmark test, Xorbits is the fastest framework among the most popular distributed data science frameworks.

As for the name of xorbits, it has many meanings, you can treat it as X-or-bits or X-orbits or xor-bits, just have fun to comprehend it in your own way.

Where to get it

The source code is currently hosted on GitHub at: https://github.com/xprobe-inc/xorbits

Binary installers for the latest released version are available at the Python Package Index (PyPI).

# PyPI
pip install xorbits

API compatibility

As long as you know how to use numpy, pandas and so forth, you would probably know how to use xorbits.

Here is an example.

pandas Xorbits
import pandas as pd

ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

m = ratings.groupby(
    'MOVIE_ID', as_index=False).agg(
    {'RATING': ['mean', 'count']})
m.columns = ['MOVIE_ID', 'RATING', 'COUNT']
m = m[m['COUNT'] > 100]
top_100 = m.sort_values(
    'RATING', ascending=False)[:100]
top_100 = top_100.merge(
    movies[['MOVIE_ID', 'NAME']])
print(top_100)
import xorbits.pandas as pd

ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

m = ratings.groupby(
    'MOVIE_ID', as_index=False).agg(
    {'RATING': ['mean', 'count']})
m.columns = ['MOVIE_ID', 'RATING', 'COUNT']
m = m[m['COUNT'] > 100]
top_100 = m.sort_values(
    'RATING', ascending=False)[:100]
top_100 = top_100.merge(
    movies[['MOVIE_ID', 'NAME']])
print(top_100)

Codes are almost identical except for the import, replace import pandas with import xorbits.pandas will just work, so does numpy and so forth.

Lightning fast speed

Xorbits is the fastest compared to other popular frameworks according to our benchmark tests.

We did a benchmark for TPC-H at scale factor 100 and 1000. The performances are shown as below.

Deployment

Xorbits can be deployed on your local machine, or largely deployed to a cluster via command lines.

Deployment Description
Local Running Xorbits on a local machine, e.g. laptop
Cluster Deploy Xorbits to existing cluster via command lines

License

Apache 2

Documentation

The official documentation is hosted on: https://doc.xorbits.io

Roadmaps

Main goals we want to achieve in the future include:

  • Transitioning from pandas native to arrow native for data storage,
    it will reduce the memory cost substantially and is more friendly for compute engine.
  • Introducing native engines that leverage technologies like vectorization and codegen to accelerate computations.
  • Scale as many libraries and algorithms as possible!

More detailed roadmaps will be revealed soon, stay tuned!

Relationship with Mars

The creators of Xorbits are mainly those of Mars, we built Xorbits currently on Mars to reduce duplicated work, but the vision of Xorbits suggests that it's not appropriate to put everything into Mars, instead, we need a new project to support the roadmaps better. In the future, we will replace some core internal components with other upcoming ones we will propose, stay tuned!

Getting involved

Platform Purpose
Discourse Forum Asking usage questions and discussing development.
Github Issues Reporting bugs and filing feature requests.
Slack Collaborating with other Xorbits users.
StackOverflow Asking questions about how to use Xorbits.
Twitter Staying up-to-date on new features.

About

Scalable Python data science, in an API compatible & lightning fast way.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.4%
  • JavaScript 25.3%
  • HTML 0.3%