Archlinux PKGBUILDs for Data Science, Machine Learning, Deep Learning, NLP and Computer Vision
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bazel
kaggle-cli Change mail to limit spam (#1) Jan 27, 2017
lightgbm-git Create GPU-PKGBUILD (#2) Apr 24, 2018
magma-mkl
magma-openblas Update PyTorch to CUDA 9.0, CuDNN 7.0.3, MKL, build magma with MKL Oct 28, 2017
mxnet-git Update mxnet and HDBSCAN Apr 26, 2017
opencv-cuda-git added bcolz from git for Zstd compression support Feb 10, 2017
python-afinn Added Natural Language Procesing/Understanding packages Apr 19, 2017
python-bcolz-git added bcolz from git for Zstd compression support Feb 10, 2017
python-categorical-encoders-git Categorical encoders + spaCy base package, dependencies coming Feb 23, 2017
python-cupy-cudnn-git
python-cymem-git
python-ete3-git
python-fastrlock
python-hdbscan-git Update mxnet and HDBSCAN Apr 26, 2017
python-mitie
python-mlxtend Adding spaCy + dependencies + refactoring Feb 26, 2017
python-murmurhash-git Update spaCy NLP, spaCy (thinc) now builds with CUDA via Cupy Nov 17, 2017
python-nervana-neon-git
python-numpy-mkl Update Numpy May 18, 2018
python-paratext
python-pillow-simd-git
python-plac
python-preshed-git Update spaCy NLP, spaCy (thinc) now builds with CUDA via Cupy Nov 17, 2017
python-prettytensor
python-pydicom-git
python-pytorch-mkl-magma-cudnn-git
python-pytorch-torchsample-git
python-pytorch-vision-git
python-ramp-workflow
python-rasa-nlu Added Natural Language Procesing/Understanding packages Apr 19, 2017
python-scikit-cuda-git Update: use 🔥 MKL 🔥 for Nupy & Scipy, add ramp-workflow DS platform Oct 23, 2017
python-scipy Update: use 🔥 MKL 🔥 for Nupy & Scipy, add ramp-workflow DS platform Oct 23, 2017
python-sklearn-pandas
python-spacy-git update spaCy install Jun 1, 2018
python-tensorflow-cudnn-git Update Tensorflow Nov 17, 2017
python-thinc-cuda-git update spaCy install Jun 1, 2018
python-tpot-git
python-zarr-git Change mail to limit spam (#1) Jan 27, 2017
xgboost-cuda-git Add GPU accelerated XGBoost May 19, 2018
.gitignore Update PyTorch to CUDA 9.0, CuDNN 7.0.3, MKL, build magma with MKL Oct 28, 2017
README.md

README.md

Data Science packages for Archlinux

Welcome to my repo to build Data Science, Machine Learning, Computer Vision, Natural language Processing and Deep Learning packages from source.

Performance considerations

My aim is to squeeze the maximum performance for my current configuration (Skylake Xeon + Nvidia Pascal GPU) so:

  • All packages are build with -O3 -march=native if the package ignores /etc/makepkg.conf config.
  • I do not use fast-math except if it's the default upstream (example opencv). You might want to enable it for GCC and NVCC (Nvidia compiler), for example for Theano
  • All CUDA packages are build with CUDA 8, cuDNN 5.1 and Compute capabilities 6.1 (Pascal). Update to Cuda 9 and cuDNN 7.0 will be done at a later time (Arch install is a LXC container and nvidia-driver must be in sync with its Debian host).
  • Pytorch is also build with MAGMA support. Magma is a linear algebra library for heterogeneous computing (CPU + GPU hybridization)
  • BLAS library is MKL except for Tensorflow (Eigen). Note: previous BLAS was OpenBLAS, some libraries may still be build against it temporarily.
  • Parallel library is OpenMP except for Tensorflow (Eigen) and OpenCV (Intel TBB, Thread Building Blocks)
  • OpenCV is further optimized with Intel IPP (Integrated Performance Primitives)
  • Nvidia libraries (CuBLAS, CuFFT, CuSPARSE ...) are used wherever possible

My Data Science environment is running from a LXC container so Tensorflow build system, bazel, must be build with its auto-sandboxing disabled.

Caveats

Please note that current mxnet and lightgbm packages are working but must be improved: they put their libraries in /usr/mxnet and /usr/lightgbm Packages included are those not available by default in Archlinux AUR or that needed substantial modifications. So check Archlinux AUR for standard packages like Numpy, Pandas or Theano.

Description of the Data Science Stack

Packages not described here are dependencies of others (bazel -> Tensorflow, murmurhash, plac, preshed, etc -> spaCy)

  • General packages

    • Monitoring
      • htop - Monitor CPU, RAM, load, kill programs
      • nvidia-smi - Monitor Nvidia GPU
        1. nvidia-smi -q -g 0 -d TEMPERATURE,POWER,CLOCK,MEMORY -l #Flags can be UTILIZATION, PERFORMANCE (on Tesla) ...
        2. nvidia-smi dmon
        3. nvidia-smi -l 1
    • CSV manipulation from command-line
      • xsv - The fastest, multi-processing CSV library. Written in Rust.
    • Computation, Matrix, Scientific libraries
      • OpenBLAS + LAPACK - Efficient Matrix computation and Linear Algebra library (alternative MKL)
      • Numpy - Matrix Manipulation in Python
      • Scipy - General scientific library for Python. Sparse matrices support
    • Rapid Development, Research
      • Jupyter - Code Python, R, Haskell, Julia with direct feedback in your browser
      • jupyter_contrib_nbextensions - Extensions for jupyter (commenting code, ...)
    • GPU computation
      • CUDA - Nvidia API for GPGPU
      • CUDNN - Nvidia primitives for Neural Networks
      • Magma - Linear Algebra for OpenCL and CUDA and heteregenous many-core systems
    • Visualization, Exploratory Data Analysis
      • Matplotlib
      • Seaborn
  • Machine Learning

    • Data manipulation

      • Pandas - Dataframe library
      • Dask - Dataframe library for out-of-core processing (Data that doesn't fit in RAM)
      • Scikit-learn-pandas - Use Pandas Dataframe seemlessly in Scikit-learn pipelines
      • Numexpr
    • Multicore processing

      • joblib
      • Numba
      • concurrent.futures
      • Dask
      • paratext - fast CSV to pandas
    • Compressing, storing data

      • Bcolz - Compress Numpy arrays in memory or on-disk and use them transparently
      • Zarr - Compress Numpy array in memory or on-disk and use them transparently
    • Out-of-core processing

      • Bcolz, Zarr
      • Dask
    • Structured Data - Classifier

      • Scikit-learn - General ML framework
      • XGBoost - Gradient Boosted tree library
      • LightGBM - Gradient Boosted tree library XGBoost and LightGBM classifiers should be preferred to Scikit-learn
    • Pipelines

      • Scikit-learn

      I don't recommend Scikit-learn pipelines as they are not flexible enough: not possible to use a validation set for XGBoost/LightGBM for early-stopping, computation waste for features that don't use target labels.

    • Unsupervised Learning - High cardinality/dimensionality (PCA, SVD, ...)

      • Scikit-learn

      Scikit-learn manifold implementations like t-SNE are not recommended for efficiency (RAM, Computation) reasons

    • Geographical data, Clustering

      • scikit-learn
      • Geopy
      • Shapely
      • HDBSCAN - Density-based clustering
    • Categorical data

      • python-categorical-encoders - Encoding with One-Hot, Binary, N-ary, Feature hashes and other scheme.

      Scikit-learn One-Hot Encoding, LabelEncoding, LabelBinarizer are a mess API-wise but are efficient if wrapped properly

    • Stacking

      • mlxtend

      I recommend you do your own stacking code to control your folds

    • Time data

      • Workalendar - Business calendar for multiple countries
    • Automatic Machine Learning

      • tpot - Scikit-learn pipeline generated through genetic algorithm
  • Deep Learning

    • Frameworks

      • Theano
      • Tensorflow
      • Pytorch
      • Mxnet
      • (Not tested) Nervana Neon, Chainer, DyNet, MinPy
    • API

      • Keras
    • Vision

      • Keras - Data augmentation
      • Scikit-image - preprocessing, segmenting (single-core)
      • Opencv - preprocessing, segmenting
    • NLP

      • spaCy - Tokenization
      • gensim - word2vec
      • ete3 - NLP trees visualization

      NLTK is single core, extremely slow and not recommended

    • Video

      • Vapoursynth - Frameserver for video pre-processing