Skip to content

pabvald/julia-for-data-science

Repository files navigation

Julia-for-data-science course

This respository is based in the tutorial created for JuliaAcademy and taught by Huda Nassar.

Below are shown the most important Julia packages for different topics, most of them used in the course.

Datasets

  • RDatasets: collection of datasets included in the R language.
  • VegaDatasets: collection of datasets used in Vega and Vega-Lite examples.

Preprocessing

#TODO

Data

General file reading/writing libraries:

  • DelimitedFiles: included in the Standard Librarty. Allows to read and write complicated files. It should be used only when the file is really complicated.
  • CSV: allows to work with .csv files. It is faster than DelimitedFiles and converts the read file in a DataFrame.
  • XLSX: is a Julia package to read and write Excel spreadsheet files. It allows to read whole sheets and particular ranges of cells.

DataFrames in Julia:

Libraries to read/write files with specific formats:

  • JSON.jl: Tprovides for parsing and printing JSON in pure Julia.
  • JLD.jl: the JLD module reads and writes "Julia data files" (*.jld files) using HDF5.The key characteristic is that objects of many types can be written, and upon later reading they maintain the proper type.
  • NPZ.jl: The NPZ package provides support for reading and writing Numpy .npy and .npz files in Julia.
  • RData.jl:Read R data files (.rda, .RData) and optionally convert the contents into Julia equivalents. Can read any R data archive, although not all R types could be converted into Julia.
  • MAT.jl: This library can read MATLAB .mat files, both in the older v5/v6/v7 format, as well as the newer v7.3 format.

Linear Algebra

  • LinearAlgebra.jl: in addition to (and as part of) its support for multi-dimensional arrays, Julia provides native implementations of many common and useful linear algebra operations.

  • SparseArrays.jl: Julia has support for sparse vectors and sparse matrices.

Statistics

  • Statistics.jl: The Statistics standard library module contains basic statistics functionality (std,var, cor, cov, mean, median, middle, quantile).

  • StatsBase.jl: a Julia package that provides basic support for statistics. Particularly, it implements a variety of statistics-related functions, such as scalar statistics, high-order moment computation, counting, ranking, covariances, sampling, and empirical density estimation.

  • KernelDensity.jl: kernel density estimators for Julia.

  • Distributions.jl:provides everything needeed to work with probability distributions.

  • HypothesisTesting.jl: implements several hypothesis test in Julia.

  • MlBase.jl: s a Julia package that provides useful tools for machine learning applications (confusion matrix, ROC curves).

See also Visualization.

Dimensionality Reduction

  • MultivariateStats: is a Julia package for multivariate statistical analysis. It provides a rich set of useful analysis techniques, such as PCA, CCA, LDA, PLS, etc.
  • TSne: Julia implementation of L.J.P. van der Maaten and G.E. Hintons t-SNE visualisation technique.
  • UMAP: A pure Julia implementation of the Uniform Manifold Approximation and Projection dimension reduction algorithm.
  • ScikitLearn.descomposition: includes matrix decomposition algorithms, including among others PCA, NMF or ICA.

Clustering

  • Clustering.jl: a julia package for data clustering. It covers two aspets of data clustering: algorithms(k-means, k-medoids, dbscan, hierarchical clustering) and validation(silhouettes, v-measure).
  • Distances.jl: a Julia package for evaluating distances(metrics) between vectors.
  • ScikitLearn.cluster: module gathers popular unsupervised clustering algorithms.

Nearest neighbors

  • NearestNeighbors.jl: a package written in Julia to perform high performance nearest neighbor searches in arbitrarily high dimensions.

SVM

Decision trees

  • DecisionTree.jl: Julia implementation of Decision Tree (CART) and Random Forest algorithms. Available via AutoMLPipeline.jl, CombineMLP.jl, MLJ.jl and ScikitLearn.jl.

Linear models

  • GLM.jl: linear and generalized linear models in Julia.
  • LsqFit.jl: basic least-squares fitting in pure Julia under an MIT license.
  • ScikitLearn.linear_models:module implements a variety of linear models.
  • ANOVA.JL: calculate ANOVA tables for linear models.

Graphs

#TODO

Numerical Optimization

#TODO

Neural Networks

Julia offers different possibilities to work with neural networks:

  • Flux.jl: the Julia Machine Learning Library.
  • Knet.jl: Koç University deep learning framework.
  • MLJ.jl: Julia Machine Learning framework by Alan Turing institute.
  • MXNet.jl: Apache MXNet Julia package.
  • TensorFlow.jl: a Julia wrapper for TensorFlow.
  • ScikitLearn: Julia implementation of the scikit-learn Python library.

From other languages

  • PyCall: allows to import any Python package as well as our own Python code.
  • RCall: facilitates communication between the R and Julia languages and allows the user to call R packages from within Julia.
  • Calling C and Fortran Code (in Julia documentation).

Visualization

  • Plots.jl.
  • StatsPlots.jl: statistical plotting recipes for Plots.jl
  • Makie: high level plotting library with a focus on interactivity and speed.

About

Exercises from the course Julia for Data Science. Guide of the most important Julia libraries for Data Science.

Topics

Resources

Stars

Watchers

Forks