Skip to content
A list of software and papers related to automatic/fast Exploratory Data Analysis
Branch: master
Clone or download
Latest commit 4c4ebea May 1, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
autoEDA-paper
usecase/compas adds minor changes Mar 26, 2019
.gitignore deletes rmd report for now Mar 15, 2019
README.md adds inspectdf package May 1, 2019
autoEDA-resources.Rproj adds r project Mar 4, 2019
comparison_table.csv updates comparison table Mar 26, 2019

README.md

autoEDA-resources

A list of software and papers related to automated Exploratory Data Analysis, including

  • fast data exploration and visualization,
  • augmented analytics,
  • visualization recommendation and other tools that speed up data exploration (visual exploration in particular).

Software

R packages

My summary of R packages is available on arxiv

Complete Packages

  • dataMaid (CRAN package) - automated checks of data validity.

  • DataExplorer (CRAN package) - automated data exploration (including univariate and bivariate plots, PCA) and treatment.

  • funModeling (CRAN package) - automated EDA, simple feature engineering and outlier detection.

  • SmartEDA (CRAN package) - automated generation of descriptive statistics and uni- and bivariate plots, parallel coordinate plots. Details can be found in a dedicated paper.

  • autoEDA (GitHub package) - automated EDA with uni- and bivariate plots. An article with an introduction can be found on LinkedIn.

    • auto-EDA (GitHub package) - uni- and bivariate plots for data exploration in regression and classification problem. The package cleans data automatically to improve the plots. Another version of Xander Horn's package.
  • visdat (CRAN package) - 6 exploratory/diagnostic plots for initial data analysis.

  • dlookr (CRAN package) - tools for data quality diagnosis, basic exploration and feature transformations.

  • xray (CRAN package) - first look at the data - distributions and anomalies. More in the blog post.

  • arsenal (CRAN package) - statistical summaries (models and exploration) and quick reporting.

  • RtutoR (CRAN package) - learning material with a automatic reports module. More at R-Bloggers.

  • exploreR (CRAN package) - exploration based on univariate linear regression.

  • summarytools (CRAN package) - table to summarise datasets and perform simple uni- and bivariate analyses.

  • inspectdf (CRAN package) - tools for column-wise exploration and comparison of data frames. Examples are provided in a README of the GitHub repo.

Packages in Development

  • AEDA (GitHub package) - summary statistics, correlation analysis, cluster analysis, PCA & other projections.

  • dataexpks (GitHub package) - quick reports with basic data summaries.

  • automatic-data-explorer (GitHub package) - basic EDA and creating Markdown reports from multiple R scripts.

  • xda (GitHub package) - basic data summaries.

  • EDA - stub of a package.

  • modeler (GitHub package) - tools for exploration and pre-processing.

  • IEDA (GitHub package) - EDA simplified through interactive visualization.

  • seda (GitHub package) - fast EDA tool in active development.

Domain-specific packages

Related packages

  • vtreat (CRAN package) - data treatment (pre-processing) that includes dealing with missing data and large categorical variables. Details can be found in the paper about vtreat.

  • report - automated modeling report generation.

  • FactoInvestigate (CRAN package) - has an automatic reporting module which selects best plots that summarise different projection techniques.

Python libraries

Complete Packages

  • Dora (pip library) - data cleaning, featuring engineering and simple modeling tools.

  • statsModels (pip library) - collection of statistical tools, including EDA.

  • TPOT (pip library) - autoML tool with feature engineering module.

  • HoloViews (pip library) - automated visualization based on short data annotations.

  • lens (pip library) - fast calculation of summary statistics and correlations. Presentation about the library.

  • pandas-profiling - popular library for quick data summaries and correlation analysis.

  • speedML (pip library) - large library for ML with module dedicated to fast EDA.

Packages in Development

  • edaviz - Python library for fast data exploration in private beta testing phase. Will provide functions for dataset overviews, bivariate plots and finding good predictors.

  • basic-auto-EDA (GitHub library) - automatic report generation.

  • automated_EDA - stub of a library.

Web services

  • DIVE - MIT's tools for data exploration that tries to choose best (most informative) visualizations.

  • Automatic Statistician - tool for automated EDA and modeling.

  • Several Shiny apps by R Squared Computing, including visulizer and descriptr.

Standalone software

  • auto-eda - automatic EDA with SQL.

  • elycite - tools for exploration and modelling available (locally) as an web application. Designed for NLP problems.

Papers

Methods and tools for autoEDA

Visualization recommendation

Augmented analytics

You can’t perform that action at this time.