Introduction
This Explorer allows to perform various statistical analyses and data mining operations in a very easy and intuitive way. As the name implies, this software aims at exploring data and getting quick insight of the order of magnitude of the observed objects. That's why it does focus on graphical representation and mouse driven operations, unlike the traditional statistical tools cluttered with numerous dialog boxes and lists with five decimal figures. You can, however, have the detailed numbers once your analysis is completed.
Videos
Overview |
|
|
Contingency table |
|
|
Weather data |
|
|
Animation |
|
|
Screenshots
- Installation and run
- Build from Source
- Data loading
- Main window
- Graph
- Tools
- Selection
- Conversions
- Units
- Types of analyses
- In the browser
- Credits
- Contact
Installation and run
The Explorer is written in javascript and built with electron,
OSX
Download the latest version for darwin from the release page.
Windows
Download the latest version corresponding to your system (32bit or 64bit) from the release page. The application is bundled into a single exe file, thanks to BoxedApp Packer .
Linux
Follow the "Build from source" instructions below.
Build from Source
Should you want to go the Build & Deploy route -you'll require node.js
(developed
on v6.1.0, confirmed to work on v4.7.3) and npm
(comes with node.js, developed using v3.9.5, confirmed to work on v2.15.11).
Download and unzip the Source files (zip
or tar.gz
) from the the release page, or clone the repository:
git clone https://github.com/jfbouzereau/explorer.git
Enter the Explorer's directory with cd explorer-1.x/app
(if you downloaded it from Releases) or cd explorer/app
(if you cloned the repository).
Install the dependencies:
npm install
And launch the app:
npm start
Data loading
At launch time, the Explorer shows a window to choose the dataset to use. You can either drag and drop a file from your computer desktop, or click the clipboard button.
Various file formats are accepted :
Source | File extension | Remarks |
---|---|---|
Access | mdb , accdb | Access 2000 or higher |
ARFF / KEEL | * | No comments at the beginning of the file. The first line must be @relation |
BigQuery | * | A config file with a content like this: BigQuery client_secret:/full/path/to/my_private_key.json query:select * from lookerdata:cdc.project_tycho_reports limit 1000 timeout:60000 |
dBase | dbf | |
Excel | xlsx | The names of the fields are expected at the top of the columns |
JMP | jmp | |
JSON file | * | A JSON array of records |
LIMDEP / NLOGIT | lpj | |
MINITAB | mtw | |
MLwiN | ws | Uncompressed format only |
MongoDB | * | A config file with a content like this: mongodb host:192.168.0.121:27017 database:geo collection:countries query:{cont:{$eq:"EU"},pop:{$gt:50000000}} |
Mysql | * | A config file with a content like this: mysql host:192.168.0.2 user:bob password:secret database:test query:select * from mytable |
Postgres | * | A config file with a content like this: postgres host:192.168.0.2 user:bob password:secret database:test query:select * from mytable or: postgres connection:bob:secret@192.168.0.2/test query:select * from mytable |
R | rdb | Binary format only |
SAS | sas7bdat | Uncompressed format only |
SPLUS | sdd | |
SPSS | sav | Uncompressed format only |
SQL Server | * | A config file with a content like this: mssql host:192.168.0.121 username:bob password:secret query:select * from mytable |
Stata | dta | Stata 8 or higher |
Tabular file | * | The names of the fields are expected on the first line |
Bzip2 file | bz2 | The uncompressed file must be in one of the previous formats |
Gzip file | gz | The uncompressed file must be in one of the previous formats |
Web file | * | Contains the url of the data. The remote file must be in one of the previous formats |
If you click the clipboard button, the data must be in tabular form, with the name of the fields on the first line.
Main window
Once the data have been successfully loaded, the main window is displayed :
Here are the elements of the interface :
-
List of the categorical fields (aka "the pink zone"). By default only 10 fields are displayed. To resize the list, move the mouse just below the list and drag to shrink or extend the list. To scroll the list, move the mouse to the right of the list.
-
Icons of the existing analyses (graphs). To run a new analysis, just drag its icon to the workspace.
-
List of the numerical fields (aka "the blue zone"). By default only 10 fields are displayed. To resize the list, move the mouse just below the list and drag to shrink or extend the list. To scroll the list, move the mouse to the right of the list.
-
Icons of the tools
-
Status bar. This area gives at any time details about the object under the mouse, or the action your are about to do.
-
Dock This area is used to keep graphs that are temporarily removed from the workspace.
-
Version number
-
Memory usage
-
Workspace. This area is where the graphs are created and arranged.
Graph
To create a new graph, drag its icon to the workspace. Alternatively if you dont know which icon to look at, you can right-click or control-click on the workspace to get a menu with all the possible analyses.
A graph is represented by an area with different noticeable parts :
-
Close box. Click on this box to close the graph. All the computations done will be lost.
-
Option menu. Some graphs have different ways of representing the results. In that case click on this sign to bring up the menu to choose from. Alternatively, right-click or control-click within the graph.
-
Title bar. This area shows the current selection (see below). Click on this area to drag the graph around.
-
Slots. These are the places where you can define the parameters of the analysis. Depending on the graph, different combinations of slots are shown. On a pink slot you can drag a categorical field. On a blue slot you can drag a numerical slot. Parameters can be swapped by dragging from one slot to another one ( of the same graph, and of the same color ).
-
Resize box. Click on this box and drag to resize the graph.
To change the type of a graph, drag the icon of the new type onto the graph. The new analysis will retain the parameters and selection of the previous one.
Selection
Every analysis can be restricted to a part of the data only. The set of observations (records) currently processed by a graph is named the selection, and is displayed in the title bar . Initially, the selection consists of all the observations, and the title is blank.
Selection based on a categorical field
- Use a type of graph that allows to split the dataset into the desired groups : pie chart, bar chart, treemap.
- Drag the slice of the group to be processed out of the graph, onto the workspace.
- This creates a new pie chart with a selection equal to the slice's category.
- Drag the icon of the wanted analysis onto this second graph. It will change its type, but will retain the selection. The type of graph can be changed as many times as wished, all the analyses will be conducted on the same selection.
Conversely, the selection of an existing graph can be changed by dragging a pie slice onto its title. This allows to conduct successively the same analysis on different parts of the data.
Selection based on a numerical field
- Drag a numerical field from the blue zone to the title of an existing graph. The selection will consists of all the observations with a non-null value of the field. Typically a dummy variable (with values 0 or 1) would be used for this, but not necessarily.
Combining selections
Dragging a slice to the title of a graph which already has a selection will combine the two sets.
If the two variables are the same, the resulting selection will be the union of the two sets. Example: a pie graph splits the data into Apples, Pears, Peaches, and Apricots. If you drag the apple slice to the title of another graph, the selection will be Apples. If you then drag the peach slice to the title of the graph, the selection will be Apples + Peaches
If the two variables are not the same, the resulting selection will be the intersection of the two sets. Example : a pie graph splits the data into Apples, Pears, Peaches and Apricots. If you drag the apple slice to the title of another graph, the selection will be Apples. If you change the variable defining the pie to split the data into Organic and Non-Organic, and drag the Organic slice to the title of the second graph, the selection will be Apples AND Organic.
Conversions
When loading the data, the Explorer identifies fields containing only numbers as numeric, and all others fields as categorical. Sometimes it is desirable to change this. Several possibilities exist.
-
Drag a numerical field to the pink zone. The field is converted to categorical, the values are the same but as strings of characters.
-
Drag a categorical field to the blue zone. Each category gives a dummy variable of the same name, Therefore, there are as many dummies as categories of the initial field, and all the dummies are exclusive. Example : COLOR is the categorical field converted:
Original data:
ID | COLOR |
1 | Blue |
2 | Red |
3 | Green |
4 | Red |
Data after the conversion
ID | Blue | Red | Green |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 1 | 0 |
- Drag the special numerical field "1" to the pink zone. This "pivots" the data. Each numerical field becomes a category of a new PIVOT field, whose value is in a new COUNT field. Each original record gives as many records as the number of numerical fields. Example: HEIGHT, WIDTH and DEPTH are the numerical fields.
Original data :
ID | COLOR | HEIGHT | WIDTH | DEPTH |
1 | Blue | 142 | 25 | 11 |
2 | Red | 175 | 12 | 16 |
3 | Green | 109 | 48 | 14 |
Data after the pivot :
ID | COLOR | PIVOT | COUNT |
1 | Blue | HEIGHT | 142 |
1 | Blue | WIDTH | 25 |
1 | Blue | DEPTH | 11 |
2 | Red | HEIGHT | 175 |
2 | Red | WIDTH | 12 |
2 | Red | DEPTH | 16 |
3 | Green | HEIGHT | 109 |
3 | Green | WIDTH | 48 |
3 | Green | DEPTH | 14 |
Units
-
All the analyses applied to categorical fields (whose icon is pink) count the observations. For example in a pie chart the slices are proportional to the number of observations of each category. Sometimes the counts have to be weighted. This is done by changing the "unit" of the graph, by dragging a numerical field onto the graph. The title of the graph is turned blue to indicate that the counts are weighted. The status bar also shows the values or percentages in the new unit. To remove the unit and go back to the normal counting, drag the special field "1" onto the graph.
-
All the analyses that represents datapoints in a 2D plane ( scatter plot, PCA, discriminant analysis, ternary plot, etc) can also be modified. If a numerical field is set as unit, the datapoints are displayed as circles whose size is proportional to the unit :
Tools
Here are the various tools proposed by the toolbar at the bottom of the screen :
-
Sort : drag this icon onto a field, or drag a field onto this icon to sort the data in ascending order. Do the same sort again to sort in descending order. The sort is stable : to sort the data by a key consisting of field1,field2,field3, you must sort by field3 first, then field2, and finally field1.
-
Clone. Drag this icon onto a graph to get a copy of it, with the same parameters. If the computation is slow, this allows to bypass the second computation.
-
Add : Drag this icon to the pink or blue zone to create a new field. See below.
-
Help. Drag this icon onto a graph to get some informations about the analysis, the results produced, the representation options, and the possible actions.
-
Picture : Drag this icon onto a graph to get its image in png format.
-
Table : Drag this icon to the pink or blue zone to get a table of the values of the dataset. Drag this icon onto a graph to get a table of the numerical results. They can be copied to the clipboard ( with control-C or command-C ) and pasted into another software.
-
Dustbin : Drag this icon onto a field, or drag a field onto this icon to permanently remove the field ( if the field is used by some graphs, it cannot be removed ). Drag a pie slice, a bar, or a tree map slice onto this icon to permanently remove the corresponding records. The original input file is not modified.
Types of analysis
- Pie chart
- Bar chart
- Line chart
- Association diagram
- Word cloud
- Arc diagram
- Contingency table
- Multiple Correspondence analysis
- 3-variable graph
- Treemap
- Chi-2 tests
- Pearson' chi-square test
- Yates' chi-square test
- G-test
- Fisher's exact test
- Gini impurity
- Entropy
- Repartition curve
- Distribution curve
- Scatter plot
- Ternary plot
- Andrew's curves
- Survey plot
- 3D plot
- Correlations
- Autocorrelation plot
- Probability plot
- Tukey-lambda PPCC plot
- Lag plot
- General statistics
- Normality tests
- Shapiro-Wilk test
- Anderson-Darling test
- Lilliefors test
- D'Agostino test
- Anscombe test
- Omnibus test
- Jarque-Bera test
- Analysis of variance
- Bartlett's test
- F-test
- Levene test
- Brown Forsythe test
- Box's M test
- Student's T-test
- Welch T-test
- Hotelling's test
- Wilk's lambda
- Lawley-Hotelling trace
- Pillai trace
- Two-way anova
- Non-parametric tests
- Kolmogorov-Smirnov test
- Kruskal-Wallis test
- Jonckheere test
- Cochran Q test
- Durbin test
- Friedman test
- Mantel-Haenszel test
- Breslow-Day test
- Woolf test
- Principal components
- Canonical correlation analysis
- K-means
- K-medoids
- Fuzzy C-means
- Huen diagram
- Dendogram
- Radviz
- Discriminant analysis
- Regressions
- Linear regression
- Poisson regression
- Negative binomial regression
- Logistic regression
- Least angle regression
- Influence plot
- QQ plot
- Box plot
- Parallel coordinates
- Neural network (perceptron)
In the browser
The Explorer can also be executed in any modern browser. Open app/index.html, paste the data from the clipboard, and click OK.
Credits
The Explorer takes advantage of some very useful npm modules :
- gapitoken Node.js module for Google API service account authorization
- mongodb The official MongoDB driver for Node.js
- pg Pure javascript PostgreSQL client for node.ja
- lzma-purejs pure JavaScript LZMA de/compression, for node.js
- mysql A node.js driver for mysql
- request Simplified HTTP request client
- synaptic Architecture-free neural network library for node.js and the browser
- tedious A TDS driver, for connecting to MS SQLServer databases