# Exploratory data analysis
These are most certainly useful but there's some extra tools we can use.

- `skimr` produces a more useful version of `summary()`
- `DataExplorer` gives us a data profile report in a single line of code!

In [2]:
install.packages(c("skimr", "naniar", "DataExplorer", "visdat", "UpSetR"))
library(tidyverse)
heroes = read_csv("data/heroes_information.csv")

Installing packages into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)
also installing the dependencies ‘rlang’, ‘tinytex’, ‘tidyselect’, ‘data.table’, ‘rmarkdown’, ‘networkD3’



ERROR: Error: package or namespace load failed for ‘tidyverse’ in library.dynam(lib, package, package.lib):
 shared object ‘readxl.so’ not found


## Data profiling
`skimr` is a great way to improve upon the `summary()` function.

In [2]:
library(skimr)
print(skim(heroes))


Attaching package: ‘skimr’

The following objects are masked from ‘package:dplyr’:

    contains, ends_with, everything, matches, num_range, one_of,
    starts_with



Skim summary statistics
 n obs: 734 
 n variables: 11 

── Variable type:character ─────────────────────────────────────────────────────
   variable missing complete   n min max empty n_unique
  Alignment       0      734 734   1   7     0        4
  Eye color       0      734 734   1  23     0       23
     Gender       0      734 734   1   6     0        3
 Hair color       0      734 734   1  16     0       30
       name       0      734 734   1  25     0      715
  Publisher      15      719 734   4  17     0       24
       Race       0      734 734   1  18     0       62
 Skin color       0      734 734   1  14     0       17

── Variable type:integer ───────────────────────────────────────────────────────
 variable missing complete   n  mean     sd p0    p25   p50    p75 p100
       X1       0      734 734 366.5 212.03  0 183.25 366.5 549.75  733
     hist
 ▇▇▇▇▇▇▇▇

── Variable type:numeric ───────────────────────────────────────────────────────
 variable missing complete   n 

It also has a data.frame behind the scenes so we can do a lot of other things than just see the results

In [3]:
skim(heroes)

variable,type,stat,level,value,formatted
X1,integer,missing,.all,0.0000,0
X1,integer,complete,.all,734.0000,734
X1,integer,n,.all,734.0000,734
X1,integer,mean,.all,366.5000,366.5
X1,integer,sd,.all,212.0318,212.03
X1,integer,p0,.all,0.0000,0
X1,integer,p25,.all,183.2500,183.25
X1,integer,p50,.all,366.5000,366.5
X1,integer,p75,.all,549.7500,549.75
X1,integer,p100,.all,733.0000,733


The `visdat` and `naniar` packages are great for getting some visualisations about the data, particularly missing values. `naniar` makes use of `UpSetR` for great effect in looking for overlaps in missings - really useful for instance when you get survey data for seeing which questions are part of conditional branches!

In [1]:
library(visdat)
vis_dat(heroes)

ERROR: Error in library(visdat): there is no package called ‘visdat’


In [None]:
library(naniar)
library(UpSetR)

heroes %>%
  as_shadow_upset() %>%
  upset()

These help with visual and ad-hoc explorations. A nifty, quick way of getting a profile of ad ataset is with the `DataExplorer` package.

In [7]:
library(DataExplorer)
create_report(heroes, output_file="heroes.html", output_dir="outputs")



processing file: report.rmd


  |..                                                               |   3%
  ordinary text without R code

  |....                                                             |   6%
label: global_options (with options) 
List of 1
 $ include: logi FALSE

  |......                                                           |   9%
  ordinary text without R code

  |........                                                         |  12%
label: basic_statistics
  |..........                                                       |  15%
  ordinary text without R code

  |...........                                                      |  18%
label: data_structure
  |.............                                                    |  21%
  ordinary text without R code

  |...............                                                  |  24%
label: missing_profile
  |.................                                                |  26%
  ordinary text without R code

  |...................  

output file: /home/nbuser/library/outputs/report.knit.md



/usr/bin/pandoc +RTS -K512m -RTS /home/nbuser/library/outputs/report.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /home/nbuser/library/outputs/heroes.html --smart --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 6 --template /home/nbuser/R/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:cerulean' --include-in-header /tmp/RtmpeYceKn/rmarkdown-str245414e3ef1.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' 



Output created: outputs/heroes.html


Report is generated at "outputs/heroes.html".


> Check out the report produced by DataExplorer then explore the powers dataset (`data/super_hero_powers.csv`) with the tools we've used here