# Exploring Data in R
Let's load up some of our data and explore it!

I'm going to use some simple read statements here so we can see how we can clean things up later.

In [1]:
install.packages("tidyverse")
library(tidyverse)
heroes = read_csv("data/heroes_information.csv")
powers = read_csv("data/super_hero_powers.csv")

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.1     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  name = col_character(),
  Gender = col_character(),
  `Eye color` = col_character(),
  Race = col_character(),
  `Hair color` = col_character(),
  Height = col_double(),
  Publisher = col_character(),
  `Skin color` = col_character(),
  Alignment = col_character(),
  Weight = col_double()
)
Parsed with column specification:
cols(
  .default = col_character()
)
See spec(...) for full column specifications.


## Out of the box tools

Built-in to R are some functions that can help us understand our data. 

- `head()` gives us a slice of records from the top of the dataset
- `summary()` gives us an overview of the values in the dataset
- `str()` tells us about the data types

In [2]:
head(heroes)

X1,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203,Marvel Comics,-,good,441
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191,Dark Horse Comics,blue,good,65
2,Abin Sur,Male,blue,Ungaran,No Hair,185,DC Comics,red,good,90
3,Abomination,Male,green,Human / Radiation,No Hair,203,Marvel Comics,-,bad,441
4,Abraxas,Male,blue,Cosmic Entity,Black,-99,Marvel Comics,-,bad,-99
5,Absorbing Man,Male,blue,Human,No Hair,193,Marvel Comics,-,bad,122


In [3]:
summary(heroes)

       X1            name              Gender           Eye color        
 Min.   :  0.0   Length:734         Length:734         Length:734        
 1st Qu.:183.2   Class :character   Class :character   Class :character  
 Median :366.5   Mode  :character   Mode  :character   Mode  :character  
 Mean   :366.5                                                           
 3rd Qu.:549.8                                                           
 Max.   :733.0                                                           
                                                                         
     Race            Hair color            Height       Publisher        
 Length:734         Length:734         Min.   :-99.0   Length:734        
 Class :character   Class :character   1st Qu.:-99.0   Class :character  
 Mode  :character   Mode  :character   Median :175.0   Mode  :character  
                                       Mean   :102.3                     
                                      

In [4]:
str(heroes)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	734 obs. of  11 variables:
 $ X1        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ name      : chr  "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
 $ Gender    : chr  "Male" "Male" "Male" "Male" ...
 $ Eye color : chr  "yellow" "blue" "blue" "green" ...
 $ Race      : chr  "Human" "Icthyo Sapien" "Ungaran" "Human / Radiation" ...
 $ Hair color: chr  "No Hair" "No Hair" "No Hair" "No Hair" ...
 $ Height    : num  203 191 185 203 -99 193 -99 185 173 178 ...
 $ Publisher : chr  "Marvel Comics" "Dark Horse Comics" "DC Comics" "Marvel Comics" ...
 $ Skin color: chr  "-" "blue" "red" "-" ...
 $ Alignment : chr  "good" "good" "good" "bad" ...
 $ Weight    : num  441 65 90 441 -99 122 -99 88 61 81 ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 11
  .. ..$ X1        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ name      : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Gende

## Additional tools
These are most certainly useful but there's some extra tools we can use.

- `skimr` produces a more useful version of `summary()`
- `DataExplorer` gives us a data profile report in a single line of code!

When in RStudio there's also the `View()` function for getting an interactive grid view.

In [5]:
install.packages(c("skimr", "DataExplorer"))

Installing packages into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


In [8]:
library(skimr)
print(skim(heroes)) # Need print here because otherwise the notebook wants to use the nice table that sits behind the scenes

Skim summary statistics
 n obs: 734 
 n variables: 11 

── Variable type:character ─────────────────────────────────────────────────────
   variable missing complete   n min max empty n_unique
  Alignment       0      734 734   1   7     0        4
  Eye color       0      734 734   1  23     0       23
     Gender       0      734 734   1   6     0        3
 Hair color       0      734 734   1  16     0       30
       name       0      734 734   1  25     0      715
  Publisher      15      719 734   4  17     0       24
       Race       0      734 734   1  18     0       62
 Skin color       0      734 734   1  14     0       17

── Variable type:integer ───────────────────────────────────────────────────────
 variable missing complete   n  mean     sd p0    p25   p50    p75 p100
       X1       0      734 734 366.5 212.03  0 183.25 366.5 549.75  733
     hist
 ▇▇▇▇▇▇▇▇

── Variable type:numeric ───────────────────────────────────────────────────────
 variable missing complete   n 

In [7]:
library(DataExplorer)
create_report(heroes, output_file="heroes.html", output_dir="outputs")



processing file: report.rmd


  |..                                                               |   3%
  ordinary text without R code

  |....                                                             |   6%
label: global_options (with options) 
List of 1
 $ include: logi FALSE

  |......                                                           |   9%
  ordinary text without R code

  |........                                                         |  12%
label: basic_statistics
  |..........                                                       |  15%
  ordinary text without R code

  |...........                                                      |  18%
label: data_structure
  |.............                                                    |  21%
  ordinary text without R code

  |...............                                                  |  24%
label: missing_profile
  |.................                                                |  26%
  ordinary text without R code

  |...................  

output file: /home/nbuser/library/outputs/report.knit.md



/usr/bin/pandoc +RTS -K512m -RTS /home/nbuser/library/outputs/report.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /home/nbuser/library/outputs/heroes.html --smart --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 6 --template /home/nbuser/R/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:cerulean' --include-in-header /tmp/Rtmp45h9lV/rmarkdown-str5013890273f.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' 



Output created: outputs/heroes.html


Report is generated at "outputs/heroes.html".


> Go download the output and let's see what's in it. Can you write a report creation step for the superpowers?