Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagging outliers in R #2

Open
jessicasmiller opened this issue Jan 16, 2017 · 2 comments
Open

Flagging outliers in R #2

jessicasmiller opened this issue Jan 16, 2017 · 2 comments

Comments

@jessicasmiller
Copy link

Ben mentioned in class today that when examining your data, there are ways to get R to flag any data values that it identifies as outliers or that you designate as outside an expected value range. Are there any resources that elaborate on that? I think it would be really helpful to be able to flag and then remove or mask values that you’ve identified as outliers.

@bbolker
Copy link
Collaborator

bbolker commented Jan 16, 2017

If you just do summary(), R will tell you (among other things) the min and max values (as well as the number of NA values, if any). (Here I'm using summary just for the mpg column in the built-in mtcars data set; summary(mtcars) will give you the summaries for every column)

summary(mtcars$mpg)
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 10.40   15.42   19.20   20.09   22.80   33.90 

If you have a range of variables in mind, you can use filter() to select just the rows that are outside this range: in this case I'm going to look for values outside the range (12,32).

library(dplyr)
badrows <- (mtcars %>%
  filter(mpg<12 | mpg>32)
)

(I'm putting parentheses around the whole expression here so Jonathan doesn't yell at me)

  • I can look at this filtered data set by clicking on the little spreadsheety-looking icon in the Data window in RStudio
    screen shot 2017-01-16 at 4 45 25 pm

  • If I'm working in the console and want to look only at a few columns, I could quickly select() a few:

(badrows  %>%
  select(mpg,cyl,disp))
#    mpg cyl  disp
# 1 10.4   8 472.0
# 2 10.4   8 460.0
# 3 32.4   4  78.7
# 4 33.9   4  71.1

In fact, the view of the data that I get in RStudio actually has a Filter button that I can use to do this interactively ...

screen shot 2017-01-16 at 4 46 15 pm

If I want to get rid of rows, I can use filter() in the opposite sense:

goodrows <- (mtcars %>% filter(mpg>=12 & mpg<=33))

However, I/we do want to caution you very strongly that you always need a good reason to exclude data: you should never automatically exclude data, you need to use human judgement in order to establish that data points should be excluded.

@jessicasmiller
Copy link
Author

jessicasmiller commented Jan 17, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants