-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_EDA.Rmd
608 lines (471 loc) · 19.1 KB
/
03_EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
---
title: "Intro to R 03: EDA"
author: "Kyla McConnell"
output: html_document
---
Set-up:
Load the tidyverse package and read in the penguins data:
```{r}
library(tidyverse)
penguins <- read_csv("data/penguin_data_1.csv")
```
## Summary tables
With filter() you can view/save tables of certain values / ranges within a column.
With summarize, you can separate the data into groups and return a summary of their values (mean, count, etc.)
### filter()
Use `filter()` to return all items that fit a certain condition. For example, you can use:
- equals to: ==
- not equal to: !=
- greater than: >
- greater than or equal to: >=
- less than: <
- less than or equal to: <=
- in (i.e. in a vector): %in%
Syntax: filter(columnname logical-operator condition)
```{r}
penguins %>%
filter(bill_length_mm > 55)
penguins %>%
filter(bill_length_mm < 35)
penguins %>%
filter(bill_depth_mm < 14)
```
You can also select numeric rows with a specific value
```{r}
penguins %>%
filter(year == 2009)
```
Or you can use it to select all items in a given category. Notice here that you have to use quotation marks to show you're matching a character string.
Look at the error below:
```{r}
penguins %>%
filter(island == Biscoe)
```
The correct syntax is: (because you're matching to a string)
```{r}
penguins %>%
filter(island == "Biscoe")
penguins %>%
filter(species == "Gentoo")
```
To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):
```{r}
penguins %>%
filter(species %in% c("Gentoo", "Chinstrap"))
penguins %>%
filter(year %in% c(2008, 2009))
```
Note that filter is case-sensitive, so capitalization matters.
#### Try it out
1. Load in the spr_example dataset again (update the path if you have to).
```{r}
spr <- read_csv("data/spr_example.csv")
```
2. Reading times lower than 150ms are pretty unusual and are sometimes removed from self-paced reading data. You decide that you want to remove these response times from your data. Filter out all RTs below 150ms from the spr dataset and save your result.
```{r}
nrow(spr)
spr <- spr %>%
filter(RT >= 150)
nrow(spr)
```
3. You want to take a look at the results from Par_C only: Filter these data points but don't save the result.
```{r}
spr %>%
filter(participant == "Par_C")
```
4. You're not interested in the practice items in the spr dataframe. Filter the dataset to include everything BUT the practice items and save your result.
```{r}
spr <- spr %>%
filter(cond != "practice")
```
### group_by() & summarize()
First, load the ikea dataset again:
```{r}
ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')
```
To look at summary statistics for specific groupings, we have to use a two- (more like three-) step process.
First, group by your grouping variable using `group_by()`
Then, summarize, which creates a column based on a transformation to another column, using `summarize()` or `summarise()`
Finally, ungroup (so that R forgets that this is a grouping and carries on as normal, with `ungroup()`
- usually this makes no difference to what you see but is an important fail-safe
You can use multiple different operations in the summarize part, including:
- mean(col_name)
- median(col_name)
- max(col_name)
- min(col_name)
For example, to return the average price in each category:
```{r}
ikea %>%
group_by(category) %>%
summarize(mean(price)) %>%
ungroup()
```
You can also give a name to your new summary column:
```{r}
ikea %>%
group_by(category) %>%
summarize(avg_price = mean(price)) %>%
ungroup()
```
Or further manipulate it:
```{r}
ikea %>%
group_by(category) %>%
summarize(avg_price_eur = mean(price) * 0.22) %>%
ungroup()
```
For categorical columns, you can also count how many rows are in each category using `count()` instead of `summarize()`
```{r}
ikea %>%
group_by(category) %>%
count() %>%
ungroup()
```
Similarly, you can use distinct() to return the unique items in the category, for example:
```{r}
ikea %>%
group_by(category) %>%
distinct(name) %>%
ungroup()
ikea %>%
distinct(name)
```
You can call `distinct()` without an argument, which will make sure that all columns have only unique items, or you can call it on a specific column, which will return only that column and all the unique items in it.
Or both:
```{r}
ikea %>%
group_by(category) %>%
distinct(name) %>%
count()
```
#### Try it out
1. Starting with the spr dataset, group by participant and use summarize to find the max RT for each participant.
```{r}
spr %>%
group_by(participant) %>%
summarize(max(RT)) %>%
ungroup()
```
2. With the same dataset, group by condition and return the mean RT for each condition. This is already a first step towards statistical analysis of your data! Do conditions look like they have similar or very different means?
```{r}
spr %>%
group_by(cond) %>%
summarize(mean(RT)) %>%
ungroup()
```
3. Summarize based on condition and count how many full sentences are in each condition. (Here you will realize that this example dataset doesn't make a ton of sense :D)
```{r}
spr %>%
group_by(cond) %>%
count(full_sentence) %>%
ungroup()
spr %>%
count(full_sentence)
```
### ifelse() & mutate()
Now for something fancy. You can also make new columns based on "if" conditions using the call `ifelse()`. The syntax of ifelse is: ifelse(this_is_true, this_happens, else_this_happens). For example:
```{r}
ikea %>%
mutate(price_categorical = ifelse(price > 2000, "expensive", "not expensive")) %>%
select(name, price, price_categorical)
```
Here is another example (notice the output isn't being saved since we don't really need it):
```{r}
ikea %>%
mutate(height_categorical = ifelse(height <= 100, "shorter than 1m", "taller than 1m")) %>%
select(name, height, height_categorical)
```
You can also use ifelse on categorical / character columns:
```{r}
ikea %>%
mutate(fav_designer = ifelse(designer == "Carina Bengs", "Yes", "No")) %>%
select(name, designer, fav_designer)
```
You can also use this to condense the groups in a certain column. For example, say we want to combine the categories "Beds", "Wardrobes", and "Chests of drawers & drawer units" into one category, "Bedroom":
```{r}
ikea %>%
mutate(new_category = ifelse(category %in% c("Beds", "Wardrobes","Chests of drawers & drawer units"), "Bedroom", category)) %>%
select(item_id, category, new_category)
```
Notice here that if we want to affect the entry only if the condition is TRUE, leaving it unchanged if it is false, we leave the column name as the else condition. This tells R to take the value that is currently in that column if it finds the condition to be FALSE.
We can use the same idea in-place, replacing the existing column, by just assinging it to the identical column name:
```{r}
ikea %>%
mutate(category = ifelse(category %in% c("Beds", "Wardrobes","Chests of drawers & drawer units"), "Bedroom", category))
```
Finally, this is pretty complicated so get ready, you can also combine multiple ifelse statements. Try to follow the logic below:
```{r}
ikea %>%
mutate(category = ifelse(category %in% c("Beds", "Wardrobes","Chests of drawers & drawer units"), "Bedroom",
ifelse(category %in% c("Cabinets & shelving units", "Café furniture"," Chairs"), "Kitchen", "Other")))
penguins %>%
mutate(categorical_weight = ifelse(body_mass_g > 3000, "chunky",
ifelse(body_mass_g > 2000, "normal",
ifelse(flipper_length_mm > 200, "long fins",
"small")
)))
```
Most of the examples above involve labeling different groupings with text, but you can use ifelse for all sorts of operations. Say the price of wood has gone up, so we want to add 100SAR to the price of any furniture taller than 1m. We could:
```{r}
ikea %>%
mutate(new_price = ifelse(height >= 100, price + 100, price)) %>%
select(item_id, height, price, new_price)
```
Remember, you could also do this in-place by assigning the new column to "price" instead of "new_price".
#### Try it out
1. Create a new column that shows whether or not the word is in the first half of a sentence. If the word number is less than or equal to 3, this new column should read 'first half', if not, it should read 'second half'.
```{r}
spr %>%
mutate(sentence_half = ifelse(word_num <= 3, "first half", "second half"))
```
2. Create a new column that has the part of speech for a word. If the word is in following list, it should be labeled "content", otherwise, it should be labeled "function": c("practice", "sentence", "relative", "clause.", "cats.")
```{r}
spr %>%
mutate(POS = ifelse(word %in% c("practice", "sentence", "relative", "clause.", "cats."), "content", "function"))
```
3. Create a new column that shows how fast a word was read. First, find the mean reading time over the entire dataset. You can do this in the console most easily using the base-R syntax `mean(spr$RT)`. If the RT is greater than or equal to this number, your new column should read "slow", otherwise, it should read "fast".
```{r}
mean(spr$RT)
spr %>%
mutate(fastness = ifelse(RT >= mean(spr$RT), "slow", "fast"))
```
### Helper functions
#### Helper functions for select()
Some helper functions can allow you to get more use out of 'select()' and 'filter()':
For example, 'ends_with()' returns TRUE for any string where the final characters match
Here's how we can select both the `price` and the `old_price` column - by specifying `ends_with("price")` in the select() call:
```{r}
ikea %>%
select(ends_with("price"))
```
The opposite is also possible, e.g.
```{r}
ikea %>%
select(starts_with("s"))
```
`contains` is another helper function. Here, we're using it to show all columns that contain an underscore:
```{r}
ikea %>%
select(contains("_"))
```
If we try to select a column that doesn't exist, we get an error message:
```{r}
# ikea %>%
# select(name, translation)
```
But if we combine select with any_of, we don't see an error. Instead, R just returns the columns that do exist:
```{r}
ikea %>%
select(any_of(c("name", "translation")))
```
Note here that you have to use an array and enclose the names in quotes!
We can also select a range of variables using a colon. This works both with variables and (a range of) numbers:
```{r}
ikea %>%
select(item_id:category)
ikea %>%
select(1:3) # first three columns
```
Here, the order of the columns matters!
Other helper functions for select() are:
- matches: similar to contains, but can use regular expressions
- num_range: in a dataset with the variables X1, X2, X3, X4, Y1, and Y2, select(num_range("X", 1:3)) returns X1, X2, and X3
#### Helper functions for filter()
You can add more than one logical using & (and) as well as | (or).
For example, to see beds which are available in other colors:
```{r}
ikea %>%
filter(category == "Beds" & other_colors == "Yes")
```
...or tables, desks, or chairs (again, taking care to match the spelling of the data frame exactly):
```{r}
ikea %>%
filter(category == "Chairs" | category == "Tables & desks")
```
...or tables, desks, and chairs that each cost less than 1000SAR:
```{r}
ikea %>%
filter((category == "Chairs" | category == "Tables & desks") & price < 1000)
```
Here, we need to add brackets around the two category options to make sure the `price` filter is applied to both categories. If we don't use the brackets, we'll find tables and desks for less than 1000SAR but that condition won't be applied to chairs, too.
Another option would be to use %in%:
```{r}
ikea %>%
filter(category %in% c("Chairs", "Tables & desks") & price < 1000)
```
One additional option for `filter` is to find NAs (`is.na()`) or exclude NAs (`!is.na()`).
#### Try it out
1. From the ikea dataset, select all columns that start with the letter "d". (Case-sensitive!)
```{r}
ikea %>%
select(starts_with("d"))
```
2. Select all the columns of ikea that contain "th".
```{r}
ikea %>%
select(contains("_"))
```
3. You would like to buy a really cheap chair -- Filter the dataset to include only chairs that cost less than 100 SAR.
```{r}
ikea %>%
filter(category == "Chairs" & price < 100)
```
4. Show two different ways to filter the name column to include only items named "BILLY" or "KALLAX".
```{r}
ikea %>%
filter(name == "BILLY" | name == "KALLAX")
ikea %>%
filter(name %in% c("BILLY", "KALLAX"))
```
# Communicating with R
## Understanding warnings and errors
R will often "talk" to you when you're running code. For example, when you install a package, it'll tell you e.g. where it is downloading a package from, and when it's done. That's nothing to worry about!
When there's a mistake in the code (e.g. misspelling a variable name, forgetting to close quotation marks or brackets), R will give an *error* and be unable to run the line. The error message will give you important information about what went wrong.
```{r}
hello <- "hi"
Hello
```
In contrast, *warnings* are shown when R thinks there could be an issue with your code or input, but it still runs the line. R is generally just telling you that you MIGHT be making a logical error, not that the code is impossible.
```{r}
c(1, 2, 3) + c(1, 2)
```
## Reading function documentation
First, load the cowsay package again (make sure it's installed):
```{r}
library(cowsay)
```
Try the following code again (making sure the cowsay package is installed **and** loaded):
```{r}
say(
what = "Good luck learning R!",
by = "buffalo")
```
But what other options are there for this command - which other animals, for example, or can you change the colour? To see the documentation for the `say` command, you can either run this line of code:
```{r}
?say
```
...or type in `say` in the Help tab on the bottom right.
This will show you the documentation for the command.
- *Usage* shows an overview of the function arguments and their defaults (e.g. if you typed in `say()` without any arguments in the brackets, you'd get the defaults, i.e. a cat saying "Hello world!")
```{r}
say()
```
- *Arguments* provides more information on each argument.
Arguments are the options you can use within a function. Check out the carsay::say() arguments (note this syntax of package::command()).
- what
- by
- type
- what_color
etc.
Each of these can be fed the `say()` function to slightly alter what it does.
- *Examples* at the bottom of the help page lists a few examples you can copy-paste into your code to better understand how a function works.
Don't worry if you don't understand everything in the documentation when you're first starting out. Just try to get an idea for which arguments there are and which options for those arguments. It's good practice to look at help documents often -- this will also help you get more efficient at extracting the info you need from them.
#### Try it out
1. All of the following code snippets produce errors. What have I done wrong? Correct each snippet.
```{r}
party_invites <- c("Tracey", "Karen", "Sandra")
```
```{r}
c(1, 4, 5) + 1
```
```{r}
max(ikea$price) #from the ikea tibble
```
```{r}
ikea %>%
filter(name == "BILLY")
```
```{r}
ikea %>%
select(starts_with("item"))
```
```{r}
ikea %>%
mutate(new_price = price * 2)
```
```{r}
summary(ikea)
```
```{r}
42 + 5
```
2. Call the `say()` function including the arguments what, by, and what_color. Try at least a few different options for each of these and find your favorite look. Try using "random" with the argument "by" or "catfact" or "fortune" with the argument "what".
```{r}
```
### Writing files to disk
To write files to disk, you also need to select which separator you want to use -- do you want a comma-separated file or a tab-separated file? In theory, this shouldn't matter much,y ou just have to know which you pick.
CSVs are often saved as .csv (this is also openable in Excel) and TSVs are often saved as .txt
You can save CSVs and TSVs like so:
```{r}
# write_tsv(ikea, "ikea_df.txt")
# write_csv(ikea, "ikea_df.csv")
```
Syntax: write_tsv(dfname, "filename.ext")
Remember, something like data type in R (ex: character or factor) can't be saved into one of these files -- it just saves the values separated by the separator (and the column names). So you'll still have to do the conversions next time you read the file into R.
### Chaining commands
Thinking tidyverse: Now you have a lot of tidy tools that can all be combined or "chained".
See if you can figure out the following challenges. (Some may be a bit tricky, it's alright if you have to skip a couple or come back to them later.)
Penguins:
1. Make a new column converting body mass to kg, then filter out all penguins under 6kg and save the result to a new tibble called "small_penguins".
```{r}
small_penguins <- penguins %>%
mutate(body_mass_kg = body_mass_g / 1000) %>%
filter(body_mass_kg < 6)
```
2. Which penguin species occur on Biscoe island? Filter first to include only this island, then return the distinct species that are found there.
```{r}
penguins %>%
filter(island == "Biscoe") %>%
distinct(species)
```
3. Remember that the flipper_length column is represented in mm. Return how many penguins measured in this dataset had a flipper length longer than 2.5cm. Try to do this in one step, without having to permanently add the column to the dataframe (and without using 250cm)
```{r}
penguins %>%
filter(flipper_length_mm * 10 > 2.5) %>%
count()
```
SPR:
1. "Center" the response times around the average. This means take the mean RT and subtract it from every RT, so that 0 is now the average RT and the numbers reflect how much faster or slower the RT was from average. Save these centered RTs as a column called RT_c
```{r}
spr %>%
mutate(RT_c = RT - mean(spr$RT))
```
2. Create a new column that is called word_length and assign it to the number of characters in the word. Save the output.
```{r}
spr %>%
mutate(word_length = nchar(word))
```
Ikea:
1. Return the most expensive bed.
```{r}
ikea %>%
filter(category == "Beds") %>%
arrange(desc(price))
```
2. Count how many chairs are available in other colors.
```{r}
ikea %>%
filter(category == "Chairs" & other_colors == "Yes") %>%
count()
```
3. There is some junk in the designer column. Look at it using distinct(), then remove it with the following code:
ikea <- ikea %>%
filter(!str_detect(designer, "[0-9]"))
Note: str_detect looks for regular expressions. Here, the regular expression [0-9] matches any number. If this is true, str_detect returns TRUE -- However, it is proceeded by ! which means NOT. Thus, the filter is satisfied only if str_detect returns FALSE, i.e. there are no numbers present.
Then, find which designer makes the most expensive furniture. Return the average price for each designer's products, and arrange them from largest to smallest. (Hint, give your summary column a name.)
```{r}
ikea <- ikea %>%
filter(!str_detect(designer, "[0-9]"))
ikea %>%
group_by(designer) %>%
summarise(avg_price = mean(price)) %>%
arrange(desc(avg_price))
```
4. How many products did each designer design? You can use count, but can also just save this variable as a factor and look at a summary.
```{r}
ikea %>%
group_by(designer) %>%
count()
ikea$designer <- as.factor(ikea$designer)
summary(ikea$designer)
```