
![image.png](attachment:image.png)

In [53]:
install.packages('nycflights13')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Lecture 2: More on data transformations

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Continue to learn how to manipulate data, including:
    * Vectors in R: what they are and basic operations
    * Adding new variables
    * Grouping and summarizing data
    
This lecture note corresponds to sections 4.3-4.6 of your book.
</div>

In [54]:
library(tidyverse)  # always!
library(nycflights13)  # install if needed

## Review of last lecture
Last lecture we learned how to alter the rows and columns of a dataframe:
- `filter()` to keep certain rows that satisfy logical conditions.
- `arrange()` to sort rows according to certain column values.
- `distinct()` to keep only rows that are distinct on some combination of columns.
- `select()` to drop/rename/rearrange columns

## What's a data frame?

Our main goal in R is to work with data, and one of the most fundamental objects in R is the *data frame*. Think of a data frame as a container for a bunch of *vectors* of data:

![dataframe](https://garrettgman.github.io/images/tidy-2.png)

## What's a vector?

- In programming speak: a *vector* is a list of values. 
- In statistical speak: a vector of observations (aka data).

Let's create a vector and work with it:

## Poll
How old are you?
<ol style="list-style-type: upper-alpha;">
    <li>19 or younger</li>
    <li>20</li>
    <li>21</li>
    <li>22 or older</li>
    <li>I forget</li>
</ol>
(This question will be graded.)<br />
(↑ This is a joke.)

The function for creating a vector in R is called simply, `c()`.

In [56]:
# Create a vector of ages
ages = c(20, 21, 22, 22, 24, 25, 21, 20, 19, 18)
ages 

## Functions that operate on vectors

Many summary functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

In [57]:
log1 = c(T, T, T)
all(log1)

In [61]:
# examples of functions we can use on the ages vector
print(sd(ages))
print(mean(ages))

[1] 2.149935
[1] 21.2


## Column operation #2: `mutate()`

`mutate()` creates new columns in a dataframe that are calculated from the existing columns.

For example, let's define the **gain** of a flight to be the different between the departure delay and the arrival delay:

$$\text{gain} = \text{dep. delay} - \text{arr. delay}$$

So, the gain is positive if the flight made up time in the air, resulting in a less-delayed arrival.

To add a column called `gain` to flights, we called `mutate()` as follows:

In [60]:
# add a gain column to flights
# check if gain column is not there
select(flights, gain)

ERROR: ignored

In [62]:
new_df = mutate(flights, gain = dep_delay - arr_delay)

In [63]:
head(new_df, 1)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,gain
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>,<dbl>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00,-9


The returned data set has a new column called `gain` added to the very end. If you want to bring it to the front, you could use `select()` like we learned last lecture:

In [64]:
# use select to move gain, dep_delay, arr_delay to the set of columns

print(select(new_df, gain, dep_delay, arr_delay, everything()))


[90m# A tibble: 336,776 × 20[39m
    gain dep_delay arr_delay  year month   day dep_time sched_…¹ arr_t…² sched…³
   [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m    -[31m9[39m         2        11  [4m2[24m013     1     1      517      515     830     819
[90m 2[39m   -[31m16[39m         4        20  [4m2[24m013     1     1      533      529     850     830
[90m 3[39m   -[31m31[39m         2        33  [4m2[24m013     1     1      542      540     923     850
[90m 4[39m    17        -[31m1[39m       -[31m18[39m  [4m2[24m013     1     1      544      545    [4m1[24m004    [4m1[24m022
[90m 5[39m    19        -[31m6[39m       -[31m25[39m  [4m2[24m013     1     1      554      600     812     837
[90m 6[39m   -[31m16[

## Quiz
What was the most amount of time gained by any flight?
<ol style="list-style-type: upper-alpha;">
    <li>2 hours</li>
    <li>109 minutes</li>
    <li>37 minutes</li>
    <li>37 seconds</li>
    <li>2 days</li>
</ol>

In [66]:
# Add arrange to see the gain in desc order and head function shows us the first 6 records by default
head(
  arrange(
    select(new_df, gain, dep_delay, arr_delay, everything()), 
    desc(gain)
), 5)

gain,dep_delay,arr_delay,year,month,day,dep_time,sched_dep_time,arr_time,sched_arr_time,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
109,235,126,2013,6,13,1907,1512,2134,1928,EV,4377,N19554,EWR,JAX,126,820,15,12,2013-06-13 15:00:00
87,60,-27,2013,2,26,1000,900,1513,1540,HA,51,N382HA,JFK,HNL,584,4983,9,0,2013-02-26 09:00:00
80,206,126,2013,2,23,1226,900,1746,1540,HA,51,N389HA,JFK,HNL,599,4983,9,0,2013-02-23 09:00:00
79,17,-62,2013,5,13,1917,1900,2149,2251,DL,1465,N721TW,JFK,SFO,313,2586,19,0,2013-05-13 19:00:00
76,24,-52,2013,2,27,924,900,1448,1540,HA,51,N389HA,JFK,HNL,589,4983,9,0,2013-02-27 09:00:00


## Filtering extreme values
In the previous question we needed to find rows that had a large value of a certain column (`gain`). This occurs frequently, so the designers of tidyverse wrote a special function:

    top_n(<DATA FRAME>, n, <COLUMN>, ...)
    

In [67]:
# use top_n to find the flights with the highest gain
top_n(new_df, 1, gain)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,gain
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>,<dbl>
2013,6,13,1907,1512,235,2134,1928,126,EV,4377,N19554,EWR,JAX,126,820,15,12,2013-06-13 15:00:00,109


## Grouping data
Very frequently our data have natural groupings. For example, in flights, we might be interested in studying differences in flights depending on the month of departure. We use the `group_by()` function to tell R how to group data.

For example, `mtcars` is a dataset of cars and the gas mileage they get:

In [68]:
head(mtcars, 5) %>% print

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2


In [69]:
help(mtcars)

Let's try grouping `mtcars` by `cyl` (the number of engine cylinders):

In [70]:
gp = group_by(mtcars, cyl)
print(gp)

[90m# A tibble: 32 × 11[39m
[90m# Groups:   cyl [3][39m
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m  21       6  160    110  3.9   2.62  16.5     0     1     4     4
[90m 2[39m  21       6  160    110  3.9   2.88  17.0     0     1     4     4
[90m 3[39m  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
[90m 4[39m  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
[90m 5[39m  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
[90m 6[39m  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
[90m 7[39m  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
[90m 8[39m  24.4     4  147.    62  3.

This has not changed the data in any way. But now watch what happens when we use `mutate()` on the grouped data frame:

In [71]:
# mean mpg for grouped data
print(mutate(gp, mean(mpg)))

[90m# A tibble: 32 × 12[39m
[90m# Groups:   cyl [3][39m
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb `mean(mpg)`
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m
[90m 1[39m  21       6  160    110  3.9   2.62  16.5     0     1     4     4        19.7
[90m 2[39m  21       6  160    110  3.9   2.88  17.0     0     1     4     4        19.7
[90m 3[39m  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1        26.7
[90m 4[39m  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1        19.7
[90m 5[39m  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2        15.1
[90m 6[39m  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1        19.7
[90

In [72]:
distinct(mtcars, cyl)

Unnamed: 0_level_0,cyl
Unnamed: 0_level_1,<dbl>
Mazda RX4,6
Datsun 710,4
Hornet Sportabout,8


Notice that the mean is now constant within different groups. It's easier to see if we first sort the table by `cyl`:

In [73]:
# sort mtcars by cyl, then group and mutate
arrange(mutate(gp, mean(mpg)), cyl) %>% print

[90m# A tibble: 32 × 12[39m
[90m# Groups:   cyl [3][39m
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb `mean(mpg)`
   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m
[90m 1[39m  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1        26.7
[90m 2[39m  24.4     4 147.     62  3.69  3.19  20       1     0     4     2        26.7
[90m 3[39m  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2        26.7
[90m 4[39m  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1        26.7
[90m 5[39m  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2        26.7
[90m 6[39m  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1        26.7
[90

## Pipes
I have already used the pipe symbol, we will make extensive use of the pipe operator `%>%` going forward. Consider the previous exercise:

In [74]:
# mutate flights to add gain column, select gain, dep_delay and arr_delay and finally sort it by gain and then print
print(arrange(
  select(mutate(flights, gain = dep_delay - arr_delay), 
  gain, dep_delay, arr_delay, everything()
  ), desc(gain)
))

[90m# A tibble: 336,776 × 20[39m
    gain dep_delay arr_delay  year month   day dep_time sched_…¹ arr_t…² sched…³
   [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m   109       235       126  [4m2[24m013     6    13     [4m1[24m907     [4m1[24m512    [4m2[24m134    [4m1[24m928
[90m 2[39m    87        60       -[31m27[39m  [4m2[24m013     2    26     [4m1[24m000      900    [4m1[24m513    [4m1[24m540
[90m 3[39m    80       206       126  [4m2[24m013     2    23     [4m1[24m226      900    [4m1[24m746    [4m1[24m540
[90m 4[39m    79        17       -[31m62[39m  [4m2[24m013     5    13     [4m1[24m917     [4m1[24m900    [4m2[24m149    [4m2[24m251
[90m 5[39m    76        24       -[31m52[39m  [4m2[24m013    

This is not very nice. To figure out what the command is doing you have to work from the inside out, which is not the order in which we are accustomed to reading. A slight improvement might be:

In [75]:
# mutate flights to add gain column, select gain, dep_delay and arr_delay and finally sort it by gain and then print
new_df1 = mutate(flights, gain = dep_delay - arr_delay)
new_df2 = select(new_df1, gain, dep_delay, arr_delay, everything())
new_df3 = arrange(new_df2, desc(gain))
print(new_df3)

[90m# A tibble: 336,776 × 20[39m
    gain dep_delay arr_delay  year month   day dep_time sched_…¹ arr_t…² sched…³
   [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m   109       235       126  [4m2[24m013     6    13     [4m1[24m907     [4m1[24m512    [4m2[24m134    [4m1[24m928
[90m 2[39m    87        60       -[31m27[39m  [4m2[24m013     2    26     [4m1[24m000      900    [4m1[24m513    [4m1[24m540
[90m 3[39m    80       206       126  [4m2[24m013     2    23     [4m1[24m226      900    [4m1[24m746    [4m1[24m540
[90m 4[39m    79        17       -[31m62[39m  [4m2[24m013     5    13     [4m1[24m917     [4m1[24m900    [4m2[24m149    [4m2[24m251
[90m 5[39m    76        24       -[31m52[39m  [4m2[24m013    

This is better, but now you've created a bunch of useless temporary variables, and it requires a lot of typing. 
Instead, we are going to use a new operator `%>%` (prounouced "pipe"):

In [77]:
# # mutate flights to add gain column, select gain, dep_delay and arr_delay and finally sort it by gain and then print using pipes
mutate(flights, gain = dep_delay - arr_delay) %>% 
  select(gain, dep_delay, arr_delay, everything()) %>% 
    arrange(desc(gain)) %>%
      print
      

[90m# A tibble: 336,776 × 20[39m
    gain dep_delay arr_delay  year month   day dep_time sched_…¹ arr_t…² sched…³
   [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m
[90m 1[39m   109       235       126  [4m2[24m013     6    13     [4m1[24m907     [4m1[24m512    [4m2[24m134    [4m1[24m928
[90m 2[39m    87        60       -[31m27[39m  [4m2[24m013     2    26     [4m1[24m000      900    [4m1[24m513    [4m1[24m540
[90m 3[39m    80       206       126  [4m2[24m013     2    23     [4m1[24m226      900    [4m1[24m746    [4m1[24m540
[90m 4[39m    79        17       -[31m62[39m  [4m2[24m013     5    13     [4m1[24m917     [4m1[24m900    [4m2[24m149    [4m2[24m251
[90m 5[39m    76        24       -[31m52[39m  [4m2[24m013    

This is much better. We can read the command from left to right and know exactly what is going on.

## Column operation #3: `summarize()`ing data

`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries. The syntax is:

    summarize(<grouped data frame>, 
              <new variable> = <formula for new variable>,
              <other new variable> = <other formula>)

The most basic use of summarize is to compute statistics over the whole data set:

In [78]:
# summarize flights by mean of departure delay
# note, if you don't add na.rm=T; which removes all NA values before computing the mean 

summarise(flights, mean_dep_delay = mean(dep_delay, na.rm = T)) 

mean_dep_delay
<dbl>
12.63907


`summarize()` applies a summary function to each group of data. Remember that it always returns **one row per group**. In the above example, there was only one group (the whole data set), so the resulting data frame had only one row.

### Grouping observations
`summarize()` is most useful when combined with `group_by()` to group observations before calculating the summary statistic. Let's summarize flights by the mean departure delay in each month.

In [79]:
# summarize average departure delay by month.
group_by(flights, month) %>%
  summarise(mean_dep_delay = mean(dep_delay, na.rm = T)) %>%
  print

[90m# A tibble: 12 × 2[39m
   month mean_dep_delay
   [3m[90m<int>[39m[23m          [3m[90m<dbl>[39m[23m
[90m 1[39m     1          10.0 
[90m 2[39m     2          10.8 
[90m 3[39m     3          13.2 
[90m 4[39m     4          13.9 
[90m 5[39m     5          13.0 
[90m 6[39m     6          20.8 
[90m 7[39m     7          21.7 
[90m 8[39m     8          12.6 
[90m 9[39m     9           6.72
[90m10[39m    10           6.24
[90m11[39m    11           5.44
[90m12[39m    12          16.6 


In [80]:
# summarize average mpg of mtcars.
# In the class this was shown
group_by(mtcars, cyl) %>% 
  summarise(mean_mpg = mean(mpg)) %>% 
    print

[90m# A tibble: 3 × 2[39m
    cyl mean_mpg
  [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m     4     26.7
[90m2[39m     6     19.7
[90m3[39m     8     15.1


### Example: counting the number of rows
The `n()` function calculates the number of rows in each group:

In [81]:

group_by(mtcars, cyl) %>% 
  summarise(mean_mpg = mean(mpg), n = n()) %>% 
    print

[90m# A tibble: 3 × 3[39m
    cyl mean_mpg     n
  [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m     4     26.7    11
[90m2[39m     6     19.7     7
[90m3[39m     8     15.1    14


In [82]:
# count the number of rows in flights for each month
group_by(flights, month) %>%
  summarise(mean_dep_delay = mean(dep_delay, na.rm = T), n = n()) %>%
  print

[90m# A tibble: 12 × 3[39m
   month mean_dep_delay     n
   [3m[90m<int>[39m[23m          [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m     1          10.0  [4m2[24m[4m7[24m004
[90m 2[39m     2          10.8  [4m2[24m[4m4[24m951
[90m 3[39m     3          13.2  [4m2[24m[4m8[24m834
[90m 4[39m     4          13.9  [4m2[24m[4m8[24m330
[90m 5[39m     5          13.0  [4m2[24m[4m8[24m796
[90m 6[39m     6          20.8  [4m2[24m[4m8[24m243
[90m 7[39m     7          21.7  [4m2[24m[4m9[24m425
[90m 8[39m     8          12.6  [4m2[24m[4m9[24m327
[90m 9[39m     9           6.72 [4m2[24m[4m7[24m574
[90m10[39m    10           6.24 [4m2[24m[4m8[24m889
[90m11[39m    11           5.44 [4m2[24m[4m7[24m268
[90m12[39m    12          16.6  [4m2[24m[4m8[24m135


### A shortcut
`summarize(n = n())` occurs so often that there is a shortcut for it:

In [83]:
# use count() instead of group_by or summarize
count(mtcars, cyl)

cyl,n
<dbl>,<int>
4,11
6,7
8,14


Let's think about how to answer the following question using `summarize`:

## What days of the year / at what airport are the busiest for flying?

To figure this out, I like to think about/visualize the table we would want to have in order to easily answer this question. Ideally, it would look something like this:

    # A tibble: 1,095 x 4
       month   day airport       n_departures
       <int> <int> <chr>                <int>
     1     1     1 EWR                    305
     2     1     1 JFK                    297
     3     1     1 LGA                    240
     4     1     2 EWR                    350
     5     1     2 JFK                    321
     6     1     2 LGA                    272
     7     1     3 EWR                    336
     8     1     3 JFK                    318
     9     1     3 LGA                    260
    10     1     4 EWR                    339
    # … with 1,085 more rows

Then, to get the answer, I could sort the table to find the row that had the largest `n_departures`.

How do I reach the table shown above? There is one row per ... what? (This tells me how to group the data.)

In [84]:
# summarize flights to get number of departures by day and by airport.
group_by(flights, month, day, origin) %>% 
  summarise(n = n()) %>% 
    print

[1m[22m`summarise()` has grouped output by 'month', 'day'. You can override using the
`.groups` argument.


[90m# A tibble: 1,095 × 4[39m
[90m# Groups:   month, day [365][39m
   month   day origin     n
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<int>[39m[23m
[90m 1[39m     1     1 EWR      305
[90m 2[39m     1     1 JFK      297
[90m 3[39m     1     1 LGA      240
[90m 4[39m     1     2 EWR      350
[90m 5[39m     1     2 JFK      321
[90m 6[39m     1     2 LGA      272
[90m 7[39m     1     3 EWR      336
[90m 8[39m     1     3 JFK      318
[90m 9[39m     1     3 LGA      260
[90m10[39m     1     4 EWR      339
[90m# … with 1,085 more rows[39m


Here is another question we can answer:

## Who is the greatest (baseball) batter of all time?
The `Lahman` dataset contains information on baseball players.

In [85]:
install.packages("Lahman")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [86]:
# install.packages("Lahman")
library(Lahman)
as_tibble(Batting) %>% print
# what do all these columns mean?

[90m# A tibble: 110,495 × 22[39m
   playerID  yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
   [3m[90m<chr>[39m[23m      [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<fct>[39m[23m  [3m[90m<fct>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m abercda01   [4m1[24m871     1 TRO    NA        1     4     0     0     0     0     0
[90m 2[39m addybo01    [4m1[24m871     1 RC1    NA       25   118    30    32     6     0     0
[90m 3[39m allisar01   [4m1[24m871     1 CL1    NA       29   137    28    40     4     5     0
[90m 4[39m allisdo01   [4m1[24m871     1 WS3    NA       27   133    28    44    10     2     2
[90m 5[39m ansonca01   [4m1[24m871     1 RC1    NA       25   120    29    39    11     3     0
[90m 6[39m armstbo01   [4m1[24m871     1 FW1    NA       12    49     9

The second player is `addybo01`. We can get information about this player by typing:

In [87]:
Lahman::playerInfo('addybo01')

Unnamed: 0_level_0,playerID,nameFirst,nameLast
Unnamed: 0_level_1,<chr>,<chr>,<chr>
111,addybo01,Bob,Addy


Bob Addy was active in the years 1871-1877. During that time he had $118+51+152+213+310+142+245=1231$ at-bats, and $32+16+54+51+80+40+68=341$ hits. Therefore his career batting average was $341/1241=0.277$.

In [89]:
filter(Batting, playerID == 'addybo01') %>% print

  playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO IBB
1 addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0  NA
2 addybo01   1873     1    PH2   NA 10  51 12 16   1   0  0  10  1  1  2  0  NA
3 addybo01   1873     2    BS1   NA 31 152 37 54   6   3  1  32  6  5  2  1  NA
4 addybo01   1874     1    HR1   NA 50 213 25 51   9   2  0  22  4  2  1  1  NA
5 addybo01   1875     1    PH2   NA 69 310 60 80   8   4  0  43 16  8  0  2  NA
6 addybo01   1876     1    CHN   NL 32 142 36 40   4   1  0  16 NA NA  5  0  NA
7 addybo01   1877     1    CN1   NL 57 245 27 68   2   3  0  31 NA NA  6  5  NA
  HBP SH SF GIDP
1  NA NA NA    0
2  NA NA NA    0
3  NA NA NA    0
4  NA NA NA    0
5  NA NA NA    0
6  NA NA NA   NA
7  NA NA NA   NA


Let's use `group_by()` and `summarize()` to calculate the "career" batting average for every player in the dataset. That is, I want a table that looks like:

    # A tibble: 20166 × 2
      playerID batting_avg
      <chr>          <dbl>
    1 addybo01       0.277
    .    .             .
    .    .             .
    .    .             .

In [90]:
# calculate the batting average for each player in the  data set
group_by(Batting, playerID) %>% 
  summarise('batting avg' = sum(H)/sum(AB)) %>% 
    top_n(10) %>%
      print

[1m[22mSelecting by batting avg


[90m# A tibble: 95 × 2[39m
   playerID  `batting avg`
   [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m
[90m 1[39m abramge01             1
[90m 2[39m alberan01             1
[90m 3[39m banisje01             1
[90m 4[39m bartocl01             1
[90m 5[39m bassdo01              1
[90m 6[39m birasst01             1
[90m 7[39m bruneju01             1
[90m 8[39m burnscb01             1
[90m 9[39m cammaer01             1
[90m10[39m campsh01              1
[90m# … with 85 more rows[39m


What has happened? Let's look at the first player in the sorted table:

In [91]:
filter(Batting, playerID == 'abramge01')

playerID,yearID,stint,teamID,lgID,G,AB,R,H,X2B,⋯,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
<chr>,<int>,<int>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
abramge01,1923,1,CIN,NL,3,1,0,1,0,⋯,0,0,0,0,0,,0,0,,


### Always include counts
It is a good idea to include counts of each group when you do a summary. Some groups may have very low numbers of observations, resulting in high variance for the summary statistics. 

What happens if we restrict our batting average calculation to players that had at least 100 at-bats?

## Quiz
Among players who had at least 100 at bats, who had the highest career batting average?
<ol style="list-style-type: upper-alpha;">
    <li>Ty Cobb</li>
    <li>Babe Ruth</li>
    <li>Prof. Terhorst</li>
    <li>Ted Williams</li>
    <li>Hank Williams</li>
</ol>

In [92]:
# highest batting average among players that had 100 or more at bats
filter(Batting, AB >= 100) %>%
  group_by(playerID) %>% 
  summarise(batting_avg = sum(H)/sum(AB)) %>% 
    arrange(desc(batting_avg)) %>%
      print

[90m# A tibble: 7,158 × 2[39m
   playerID  batting_avg
   [3m[90m<chr>[39m[23m           [3m[90m<dbl>[39m[23m
[90m 1[39m hazlebo01       0.403
[90m 2[39m daviscu01       0.381
[90m 3[39m fishesh01       0.374
[90m 4[39m woltery01       0.370
[90m 5[39m cobbty01        0.366
[90m 6[39m barnero01       0.363
[90m 7[39m hornsro01       0.362
[90m 8[39m meyerle01       0.358
[90m 9[39m jacksjo01       0.357
[90m10[39m harveza01       0.353
[90m# … with 7,148 more rows[39m


## Quiz
Among players who had at least 100 at bats in a season, what was the highest batting average in a single season?
<ol style="list-style-type: upper-alpha;">
    <li>Ted Williams</li>
    <li>Steven Colbert</li>
    <li>Chonky Squirrel</li>
    <li>Levi Meyerle</li>
    <li>Tom Riddle</li>
</ol>

In [None]:
# highest seasonal batting average
# I will let you folks figure this out

## The Steroid Era of Baseball

> [Baseball] remained relatively the same until the 90s when steroid use became rampant. Famous sluggers like Barry Bonds, Mark McGwire, and Sammy Sosa rose to fame during this era. They were beloved at the time until we later found out that they were cheating.

https://www.wagerbop.com/how-home-runs-and-batting-averages-have-changed-over-the-last-30-years/

![barry bounds](https://cdn.ebaumsworld.com/mediaFiles/picture/2605038/87087115.jpg)

## Can we see the steroid era reflected in the data?

In [None]:
# summarize the dataset in order to investigate steroid era in batters
# this one too - exercise for you folks