Copyright 2022 Dale Bowman, Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Descriptive statistics

One of the most important components of a data science project is to examine your data using descriptive statistics. 
We call this phase of the project *Exploratory Data Analysis (EDA)*. 

There are two types of EDA, *graphical* EDA and *numerical* EDA. 
The graphical EDA involves plotting the features (variables) in the data to look for symmetry, skewness and multiple modality. 
**We will discuss numerical EDA in this session.**

## What you will learn

In the sections that follow you will learn about descriptive statistics, in particular numerical EDA, and how they can help us learn about our data and what types of analyses may be appropriate.  We will study the following:

- level of measure
- measures of central tendency
- measures of dispersion
- sampling

## When to use numerical EDA

Descriptive statistics, both numerical and graphical, are useful when you begin a data science project and you want to explore the data.  Often insights will be gained that can be useful in further analyses.

## Level of Measure

The type of features (variables) you have in the data determines which graphical and numerical techniques are appropriate. 
One characteristic of the features is their level of measure. 
There are four levels: nominal, ordinal, interval, and ratio. 
The characteristics of these levels are detailed in the table below.

| Level    | Description                              | Example                      |
|:---------|:-----------------------------------------|:-----------------------------|
| nominal  | categorical data that can be *named*     | eye color                    |
| ordinal  | categorical data with a natural ordering |  grades: A, B, C, D, F       |
| interval | numerical data without a true zero       | Fahrenheit temperature scale |
| ratio    | numerical data with a true zero          | Kelvin temperature scale     |


## Descriptive Statistics, Numerical EDA

Descriptive measures fall into one of two categories:

- Measuring central tendencies of a variable
- Measuring the spread of a variable

To explore these, we'll use the `iris` dataset:

| Variable    | Type    | Description           |
|:-------------|:---------|:-----------------------|
| SepalLength | Ratio   | the sepal length (cm) |
| SepalWidth  | Ratio   | the sepal width (cm)  |
| PetalLength | Ratio   | the petal length (cm) |
| PetalWidth  | Ratio   | the petal width (cm)  |
| Species     | Nominal | the flower species    |

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/iris">UCI Machine Learning Repository library
    </a></div>
<br>


We can calculate central tendency and spread using several R packages.
Let's start by importing `readr`:

- `library readr`

In [22]:
library(readr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="YL8X8;LE`^hm(/Pe7$zz">readr</variable></variables><block type="import_R" id="Y!K^DxXEJ]YvUify+0uF" x="16" y="10"><field name="libraryName" id="YL8X8;LE`^hm(/Pe7$zz">readr</field></block></xml>

Now let's load a dataset into a dataframe:

- Set `dataframe` to `with readr do read_csv using "datasets/iris.csv"`
- Display `dataframe`

In [None]:
dataframe = readr::read_csv("datasets/iris.csv")

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable><variable id="YL8X8;LE`^hm(/Pe7$zz">readr</variable></variables><block type="variables_set" id="i*U..F)p9r]#n*e(./*x" x="28" y="272"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field><value name="VALUE"><block type="varDoMethod_R" id="d~47.[8ahP7Ab1vu=Vee"><mutation items="1"></mutation><field name="VAR" id="YL8X8;LE`^hm(/Pe7$zz">readr</field><field name="MEMBER">read_csv</field><data>readr:read_csv</data><value name="ADD0"><block type="text" id="~Mspn1jJRE!J8ISd3!#V"><field name="TEXT">datasets/iris.csv</field></block></value></block></value></block><block type="variables_get" id="xI`MabBVXfmZ`j7Agkuf" x="23" y="343"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></xml>

Now we're ready to calculate the measures of central tendency in the next section.

### Measures of Central Tendency

The measures of central tendency most commonly used to describe data are **mean, median and mode**. 

A powerful way of calculating these measures of central tendency (along with other things) is the `dplyr` method `summarize`.

So let's import `dplyr`

- `library dplyr`

In [2]:
library(dplyr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable></variables><block type="import_R" id="}brvEys-]icw;s--;]Tx" x="16" y="10"><field name="libraryName" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field></block></xml>


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




#### Mean
The **mean** is the numerical average of the variables. 
Let Let $X_1, X_2, \ldots, X_n$ represent the data;
then the mean is found as $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.$

A `summarize` with `mean` will do this calculation for you:

- with `dplyr` do `summarize` 
    - using `dataframe`
    - and `"Mean_SL" = mean(SepalLength)`
    - and `"Mean_PW" = mean(PetalWidth)`

In [3]:
dplyr::summarize(dataframe,"Mean_SL" = mean(SepalLength),"Mean_PW" = mean(PetalWidth))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod_R" id="Y=ShNEczz|ztdA3~t.6D" x="8" y="176"><mutation items="3"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">summarize</field><data>dplyr:summarize</data><value name="ADD0"><block type="variables_get" id="i-xG?$r6Nqf0uD?|ND+F"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="MqWLSIeUWqc4BDLmV;R7"><field name="CODE">"Mean_SL" = mean(SepalLength)</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="`7Ic_@^75hgnes36FzQT"><field name="CODE">"Mean_PW" = mean(PetalWidth)</field></block></value></block></xml>

Mean_SL,Mean_PW
<dbl>,<dbl>
5.843333,1.198667


We only summarized two of the variables, but you could summarize more by adding slots to `summarize` if you wanted.

#### Median
The **median** is the number in the middle of the data. 
By definition, one half of the data points are below the median and one half are above. 

We can calculate the median just like the mean.
Copy the blocks above and replace "mean" with "median".

In [4]:
dplyr::summarize(dataframe,"Median_SL" = median(SepalLength),"Median_PW" = median(PetalWidth))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod_R" id="Y=ShNEczz|ztdA3~t.6D" x="8" y="176"><mutation items="3"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">summarize</field><data>dplyr:summarize</data><value name="ADD0"><block type="variables_get" id="i-xG?$r6Nqf0uD?|ND+F"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="MqWLSIeUWqc4BDLmV;R7"><field name="CODE">"Median_SL" = median(SepalLength)</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="`7Ic_@^75hgnes36FzQT"><field name="CODE">"Median_PW" = median(PetalWidth)</field></block></value></block></xml>

Median_SL,Median_PW
<dbl>,<dbl>
5.8,1.3


Notice that the mean and the median are almost the same for the first variable but are different for the third variable.
What do you think that means?

#### Mode

The **mode** is the value in the data that shows up the most often.
Mode is somewhat neglected in R, so we need to load a new package, `modeest`:

- `library modeest`

In [5]:
library(modeest)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="GX/Uo@ZFZu^h,-ci@W#@">modeest</variable></variables><block type="import_R" id=":+@BDUz.`%SQaxZ0U30_" x="-241" y="103"><field name="libraryName" id="GX/Uo@ZFZu^h,-ci@W#@">modeest</field></block></xml>

Now you can copy the code above and use `summarize` again, except this time use the function `mfv`, e.g. `Mode_SL=mfv(SepalLength)`.

In [6]:
dplyr::summarize(dataframe,"Mode_SL" = mfv(SepalLength),"Mode_PW" = mfv(PetalWidth))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod_R" id="Y=ShNEczz|ztdA3~t.6D" x="8" y="176"><mutation items="3"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">summarize</field><data>dplyr:summarize</data><value name="ADD0"><block type="variables_get" id="i-xG?$r6Nqf0uD?|ND+F"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="MqWLSIeUWqc4BDLmV;R7"><field name="CODE">"Mode_SL" = mfv(SepalLength)</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock_R" id="`7Ic_@^75hgnes36FzQT"><field name="CODE">"Mode_PW" = mfv(PetalWidth)</field></block></value></block></xml>

Mode_SL,Mode_PW
<dbl>,<dbl>
5,0.2


#### Summary

The type of variable determines the measures of central tendency available.

For categorical data, the mode is the only measure of central tendency that can be computed unless the data are ordinal, in which case you can use either the mode or the median.

In the table below, X indicates where a variable type and a measure of central tendency can be used together.

|          | mode | median | mean |
|----------|------|--------|------|
| nominal  | X    |        |      |
| ordinal  | X    | X      |      |
| interval | X    | X      | X    |
| ratio    | X    | X      | X    |

### Example Categorical Data 

The grades in a large statistics course occurred with the following frequency.

<!-- | Grade     | A | B  | C  | D  | F  |
|:-----------|---|----|----|----|----|
| Frequency | 5 | 15 | 25 | 10 | 45 | -->

<!-- AO: seems more natural: -->

| Grade     | F | D  | C  | B  | A  |
|:-----------|---|----|----|----|----|
| Frequency | 45 | 10 | 25 | 15 | 5 |

The mode of this data is the grade with the highest frequency, in this case F. 
Since this data is ordinal, we can also compute the median. 
There are a total of 100 grades, so the median will be the grade with 50 grades above and 50 below. 
This puts the median grade at **D**.

For numerical data, the mode, median and mean can all be used to measure central tendency. 
Sometimes one measure will be more useful than another. 
For example, when outliers exist in the data, the mean can be skewed towards the outliers. 
Think of measuring incomes where one of the incomes is that of a professional basketball player. 
The extremely higher income of the player is much different than most of the other incomes. 
It is called an *outlier* and will affect the mean more than the median. 
As a simple example consider the following incomes.

$30,000 ~~ 40,000 ~~ 50,000~~60,000~~4,000,000$

The mean of these incomes is 

$\frac{30000+40000+50000+60000+4000000}{5} = \$836,000.$

Notice that \\$4 million is more than 10 times greater than the next highest value, \\$60 thousand.
As a result, the mean is pulled between \\$60 thousand and \\$4 million

In contrast, the median of the incomes is the number with 2 incomes below and 2
incomes above, $50,000, which is a much more reasonable estimate of the
central tendency of the majority of these incomes.

Let's take a closer look at this in R by creating a list of values:
- Set `salary` to `create list with` the following slots
    - 30000
    - 40000
    - 50000
    - 60000
    - 4000000
    
This is what it looks like:

![image.png](attachment:image.png)

In [7]:
salary = list(30000, 40000, 50000, 60000, 4000000)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=":CSTVHy4liG@-G^A}4h7">salary</variable></variables><block type="variables_set" id="lS,4d1|x$va@tz75+6s{" x="-41" y="54"><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field><value name="VALUE"><block type="lists_create_with" id="F2@)9b~Nc{R$c@SA8zG0"><mutation items="5"></mutation><value name="ADD0"><block type="math_number" id="A%#8PWhokwv3Y,92!Y+a"><field name="NUM">30000</field></block></value><value name="ADD1"><block type="math_number" id="+N1A1l~2Kl~z`pvDg/G~"><field name="NUM">40000</field></block></value><value name="ADD2"><block type="math_number" id="gHay^p}{V)k.xIGFZpwU"><field name="NUM">50000</field></block></value><value name="ADD3"><block type="math_number" id="%tat7.nu|j69cJmd_eFn"><field name="NUM">60000</field></block></value><value name="ADD4"><block type="math_number" id="OkM`UXA=KCTsHQk/xK`k"><field name="NUM">4000000</field></block></value></block></value></block></xml>

Now try the following using the `sum of list` block from MATH:

- `average` of list `salary`
- `median` of list `salary`

In [10]:
mean(unlist(salary))

median(unlist(salary))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=":CSTVHy4liG@-G^A}4h7">salary</variable></variables><block type="math_on_list" id="CQj/OK@JMS-L8mGVA/VP" x="-139" y="308"><mutation op="AVERAGE"></mutation><field name="OP">AVERAGE</field><value name="LIST"><block type="variables_get" id="RI0(a*t,|,{NX^3p296}"><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field></block></value></block><block type="math_on_list" id="U{dnjTo_=]/y+5UgYf.E" x="-140" y="351"><mutation op="MEDIAN"></mutation><field name="OP">MEDIAN</field><value name="LIST"><block type="variables_get" id="?~O(Utl)Vh:FY,9(yco%"><field name="VAR" id=":CSTVHy4liG@-G^A}4h7">salary</field></block></value></block></xml>

As you can see, this matches the example above.
The mean is much much bigger than the median.

Now go back up to your list and take out the $4 million block (remember to take the blank spot out of your list as well) and run the mean/median again.

When you do this, you should find that the mean and the median are **exactly** the same.
What does this mean?
When they are the same, the data is **symmetric** (like the plot below).

### Measures of Dispersion (spread)

Even when two different variables have similar means (or medians, or modes) they can still be quite different depending on how the data are spread out around the center. 
In Figure 1 below both distributions have the same mean (0) but different spreads. 
The red curve has most of its points close to the center while the blue curve has points spread further from the mean.


![spread2.png](attachment:spread2.png)

**Figure 1:** Two distributions with the same center but different
spread

One measure of dispersion that can be used with ordered categorical data (ordinal level) or numerical data (interval/ratio level) is the **five number summary**.
The five number summary is useful for comparing the center and spread of multiple variables. 
You use the numbers in the five number summary to construct a box and whiskers plot. 
The five numbers are: 

- minimum
- first quartile
- median
- third quartile
- maximum

The first quartile is the median of the values below the median and the third quartile is the median of the
values above the median.

To use a football analogy, quartiles are like the 4 quarters in a game, and the median is like halftime.

We can get the five number summary easily with the `base` R package:

- `library base`

In [11]:
library(base)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="6C!l+@S8,SvtzakmsxQD">base</variable></variables><block type="import_R" id="E-D0N=RjAyN*|Y!@+3rn" x="-140" y="10"><field name="libraryName" id="6C!l+@S8,SvtzakmsxQD">base</field></block></xml>

Now use the `summary` function from `base`.
The `base` library is so big, you might have to wait a bit for it to load:

- with `base` do `summary` using `dataframe`

In [12]:
base::summary(dataframe)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="6C!l+@S8,SvtzakmsxQD">base</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="varDoMethod_R" id="3IhuOZY]Z7e~R%5$LI7q" x="-148" y="176"><mutation items="1"></mutation><field name="VAR" id="6C!l+@S8,SvtzakmsxQD">base</field><field name="MEMBER">summary</field><data>base:summary</data><value name="ADD0"><block type="variables_get" id="{0rtTGKWM!}`q[9heh6_"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value></block></xml>

  SepalLength      SepalWidth     PetalLength      PetalWidth   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
   Species         
 Length:150        
 Class :character  
 Mode  :character  
                   
                   
                   

A five number summary is returned for every numeric variable.

### Example Five Number Summary 

For the data shown below the minimum is 3, the median is 9 and the maximum is 22. 
We find the first quartile as the median of the lower five numbers: here it is 6. 
The third quartile is 13, the median of the numbers above the median. 
So the five number summary for this data is $\{ 3,6,9,13,22\}$.

![summary.png](attachment:summary.png)

Other measures of the spread for numerical data include the range, the interquartile range, and the variance. 

The **range** is simply the maximum value minus the minimum. 
When outliers are present they may inflate the range. 
For example in our income example the range would be $4000000-30000=3,970,000$ which is not representative of the spread of the majority of incomes. 

To reduce the effect of outliers on the measure of dispersion, the interquartile range is often used. 
The **interquartile range** is defined as the third quartile minus the first quartile.

The most commonly used measures of dispersion for numerical data are the **variance** and its square root, the **standard deviation**. 
The variance measures the sum of squared differences of the data about the mean.
Squaring the differences may seem complicated but makes sense when you realize that the sum of differences about the mean is zero.

Again, let $X_1, X_2, \ldots, X_n$ be the variables you want to compute the variance of. 
The formula for the variance is given by $S^2 = \frac{\sum_{i=1}^n (X_i  - \bar{X})^2}{n-1}.$ 
The standard deviation is the square root of the variance.

You can calculate standard deviation using the same block from MATH that we used for mean and median.

## Sampling

The descriptive statistics discussed here all assume that the data we have is a **random sample** from some larger population. 
The population mean, $\mu$, and the population variance, $\sigma^2$ are unknown and the sample is typically taken to gain information about them. 
The population mean and variance are **parameters** while the sample mean ($\bar{X}$) and sample variance ($S^2$) are called **statistics**. 
Since the sample mean and sample variance are computed from a random sample from the population, each time we take a different sample, we expect to get different values of the sample mean and sample variance. 

We would like to know how much difference there would be in say $\bar{X}$ over different samples. 
The *standard error* can be used to estimate the variance about a statistic. 
For the sample mean, it is known that the variation in $\bar{X}$ will vary in direct proportion to the population variance, $\sigma^2$ and inversely with the sample size. 
So we can reduce the variation in $\bar{X}$ by increasing our sample size, $n$. 
The standard error of $\bar{X}$ can be estimated by $\displaystyle \sqrt{\frac{S^2}{n}}.$

The best way to begin to understand this is to sample some rows from your dataframe (remember the rows are just data points) and calculate the mean of that same.

When we do this in `dplyr`, the sample is just another (smaller) dataframe:

- Set `sample` to `with dplyr do sample_n`
    - using `dataframe`
    - and `10`
- `sample`

In [13]:
sample = dplyr::sample_n(dataframe,10)

sample

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="X~V70-15t9L^b}[IjBoW">sample</variable><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="variables_set" id="oLvPn#H_#KD]A.8$CUa#" x="-114" y="195"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field><value name="VALUE"><block type="varDoMethod_R" id="4_82p3|C@:,3[NUCW.:R"><mutation items="2"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">sample_n</field><data>dplyr:sample_n</data><value name="ADD0"><block type="variables_get" id="p=-t+dH7[;}r7zQiPtd+"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value><value name="ADD1"><block type="math_number" id="jdV9~GKS6nUY/*Yq0n3W"><field name="NUM">10</field></block></value></block></value></block><block type="variables_get" id="nr5WrQE`[TvHgq(3vKz:" x="-117" y="308"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field></block></xml>

SepalLength,SepalWidth,PetalLength,PetalWidth,Species
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
5.6,2.5,3.9,1.1,versicolor
6.0,2.9,4.5,1.5,versicolor
6.0,3.0,4.8,1.8,virginica
6.7,3.1,4.4,1.4,versicolor
5.6,2.7,4.2,1.3,versicolor
6.3,3.3,6.0,2.5,virginica
6.0,2.2,4.0,1.0,versicolor
4.3,3.0,1.1,0.1,setosa
6.3,2.5,4.9,1.5,versicolor
6.7,2.5,5.8,1.8,virginica


Rerun the sampling cell above a couple of times to see how it randomly draws 10 rows from the original dataframe.
You can get a sense of the randomness by looking at how the values change each time.

Once you're ready do this:

- Set `sample` to `with dplyr do sample_n`
    - using `dataframe`
    - and `10`
`with dplyr do summarize_all`
    - using `dataframe`
    - and `mean`
    
*Note: `summarize_all` applies the same function to all variables*

In [20]:
sample = dplyr::sample_n(dataframe,10)

dplyr::summarize_all(sample,mean)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="X~V70-15t9L^b}[IjBoW">sample</variable><variable id="eNrJ9[?:!)MH8@C/C@}4">dplyr</variable><variable id="[V~uW+0L/4GW;45ulv+l">dataframe</variable></variables><block type="variables_set" id="oLvPn#H_#KD]A.8$CUa#" x="-114" y="195"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field><value name="VALUE"><block type="varDoMethod_R" id="4_82p3|C@:,3[NUCW.:R"><mutation items="2"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">sample_n</field><data>dplyr:sample_n</data><value name="ADD0"><block type="variables_get" id="p=-t+dH7[;}r7zQiPtd+"><field name="VAR" id="[V~uW+0L/4GW;45ulv+l">dataframe</field></block></value><value name="ADD1"><block type="math_number" id="jdV9~GKS6nUY/*Yq0n3W"><field name="NUM">10</field></block></value></block></value></block><block type="varDoMethod_R" id="f}0S5efYQ,_GPgHvH;Y^" x="-134" y="430"><mutation items="2"></mutation><field name="VAR" id="eNrJ9[?:!)MH8@C/C@}4">dplyr</field><field name="MEMBER">summarize_all</field><data>dplyr:summarize_all</data><value name="ADD0"><block type="variables_get" id="*j;b?VC.k3lvtk-YO41/"><field name="VAR" id="X~V70-15t9L^b}[IjBoW">sample</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="`UImL@LRZn=Il-Zm]Z+K"><field name="CODE">mean</field></block></value></block></xml>

“argument is not numeric or logical: returning NA”


SepalLength,SepalWidth,PetalLength,PetalWidth,Species
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.84,3.11,3.99,1.37,


Run this code cell a couple of times (trick: do Ctrl + Enter)

Notice that the means change each time, just a bit.

That difference in means over different samples is what the standard error of the mean is measuring.

<!--  -->