c(a, b, c,...) concatenates elements into a vector

In [1]:
myFamilyAges <- c(43, 42, 12, 8, 5)

Common stats operations on a vector include sum, mean, range

In [5]:
print(sum(myFamilyAges))
print(mean(myFamilyAges))
print(range(myFamilyAges))

[1] 110
[1] 22
[1]  5 43


A few more vectors to make up the family stats

In [7]:
myFamilyNames <- c("Dad", "Mom", "Sis","Bro", "Dog")
myFamilyGenders <- c("Male", "Female", "Female", "Male", "Female")
myFamilyWeights <- c(188, 136, 83, 61, 44)

Append the vectors into a Data Frame

In [8]:
myFamily <- data.frame(myFamilyNames, myFamilyAges, myFamilyGenders, myFamilyWeights)

In [9]:
myFamily

myFamilyNames,myFamilyAges,myFamilyGenders,myFamilyWeights
<fct>,<dbl>,<fct>,<dbl>
Dad,43,Male,188
Mom,42,Female,136
Sis,12,Female,83
Bro,8,Male,61
Dog,5,Female,44


myFamilyNames should be charactor type (strings), so use as.character to convert

In [26]:
myFamily$myFamilyNames = as.character(myFamily$myFamilyNames)

In [27]:
str(myFamily)

'data.frame':	5 obs. of  4 variables:
 $ myFamilyNames  : chr  "Dad" "Mom" "Sis" "Bro" ...
 $ myFamilyAges   : num  43 42 12 8 5
 $ myFamilyGenders: Factor w/ 2 levels "Female","Male": 2 1 1 2 1
 $ myFamilyWeights: num  188 136 83 61 44


## Some Statistics
- The __mean__ (technically the arithmetic mean), a measure of central
tendency that is calculated by adding together all of the observations
and dividing by the number of observations.
- The __median__, another measure of central tendency, but one that
cannot be directly calculated. Instead, you make a sorted list of
all of the observations in the sample, then go halfway up that
list. Whatever the value of the observation is at the halfway
point, that is the median.
- A __quartile__ is a type of _quantile_ which divides the number of data points into four parts of more-or-less equal size. The data must be ordered from the smallest to the largest to compute quantiles, as such, quantiles are a form of order statistics. The three main quartiles are as follows:
> - The first quartile ($Q_1$) is defined as the midle number betwen the smallest number (_mininum_) and the _median_ of the data set. Also known as the lower of 25th empirical quartile.
> - The second quartile ($Q_2$) is the median of a data set; thus 50% of the data lies below this point.
> - The third quartile ($Q_3$) is the middle value between the miedin and the highst value (_maximum_) of the data set. Also known as the upper or 75th empirical quartile, as 75% of the data lies below this point.
- The __range__, which is a measure of "dispersion" - how spread out a
bunch of numbers in a sample are - calculated by subtracting the
lowest value from the highest value.
- The __mode__, another measure of central tendency. The mode is the
value that occurs most often in a sample of data. Like the median,
the mode cannot be directly calculated. You just have to count up how many of each number there are and then pick the
category that has the most.
- The __variance__, a measure of dispersion. Like the range, the variance
describes how spread out a sample of numbers is. Unlike
the range, though, which just uses two numbers to calculate dispersion,
the variance is obtained from all of the numbers
through a simple calculation that compares each number to the
mean.
- The __standard deviation__, a measure of the amount of deviation, or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the _mean_ (also called the _expected value_) of the set, while a high standard deviation indicates that the values are spread out oer a wider range. For a _normally distributed_ random variable, $\mu$ is the mean of the distribution, and $\sigma$ is its standard deviation: (the _68-95-99.7 rule_)

> - $Pr(\mu - 1\sigma \leq X \leq \mu + 1\sigma \approx 68.27%$
> - $Pr(\mu - 2\sigma \leq X \leq \mu + 2\sigma \approx 95.45%$ 
> - $Pr(\mu - 3\sigma \leq X \leq \mu + 2\sigma \approx 99.73%$ 


### As an example

Note that standard deviation, $$ SD = \sqrt{\, \frac{ \sum(X - \overline{x})^2 }{n - 1}} $$

| WHO | AGE | AGE - MEAN | (AGE - MEAN)<SUP>2</SUP> |
| --- | --- | --- | --- |
| Dad | $$43$$ | $$43 - 22 = 21$$ | $$21 \times 21 = 441$$ |
| Mom | $$42$$ | $$42 - 22 = 20$$ | $$20 \times 20 = 400$$ |
| Sis | $$12$$ | $$12 - 22 = -10$$ | $$-10 \times -10 = 100$$ |
| Bro | $$8$$ | $$8 - 22 = -14$$ | $$-14 \times -14 = 196$$ |
| Dog | $$5$$ | $$5 - 22 = -17$$ | $$-17 \times -17 = 289$$ |
| &nbsp; | &nbsp; | <div style="text-align: right">__Total:__</div> | 1426 |
| &nbsp; | &nbsp; | <div style="text-align: right">__Total/(5-1):__</div>| 356.5 |


In [32]:
print(var(myFamily$myFamilyAges))
print(sd(myFamily$myFamily))

[1] 356.5
[1] 18.88121
