# Probability and Statistics for Computer Science

*Authors: David Forsyth*

**ISBN: 978-3-319-64410-3**

## Notation

- $\{x\}$ - **Dataset**
- $x_i$ -***i*th Data Item**
- $x_i^{(j)}$ - ***j*th Component of *i*th Data Item**
- $\text{mean}(\{x\})$ - **Mean**
- $\text{std}(\{x\})$ - **Standard Deviation**
- $\text{var}(\{x\})$ - **Variance**
- $\text{median}(\{x\})$ - **Median**
- $\text{percentile}(\{x\}, k)$ - **$k\%$ Percentile**
- $\text{iqr}(\{x\})$ - **Interquartile Range**
- $\{\hat{x}\}$ - **Dataset Transformed to Standard Coordinates**
- $\text{corr}(\{(x, y)\})$ - **Correlation**
- $\emptyset$ - **Empty Set**
- $\Omega$ - **Set of All Possible Experiment Outcomes**
- $\mathcal{A}$ - **Set**
- $\mathcal{A}^c = \Omega - \mathcal{A}$ - **Set Complement**
- $\mathcal{E}$ - **Event**
- $P(\{\mathcal{E}\})$ - **Probability of Event $\mathcal{E}$**
- $P(\{\mathcal{E}\} \vert \{\mathcal{F}\})$ - **Probability of Event $\mathcal{E}$, Conditioned on Event $\mathcal{F}$**
- $p(x)$ - **Probability That Random Variable $X$ Equals Value $x$**
- $p(x, y)$ - **Probability That Random Variable $X$ Equals Value $x$ And Random Variable $Y$ Equals Value $y$**
- $\max_{x} f(x)$ - **Value of $x$ That Maximizes $f(x)$**
- $\min_{x} f(x)$ - **Value of $x$ That Minimizes $f(x)$**
- $\hat{\theta}$ - **Estimated Value of $\theta$**

## 1. First Tools for Looking at Data

### Datasets

- **Dataset**: A collection of descriptions (or $d$-tuples) of different instances of the same phenomenon.
    - *Categorical*
    - *Continuous*

### Summarizing 1D Data

- A **location parameter** tells where the data lies along a number line.
- A **scale parameter** tells how wide the spread of data is.

### Mean

- Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The mean of this dataset is:
$$\text{mean}(\{x\}) = \frac{1}{N} \sum_{i = 1}^{N} x_i$$

#### Properties of Mean

- **Scaling**: $\text{mean}(\{k \cdot x_i\}) = k \cdot \text{mean}(\{x_i\})$
    - *Yes Effect*
- **Translation**: $\text{mean}(\{x_i + c\}) = \text{mean}(\{x_i\}) + c$
    - *Yes Effect*
- **Sum of Signed Differences**: $\sum_{i = 1}^{N} \left( x_i - \text{mean}(\{x_i\}) \right) = 0$
- **Sum of Squared Distances to $\mu$**: $\min_{\mu} \sum_{i} \left( x_i - \mu \right)^2 = \text{mean}(\{x_i\})$
    - *Mean*: $\mu$

#### Interpretation of Mean

- The mean is a *location parameter* that summarizes the dataset with a value that is as close as possible to each datum.

### Standard Deviation

- Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The standard deviation of this dataset is:
$$
\begin{align}
\text{std}(\{x\}) 
& = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} \left( x_i - \text{mean}(\{x\}) \right)^2} \\
& = \sqrt{\text{mean}\left( \{ \left( x_i - \text{mean}(\{x\}) \right)^2 \} \right)}
\end{align}
$$

#### Properties of Standard Deviation

- **Scaling**: $\text{std}(\{k \cdot x_i\}) = k \cdot \text{std}(\{x_i\})$
    - *Yes Effect*
- **Translation**: $\text{std}(\{x_i + c\}) = \text{std}(\{x_i\})$
    - *No Effect*
- For any dataset, there can be only a few items that are many standard deviations away from the mean. For $N$ data items, $x_i$, whose standard deviation is $\sigma$, there are at most $\frac{1}{k^2}$ data points lying $k$ or more standard deviations away from the mean.
- For any dataset, there must be at least one data item that is at least one standard deviation away from the mean.

#### Interpretation of Standard Deviation

- The standard deviation is the root mean square of the offsets from the mean.
- The standard deviation is a *scale parameter* that measures the size of the average deviation from the mean for a dataset.
- When the standard deviation is large, there are many items with values much larger than, or much smaller than, the mean.
- When the standard deviation is small, most data items have values close to the mean.

### Variance

- Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$. The variance of this dataset is:
$$
\begin{align}
\text{var}(\{x\}) 
& = \frac{1}{N} \sum_{i = 1}^{N} \left( x_i - \text{mean}(\{x\}) \right)^2 \\
& = \text{mean}\left( \{ \left( x_i - \text{mean}(\{x\}) \right)^2 \} \right)
\end{align}
$$

#### Properties of Variance

- **Scaling**: $\text{var}(\{k \cdot x_i\}) = k^2 \cdot \text{var}(\{x_i\})$
    - *Yes Effect*
- **Translation**: $\text{var}(\{x_i + c\}) = \text{var}(\{x_i\})$
    - *No Effect*
    
#### Interpretation of Variance

- Variance is the mean-square error you would incur if you replaced each data item with the mean.

### Median

- Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$.
    - If $N$ is odd, the median of this dataset is:
$$\text{median}(\{x\}) = \text{sort}(\{x\})\left[ \frac{N}{2} \right]$$
    - If $N$ is even, the median of this dataset is:
$$\text{median}(\{x\}) = \frac{1}{2} \left( \text{sort}(\{x\})\left\lceil \frac{N}{2} \right\rceil + \text{sort}(\{x\})\left\lfloor \frac{N}{2} \right\rfloor \right)$$

#### Properties of Median

- **Scaling**: $\text{median}(\{k \cdot x_i\}) = k \cdot \text{median}(\{x_i\})$
    - *Yes Effect*
- **Translation**: $\text{median}(\{x_i + c\}) = \text{median}(\{x_i\}) + c$
    - *Yes Effect*

#### Interpretation of Median

- Generally, approximately half the data is smaller than the median, and approximately half the data is larger than the median.
- The median is an alternative to the mean because it is also a *location parameter*.

### Interquartile Range

- **Percentile**: The $k$'th percentile is the value such hat $k\%$ of the data is less than or equal to that value.
    - $\text{percentile}(\{x\}, k)$
- **Quartile**:
    - The first quartile is the value such that $25\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 25)$.
    - The second quartile is the value such that $50\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 50)$.
    - The third quartile is the value such that $75\%$ of the data is less than or equal to that value: $\text{percentile}(\{x\}, 75)$.
- **Interquartile Range**: The interquartile range of a dataset $\{x\}$ is:
$$\text{iqr}(\{x\}) = \text{percentile}(\{x\}, 75) - \text{percentile}(\{x\}, 25)$$

#### Properties of Interquartile Range

- **Scaling**: $\text{iqr}(\{k \cdot x_i\}) = k \cdot \text{iqr}(\{x_i\})$
    - *Yes Effect*
- **Translation**: $\text{iqr}(\{x_i + c\}) = \text{iqr}(\{x_i\})$
    - *No Effect*
    
#### Interpretation of Interquartile Range

- The interquartile range is an alternative to the standard deviation because it is also a *scale parameter*.

### Online Algorithms for Mean and Standard Deviation

#### Mean

- Let $\hat{\mu}_{k}$ be an estimate for the mean of the dataset after seeing $k$ elements.

$$
\begin{align}
\hat{\mu}_{1} &= x_1 \\
\hat{\mu}_{k + 1} &= \frac{\hat{\mu}_{k} + x_{k + 1}}{k + 1}
\end{align}
$$

#### Standard Deviation

- Let $\hat{\sigma}_{k}$ be an estimate for the standard deviation of the dataset after seeing $k$ elements.

$$
\begin{align}
\hat{\sigma}_{1} &= 0 \\
\hat{\sigma}_{k + 1} &= \sqrt{\frac{(k \cdot \hat{\sigma}_{k}^2) + (x_{k + 1} - \hat{\mu}_{k + 1})^2}{k + 1}}
\end{align}
$$

### Mean vs. Median and Standard Deviation vs. Interquartile Range

#### Mean and Standard Deviation

- The mean and the standard deviation are strongly affected by **outliers**.
- The mean and the standard deviation are inexpensive to exactly calculate.
- Generally, The mean and the standard deviation are sensible for continuous data.

#### Median and Interquartile Range

- The median and the interquartile range are weakly affected by **outliers**.
- The median and the interquartile range are expensive to exactly calculate.
- Generally, The median and the interquartile range are sensible for categorical data.

### Histograms

- **Bar Chart**: A set of bars, one per category, where the height of each bar is proportional to the number of items in that category.
- **Histogram**: A generalization of a bar chart for continuous-valued data.
    1. Divide the range of data into even or uneven intervals.
    2. Associate each interval with a pigeonhole.
    3. Associate each datum with a pigeonhole.
    4. Visualize the histogram as a set of boxes, one per interval, in which each box sits on its interval on the horizontal axis, and its height is determined by the amount of data in the corresponding pigeonhole.
- **Conditional Histogram**: A histogram that only plots part of a data set.

#### Modes and Histograms

- A histogram is **unimodal** if there is only one peak.
- A histogram is **bimodal** if there is only two peaks.
- A histogram is **multimodal** if there are many peaks.

![Modes of Histogram](images/Figure_1_4.png)

#### Skew and Histograms

- The **tails** of a histogram are the relative uncommon values that are significantly larger or smaller than the value at the peak.
- If the histogram is not symmetric, then the histogram is **skewed**.

![Skew of Histogram](images/Figure_1_5.png)

### Standard Coordinates and Normal Data

- Assume we have a dataset $\{x\}$ of $N$ data items, $x_1, ..., x_N$, The standard coordinates of this dataset is:
$$\hat{x}_i = \frac{x_i - \text{mean}(\{x\})}{\text{std}(\{x\})}$$
- Data is **standard normal data** if, when we have a lot of data, the histogram of the data in standard coordinates is a close approximation to the **standard normal curve**:
$$y(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2 / 2}$$
- Data is **normal data** if, when we subtract the mean and divide by the standard deviation, it becomes the standard normal data.

#### Interpretation of Standard Coordinates

- A dataset expressed in standard coordinates is unitless with a mean of $0$ and a standard deviation of $1$.
    - This allows different datasets to be compared if they are expressed in standard coordinates.
- Many datasets expressed in standard coordinates are symmetric and unimodal, so they tend to be normal data.
    
#### Figure of Standard Normal Curve

![Standard Normal Curve](images/Figure_1_7.png)

#### Properties of Normal Data

- Approximately 68% of data lie within one standard deviations of the mean.
- Approximately 95% of data lie within two standard deviations of the mean.
- Approximately 99% of data lie within three standard deviations of the mean.

### Box Plots

- A **box plot** is a way to plot data that simplifies comparison.
    - Dataset = Vertical Display.
    - Vertical Box = Interquartile Range.
    - Horizontal Line = Median.
    - Whiskers = Range of Non-Outlier Data.
    - Crosses = Outliers.
    
![Box Plot](images/Figure_1_8.png)