# Partitions

In this chapter, we will use a relatively simple approach to assessing whether a particular socioeconomic factor is associated with different numbers of Covid-19 cases. We will use the values of that socioeconomic factor to *partition* the Covid case data into two groups. Below, we provide some definitions to clarify exactly what we mean by a partition. We start by introducing the concept of *disjoint* collections of data:

````{card}
DEFINITION
^^^
disjoint
: collections of data are disjoint if no data point belongs to more than one of the collections
````

If the data points are unique, then we can use the mathematics of sets to formalize these ideas.  If you are unfamiliar with sets and their operations, please review **Appendix A** for a brief introduction. For our purposes, we can always consider a collection of data points to be unique if we include a unique index value with each data point.

A collection of sets $A_0, A_1, \ldots, A_{n-1}$ are disjoint if $A_i \cap A_j =\emptyset$ for every $i \in \{0,1,\ldots, n-1\}$, $j \in \{0,1,\ldots, n-1\}$, where $i \ne j$. 



<!--
MOVE THIS TO APPENDIX A

````{card}
DEFINITION
^^^
set
: a set is an unordered collection of unique items
````

Note that sets can contain any type of item, such as number, labels, names, types of dogs, weather conditions, etc.
-->





````{card}
DEFINITION
^^^
{glossary}
```
partition
    a partition of a data set is a group of **disjoint** collections of data, such that every data point in the original data set belongs to exactly one of the collections. 
```
````

If $S$ is a set of data, then the collections $A_0, A_1, \ldots A_{n-1}$ partition $S$ if:


$$
\bigcup_{i=0}^{n-1} A_i = S
$$
and $A_0, A_1, \ldots, A_{n-1}$ are  disjoint.

A *binary partition* of a set $S$ is a pair of sets $A_0$ and $A_1$ such that $A_0 \cup A_1 =S$ and $A_0 \cap A_1 = \emptyset$.

To use a particular socioeconomic metric to partition our data, we will partition the data into two sets: one with higher values of that metric, and one with lower values of that metric. 

To determine what constitutes a "higher value" or a "lower value", we compute a threshold from the data to compare to. Any such value that is computed from the data is called a *summary statistic*:

## Terminology Review

Use the flashcards below to help you review the terminology introduced in this section.

In [1]:
from jupytercards import display_flashcards

github='https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/03-first-data/flashcards/'
display_flashcards(github+'partitions.json')