# Descriptive Statistics
## Learning Goals

After completing this tutorial, the interns should be able to:

- Recognize, describe, and calculate the measures of a spread of data: variance, standard deviation, and range.
- Recognize, describe, and calculate the mean, median, and mode of a dataset
    

To explore these, we'll use the `flower-data-2020` dataset:

| Variable    | Type    | Description                   |
|:------------|:--------|:------------------------------|
| PetalColor  | Ratio   | multicolor, unicolor          |
| PetalShape  | Ratio   | rounded, unrounded            |
| Size        | Ratio   | 1. small, 2. medium, 3. large |


<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/iris">UCI Machine Learning Repository library
    </a></div>
<br>

We can calculate central tendency and spread using `pandas dataframes`.
Let's start by importing `pandas`:


<span><h3>
 Step 1. Read CSV Data into Pandas Dataframe

#### Substep. Import Pandas Library (if Needed)

- `import pandas` &nbsp;as&nbsp; `pd`
</h3></span>

<details>
  <summary>Blockly Hints</summary>
  <ol>
    <li><code>import .. as</code> block found under IMPORT</li>
    <li>Variable <code>pd</code> must be created under VARIABLES before it appears in the <code>as</code> dropdown</li>
  </ol>
</details>

<span><h4>
 Substep. Read CSV data and Save in Variable

- Create variable &nbsp;`dataframe`
- `Set dataframe to` &nbsp;:&nbsp; `with pd do read_csv` &nbsp;using&nbsp; `datasets/flower-data-2020.csv
`
</h4></span>

<details>
  <summary>Blockly Hints</summary>
  <ol>
    <li>Variable <code>dataframe</code> must be created under VARIABLES before it can be used</li>
    <li><code>set .. to</code> block found under VARIABLES</li>
    <li><code>with .. do .. using</code> block found under VARIABLES</li>
    <li>If <code>do</code> dropdown in <code>with .. do .. using</code> block will not populate, try "Run All Above Selected Cell" from the "Run" menu in top-left.</li>
    <li>You need a <code>".."</code> block found under TEXT to fill in the <code>using</code> part of the <code>with .. do .. using</code> block</li>
    <li>If <code>with .. do .. using</code> block does not want to snap together nicely with the <code>set .. to</code> block, try dragging the <code>set .. to</code> block instead.</li>
  </ol>
</details>



#### Substep. Display Dataframe Contents

-  Go to variable &nbsp; and select `dataframe` block




### Step 2. Measures the Central Tendency



#### Mean
The **mean** is the numerical average of the variables.

A dataframe will do this calculation for you:


<span><h4>
 Substep. Compute the mean  

</h4></span>


- Create variable &nbsp;`size`
- `set size to` &nbsp;:&nbsp; `dataframe [ .. ]` &nbsp;:&nbsp; `create list with` &nbsp;:
    - `" Size "`
- `with size do mean using`

<details>
  <summary>Blockly Hints</summary>
  <ol>
      <li>Using the LISTS menu in the Blockly palette, click on <code>create empty list</code>. Then click on the <code>+</code> sign.</li>
      <li>To add an element to the list, drag a text block and connect it with the list block. Inside the text block , write <b>Size</b>. </li>
      <li>Using the LISTS menu in the Blockly palette, click on the {dictVariable}[ ] block, change {dictVariable} to dataframe, and drop a create list with ["Size"] inside it.</li>      
     <li>If <code>do</code> dropdown in <code>with .. do .. using</code> block will not populate, try "Run All Above Selected Cell" from the "Run" menu in top-left.</li>

  </ol>
</details>

This gives us the same result as before and is the preferred way to calculate the mean of multiple columns.

However, the result is not a number value that we could reuse. In order to get a real number value, we need to include only one column in our filtered subset.



- Create variable &nbsp;`size`
- `set size to` &nbsp;:&nbsp; `dataframe [ .. ]` &nbsp;<-&nbsp; `" Size "`
- `with size do mean using`

<span><h3>
 Median
The **median** is the number in the middle of the data. By definition, one half of the data points are below the median and one half are above.

#### Substep. Compute the Median  


We can get this from a dataframe to:

</h3></span>


- `with size do median using`

<span><h3>
 Mode
The **mode** is the value in the data that shows up the most:

#### Substep. Compute the mean  

</h3></span>

- `with dataframe do mode using`

This output looks like it could be one of the rows from our dataset, **but it isn't!** It's actually the most popular value from each column smushed into a row. 

For each column, what `pandas` did is:

 - get the unique values
 - count the number of times each of those values appeared
 - take the max of the counts
 - find the name of the unique values with that count

If there are multiple unique values with the same count, it would find both of them and there would be two rows in the resulting mode table. In this case, only one row means that there was only one mode value for each column.

To confirm this, let's make subsets from the `PetalColor` and `PetalShape` columns as well, and take the mode of just that one column.



- Create variable &nbsp;`petal_color`
- `set petal_color to` &nbsp;:&nbsp; `dataframe [ .. ]` &nbsp;<-&nbsp; `" PetalColor "`
- Create variable &nbsp;`petal_shape`
- `set petal_shape to` &nbsp;:&nbsp; `dataframe [ .. ]` &nbsp;<-&nbsp; `" PetalShape "`

- `with petal_color do mode using`

- `with petal_shape do mode using`

- `with size do mode using`

Now, we can see that the mode values for each column do match with the original mode table.

### Measures of Dispersion (spread)

Even when two different variables have similar means (or medians, or modes) they can still be quite different depending on how the data are spread out around the center. 

One measure of dispersion that can be used with ordered categorical data (ordinal level) or numerical data (interval/ratio level) is the **five number summary**.
The five number summary is useful for comparing the center and spread of multiple variables. 
**You use the numbers in the five number summary to construct a box and whiskers plot.**
The five numbers are: 

- minimum
- first quartile
- median
- third quartile
- maximum

The first quartile is the median of the values below the median and the third quartile is the median of the
values above the median.

To use a football analogy, quartiles are like the 4 quarters in a game, and the median is like halftime.

We can get the five number summary easily from `pandas`:



- `with dataframe do describe using`

We can get the five number summary from the `min`, `25%`, `50%`, `75%`, and `max` rows. The five number summary for this dataset is $\{1,2,2,3,3\}$.

Note we've also got a few extra stats here: count, mean, and std.

### Range, Variance, and Standard Deviation

Other measures of the spread for numerical data include the range, the interquartile range, and the variance. 

The **range** is simply the maximum value minus the minimum. 
When outliers are present they may inflate the range. 
For example, consider a dataset containing incomes of all company employees. The CEO's income is $\$4,000,000$, but an intern only makes $\$30,000$. In this example, the range would be $4,000,000-30,000=3,970,000$, but this would not be representative of the spread of the majority of incomes. 

To reduce the effect of outliers on the measure of dispersion, the interquartile range is often used. 
The **interquartile range** is defined as the third quartile minus the first quartile.

The most commonly used measures of dispersion for numerical data are the **variance** and its square root, the **standard deviation**. 
The variance measures the sum of squared differences of the data about the mean.
Squaring the differences may seem complicated but makes sense when you realize that the sum of differences about the mean is zero.

![image.png](attachment:image.png)

When we did `describe()` above, it gave us standard deviation already (`std`), but let's convince ourselves of the difference between and meaning of the **sum of the differences**, the **variance**, and the **standard deviation**.

![image.png](attachment:image.png)

<span><h4>
 Substep. Compute the Standard Deviation  

</h4></span>

First, let's get the sum of the differences which we are told should equal zero. Following the formula above, we should get the mean of the `Size` column, then subtract the mean from the `Size` value of each row in the column. This will give us a new list of values (the "differences") which we can then take the sum of.



- Create variable &nbsp;`mean`
- `set mean to` &nbsp;:&nbsp; `with size do mean using`
- Create variable &nbsp;`differences`
- `set differences to` &nbsp;:&nbsp; ` .. - .. ` &nbsp;<-
    - `dataframe [ .. ]` &nbsp;<-&nbsp; `" Size "`
    - variable &nbsp;`mean`
- `sum of list` &nbsp;:&nbsp; `differences`

<details>
  <summary>Blockly Hints</summary>
  <ol>
      <li> To perform substraction, from the MATH block of the blockly pallette select the, <code>1 + 1</code>option. </li> 
      <li> On the dropdown option with <code>+</code>, select <code>-</code>. Also update the block with the values you want to substract to and from with</li>
      <li> To select the column <code>Size</code> of the dataframe, refer to blockly hints above. </li>


  </ol>
</details>

**Wait! That's not a 0!** But, actually, in programming land, it is. It's a rounding error. That number means $2.66 \times10^{-14}$ or $0.0000000000000266$, which is very small. So, the answer we got is not zero but is **effectively zero**.

Now, we can use the list of differences to calculate the variance by squaring the value in each row, then taking the sum of the whole column.



- `set differences to` &nbsp;:&nbsp; `with differences do pow using` &nbsp;<-&nbsp; `2`
- Create variable &nbsp;`variance`
- `set variance to` &nbsp;:&nbsp; `sum of list` &nbsp;:&nbsp; `differences`
- `set variance to` &nbsp;:&nbsp; ` .. / .. ` &nbsp;<-
    - variable &nbsp;`variance`
    - ` .. - .. ` &nbsp;<-
        - `with size do count using`
- variable `variance`

<details>
  <summary>Blockly Hints</summary>
  <ol>
      <li> To compute the square, go to VARIABLES block and select <code> with differences do ... </code>option. </li>
      <li> On the do... option select <code> pow </code>option. </li>
      <li> To compute the sum of the difference, which is a list, goto MATH and select  <code> sum of list </code>option. Go to Variables and select differences. Merge the two block and set variable variance to the sum of difference </li>
      <li> To compute the count, go to VARIABLES block and select <code> with size do ... </code>option. </li>
      <li> On the do... option select <code> count </code>option. </li>
      <li>If <code>do</code> dropdown in <code>with .. do .. using</code> block will not populate, try "Run All Above Selected Cell" from the "Run" menu in top-left.</li>

  </ol>
</details>

Once we have calculated the variance, we can take the square root of it to get the standard deviation.


- `square root` &nbsp;:&nbsp; `variance`

<details>
  <summary>Blockly Hints</summary>
  <ol>
      <li> To compute the square root, go to MATH block and select <code> square root</code>option. </li>
      <li> From VARIABLES, select the <code>variance</code>block and connect with <code> square root</code></li>
      <li>If <code>do</code> dropdown in <code>with .. do .. using</code> block will not populate, try "Run All Above Selected Cell" from the "Run" menu in top-left.</li>

  </ol>
</details>

**We did it!** This value seems to match the one in the `std` row of the `describe()` output.