# STA130 Tutorial 3 (with \<Your Favorite TA\>): <br>Data Visualization
### ***Samples*** come from ***Populations*** (often called ***Distributions***) $\quad$  
Today we're interested describing the "shape" or "distribution" of data

|![](im/3/Normal_Distribution.png)|
|:-:|
|Illustration credit: Allison Horst|

### Normal Populations are described by their Parameters (10 minutes)

$$\large \mathcal{N}(\mu, \sigma)   \color{gray}{\quad {\normalsize f(x)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma} \right)^2}} \leftarrow \text{math just for fun}} $$
- $\mu$: **mean** (location or center parameter)
- $\sigma$: **standard deviation** (scale or spread parameter)
    - often referenced as **variance** $\sigma^2$
    
| Understand $\mu$ and $\sigma$ work "Normally" below | Understand ***Samples*** come from ***Populations*** |
|-|-|
|<img src='https://www.scribbr.de/wp-content/uploads/2023/01/Standard-normal-distribution.webp' style="width:600px">|`from scipy import stats`<br>`mu,std = 0,1`<br>`a_normal_distribution = \`<br>`  stats.norm(loc=mu, scale=std)`<br>`n = 100`<br>`a_sample_of_size_n100 = \`<br>`  a_normal_distribution.rvs(size=n)`<br>`# rvs: random variable samples`|


# Tutorial Activity: Quiz Part I (3 minutes)

*Turn this in for your Tutorial Activity mark*

0. Your name
1. Could you estimate the variance of a data distribution based on just a histogram of the data; and, if so, what knowledge would you use and how would you estimate the variance? <br>(Hint: **variance** is **standard deviation** squared) <!-- For symmetric, normally shaped data, most of the data distribution would be between 2-3 standard deviations from the mean, so the standard deviation could be roughly estimated that way, and then that number could be squared to estimate the variance -->  

> - Answer will be reviewed later after Part II of the Quiz
> - Question credit for attempting to provide an answer: answers will not be reviewed in detail during marking


### Describing Distributions More Generally (10 minutes)

## (First Order) Distributional Characteristics: Location/Center 

| | | | |
|-|-|-|-|
| **Mean** | Average or $\bar x$ |`n=len(my_samp)` | `my_samp.sum()/n `<br>`# my_samp.mean()`|
| | 
| **Median** | 50th percentile | `import numpy as np` | `np.percentile(my_samp, 50)`<br>`# sorted(my_samp)[int(n/2)]`<br>`# np.quantile(my_samp, 0.5)`|
| | 
| **Mode** | Most Common | `from collections import Counter` | `Counter(my_samp).most_common()`

> These can be **parameters** when talking about a **population** but above they are **statistics** since they're calculated on a **sample**...

> **Statistics** are "functions of data" which characterize the "distribution of the sample"; and, they're good estimates of **parameters** of the **population** the **sample** was drawn from.




### Describing Distributions More Generally (10 minutes)

## (Second Order) Distributional Characteristics: Scale/Spread

| | |
|-|-|
| **Interquartile Range (IQR)** | `np.quantile(my_samp, 0.75) - np.quantile(my_samp, 0.25)` |
| | |
| **Range** | `my_samp.max() - my_samp.min()` |
| | |
| **Variance** | `my_samp.var(ddof=1) #'ddof=1' specifies division by "n-1"`|
| |$\underset{\color{gray}{\text{Estimates $\sigma^2$}}}{s^2} = \frac{1}{n-1}\displaystyle\sum_{i=1}^{n} (x_i - \bar x)^2 \quad\quad \underset{\color{gray}{\text{Estimates $\mu$}}}{\bar x} = \frac{1}{n} \sum_{i=1}^n x_i$<br>The variance is the (almost) average squared distance from the mean|
| **Standard Deviation** | `my_samp.std(ddof=1) #'ddof=1' specifies division by "n-1"`|
| | $\underset{\color{gray}{\text{Estimates $\sigma$}}}{s} = \sqrt{s^2} \quad$ is the square root of the **variance** so it's more interpretable<br>$\color{white}{\underset{\text{Estimates $\sigma$}}{s} = \sqrt{s^2}} \quad$ since it's now back in the same units as the original data.<br> Most normally distributed data is within 2-3 **standard deviations** of the **mean**.

> These can be **parameters** when talking about a **population** but above they are **statistics** since they're calculated on a **sample**...

<!--
> IQR is fairly simple to understand (right?); and standard deviation also usually has the easy interpretation (especially for symmetric unimodal data) that most (or all) of the data is within 2 (or 3) standard deviations to the left and to the right of the mean.

> **Statistics** are "functions of data" which characterize the "distribution of the sample"; and, they're good estimates of **parameters** of the **population** the **sample** was drawn from.
-->

#### Describing Distributions More Generally (5 minutes) *[next click is down not right]*

### (Higher Order) Distributional Characteristics: *Skewness* and Modality

![](im/3/skew.JPG)

- What causes the order of the median and the mean to be this way?

#### Describing Distributions More Generally (continued...) *[next click is down not right]*

### (Higher Order) Distributional Characteristics: Skewness and *Modality*

<center><img src='im/3/skew_modality2.png' style="width:1000px"></center>


#### Describing Distributions More Generally (continued...)

### (Higher Order) Distributional Characteristics: *Skewness and Modality*

<center><img src='im/3/skew_modality.png' style="width:1000px"></center>

- These terms of course also describe distribuitonal shapes of samples, too!


# Practice: group 1 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

<sub><sup>The histogram below shows the distribution of aftertaste scores for a sample of 1338 coffee samples.  Suppose you obtained 2 new cups of coffee, rated their aftertaste on a scale from 0 to 10, and re-calculated the mean, median, standard deviation, and variance of all 1340 aftertaste scores (1338 original values + 2 new values). </sup></sub>

|![](im/3/Rplot06.png)|
|:-:|
| |

If the two new coffees got scores of 7 and 10, how would the **recaluated mean and median change** compared to their original values (for just the first 1338 samples)?

<!-- mean increases but median is the same -->

# Practice: group 2 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

<sub><sup>The histogram below shows the distribution of aftertaste scores for a sample of 1338 coffee samples.  Suppose you obtained 2 new cups of coffee, rated their aftertaste on a scale from 0 to 10, and re-calculated the mean, median, standard deviation, and variance of all 1340 aftertaste scores (1338 original values + 2 new values). </sup></sub>

|![](im/3/Rplot06.png)|
|:-:|
| |

If the two new coffees got both got scores of 10, how would the **recaluated standard deviation change** compared to its original values (for just the first 1338 samples)?
<!-- standard deviation will increase slightly -->

# Practice: group 3 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

Which ONE of the statements below is most accurate for these variables?
    
![](im/3/Quiz2-Question2.png)


A. Variables have similar means and similar variances

B. Variables have similar means but quite different variances <!-- B -->

C. Variables have similar variances but quite different means

D. Variables have quite different means and quite different variances

*Hints: which of the variances is the smallest? **Which of the variances is the largest?***


# Practice: group 4 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

Which of the descriptors below are appropriate in describing the variable **VarA**? 

| | | | |
|-|-|-|-|
|left skewed|right skewed|symmetric|
|unimodal|bimodal|multimodal|
|mean>median|mean<median|mean$\approx$median| (Roughly guess mean/median?)|

|![](im/3/Quiz2-Question3.png)|
|:-:|
| Follow up question(s): what is the IQR and range of this distribution, approximately? |

<!-- left skewed unimodal mean<median -->
<!-- range is easy to read from this, but IQR we would have to approximate -->

# Practice: group 5 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

Which of the descriptors below are appropriate in describing the variable **b**? 

| | | | |
|-|-|-|-|
|left skewed|right skewed|symmetric|
|unimodal|bimodal|multimodal|
|mean>median|mean<median|mean$\approx$median|(Roughly guess mean/median?)|

|![](im/3/Quiz2-Question4.png)|
|:-:|
| Follow up question(s): what is the IQR and range of this distribution, approximately? |
<!-- left skewed mean>median and MAYBE unimodal? -->
<!-- IQR and range easy to read from this -->

# Practice: group 6 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

Which of the descriptors below are appropriate in describing the variable **Aftertaste scores**? 

| | | | |
|-|-|-|-|
|left skewed|right skewed|symmetric|
|unimodal|bimodal|multimodal|
|mean>median|mean<median|mean$\approx$median| (Roughly guess mean/median?|

|![](im/3/Rplot06.png)|
|:-:|
| Follow up question(s): what is the IQR and range of this distribution, approximately? |
<!-- symmetric unimodal mean about equals median -->
<!-- range is easy to read from this, but IQR we would have to approximate -->

# Practice: group 7 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon

The three aspects that should be reported when describing the distribution of a numerical variable are listed below along with an example. 

*Name another example of each of these.*

| | |
|-|:-|
| Location/Center|**Median**, the 50th percentile (or halfway point) of the distribution |
| Scale/Spread   |**Standard deviation**, a measure of spread of a distribution | 
| Shape          |**Skewness**, an imbalance away from symmetry in distribution tails |


# Practice: group 8 of 8 answers (<4 minutes)

- Break into 8 groups: confer, agree, then volunteer or be called upon


What two types of data can be visualized with plotly `px.histogram`? <!-- continuous numerical data; and, discrete numerical data, in which case the bins get little spaces between them. This latter plot is often call a "bar" plot: these are demonstrated on the next slide --> 
And what parameter controls the number of bars used on the plot on the right? <!-- nbins --> 


|<img src='im/3/graphs_from_HW3.JPG' style="height:250px">|
|:-:|
| |

> Hints: what's the difference between left and right figures?
> One is called a **barplot** while the other is a called a **histogram**... 
>
> ...though both are still created with `px.histogram`



# Tutorial Activity: Quiz (Part II: 7 minutes)

2. Which of the characteristics below are immediately clear from a boxplot compared to a histogram, and vice-versa? Which are pretty easy to assess from both boxplots and histograms?
 
  **Mean, Median, Mode, IQR, Range, Modality, Skewness, <u>OUTLIERS</u>**

<!-- median and IQR is part of boxplots; modality would not show on a boxplot but could be seen in a histogram with appropriately/reasonably sized bins; mode could also be identified with a histogram for discretely-valued data when bins each capture the presense of a single observation value; outliers, range, skewness should all be fairly discernable from both boxplots and histograms, as may be the mean depending on the skewness of the data distribution with mean close to median for nearly symmetric distributions -->

3. Describe what the code below does <!-- gets the height variable, removes missing values, transforms them to a list, and then makes a kernel density estimation of the 'height' variable overlayed on a histogram of the data. An example of this kind of KDE figure is shown later in the slides <img src='images/kde_HW3.JPG' style="height:250px"> -->

```python 
fig = ff.create_distplot(
        [amazonbooks.Height.dropna().values.tolist()], 
        ['Height']); fig.show()
```

4. For the continuous and discrete variables `variable` and `category` in the `pandas` data frame `df`, write `plotly` code to compare the distribution of `variable` across the levels of `category`. 
<!-- fig = px.box(df, y="category", x="variable"); fig.show() -->

# Tutorial Activity: Quiz Review (15 minutes)

> - Refer to questions on slide 3 and last slide as needed
> - Question credit for attempting to provide an answer: answers will not be reviewed in detail during marking

1. <sub><sup>For symmetric, normally shaped data, most of the data distribution would be between 2-3 standard deviations from the mean, so the standard deviation could be roughly estimated that way, and then that number could be squared to estimate the variance</sup></sub>

2. <sub><sup>Boxplots show median and IQR; modality (and mode for categorical data) are easily seen from histograms; Range, Skewness, and **<u>OUTLIERS</u>** are easily enough seen in both;  mean is probably fairly easy to guess for both; **<u>Briefly discuss OUTLIERS generally and as indicated in boxplots as this may be a new term for students</u>**</sup></sub>

3. <sub><sup>Gets the height variable, removes missing values, transforms them to a list, and then plots a kernel density estimate of the 'height' variable overlayed on a histogram of the data (like the example in later slides)</sup></sub>

4. <sub><sup>`fig = px.box(df, y="category", x="variable"); fig.show()`</sup></sub>

# reworked questions structure/use and and these were some left over questions that ended up left on the cutting room floor


2. If the distribution is left-skewed, how does the median compare to the mean? [because I think this was well-addressed earlier]  <!-- the median is greater than the mean -->

3. What’s the difference between mean and median (calculations) that causes the difference above? [because I think this was well-addressed earlier]  <!-- making some of the small numbers in an average smaller would make the average smaller, but if they're less than the median this would not make the median smaller... this is what left-skewed data does relative to a symmetric distribution where the mean/median are the same --> 

3. What does thin-tailed vs. heavy-tailed describe about a data distribution? [because I think it's more important to focus on getting a sense of the concept of outliers as opposed to the statistical notion of "heavy tailed-ness" which seems too technically advanced to be necessary and I think is sufficiently addressed thorugh the notion of outliers which could just be given superlatives like "crazy" or "extreme" to make the point that "heavy tailed-ness" is really getting at] <!-- the proportion of oberservations that fall within the "tails" of the distribution relative to the "central mass" of the distribution -->



# Full Class Discussion (15 minutes) *[click down...]*

You've learned about the different types of variables and techniques for visualizing and describing the distribution of variables. The `coffee_ratings.csv` dataset contains information about various samples of coffee and their ratings. On the next slide, there are four plots that visualize variables in this dataset.

1. What types of variables are displayed in each plot? 
    - Briefly justify your choice of Continuous or Discrete
    <!-- country is obviously categorical, so, discrete. For the other two variables both answers could be argued; but, probably `total_cup_points` is an integer, so technically discrete; and, `flavor` appears discrete with steps every 1/12, but there are some data points that don't follow this, so, it seems like it could theoretically be any numerical/decimal value, so, perhaps this is more naturally viewed as being continuous -->
2. The upper two graphs visualize the same data. <br>Which one you like better and why?
    - Which allows you to make comparisons across the groups easier? <!-- boxplots -->
    - Which better conveys the relative amount of data in each group? <!-- histograms -->
    
    
3. How could you best convey the relative amount of data in each group? <!-- perhaps sort the bottom right bar plot least to greatest? --> 
4. What could coloring the points in the bottom left figure by country show?  <!-- if there are different "sub-associations" between flavor and points within countries --> 


<!-- 2. What type of plot is used in each of the four figures?
3. Which plots are appropriate for which types of variables? 
- What features of the variable can you see from the plot? -->


# Full Class Discussion (continued...) *[click down...]*

![](im/3/coffee_plots.JPG)

# Full Class Discussion (continued...)

1. Continuous or Discrete? 
2. Boxplots allow for easy comparisions; but, don't show data sizes <br>(Address with text information? barplots? Re: 3.)... also, what about this?

![](im/3/better.JPG)

4. Coloring the scatterplot points might show different "sub-associations" of flavor and points in different countries, or general differences in scores

## Group Activity: Review 

`Q17` from this week's homework was: **Choose which of histograms, boxplots, and KDEs are your favorite way to represent distributions of data and explain why**

Check out these other cool ways to visualize data distributions: https://plotly.com/python/violin/

![](im/3/kde_HW3.JPG)

# Hedging (10 mins)
Hedging is helpful whenever you can’t say something is 100% one way or another, as is often the case.
In statistics, hedging should always be used with respect to the limitations of data and the strength and
generalizability of the conclusions.

Play this video for students: [https://web.microsoftstream.com/video/22f20d20-f096-4934-bfb4-86c0caf9da85](https://web.microsoftstream.com/video/22f20d20-f096-4934-bfb4-86c0caf9da85)

> We hope a **sample** is representative of a **population**; but, small sample sizes mean generalizations -- such as the accuracy of **sample statistics** estimating **population parameters** -- should be viewed cautiously and not be used overconfidently

## Tutorial Assignment *[next click is down not right]*

- Submit your work for the assignment through Quercus
- Include the `.ipynb` file upon which your written submission is based

#### Marking is based on addressing the points on the next slide (relative to the prompt on the next-next slide) and the subsequent determination of the TA that they have come to a good sense of the figure(s) you're describing without referring to your submitted `.ipynb` file

- Don't spend more than 60 minutes on this assignment (unless really needed...)    

    - Aim for something close to 200 to 500 words
    - Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner; so, 
        - use full sentences (without slang or emojis) 
    




## Tutorial Assignment *[next click is down not right]*

- Describe the data source, acknowledging any notable limitations of the data<br>(such as sample size, missing data, etc.)
- State the type of graph you're making (and identify the x- and y-axes) 
- Describe the key features and characteristics of the data distribution
    - Where is the data located/centered (approximate values if possible)?
        - Note key relative frequencies in categorical data contexts
    
    - What is the scale/spread of the data (approximate values if possible)? <!-- (Characterized relative to location/center of the data if helpful) -->
    - What is the shape of the data? <br>Symmetric? Left or right skewed? Multiple modes (and how many)?
    - Note the presense or absense of any potential outliers/extreme data: <br> **Provide description of the nature of outliers in the tails of the distribution if doing so is sensible and helpful**

## Tutorial Assignment (complete at home if needed)

Suppose you're on the phone with your friend, and for whatever reason you're describing some of the data and data visualization techniques you've been working with for STA130.  

Use `plotly` to construct 1-2 plots from the `coffee_ratings.csv` dataset and prepare a small paragraph with your description of the graph(s) for your friend keeping in mind that they cannot see the graph(s). Suppose your friend has not taken STA130, so they will not be as familiar with the statistical vocabulary as you are; so, explain any terms you use in plain language as you describe the data and graph(s) to your friend.

- Do not include any code in your written submission; but, separately include your `.ipynb` file along with your written submission
    - Annotate your notebook with comments where it's helpful for understanding what your code is doing