# STA130 Tutorial 3 (with \<Your Favorite TA\>): <br>Data Visualization

![](images/Normal_Distribution.png)    
Illustration credit: Allison Horst

# Describing Distributions (15 mins)
## (First Order) Distributional Characteristics: Center/Location Statistics
- Median: the 50th percentile of the data
- Mean: the average value in the data
- Mode: the most frequent data value


## (Second Order) Distributional Characteristics: Spread/Scale Statistics
- Interquartile Range (IQR): 75th percentile - 25th percentile
- Range: maximum - minimum
- Variance: (almost) average squared distance from the mean $$ s^2 = \frac{1}{n-1}\displaystyle\sum_{i=0}^{n-1} (x_i - \bar x)^2$$
- Standard Deviation: square root of the variance (more interpretable, since same units) $$ s = \sqrt{s^2} $$

## (Higher Order) Distributional Characteristics: Skewness + Modality
|||
|-|-|
|![](images/skew_modality.png)|![](images/skew_modality2.png)|


![](images/skew.JPG)

# Quiz (15 mins)

0. What is your name?
1. Consider the following distribution characteristics: mean, median, mode, range, IQR, modality, skewness, outliers. Which ones can be easily observed/estimated from a boxplot? What about a histogram?  <!-- Boxplot: median, IQR, range, outliers; Histogram: mode, range, modality, skewness -->
2. What does it mean for a distribution to be thin-tailed vs. heavy-tailed? <!-- the number of oberservations/points that fall within the tails of the histogram -->
3. What type of data is typically visualized using a histogram? <!-- continuous numerical data -->
4. If the distribution is left-skewed, how does the median compare to the mean?  <!-- the median is greater than the mean -->
5. What parameter is used to set the number of bins in a Plotly histogram? <!-- nbins -->
6. What does the following code do?

```python 
group_labels = ['Height']
fig = ff.create_distplot([amazonbooks.Height.dropna().values.tolist()], group_labels)
fig.show()
```
<!-- creates a kernel density estimation of the 'height' variable -->
7. Write `plotly` code that creates a single boxplot for `df` that plots the distribution of `variable`. 
<!-- fig = px.box(df, x="variable")
fig.show()
-->

# Discussion (15-20 mins)
- Review the quiz and address any open questions or concerns

# Discussion  (20 mins) 

- *Break into groups of 4*

You've learned about the different types of variables and techniques for visualizing and describing the distribution of variables. The `coffee_ratings.csv` dataset contains information about various samples of coffee and their ratings. On the next slide, there are four plots that describe variables in this dataset.

- What type of variable is displayed in the plot?
- What type of plot is used to visualize the variable?
- Which plots are appropriate for which types of variables?
- What features of the variable can you see from the plot?
- The upper two graphs plot the same relationship. Which one you like better? Why?

![](images/coffee_plots.JPG)

# Hedging (10 mins)
Hedging is helpful whenever you can’t say something is 100% one way or another, as is often the case.
In statistics, hedging should always be used with respect to the limitations of data and the strength and
generalizability of the conclusions.

Play this video for students: [https://web.microsoftstream.com/video/22f20d20-f096-4934-bfb4-86c0caf9da85](https://web.microsoftstream.com/video/22f20d20-f096-4934-bfb4-86c0caf9da85)

# Tutorial Exercise (20-25 mins)

Pretend that you are on the phone with your friend, and you want to share some of the cool new data visualization techniques that you have been learning in STA130. Use `plotly` to constuct 1-2 plots from the `coffee_ratings.csv` dataset and prepare a small paragraph on how you would describe the graph(s) to your friend (keeping in mind that they cannot see the graph). It is important to keep in mind that the person you are talking to has not taken STA130, therefore they will not be as familiar with the statistical vocabulary as you are. Therefore, make sure you explain any terms you use in plain language.
- Do not include any code in your written submission.
- Include your code in a `.ipynb` notebook along with your written submission.
    - Annotate your notebook with comments where it's helpful for understanding what you are doing.

## When describing a figure, it is important to:
- Describe the data source
- State the type of graph
- Identify what are on the x- and y-axes (if appropriate)
- Describe the distribution
- Make note of potential outliers

Make sure you include some of the following: where the data is centered (approximate values if available), spread (relative to what?), shape (symmetric, left-skewed, right-skewed), distribution tails (heavy-tailed, thin-tailed), modality (how many?, where?), outliers/extreme values, frequency. Use hedging to acknowledge the limitations of the data.

### Notes on approaching the writing prompt

- Hand in the assignment on Quercus
- Use full sentences
- Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner (without slang or emojis) 
- Aim for 200 - 500 words
- Do not spend more than 90 minutes on the prompt (unless you really need to...)