# Session 4: Intro to Visualization

*Joachim Kahr Rasmussen*

# Recap (I/II)

*Why is it again that pandas has gained so much traction?*


Many reasons! Clear advantages over standard python and numpy are:
- Data representation: Allows naming columns (and rows), making it much easier to navigate in large sets of data
- Features and speed: Has lots of features for data analysis that are both *simple* and *fast* to apply
- Method chaining: We can write something fairly complicated in just a few lines

*What data types do we typically work with?*

We have covered five data types:
- Boolean data (binary true/false variables): Often used for data selection
- Numeric variables in general: Lots of built in methods for analysis (`describe`, `cut` and much, much more)
- Strings: Data consisting of *alphanumeric' characters. Many types of operations possible with pandas.
- Categorical data: Numeric data that can only take a (often strictly) limited number of values.
- Time Series: Data that has an explicit time dimension (a *time stamp* so to say).

# Recap (II/II)

*OK, so I have a collection of data that I want to analyze. How to get my data ready for analysis?*


If your data comes in different subsets:
- Using `merge`: Combining through one or multiple keys
- Using `concat` or `join`: Combining though index
- Inner joint? Outer join? Left join? Might create missings.

Think about how to deal with missings or duplicates:
- Missings: Should these be dropped (`.dropna()`) or imputed (`.fillna()`)?
- Duplicates: What is a duplicate really? And should they be droppend (`.drop_duplicates()`)?

Think about whether your data has the right shape:
- Wide format or long format? Use `.stack()` or `.unstack()`

Create some aggregate results for your data to hold observations up against:
- Use *split-apply-combine* to create all kinds of subgroup characteristics (mean, variance, median, etc.)

# Questions from This Morning

I have tried to gather some questions that seemed to address more general issues:
- *4432*

Other questions?


# Overview of Session 4

Today, we will work with how one can do plotting in Python. In particular, we will cover:
1. Understanding Plotting (live)
    - What are you plotting?
    - Why we plot
    - Why are you plotting?
2. Plotting in Python: Packages and Grammar (live)
    - Intro to `matplotlib` and `seaborn`
    - The "Grammar of Graphics"
3. Plotting the Tips Data (video + notebook)
    - Plots for one variable (Series)
        - Numeric data
        - Categorical
    - Plots for two or more variables (DataFrame):         
        - Numeric data
        - Mixed numerica and categorical data
    - Advanced exploratory plotting

# Associated Readings

Wickham (2010), sections 1-3
- Fundamentals of a plotting
- "Grammar of Graphics"

PDA, chapter 9:
- Basic syntax and fundamental concepts with matplotlib
- Combining matplotlib with pandas and using seaborn package

Moffitt (2017):
- Strengths and weaknesses of matplotlib
- Intro to `figure` and `axes`
- Using functions in order to improve formatting

# Understanding Plotting

*What are we plotting?*

In the last sessions, we worked with generating, cleaning and making operations on data using pandas.
- When we plot, we essentially want to make a *visual* and *digestable* representation of these data.!

*What are some guidelines on making plots in **general**?*

Be aware of *what* you plot
- numerical vs. non-numeric (categorical)
- raw data vs. model results 

## Why We Plot

An English adage
> A picture is worth a thousand words

Is that always the case?


## What Values Do A,B,C and D Have?
<center><img src='https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/excel1.png'></center>

## The Shocking Answer
<center><img src='https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/excel2.png'></center>


## Why Are You Plotting?
*Who's the audience?*

You / your team:

- **Exploratory** plots: Figures for understanding data
    - Quick to produce $\sim$ minimal polishing
    - Interesting feature may by implied by the producer
    - Be careful showing these out of context

Others:

- **Explanatory** plots: Figures to convey a message
    - Polished figures
    - Direct attention to interesting feature in the data
    - Minimize risk of misunderstanding

## How Should You Plot?
*What are some tips for making **explanatory** plots in a report?*  ***<font color="red">(Exam relevant!)</font>***

- Clear narratives - should convey key point(s)
  - If you want to show difference between groups in data make sure it is easy to distinguish them.

- Self explanatory
  - Contain axis label, title, footnotes in text containing relevant information.

- Nice appereance 
  - Choose the right plot type.
  - Make sure font type, size, colors, line width.

- Keep simplicity.
  - Anything unnecessary should be removed, see [this post](https://www.darkhorseanalytics.com/blog/data-looks-better-naked/).

*Some practical advice on making **explanatory** plots?*

1. Try out a few plot types, using exploratory analysis - use what works.
1. Apply the *layered grammer of graphics*.
    - Start with an empty canvas
    - Fill the necessary things (axis, ticks, bars/lines, labels)

<img src="https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png" alt="drawing" width="600"/>

# Plotting in Python: Packages and Grammar

## How Are You Plotting?
There are two overall approaches to plotting:

- make a fast, decent figure
    - iteratively adjust if necessary
    - start out in `seaborn` continue to `matplotlib`


- from empty canvas to figure
    - iteratively add material and layers
    - performed in `matplotlib`
   

## Packages for Python Plotting (I/II)
*What is the fundamental tool for making plots in Python?*

**Matplotlib** is the fundamental plotting module
- Can make almost any 2d plot.
- Can build publication ready figures.
- Caveat: 
    - requires time consuming customization;
    - requires practice.

In [3]:
import matplotlib.pyplot as plt
# allow printing in notebook
%matplotlib inline 

## Packages for Python Plotting (II/II)
*What are good tools for fast, exploratory plots?*

`seaborn` has built-in capabilities to make plots
- Analyzing data, e.g. splitting by subsets
- Make interpolation of data to smooth out noise.

`pandas` can easily convert Series and DataFrames to plots