In [None]:
import pandas as pd

import altair as alt
alt.data_transformers.disable_max_rows()

import matplotlib.pyplot as plt


from vega_datasets import data
mtcars = data.cars()

## Set Altair default size

def theme_fm(*args, **kwargs):
    return {'height': 220,
            'width' : 220,
            'config': {'style': {'circle': {'size': 400},
                                'point': {'size': 30},
                                'square': {'size': 400},
                                },
                       'legend': {'symbolSize': 20, 'titleFontSize': 20, 'labelFontSize': 20}, 
                       'axis': {'titleFontSize': 20, 'labelFontSize': 20}},
            }

alt.themes.register('theme_fm', theme_fm)
alt.themes.enable('theme_fm')

# Exploratory Data Analysis with Altair

<img src="imgs/viz.jpg" align="center" width=100%>

<p style="text-align:left;">
    Photo by <a href="https://www.pexels.com/@rodnae-prod">RODNAE Productions</a> from <a href="https://www.pexels.com/photo/magnifying-glass-on-white-paper-7948038/">Pexels</a>
    <span style="float:right;">
        March 25, 2022 <br>
        Firas Moosvi
    </span>
</p>

In [None]:
## We'll be using the mtcars dataset for the first part of this lecture

mtcars.head()

## Starting with the punchline!

By the end of class today, you will learn how to make this chart using the `mtcars` dataset:

In [None]:
base = alt.Chart(mtcars).mark_point().encode(
    alt.X('Horsepower'),
    alt.Y('Miles_per_Gallon'),
    alt.Color('Origin'),
    alt.Column('Origin')
) 

base.interactive()

### In matplotlib:

If you're familiar with `matplotlib`, this should illustrate to you **how** Altair is different - not better or worse, just *differently sane* (h/t [Greg Wilson](https://tidynomicon.tech)).

In [None]:
colour_map = dict(zip(mtcars['Origin'].unique(), ['red','lightblue','orange']))
n_panels = len(colour_map)

fig, ax = plt.subplots(1, n_panels, figsize=(n_panels * 6, 5),
                       sharex = True, sharey = True)

for i, (country,group) in enumerate(mtcars.groupby('Origin')):
    ax[i].scatter(group['Horsepower'],
                  group['Miles_per_Gallon'],
                  label = country,
                  color = colour_map[country])
    ax[i].legend(title='Origin')
    ax[i].grid()
    ax[i].set_xlabel('Horsepower')
    ax[i].set_ylabel('Miles_per_Gallon')

### 1. Tabular Data

Data in Altair is built around the [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

The fundamental object in Altair is the ``Chart``. It takes the dataframe as a single argument:

```chart = alt.Chart(DataFrame)```

Let's create a simple `DataFrame` to visualize, with a categorical data in the `Letters` column and numerical data in the `Numbers` column:

In [None]:
df = pd.DataFrame({'Letters': list('CCCDDDEEE'),
                     'Numbers': [2, 7, 4, 1, 2, 6, 8, 4, 7]})
df.T

In [None]:
plot = alt.Chart(df)

#plot 

### 2. Chart Marks

Next we can decide what sort of *mark* we would like to use to represent our data.

Here are some of the more commonly used `mark_*()` methods supported in Altair and Vega-Lite; for more detail see [Marks](https://altair-viz.github.io/user_guide/marks.html) in the Altair documentation:

|Mark|
|------|
|`mark_area()`|
|`mark_bar()`|
|`mark_circle()`, `mark_point`, `mark_square`|
|`mark_rect()`|
|`mark_line()`|
|`mark_rule()`|
|`mark_text()`|
|`mark_image()`|

Let's add a mark_point() to our plot:

In [None]:
plot = alt.Chart(df).mark_point()

plot

😒 

We have a plot now, but clearly we're being pranked: all the data points collapsed to one location! Why ?

<img src="imgs/Visualization-Grammar4.jpeg" align="center">
    
Slide used with permission from [Eitan Lees](https://eitanlees.github.io/altair-stack/)

In [None]:
### 3. Chart Encoding

Let's add an encoding so the data is mapped to the x and y axes:

In [None]:
plot = alt.Chart(df).mark_point().encode(alt.X('Numbers'))

plot

# We still haven't encoded any of the data to the Y-axis!

#### Activity : You Try!

Encode the `Letters` column at the `y` position to make the visualization more useful.

In [None]:
plot = alt.Chart(df).mark_point().encode(alt.X('Numbers'),
                                         alt.Y('Letters'),
                                         )
# first chart

#### Activity : You Try!

Change the `mark` from `mark_point()` to `mark_circle` or `mark_square`

In [1]:
plot = plot ## YOUR SOLUTION HERE


NameError: name 'plot' is not defined

#### Activity : You Try!

What do you think will happen when you try to change the `mark_circle` to a `mark_bar()`

In [None]:
plot ## YOUR SOLUTION HERE

### 4. Transforms

Though Altair supports a few built-in data transformations and aggregations, in general I **do not suggest** you use them.

Some reasons why:

- Not all functions are available
- You already know how to do complex wrangling using pandas
- No opportunity to write tests if wrangling is done within plots
- Single point of failure
- Syntax is non-trivial and not very "pythonic"
- Code is less readable and harder to document

### 5. Scale

The scale parameter controls axis limits, axis types (`log`, `semi-log`, etc...).

For a complete description of the available options, see the [Scales and Guides](https://altair-viz.github.io/user_guide/scale_resolve.html) section of the documentation.

In [None]:
plot = alt.Chart(df).mark_point().encode(
            alt.X('Numbers'),
            alt.Y('Letters'))

plot.encode(alt.X('Numbers', 
                  scale = alt.Scale(type='log')))

<img src="imgs/Visualization-Grammar7.jpeg" align="center">
    
Slide used with permission from [Eitan Lees](https://eitanlees.github.io/altair-stack/)

### 6. Guide

The guides component deals with legends and annotations that "guide" our interpretation of the data. In most cases you will not need to work with this component very much as the defaults are pretty good!

For a complete description of the available options, see the [Scales and Guides](https://altair-viz.github.io/user_guide/scale_resolve.html) section of the documentation.

## Interactive Altair (5 mins)

In [None]:
# Altair 

## To uncomment the code chunk below, select it
## and press Command + / (or Control + /)

first_chart = alt.Chart(mtcars).mark_point().encode(
    alt.X('Horsepower'),
    alt.Y('Miles_per_Gallon'),
    alt.Color('Origin'),
    alt.Row('Origin')
)
first_chart.interactive()

### One more thing...

In [None]:
chart = alt.Chart(mtcars).mark_point().encode(
            alt.Y('Horsepower'),
            alt.X('Miles_per_Gallon')).interactive()

# Combine multiple charts together

chart

## Motivating the need for EDA (20 mins)

- Let's put our new skills to work with an example dataset!

### Scenario

You have been given a dataset and tasked with trying to solve a problem.
In WW2, expensive fighter planes were going down quite frequently due to bullet fire.
The military decided to conduct an analysis and surveyed all the surviving planes in an effort to catalogue which regions of the plane should be reinforced.

With limited resources, the military could only reinforce a maximum of two zones.

**Your task is to look at the bullet data for the planes and help determine which areas of the plane should be reinforced.**

You're given a schematic of the plane, and told that the workers added a grid to the schematic, divided it up into regions A,B,C,D,E and recorded a value of 1 wherever there was a bullet hole across all the planes that returned.

Areas without bullet holes are marked as 0.
They gave you a csv file with this information called [`bullet_data.csv`](https://github.com/firasm/bits/raw/master/bullet_data.csv).

Yes, these WW2 workers are very sophisticated and had access to a fancy computer!

<img src="imgs/plane.png" width=50% align="center">

`bullet_data.csv` is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv

In [None]:
df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()

In [None]:
# Use our standard tool first:

df.describe().T

`describe()` didn't quite organize the data like the way we wanted, let's try and figure out some more info manually.

In [None]:
print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")

print("Columns are: {0}".format(list(df.columns)),"\n")

print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")

Let's wrangle the data a bit to try and see what's going on:

In [None]:
# First, only consider the bullet 'hits':

## YOUR SOLUTION HERE

#### Activity: Produce a `hits_df` that groups by Zone and reports the bullet hits in each zone

In [3]:
## YOUR SOLUTION HERE

#### Activity: Visualize the `hits_df` as a bar graph

In [4]:
## YOUR SOLUTION HERE

#### Activity: Visualize the 2D hits on the plane

In [None]:
## YOUR SOLUTION HERE

### Debrief

- Look at your data!
- Talk to someone about your data!
- Look at your data another way!
- Think about your data and what it means!
- Exploratory Data Analysis is **essential**!

## Activity: You Try (homework)

- Task: Select a plot from the ["Interactive Charts" section of the Altair gallery](https://altair-viz.github.io/gallery/index.html#interactive-charts) and reproduce the plot in a Jupyter Notebook.

<img src="imgs/altair_gallery.png" width=60% align="center">

We will review these examples at the start of next class!