# Visualization Workshop: Day 1 Introducing Vega-Altair

## Learning Outcomes

Here’s a more concise version of the learning goals:

1. **Chart Selection**: You will be able to choose and create suitable chart types (e.g., line, bar, scatter) to explore different data relationships.

2. **Encoding and Aggregation**: You will be able to apply encodings and aggregations to represent key data features effectively.

3. **Pattern Identification**: You will be able to recognize and highlight patterns and distributions in data using Altair visualizations.


## Part 1: Setting Up Altair

In [None]:
# Install Altair if not already installed

!pip install altair vega_datasets

In [None]:
# import the libraries that we will be using
import altair as alt
import pandas as pd



In [None]:
from vega_datasets import data

# short Altair program to make sure you have the environment configured

alt.Chart(data.cars()).mark_tick().encode(
    x='Horsepower',
    y='Cylinders:O'
)

**CONGRATULATIONS** you've created your first Altair Chart !!!

## Part 2: Scattering Cars

This task introduced students to Altair’s foundational concepts, including encodings and marks, and progressively added complexity by using multiple channels

### Cars Dataset

In [None]:
# use the Vega Altair Dataset cars

cars = data.cars()


#### Understanding the dataset

How big is the dataset?

What are the attributes in the dataset?

What is the data type of each attribute?

In [None]:
# Preview each dataset
print(cars.head())
print(cars.shape)
print(cars.info())


### First Altair Chart
 - Attach the dataset to the chart object
 - Select the `circle` mark
 - Specify what attribute will be encoded on the `x` channel (i.e., `Horsepower`)
    

#### Chart Object

We start by defining the chart type as `alt.Chart(cars)`
and specifying mark_circle() to create a bar chart.

#### `x` Channel
Let's encode `Horsepower` on the `x` channel

In [None]:
...


#### Learn by Doing (LBD) Task
In the cells below, modify the code above 
and try out different marks (e.g., 'tick', 'line', 'point', 'bar', 'rect')



In [None]:
...

#### `y` Channel
Let's encode data on the `y` channel. 
We will encode the `Miles_per_Gallon` attribute. 


In [None]:
...

#### Reflect

What insights can be inferred from the scatter plot?


#### LBD Task
In the cells below, change which attributes are encoded on the x and y channel
What questions can you answer with the scatter plots created?
What insights can be observed?

In [None]:
...

#### `color` Channel. 
Encode the geographic region where the car was manufactured on the color channel. 


In [None]:
...

#### Pause

Color is NOT just aesthetic, there is a science around which colors are used. 
More on this tomorrow ooooo

#### `size` Channel
Encode `Horsepower` on the `size` channel


In [None]:
...

#### LBD Task
Try encoding the remaining **unused** attributes on the `size` channel.

Which attributes do NOT make sense to use for the `size` channel?


In [None]:
...

## Part 3: Stocks Through Time

#### Stocks Dataset


In [None]:
stocks = data.stocks()
stocks.head()

How large is the dataset?
How many companies are featured in the dataset

In [None]:
print(stocks.info())
print(stocks['symbol'].unique())

#### Line Chart
Let's filter the data to only include Apple. 


In [None]:
apple_stock = stocks[stocks['symbol'] == 'AAPL']

Using `mark_line` let's create a line chart
where we encode
 - date on the `x` channel
 - price on the `y` channel

In [None]:
...

#### Multi-Line Chart
- visualize the stock price for all companies

In [None]:
...

##### Filter out Google

Interesting we don't have data for google before 2004. 
Let's remove Google and vis the rest. 


In [None]:
stock_wo_goog = stocks[stocks['symbol'].isin(['AAPL', 'IBM', 'MSFT', 'AMZN'])]

#####
Use the filtered dataset to create a multi-line chart

In [None]:
...

#### Area Chart 
Let's create an Area chart for Apple Stock
Encode `date` on the `x` channel and `price` on the `y` channel


In [None]:
...

Let's create an area chart for all the stocks (exclude Google).
Encode `date` on the `x` channel and `price` on the `y` channel and `symbol` on `color`

In [None]:
...

In [None]:
...

#### Normalized Area Chart
 - Stacking: By setting stack="normalize", Altair adjusts each segment of the area to represent percentages rather than absolute counts. 


## Part 4: Moving Bars
Using bar charts to make sense of the movie dataset. 


In [None]:
movies  = data.movies()

#### Understanding the dataset

How big is the dataset?

What are the attributes in the dataset?

What is the data type of each attribute?

In [None]:
...

#### First Bar Chart

 - Attach the movies dataset to the Chart object
 - Specify `mark_bar` as the chart's mark
 - Encode the movie's `Major_Genre` on the `x` channel
 - Encode the movie's `IMDB_Rating` on the `y` channel

In [None]:
...

#### SLOW DOWN
 - What is this. 
 - Why is the rating so high?
 - What does the values on the Y axes actually mean. 

We will add a **simple** interaction to help us understand what this bar chart is actually representating. 

Add the `tooltip` channel
and attach it to an array of attributes
    
    tooltip = ['Major_Genre', 'Release_Date', 'IMDB_Rating']

In [None]:
...

#### Reflect

Now that you have the tooltip, what does the Y axes **really** represent?


### Aggregations
Are a shortcut that allow us to summarize or transform the data. These aggregation can be used to count, sum, average, or perform other calculaions on data fields within the chart. 

Do you know how to do this in Pandas (possibly)
Vega's Altair has a shorthand way to support your vizzing process


#### Common Aggregations

##### count
Counts the number of data points in a specified field.
Example: Counting occurrences in a categorical field, such as the number of movies by genre.

In [None]:
alt.Chart(movies).mark_bar().encode(
    x='count()',
    y='Major_Genre:N'
)

##### average
Computes the average of a quantitative field
Averaging IMDB_Rating for movies by genre.

In [None]:
alt.Chart(movies).mark_bar().encode(
    x='Major_Genre:N',
    y='average(IMDB_Rating):Q'
)

#### Aggregations
 - sum: calculates the total sum of a quantitative field
 - median: finds the median value of a quantitative field 
 - min: Identifies the minimum value within a quantitative field.
 - max: Identifies the maximum value within a quantitative field.

These aggregation functions allow Altair users to quickly summarize data within charts, making it easier to interpret trends, distributions, and patterns within datasets.



#### LBD: Using Aggregations

Create three bar charts using these aggregations:

 - sum on one attribute
 - max on another attribute
 - median on a third attribute

In [None]:
...

### Bar Chart Variations 

#### Stacked Bar Chart
Let's create a stacked bar chart showing the count of movies in each genre, with each bar color-coded by MPAA_Rating. This will help visualize the distribution of ratings within each genre.

 - encode the number of records (i.e., `count()`) on the `x` channel
 - encode `Major_Genre` on the `y` channel
 - encode `MPAA_Rating` on the `color` channel
 

In [None]:
...

What do you learn from this visualization?


#### Normalized Bar Chart
Let's create a normalized stacked bar chart showing the percentage distribution of MPAA_Rating within each Major_Genre. 

This will allow you to see the relative rating distribution across different genres.


WHATTTTTT

 - Aggregation: We use sum(count) to calculate the total count of movies within each combination of Major_Genre and MPAA_Rating. The :Q indicates this is a quantitative measure.

 - Stacking: By setting stack="normalize", Altair adjusts each bar to represent percentages rather than absolute counts. This means that the total height of each bar becomes 100%, with segments proportionally representing each MPAA_Rating.


In [None]:
...


WOW
If you are still sitting, i'm so very very very proud of you oooo. 

We have barely scratched the surface. 
But now that you have the fundamental aspects of how to create a visualization tomorrow we can spend time focusing on how to make viz interactive and creating dashboards. 


Breathe. 

You are still standing or sitting :)

Dr. K