# Intro to Altair
Altair is a statistical visualization library for Python. It uses the *grammar* of **marks** and **channels** in order to build visualizations.

More information can be found at [Altair's documentation](https://altair-viz.github.io/getting_started/overview.html)! Parts of this assignment are taken from its tutorial.

## Installation
Follow these instructions in order to install Altair on your computer.

In your terminal, run `pip install altair vega-datasets`  
This will automatically install some dependencies as well. You'll be ready to start once this command finishes!

If you are using a Windows computer, you might have to run this command instead: `py -m pip install altair vega-datasets`

## Background
We'll be working with the *US Employment* dataset. Below is the description:

> In the mid 2000s the global economy was hit by a crippling recession. One result: Massive job losses across the United States. The downturn in employment, and the slow recovery in hiring that followed, was tracked each month by the [Current Employment Statistics](https://www.bls.gov/ces/) program at the U.S. Bureau of Labor Statistics.

> This file contains the monthly employment total in a variety of job categories from January 2006 through December 2015. The numbers are seasonally adjusted and reported in thousands. The data were downloaded on Nov. 11, 2018, and reformatted for use in this library.

> Totals are included for the [22 "supersectors"](https://download.bls.gov/pub/time.series/ce/ce.supersector) tracked by the BLS. The "nonfarm" total is the category typically used by economists and journalists as a stand-in for the country's employment total.

> A calculated "nonfarm_change" column has been appended with the month-to-month change in that supersector's employment. It is useful for illustrating how to make bar charts that report both negative and positive values.

This mini-lab will give you experience analyzing this dataset.

### Part 1: Data Exploration
The first few exercises are intended for you to review using `pandas`, while also showing you common techniques for when you first encounter a new dataset.


In [2]:
import altair as alt # import Altair
from vega_datasets import data # import starter datasets

emp_df = data.us_employment() # US Employment dataframe


`pandas` has some handy commands to help you explore a dataframe. If there's something you haven't encountered before, check out the documentation or try running it to see what happens!

  * `dataframe.head()`
  * `dataframe.sample(10)`
  * `dataframe.shape`
  * `dataframe.columns`
  * `dataframe.describe()`


In [3]:
# Explore the functions listed above using emp_df in this code block


### Question 1: How many rows and columns are in this dataset?
*Show your work in the code block below.*

In [4]:
# Replace with your answer to #1

### Question 2: How many employees worked in the Private sector in June 2009?

In [5]:
# Replace with your answer to #2

### Question 3: Which month experienced the greatest decrease in employment? Which month experienced the greatest increase?
*Hint: `iloc`, `idxmin`, and `idxmax` will be handy functions to know.*

In [6]:
# Replace with your answer to #3

## Data Visualization with Altair
As mentioned before, Altair runs on the grammar of marks and channels. In other words, you'll have to specify the marks and channels you want to use in your resulting visualization.

### The Chart object
The fundamental object in Altair is the `Chart`, which takes a dataframe as a single argument.

`chart = alt.Chart(data)`

While the Chart object is defined, we haven't told the chart to *do* anything with the data yet.

### Marks
We first have to specify which marks we'd like to use to represent the data. Let's say we wanted to make a bar plot, in which case we would use *bars* as our marks. We do so as follows:

`chart = alt.Chart(data).mark_bar()`

More examples of marks can be found at [Altair's mark documentation](https://altair-viz.github.io/user_guide/marks.html).

### Channels
Once we've decided on our marks, we then have to specify which channels we use via the `encode` function. If we correctly specify the *type* of data we pass in, then Altair is good at handling the rest. Examine the cell below to see an example of how to encode your visualization.

|Data Type|	Shorthand Code	|Description|
|---|:---:|---|
|quantitative|	Q	|a continuous real-valued quantity|
|ordinal	|O	|a discrete ordered quantity|
|nominal	|N	|a discrete unordered category|
|temporal	|T	|a time or date value|

More customization options are offered -- check out [Altair's `encode` documentation](https://altair-viz.github.io/user_guide/encoding.html).


In [7]:
alt.Chart(emp_df).mark_bar().encode(
    x='month:T',
    y='nonfarm_change:Q'
)

### Conditions
You can adjust the color (or other channels) of the chart depending on a value. Examine the cell below for an example.

In [8]:
# Color bars green if positive and red if negative
alt.Chart(emp_df).mark_bar().encode(
    x='month:T',
    y='nonfarm_change:Q',
    color=alt.condition(
        alt.datum["nonfarm_change"] > 0,    # Specify the condition
        alt.value("green"),                 # Condition true
        alt.value("red")                    # Condition false
    )
)

### Properties
You can also change the chart's properties and labels. Look through documentation for more examples!

In [9]:
# Color bars green if positive and red if negative
alt.Chart(emp_df).mark_bar().encode(
    x=alt.X("month:T", title="Month"),
    y=alt.Y("nonfarm_change:Q", title="Change in Employment "),
    color=alt.condition(
        alt.datum["nonfarm_change"] > 0,    # Specify the condition
        alt.value("green"),                 # Condition true
        alt.value("red")                    # Condition false
    )
).properties(title="Change in Employment", width=700, height=200)

### Long-form vs Wide-form
Sometimes you might need your data in a different format. Altair works best with long-form data, whereas our employment dataset is in wide-form. More information about the distinction can be found [here](https://altair-viz.github.io/user_guide/data.html?highlight=wide#long-form-vs-wide-form-data), but I convert it to long-form for you in the code cell below. You can use this as an example if you ever need to convert your data in the future.

In [10]:
# Convert emp_df from wide-form to long-form
emp_lf = emp_df.melt("month", var_name="sector", value_name="employment")

# Create bar chart for average employment in each sector, disregarding nonfarm columns
_data = emp_lf[~emp_lf['sector'].isin(["nonfarm", "nonfarm_change"])] # Drop nonfarm rows
alt.Chart(_data).mark_bar().encode(
    x=alt.X("mean(employment):Q", title="Average Employment (in 1000s)"),
    y=alt.Y("sector:N", sort='x', title="Sector"),
).properties(width=600, height=250)

### Another Example
Here's another example of a visualization you could do with this dataset. I've included plenty of comments for you to reason through the code (alongside searching documentation). You don't *need* to know how to recreate this, but it'll be a good example if this is something you're interested in!

In [15]:
# Create additional column that includes the difference in employment
heatmap_df = _data.copy(deep=True)
heatmap_df['diff_employment'] = _data.groupby('sector')[['employment']].diff().fillna(0) # Calculate differences
heatmap_df = heatmap_df[heatmap_df['month'].str.contains("2008|2009")] # Filter out months with 2008 or 2009

# Create a heatmap of employment differences in each sector during 2008 and 2009
alt.Chart(heatmap_df).mark_rect().encode(
    x="month:O",
    y="sector:N",
    color=alt.Color("diff_employment:Q", scale=alt.Scale(scheme="blues"), title="Employment Delta")
).properties(title="Change in Employment During 2008-2009", height=350)

# Color scale information can be found via Altair/Vega documentation

## Your Turn to Practice!
You'll find [Altair's example gallery](https://altair-viz.github.io/gallery/index.html) very useful! You should fill in the cell blocks so that they create the described visualizations. Use either `emp_df` or `emp_lf`.

In [12]:
# Change the employment-change bar graph to use circles instead


In [13]:
# Create a histogram for employment in "education and health services" sector, with a bin-width of 250


# Critical Thinking Questions
# We can create this histogram since we have many values for a single variable, but does it make sense? How would we # interpret this histogram? Does the shape of the histogram tell us anything useful?

In [14]:
# Create a line graph for employment in "service_providing" sector over time
