# Storytelling with Data! in Altair

by Maisa de Oliveira Fraiz

## Introduction

This project aims to replicate selected examples from Cole Nussbaumer's book, "Storytelling with Data - Let's Practice!", using `Python` library `Altair`. The primary objective is to document the reasoning behind the modifications proposed by the author, while also highlighting the challenges that arise when transitioning from the book's Excel-based approach to programming in a different software environment.

`Altair` was selected for this project due to its declarative syntax, interactivity, grammar of graphics, and compatibility with `Streamlit` and other web formatting tools, while within the user-friendly Python environment. Anticipated challenges include the comparatively smaller documentation and development community of `Altair` compared to more established libraries like `Matplotlib`, `Seaborn`, or `Plotly`, and the difficulty to effectively translate tasks that might appear straightforward in Excel.

In addition to replicating the graphs from the book, the objective is to extend the functionality by creating interactive versions, fully leveraging Altair's capabilities.

## Imports

In [14]:
import pandas as pd
import numpy as np
import altair as alt

## Chapter 2 - Choose an effective visual

*"When I have some data I need to show, how do I do that in an effective way?"* - Cole Nussbaumer

### Exercise 2.5 - how would you show this data?

The data for this exercise can be found here: https://www.storytellingwithdata.com/letspractice/downloads

In [15]:
# Loading considering the NaN caused by Excel formatting
table = pd.read_excel(r"..\..\Data\2.5 EXERCISE.xlsx", usecols = [1, 2], header = 5)
table

Unnamed: 0,Year,Attrition Rate
0,2019,0.091
1,2018,0.082
2,2017,0.045
3,2016,0.123
4,2015,0.056
5,2014,0.151
6,2013,0.07
7,2012,0.01
8,2011,0.02
9,2010,0.097


First, we will drop the "AVG" (Average) column, as it will not be a data point in our graphs. 

In [16]:
table.drop(10, inplace = True)

#### Dot plot

When attempting the first scatter plot, we realize that Altair incorrectly classifies the data type of the "Year" column. This can be fixed by specifying the correct date type (:O, as of, Ordinary).

In [17]:
# Without data type

alt.Chart(table).mark_point(filled = True).encode(
    x = alt.X('Year'),
    y = alt.Y('Attrition Rate')
    )

In [18]:
# With data type equals temporal

alt.Chart(table).mark_point(filled = True).encode(
    x = alt.X('Year:T'),
    y = alt.Y('Attrition Rate')
    )

In [19]:
# With data type equals ordinal 

alt.Chart(table).mark_point(filled = True).encode(
    x = alt.X('Year:O'),
    y = alt.Y('Attrition Rate')
    )

Initially, we will create a dot plot to visually represent the data over time, incorporating an average line to facilitate comparison.

In [20]:
base = alt.Chart(table, title = alt.Title(
                 "Attrition rate over time",
                 fontSize = 18,
                 fontWeight = 'normal',
                 anchor = 'start',
                 offset = 10))

dots = base.mark_point(filled = True, size = 50, color = '#2c549d').encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              ),
    opacity = alt.value(1)
    ) 

rule = base.mark_rule(color = "#2c549d", strokeDash = [3,3]).encode(
    x = alt.value(0),
    x2 = alt.value(315),
    y = 'mean(Attrition Rate)'
)

label = alt.Chart({"values": 
                    [{"text":  ['AVERAGE: 7.5%']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = -170, dy = 0, 
                                color = '#2c549d',
                                fontWeight = 'bold'
                                ).encode(text = "text:N")

final = dots + rule + label
final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

![Alt text](\Images\2_5b.png)

#### Line graph

Next, we will link the dots with a line, aiding in the comparison of value differences.

Once more, omitting the data type in the label specification causes the labels to accumulate on the right side of the graph.

In [21]:
# Without data type

line = base.mark_line(color = '#2c549d').encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              )
    )

label = base.mark_text(align = 'left', dx = 3).encode(
    x= alt.X('Year', aggregate = 'max'),
    y = alt.Y('Attrition Rate', aggregate = {'argmax': 'Year'}),
    text = alt.Text('Attrition Rate')
)

final = line + rule + label

final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

In [22]:
# With data type
line = base.mark_line(color = '#2c549d').encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              )
    )

label = base.mark_text(align = 'left', dx = 3, color = '#2c549d').encode(
    x = alt.X('Year:O', aggregate = 'max'),
    y = alt.Y('Attrition Rate', aggregate = {'argmax': 'Year'}),
    text = alt.Text('Attrition Rate')
)

final = line + rule + label

final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

The default method for placing the end label appeared ineffective, as it failed to filter out 2019 as the maximum value in the Year column. This issue can be rectified by straightforwardly filtering the entire dataset to encompass only values where "Year == 2019".

In [23]:
line = base.mark_line(color = '#2c549d').encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              )
    )

label = base.mark_text(align = 'left', dx = 3, color = '#2c549d', fontWeight = 'bold').encode(
    x = alt.X('Year:O'),
    y = alt.Y('Attrition Rate'),
    text = alt.Text('Attrition Rate', format = ".1%"),
    xOffset = alt.value(-10),
    yOffset = alt.value(-10)
).transform_filter(
    alt.FieldEqualPredicate(field='Year', equal=2019)
    )

label2 = alt.Chart({"values": 
                    [{"text":  ['AVG: 7.5%']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = 96, dy = 15, 
                                color = '#2c549d',
                                fontWeight = 'bold'
                                ).encode(text = "text:N")

point = base.mark_point(filled = True).encode(
    x = alt.X('Year:O'),
    y = alt.Y('Attrition Rate', aggregate = {'argmax': 'Year'})
).transform_filter(
    alt.FieldEqualPredicate(field='Year', equal=2019)
    )

final = line + rule + label + label2 + point

final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

![Alt text](\Images\2_5c.png)


Coloring below the average line may help highlight values below it. 

In [24]:
avg = table['Attrition Rate'].mean()

rect = alt.Chart(pd.DataFrame({'y': [0], 'y2':[avg]})).mark_rect(
    opacity = 0.2
).encode(y='y', y2='y2', x = alt.value(0), x2 = alt.value(315))

label2 = alt.Chart({"values": 
                    [{"text":  ['AVG:', '7.5%']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = 113, dy = 15, 
                                color = '#9fb5db',
                                fontWeight = 'bold'
                                ).encode(text = "text:N")

final = line + rect + label + label2 + point

final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

![Alt text](\Images\2_5d.png)

#### Area graph

An exploration using an area graph was undertaken; however, it conveys the impression that the area under the line holds significance, which is not the case for this dataset. This graph type may not be the most suitable choice for presenting this data.

Also, the decision was made to stray from the example given and connect the area to the y-axis. It is unclear if the decision to have it separated for this graph only was on purpose or an error.

In [25]:
area = base.mark_area().encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              )
    )

rule_light = base.mark_rule(color = "#9fb5db", strokeDash = [3,3]).encode(
    x = alt.value(0),
    x2 = alt.value(315),
    y = 'mean(Attrition Rate)'
)

final = area + rule_light + label2
final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

![Alt text](\Images\2_5e.png)

#### Bar plot

Finally, we can do a classic bar plot.

In [26]:
bar = base.mark_bar(size = 25).encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False, domain = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              )
    )
    
label = alt.Chart({"values": 
                    [{"text":  ['AVG: 7.5%']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = -130, dy = 0, 
                                color = '#2c549d',
                                fontWeight = 'bold'
                                ).encode(text = "text:N")

final = bar + rule + label
final.properties(
    width = 320,
    height = 200
).configure_view(stroke = None)

![Alt text](\Images\2_5f.png)

In [189]:
# Create the slider
slider = alt.binding_range(min = 0, max = 0.16, step = 0.005, name ='CUT: ')
selector = alt.param(name = 'SelectorName', value = 0.03, bind = slider)

# Remove space from column name
table['AttRate'] = table['Attrition Rate']

base = alt.Chart(table, title = alt.Title(
                 "Attrition rate over time",
                 fontSize = 18,
                 fontWeight = 'normal',
                 anchor = 'start',
                 offset = 10))

dots = base.mark_point(filled = True, size = 50).encode(
    x = alt.X('Year:O',
              axis = alt.Axis(labelAngle = 0, labelColor = '#888888', ticks = False),
              title = None,
              scale = alt.Scale(align = 0)
              ), 
    y = alt.Y('Attrition Rate',
              axis = alt.Axis(grid = False, titleAnchor = 'end', 
                              labelColor = "#888888", titleColor = '#888888', 
                              tickCount = 9, format = "%", titleFontWeight = 'normal'), 
              title = "ATTRITION RATE"
              ),
    color = alt.condition(alt.datum.AttRate < selector, alt.value("#ef476f"), alt.value("#118ab2")),
    opacity = alt.value(1)
    ).add_params(selector)

rule = base.mark_rule(color = "#118ab2", strokeDash = [3,3]).encode(
    x = alt.value(0),
    x2 = alt.value(315),
    y = "mean(AttRate)"
)

rule2 = base.mark_rule(color = "#ef476f", strokeDash = [3,3]).encode(
    x = alt.value(0),
    x2 = alt.value(315),
    y = alt.value(200 - selector*1250),
    opacity = alt.value(0.1)
).add_params(selector)


label = alt.Chart({"values": 
                    [{"text":  ['AVERAGE: 7.5%']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = -170, dy = 0, 
                                color = '#118ab2',
                                fontWeight = 'bold'
                                ).encode(text = "text:N")

label2 = alt.Chart({"values": 
                    [{"text":  ['CUT']}]
                    }
                    ).mark_text(size = 10, 
                                align = "left", 
                                dx = 110, 
                                color = '#ef476f',
                                fontWeight = 'bold'
                                ).encode(text = "text:N", y = alt.value(195 - selector*1250)).add_params(selector)

final = dots + rule + label + rule2 + label2
final.properties(
    width = 350,
    height = 200
).configure_view(stroke = None)

SyntaxError: invalid syntax (352549917.py, line 27)