# Storytelling with Data! in Altair

by Maisa de Oliveira Fraiz

## Introduction

This project aims to replicate the examples from Cole Nussbaumer's book, "Storytelling with Data - Let's Practice!", using `Python Altair`. Our primary objective is to document the reasoning behind the modifications proposed by the author, while also highlighting the challenges that arise when transitioning from the book's Excel-based approach to programming in a different software environment.

`Altair` was selected for this project due to its declarative syntax, interactivity, grammar of graphics, and compatibility with `Streamlit` and other web formatting tools, while within the user-friendly Python environment. Anticipated challenges include the comparatively smaller documentation and development community of Altair compared to more established libraries like `Matplotlib`, `Seaborn`, or `Plotly`. Furthermore, tasks that might appear straightforward in Excel may require multiple iterations to translate effectively into the language.


## Imports

In [3]:
import pandas as pd
import numpy as np
import altair as alt

## Chapter 2 - Choose an effective visual

*"When I have some data I need to show, how do I do that in an effective way?"*

This chapter's exercises aim to incentivize evaluating different graphs for the same data in order to understand the strengths and constraints of each, helping in the process of finding the best medium to present the information you want to highlight.

### Exercise 2.1 - improve this table

The data for this exercise can be found here: https://www.storytellingwithdata.com/letspractice/downloads

The first problem with the Excel-to-Altair translation arises from the data itself, as it is polluted with titles and texts for readability in Excel. This, however, is not friendly when dealing with Python, so we should be careful when loading it.

In [4]:
# Example of wrong loading
table = pd.read_excel(r"..\..\Data\2.1 EXERCISE.xlsx")
table

Unnamed: 0,EXERCISE 2.1,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,,,,,
1,,FIG 2.1a,,,,
2,,,,,,
3,,New client tier share,,,,
4,,,,,,
5,,Tier,# of Accounts,% Accounts,Revenue ($M),% Revenue
6,,A,77,0.070772,4.675,0.25
7,,A+,19,0.017463,3.927,0.21
8,,B,338,0.310662,5.984,0.32
9,,C,425,0.390625,2.805,0.15


In [5]:
del table

In [6]:
# Right loading
table = pd.read_excel(r"..\..\Data\2.1 EXERCISE.xlsx", usecols = [1, 2, 3, 4, 5], header = 6)
table

Unnamed: 0,Tier,# of Accounts,% Accounts,Revenue ($M),% Revenue
0,A,77,0.070772,4.675,0.25
1,A+,19,0.017463,3.927,0.21
2,B,338,0.310662,5.984,0.32
3,C,425,0.390625,2.805,0.15
4,D,24,0.022059,0.374,0.02


For the first alterations, the book suggests enhancing readability by making the following improvements:

* Order the tiers.
* Add a row containing the total value.
* As the total percentage of the tiers do not sum to 100%, add a category called "All others" to encompass not mentioned values.
* Round the numbers and transform the percentage.

In [7]:
# Ordering the tiers

table = table.loc[[1, 0, 2, 3, 4]]

In [8]:
# Fixing the percentages

table['% Accounts'] = table['% Accounts'].apply(lambda x: x*100)
table['% Revenue'] = table['% Revenue'].apply(lambda x: x*100)

In [9]:
# Calculating and adding "All other" values

other_account_per = 100 - table['% Accounts'].sum()
other_revenue_per = 100 - table['% Revenue'].sum()

other_account_num = (other_account_per*table['# of Accounts'][0])/table['% Accounts'][0]
other_revenue_num = (other_revenue_per*table['Revenue ($M)'][0])/table['% Revenue'][0]

table.loc[len(table)] = ["All other", other_account_num, other_account_per, other_revenue_num, other_revenue_per]


In [10]:
# Since we will use not-rounded values or the total rwo for the graphs,
# we should create a new variable before making the following alterations

table_charts = table.copy()

In [11]:
# Adding total values row

table.loc[len(table)] = ["Total", table['# of Accounts'].sum(), table['% Accounts'].sum(),
                        table['Revenue ($M)'].sum(), table['% Revenue'].sum()]

In [12]:
# Rounding the numbers

table['% Accounts'] = table['% Accounts'].apply(lambda x: round(x))
table['Revenue ($M)'] = table['Revenue ($M)'].apply(lambda x: round(x, 1))

The new table is as follows:

In [13]:
table

Unnamed: 0,Tier,# of Accounts,% Accounts,Revenue ($M),% Revenue
1,A+,19.0,2,3.9,21.0
0,A,77.0,7,4.7,25.0
2,B,338.0,31,6.0,32.0
3,C,425.0,39,2.8,15.0
4,D,24.0,2,0.4,2.0
5,All other,205.0,19,0.9,5.0
6,Total,1088.0,100,18.7,100.0


or, for even better readability:

In [14]:
table.set_index("Tier")

Unnamed: 0_level_0,# of Accounts,% Accounts,Revenue ($M),% Revenue
Tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A+,19.0,2,3.9,21.0
A,77.0,7,4.7,25.0
B,338.0,31,6.0,32.0
C,425.0,39,2.8,15.0
D,24.0,2,0.4,2.0
All other,205.0,19,0.9,5.0
Total,1088.0,100,18.7,100.0


* Note: the author adds the % symbol next to the number in the percentage columns. Doing this in Python will transform the data from "int" to "string", and therefore is not recommended.

# 

As you can see, modifications that might have been achieved with just a few clicks in Excel may require some effort when using a programming language. 

The author also suggests changes such as colors of rows, alignment of text, and embedding graphs into the table to represent percentage columns. These changes are not compatible with Pandas DataFrame.

Considering that percentages depict a fraction of a whole, the next proposal is to employ a pie chart. 
Here is the default Altair graph version:

In [15]:
# Default pie chart

alt.Chart(table_charts).mark_arc().encode(
    theta = "% Accounts",
    color = alt.Color('Tier:N'),
)

Here are some needed adjustments to bring it closer to the original:


* Order the Tiers that have been rearranged alphabetically. 

* By default, the Tiers' labels appear in the legend, whereas in the book, they are displayed next to the corresponding of slice of pie.

* Altair doesn't automatically add a title.

In [16]:
## % of Accounts Pie Chart

base = alt.Chart(table_charts, title="% of Total Accounts").encode(
    theta = alt.Theta("% Accounts:Q").stack(True),
    color = alt.Color('Tier:N').legend(None),
    order = alt.Order(field ='Tier:O'))



pie = base.mark_arc(outerRadius = 115)
text = base.mark_text(radius = 140, size = 15).encode(text = alt.Text("Tier"))


acc_pie = pie + text
acc_pie

The pie chart above is easily modified to represent the percentage of total revenue.

In [17]:
# % of Revenue Pie Chart

base = alt.Chart(table_charts, title="% of Total Revenue").encode(
    theta = alt.Theta("% Revenue:Q").stack(True),
    color = alt.Color('Tier:N').legend(None),
    order = alt.Order(field ='Tier:O'))



pie = base.mark_arc(outerRadius = 115)
text = base.mark_text(radius = 135, size = 14, align = "left").encode(text = alt.Text("Tier"))


rev_pie = pie + text
rev_pie 

With both graphs available, we can add them next to each other and include a main title.

In [18]:
# Finished Pie Chart

pies = acc_pie | rev_pie

pies.properties(
    title = alt.Title('New Client Tier Share', offset = 40, fontSize = 30)
)



![Alt text](\Images\2_1e.png)

* Add text explaining why pie charts aren't preferable for this type of data (with reference)!!

The next graph proposed to tackle this data is a horizontal bar chart. 

This is the default representation in Altair:

In [19]:
# Default altair bar chart

alt.Chart(table_charts).mark_bar().encode(
    y = alt.Y('Tier'),
    x = alt.X('% Accounts'))

Here are some adjustments needed:

* "Tier" label on the left upper corner.

* Values represented next to the bar, instead of an x-axis.

* Add title and rearrange tiers.

In [20]:
# Alterations as per book

title = alt.TitleParams('% OF TOTAL ACCOUNTS', dy=12)

base = alt.Chart(table_charts, title = title).mark_bar().encode(
    y = alt.Y('Tier', sort = ["A+"], title = "TIER", axis = alt.Axis(titleY = 0, titleAlign = "left", titleAngle = 0)),
    x = alt.X('% Accounts').axis(None),
    text = alt.Text("% Accounts", format = ".0f"))

final_acc = base.mark_bar() + base.mark_text(align = 'left', dx = 2)
final_acc

Now we do the same for the revenue column.

* In addition, the y-axis is removed. This is so the axis isn't repeated when uniting the charts.

In [21]:
base = alt.Chart(table_charts, title = "% OF TOTAL REVENUE").mark_bar().encode(
    y = alt.Y('Tier', sort = ["A+"]).axis(None),
    x = alt.X('% Revenue').axis(None),
    text = alt.Text("% Revenue", format = ".0f"))

final_rev = base.mark_bar() + base.mark_text(align = 'left', dx = 2)
final_rev

Similar to the pie chart, we can arrange these graphs side by side and include a main title.

In [22]:
final = final_acc | final_rev
final.configure_view(stroke=None).properties(
    title = alt.Title('New Client Tier Share', offset = 25, fontSize = 20)
)

![Alt text](\Images\2_1f.png)

We can now merge these two bar charts into a single graph.

In [23]:
# Default by Altair

alt.Chart(table_charts).mark_bar().encode(
    x = alt.X('value:Q'),
    y = alt.Y('variable:N'),
    color = alt.Color('variable:N', legend = alt.Legend(title = 'Metric')),
    row = alt.Row(
                'Tier:O'
                )
).transform_fold(
    fold = ['% Accounts', '% Revenue'],
    as_ = ['variable', 'value']
)

Needed alterations:

* Remove grid.

* Fix labels position and redundancy.

* Add title and subtitle.

* Change color palette. 

In [33]:
# Proper alterations

alt.Chart(table_charts).mark_bar().encode(
    x = alt.X('value:Q', axis = alt.Axis(title = None, grid = False, orient = "top")),
    y = alt.Y('variable:N', axis = alt.Axis(title = None, labels = False, ticks = False)),
    color = alt.Color('variable:N', 
                      legend = alt.Legend(title = 'Metric'),
                      scale = alt.Scale(range = ['#b4c6e4', '#4871b7'])
                      ),
    row = alt.Row(
                'Tier:O', 
                header = alt.Header(labelAngle = 0, labelAlign = "left"), 
                title = None,
                sort = ['A+'],
                spacing = 10
                )
).transform_fold(
    fold = ['% Accounts', '% Revenue'],
    as_ = ['variable', 'value']
).properties(title = {
      "text": ["New client tier share"], 
      "subtitle": ["% OF TOTAL ACCOUNTS vs REVENUE", " ", "Tier"]
    }
).configure_view(stroke = None)

![Alt text](\Images\2_1g.png)

We should now modify this chart to be in a vertical orientation. This can be done by switching the y and x axis and the "Row" class to the "Column" class. We will also need to reorient the labels.

In [35]:
alt.Chart(table_charts).mark_bar().encode(
    y = alt.Y('value:Q', axis = alt.Axis(title = '%', grid = False)),
    x = alt.X('variable:N', axis = alt.Axis(title = None, labels = False)),
    color = alt.Color('variable:N', 
                      legend = alt.Legend(title = 'Metric'),
                      scale = alt.Scale(range = ['#b4c6e4', '#4871b7'])
                      ),
    column = alt.Column(
        'Tier:O', 
        header = alt.Header(labelOrient = 'bottom', titleOrient = "bottom", titleAnchor = "start"),
        sort = ['A+']
        )
).transform_fold(
    fold = ['% Accounts', '% Revenue'],
    as_ = ['variable', 'value']
).properties(title = {
      "text": ["New client tier share"], 
      "subtitle": ["% OF TOTAL ACCOUNTS vs REVENUE"],
    }
).configure_view(stroke = None)



![Alt text](\Images\2_1h.png)

Instead of a legend, the author corresponds the color in the text of the title with the color of the bars. While this is a more elegant approach, it's not compatible with Altair. In a notebook, you can work around this by using ``Latex`` in a ``Markdown`` cell, but it won't be seamlessly integrated with the chart. It would look something like this:

**New client tier share**

"% OF TOTAL $\textcolor{#b4c6e4}{ACCOUNTS}$ vs $\textcolor{#4871b7}{REVENUE}$"

In the code above, we've utilized the 'transform_fold' method to generate the grouped bar chart because our data is structured in the 'wide form,' which is the standard Excel format. However, Altair (as well as other visualization languages) is inherently designed to work with 'long form' data. 

The 'transform_fold' function automates this conversion within the chart, enabling us to create the graph. Nonetheless, this approach can obscure the process, making it preferable to perform this transformation before using Altair.

In [26]:
# Transforms our data to the long-form format.

melted_table = pd.melt(table_charts, id_vars = ['Tier'], var_name = 'Metric', value_name = 'Value')
melted_table

Unnamed: 0,Tier,Metric,Value
0,A+,# of Accounts,19.0
1,A,# of Accounts,77.0
2,B,# of Accounts,338.0
3,C,# of Accounts,425.0
4,D,# of Accounts,24.0
5,All other,# of Accounts,205.0
6,A+,% Accounts,1.746324
7,A,% Accounts,7.077206
8,B,% Accounts,31.066176
9,C,% Accounts,39.0625


We can now use this table to remake the bar chart without the ``transform_fold`` method.

In [37]:
selected_rows = melted_table[melted_table['Metric'].isin(['% Accounts', '% Revenue'])]

alt.Chart(selected_rows).mark_bar().encode(
    y = alt.Y('Value', axis = alt.Axis(title='%', grid = False)),
    x = alt.X('Metric', axis = alt.Axis(title = None, labels = False)),
    color = alt.Color('Metric', scale = alt.Scale(range = ['#b4c6e4', '#4871b7'])),
    column = alt.Column('Tier',
                        header = alt.Header(labelOrient = 'bottom', titleOrient = "bottom", titleAnchor = "start"),
                        sort = ['A+']
                        )
    ).properties(title = {
      "text": ["New client tier share"], 
      "subtitle": ["% OF TOTAL ACCOUNTS vs REVENUE"],
    }
).configure_view(stroke = None)


The next proposed graph is an extension of the previous bar chart, featuring the addition of lines to accentuate the endpoints of the columns within the same tier.

However, due to the nature of faceted charts, we encounter an error (*ValueError: Faceted charts cannot be layered. Instead, layer the charts before faceting*) when attempting to layer it. This issue arises because, in faceted charts, the x-axis structure is altered. 

Now that we've transformed our data into long-format, we can work around this problem by creating our graph without using the 'column' method, and thereby, avoiding faceting. Instead of specifying 'x' as 'Metric,' 'y' as 'Value,' 'color' as 'Metric,' and 'column' as 'Tier,' we can redefine 'x' as 'Tier,' 'y' as 'Value,' 'color' as 'Metric,' and introduce 'XOffset' for controlling the horizontal positioning of data points within a group. In essence, 'column' primarily serves to define distinct x-axis categories, while 'XOffset' is employed to manage the horizontal placement of data points within a group.

The following chart incorporates the alterations we discussed and yields a graph that closely resembles the previous one.

In [38]:
# New bar chart

alt.Chart(selected_rows).mark_bar().encode(
    y = alt.Y('Value', axis = alt.Axis(title ='%', grid = False)),
    x = alt.X('Tier', axis = alt.Axis(labelAngle = 0, titleAnchor = "start"), sort = ['A+']),
    color = alt.Color('Metric',scale = alt.Scale(range = ['#b4c6e4', '#4871b7'])),
    xOffset = 'Metric'
    ).properties(title = {
      "text": ["New client tier share"], 
      "subtitle": ["% OF TOTAL ACCOUNTS vs REVENUE"]
    }
).configure_view(stroke = None)

Now, we can layer the graph and introduce the lines. It's worth noting that creating the lines in Altair is not a straightforward task and a considerable amount of documentation searching was necessary to achieve it.

In [50]:
base = alt.Chart(selected_rows).properties(title = {
      "text": ["New client tier share"], 
      "subtitle": ["% OF TOTAL ACCOUNTS vs REVENUE"],
    }
)


bars = base.mark_bar().encode(
    y = alt.Y('Value', axis = alt.Axis(title='%', grid = False)),
    x = alt.X('Tier', axis = alt.Axis(labelAngle = 0, titleAnchor = "start"), sort = ['A+']),
    color = alt.Color('Metric', scale = alt.Scale(range = ['#b4c6e4', '#4871b7'])),
    xOffset = 'Metric'
    )

# Lines that are ascending
rule_asc = base.mark_rule(x2Offset = 10, xOffset = -10
).encode(
    x = alt.X('Tier', sort = ['A+']),
    x2 = alt.X2('Tier'),
    y = alt.Y('min(Value)'),
    y2 = alt.Y2('max(Value)'),
    strokeWidth = alt.value(2), 
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(1), alt.value(0)
        )
    )

# Lines that are descending
rule_desc = base.mark_rule(x2Offset = 10, xOffset = -10
).encode(
    x = alt.X('Tier', sort = ['A+']),
    x2 = alt.X2('Tier'),
    y = alt.Y('max(Value)'),
    y2 = alt.Y2('min(Value)'),
    strokeWidth = alt.value(2), 
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(0), alt.value(1)
        )
    )

# Points of % Revenue where % Revenue > % Accounts
points1 = base.mark_point(filled = True, xOffset = 10, color = "black").encode(
    x = alt.X('Tier', sort = ['A+']),
    y = alt.Y('max(Value)'),
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(1), alt.value(0)
        )
    )

# Points of % Revenue where % Revenue < % Accounts
points2 = base.mark_point(filled = True, xOffset = 10, color = "black").encode(
    x = alt.X('Tier', sort = ['A+']),
    y = alt.Y('min(Value)'),
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(0), alt.value(1)
        )
    )

# Points of % Accounts where % Revenue < % Accounts
points3 = base.mark_point(filled = True, xOffset = -10, color = "black").encode(
    x = alt.X('Tier', sort = ['A+']),
    y = alt.Y('max(Value)'),
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(0), alt.value(1)
        )
    )

# Points of % Revenue where % Revenue > % Accounts
points4 = base.mark_point(filled = True, xOffset = -10, color = "black").encode(
    x = alt.X('Tier', sort = ['A+']),
    y = alt.Y('min(Value)'),
    opacity = alt.condition(
        (alt.datum.Tier == 'A+') | 
        (alt.datum.Tier == 'A')  |
        (alt.datum.Tier == 'B'), 
        alt.value(1), alt.value(0)
        )
    )

final = bars + rule_asc + rule_desc + points1 + points2 + points3 + points4
final.configure_view(stroke = None)


![Alt text](\Images\2_1i.png)

`Y2` has no parameter named 'condition'

SchemaValidationError: '{'condition': {'test': "(((datum.Tier === 'A+') || (datum.Tier === 'A')) || (datum.Tier === 'B'))", 'value': 10}, 'value': -10}' is an invalid value for `xOffset`. Valid values are of type 'number'.

In [30]:
rule = rule_asc + rule_desc + points1 + points2 + points3 + points4
rule.properties(
    width = 350,
    height = 350
).configure_axisX(
    labelAngle = 0, titleAnchor = "start"
).configure_axisY(
    title = '%', grid = False
).configure_view(stroke = None)

![Alt text](\Images\2_1j.png)

In [41]:
base = alt.Chart(selected_rows)

line = base.mark_line(point = True, color = "black").encode(
    x = alt.X('Metric', axis = alt.Axis(title = None, labelAngle = 0)),
    y = alt.Y('Value', axis = alt.Axis(grid = False, title = "%")),
    color = alt.Color('Tier', 
                      scale = alt.Scale(range = ['black', 'black', 'black', 'black', 'black', 'black']),
                      legend = None)
).properties(
    width = 300,
    height = 350
)

labels = base.mark_text(
    align='left', dx = 100
).encode(
    y = alt.Y('Tier', axis = None, sort = ["A+"]),
    text ='Value',
)

final = line #+ labels

final.configure_view(stroke = None)

![Alt text](\Images\2_1k.png)

Points to add:

- Text about pie charts.

- eExplain why making the code less repetitive didn't work.

- Add labels to slopegraph.