<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

# IFN619 :: B3-Visualisations

### What is data visualisation?

Data visualisation is the process of transforming data into a compeling story use graphical tools

<img src="./graphics/b3-vis-overview.png">

> **Tip:** Tables can be also considered as data visualisations if they are used to represent a compeling story

### Benefits of data visualisations

- Data visualisations highlight patterns in the data
- Data visualisations allow to graphically represent complex data
- Data visualisations support decision-making for different stakeholders

### Can you see any problems with the following visualisations?

<img src="./graphics/b3-bad-example.png" style="width: 49%; height: 300px; float: left;">
<img src="./graphics/b3-better-example.png" style="width: 49%; height: 300px; float: right">

Both visualisations have exactly the same underlying data. However, the order of the data and the decisions on how to build the visualisation are different
- The visualisation on the left:
    - The data is not sorted
    - The title is missing
    - The legend is not sorted
    
- The visualisation on the right:
    - The data is sorted
    - It has a meaningful title
    - The legend is sorted

The data is only one part of of building a data visualisation, the other part is support humans to interpret the data and make sense of what is being presented to them.

---

## Data visualisations structure

### Coordinate systems

The coordinate system specifies the position and a scale to be used to position the data. The most common is the *cartesian coordinate system* that is composed of two axis commonly named X and Y. This system gives a 2 dimensional linear position. Most of the data visualisations use this coordinate system.

<img src="./graphics/b3-cartesian.jpg" style="width: 500px">

The second most common is the *radial coordinate system* that gives the positions based on a circle. The most common data visualisations that use this system is the pie chart or donut chart.

<img src="./graphics/b3-radial.png" style="width: 500px">


### Aesthetics

Aesthetics are visual elements of the visualisation that can be mapped to quantifiable data. The most important aesthetics are the following:

1. *Position* is the most powerful aesthetic as it is the fastest and easier graphical feature to be differentiated
2. *Colour* provides an easy way to group data belonging to the same category. Every colour has different meanings, for instance, red is commonly used to depict danger while green usually depicts safety
3. *Shape* can be used as colour but has the limitation of a limited amount of shapes that can be easily differentiated
4. *Size* can used as position but it is more difficult for the visual system to properly evaluate a difference in size
5. *Line type* similarly to shape there are a limited amount of different line types that can be easily differentiated
6. *Line width* it is highly unsed for having a limited amount of different widhts as well as being difficult to being evaluated

<img src="./graphics/b3-aesthetics.png" style="width: 500px">


### How to visualise complex data

The human visual system is one of the fastest senses and for this reason needs to be fully supported to deliver meaningful messages through data visualisations.

Different data types are going to be better represented in different formats

- *Temporal data* is usually represented in line graphs and scatter plots
- *Hierarchical data* is usually represented in tree maps and tree diagrams
- *Network data* is usually represented in matrix and node-link diagrams
- *Multidimensional data* is usually represented in scatter plots, stacked bar charts and parallel coordinate plots
- *Geographical data* is sually represented in choroplet map and cartograms

> **Tip:** The previous list provides suggestions for what is normally used in each data type. However, other charts could be valuable if there is a different intention in the visualisation
---

### Visualising datasets with different visualisation types

The next section were are going to compare how the same data can look different using different visualisations. The most important to remember is which type of visualisation supports your narrative better.


#### Temporal data

Let's visualise temporal data in different charts to see how it can be easier to understand. The dataset is the [Daily minimum temperatures in Melbourne](https://www.kaggle.com/datasets/paulbrabban/daily-minimum-temperatures-in-melbourne) from Kaggle.

In [1]:
import pandas as pd
import plotly.express as px # Data visualisation library

In [4]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-daily-minimum-MEL.csv"
temp_df = pd.read_csv(f"{path}{file_name}")
temp_df

Unnamed: 0,Date,"Daily minimum temperatures in Melbourne, Australia, 1981-1990"
0,1/01/1981,20.7
1,2/01/1981,17.9
2,3/01/1981,18.8
3,4/01/1981,14.6
4,5/01/1981,15.8
...,...,...
3645,27/12/1990,14.0
3646,28/12/1990,13.6
3647,29/12/1990,13.5
3648,30/12/1990,15.7


When working with dates, it is important to check that the data imported is in the correct format. Otherwise, any calculations and operations will not work as the data values are string (object type). The most common problem shows when trying to sort dates in the incorrect format.

In [5]:
# Checking the data types of our dataset
temp_df.dtypes

Date                                                              object
Daily minimum temperatures in Melbourne, Australia, 1981-1990    float64
dtype: object

In [7]:
# Convert the data column to a DateTime type matching the string format
temp_df["Date"] = pd.to_datetime(temp_df["Date"], format="%d/%m/%Y")
temp_df.dtypes

Date                                                             datetime64[ns]
Daily minimum temperatures in Melbourne, Australia, 1981-1990           float64
dtype: object

In [12]:
# Plot a bar chart using the temperature dataframe
temp_bar = px.bar(temp_df, x="Date", y="Daily minimum temperatures in Melbourne, Australia, 1981-1990")
temp_bar.show()

In [19]:
# Plot a bar chart using the temperature dataframe
temp_bar = px.bar(temp_df, x="Date", y="Daily minimum temperatures in Melbourne, Australia, 1981-1990",
                  labels={
                     "Daily minimum temperatures in Melbourne, Australia, 1981-1990": "Minimum (degrees Celcius)"
                 },
                title="Daily minimum temperatures in Melbourne, Australia, 1981-1990")
temp_bar.show()

In [21]:
# An easier way may be to rename the column
temp2_df = temp_df.rename(columns={"Daily minimum temperatures in Melbourne, Australia, 1981-1990": "Minimum (degrees Celcius)"})
temp2_df.head()

Unnamed: 0,Date,Minimum (degrees Celcius)
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


In [23]:
# Plot a scatter plot using the temperature dataframe
temp_scatter = px.scatter(temp2_df, x="Date", y="Minimum (degrees Celcius)")
temp_scatter.show()

In [24]:
# Plot a timeline using the temperature dataframe
temp_line = px.line(temp2_df, x="Date", y="Minimum (degrees Celcius)")
temp_line.show()

The line graph provides the best graphical description for this type of data for the following reasons:
- The line gives a sense of continuity in time
- The line provides an easy way to follow the trend
- The line highlights the patterns

#### Hierarchical data

Let's visualise hierarchical data in different charts to see how it can be easier to understand. The dataset is the [Formula 1 pilots of all times](https://www.kaggle.com/datasets/bvovczak/f1-pilots) from Kaggle.

In [25]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-f1_data.csv"
f1_df = pd.read_csv(f"{path}{file_name}")
f1_df

Unnamed: 0,Driver,Nationality,Seasons,Championships,Race_Entries,Race_Starts,Pole_Positions,Race_Wins,Podiums,Fastest_Laps,...,Championship Years,Decade,Pole_Rate,Start_Rate,Win_Rate,Podium_Rate,FastLap_Rate,Points_Per_Entry,Years_Active,Champion
0,Carlo Abate,Italy,"[1962, 1963]",0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,,1960,0.0,0.000000,0.0,0.0,0.000000,0.000000,2,False
1,George Abecassis,United Kingdom,"[1951, 1952]",0.0,2.0,2.0,0.0,0.0,0.0,0.0,...,,1950,0.0,1.000000,0.0,0.0,0.000000,0.000000,2,False
2,Kenny Acheson,United Kingdom,"[1983, 1985]",0.0,10.0,3.0,0.0,0.0,0.0,0.0,...,,1980,0.0,0.300000,0.0,0.0,0.000000,0.000000,2,False
3,Andrea de Adamich,Italy,"[1968, 1970, 1971, 1972, 1973]",0.0,36.0,30.0,0.0,0.0,0.0,0.0,...,,1970,0.0,0.833333,0.0,0.0,0.000000,0.166667,5,False
4,Philippe Adams,Belgium,[1994],0.0,2.0,2.0,0.0,0.0,0.0,0.0,...,,1990,0.0,1.000000,0.0,0.0,0.000000,0.000000,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
863,Emilio Zapico,Spain,[1976],0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,,1980,0.0,0.000000,0.0,0.0,0.000000,0.000000,1,False
864,Zhou Guanyu,China,[2022],0.0,23.0,23.0,0.0,0.0,0.0,2.0,...,,2020,0.0,1.000000,0.0,0.0,0.086957,0.260870,1,False
865,Ricardo Zonta,Brazil,"[1999, 2000, 2001, 2004, 2005]",0.0,37.0,36.0,0.0,0.0,0.0,0.0,...,,2000,0.0,0.972973,0.0,0.0,0.000000,0.081081,5,False
866,Renzo Zorzi,Italy,"[1975, 1976, 1977]",0.0,7.0,7.0,0.0,0.0,0.0,0.0,...,,1980,0.0,1.000000,0.0,0.0,0.000000,0.142857,3,False


In [28]:
# Plot a stacked bar chart using the f1 dataframe
f1_bar = px.bar(f1_df, x="Nationality", y="Race_Wins", color="Champion")
f1_bar.show()

We need a visualisation that makes it easier to compare on what we care about. A treemap visualisation can be helpful for heirachical data

In [26]:
# Plot a treemap using the f1 dataframe
f1_tree = px.icicle(f1_df, path=[px.Constant("F1 Wins"), "Champion", 'Nationality'], values='Race_Wins') # px.Constant creates a common root for the data. This is needed in treemaps
f1_tree.update_traces(root_color="lightgrey")
f1_tree.update_layout(margin = dict(t=25, l=25, r=25, b=25))
f1_tree.show()

#### Multidimensional data

Let's visualise multidimensional data in different charts to see how it can be easier to understand. The data set is [Walmart orders](https://www.kaggle.com/datasets/matthewcornfield/wallmart) from Kaggle.

In [30]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-walmart_data.csv"
walmart_df = pd.read_csv(f"{path}{file_name}")
walmart_df

Unnamed: 0,Order ID,Order Date,Ship Date,Customer Name,Country,City,State,Category,Product Name,Sales,Quantity,Profit
0,CA-2013-138688,13-06-2013,17-06-2013,Darrin Van Huff,United States,Los Angeles,California,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,6.87
1,CA-2011-115812,09-06-2011,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.86,7.0,14.17
2,CA-2011-115812,09-06-2011,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Art,Newell 322,7.28,4.0,1.97
3,CA-2011-115812,09-06-2011,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Phones,Mitel 5320 IP Phone VoIP phone,907.15,4.0,90.72
4,CA-2011-115812,09-06-2011,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Binders,DXL Angle-View Binders with Locking Rings by S...,18.50,3.0,5.78
...,...,...,...,...,...,...,...,...,...,...,...,...
3198,CA-2013-125794,30-09-2013,04-10-2013,Maris LaWare,United States,Los Angeles,California,Accessories,Memorex Mini Travel Drive 64 GB USB 2.0 Flash ...,36.24,1.0,15.22
3199,CA-2014-121258,27-02-2014,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.96,2.0,15.63
3200,CA-2014-121258,27-02-2014,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Phones,Aastra 57i VoIP phone,258.58,2.0,19.39
3201,CA-2014-121258,27-02-2014,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.60,4.0,13.32


In [31]:
# Check the dataframe data types
walmart_df.dtypes

Order ID          object
Order Date        object
Ship Date         object
Customer Name     object
Country           object
City              object
State             object
Category          object
Product Name      object
Sales            float64
Quantity         float64
Profit           float64
dtype: object

In [32]:
# Transform the Order Date to a DateTime format
walmart_df["Order Date"] = pd.to_datetime(walmart_df["Order Date"], format="%d-%m-%Y")
walmart_df.dtypes

Order ID                 object
Order Date       datetime64[ns]
Ship Date                object
Customer Name            object
Country                  object
City                     object
State                    object
Category                 object
Product Name             object
Sales                   float64
Quantity                float64
Profit                  float64
dtype: object

In [33]:
# Filter rows where Walmart made a profit
walmart_df = walmart_df[walmart_df["Profit"] > 0]
walmart_df

Unnamed: 0,Order ID,Order Date,Ship Date,Customer Name,Country,City,State,Category,Product Name,Sales,Quantity,Profit
0,CA-2013-138688,2013-06-13,17-06-2013,Darrin Van Huff,United States,Los Angeles,California,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2.0,6.87
1,CA-2011-115812,2011-06-09,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.86,7.0,14.17
2,CA-2011-115812,2011-06-09,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Art,Newell 322,7.28,4.0,1.97
3,CA-2011-115812,2011-06-09,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Phones,Mitel 5320 IP Phone VoIP phone,907.15,4.0,90.72
4,CA-2011-115812,2011-06-09,14-06-2011,Brosina Hoffman,United States,Los Angeles,California,Binders,DXL Angle-View Binders with Locking Rings by S...,18.50,3.0,5.78
...,...,...,...,...,...,...,...,...,...,...,...,...
3198,CA-2013-125794,2013-09-30,04-10-2013,Maris LaWare,United States,Los Angeles,California,Accessories,Memorex Mini Travel Drive 64 GB USB 2.0 Flash ...,36.24,1.0,15.22
3199,CA-2014-121258,2014-02-27,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.96,2.0,15.63
3200,CA-2014-121258,2014-02-27,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Phones,Aastra 57i VoIP phone,258.58,2.0,19.39
3201,CA-2014-121258,2014-02-27,04-03-2014,Dave Brooks,United States,Costa Mesa,California,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.60,4.0,13.32


In [34]:
# Plot a bar chat using the Walmart dataframe
walmart_bar = px.bar(walmart_df, x="Order Date", y="Sales", color="Quantity")
walmart_bar.show()

In [35]:
# Plot a line chart using the Walmart dataframe
walmart_line = px.line(walmart_df, x="Order Date", y="Sales", color="Quantity")
walmart_line.show()


The line chart does not look properly formatted. The reason is that the dataframe is not sorted by date. Therefore, the lines go forwards and backwards depending of the order and the timeline. To fix it, lets sort the dataframe.

In [36]:
# Sort de dataframt by Order Date
walmart_sorted_df = walmart_df.sort_values(by="Order Date")
walmart_sorted_df

Unnamed: 0,Order ID,Order Date,Ship Date,Customer Name,Country,City,State,Category,Product Name,Sales,Quantity,Profit
1707,CA-2011-130813,2011-01-07,09-01-2011,Lycoris Saunders,United States,Los Angeles,California,Paper,Xerox 225,19.44,3.0,9.33
1579,CA-2011-157147,2011-01-14,19-01-2011,Brian Dahlen,United States,San Francisco,California,Storage,Tennsco 6- and 18-Compartment Lockers,1325.85,5.0,238.65
1581,CA-2011-157147,2011-01-14,19-01-2011,Brian Dahlen,United States,San Francisco,California,Art,4009 Highlighters by Sanford,19.90,5.0,6.57
1580,CA-2011-157147,2011-01-14,19-01-2011,Brian Dahlen,United States,San Francisco,California,Bookcases,"O'Sullivan Elevations Bookcase, Cherry Finish",334.00,3.0,3.93
1719,CA-2011-123477,2011-01-19,22-01-2011,David Wiener,United States,Springfield,Oregon,Appliances,Fellowes Mighty 8 Compact Surge Protector,64.86,4.0,6.49
...,...,...,...,...,...,...,...,...,...,...,...,...
1757,CA-2014-130631,2014-12-30,03-01-2015,Bruce Stewart,United States,Edmonds,Washington,Furnishings,Hand-Finished Solid Wood Document Frame,68.46,2.0,20.54
1641,CA-2014-146626,2014-12-30,06-01-2015,Ben Peterman,United States,Anaheim,California,Furnishings,Nu-Dell Executive Frame,101.12,8.0,37.41
1756,CA-2014-130631,2014-12-30,03-01-2015,Bruce Stewart,United States,Edmonds,Washington,Fasteners,Acco Glide Clips,19.60,5.0,9.60
389,CA-2014-115427,2014-12-31,04-01-2015,Erica Bern,United States,Fairfield,California,Binders,GBC Binding covers,20.72,2.0,6.48


In [37]:
# Plot a line chart using the Walmart dataframe
walmart_line_sorted = px.line(walmart_sorted_df, x="Order Date", y="Sales", color="Quantity")
walmart_line_sorted.update_layout(
    title_font_size=25, # Update the title font size
    title_x=0.5, # Update the title horizontal position top middle
    legend_title_font_size=15
)
walmart_line_sorted.show()

In [38]:
# Plot a scatter plot using the Walmart dataframe
walmart_scatter = px.scatter(walmart_sorted_df, x="Order Date", y="Sales", size="Profit", color="Quantity")
walmart_scatter.show()

The scatter plot is the chart that provides more aesthetics that are easily differentiated. It can support the following aesthetics:
- Position
- Color
- Size
---

### Data visualisation structure with Plotly

There are certain elements in visualisations that can be generated automatically based on the data structure. However, it is useful to know how to manipulate these to suit the narrative we want to tell to our stakeholders.

This exercise uses the [Iris dataset](https://www.kaggle.com/datasets/uciml/iris). The iris dataset has sepal and petal width and lenght of 3 different iris species. Each specie has 50 samples. All the measurements are in centimeters.

In [39]:
# Import a the Iris dataframe included in the plotly library
iris = px.data.iris()
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,virginica,3
148,6.2,3.4,5.4,2.3,virginica,3


In [40]:
# Plot a scatter plot using the Iris dataframe
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals")
iris_fig.show()

1. Manipulating title, labels and legend ([Plotly layout documentation](https://plotly.com/python/figure-labels/))
    - The property title allows to set a title for the chart
    - The property labels allows to map the original name from the Dataframe to a manually set label
    - The method update_layout allows you to modify the feel and look of the visualisation in general including title and legends. For instance, updating the visualisation width, title font size, position, etc.

In [41]:
# Update the labels to include relevant information
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals",
    labels={
        "sepal_length": "Sepal Length (cm)", # Include the units of measurement
        "sepal_width": "Sepal Width (cm)", # Include the units of measurement
        "species": "Species of Iris" # Provide a complete title
    })
iris_fig.update_layout(
    title_font_size=25, # Update the title font size
    title_x=0.5, # Update the title horizontal position top middle
    legend_title_font_size=15, # Update the legend title font size
    width=750 # Specify the width of the chart
)
iris_fig.show()

2. Manipulating the axis ([Plotly axes layout documentation](https://plotly.com/python/axes/))
    - The method update_xaxes allows you to modify the feel and look of the x axis. For instance, title font size, tick font size. It can also be modified the amount of ticks and the style of the ticks
    - The method update_yaxes allows you to modify the feel and look of the y axis. The examples are exantly the same as the x axis

In [42]:
# Update the font size and ticks in the axes
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals",
    labels={
        "sepal_length": "Sepal Length (cm)",
        "sepal_width": "Sepal Width (cm)",
        "species": "Species of Iris"
    })
iris_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    width=750
)
iris_fig.update_xaxes(
    title_font_size=12, # Update x axis label font size
    tickfont_size=10, # Update x axis tick font size
    tick0=0, # Set the value of the first tick
    dtick=1 # Set the distance between ticks
)
iris_fig.update_yaxes(
    title_font_size=12,
    tickfont_size=10,
    dtick=0.25
)
iris_fig.show()

> **TIP:** If you need to download the plot as an image, you can use the tools that appear on the top right corner of the visualisation. The camera icon (first from left to right) allows you to download the image. 

To import an image into the notebook you can use the following code:

```Html
<img src="path/file_name.png">
 ```

 For example: 

 ```Html
 <img src="./images/vis.png">
 ```

---

### Guidelines to design data visualisations

1. Every chart needs a title, labels and legends
2. Keep it simple. Too many aesthetics in a single chart might confuse the reader
3. Use white space
4. Design having in mind your audience
5. Double check that all the calculations have been done correctly
6. Use colour meaningfully
7. Be mindful of the starting points of the axes

> **Tip:** Remember that data visualisations need to support your story. Data storytelling is critical in design

### Do it yourself

Using the [Earthquake dataset](https://www.kaggle.com/datasets/warcoder/earthquake-dataset), try to create meaningful charts to depict the relationship between the magnitude, significance and location. You can use many charts and associate with other variables such as date_time.

What and how would you visualise this dataset?