<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: B3-Visualisations

### What is data visualisation?

Data visualisation is the process of transforming data into a compeling story use graphical tools

<img src="./graphics/b3-vis-overview.png">

> **Tip:** Tables can be also considered as data visualisations if they are used to represent a compeling story

### Benefits of data visualisations

- Data visualisations highlight patterns in the data
- Data visualisations allow to graphically represent complex data
- Data visualisations support decision-making for different stakeholders

### Can you see any problems with the following visualisations?

<img src="./graphics/b3-bad-example.png" style="width: 49%; height: 300px; float: left;">
<img src="./graphics/b3-better-example.png" style="width: 49%; height: 300px; float: right">

Both visualisations have exactly the same underlying data. However, the order of the data and the decisions on how to build the visualisation are different
- The visualisation on the left:
    - The data is not sorted
    - The title is missing
    - The legend is not sorted
    
- The visualisation on the right:
    - The data is sorted
    - It has a meaningful title
    - The legend is sorted

The data is only one part of of building a data visualisation, the other part is support humans to interpret the data and make sense of what is being presented to them.

---

## Data visualisations structure

### Coordinate systems

The coordinate system specifies the position and a scale to be used to position the data. The most common is the *cartesian coordinate system* that is composed of two axis commonly named X and Y. This system gives a 2 dimensional linear position. Most of the data visualisations use this coordinate system.

<img src="./graphics/b3-cartesian.jpg" style="width: 500px">

The second most common is the *radial coordinate system* that gives the positions based on a circle. The most common data visualisations that use this system is the pie chart or donut chart.

<img src="./graphics/b3-radial.png" style="width: 500px">


### Aesthetics

Aesthetics are visual elements of the visualisation that can be mapped to quantifiable data. The most important aesthetics are the following:

1. *Position* is the most powerful aesthetic as it is the fastest and easier graphical feature to be differentiated
2. *Colour* provides an easy way to group data belonging to the same category. Every colour has different meanings, for instance, red is commonly used to depict danger while green usually depicts safety
3. *Shape* can be used as colour but has the limitation of a limited amount of shapes that can be easily differentiated
4. *Size* can used as position but it is more difficult for the visual system to properly evaluate a difference in size
5. *Line type* similarly to shape there are a limited amount of different line types that can be easily differentiated
6. *Line width* it is highly unsed for having a limited amount of different widhts as well as being difficult to being evaluated

<img src="./graphics/b3-aesthetics.png" style="width: 500px">


### How to visualise complex data

The human visual system is one of the fastest senses and for this reason needs to be fully supported to deliver meaningful messages through data visualisations.

Different data types are going to be better represented in different formats

- *Temporal data* is usually represented in line graphs and scatter plots
- *Hierarchical data* is usually represented in tree maps and tree diagrams
- *Network data* is usually represented in matrix and node-link diagrams
- *Multidimensional data* is usually represented in scatter plots, stacked bar charts and parallel coordinate plots
- *Geographical data* is sually represented in choroplet map and cartograms

> **Tip:** The previous list provides suggestions for what is normally used in each data type. However, other charts could be valuable if there is a different intention in the visualisation
---

### Visualising datasets with different visualisation types

The next section were are going to compare how the same data can look different using different visualisations. The most important to remember is which type of visualisation supports your narrative better.


#### Temporal data

Let's visualise temporal data in different charts to see how it can be easier to understand. The dataset is the [Daily minimum temperatures in Melbourne](https://www.kaggle.com/datasets/paulbrabban/daily-minimum-temperatures-in-melbourne) from Kaggle.

In [None]:
import pandas as pd
import plotly.express as px # Data visualisation library

In [None]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-daily-minimum-MEL.csv"
temp_df = pd.read_csv(???)
temp_df

When working with dates, it is important to check that the data imported is in the correct format. Otherwise, any calculations and operations will not work as the data values are string (object type). The most common problem shows when trying to sort dates in the incorrect format.

In [None]:
# Checking the data types of our dataset
temp_df.dtypes

In [None]:
# Convert the data column to a DateTime type matching the string format "%d/%m/%Y"
temp_df["Date"] = pd.to_datetime(temp_df[???], format=???)
temp_df.dtypes

In [None]:
# Plot a bar chart using the temperature dataframe
temp_bar = px.bar(???, x="Date", y="Daily minimum temperatures in Melbourne, Australia, 1981-1990")
temp_bar.show()

In [None]:
# Fix the x-axis label
temp_bar = px.bar(temp_df, x="Date", y="Daily minimum temperatures in Melbourne, Australia, 1981-1990",
                  labels={
                     ???: "Minimum (degrees Celcius)"
                 },
                title=???)
temp_bar.show()

In [None]:
# An easier way may be to rename the column
temp2_df = temp_df.rename(columns={"Daily minimum temperatures in Melbourne, Australia, 1981-1990": "Minimum (degrees Celcius)"})
temp2_df.head()

In [None]:
# Plot a scatter plot using the temperature dataframe
temp_scatter = px.scatter(temp2_df, x=???, y=???)
temp_scatter.show()

In [None]:
# Plot a timeline using the temperature dataframe
temp_line = px.line(???, x=???, y=???)
temp_line.show()

The line graph provides the best graphical description for this type of data for the following reasons:
- The line gives a sense of continuity in time
- The line provides an easy way to follow the trend
- The line highlights the patterns

#### Hierarchical data

Let's visualise hierarchical data in different charts to see how it can be easier to understand. The dataset is the [Formula 1 pilots of all times](https://www.kaggle.com/datasets/bvovczak/f1-pilots) from Kaggle.

In [None]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-f1_data.csv"
f1_df = pd.read_csv(f"{path}{file_name}")
f1_df

In [None]:
# Plot a stacked bar chart using the f1 dataframe
f1_bar = px.bar(???, x="Nationality", y="Race_Wins", color="Champion")
f1_bar.show()

We need a visualisation that makes it easier to compare on what we care about. A treemap visualisation can be helpful for heirachical data

In [None]:
# Plot a treemap using the f1 dataframe
f1_tree = px.icicle(???, path=[px.Constant("F1 Wins"), "Champion", 'Nationality'], values='Race_Wins') # px.Constant creates a common root for the data. This is needed in treemaps
f1_tree.update_traces(root_color="lightgrey")
f1_tree.update_layout(margin = dict(t=25, l=25, r=25, b=25))
f1_tree.show()

#### Multidimensional data

Let's visualise multidimensional data in different charts to see how it can be easier to understand. The data set is [Walmart orders](https://www.kaggle.com/datasets/matthewcornfield/wallmart) from Kaggle.

In [None]:
# Load the data from the CSV file
path = "data/"
file_name = "b3-walmart_data.csv"
walmart_df = pd.read_csv(f"{path}{file_name}")
walmart_df

In [None]:
# Check the dataframe data types
walmart.dtypes

In [None]:
# Transform the Order Date to a DateTime format
walmart_df["Order Date"] = pd.to_datetime(walmart_df["Order Date"], format=???)
walmart_df.dtypes

In [None]:
# Filter rows where Walmart made a profit
walmart_df = walmart_df[walmart_df["Profit"] > 0]
walmart_df

In [None]:
# Plot a bar chat using the Walmart dataframe
walmart_bar = px.bar(walmart_df, x="Order Date", y="Sales", color="Quantity")
walmart_bar.show()

In [None]:
# Plot a line chart using the Walmart dataframe
walmart_line = px.line(walmart_df, x=???, y=???, color=???)
walmart_line.show()

The line chart does not look properly formatted. The reason is that the dataframe is not sorted by date. Therefore, the lines go forwards and backwards depending of the order and the timeline. To fix it, lets sort the dataframe.

In [None]:
# Sort de dataframt by Order Date
??? = walmart_df.sort_values(by="Order Date")
???

In [None]:
# Plot a line chart using the Walmart dataframe
walmart_line_sorted = px.line(???, x=???, y=???, color=???)
walmart_line_sorted.update_layout(
    title_font_size=25, # Update the title font size
    title_x=0.5, # Update the title horizontal position top middle
    legend_title_font_size=15
)
walmart_line_sorted.show()

In [None]:
# Plot a scatter plot using the Walmart dataframe
walmart_scatter = px.scatter(???, x="Order Date", y="Sales", size="Profit", color="Quantity")
walmart_scatter.show()

The scatter plot is the chart that provides more aesthetics that are easily differentiated. It can support the following aesthetics:
- Position
- Color
- Size
---

### Data visualisation structure with Plotly

There are certain elements in visualisations that can be generated automatically based on the data structure. However, it is useful to know how to manipulate these to suit the narrative we want to tell to our stakeholders.

This exercise uses the [Iris dataset](https://www.kaggle.com/datasets/uciml/iris). The iris dataset has sepal and petal width and lenght of 3 different iris species. Each specie has 50 samples. All the measurements are in centimeters.

In [None]:
# Import a the Iris dataframe included in the plotly library
iris = px.data.iris()
iris

In [None]:
# Plot a scatter plot using the Iris dataframe
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals")
iris_fig.show()

1. Manipulating title, labels and legend ([Plotly layout documentation](https://plotly.com/python/figure-labels/))
    - The property title allows to set a title for the chart
    - The property labels allows to map the original name from the Dataframe to a manually set label
    - The method update_layout allows you to modify the feel and look of the visualisation in general including title and legends. For instance, updating the visualisation width, title font size, position, etc.

In [None]:
# Update the labels to include relevant information
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals",
    labels={
        "sepal_length": "Sepal Length (cm)", # Include the units of measurement
        "sepal_width": "Sepal Width (cm)", # Include the units of measurement
        "species": "Species of Iris" # Provide a complete title
    })
iris_fig.update_layout(
    title_font_size=25, # Update the title font size
    title_x=0.5, # Update the title horizontal position top middle
    legend_title_font_size=15, # Update the legend title font size
    width=750 # Specify the width of the chart
)
iris_fig.show()

2. Manipulating the axis ([Plotly axes layout documentation](https://plotly.com/python/axes/))
    - The method update_xaxes allows you to modify the feel and look of the x axis. For instance, title font size, tick font size. It can also be modified the amount of ticks and the style of the ticks
    - The method update_yaxes allows you to modify the feel and look of the y axis. The examples are exantly the same as the x axis

In [None]:
# Update the font size and ticks in the axes
iris_fig = px.scatter(iris, 
    x="sepal_length", 
    y="sepal_width", 
    color="species",
    title="Iris Sepals",
    labels={
        "sepal_length": "Sepal Length (cm)",
        "sepal_width": "Sepal Width (cm)",
        "species": "Species of Iris"
    })
iris_fig.update_layout(
    title_font_size=25,
    title_x=0.5,
    legend_title_font_size=15,
    width=750
)
iris_fig.update_xaxes(
    title_font_size=12, # Update x axis label font size
    tickfont_size=10, # Update x axis tick font size
    tick0=0, # Set the value of the first tick
    dtick=1 # Set the distance between ticks
)
iris_fig.update_yaxes(
    title_font_size=12,
    tickfont_size=10,
    dtick=0.25
)
iris_fig.show()

> **TIP:** If you need to download the plot as an image, you can use the tools that appear on the top right corner of the visualisation. The camera icon (first from left to right) allows you to download the image. 

To import an image into the notebook you can use the following code:

```Html
<img src="path/file_name.png">
 ```

 For example: 

 ```Html
 <img src="./images/vis.png">
 ```

---

### Guidelines to design data visualisations

1. Every chart needs a title, labels and legends
2. Keep it simple. Too many aesthetics in a single chart might confuse the reader
3. Use white space
4. Design having in mind your audience
5. Double check that all the calculations have been done correctly
6. Use colour meaningfully
7. Be mindful of the starting points of the axes

> **Tip:** Remember that data visualisations need to support your story. Data storytelling is critical in design

### Do it yourself

Using the [Earthquake dataset](https://www.kaggle.com/datasets/warcoder/earthquake-dataset), try to create meaningful charts to depict the relationship between the magnitude, significance and location. You can use many charts and associate with other variables such as date_time.

What and how would you visualise this dataset?