# Interactive Dashboard with HoloViz

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 20/03/2025   | Martin | Create  | Created notebook | 
| 21/03/2025   | Martin | Update  | Completed data cleaning process | 
| 22/03/2025   | Martin | Update  | Completed initial 4 plots for visualisation | 
| 23/03/2025   | Martin | Update  | Completed dashboard and documentation | 

# Content

* [Introduction](#introduction)
* [Setup](#setup)
* [The Data](#the-data)
* [The Dashboard](#the-dashboard)
* [Limitations](#limitations)
* [Future Works](#future-works)

# Introduction

This notebook complements the associated article to create interactive dashboards using components from the HoloViz Suite, more specifically, Panel, hvPlot and HoloViews. These are data visualisations that allow for user interaction to dive deeper into the data.

In this tutorial, we will perform some data cleaning prior to the visualisation tutorial and explore the Graduate Employment Survey (GES) conducted by Universities in Singapore from 2013-2022.

# Setup

If you have `poetry` installed you can install install dependencies directly after cloning the project:

```
poetry install
```

Otherwise you can use the `requirements.txt` inside the repo using `pip` after creating a virtual environment

```
pip install -r requirements.txt
```

# The Data

_More details about the data can be found in the article or from the [source](https://data.gov.sg/datasets/d_3c55210de27fcccda2ed0c63fdd2b352/view)_

The Graduate Employment Survey (GES) is jointly conducted by NTU, NUS, SMU, SIT, SUTD and SUSS (Universities in Singapore) annually to survey the employment conditions of graduates about six months after their final examinations. The results are published by the Ministry of Education (MOE)

Here we will perform some data cleaning to ensure a good format for visualisation later.

_If you would like to skip pass this part and jump straight to the visualisation, there is a `clean()` function in the Dashboard section_

In [1]:
import pandas as pd
import numpy as np
import re
import hvplot.pandas
import panel as pn
import holoviews as hv
from datetime import datetime

pn.extension()

%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)


In [2]:
df = pd.read_csv("data/GraduateEmploymentSurveyNTUNUSSITSMUSUSSSUTD.csv")
df.head()

Unnamed: 0,year,university,school,degree,employment_rate_overall,employment_rate_ft_perm,basic_monthly_mean,basic_monthly_median,gross_monthly_mean,gross_monthly_median,gross_mthly_25_percentile,gross_mthly_75_percentile
0,2013,Nanyang Technological University,College of Business (Nanyang Business School),Accountancy and Business,97.4,96.1,3701,3200,3727,3350,2900,4000
1,2013,Nanyang Technological University,College of Business (Nanyang Business School),Accountancy (3-yr direct Honours Programme),97.1,95.7,2850,2700,2938,2700,2700,2900
2,2013,Nanyang Technological University,College of Business (Nanyang Business School),Business (3-yr direct Honours Programme),90.9,85.7,3053,3000,3214,3000,2700,3500
3,2013,Nanyang Technological University,College of Business (Nanyang Business School),Business and Computing,87.5,87.5,3557,3400,3615,3400,3000,4100
4,2013,Nanyang Technological University,College of Engineering,Aerospace Engineering,95.3,95.3,3494,3500,3536,3500,3100,3816


First we'll perform some basic data sanitisation - removing NA values and declaring the column data types

In [3]:
# Remove the rows with na (str)
df = df[df['employment_rate_overall'] != 'na']

# Specify the datatypes of columns
cols_to_change = [col for col in df.columns if col not in 
                  ['year', 'university', 'school', 'degree']]
df = df.astype(dtype={
  col: 'float64' for col in cols_to_change
})

Next, we observe that for each University there might be different names for the degrees and schools, and for degrees there are distinctions between the level of degree obtained (Hons, Cum Laude, etc.). As such, we attempt to remove these distinctions and separate them into different columns to ensure better separation

_Note: The cleaning might not be 100% accurate, so please excuse some missed distinctions_

In [4]:
# Change the datatype of year column
df['year'] = pd.to_datetime(df['year'], format="%Y")

In [5]:
# Cleaning the school column
# Remove any details from brackets
df['school'] = df['school'].str.replace(r'\(.*?\)', '', regex=True)

# Remove any special characters from the back
df['school'] = df['school'].str.replace(r'[\*|\\|\#]+', '', regex=True)

# Remove the white space after dash
df['school'] = df['school'].str.replace(r'-\s', '-', regex=True)

# Remove leading and trailing whitespace
df['school'] = df['school'].str.strip()

In [6]:
# Cleaning the degree column
# Remove special characters from the back
df['degree'] = df['degree'].str.replace(r'[\*|\\|\#|\^|\.]+', '', regex=True)

# Extract out if they were honours or cum laude programs
df['advanced'] = np.where(df['degree'].str.contains(r'Honours|\(Hons\)|Cum\s+Laude'), 1, 0)
remove_advanced = r'\s+with\s+Honours|\(Hons\)|\(?Cum\sLaude\sand\sabove\)?'
df['degree'] = df['degree'].str.replace(remove_advanced, '', regex=True)

# Remove the length of degree
df['degree'] = df['degree'].str.replace(r'\([^()]*\d[^()]*\)', '', regex=True)

# Remove non-degree related terms
df['degree'] = df['degree'].str.replace(r'\(LLB\)|\(MBBS\)|\(Land\)', '', regex=True)

# Some degree types are hidden between brackets so we extract them
temp = df['degree'].str.extract(r'\(([^)]+)\)')
df.loc[temp[~temp[0].isna()].index, 'degree'] = temp[~temp[0].isna()][0]

# Some degrees are also only expressed after the word "in"
temp = df['degree'].str.extract(r'\bin\b\s+(.*?)$')
df.loc[temp[~temp[0].isna()].index, 'degree'] = temp[~temp[0].isna()][0]

# Remove term "Bachelor of"
df['degree'] = df['degree'].str.replace(r'Bachelor\sof\s?', '', regex=True, case=False)

# Replace some special characters with their word equivalents
df['degree'] = df['degree'].str.replace('&', 'and')
df['degree'] = df['degree'].str.replace('/', ' and ')
df['degree'] = df['degree'].str.replace('with', '')
df['degree'] = df['degree'].str.replace(r'\s+', ' ', regex=True)
df['degree'] = df['degree'].str.replace(r's$', '', regex=True)

# Remove leading and trailing whitespace
df['degree'] = df['degree'].str.strip()

# Reset the index
df = df.reset_index(drop=True)

In [7]:
df.head()

Unnamed: 0,year,university,school,degree,employment_rate_overall,employment_rate_ft_perm,basic_monthly_mean,basic_monthly_median,gross_monthly_mean,gross_monthly_median,gross_mthly_25_percentile,gross_mthly_75_percentile,advanced
0,2013-01-01,Nanyang Technological University,College of Business,Accountancy and Busines,97.4,96.1,3701.0,3200.0,3727.0,3350.0,2900.0,4000.0,0
1,2013-01-01,Nanyang Technological University,College of Business,Accountancy,97.1,95.7,2850.0,2700.0,2938.0,2700.0,2700.0,2900.0,1
2,2013-01-01,Nanyang Technological University,College of Business,Business,90.9,85.7,3053.0,3000.0,3214.0,3000.0,2700.0,3500.0,1
3,2013-01-01,Nanyang Technological University,College of Business,Business and Computing,87.5,87.5,3557.0,3400.0,3615.0,3400.0,3000.0,4100.0,0
4,2013-01-01,Nanyang Technological University,College of Engineering,Aerospace Engineering,95.3,95.3,3494.0,3500.0,3536.0,3500.0,3100.0,3816.0,0


---

# The Dashboard

Now that the data is prepared, we'll create individual plots while describing how each of them can be used to interpret some details about the data set. Then we'll combine into a complete dashboard for viewing.

In [8]:
# You can load the data using this function if you want to jump straight to the dashboard
from utils import clean
df = pd.read_csv('data/GraduateEmploymentSurveyNTUNUSSITSMUSUSSSUTD.csv')
df = clean(df)

## Create the control scheme

We create controls across the different columns that will allow the users to filter the data according what they want to observe

In [9]:
# Remove the unnecessary warnings from panel
import warnings

with warnings.catch_warnings(): 
    warnings.simplefilter("ignore", category=UserWarning) #specifies the type of warnings to ignore
    pn.config.layout_compatibility = 'warn'

In [10]:
# Here we define the set of values that are available for each dashboard control widget
universities = list(df['university'].unique())
years = list(df['year'].dt.date.unique())
degrees = list(df['degree'].unique())

COLOURS = ['#264653', '#e9c46a', '#e76f51', '#2a9d8f', '#f4a261']

When definiing the controls, we select which widget to display then define the configurations. Generally each widget the following:
* `name` - Defines the name (and sometimes title) of the widget
* `value` - The default value when the widget is first created
* `options` - (if categorical) Defines the set of options available for that widget

In [11]:
# Checkbox group for universities
## Allows users to select multiple universities at the same time to compare them
uni_select = pn.widgets.Select(
  name='Select Universities:',
  value='National University of Singapore',
  options=universities,
)

## Allow users to select the range of years they want to see
year_slider = pn.widgets.DateRangeSlider(
  name='Graduate Date Range',
  start=years[0],
  end=years[-1],
  value=(years[0], years[-1]),
  step=365,
  # format='%Y'
)

## Allow users to select multiple degrees at the same time
degree_multi_select = pn.widgets.MultiChoice(
  name='Degrees to Compare (limit 5)',
  options=degrees,
  value=['Accountancy', 'Business Administration', 'Social Science', 'Art'],
  max_items=5
)

pn.Column(
  uni_select,
  pn.Row(year_slider, degree_multi_select)
)

BokehModel(combine_events=True, render_bundle={'docs_json': {'fbd7e795-ddc4-4158-b76b-e056c15fef54': {'version…

## Plot 1: Scatter + Line Plot 

We'll first build a scatter and line plot that showcases the changing mean monthly salary for fresh graduates across the different years. This plot is good for showcasing the historical changes for different categorical groups of data. Since our data is a time-series, it would be good to observe the change over time.

The plot should be able to filter according to school and degree. This will allow the user to filter and compare the job acquisition performance of the same degree across different schools.

In [12]:
@pn.depends(uni_select, year_slider, degree_multi_select)
def scatter_line_plot(uni_select, year_slider, degree_multi_select):
  df1 = df.copy()

  # Change the year details to string type for more consistent comparison
  start_year, end_year = int(year_slider[0].year), int(year_slider[1].year)
  year_range = [str(start_year + i) for i in range(1, int(end_year - start_year + 1))]
  df1['year'] = df1['year'].dt.strftime("%Y")

  # Prepare the data for the plot
  df1 = df1[
    (df1['university'] == uni_select) &
    (df1['year'].isin(year_range)) &
    (df1['degree'].isin(degree_multi_select)) &
    (df1['advanced'] == 0) # automatically filter out advanced further studies
  ]
  df1 = df1.sort_values('year')

  # Pivot the dataframe to match hvplot format
  df2 = df1.pivot(index='year', columns='degree', values='basic_monthly_mean').reset_index()
  cols_for_line = list(df2.columns.drop('year'))

  # Plot the line graph
  scatter_plot = df1.hvplot.scatter(
    x='year',
    y='basic_monthly_mean',
    by='degree',
    size=70,
    hover_cols=['year', 'basic_monthly_mean'],
    title=f'Mean Monthly Salary for Jobs in {uni_select}',
    xlabel='Year',
    ylabel='Mean Monthly Salary',
    color=COLOURS[:len(degree_multi_select)],
    legend='top'
  )

  line_plot = df2.hvplot.line(
    x='year',
    y=cols_for_line,
    by='degree',
    color=COLOURS[:len(degree_multi_select)]
  )

  # Combine the scatter plot and line plot together
  plot1 = line_plot * scatter_plot

  return plot1

# Create layout
layout = pn.Column(
  pn.Row(uni_select),
  pn.Row(year_slider, degree_multi_select),
  scatter_line_plot
)

# Display
layout

BokehModel(combine_events=True, render_bundle={'docs_json': {'a46cec91-a292-4ea9-9c42-818f5eb00323': {'version…

One observation is that it's difficult to control the assigned colour for each category when they are separated into different plots i.e scatter and line. As seen above, the dots and line are not always the same colour. This happens because `hvplot` assigns colour based on the order of categories in the original dataframe, which is structured differently for both the scatter and line plot, therefore the colours are off.

But from here we can see that the mean monthly salary for each degree has been increasing over time since 2014. Of note, Business Administration has seen the largest increase, while Accountancy experienced a sharp drop in 2021, but rebounded significantly in 2022.

## Plot 2: Error Bars

The error bars plot shows the 25th, median and 75th percentile for each degree, it is a simplified boxplot (here we are limited by the availability of data). This plot is good for showing the distribution of data across multiple categories. Showing a single median value might not provide a clear enough description of how wide the variation is.

It should allow a filter on the university and degree. This allows the user to have a better understanding of the distribution of the Gross Salary for each degree.

In [14]:
@pn.depends(uni_select, year_slider, degree_multi_select)
def error_bars_plot(uni_select, year_slider, degree_multi_select):
  df2 = df.copy()

  # Prepare the data for the plot
  df2['year'] = df2['year'].dt.strftime("%Y")
  df2 = df2[
    (df2['university'] == uni_select) &
    (df2['year'] == str(year_slider[1].year)) &
    (df2['degree'].isin(degree_multi_select)) &
    (df2['advanced'] == 0) # automatically filter out advanced further studies
  ]
  # Get the difference between the 25th/ 75th percentile and the median
  df2['lower_error'] = df2['gross_monthly_median'] - df2['gross_mthly_25_percentile']
  df2['upper_error'] = df2['gross_mthly_75_percentile'] - df2['gross_monthly_median']

  # Create the scatter plot point for median
  median = df2.hvplot.scatter(
    x='degree',
    y='gross_monthly_median',
    color='#39b9e8',
    size=100,
    marker='s',
    ylabel='Gross Monthly Salary',
    xlabel='Degree',
    title=f'Errors for Gross Monthly Salary in {year_slider[1].year}',
    legend=False
  )

  # Create the error bars using the calculated errors
  errors = df2.hvplot.errorbars(
    x='degree',
    y='gross_monthly_median',  # Median as central value
    yerr1='lower_error',
    yerr2="upper_error",
    color='black',
    line_width=2,
    legend=False
  )

  plot2 = errors * median

  return plot2

# Create layout
layout = pn.Column(
  pn.Row(uni_select),
  pn.Row(year_slider, degree_multi_select),
  error_bars_plot
)

# Display
layout

BokehModel(combine_events=True, render_bundle={'docs_json': {'7d9649e7-f136-48df-97b6-32dafa134138': {'version…

Even though Business Administration has the highest median Gross Monthly Salary amongst the listed degrees, the 75th percentile for Accountancy seems to overlap with the median value, which suggests that the higher paying Accountancy roles only hit about the median salary. Meanwhile, Art and Social Science have less variation in the salary offered to fresh graduates.

## Plot 3: Bar Graph

Next we'll build a bar graph to compare the employment rates between the various degrees in the __latest year__ selected. The bar graph is a good way to compare numerical values side by side for each category, since it provides a common axis to compare them against.

The visualisation should filter on the university and degree. This will allow users to compare the various employment rates between each degrees without having to dive deeply and individually sort values.

In [15]:
@pn.depends(uni_select, year_slider, degree_multi_select)
def bar_graphs_plot(uni_select, year_slider, degree_multi_select):
  df3 = df.copy()
  # Prepare the data for the plot
  df3['year'] = df3['year'].dt.strftime("%Y")
  df3 = df3[
    (df3['university'] == uni_select) &
    (df3['year'] == str(year_slider[1].year)) &
    (df3['degree'].isin(degree_multi_select)) &
    (df3['advanced'] == 0) # automatically filter out advanced further studies
  ]
  df3 = df3.sort_values('employment_rate_overall')

  # Plot the bar graph
  plot3 = df3.hvplot.barh(
    y='employment_rate_overall',
    x='degree',
    color='#39b9e8',
    ylabel='Full-time Employment Percentage (%)',
    xlabel='Degree',
    title=f'Full-time Employment Percentage in \n{uni_select} ({year_slider[1].year})'
  )

  return plot3

# Create layout
layout = pn.Column(
  pn.Row(uni_select),
  pn.Row(year_slider, degree_multi_select),
  bar_graphs_plot
)

# Display
layout

BokehModel(combine_events=True, render_bundle={'docs_json': {'d99e2ea1-2f34-4d71-aa86-b94ecc296fd2': {'version…

From this, although Business Administration had the highest median salary, it has the lowest Full-time Employment Percentage compared to the other degrees. Higher pay means lower employment since it costs more for the company to higher. But the overall employment rate is still relatively high >85% for each degree.

## Plot 4: Dumbbell Plot

The dumbbell plot is used to emphasie gaps in values between 2 groups of data. In this case, we use it to showcase the difference between degrees that are honours degrees or cum laude achievement. Here we observe the mean salary difference.

This allows users to directly compare those degrees that have different tiers.

In [18]:
@pn.depends(uni_select, year_slider, degree_multi_select)
def dumbbell_plot(uni_select, year_slider, degree_multi_select):
  df4 = df.copy()

  # Prepare the data for the plot
  df4['year'] = df4['year'].dt.strftime("%Y")
  df4 = df4[
    (df4['university'] == uni_select) &
    (df4['year'] == str(year_slider[1].year)) &
    (df4['degree'].isin(degree_multi_select))
  ]

  # Filter out degrees that have both advanced degrees
  adv_filter = df4.groupby('degree').count().reset_index()
  adv_filter = list(adv_filter[adv_filter['advanced'] == 2]['degree'])
  df4 = df4[df4['degree'].isin(adv_filter)]

  # Pivot dataframe into the advanced and not advanced columns
  df4 = df4.pivot(index='degree', values='basic_monthly_mean', columns='advanced').reset_index()
  df4 = df4.rename({0: 'start', 1: 'end'}, axis=1)

  # Create the dumbbell plot
  ## Scatter plot points for advanced and not advanced values
  start_points = df4.hvplot.scatter(
    x='start',
    y='degree',
    color="#39b9e8",
    size=100,
    label='Without Higher Qualifications'
  )
  end_points = df4.hvplot.scatter(
    x='end',
    y='degree',
    color="#91dc4c",
    size=100,
    label='With Higher Qualifications'
  )

  # Create line segments connecting the points
  segments = hv.Segments([
    (
      df4.loc[i, "start"],
      df4.loc[i, "degree"], 
      df4.loc[i, "end"],
      df4.loc[i, "degree"]
    ) 
  for i in df4.index]).opts(color="black", line_width=2)

  # Combine plots
  plot4 = (segments * start_points * end_points) \
    .opts(
      title=f"Difference in Mean Monthly Salary for Jobs \nwith Different Qualification Levels in {year_slider[1].year}",
      xlabel="Degree",
      ylabel="Mean Monthly Salary",
      legend_position="top"
    )

  return plot4

# Create layout
layout = pn.Column(
  pn.Row(uni_select),
  pn.Row(year_slider, degree_multi_select),
  dumbbell_plot
)

# Display
layout

BokehModel(combine_events=True, render_bundle={'docs_json': {'558edc34-7131-48b0-95df-28d2efb59aff': {'version…

Here Social Science is not included since it doesn't have an honours program or data for it's cum laude graduates. However, funnily enough the degrees with higher qualifications have lower Mean Monthly Salary (Business Administration and Accountancy) 

## Putting it All Together

Finally, we combine all the dashboards we have created into 1 large viewing plane. This simplifies the functionality for users to view all information with a single control panel.

In [21]:
layout = pn.Column(
  pn.Row("# Singapore Graduate Employment Survey (2013-2022)"),
  pn.Row("## Control Panel"),
  pn.Row(uni_select),
  pn.Row(year_slider, degree_multi_select),
  pn.Row("## Plots"),
  pn.Row(scatter_line_plot, error_bars_plot),
  pn.Row(bar_graphs_plot, dumbbell_plot)
)
layout

BokehModel(combine_events=True, render_bundle={'docs_json': {'ed351124-8786-4812-8cf1-897defdffd30': {'version…

In [None]:
# To launch it on browser in a separate window
# layout.show()

Launching server at http://localhost:56332


<panel.io.server.Server at 0x210ec505fa0>



Each plot provides details about the graduate salary for each degree, with respect to the school that offers it. The Scatter + Line plot provides the historical information on salary performance, which although leaves it slightly isolated from the rest of the plots provides key insights on the degrees performance over time. The remaining plot focus on the latest information (latest year) for each degree and how they stack up against each other. Providing comparisons on employability, distribution of salaries and whether a higher qualification will land you higher pay.

All in all, this dashboard serves as a good view for prospective students to compare degrees that they are interested in for any of the available universities in Singapore.

##

# Limitations

🚨 __ALERT: There is a `UnknownReferenceError` that pops up every time the date values are changed. This is a [known issue](https://discourse.holoviz.org/t/panel-unknownreferenceerror-cant-resolve-reference/8417) that has not been resolved when working with different date steps. Please refresh the notebook if the error logs get too long__

While comprehensive there are some key limitations to this dashboard:

* Unable to compare between degrees from different Universities
* Data cleaning limitations - some degrees are the same but have different names
* Limited to 5 degrees due to space constraint
* Limited to 1 year for some plots due to the plots nature

# Future Works

Suggestions to improve dashboard:
* Provide an additional filter layer to compare between Universities
* Invest more time into data cleaning efforts to ensure better comparisons
* `hvplot` has other library extensions that can potentially be used for more detailed plots