The paper [The Visual Language of Multidimensional Data Projection: A Visualization Taxonomy and Informed Insights](https://) discusses the need for alternative ways to think about visualizing Multi-Dimensional Projection(MDP). This project shows the alternative ways for MDPs using three different use cases. There exist complex relationships between data instances, therefore there is need to explore different visualization techniques (encoding and interaction) to make clear these complexities. The goal of the project is not to propose new visualization rather to explore the design space of alternatives visualization techniques.



## Use case 1: Principal Component Analysis (PCA)

In this notebook, we start with the projection multidimensional data using PCA. The analytical task we consider in this use case is pattern Identification (involves detecting clusters) and membership disambiguation (involves counting the number of objects in a cluster).

Here, we used the [Ecoli](https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.340110203) dataset from the [UCI machine learning](https://archive.ics.uci.edu/dataset/39/ecoli) repository.


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 data-toggle="list"  role="tab" aria-controls="home"><p style="font-size : 30px"><font color="darkgrey">Content<font/></p></h3>

1. [<font color="darkgrey">One Dimension<font/>](#1)   
    - 1.1 [<font color="darkgrey"> 1D scatterplot<font/>](#1.1)
    - 1.2 [<font color="darkgrey"> Density plot and its variations<font/>](#1.2)
    - 1.3 [<font color="darkgrey"> Histogram and its variations<font/>](#1.3)
2. [<font color="darkgrey"> Two Dimensions<font/>](#2)
    - 2.1 [<font color="darkgrey">Scatterplot<font/>](#2.1)
    - 2.2 [<font color="darkgrey"> Boxplot<font/>](#2.2)
    - 2.3 [<font color="darkgrey"> Violin plot<font/>](#2.3)
3. [<font color="darkgrey"> N Dimensions<font/>](#3)
    - 3.1 [<font color="darkgrey"> Area chart<font/>](#3.1)
    - 3.2 [<font color="darkgrey"> Line chart<font/>](#3.2)    
4. [<font color="darkgrey">Multiple views<font/>](#4)
    

In [1]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
import altair as alt

#import scanpy as sc

from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests
# from seurat_py import Seurat

In [2]:
data = pd.read_csv("https://github.com/jbrownlee/Datasets/raw/master/ecoli.csv")
headers = pd.Series(data.columns)
data.columns = range(len(data.columns))
data = data.append(headers, ignore_index=True)
data = data.iloc[-1:].append(data[:-1], ignore_index=True)
data.columns = ["mcg", "gvh", "lip", "chg", "aac", "alm1", "alm2", "loc_site"]


In [3]:
data

Unnamed: 0,mcg,gvh,lip,chg,aac,alm1,alm2,loc_site
0,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
1,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
2,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
3,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp
4,0.23,0.32,0.48,0.5,0.55,0.25,0.35,cp
...,...,...,...,...,...,...,...,...
331,0.74,0.56,0.48,0.5,0.47,0.68,0.3,pp
332,0.71,0.57,0.48,0.5,0.48,0.35,0.32,pp
333,0.61,0.6,0.48,0.5,0.44,0.39,0.38,pp
334,0.59,0.61,0.48,0.5,0.42,0.42,0.37,pp


In [4]:
data.to_csv('ecoli.csv', index=False)

In [5]:
X_data = data.drop(columns='loc_site')
all_features = X_data.columns
y= data['loc_site']

In [6]:
all_features

Index(['mcg', 'gvh', 'lip', 'chg', 'aac', 'alm1', 'alm2'], dtype='object')

PCA allows us to get an idea of the dimensionality of the dataset.
We check for the cumulative variance of the attributes to see how many attributes are needed to explain the variance in the dataset.

In [7]:
#explain the variance of the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_data)



In [8]:
X_scaled

array([[-0.0517614 , -1.41953086, -0.17514236, ...,  0.49078096,
        -1.20771743, -0.7160837 ],
       [-2.21287637, -0.67596708, -0.17514236, ...,  0.32710612,
        -0.69711074, -0.28566488],
       [ 0.30842443, -0.67596708, -0.17514236, ..., -0.08208098,
        -0.60427317, -0.19001625],
       ...,
       [ 0.56570002,  0.67596708, -0.17514236, ..., -0.49126808,
        -0.51143559, -0.57261076],
       [ 0.46278978,  0.74356378, -0.17514236, ..., -0.65494292,
        -0.37217922, -0.62043507],
       [ 1.23461656,  1.62232098, -0.17514236, ..., -1.55515454,
         0.13842746,  0.09692964]])

In [9]:
pca = PCA().fit(X_scaled)
data_plot = pd.DataFrame({'Component Number': 1+np.arange(X_data.shape[1]),
                            'Cumulative explained variance': pca.explained_variance_ratio_})


In [10]:

base = alt.Chart(data_plot).encode().properties(
    title='Cumulative explained variance'
).properties(width=500)

plot = base.mark_bar().encode(
    x=alt.X('Component Number:N', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('Cumulative explained variance:Q', axis=alt.Axis(format='%', title='Percentage')),
    tooltip=[alt.Tooltip('Cumulative explained variance', format='.0%')]
)

plot

In [11]:
principal_components = PCA(n_components=2).fit_transform(X_scaled)
principal_components

array([[-1.29035151e+00, -3.24912472e-01],
       [-1.58601216e+00, -1.03468292e+00],
       [-5.30483123e-01, -1.30495184e-01],
       [-2.61720798e-01,  3.38264075e-01],
       [-1.82436437e+00, -7.31834460e-01],
       [-6.40359260e-01,  9.14984789e-02],
       [-2.14346042e+00, -6.66199646e-01],
       [-1.76945641e+00, -7.93057269e-01],
       [-6.93270938e-01, -1.00629762e+00],
       [-1.57120123e+00,  3.51996438e-02],
       [-1.36023410e+00, -6.84405890e-01],
       [-2.22120086e+00,  3.75520260e-02],
       [-1.75082107e+00, -3.82457321e-01],
       [-9.85480780e-01,  3.58866415e-01],
       [-2.30316491e+00, -1.67055951e-01],
       [-9.62038422e-01, -8.85316631e-01],
       [-1.89186232e+00, -3.39888583e-02],
       [-1.08494954e-01, -1.14888887e+00],
       [-1.99821878e+00, -4.97585720e-01],
       [-1.68384927e+00,  5.60048731e-01],
       [-2.67100999e+00,  7.88527376e-01],
       [-2.35030982e+00, -3.66607105e-01],
       [-8.14334188e-01,  3.31973563e-01],
       [-2.

In [12]:
df_ecoli = data.copy()
df_ecoli['pca1'] = principal_components[:,0]
df_ecoli['pca2'] = principal_components[:,1]
df_ecoli['loc_site'] = y
#wines['pca3'] = principal_components[:,2]

In [13]:
df_ecoli

Unnamed: 0,mcg,gvh,lip,chg,aac,alm1,alm2,loc_site,pca1,pca2
0,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp,-1.290352,-0.324912
1,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp,-1.586012,-1.034683
2,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp,-0.530483,-0.130495
3,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp,-0.261721,0.338264
4,0.23,0.32,0.48,0.5,0.55,0.25,0.35,cp,-1.824364,-0.731834
...,...,...,...,...,...,...,...,...,...,...
331,0.74,0.56,0.48,0.5,0.47,0.68,0.3,pp,0.538242,0.811114
332,0.71,0.57,0.48,0.5,0.48,0.35,0.32,pp,-0.366133,0.993589
333,0.61,0.6,0.48,0.5,0.44,0.39,0.38,pp,-0.397212,0.763576
334,0.59,0.61,0.48,0.5,0.42,0.42,0.37,pp,-0.423182,0.761684


As you may notice here, the third principal components is commented out but it can as well be visualized. Please note that domain convention usually look at the first and second princinpal components because of the eigenvectors

#### Visualizing projection (MDP)

It can be useful to visualize principal components individually, especially the first and second principal components which contain the most important information. Visualizing these components can help us understand the overall structure of the data and identify any patterns or trends that may be present. 

Visualizing other principal components can also be informative, especially if you are interested in exploring more complex relationships. However, it is important to keep in mind that the higher-order (e.g., third, fourth and so on) principal components typically capture smaller amounts of variability and may be more difficult to interpret.


### Taxonomy

![FlowMDP](sankey_full.png)


<font size="+3" color="grey"><b>1. One Dimension </b></font><br><a id="1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

<font size="+2" color="grey"><b>1.1 1D scatterplot  </b></font><br><a id="1.1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

Based on convention, plotting 1D scatterplot
- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), points, color
- Interaction: selection and filter, tooltip
- Layout: Juxtaposition (horizontal concatenation)


NB: The selection functions in this notebook are single selection, multi selection, Shift+Click and the filtering functions are the alternatives not highlighted

In [15]:
#binding the marks using the quality variable
selection = alt.selection_multi(fields=['loc_site'], bind='legend')

base = alt.Chart(df_ecoli).encode().properties(
    width=1000,)

#plot for 1D scatterplot
pca1_circle = base.mark_circle(size=60).encode(
    x=alt.X('pca1', title="PC 1"),
    color=alt.Color('loc_site:N'),
    #tooltip=list(all_features),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_selection(
    selection
)

pca2_circle = base.mark_circle(size=60).encode(
    x=alt.X('pca2', title="PC 2"),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20') ),
    #tooltip=list(all_features),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
)


(pca1_circle & pca2_circle).properties(
    title= 'Principal Components in 1-dimensional plot',
).configure_title(
    anchor= 'middle',
    fontSize=16  
)



<font size="+2" color="grey"><b>1.2 Density plot and its variations </b></font><br><a id="1.2"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

From the taxanomy, plotting the density plot
- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), area
- Interaction: no interaction used here
- Layout: Juxtaposition (horizontal concatenation)




In [56]:
#Plotted using the area mark
pca1_density= alt.Chart(df_ecoli).transform_density(
    'pca1',
    as_=['pca1', 'density'],
).mark_area().encode(
    x=alt.X('pca1:Q', title="Principal Component 1", scale=alt.Scale(domain=[-4, 5])),
    y=alt.Y('density:Q', scale=alt.Scale(domain=[0.00, 0.50])),
)

pca2_density= alt.Chart(df_ecoli).transform_density(
    'pca2',
    as_=['pca2', 'density'],
).mark_area().encode(
    x=alt.X('pca2:Q', title="Principal Component 2"),
    y=alt.Y('density:Q', scale=alt.Scale(domain=[0.00, 0.50])),
)


(pca1_density | pca2_density).properties(
    title= 'Density plot for Principal Components 1 and 2'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

#### Creating density plot for the principal components separated by the target variable (localization site).

The per-quality basis is with respect to the 'quality' of wine variable from the dataset

- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), area, color, opacity
- Interaction: Select and filter
- Layout: Superimposition(layered) and Juxtaposition (horizontal concatenation)

In [66]:
#plotting each principal compoments by quality
selection = alt.selection_multi(fields=['loc_site'], bind='legend')


pca1_group = alt.Chart(df_ecoli).transform_density(
    'pca1',
    groupby=['loc_site'],
    as_=['pca1', 'density'],
).mark_area(fillOpacity=0.5).encode(
    x=alt.X('pca1:Q', title="Principal Components 1",),
    y=alt.Y('density:Q',),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
    opacity=alt.condition(selection, alt.value(1.4), alt.value(0.2))
).add_selection(
    selection
)

pca2_group = alt.Chart(df_ecoli).transform_density(
    'pca2',
    groupby=['loc_site'],
    as_=['pca2', 'density'],
).mark_area(fillOpacity=0.5).encode(
    x=alt.X('pca2:Q', title="Principal Component 2", ),
    y=alt.Y('density:Q', scale=alt.Scale(domain=[0,6]) ),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
    opacity=alt.condition(selection, alt.value(1.4), alt.value(0.2))
)

(pca1_group | pca2_group).properties(
    title= 'Principal Components separated by the localization site'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)






- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), area, color
- Interaction: Select and filter
- Layout: Small multiples(facet), Superimposition(layered) and Juxtaposition (horizontal concatenation)



In [19]:
#plotting each principal compoments in small multiples
pca1_facet = alt.Chart(df_ecoli, width=150, height=150).transform_density(
    'pca1',
    groupby=['loc_site'],
    as_=['pca1', 'density'],
).mark_area().encode(
    x=alt.X('pca1:Q', title="PC 1"),
    y=alt.Y('density:Q'),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20'))
    #opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).facet(
    'loc_site:N',
    columns=4
).properties(
    title= 'Small multiples of density plot for PC 1'
)



(pca1_facet).configure_title(
    anchor= 'middle',
    fontSize=16  
)

In [20]:
#plotting each principal compoments in small multiples
pca2_facet = alt.Chart(df_ecoli, width=150, height=150).transform_density(
    'pca2',
    groupby=['loc_site'],
    as_=['pca2', 'density'],
).mark_area().encode(
    x=alt.X('pca2:Q', title="PC 2"),
    y=alt.Y('density:Q'),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20'))
    #opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).facet(
    'loc_site:N',
    columns=3
).properties(
    title= 'Small multiples of density plot for PC 2'
)

(pca2_facet).configure_title(
    anchor= 'middle',
    fontSize=16  
)

<font size="+2" color="grey"><b>1.3 Histogram and its variations </b></font><br><a id="1.3"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

- Dimension: D= 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), length, color
- Interaction: tooltip (detail on demand)
- Layout: Juxtaposition (horizontal concatenation)

In [57]:
#plotting each principal compoments on stacked barchart
pca1_hist = alt.Chart(df_ecoli).mark_bar().encode(
    x=alt.X('pca1', bin=alt.Bin(maxbins=20), title="Principal Component 1"),
    y=alt.Y('count()', scale=alt.Scale(domain=[0, 180]) ),
    #color=alt.Color('quality:N', scale=alt.Scale(scheme='tableau20')),
    tooltip='count()'
).interactive()

pca2_hist = alt.Chart(df_ecoli).mark_bar().encode(
    x=alt.X('pca2', bin=alt.Bin(maxbins=20), title="Principal Component 2"),
    y=alt.Y('count()', ),
    #color=alt.Color('quality:N',scale=alt.Scale(scheme='tableau20')),
    tooltip='count()'
).interactive()

(pca1_hist | pca2_hist).properties(
    title= 'Histogram for Principal Components 1 and 2'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), length, color
- Interaction: Select and filter
- Layout: Superimposition(layered) and Juxtaposition (horizontal concatenation)



In [22]:
#plotting each principal compoments using overlapped histogram binded by quality variable
selection = alt.selection_multi(fields=['loc_site'], bind='legend')

pca1_overlap = alt.Chart(df_ecoli).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('pca1:Q', bin=alt.Bin(maxbins=20), title="PC 1"),
    alt.Y('count()', stack=None),
    alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
    opacity=alt.condition(selection, alt.value(0.7), alt.value(0.1)),
).add_selection(
    selection
).interactive()


pca2_overlap = alt.Chart(df_ecoli).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('pca2:Q', bin=alt.Bin(maxbins=20), title="PC 2"),
    alt.Y('count()', stack=None),
    alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
    opacity=alt.condition(selection, alt.value(0.7), alt.value(0.1)),
).interactive()

(pca1_overlap | pca2_overlap).properties(
    title= 'Overlapped histogram for PC 1 and 2'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)


- Dimension: D = 1
- Data abstraction: one numeric variable
- Encoding: position (x-axis), length, color
- Interaction: No interaction
- Layout: Small multiples(facet)


In [23]:
#plotting each principal compoments in small multiples layout with histogram
pca1_small= alt.Chart(df_ecoli, width=150, height=150).mark_bar(
).encode(
    alt.X('pca1:Q', bin=alt.Bin(maxbins=20), title="PC 1"),
    alt.Y('count()', stack=None),
    alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
).facet(
    'loc_site:N',
    columns=4
).properties(
    title= 'Small multiples of histogram for PC 1'
)


(pca1_small).configure_title(
    anchor= 'middle',
    fontSize=16  
)

In [24]:
#plotting each principal compoments in small multiples layout with histogram
pca2_small= alt.Chart(df_ecoli, width=150, height=150).mark_bar(
).encode(
    alt.X('pca2:Q', bin=alt.Bin(maxbins=20), title="PC 2"),
    alt.Y('count()', stack=None),
    alt.Color('loc_site:N', scale=alt.Scale(scheme='tableau20')),
).facet(
    'loc_site:N',
    columns=4
).properties(
    title= 'Small multiples of histogram for PC 2'
)

(pca2_small).configure_title(
    anchor= 'middle',
    fontSize=16  
)

### Further points

Considering each Principal Components (PC) as one dimension. First we considered the PCs as points represented in the one dimensional plane. For this consideration, occlusion is present which can hide many datapoints. However, labels can be identified using the color encoding with interaction idioms like select and filter.


Going further, it is important to consider alternative representations to show the distribution of points, density plot is used for this case. It shows how the data points are distributed for each PC. Additionally, histogram provides insight on the number of points in a label as well as its distributions.

The most effective way to visualize a numeric data and its distribution is by using histogram or a density plot. With that, we can understand how the data is distributed for each principal components. Additionally, we can count the number of objects that exist in a cluster or label.





<font size="+3" color="grey"><b>2 Two Dimensions </b></font><br><a id="2"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>


<font size="+2" color="grey"><b>2.1 Scatterplot </b></font><br><a id="2.1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>


- Dimension: D = 2
- Data abstraction: two numeric variables
- Encoding: position (x & y-axis), color, points
- Interaction: select/filter, zoom in/out, tooltip
- Layout: No layout

In [25]:
#plotting the principal compoments using scatterplots
base = alt.Chart(df_ecoli).encode().properties(
    width=700,
    height=500)

pca_scatter = base.mark_circle(size=60).encode(
    x=alt.X('pca1', title='PC 1'),
    y=alt.Y('pca2', title='PC 2'),
    color=alt.Color('loc_site:N'),
    tooltip=['loc_site', 'pca1', 'pca2'],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_selection(
    selection
).interactive().properties(
    title='Scatterplot of the principal components',
)

(pca_scatter).configure_title(
    anchor= 'middle',
    fontSize=16  
)


- Dimension: D = 2
- Data abstraction: one numeric, one categorical
- Encoding: position (x & y-axis), points
- Interaction: No interaction
- Layout: Juxtaposition (horizontal concatenation)




In [26]:
#plotting the principal compoments to show clusters using quality
base = alt.Chart(df_ecoli).encode().properties(
    width=600,
    height=300)

pca1_circle_quality = base.mark_point(size=60).encode(
    x=alt.X('pca1:Q', title="PC 1"),
    y=alt.Y('loc_site:O'),
    #opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
)

pca2_circle_quality = base.mark_point(size=60).encode(
    x=alt.X('pca2:Q', title="PC 2"),
    y=alt.Y('loc_site:O')
    #opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
)

(pca1_circle_quality | pca2_circle_quality).properties(
    title= 'Quality Vs Principal components 1&2'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

### Further Points 

When the 'loc_site' column was encoded using y-axis(position), although occlusions are still visible, clusters are observable as well as their distributions across each principal components. Opacity may be used to tackle occlusion.

Compared to the classical scatter plot, there is less cognitive load from how the labels are clustered and patterns can be easily identified. For example, localization site such as 'cp' and 'im' are likely the largest clusters. However, outliers can easily be pinpointed.

The main takeaway is that there is no competition between datapoints with the range of interest and outliers. This is because things are more spread out to enhance the differentiation of datapoints within the distribution and outliers. 




<font size="+2" color="grey"><b>2.2 Boxplot </b></font><br><a id="2.2"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>


- Dimension: D = 2
- Data abstraction: one numeric, one categorical
- Encoding: position (x & y-axis), lines, length, color
- Interaction: Tooltip
- Layout: Juxtaposition (horizontal concatenation)



In [37]:
#plotting the principal compoments using Boxplot
pca1_boxplot= alt.Chart(df_ecoli).mark_boxplot().encode(
    x='loc_site:N',
    y=alt.X('pca1:Q', title='PC 1'),
    color='loc_site:N'
).properties(
    width=700,
    height=400
)

pca2_boxplot= alt.Chart(df_ecoli).mark_boxplot().encode(
    x='loc_site:N',
    y=alt.X('pca2:Q', title='PC 2', scale=alt.Scale(domain=[-6, 14])),
    color='loc_site:N'
).properties(
    width=700,
    height=400
)

(pca1_boxplot | pca2_boxplot).properties(
    title= 'Principal Components grouped by location site'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

<font size="+2" color="grey"><b>2.3 Violin plot </b></font><br><a id="2.3"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>


- Dimension: D = 2
- Data abstraction: one numeric, one categorical
- Encoding: position (x & y-axis), area,  color
- Interaction: None
- Layout: Juxtaposition (horizontal concatenation)

In [28]:
#plotting the principal compoments using violin plot
pca1_violin = alt.Chart(df_ecoli).transform_density(
    'pca1',
    as_= ['pca1', 'density'],
    groupby=['loc_site'],
    #kernel='gaussian',
    #bandwidth=0.2
).mark_area(orient='horizontal').encode(
    y= alt.Y('pca1:Q', title='PC 1'),
    x= alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    color='loc_site:N',
    column=alt.Column(
        'loc_site:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=20,
        ),
    )
).properties(
    width=100,
)

pca2_violin = alt.Chart(df_ecoli).transform_density(
    'pca2',
    as_= ['pca2', 'density'],
    groupby=['loc_site'],
    #kernel='gaussian',
    #bandwidth=0.2
).mark_area(orient='horizontal').encode(
    y= alt.Y('pca2:Q', title='PC 2', scale=alt.Scale(domain=[-6, 14])),
    x= alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    color='loc_site:N',
    column=alt.Column(
        'loc_site:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=20,
        ),
    )
).properties(
    width=100,
)

(pca1_violin | pca2_violin).properties(
    title= 'Principal Components grouped quality using Violin plot'
).configure_title(
    anchor= 'middle',
    fontSize=16  
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)



### Further points

- Here, we have the projection values displayed in 2 dimensions with violin plot and box plot. Does this makes sense? Can you get any insight about the projections?

- For 'PC 1, loc_site omL', there is a disconntection. What is causing this disconnection? Could it be as a methodological result  or a problem from the technique of how a violin plot is rendered? 

Thoughts: Most likely from technique representation.

- Comparing the violin plot and the boxplot above, we see a huge outlier for this localization site(PC1, loc_site omL), which have most of its distribution at the negative part of the scale

- For other localization sites, the outliers are shown as a line, which is form when all the points are connected. For example  'PC 1, loc_site im', the outliers connected and form a single line. Therefore is there an under-representation or over-representation of data when rendering of violin plot?

- Could an interesting thought be on how the techniques in the background work for the data and the visualization generated, in this case violin plot? 



<font size="+3" color="grey"><b>3 N-Dimensions </b></font><br><a id="3"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>



In [29]:
# Creating a DataFrame with the PCA loadings
loadings = pd.DataFrame(pca.components_.T,
                        columns=[f'PC{i}' for i in range(1, len(all_features) + 1)],
                        index=all_features)

# Melting the DataFrame to a long format suitable for Altair
loadings = loadings.reset_index().melt(id_vars='index')

# Creating the heatmap
heatmap = alt.Chart(loadings).mark_rect().encode(
    y=alt.Y('variable', title='Principal Components'),
    x=alt.X('index', title='Features'),
    color='value'
).properties(
    width=1000,
    height=500,
    title= 'Heatmap of principal components and original features'
)

# Create a text layer to display the loading values
text = alt.Chart(loadings).mark_text().encode(
    x='index',
    y='variable',
    text=alt.Text('value:Q', format='.2f'),
    color=alt.condition(
        alt.datum.value > 0.5,
        alt.value('black'),
        alt.value('white')
    )
)

(heatmap + text).configure_title(
    anchor= 'middle',
    fontSize=16  
)



<font size="+2" color="grey"><b>3.1 Area chart </b></font><br><a id="3.1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

- Dimension: D = >2
- Data abstraction: two numeric, one categorical
- Encoding: position (x & y-axis), area,  color
- Interaction: Tooltip
- Layout: None

In [30]:
area_plot = alt.Chart(df_ecoli).mark_area(opacity=0.3).encode(
    x=alt.X("pca2:Q", title="PC 2"),
    y=alt.Y("pca1:Q", title="PC 1"),
    color="loc_site:N",
    tooltip=['pca1', 'pca2'],
).properties(
    width=700,
    height=500
)

(area_plot).properties(
    title= 'Decomposition of PCA using area chart'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)


- Dimension: D = >2
- Data abstraction: two numeric , one categorical
- Encoding: position (x & y-axis), area,  color
- Interaction: None
- Layout: small multiples

In [31]:
area_plot_facet = alt.Chart(df_ecoli,).mark_area().encode(
    x=alt.X("pca2:Q", title="PC 2"),
    y=alt.Y("pca1:Q", stack=None, title="PC 1"),
    color="loc_site:N",
).facet(
    'loc_site:N',
    columns=4
)

(area_plot_facet).properties(
    title= 'Decomposition of PCA faceted with PC 1 & 2 using small multiples of area chart'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

<font size="+2" color="grey"><b>3.2 Line chart </b></font><br><a id="3.2"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>


- Dimension: D = >2
- Data abstraction: two numeric, one categorical
- Encoding: position (x & y-axis), point, line, color
- Interaction: pan and zoom
- Layout: None

In [32]:
line_chart = alt.Chart(df_ecoli).mark_line(point=alt.OverlayMarkDef(color="red")).encode(
    x=alt.X('pca1', title="PC 1"),
    y=alt.Y('pca2', title="PC 2"),
    color='loc_site:N',
).properties(
    width=700,
    height=500
).interactive()

(line_chart).properties(
    title= 'Decomposition of PCA using line chart overlayed with points'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)


- Dimension: D = >2
- Data abstraction: two numeric , one categorical
- Encoding: position (x & y-axis), lines,  color
- Interaction: pan and zoom, tooltip
- Layout: small multiples and superimposition

In [33]:
line_chart_facet = alt.Chart(df_ecoli,).mark_line(point=alt.OverlayMarkDef(color="red")).encode(
    x=alt.X('pca1', title="PC 1"),
    y=alt.Y('pca2', title="PC 2"),
    color='loc_site:N',
    tooltip= list(all_features)
).facet(
    'loc_site:N',
    columns=4
).interactive()

(line_chart_facet).properties(
    title= 'Decomposition of PCA using line chart overlayed with points (small multiples)'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)

### Further points

- As one of the main aims of this project, exploring the design space of MDP. It is also important to figure out what is suitable or not suitable for specific task. Here we considered multiple dimensions (D > 2) with line chart and area chart. We represented the projections and encoded them using the x and y position, and encoded the localization site using color channel. Does this make sense? Can we get any insight about the data with these representations?

- Representing the projections without a layout and interaction can pose difficulties to gain data insights. This is the case for the area chart and line chart.

- However with small multiples layout, a better representation of the data can be acheieved. The superimpostion of the point mark affords easier identification of distances between datapoints. Although clusters are abundant in certain qualities like loc_site 'cp' and 'im', they still provides insight and indicates how sparse or dense the resulting clusters are.



<font size="+3" color="grey"><b>4 Multiple Views </b></font><br><a id="4"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

- Dimension: D = >2
- Data abstraction: n numeric, one categorical
- Encoding: position (x & y-axis), area,  color, points, length
- Interaction: tooltip, brushing and linking
- Layout: small multiples and juxtaposition

In [86]:

df_ecoli[['mcg', 'gvh', 'lip', 'chg', 'aac', 'alm1', 'alm2']] = df_ecoli[['mcg', 'gvh', 'lip', 'chg', 'aac', 'alm1', 'alm2']].apply(pd.to_numeric)
df_ecoli.dtypes

mcg         float64
gvh         float64
lip         float64
chg         float64
aac         float64
alm1        float64
alm2        float64
loc_site     object
pca1        float64
pca2        float64
dtype: object

In [91]:
brush = alt.selection_interval()


points = alt.Chart(df_ecoli).mark_rect().encode(
    x=alt.X('pca1', bin=alt.Bin(maxbins=20), title="PC 1"),
    y=alt.Y('pca2',bin=alt.Bin(maxbins=20), title="PC 2"),
    color = alt.Color('count()', scale=alt.Scale(scheme='greenblue')),
    #tooltip = list(all_features) 
).add_selection(brush)


scatter = base.mark_circle(size=60,).encode(
    x=alt.X('pca1', title="PC 1"),
    y=alt.Y('pca2', title="PC 2"),
    color=alt.Color('loc_site:N', scale=alt.Scale(scheme='greenblue')),
    tooltip=list(all_features),
    #opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).properties(
    width=400,
    height=300
).transform_filter(
    brush
)

point_scatter = points | scatter

point_scatter

boxplots = alt.hconcat()
for measure in all_features:
    boxplot = alt.Chart(df_ecoli).mark_boxplot().encode(
            x =alt.X(measure, axis=alt.Axis(titleX=470, titleY=0)),
            
    ).transform_filter(
            brush
    )
    boxplots &= boxplot

chart = alt.vconcat(point_scatter, boxplots).properties(
    title= 'Decomposition of PCA into Multiple views'
).configure_title(
    anchor= 'middle',
    fontSize=16  
)
chart





### Further points

- Layout and interaction techniques goes a long way into providing clear and concise insight about MDPs. 
- In the case where occlusion is visible, opacity may be used to tackle the occlusion.

## Citation

If you found the examples in this notebook useful and you have used these alternatives in your research, please cite...