The paper ([The Visual Language of Multidimensional Data Projection: A Visualization Taxonomy and Informed Insights](https://)) discusses the need for alternative ways to think about visualizing multidimensional projection(MDP). This project shows the alternative ways for MDPs using three different use cases. There exist complex relationships between data instances, therefore there is need to explore different visualization techniques (encoding and interaction) to make clear these complexities. The goal of the project is not to propose new visualization rather to explore the design space of alternatives visualization techniques.


## Use case 2:  Cluster Identification
#### Cluster Identification through Visualization

In this notebook, we start exploring cluster identification using visualization techniques. Clustering is a key unsupervised learning technique that groups data points into clusters based on their similarities. Understanding cluster formation is essential for gaining insights into the underlying structure of the data and identifying patterns that may not be immediately apparent.


##### Dataset Description
<br/>
1. Title: Protein Localization Sites

2. Creator and Maintainer:
	Kenta Nakai Institue of Molecular and Cellular Biology Osaka, University
	1-3 Yamada-oka, Suita 565 Japan nakai@imcb.osaka-u.ac.jp
    http://www.imcb.osaka-u.ac.jp/nakai/psort.html 
    Donor: Paul Horton (paulh@cs.berkeley.edu)
    Date:  September, 1996
    See also: yeast database

Attribute Information.

  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score. Binary attribute.
  5. chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute.
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 data-toggle="list"  role="tab" aria-controls="home"><p style="font-size : 30px"><font color="darkgrey">Content<font/></p></h3>

1. [<font color="darkgrey"> Dataset<font/>](#1)
    - 1.1 [<font color="darkgrey"> Dataset Overview<font/>](#1.1)
    - 1.2 [<font color="darkgrey"> Preprocessing<font/>](#1.2) 
    - 1.3 [<font color="darkgrey"> Dimensionality Reduction<font/>](#1.3) 
    - 1.4 [<font color="darkgrey"> Unsupervised Machine Learning for labels<font/>](#1.4) 

2. [<font color="darkgrey">Visual Perspectives<font/>](#2)   
    - 2.1. [<font color="darkgrey"> One Dimension (1D)<font/>](#2.1)
        - 2.1.1 [<font color="darkgrey"> 1D Strip plot <font/>](#2.1.1)
        - 2.1.2 [<font color="darkgrey"> Box Plot <font/>](#2.1.2)
        - 2.1.3 [<font color="darkgrey"> Histogram <font/>](#2.1.3)
        - 2.1.4 [<font color="darkgrey"> Violin Plot <font/>](#2.1.4)
    - 2.2. [<font color="darkgrey"> Two Dimensions (2D)<font/>](#2.2)
        - 2.2.1 [<font color="darkgrey">Scatterplot<font/>](#2.2.1)
        - 2.2.2 [<font color="darkgrey"> Contour<font/>](#2.2.2)
        - 2.2.3 [<font color="darkgrey"> Heatmap<font/>](#2.2.3)
        - 2.2.4 [<font color="darkgrey"> Dendrogram <font/>](#2.2.4)
        - 2.2.5 [<font color="darkgrey"> Parallel Coordinate Plot <font/>](#2.2.5)

    

<font size="+3" color="grey"><b>1. Dataset </b></font><br><a id="1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

In [3]:
import sys, os
import pandas as pd
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Create a directory
output_dir = "images"
os.makedirs(output_dir, exist_ok=True)

# Replace with the path to your .data file
data_file_path = '../Data/Dataset_2/E_Coli/ecoli.data'

column_headers = [
    "Sequence Name", "mcg", "gvh", "lip", "chg", "aac", "alm1",  "alm2", "Class"
]

# Read the .data file and set the headers
df = pd.read_csv(data_file_path,  delimiter=r'\s+', header=None, names=column_headers)

<font size="+2" color="grey"><b>1.1. Dataset Overview </b></font><br>

In [4]:
df = df[["mcg", "gvh", "lip", "chg", "aac", "alm1",  "alm2"]]
df

Unnamed: 0,mcg,gvh,lip,chg,aac,alm1,alm2
0,0.49,0.29,0.48,0.5,0.56,0.24,0.35
1,0.07,0.40,0.48,0.5,0.54,0.35,0.44
2,0.56,0.40,0.48,0.5,0.49,0.37,0.46
3,0.59,0.49,0.48,0.5,0.52,0.45,0.36
4,0.23,0.32,0.48,0.5,0.55,0.25,0.35
...,...,...,...,...,...,...,...
331,0.74,0.56,0.48,0.5,0.47,0.68,0.30
332,0.71,0.57,0.48,0.5,0.48,0.35,0.32
333,0.61,0.60,0.48,0.5,0.44,0.39,0.38
334,0.59,0.61,0.48,0.5,0.42,0.42,0.37


<font size="+2" color="grey"><b>1.2. Preprocessing </b></font><br>

In [5]:
from Implementations.imputation import Preprocessor
# Initialize the Preprocessor class with the dataset
preprocessor = Preprocessor(df)
# We have decided to excempt the target variable from being preprocessed
exempt_columns = ["Class"]
processed_data = preprocessor.preprocess(
    strategy='mean', 
    remove_missing=False, 
    exempt_columns=exempt_columns
)

<font size="+2" color="grey"><b>1.3. Dimensionality Reduction </b></font><br>


<font size="+1" color="grey"><b> Dimensionality Reduction Techniques </b></font><br>

Dimensionality reduction simplifies complex data by reducing the number of features, making it easier to visualize and understand.

**PCA (Principal Component Analysis)**
- **Purpose**: Transforms data into a new space, capturing the most important patterns.
- **Method**: Projects high-dimensional data onto a lower-dimensional space while retaining essential information.

**t-SNE (t-Distributed Stochastic Neighbor Embedding)**
- **Purpose**: Visualizes high-dimensional data in a lower dimension.
- **Method**: Keeps similar points close together to help identify clusters and patterns.

**UMAP (Uniform Manifold Approximation and Projection)**
- **Purpose**: Reduces dimensions while preserving both global and local structures in data.
- **Method**: Effective for larger datasets, showing both major patterns and finer details.

After applying these techniques, visualizations in 2D plots make it easier to see and interpret complex data patterns.


In [23]:
from Implementations.dimensionality_reduction import DimensionalityReduction
from Implementations.dimensionality_reduction import InteractivePlot
import pandas as pd
# Initialize DimensionalityReduction class
dr = DimensionalityReduction(data=df, target_column='')
# Apply different dimensionality reduction techniques
pca_df = dr.apply_pca()
tsne_df = dr.apply_tsne()
umap_df = dr.apply_umap()
complete_dataset = pd.concat([pca_df, tsne_df, umap_df], axis=1)

# combine reduced dimension into one
merged_datasets = complete_dataset[[
    # "PCA_Component_1", "PCA_Component_2", 
    "t-SNE_Component_1", "t-SNE_Component_2", 
    # "UMAP_Component_1", "UMAP_Component_2"
]]
merged_datasets = merged_datasets.groupby(merged_datasets.columns, axis=1).first()
# merged_datasets['Class2'] = merged_datasets['Class'].replace({"cp": 0, "im": 1, "pp": 2, "imU": 3, "om": 4, "omL": 5, "imL": 6, "imS": 7})
# merged_datasets['Class2'] = merged_datasets['Class'].replace({0: "Group 1", 1: "Group 2", 2: "Group 3", 3: "Group 4", 4: "Group 5"})


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.


DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.




<font size="+2" color="grey"><b>1.4. Unsupervised Machine Learning for labels </b></font><br>


In [24]:
from Implementations.unsupervised import UnsupervisedLearning
import numpy as np

# Initialize the unsupervised learning class
unsupervised = UnsupervisedLearning(tsne_df)
"""
# Tune K-Means hyperparameters
param_grid = {'n_clusters': list(range(3, 10))}
best_params, best_score, df_kmeans = unsupervised.tune_kmeans(param_grid=param_grid, search_method='grid')

# Tune DBSCAN hyperparameters
param_grid = {
    'eps': np.linspace(0.1, 1.0, 10),
    'min_samples': list(range(3, 10))
}
best_params, best_score, df_dbscan = unsupervised.tune_dbscan(param_grid=param_grid, search_method='random')

"""
df_agglomertive = None
# Tune Agglomerative Clustering hyperparameters
param_grid = {
    'n_clusters': list(range(3, 10)),
    # 'linkage': ['ward', 'complete', 'average', 'single']
}
# best_params, best_score, df_agglomertive = unsupervised.tune_agglomerative(param_grid=param_grid, search_method='grid')
best_params, best_score, df_agglomertive = unsupervised.tune_kmeans(param_grid=param_grid)
# best_params, best_score, df_agglomertive = unsupervised.tune_agglomerative(param_grid=param_grid, search_method='grid')


Best K-Means Parameters: {'n_clusters': 3}, Best Silhouette Score: 0.5290709674358368


In [25]:
print(df_agglomertive["KMeans_Labels_CV"].unique())
df_agglomertive.rename(columns={'KMeans_Labels_CV': 'Class'}, inplace=True)
df_agglomertive['Class2'] = df_agglomertive['Class'].replace({0: 'Group 1', 1: 'Group 2', 2: 'Group 3'})
df_agglomertive




[0 1 2]


Unnamed: 0,t-SNE_Component_1,t-SNE_Component_2,Class,Class2
0,-11.242157,0.421885,0,Group 1
1,-7.521434,0.403877,0,Group 1
2,-5.041004,-2.691878,1,Group 2
3,-3.667037,-3.763310,1,Group 2
4,-8.782049,0.078858,0,Group 1
...,...,...,...,...
331,0.567117,-4.579026,1,Group 2
332,-0.944844,-5.599624,1,Group 2
333,-1.513767,-5.417622,1,Group 2
334,-1.708600,-5.516223,1,Group 2


<font size="+9" color="grey"><b>2.1. 1D</b></font><font size="+1" color="grey"><b> ( Cluster Identification) </b></font><br>

In [26]:
from Implementations.visualization import create_combined_chart
scatter_plot_single, gaussian_jitter, box_with_jitter, histogram_shade, combined_chart = create_combined_chart(
    df_agglomertive, "t-SNE_Component_1", "t-SNE_Component_2",
    color_column='Class2', 
    main_color_domain=["Group 1", "Group 2", "Group 3"], 
    main_attr_color_range=["#1f78b4", "#b2df8a"],
    main_color_range=["#66c2a5", "#fc8d62", "#8da0cb"], 
    attr_color_domain=[
        "Group 1 t-SNE 1", 
        "Group 2 t-SNE 1", 
        "Group 3 t-SNE 1", 
        "Group 1 t-SNE 2", 
        "Group 2 t-SNE 2",
        "Group 3 t-SNE 2", 
    ], 
    attr_color_range=["#66c2a5", "#fc8d62", "#8da0cb", "#66c2a5", "#fc8d62", "#8da0cb"],
    attribute_nomeclature=["t-SNE 1", "t-SNE 2"],
    width_single=800,
    height_single=400, 
    jitter_size=40
    # greyed_out_option=False
)
combined_chart.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case2_combined.png', format='png', scale_factor=2)

In [27]:
scatter_plot_single.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case2_scatter_plot.png', format='png', scale_factor=2)

<font size="+1" color="grey"><b> 2.1.1 Strip Plot </b></font>

In [28]:
gaussian_jitter.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case2_strip_plot.png', format='png', scale_factor=2)

<font size="+1" color="grey"><b> 2.1.2 Box Plot </b></font>

In [12]:
box_with_jitter.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case2_box_plot_with_strip_plot.png', format='png', scale_factor=2)

<font size="+1" color="grey"><b> 2.1.3 Histogram Plot</b></font>

In [13]:
histogram_shade.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case2_histogram_plot.png', format='png', scale_factor=2)
histogram_shade

<font size="+1" color="grey"><b> 2.1.4 Violin Plot</b></font><br/>

In [14]:
from Implementations.visualization import create_plotly_violin_plots

violin_plot = create_plotly_violin_plots(df_agglomertive, 't-SNE_Component_1', 't-SNE_Component_2')
violin_plot.update_layout(
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))

)
# .write_image("../../images/case2_violin_plot.png", width=800, height=400, scale=2)
violin_plot

<font size="+9" color="grey"><b>2.2. 2D</b></font><font size="+1" color="grey"><b> ( Cluster Identification) </b></font><br>

<h2 style="font-size: 1.2em;">2D Types or More</h2>
<ul>
    <li>Scattered Plot</li>
    <li>Density Contour Plot</li>
    <li>Heatmap Plot</li>
    <li>Dendrogram Plot</li>
    <li>Parallel Co-ordinate Plot</li>
</ul>

In [15]:
from Implementations.visualization import create_2Dinteractive_plots
scatter_widget, contour_widget, density_widget, parallel_widget, dendro_widget, grid_layout = create_2Dinteractive_plots(df_agglomertive, 't-SNE_Component_1', 't-SNE_Component_2', target_numeric="Class", target="Class2")


<font size="+1" color="grey"><b> 2.2.1 Scatter Plot</b></font><br/>

In [16]:
scatter_plot_2d = scatter_widget.update_layout(
    # width=800,
    # height=600,
    xaxis=dict(
        range=[-15, 20],
        gridcolor='LightGray',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1,
        title_font=dict(size=20), 
        tickfont=dict(size=16)  
    ),
    yaxis=dict(
        range=[-10, 8],
        gridcolor='LightGray',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1,
        title_font=dict(size=20), 
        tickfont=dict(size=16)  
    ),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'

)

# scatter_plot_2d.write_image("../../images/case2_2d_scatter_plot.png", width=800, height=400, scale=2)

<font size="+1" color="grey"><b> 2.2.2 Contour Plot</b></font><br/>

In [17]:
contour_plot_2d = contour_widget.update_traces(line=dict(width=2)).update_layout(
    width=800,
    height=600,
    xaxis=dict(
        range=[-25, 25],
        gridcolor='grey',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1,        # Set width for 0 line
        title_font=dict(size=20), 
        tickfont=dict(size=16) 
    ),
    yaxis=dict(
        range=[-15, 10],
        gridcolor='grey',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1,        # Set width for 0 line
        title_font=dict(size=20), 
        tickfont=dict(size=16) 
    ),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'

)

# contour_plot_2d.write_image("../../images/case2_2d_contour_plot.png", width=800, height=400, scale=2)
contour_plot_2d

FigureWidget({
    'data': [{'contours': {'coloring': 'none'},
              'hovertemplate': ('Class2=Group 1<br>t-SNE_Compon' ... '}<br>count=%{z}<extra></extra>'),
              'legendgroup': 'Group 1',
              'line': {'color': '#66c2a5', 'width': 2},
              'name': 'Group 1',
              'showlegend': True,
              'type': 'histogram2dcontour',
              'uid': '2f5749ac-60cd-4829-acda-0f5ec4f05774',
              'x': array([-11.242157  ,  -7.5214343 ,  -8.782049  ,  -9.76274   ,  -8.230577  ,
                           -4.6210265 , -10.231147  , -11.499667  ,  -8.781772  , -10.157749  ,
                           -9.091834  ,  -5.1276517 ,  -6.479449  ,  -1.1615304 ,  -7.6308618 ,
                           -5.0978794 , -11.808767  , -12.025596  , -12.279099  ,  -9.932744  ,
                           -5.18357   ,  -7.032982  ,  -6.4550233 ,  -7.0857124 ,  -7.2643313 ,
                           -7.4882154 ,  -8.399852  ,  -9.74474   ,  -7.9211698 ,  -3

<font size="+1" color="grey"><b> 2.2.3 Heatmap Plot</b></font><br/>

In [18]:

heatmap_plot_2d = (density_widget.update_layout(
    # width=800,
    # height=600,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
))
# heatmap_plot_2d.write_image("../../images/case2_2d_heatmap_plot.png", width=800, height=400, scale=2)
heatmap_plot_2d

FigureWidget({
    'data': [{'coloraxis': 'coloraxis',
              'hovertemplate': 't-SNE_Component_1=%{x}<br>t-SNE_Component_2=%{y}<br>count=%{z}<extra></extra>',
              'name': '',
              'type': 'histogram2d',
              'uid': '09fad21c-6137-4cf8-866b-6962fd824412',
              'x': array([-11.242157 ,  -7.5214343,  -5.0410037, ...,  -1.5137675,  -1.7085998,
                           -2.5848308], dtype=float32),
              'xaxis': 'x',
              'xbingroup': 'x',
              'y': array([ 0.4218855 ,  0.40387744, -2.6918783 , ..., -5.4176216 , -5.5162234 ,
                          -8.284542  ], dtype=float32),
              'yaxis': 'y',
              'ybingroup': 'y'}],
    'layout': {'coloraxis': {'colorbar': {'title': {'text': 'count'}}, 'colorscale': [[0, 'white'], [1, 'black']]},
               'font': {'size': 18},
               'legend': {'tracegroupgap': 0},
               'margin': {'t': 60},
               'showlegend': False,
           

<font size="+1" color="grey"><b> 2.2.4 Dendrogram Plot</b></font><br/>

In [19]:
treemap_plot_2d = (dendro_widget.update_layout(
    # width=800,
    # height=600,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
))

# treemap_plot_2d.write_image("../../images/case2_2d_treemap_plot.png", width=800, height=400, scale=2)
treemap_plot_2d

FigureWidget({
    'data': [{'line': {'color': 'black', 'width': 1},
              'mode': 'lines',
              'type': 'scatter',
              'uid': 'f0ac7aff-7bdf-4431-8ace-f6abbaa5a7ce',
              'x': [25.0, 25.0, 35.0, 35.0],
              'y': [0.0, 0.3579368476147808, 0.3579368476147808, 0.0]},
             {'line': {'color': 'black', 'width': 1},
              'mode': 'lines',
              'type': 'scatter',
              'uid': 'ef213e2b-32c1-44d3-a5d1-2084f4cceee8',
              'x': [15.0, 15.0, 30.0, 30.0],
              'y': [0.0, 0.9640965379264982, 0.9640965379264982,
                    0.3579368476147808]},
             {'line': {'color': 'black', 'width': 1},
              'mode': 'lines',
              'type': 'scatter',
              'uid': 'aebaf336-687c-4d37-b5e1-186dc6ba168b',
              'x': [5.0, 5.0, 22.5, 22.5],
              'y': [0.0, 1.7809184483221632, 1.7809184483221632,
                    0.9640965379264982]},
             {'line': {'color

<font size="+1" color="grey"><b> 2.2.4 Parallel Coordinate Plot</b></font><br/>

In [20]:
parallel_plot_2d = (parallel_widget.update_layout(
    # width=800,
    # height=600,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
))
# parallel_plot_2d.write_image("../../images/case2_2d_parallel_plot.png", width=800, height=400, scale=2)
parallel_plot_2d

FigureWidget({
    'data': [{'dimensions': [{'label': 't-SNE_Component_1',
                              'values': array([-11.242157 ,  -7.5214343,  -5.0410037, ...,  -1.5137675,  -1.7085998,
                                                -2.5848308], dtype=float32)},
                             {'label': 't-SNE_Component_2',
                              'values': array([ 0.4218855 ,  0.40387744, -2.6918783 , ..., -5.4176216 , -5.5162234 ,
                                               -8.284542  ], dtype=float32)}],
              'domain': {'x': [0.0, 1.0], 'y': [0.0, 1.0]},
              'line': {'color': array([0, 0, 1, ..., 1, 1, 1], dtype=int32),
                       'coloraxis': 'coloraxis'},
              'name': '',
              'type': 'parcoords',
              'uid': 'd801ae99-022e-4112-a43a-642ce1b62a24'}],
    'layout': {'coloraxis': {'colorbar': {'title': {'text': 'Class'}},
                             'colorscale': [[0.0, 'rgb(176, 242, 188)'],
                   

##### All in one view

In [21]:
# display(grid_layout)

## To be continued ...