The paper [Exploring Chart Choices for High Dimensional Projections](https://) discusses the need for alternative ways to think about visualizing Multi-Dimensional Projection(MDP). This project shows the alternative chart types for high-dimensional projections using three different use cases. There exist complex relationships between data instances, therefore there is need to explore different design-related items to make clear these complexities. The goal of the project is not to propose new visualization rather to explore the design space of alternatives chart types.


## Use case 3:  Anomaly Detection

#### Anomaly Detection through Visualization

In this notebook, we start exploring anomaly detection using visualization techniques. Anomaly detection, also known as outlier detection, is the process of identifying data points that deviate significantly from the majority of the data. Detecting anomalies is crucial in various domains, including fraud detection, network security, and quality control, as it helps uncover potentially critical issues.

##### Dataset Description

1. Dataset: Yeast
Link: https://archive.ics.uci.edu/dataset/110/yeast

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 data-toggle="list"  role="tab" aria-controls="home"><p style="font-size : 30px"><font color="darkgrey">Content<font/></p></h3>

1. [<font color="darkgrey"> Dataset<font/>](#1)
    - 1.1 [<font color="darkgrey"> Overview<font/>](#1.1)
    - 1.2 [<font color="darkgrey"> Preprocessing<font/>](#1.2) 
    - 1.3 [<font color="darkgrey"> Dimensionality Reduction<font/>](#1.3) 

2. [<font color="darkgrey">Visual Perspectives<font/>](#2)   
    - 2.1. [<font color="darkgrey"> One Dimension (1D)<font/>](#2.1)
        - 2.1.1 [<font color="darkgrey"> 1D Strip plot <font/>](#2.1.1)
        - 2.1.2 [<font color="darkgrey"> 1D Scatter plot <font/>](#2.1.2)
        - 2.1.3 [<font color="darkgrey"> Box Plot <font/>](#2.1.3)
        - 2.1.4 [<font color="darkgrey"> Histogram <font/>](#2.1.4)
        - 2.1.5 [<font color="darkgrey"> Violin Plot <font/>](#2.1.5)
    - 2.2. [<font color="darkgrey"> Two Dimensions (2D)<font/>](#2.2)
        - 2.2.1 [<font color="darkgrey">Scatterplot<font/>](#2.2.1)
        - 2.2.2 [<font color="darkgrey"> Heatmap<font/>](#2.2.2)
        - 2.2.3 [<font color="darkgrey"> Parallel Coordinate Plot <font/>](#2.2.3)
        - 2.2.4 [<font color="darkgrey"> Radar Chart<font/>](#2.2.4)
        - 2.2.5 [<font color="darkgrey"> 2D Histogram Scatter Plot<font/>](#2.2.5)

    

<font size="+3" color="grey"><b>1. Dataset </b></font><br><a id="1"></a>
<a href="#top" class="btn-xs btn-danger" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go back to the TOP</a>

In [25]:
import sys, os
import pandas as pd
# Move one step up from the current working directory
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
# Create a directory
output_dir = "images"
os.makedirs(output_dir, exist_ok=True)

# Replace with the path to your .data file
data_file_path = '../Data/Dataset_2/yeast/yeast.data'
column_headers = [
    "Sequence Name", "mcg", "gvh", "alm", "mit", "erl", "pox",  "vac", "nuc", "localization_site"
]

# Read the .data file and set the headers
df = pd.read_csv(data_file_path,  delimiter=r'\s+', header=None, names=column_headers)


<font size="+2" color="grey"><b>1.1. Overview </b></font><br>

In [26]:
print(df["localization_site"].unique())
df[["mcg", "gvh", "alm"	, "mit", "erl", "pox", "vac", "nuc", "localization_site"]]

['MIT' 'NUC' 'CYT' 'ME1' 'EXC' 'ME2' 'ME3' 'VAC' 'POX' 'ERL']


Unnamed: 0,mcg,gvh,alm,mit,erl,pox,vac,nuc,localization_site
0,0.58,0.61,0.47,0.13,0.5,0.0,0.48,0.22,MIT
1,0.43,0.67,0.48,0.27,0.5,0.0,0.53,0.22,MIT
2,0.64,0.62,0.49,0.15,0.5,0.0,0.53,0.22,MIT
3,0.58,0.44,0.57,0.13,0.5,0.0,0.54,0.22,NUC
4,0.42,0.44,0.48,0.54,0.5,0.0,0.48,0.22,MIT
...,...,...,...,...,...,...,...,...,...
1479,0.81,0.62,0.43,0.17,0.5,0.0,0.53,0.22,ME2
1480,0.47,0.43,0.61,0.40,0.5,0.0,0.48,0.47,NUC
1481,0.67,0.57,0.36,0.19,0.5,0.0,0.56,0.22,ME2
1482,0.43,0.40,0.60,0.16,0.5,0.0,0.53,0.39,NUC


*** Doing this just to have 2 classes ***

In [27]:
new_df = df # [df["localization_site"].isin(['MIT', 'NUC'])]

<font size="+2" color="grey"><b>1.2. Preprocessing </b></font><br>

In [28]:
from Implementations.imputation import Preprocessor
# Initialize the Preprocessor class with the dataset
preprocessor = Preprocessor(new_df)
# We have decided to excempt the target variable from being preprocessed
exempt_columns = ['Sequence Name', 'localization_site']
processed_data = preprocessor.preprocess(
    strategy='mean', 
    remove_missing=False, 
    exempt_columns=exempt_columns
)

<font size="+2" color="grey"><b>1.3. Dimensionality Reduction </b></font><br>


<font size="+1" color="grey"><b> Dimensionality Reduction Techniques </b></font><br>

Dimensionality reduction simplifies complex data by reducing the number of features, making it easier to visualize and understand.

**PCA (Principal Component Analysis)**
- **Purpose**: Transforms data into a new space, capturing the most important patterns.
- **Method**: Projects high-dimensional data onto a lower-dimensional space while retaining essential information.

**t-SNE (t-Distributed Stochastic Neighbor Embedding)**
- **Purpose**: Visualizes high-dimensional data in a lower dimension.
- **Method**: Keeps similar points close together to help identify clusters and patterns.

**UMAP (Uniform Manifold Approximation and Projection)**
- **Purpose**: Reduces dimensions while preserving both global and local structures in data.
- **Method**: Effective for larger datasets, showing both major patterns and finer details.

After applying these techniques, visualizations in 2D plots make it easier to see and interpret complex data patterns.


In [29]:
processed_data

Unnamed: 0,Sequence Name,mcg,gvh,alm,mit,erl,pox,vac,nuc,localization_site
0,ADT1_YEAST,0.581981,0.888481,-0.346645,-0.957203,-0.09759,-0.099131,-0.344175,-0.527919,MIT
1,ADT2_YEAST,-0.510891,1.372811,-0.231226,0.064312,-0.09759,-0.099131,0.521219,-0.527919,MIT
2,ADT3_YEAST,1.019130,0.969203,-0.115808,-0.811272,-0.09759,-0.099131,0.521219,-0.527919,MIT
3,AAR2_YEAST,0.581981,-0.483786,0.807542,-0.957203,-0.09759,-0.099131,0.694298,-0.527919,NUC
4,AATM_YEAST,-0.583749,-0.483786,-0.231226,2.034375,-0.09759,-0.099131,-0.344175,-0.527919,MIT
...,...,...,...,...,...,...,...,...,...,...
1479,YUR1_YEAST,2.257718,0.969203,-0.808320,-0.665341,-0.09759,-0.099131,0.521219,-0.527919,ME2
1480,ZIP1_YEAST,-0.219458,-0.564507,1.269217,1.012861,-0.09759,-0.099131,-0.344175,1.820499,NUC
1481,ZNRP_YEAST,1.237705,0.565595,-1.616251,-0.519411,-0.09759,-0.099131,1.040456,-0.527919,ME2
1482,ZUO1_YEAST,-0.510891,-0.806672,1.153799,-0.738307,-0.09759,-0.099131,0.521219,1.069005,NUC


In [30]:
from Implementations.dimensionality_reduction import DimensionalityReduction
import pandas as pd
# Initialize DimensionalityReduction class
processed_data = processed_data.drop(columns=["Sequence Name"], errors='ignore')
dr = DimensionalityReduction(data=processed_data, target_column='localization_site')
# Apply different dimensionality reduction techniques
pca_df = dr.apply_pca()
tsne_df = dr.apply_tsne()
umap_df = dr.apply_umap()
complete_dataset = pd.concat([pca_df, tsne_df, umap_df], axis=1)

# combine reduced dimension into one
merged_datasets = complete_dataset[[
    # "PCA_Component_1", "PCA_Component_2", 
    # "t-SNE_Component_1", "t-SNE_Component_2", 
    "UMAP_Component_1", "UMAP_Component_2", "localization_site"
]]
merged_datasets = merged_datasets.groupby(merged_datasets.columns, axis=1).first()
# merged_datasets['Class2'] = merged_datasets['localization_site'].replace({0.: 'Group 1', 1.: 'Group 2'})



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.


DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.



In [31]:
import altair as alt
base = alt.Chart(merged_datasets).encode().properties(
    width=1000,)
base.mark_circle(size=60).encode(
    x=alt.X('UMAP_Component_1', title="UMAP 1"),
    y=alt.Y('UMAP_Component_2', title="UMAP 2"),
    color=alt.Color('localization_site:N', scale=alt.Scale(scheme='tableau20') )
)


<font size="+9" color="grey"><b>1.1. 1D</b></font><font size="+1" color="grey"><b> ( Anomaly Detection) </b></font><br>

In [32]:
from Implementations.visualization import create_combined_chart

scatter_plot_single, gaussian_jitter, box_with_jitter, histogram_shade, combined_chart = create_combined_chart(
    merged_datasets, "UMAP_Component_1", "UMAP_Component_2",
    color_column='localization_site', 
    main_color_domain=['MIT', 'NUC', 'CYT', 'ME1', 'EXC', 'ME2', 'ME3', 'VAC', 'POX', 'ERL'],
    main_color_range=["#543005", "#8c510a", "#bf812d", "#dfc27d", "#f6e8c3", 
    "#f5f5f5", "#c7eae5", "#80cdc1", "#35978f", "#01665e"],
    attr_color_domain=[
        'MIT Attr 1', 'MIT Attr 2', 
        'NUC Attr 1', 'NUC Attr 2', 
        'CYT Attr 1', 'CYT Attr 2', 
        'ME1 Attr 1', 'ME1 Attr 2', 
        'EXC Attr 1', 'EXC Attr 2', 
        'ME2 Attr 1', 'ME2 Attr 2', 
        'ME3 Attr 1', 'ME3 Attr 2', 
        'VAC Attr 1', 'VAC Attr 2', 
        'POX Attr 1', 'POX Attr 2', 
        'ERL Attr 1', 'ERL Attr 2'
    ],
    attr_color_range=["#66c2a5", "#fc8d62", "#66c2a5", "#fc8d62"],
    attribute_nomeclature=["Attr 1", "Attr 2"],
    width_single=800,
    height_single=400, 
    jitter_size=40
    # greyed_out_option=False
)
combined_chart.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case3_combined.png', format='png', scale=5)

In [33]:
scatter_plot_single.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case3_scatter_plot.png', format='png', scale_factor=5)

<font size="+1" color="grey"><b> 2.1.1 Strip Plot </b></font>

In [34]:
gaussian_jitter.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case3_strip_plot.png', format='png', scale_factor=5)

<font size="+1" color="grey"><b> 2.1.2 1D Scatter Plot </b></font>

In [35]:
import altair as alt

color_scheme = [
    "#543005", "#8c510a", "#bf812d", "#dfc27d", "#f6e8c3", 
    "#f5f5f5", "#c7eae5", "#80cdc1", "#35978f", "#01665e"
]
#binding the marks using the quality variable
selection = alt.selection_multi(fields=['localization_site'], bind='legend')

base = alt.Chart(merged_datasets).encode().properties(
    width=1000,)

#plot for 1D scatterplot
UMAP1_circle = base.mark_circle(size=60).encode(
    x=alt.X('UMAP_Component_1', title="UMAP 1"),
    color=alt.Color('localization_site:N', scale=alt.Scale(domain=merged_datasets['localization_site'].unique(), range=color_scheme)),
    #tooltip=list(all_features),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_selection(
    selection
)

UMAP2_circle = base.mark_circle(size=60).encode(
    x=alt.X('UMAP_Component_2', title="UMAP 2"),
    color=alt.Color('localization_site:N', scale=alt.Scale(domain=merged_datasets['localization_site'].unique(), range=color_scheme)),
    #tooltip=list(all_features),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
)


(UMAP1_circle & UMAP2_circle).properties(
    title= '1-dimensional Plot',
).configure_title(
    anchor= 'middle',
    fontSize=16  
).configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case3_1D_scatter_plot.png', format='png', scale_factor=5)




Deprecated in `altair=5.0.0`. Use selection_point instead.


Deprecated in `altair=5.0.0`. Use add_params instead.



<font size="+1" color="grey"><b> 2.1.3 Box Plot for UMAP grouped by location site </b></font>

In [36]:
#plotting the principal compoments using Boxplot

import altair as alt

# Define your custom color scheme
color_scheme = [
    "#543005", "#8c510a", "#bf812d", "#dfc27d", "#f6e8c3", 
    "#f5f5f5", "#c7eae5", "#80cdc1", "#35978f", "#01665e"
]

# Create the box plot for UMAP_Component_1
UMAP1_boxplot = alt.Chart(merged_datasets).mark_boxplot().encode(
    x='localization_site:N',
    y=alt.Y('UMAP_Component_1:Q', title='UMAP 1'),
    color=alt.Color('localization_site:N', scale=alt.Scale(domain=merged_datasets['localization_site'].unique(), range=color_scheme))
).properties(
    width=300,
    height=300
)

# Create the box plot for UMAP_Component_2
UMAP2_boxplot = alt.Chart(merged_datasets).mark_boxplot().encode(
    x='localization_site:N',
    y=alt.Y('UMAP_Component_2:Q', title='UMAP 2', scale=alt.Scale(domain=[-6, 14])),
    color=alt.Color('localization_site:N', scale=alt.Scale(domain=merged_datasets['localization_site'].unique(), range=color_scheme))
).properties(
    width=300,
    height=300
)

# Combine both box plots and configure the title
final_plot = (UMAP1_boxplot | UMAP2_boxplot).properties(
    title='UMAP Components grouped by localization site'
).configure_title(
    anchor='middle',
    fontSize=18  
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)

# Save the final plot
# final_plot.save('../../images/case3_box_plot.png', format='png', scale_factor=5)


# box_with_jitter.save('../../images/case3_box_plot_with_strip_plot.png', format='png', scale_factor=5)


<font size="+1" color="grey"><b> 2.1.4 Histogram Plot</b></font>

In [37]:
histogram_shade.configure_title(
    fontSize=18
).configure_axis(
    labelFontSize=15,
    titleFontSize=18
).configure_legend(
    labelFontSize=15,
    titleFontSize=18,
    orient="bottom", title=None, padding=5, labelLimit=200, 
    # columns=3
)
# .save('../../images/case3_histogram_plot.png', format='png', scale_factor=5)

<font size="+1" color="grey"><b> 2.1.5 Violin Plot</b></font><br/>

In [38]:
from Implementations.visualization import create_plotly_violin_plots
violin_plot = create_plotly_violin_plots(merged_datasets, 'UMAP_Component_1', 'UMAP_Component_2', target="localization_site", base_colors=color_scheme)
violin_plot.update_layout(
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))

)
# .write_image("../../images/case3_violin_plot.png", width=800, height=400, scale=2)
violin_plot

In [39]:
from Implementations.visualization import create_outlier_plots
outlier_plot = create_outlier_plots(merged_datasets, 'UMAP_Component_1', 'UMAP_Component_2', contamination=0.1)

# Display the grid plot
outlier_plot

<font size="+9" color="grey"><b>2.2. 2D</b></font><font size="+1" color="grey"><b> ( Anomaly Identification) </b></font><br>

<h2 style="font-size: 1.2em;">2D Types or More</h2>
<ul>
    <li>Scattered Plot</li>
    <li>Density Contour Plot</li>
    <li>Heatmap Plot</li>
    <li>Parallel Co-ordinate Plot</li>
    <li>Dendrogram Plot</li>
</ul>

In [40]:
merged_datasets['localization_site'].unique()

array(['MIT', 'NUC', 'CYT', 'ME1', 'EXC', 'ME2', 'ME3', 'VAC', 'POX',
       'ERL'], dtype=object)

In [41]:
merged_datasets

Unnamed: 0,UMAP_Component_1,UMAP_Component_2,localization_site,Z1,Z2,Outlier_Z,Outlier_IQR,Outlier_IF
0,9.542462,4.898843,MIT,-0.285094,-0.628717,Normal,Normal,Normal
1,9.299116,5.506402,MIT,-0.363783,-0.381550,Normal,Normal,Normal
2,9.571456,4.387489,MIT,-0.275719,-0.836745,Normal,Normal,Normal
3,11.698423,6.441824,NUC,0.412056,-0.001003,Normal,Normal,Normal
4,14.248354,4.603813,MIT,1.236600,-0.748741,Normal,Normal,Normal
...,...,...,...,...,...,...,...,...
1479,9.386407,2.922579,ME2,-0.335556,-1.432699,Normal,Normal,Normal
1480,8.758707,9.586633,NUC,-0.538529,1.278365,Normal,Normal,Normal
1481,8.877038,3.127717,ME2,-0.500266,-1.349245,Normal,Normal,Normal
1482,11.942624,9.899110,NUC,0.491020,1.405487,Normal,Normal,Normal


In [42]:
from Implementations.visualization import create_2Dinteractive_plots
merged_datasets['Class2'] = merged_datasets['localization_site'].replace({
    'MIT': 0, 'NUC': 1, 'CYT': 2, 'ME1': 3, 'EXC': 4, 'ME2': 5, 'ME3': 6, 'VAC': 7, 'POX': 8, 'ERL': 9})
scatter_widget, contour_widget, density_widget, parallel_widget, dendro_widget, grid_layout = create_2Dinteractive_plots(
    merged_datasets, 'UMAP_Component_1', 'UMAP_Component_2', target_numeric="Class2", target="localization_site"
)



Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



<font size="+1" color="grey"><b> 2.2.1 Scatter Plot</b></font><br/>

In [43]:
scatter_plot_2d = (scatter_widget.update_layout(
    width=800,
    height=600,
    xaxis=dict(
        range=[-8, 15],
        gridcolor='LightGray',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1,
        title_font=dict(size=20), 
        tickfont=dict(size=16)  
    ),
    yaxis=dict(
        range=[0, 12],
        gridcolor='LightGray',
        showgrid=True,
        zeroline=True,         # Show 0 line
        zerolinecolor="gray",  # Set color for 0 line
        zerolinewidth=1 ,
        title_font=dict(size=20), 
        tickfont=dict(size=16) 
    ),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'

))

# scatter_plot_2d.write_image("../../images/case3_2d_scatter_plot.png", width=800, height=400, scale=2)

<font size="+1" color="grey"><b> 2.2.2 Heatmap Plot</b></font><br/>

In [44]:

heatmap_2d = (density_widget.update_layout(
    # width=800,
    # height=600,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
))

# heatmap_2d.write_image("../../images/case3_2d_heatmap_plot.png", width=800, height=400, scale=2)

<font size="+1" color="grey"><b> 2.2.3 Parallel Coordinate Plot</b></font><br/>

In [45]:
parallel_plot_2d = (parallel_widget.update_layout(
    # width=800,
    # height=600,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
))

# parallel_plot_2d.write_image("../../images/case3_2d_parallel_plot.png", width=800, height=400, scale=2)

<font size="+1" color="grey"><b> 2.2.4 Radar Plot</b></font><br/>

In [46]:
import plotly.graph_objects as go

# Custom color scheme
color_scheme = [
    "#543005", "#8c510a", "#bf812d", "#dfc27d", "#f6e8c3", 
    "#f5f5f5", "#c7eae5", "#80cdc1", "#35978f", "#01665e"
]

# Drop rows with NaN values in specified columns
df = df.dropna(subset=["mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"])

# Group by localization_site and calculate the mean for each component
grouped_data = df[["mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"]].groupby('localization_site').mean().reset_index()

# Prepare data for the radar chart
categories = list(grouped_data.columns[1:])  # Exclude 'localization_site'
values = grouped_data[categories].values  # Values for radar chart
values = [list(row) + [row[0]] for row in values]  # Close the radar chart
categories += [categories[0]]  # Close the categories


# Create the radar chart
fig = go.Figure()

# Add traces for each localization_site with custom colors
for i, row in grouped_data.iterrows():
    fig.add_trace(go.Scatterpolar(
        r=list(row[1:]) + [row[1]],  # Close the radar chart
        theta=categories,
        fill='toself',
        name=row['localization_site'],
        line=dict(color=color_scheme[i % len(color_scheme)])  # Apply custom color
    ))

# Update layout
fig.update_layout(
    title='Radar Chart of Localization Site',
    polar=dict(
        radialaxis=dict(
            visible=True,
            # range=[0, 1]
        )
    ),
    showlegend=True,
    # width=700,
    # height=700,
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
)

# Show the figure
# fig.show()
# fig.write_image("../../images/case3_2d_radar_original_data_plot.png", width=800, height=400, scale=2)



Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



In [47]:
import plotly.graph_objects as go

# Custom color scheme
color_scheme = [
    "#543005", "#8c510a", "#bf812d", "#dfc27d", "#f6e8c3", 
    "#f5f5f5", "#c7eae5", "#80cdc1", "#35978f", "#01665e"
]

# Group by localization_site and calculate the mean for each component
grouped_data = merged_datasets[["UMAP_Component_1", "UMAP_Component_2", "localization_site"]].groupby('localization_site').mean().reset_index()

# Prepare data for the radar chart
categories = list(grouped_data.columns[1:])  # Exclude 'localization_site'
values = grouped_data[categories].values  # Values for radar chart
values = [list(row) + [row[1]] for row in values]  # Close the radar chart
categories += [categories[0]]  # Close the categories

# Create the radar chart
fig = go.Figure()

# Add traces for each localization_site with custom colors
for i, row in grouped_data.iterrows():
    fig.add_trace(go.Scatterpolar(
        r=list(row[1:]) + [row[1]],  # Close the radar chart
        theta=categories,
        fill='toself',
        name=row['localization_site'],
        line=dict(color=color_scheme[i % len(color_scheme)]),  # Apply custom color
        marker=dict(size=10)  # Increase the size of the points
    ))

# Update layout
fig.update_layout(
    title='Radar Chart of Localization Site',
    polar=dict(
        radialaxis=dict(
            visible=True,
        ),
        angularaxis=dict(
            tickfont=dict(size=12),  # Increase the font size of the labels
            rotation=90,  # Rotate the labels if necessary
            direction='clockwise'  # Change the direction of the labels
        )
    ),
    showlegend=True,
    # width=800,  # Increase width to accommodate more space
    # height=800,  # Increase height
    margin=dict(l=20, r=20, t=50, b=20),  # Adjust margins to give more room
    font=dict(size=18),
    title_font=dict(size=24),
    xaxis=dict(title_font=dict(size=20), tickfont=dict(size=16)),
    yaxis=dict(title_font=dict(size=20), tickfont=dict(size=16))
)

# Show the figure
# fig.show()
# fig.write_image("../../images/case3_2d_radar_reduced_dim_plot.png", width=800, height=400, scale=2)



Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



In [48]:
merged_datasets[["UMAP_Component_1", "UMAP_Component_2", "localization_site", "Class2"]]
grouped_data = merged_datasets[["UMAP_Component_1", "UMAP_Component_2", "localization_site"]].groupby('localization_site').mean().reset_index()
grouped_data

Unnamed: 0,localization_site,UMAP_Component_1,UMAP_Component_2
0,CYT,11.037064,7.220031
1,ERL,-6.627049,4.135227
2,EXC,10.382649,2.71557
3,ME1,9.592963,2.322833
4,ME2,9.384328,3.729452
5,ME3,7.555572,5.926231
6,MIT,12.25763,4.991294
7,NUC,10.620154,7.925319
8,POX,2.535629,2.853205
9,VAC,9.970539,5.714844


## To be continued ...

In [50]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

grouped_data = merged_datasets[["UMAP_Component_1", "UMAP_Component_2", "localization_site"]].groupby('localization_site').mean().reset_index()

# Compute the metrics for the Taylor Diagram
grouped_data['Standard Deviation'] = np.sqrt(grouped_data['UMAP_Component_1']**2 + grouped_data['UMAP_Component_2']**2)
grouped_data['Angle'] = np.degrees(np.arctan2(grouped_data['UMAP_Component_2'], grouped_data['UMAP_Component_1']))
grouped_data['Correlation'] = np.cos(np.radians(grouped_data['Angle']))  # Using the cosine of the angle for correlation

# Define diagram range and tick settings
float_max_r = grouped_data["Standard Deviation"].max() * 1.5
np_angular_ticks = np.arange(0, 100, 10)
np_angular_labels = np.round(np.cos(np.radians(np_angular_ticks)), 2)

# Plotting Taylor Diagram
fig = go.Figure()

# Add each data point to the polar plot
for index, row in grouped_data.iterrows():
    fig.add_trace(
        go.Scatterpolar(
            r=[row["Standard Deviation"]],
            theta=[row["Angle"]],
            mode="markers",
            name=row["localization_site"],  # Using localization_site as label
            marker=dict(size=10, opacity=0.8),
            hovertemplate=f"Localization: {row['localization_site']}<br>"
                          f"Std Dev: {row['Standard Deviation']:.2f}<br>"
                          f"Correlation: {row['Correlation']:.2f}<extra></extra>"
        )
    )

# Update layout for the polar plot
fig.update_layout(
    polar=dict(
        radialaxis=dict(range=[0, float_max_r]),
        angularaxis=dict(
            tickvals=np_angular_ticks,
            ticktext=np_angular_labels,
            direction="counterclockwise",
        ),
    ),
    title="Taylor Diagram for UMAP Components by Localization Site"
)

fig.show()


In [51]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go


grouped_data = merged_datasets[["UMAP_Component_1", "UMAP_Component_2", "localization_site"]].groupby('localization_site').mean().reset_index()

# Compute the metrics for the Taylor Diagram
grouped_data['Standard Deviation'] = np.sqrt(grouped_data['UMAP_Component_1']**2 + grouped_data['UMAP_Component_2']**2)
grouped_data['Angle'] = np.degrees(np.arctan2(grouped_data['UMAP_Component_2'], grouped_data['UMAP_Component_1']))
grouped_data['Correlation'] = np.cos(np.radians(grouped_data['Angle']))  # Using the cosine of the angle for correlation

# Define diagram range and tick settings
float_max_r = grouped_data["Standard Deviation"].max() * 1.5
np_angular_ticks = np.arange(0, 100, 10)
np_angular_labels = np.round(np.cos(np.radians(np_angular_ticks)), 2)

# Define Tableau 10 colors
tableau_colors = [
    '#1f77b4',  # Blue
    '#ff7f0e',  # Orange
    '#2ca02c',  # Green
    '#d62728',  # Red
    '#9467bd',  # Purple
    '#8c564b',  # Brown
    '#e377c2',  # Pink
    '#7f7f7f',  # Gray
    '#bcbd22',  # Olive
    '#17becf'   # Cyan
]

# Create a mapping of localization sites to colors
color_mapping = {site: tableau_colors[i % len(tableau_colors)] for i, site in enumerate(grouped_data['localization_site'])}

# Plotting Taylor Diagram
fig = go.Figure()

# Add each data point to the polar plot
for index, row in grouped_data.iterrows():
    fig.add_trace(
        go.Scatterpolar(
            r=[row["Standard Deviation"]],
            theta=[row["Angle"]],
            mode="markers",
            # mode="lines+markers",
            name=row["localization_site"],  # Using localization_site as label
            marker=dict(size=10, opacity=0.8, color=color_mapping[row["localization_site"]]),
            hovertemplate=f"Localization: {row['localization_site']}<br>"
                          f"Std Dev: {row['Standard Deviation']:.2f}<br>"
                          f"Correlation: {row['Correlation']:.2f}<extra></extra>"
        )
    )

# Update layout for the polar plot
fig.update_layout(
    polar=dict(
        radialaxis=dict(range=[0, float_max_r]),
        angularaxis=dict(
            tickvals=np_angular_ticks,
            ticktext=np_angular_labels,
            direction="counterclockwise",
        ),
    ),
    width=1000,
    height=600,
    # title="Taylor Diagram for UMAP Components by Localization Site"
)

fig.show()


In [52]:
import polar_diagrams as diag


grouped_data = processed_data[["mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"]].groupby('localization_site').mean().reset_index()

df_pivot = grouped_data.pivot_table(index=None, columns='localization_site', values=["mcg", "gvh", "alm", "mit", "erl", "pox", "vac", "nuc", "localization_site"], aggfunc='first')

# Reset index to make it a standard DataFrame
df_pivot.reset_index(drop=True, inplace=True)
df_pivot = df_pivot.reset_index().rename(columns={'index': 'localization_site'})

df_pivot['Reference Model'] = df_pivot['localization_site'].replace({
    'MIT': 0, 'NUC': 1, 'CYT': 2, 'ME1': 3, 'EXC': 4, 'ME2': 5, 'ME3': 6, 'VAC': 7, 'POX': 8, 'ERL': 9})

string_mid_type = 'scaled'
string_corr_method='pearson'
bool_discrete_measures=False
string_ref_model = 'Reference Model'
INT_RANDOM_SEED = 42
INT_NUM_OF_JOBS=12

dict_mi_parameters_features_continous_target_continous = dict(
    string_entropy_method='auto',
    int_mi_n_neighbors=3,
    bool_discrete_reference_model=False,
    discrete_models=False,
    int_random_state=INT_RANDOM_SEED)
df_pivot["Reference Model"] = df_pivot["Reference Model"].astype(float)
df_taylor_res = diag.df_calculate_all_properties(
    df_pivot[['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX', 'VAC', "Reference Model"]], string_reference_model=string_ref_model, 
    string_corr_method=string_corr_method, 
    dict_mi_parameters=dict_mi_parameters_features_continous_target_continous
)
df_taylor_res


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calcu

Unnamed: 0,Model,Standard Deviation,Correlation,Angle,CRMSE,Normalized CRMSE,Normalized Standard Deviation,Entropy,Mutual Information,Fixed_MI,Normalized Entropy,Normalized MI,Angle_NMI,Root Entropy,Joint_entropies,Scaled MI,Angle_SMI,Normalized Root Entropy,VI,RVI
0,CYT,0.198725,-0.440118,116.11141,2.385434,1.041089,0.086731,-0.711999,0.053274,0.263297,-0.213048,0.263297,74.734219,-1.0,2.366663,-0.26188,180.0,-0.547015,2.103367,1.450299
1,ERL,3.279972,-0.446641,116.528356,4.766713,2.080364,1.431497,2.728314,0.041667,0.205931,0.816382,0.068198,86.089494,1.651761,5.864342,0.132448,137.316041,0.903538,5.658412,2.378742
2,EXC,0.886863,-0.396693,113.371613,2.765634,1.207022,0.387059,1.382772,0.0,0.0,0.413761,0.0,90.0,1.175913,4.724731,0.0,180.0,0.643242,4.724731,2.173645
3,ME1,1.106369,0.084117,85.174747,2.459182,1.073275,0.482859,1.702892,0.024107,0.119146,0.509549,0.049944,87.137226,1.304949,4.925705,0.103124,142.537611,0.713827,4.806559,2.192387
4,ME2,0.718997,0.086618,85.030932,2.341274,1.021816,0.313796,1.194488,0.06875,0.339785,0.357421,0.170064,80.208438,1.092926,4.196661,0.357212,106.593301,0.597847,3.856876,1.963893
5,ME3,0.511214,0.646506,49.72132,1.999196,0.872521,0.223112,0.084387,0.084524,0.417745,0.025251,0.786636,38.127736,0.290494,3.008601,4.456576,0.0,0.158904,2.590856,1.609614
6,MIT,0.377664,-0.10946,96.284166,2.36264,1.031141,0.164826,0.127752,0.0,0.0,0.038227,0.0,90.0,0.357424,3.469711,0.0,180.0,0.195516,3.469711,1.862716
7,NUC,0.295331,-0.065077,93.731263,2.329226,1.016558,0.128893,0.138664,0.0,0.0,0.041492,0.0,90.0,0.372377,3.480623,0.0,180.0,0.203696,3.480623,1.865643
8,POX,1.837118,0.388995,67.108037,2.313041,1.009494,0.801784,0.318121,0.0,0.0,0.09519,0.0,90.0,0.564022,3.66008,0.0,180.0,0.308529,3.66008,1.913133
9,VAC,0.31089,0.324832,71.04462,2.209948,0.964501,0.135684,0.455672,0.0,0.0,0.136349,0.0,90.0,0.675035,3.797631,0.0,180.0,0.369254,3.797631,1.948751


In [53]:
DICT_PLOTLY_CONFIG = {
    'displaylogo': False,
    'toImageButtonOptions': {
        'format': 'png', # one of png, svg, jpeg, webp
        'filename': 'polar_diagram',
        #'height': 500,
        #'width': 700,
        'scale': 6 # Multiply title/legend/axis/canvas sizes by this factor
}}


chart_taylor_res = diag.chart_create_taylor_diagram(
    df_pivot[['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX', 'VAC', "Reference Model"]], 
    string_reference_model=string_ref_model, 
    string_corr_method=string_corr_method
)
chart_taylor_res.show(config=DICT_PLOTLY_CONFIG)


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calcu

In [54]:
df_pivot["Reference Model"] = df_pivot["Reference Model"].astype(float)
chart_both = diag.chart_create_all_diagrams(
    df_pivot[['CYT', 'ERL', 'EXC', 'ME1', 'ME2', 'ME3', 'MIT', 'NUC', 'POX', 'VAC', "Reference Model"]], 
    string_reference_model='Reference Model', 
    string_corr_method='pearson', 
    string_mid_type='scaled', 
    dict_mi_parameters=dict_mi_parameters_features_continous_target_continous
)
chart_both.show(config=DICT_PLOTLY_CONFIG)


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean squared error, use the function'root_mean_squared_error'.


'squared' is deprecated in version 1.4 and will be removed in 1.6. To calcu