<a href="https://www.kaggle.com/code/mikedelong/python-eda-with-tsne?scriptVersionId=148626401" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='/kaggle/input/iris-dataset-extended/iris_extended.csv')
df.head()

Unnamed: 0,species,elevation,soil_type,sepal_length,sepal_width,petal_length,petal_width,sepal_area,petal_area,sepal_aspect_ratio,...,sepal_to_petal_length_ratio,sepal_to_petal_width_ratio,sepal_petal_length_diff,sepal_petal_width_diff,petal_curvature_mm,petal_texture_trichomes_per_mm2,leaf_area_cm2,sepal_area_sqrt,petal_area_sqrt,area_ratios
0,setosa,161.8,sandy,5.16,3.41,1.64,0.26,17.5956,0.4264,1.513196,...,3.146341,13.115385,3.52,3.15,5.33,18.33,53.21,4.194711,0.652993,41.265478
1,setosa,291.4,clay,5.48,4.05,1.53,0.37,22.194,0.5661,1.353086,...,3.581699,10.945946,3.95,3.68,5.9,20.45,52.53,4.711051,0.752396,39.205087
2,setosa,144.3,sandy,5.1,2.8,1.47,0.38,14.28,0.5586,1.821429,...,3.469388,7.368421,3.63,2.42,5.66,24.62,50.25,3.778889,0.747395,25.56391
3,setosa,114.6,clay,4.64,3.44,1.53,0.17,15.9616,0.2601,1.348837,...,3.03268,20.235294,3.11,3.27,4.51,22.91,50.85,3.995197,0.51,61.367166
4,setosa,110.9,loamy,4.85,2.87,1.23,0.26,13.9195,0.3198,1.689895,...,3.943089,11.038462,3.62,2.61,4.03,21.56,40.57,3.730885,0.565509,43.525641


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   species                          1200 non-null   object 
 1   elevation                        1200 non-null   float64
 2   soil_type                        1200 non-null   object 
 3   sepal_length                     1200 non-null   float64
 4   sepal_width                      1200 non-null   float64
 5   petal_length                     1200 non-null   float64
 6   petal_width                      1200 non-null   float64
 7   sepal_area                       1200 non-null   float64
 8   petal_area                       1200 non-null   float64
 9   sepal_aspect_ratio               1200 non-null   float64
 10  petal_aspect_ratio               1200 non-null   float64
 11  sepal_to_petal_length_ratio      1200 non-null   float64
 12  sepal_to_petal_width

In [3]:
from plotly.express import histogram
for column in df.columns:
    if column not in ['species', 'soil_type']:
        histogram(data_frame=df, x=column, color='species').show()

Clearly the setosa have very different petals but the other two species are harder to differentiate based on the available data.

We have a lot of dimensions; let's do some dimension reduction.

In [4]:
from sklearn.manifold import TSNE
from plotly.express import scatter

def tsne_plot(input_df: pd.DataFrame, columns_to_drop: list):
    # we want our code to be reintrant
    result_df = input_df.copy()
    for column in ['x', 'y']:
        if column in result_df.columns:
            result_df = result_df.drop(columns=[column])
    tsne = TSNE(n_components=2, verbose=1, random_state=2023)
    result_df[['x', 'y']]  = tsne.fit_transform(X=result_df.drop(columns=columns_to_drop))
    for color in ['species', 'soil_type']:
        scatter(data_frame=result_df, x='x', y='y', color=color, ).show()

tsne_plot(input_df=df, columns_to_drop=['species', 'soil_type', ])

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1200 samples in 0.001s...
[t-SNE] Computed neighbors for 1200 samples in 0.108s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1200
[t-SNE] Computed conditional probabilities for sample 1200 / 1200
[t-SNE] Mean sigma: 8.003756
[t-SNE] KL divergence after 250 iterations with early exaggeration: 53.045162
[t-SNE] KL divergence after 1000 iterations: 0.435911


Clearly TSNE easily separates setosa from the other two species; the soil type not so much.

In [5]:
tsne_plot(input_df=df, columns_to_drop=['species', 'soil_type', 'elevation'])

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1200 samples in 0.001s...
[t-SNE] Computed neighbors for 1200 samples in 0.045s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1200
[t-SNE] Computed conditional probabilities for sample 1200 / 1200
[t-SNE] Mean sigma: 2.734062
[t-SNE] KL divergence after 250 iterations with early exaggeration: 59.698608
[t-SNE] KL divergence after 1000 iterations: 0.747908


We get much better separation among our clusters if we drop the elevation, which as we learned above is distributed more or less uniformly across the dataset.

In [6]:
from plotly.express import imshow
imshow(img=df.drop(columns=['species', 'soil_type',]).corr(), height=1000)

In [7]:
from plotly.express import treemap
from plotly.express import Constant
treemap(data_frame=df[['species', 'soil_type']], path=['species', 'soil_type'], )

In [8]:
treemap(data_frame=df[['soil_type', 'species']], path=['soil_type', 'species'], )

In [9]:
# would really like this to be an area plot
imshow(img=df[['species','soil_type']].value_counts().reset_index().pivot(columns=['soil_type'], index='species', values='count'),)