# Explorytics Basic Usage Tutorial with Wine Dataset

This notebook demonstrates the functionality of Explorytics using the wine dataset from scikit-learn. The wine dataset contains 13 different measurements of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

In [9]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from explorytics import DataAnalyzer

# Load the wine dataset
wine = load_wine()

# Create a pandas DataFrame
df = pd.DataFrame(wine.data, columns=wine.feature_names)

# Add target variable (wine class)
df['wine_class'] = wine.target

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFeature Names:")
for name in wine.feature_names:
    print(f"- {name}")
print("\nFirst few rows:")
df.head()

Dataset Shape: (178, 14)

Feature Names:
- alcohol
- malic_acid
- ash
- alcalinity_of_ash
- magnesium
- total_phenols
- flavanoids
- nonflavanoid_phenols
- proanthocyanins
- color_intensity
- hue
- od280/od315_of_diluted_wines
- proline

First few rows:


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,wine_class
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


## Basic Analysis

Let's analyze the wine dataset using Explorytics:

In [10]:
# Initialize the analyzer
analyzer = DataAnalyzer(df)

# Perform comprehensive analysis
results = analyzer.analyze()

# Display basic statistics
print("Basic Statistics for Wine Features:")
for col, stats in results.basic_stats['numeric_summary'].items():
    if col != 'wine_class':  # Skip the target variable
        print(f"\n{col}:")
        for stat, value in stats.items():
            print(f"{stat}: {value:.2f}")

Basic Statistics for Wine Features:

alcohol:
count: 178.00
mean: 13.00
std: 0.81
min: 11.03
25%: 12.36
50%: 13.05
75%: 13.68
max: 14.83

malic_acid:
count: 178.00
mean: 2.34
std: 1.12
min: 0.74
25%: 1.60
50%: 1.87
75%: 3.08
max: 5.80

ash:
count: 178.00
mean: 2.37
std: 0.27
min: 1.36
25%: 2.21
50%: 2.36
75%: 2.56
max: 3.23

alcalinity_of_ash:
count: 178.00
mean: 19.49
std: 3.34
min: 10.60
25%: 17.20
50%: 19.50
75%: 21.50
max: 30.00

magnesium:
count: 178.00
mean: 99.74
std: 14.28
min: 70.00
25%: 88.00
50%: 98.00
75%: 107.00
max: 162.00

total_phenols:
count: 178.00
mean: 2.30
std: 0.63
min: 0.98
25%: 1.74
50%: 2.35
75%: 2.80
max: 3.88

flavanoids:
count: 178.00
mean: 2.03
std: 1.00
min: 0.34
25%: 1.21
50%: 2.13
75%: 2.88
max: 5.08

nonflavanoid_phenols:
count: 178.00
mean: 0.36
std: 0.12
min: 0.13
25%: 0.27
50%: 0.34
75%: 0.44
max: 0.66

proanthocyanins:
count: 178.00
mean: 1.59
std: 0.57
min: 0.41
25%: 1.25
50%: 1.56
75%: 1.95
max: 3.58

color_intensity:
count: 178.00
mean: 5.06
std:

## Distribution Analysis

Let's examine the distribution of various chemical properties:

In [11]:
# Select some interesting features to visualize
features_to_plot = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash']

for feature in features_to_plot:
    print(f"\nAnalyzing distribution of {feature}")
    dist_plot = analyzer.visualizer.plot_distribution(feature, kde=True)
    dist_plot.show()


Analyzing distribution of alcohol



Analyzing distribution of malic_acid



Analyzing distribution of ash



Analyzing distribution of alcalinity_of_ash


## Correlation Analysis

Let's examine relationships between different chemical properties:

In [12]:
# Display correlation matrix
print("Correlation Matrix of Wine Features:")
correlations = results.correlations.round(2)
print(correlations)

# Plot correlation heatmap
corr_plot = analyzer.visualizer.plot_correlation_matrix()
corr_plot.show()

# Find highly correlated features (|correlation| > 0.5)
print("\nStrong Correlations (|correlation| > 0.5):")
for i in range(len(correlations.columns)):
    for j in range(i+1, len(correlations.columns)):
        corr = correlations.iloc[i,j]
        if abs(corr) > 0.5:
            print(f"{correlations.index[i]} vs {correlations.columns[j]}: {corr:.2f}")

Correlation Matrix of Wine Features:
                              alcohol  malic_acid   ash  alcalinity_of_ash  \
alcohol                          1.00        0.09  0.21              -0.31   
malic_acid                       0.09        1.00  0.16               0.29   
ash                              0.21        0.16  1.00               0.44   
alcalinity_of_ash               -0.31        0.29  0.44               1.00   
magnesium                        0.27       -0.05  0.29              -0.08   
total_phenols                    0.29       -0.34  0.13              -0.32   
flavanoids                       0.24       -0.41  0.12              -0.35   
nonflavanoid_phenols            -0.16        0.29  0.19               0.36   
proanthocyanins                  0.14       -0.22  0.01              -0.20   
color_intensity                  0.55        0.25  0.26               0.02   
hue                             -0.07       -0.56 -0.07              -0.27   
od280/od315_of_diluted_wine


Strong Correlations (|correlation| > 0.5):
alcohol vs color_intensity: 0.55
alcohol vs proline: 0.64
malic_acid vs hue: -0.56
alcalinity_of_ash vs wine_class: 0.52
total_phenols vs flavanoids: 0.86
total_phenols vs proanthocyanins: 0.61
total_phenols vs od280/od315_of_diluted_wines: 0.70
total_phenols vs wine_class: -0.72
flavanoids vs nonflavanoid_phenols: -0.54
flavanoids vs proanthocyanins: 0.65
flavanoids vs hue: 0.54
flavanoids vs od280/od315_of_diluted_wines: 0.79
flavanoids vs wine_class: -0.85
proanthocyanins vs od280/od315_of_diluted_wines: 0.52
color_intensity vs hue: -0.52
hue vs od280/od315_of_diluted_wines: 0.57
hue vs wine_class: -0.62
od280/od315_of_diluted_wines vs wine_class: -0.79
proline vs wine_class: -0.63


## Feature Relationships

Let's explore relationships between some key features:

In [13]:
# Create scatter plots for interesting feature pairs
feature_pairs = [
    ('alcohol', 'color_intensity'),
    ('total_phenols', 'flavanoids'),
    ('malic_acid', 'ash')
]

for x, y in feature_pairs:
    print(f"\nAnalyzing relationship between {x} and {y}")
    scatter_plot = analyzer.visualizer.plot_scatter(
        x=x,
        y=y,
        color='wine_class'  # Color points by wine class
    )
    scatter_plot.show()


Analyzing relationship between alcohol and color_intensity



Analyzing relationship between total_phenols and flavanoids



Analyzing relationship between malic_acid and ash


## Outlier Analysis

Let's identify outliers in key chemical properties:

In [14]:
# Display outlier information for key features
print("Outlier Analysis for Key Features:")
key_features = ['alcohol', 'malic_acid', 'ash', 'total_phenols']

for feature in key_features:
    stats = results.outliers[feature]
    print(f"\n{feature}:")
    print(f"Lower bound: {stats['lower_bound']:.2f}")
    print(f"Upper bound: {stats['upper_bound']:.2f}")
    print(f"Number of outliers: {stats['count']}")
    
    # Create box plot to visualize outliers
    box_plot = analyzer.visualizer.plot_boxplot(feature)
    box_plot.show()

Outlier Analysis for Key Features:

alcohol:
Lower bound: 10.39
Upper bound: 15.65
Number of outliers: 0



malic_acid:
Lower bound: -0.62
Upper bound: 5.30
Number of outliers: 3



ash:
Lower bound: 1.69
Upper bound: 3.08
Number of outliers: 3



total_phenols:
Lower bound: 0.16
Upper bound: 4.39
Number of outliers: 0


## Summary of Findings

From our analysis of the wine dataset, we can observe:

1. The dataset contains measurements of 13 different chemical properties for different wines
2. There are three different classes of wines (0, 1, 2)
3. Some features show strong correlations, particularly between:
   - Total phenols and flavanoids
   - Color intensity and proline
4. Several features exhibit outliers, which might be interesting points for further investigation

This demonstrates how Explorytics can be used to quickly gain insights into a complex dataset with multiple features.