## 👨‍💻 Creator

**Created by [Jukka-Matti Turtiainen](https://www.rdmaic.com)**
- Lean Six Sigma Expert & Trainer
- Website: [rdmaic.com](https://www.rdmaic.com)

# 📊 ESTIEM EDA Toolkit - Google Colab Quick Start

**Exploratory Data Analysis - Professional Statistical Tools**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jukka-matti/ESTIEM-eda/blob/main/notebooks/ESTIEM_EDA_Quick_Start.ipynb)

---

## ⚡ Installation & Setup

Run this cell first to install the ESTIEM EDA toolkit:

In [None]:
# Install ESTIEM EDA Toolkit
!pip install git+https://github.com/jukka-matti/ESTIEM-eda.git -q

# Import required libraries
import pandas as pd
import numpy as np
from estiem_eda.tools.process_analysis import ProcessAnalysisTool
from estiem_eda.tools.anova import ANOVATool
from estiem_eda.tools.pareto import ParetoTool

print("✅ ESTIEM EDA Toolkit installed successfully!")
print("📊 Available tools: Process Analysis, ANOVA, Pareto Analysis")
print("🔬 Process Analysis combines: I-Chart + Capability + Distribution Assessment")

## 📁 Sample Data Generation

Generate sample datasets for immediate testing:

In [None]:
# Generate manufacturing process data
np.random.seed(42)

# Process measurements
n_samples = 100
measurements = np.random.normal(10.0, 0.3, n_samples)
lines = np.random.choice(['Line_A', 'Line_B', 'Line_C'], n_samples)

# Quality defects data
defect_types = ['Surface', 'Dimensional', 'Assembly', 'Material', 'Electrical']
defect_counts = [45, 32, 18, 12, 8]  # Pareto-distributed

# Create sample DataFrames
process_data = pd.DataFrame({
    'measurement': measurements,
    'line': lines,
    'sample_id': range(1, n_samples + 1)
})

quality_data = pd.DataFrame({
    'defect_type': defect_types,
    'count': defect_counts
})

print("📊 Sample data generated:")
print(f"   Process data: {len(process_data)} measurements")
print(f"   Quality data: {len(quality_data)} defect categories")
print("\n🔍 Process data preview:")
display(process_data.head())
print("\n🔍 Quality data preview:")
display(quality_data)

## 🛠️ Professional Six Sigma Analysis Tools

### 1. Process Analysis (Comprehensive Assessment)

In [None]:
# Process Analysis - Comprehensive Assessment
process_tool = ProcessAnalysisTool()

results = process_tool.execute({
    'data': measurements.tolist(),
    'specification_limits': {
        'lsl': 9.4,
        'usl': 10.6,
        'target': 10.0
    },
    'distribution': 'normal',
    'title': 'Manufacturing Process Analysis'
})

if results['success']:
    # Process Summary
    summary = results['process_summary']
    print("📈 Process Analysis Results:")
    print(f"   Sample Size: {summary['sample_size']}")
    range_info = summary['measurement_range']
    print(f"   Process Mean: {range_info['mean']:.4f}")
    print(f"   Process Std Dev: {range_info['std_dev']:.4f}")
    
    # Stability Assessment
    stability = results['stability_analysis']
    print(f"\n🔒 Stability Assessment:")
    print(f"   Control Status: {stability['control_status'].replace('_', ' ').title()}")
    if 'statistics' in stability:
        stats = stability['statistics']
        print(f"   UCL: {stats.get('ucl', 0):.4f}")
        print(f"   LCL: {stats.get('lcl', 0):.4f}")
    
    # Capability Assessment
    capability = results['capability_analysis']
    if 'capability_indices' in capability:
        indices = capability['capability_indices']
        print(f"\n🎯 Capability Assessment:")
        print(f"   Cp:  {indices['cp']:.4f}")
        print(f"   Cpk: {indices['cpk']:.4f}")
        
        defects = capability['defect_analysis']
        print(f"   Expected PPM: {defects['ppm_total']:.0f}")
        print(f"   Sigma Level: {defects['sigma_level']:.1f}")
    
    # Distribution Assessment
    distribution = results['distribution_analysis']
    if 'goodness_of_fit' in distribution:
        gof = distribution['goodness_of_fit']
        print(f"\n📊 Distribution Assessment:")
        print(f"   Distribution: {distribution['distribution'].title()}")
        print(f"   Correlation: {gof['correlation_coefficient']:.4f}")
        print(f"   Fit Quality: {gof['interpretation']}")
    
    # Overall Assessment
    if 'overall_assessment' in results:
        assessment = results['overall_assessment']
        print(f"\n🏆 Overall Assessment: {assessment['overall_status'].replace('_', ' ').title()}")
        if assessment['recommendations']:
            print("   Recommendations:")
            for rec in assessment['recommendations']:
                print(f"   • {rec}")
    
    print(f"\n🎯 {results['interpretation']}")
    
    # Display chart if available
    if 'visualization' in results:
        from IPython.display import HTML
        display(HTML(results['visualization']))
else:
    print(f"❌ Analysis failed: {results.get('error')}")

### 2. ANOVA Analysis (Group Comparison)

In [None]:
# Process Capability Analysis
capability_tool = CapabilityTool()

results = capability_tool.execute({
    'data': measurements.tolist(),
    'lsl': 9.4,  # Lower Specification Limit
    'usl': 10.6, # Upper Specification Limit
    'target': 10.0
})

if results['success']:
    indices = results['capability_indices']
    defects = results['defect_analysis']
    
    print("🎯 Process Capability Results:")
    print(f"   Cp:  {indices['cp']:.4f}")
    print(f"   Cpk: {indices['cpk']:.4f}")
    print(f"   Pp:  {indices['pp']:.4f}")
    print(f"   Ppk: {indices['ppk']:.4f}")
    print(f"\n📊 Defect Analysis:")
    print(f"   Expected PPM: {defects['ppm_total']:.0f}")
    print(f"   Sigma Level: {defects['sigma_level']:.1f}")
    print(f"\n🎯 {results['interpretation']}")
    
    # Display chart if available
    if 'visualization' in results:
        display(HTML(results['visualization']))
else:
    print(f"❌ Analysis failed: {results.get('error')}")

### 3. Pareto Analysis (Priority Identification)

In [None]:
# ANOVA Analysis - Compare production lines
anova_tool = ANOVATool()

# Prepare groups data
groups = {}
for line in process_data['line'].unique():
    groups[line] = process_data[process_data['line'] == line]['measurement'].tolist()

results = anova_tool.execute({'groups': groups})

if results['success']:
    anova_stats = results['anova_results']
    
    print("📊 ANOVA Results:")
    print(f"   F-statistic: {anova_stats['f_statistic']:.4f}")
    print(f"   p-value: {anova_stats['p_value']:.6f}")
    print(f"   Significant: {'Yes' if anova_stats['significant'] else 'No'}")
    
    if 'post_hoc' in results:
        print(f"\n📈 Post-hoc Comparisons:")
        for comparison in results['post_hoc']['comparisons']:
            print(f"   {comparison['groups']}: p={comparison['p_value']:.4f} {'*' if comparison['significant'] else ''}")
    
    print(f"\n🎯 {results['interpretation']}")
    
    # Display chart if available
    if 'visualization' in results:
        display(HTML(results['visualization']))
else:
    print(f"❌ Analysis failed: {results.get('error')}")

### Note: Individual Tools (Legacy)

The following sections show individual tool usage. For streamlined analysis, use the Process Analysis tool above which combines I-Chart, Capability, and Probability Plot into a single comprehensive workflow.

#### Legacy: Individual Control Chart (I-Chart)
*This functionality is now included in Process Analysis above*

In [None]:
# Pareto Analysis - Quality defects
pareto_tool = ParetoTool()

# Convert quality data to dictionary
defect_dict = dict(zip(quality_data['defect_type'], quality_data['count']))

results = pareto_tool.execute({'data': defect_dict})

if results['success']:
    vital_few = results['vital_few']
    
    print("📉 Pareto Analysis Results:")
    print(f"   Total Categories: {len(defect_dict)}")
    print(f"   Vital Few: {len(vital_few['categories'])} categories")
    print(f"   Impact: {vital_few['percentage']:.1f}% of total defects")
    print(f"   Top Categories: {', '.join(vital_few['categories'])}")
    
    gini = results['gini_coefficient']
    print(f"   Gini Coefficient: {gini['value']:.3f} ({gini['interpretation']})")
    
    print(f"\n🎯 {results['interpretation']}")
    
    # Display chart if available
    if 'visualization' in results:
        display(HTML(results['visualization']))
else:
    print(f"❌ Analysis failed: {results.get('error')}")

### Legacy: Process Capability Analysis
*This functionality is now included in Process Analysis above*

In [None]:
# Probability Plot - Assess normality
probability_tool = ProbabilityPlotTool()

results = probability_tool.execute({
    'data': measurements.tolist(),
    'distribution': 'normal'
})

if results['success']:
    gof = results['goodness_of_fit']
    outliers = results['outliers']
    
    print("📋 Probability Plot Results:")
    print(f"   Distribution: Normal")
    print(f"   Correlation: {gof['correlation_coefficient']:.4f}")
    print(f"   Fit Quality: {gof['interpretation']}")
    print(f"   Outliers Detected: {outliers['count']}")
    
    if 'normality_test' in results:
        norm_test = results['normality_test']
        print(f"   Anderson-Darling: {norm_test['statistic']:.4f} (p={norm_test['p_value']:.4f})")
    
    print(f"\n🎯 {results['interpretation']}")
    
    # Display chart if available
    if 'visualization' in results:
        display(HTML(results['visualization']))
else:
    print(f"❌ Analysis failed: {results.get('error')}")

### Legacy: Probability Plot
*This functionality is now included in Process Analysis above*

In [None]:
# File upload widget
from google.colab import files
import io

print("📁 Upload your CSV file:")
uploaded = files.upload()

# Process uploaded files
for filename in uploaded.keys():
    print(f"\n✅ Processing file: {filename}")
    
    # Read CSV
    df = pd.read_csv(io.BytesIO(uploaded[filename]))
    
    print(f"📊 Data shape: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"📋 Columns: {', '.join(df.columns.tolist())}")
    
    # Show preview
    print("\n🔍 Data preview:")
    display(df.head())
    
    # Show summary statistics
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print("\n📈 Numeric columns summary:")
        display(df[numeric_cols].describe())
    
    # Store for analysis
    globals()[f'uploaded_data'] = df
    print(f"\n💾 Data stored as 'uploaded_data' - ready for analysis!")

## 🔧 Analyze Your Data

Modify the cells below to analyze your uploaded data with the 3 core tools:

In [None]:
# Example: Analyze your uploaded data with 3 core tools
# Modify these parameters for your data:

# For Process Analysis:
# column_name = 'your_measurement_column'
# data_values = uploaded_data[column_name].dropna().tolist()
# your_lsl = 9.0  # Your lower spec limit
# your_usl = 11.0 # Your upper spec limit

# For ANOVA:
# value_column = 'measurement'
# group_column = 'group'

# For Pareto:
# category_column = 'defect_type'
# value_column = 'count'  # or leave None to count occurrences

print("💡 Uncomment and modify the code above to analyze your data")
print("📋 Available data: 'uploaded_data'")
if 'uploaded_data' in globals():
    print(f"📊 Columns: {uploaded_data.columns.tolist()}")
    print(f"📊 3 Core Tools: Process Analysis, ANOVA, Pareto Analysis")

---

**📊 Built by ESTIEM for 60,000+ Industrial Engineering students**

*Professional Six Sigma toolkit with 3 core tools • Free forever for educational use • Apache 2.0 License*