# ML Model Training: Predicting variety

This notebook was auto-generated by Intent2Model.

**Task:** classification
**Target Column:** variety
**Model:** logistic_regression


## PLANNING SOURCE

**Planning Method:** LLM

**Target Confidence:** 1.00
**Task Confidence:** 1.00
**Plan Quality:** High Confidence



## STEP 0 — TASK INFERENCE

### Step 0: Task Inference

*   **Target Variable**: The user specified `variety` as the target. This column is present in the dataset.
*   **Task Type**: The target column `variety` is of `object` dtype and has 3 unique values ('Setosa', 'Versicolor', 'Virginica'). The features are all numeric. This is a classic **multiclass classification** problem.


## STEP 1 — DATASET INTELLIGENCE

### Step 1: Dataset Intelligence

*   **Shape**: The dataset is small, with 150 rows and 5 columns.
*   **Data Types**: There are 4 numeric features (`float64`) and 1 categorical target (`object`).
*   **Missingness**: There are no missing values in any column, which simplifies preprocessing.
*   **Feature Analysis**:
    *   The numeric features (`sepal.length`, `sepal.width`, `petal.length`, `petal.width`) have low skewness values, indicating they are fairly symmetric.
    *   The value ranges are not dramatically different, but scaling is still recommended for distance-based or gradient-based models.
    *   There is a very minor outlier hint in `sepal.width` based on the IQR method, but it's not significant enough to warrant complex outlier removal for this initial plan.
*   **Target Analysis**: The target variable `variety` is perfectly balanced, with each of the 3 classes having exactly 50 samples. This is ideal and means we don't need to employ class imbalance handling techniques like SMOTE or class weights. Accuracy can be used as a reliable primary metric.
*   **Identifiers/Leakage**: No columns with high cardinality or other signs of being identifiers were detected.


## STEP 2 — TRANSFORMATION STRATEGY

### Step 2: Transformation Strategy

*   **Numeric Features (`sepal.length`, `sepal.width`, `petal.length`, `petal.width`)**: 
    *   **Decision**: `StandardScaler`.
    *   **Justification**: While not strictly necessary for tree-based models, scaling is critical for models like Logistic Regression, SVM, and KNN to prevent features with larger scales from dominating the model's learning process. `StandardScaler` is a robust choice that centers the data to a mean of 0 and scales to a standard deviation of 1, making it suitable for all planned model candidates.

*   **Target Feature (`variety`)**: 
    *   **Decision**: `Drop`.
    *   **Justification**: This is the target variable (y) and must be excluded from the feature matrix (X) to prevent data leakage.


## STEP 3 — MODEL CANDIDATE SELECTION

### Step 3: Model Selection

Given the small, clean, and balanced nature of the dataset, a few well-established models are sufficient.

*   **`LogisticRegression`**: 
    *   **Justification**: Included as a simple, fast, and highly interpretable linear baseline. It performs well when the decision boundary is roughly linear.
*   **`RandomForestClassifier`**: 
    *   **Justification**: Included as a powerful, non-linear ensemble model. It is robust to the data's scale (though scaling doesn't hurt) and captures complex interactions between features. It also provides feature importances for free.
*   **`SupportVectorClassifier` (SVC)**: 
    *   **Justification**: Included because SVMs are highly effective in high-dimensional space and are excellent at finding clear separation margins, a known characteristic of the Iris dataset. Performance is dependent on proper scaling.


## STEP 4 — TRAINING & VALIDATION

### Step 4: Training & Validation Plan

*   **Validation Strategy**: `Stratified 5-Fold Cross-Validation`.
    *   **Justification**: The dataset is too small (150 rows) for a simple train-test split to be reliable. K-fold cross-validation provides a more robust estimate of model performance. `Stratified` k-fold is chosen to ensure that the 50/50/50 class balance is preserved in each fold, which is best practice for classification tasks.
*   **Performance Metrics**:
    *   **Primary Metric**: `accuracy`.
        *   **Justification**: The classes are perfectly balanced, making accuracy a direct and easy-to-interpret measure of overall correctness.
    *   **Additional Metrics**: `f1_macro`, `roc_auc_ovr`.
        *   **Justification**: `f1_macro` calculates the F1-score for each class and finds their unweighted mean, which is a good holistic measure for multiclass problems. `roc_auc_ovr` (One-vs-Rest) provides a measure of the model's ability to discriminate between each class and the rest, which is useful for understanding per-class performance.


## STEP 5 — ERROR & BEHAVIOR ANALYSIS

### Step 5: Error/Behavior Analysis Plan

To understand model performance beyond single metrics, the following analyses are critical:

*   **Confusion Matrix**: This is the most important tool for a classification task. It will be plotted to visualize which classes are being confused for which (e.g., is the model struggling to distinguish between 'Versicolor' and 'Virginica'?).
*   **Classification Report**: This will be generated to provide a per-class breakdown of precision, recall, and F1-score, giving a detailed view of performance for each of the three flower species.
*   **Feature Importance Plot**: For the `RandomForestClassifier`, a bar plot of feature importances will be created to show which features (e.g., 'petal.width') are most influential in the model's predictions.


## STEP 6 — EXPLAINABILITY

### Step 6: Explainability Plan

*   **Feature Name Alignment**: The feature transformations consist only of scaling. The post-transformation feature names will directly correspond to the original features (`sepal.length`, `sepal.width`, `petal.length`, `petal.width`), so no complex name mapping is required.
*   **Explainability Method**: The primary method will be analyzing the `feature_importances_` attribute from the trained `RandomForestClassifier`. 
    *   **Justification**: This provides a straightforward, global explanation of which features are driving the model's predictions on average. For a simple dataset like this, it is often sufficient to understand the key drivers of the classification.


## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Data

In [None]:
# Load your dataset
# df = pd.read_csv('your_dataset.csv')

# For this example, we'll use the uploaded data
print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head()

## 3. Prepare Data

In [None]:
# Separate features and target
X = df.drop(columns=['variety'])
y = df['variety']

# Handle categorical target if needed
le = LabelEncoder()
y = le.fit_transform(y.astype(str))

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training set: {X_train.shape}')
print(f'Test set: {X_test.shape}')

## 4. Build Preprocessing Pipeline (from AutoMLPlan)

In [None]:
# Preprocessing compiled from AutoMLPlan
# Each feature transform is based on plan.feature_transforms

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer

transformers = []

# Create preprocessor from plan-driven transformers

if len(transformers) == 0:
    # ⚠️ WARNING: No transformers generated from plan.feature_transforms! Using runtime fallback.
    numeric_cols = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
    transformers.append(('num_scaled', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_cols))
preprocessor = ColumnTransformer(transformers, remainder='drop')



## 5. Build Model (from AutoMLPlan)

In [None]:
# Model compiled from AutoMLPlan.model_candidates

# Selected model: logistic_regression (from plan.model_candidates)
# Reason: Selected based on plan recommendations.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, random_state=42)



## 6. Assemble Pipeline (from AutoMLPlan)

In [None]:
# Assemble pipeline from plan-driven components
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])



## 7. Train Model

In [None]:
# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate using metrics from AutoMLPlan
# Metrics compiled from AutoMLPlan
# Primary metric: accuracy
# Additional metrics: ['f1_macro', 'roc_auc_ovr']

from sklearn.metrics import (
    accuracy_score,
    classification_report
)


# Calculate metrics
primary_score = accuracy_score(y_test, y_pred)
print(f'Primary Metric (Accuracy): {primary_score:.4f}')

# Additional metrics from plan:

print(classification_report(y_test, y_pred))


## 8. Feature Importance (from plan.explainability_md)

In [None]:
# Get feature importance (aligned with plan)
if hasattr(pipeline.named_steps['model'], 'feature_importances_'):
    importances = pipeline.named_steps['model'].feature_importances_
    
    # Get feature names after preprocessing (aligned with plan, not dtype-based)
    # NOTE: Do NOT use numeric_cols or categorical_cols - they may not be defined
    try:
        preprocessor = pipeline.named_steps['preprocessor']
        feature_names = []
        # Get feature names from preprocessor transformers
        if hasattr(preprocessor, 'transformers_'):
            for name, transformer, cols in preprocessor.transformers_:
                if hasattr(transformer, 'get_feature_names_out'):
                    feature_names.extend(transformer.get_feature_names_out(cols))
                elif hasattr(transformer, 'named_steps'):
                    # Pipeline transformer
                    for step_name, step_transformer in transformer.named_steps.items():
                        if hasattr(step_transformer, 'get_feature_names_out'):
                            feature_names.extend(step_transformer.get_feature_names_out(cols))
                            break
                else:
                    # Fallback: use column names
                    feature_names.extend([f'{name}_{col}' for col in cols])
        else:
            # Preprocessor not fitted yet - use generic names
            feature_names = [f'feature_{i}' for i in range(len(importances))]
    except Exception as e:
        # Fallback: use generic feature names
        feature_names = [f'feature_{i}' for i in range(len(importances))]
    
    # Create importance DataFrame
    importance_df = pd.DataFrame({
        'feature': feature_names[:len(importances)],
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    # Plot
    plt.figure(figsize=(10, 6))
    sns.barplot(data=importance_df.head(10), x='importance', y='feature')
    plt.title('Top 10 Feature Importance')
    plt.tight_layout()
    plt.show()
    
    print(importance_df)
else:
    print('Feature importance not available for this model type.')

## 9. Save Model

In [None]:
# Save the trained model
with open('model.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

# Save label encoder if used
with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

print('Model saved successfully!')

## 10. Make Predictions

In [None]:
# Load model for predictions
# with open('model.pkl', 'rb') as f:
#     loaded_model = pickle.load(f)

# Example prediction
# new_data = pd.DataFrame({...})
# prediction = loaded_model.predict(new_data)
# print(f'Prediction: {prediction}')