# Scottish Haggis Analysis

# Final Project: Data Mining Analysis of Scottish Haggis Population

## 1. Introduction

### Brief Description of the Dataset

This dataset contains 344 recorded sightings of the elusive Scottish haggis—a rare wildlife species recently discovered across three Scottish islands: Iona, Skye, and Shetland. Wildlife volunteers have documented three distinct species: the **Wild Rambler**, the **Macduff**, and the **Bog Sniffler**.

Each observation includes:

- **Morphological measurements**: nose length (mm), eye size (mm), tail length (mm), and body mass (g)
- **Demographic data**: sex of the specimen
- **Temporal and geographical context**: island location and year of sighting (2023–2025)

The dataset represents a unique opportunity to understand the physical characteristics and distributions of these newly monitored species across different island ecosystems.

---

### Brief Description of the Task

The objective of this project is to demonstrate a comprehensive understanding of the **data mining lifecycle** by applying multiple machine learning techniques to this single dataset. Rather than simply running algorithms, the focus is on building a coherent analytical narrative—from initial data exploration through to predictive modeling—while making informed decisions at each stage.

**Key Questions This Analysis Will Address:**

- What patterns emerge from the physical measurements of different haggis species?
- Can we identify natural groupings within the population using unsupervised learning?
- How accurately can we predict species classification based on physical traits?
- What relationships exist between specific features (e.g., body mass and morphological traits)?

This investigation aims to extract meaningful biological insights while demonstrating proper application of data mining methodologies.

---

### What Techniques Will Be Applied

This analysis employs a progressive approach, moving from exploratory understanding to supervised prediction:

#### **Stage 1: Exploratory Data Analysis (EDA)**
- Data loading, inspection, and quality assessment
- Visualization of feature distributions and relationships
- Handling missing values and data type corrections
- Feature scaling and encoding preparation

#### **Stage 2: Unsupervised Learning (Clustering)**
- **K-Means clustering** to discover natural groupings in the data
- Optimal k selection using Elbow Method and Silhouette Score
- Cluster characterization and interpretation
- *Optional*: Comparison with density-based clustering (DBSCAN)

#### **Stage 3: Supervised Learning - Classification (Decision Trees)**
- Decision Tree classifier implementation
- Model evaluation using accuracy, confusion matrix, and classification metrics
- Feature importance analysis
- *Optional*: Hyperparameter tuning and ensemble methods (Random Forest, XGBoost)

#### **Stage 4: Comparative Classification Analysis**
- **K-Nearest Neighbors (KNN)** implementation with optimal k determination
- **Logistic Regression** with coefficient interpretation
- Performance comparison across all three classification methods
- Analysis of which algorithm performs best for this dataset

#### **Stage 5: Supervised Learning - Regression**
- **Linear Regression** to model relationships between continuous features
- Model evaluation using R², MAE, and RMSE
- Interpretation of regression coefficients and model fit

---

### Brief Outline of the Workflow

The analysis follows a structured, end-to-end data mining pipeline:

**1. Data Preparation & Understanding**
- Load the haggis dataset and perform initial inspection
- Assess data quality (missing values, outliers, data types)
- Create comprehensive visualizations to understand feature distributions

**2. Data Cleaning & Transformation**
- Handle missing values with justified approaches
- Encode categorical variables (species, island, sex)
- Scale numerical features where appropriate for specific algorithms

**3. Unsupervised Exploration**
- Apply K-Means to identify natural clusters
- Validate clustering quality and interpret biological meaning
- Explore whether clusters align with known species boundaries

**4. Supervised Classification**
- Split data into training and testing sets
- Build Decision Tree, KNN, and Logistic Regression models
- Compare performance and identify the most suitable classifier
- Extract insights from feature importances and coefficients

**5. Regression Analysis**
- Select appropriate continuous features for regression modeling
- Build and evaluate linear regression model
- Interpret relationships between physical characteristics

**6. Synthesis & Conclusions**
- Integrate findings across all analytical stages
- Discuss biological implications of discovered patterns
- Identify limitations and potential future work

---

Throughout this notebook, each decision—from choosing a value for k to selecting features for regression—will be **explicitly justified** with reference to the data, statistical principles, or domain context. The goal is not just to apply algorithms, but to tell a coherent story about what the data reveals about Scottish haggis populations.

# 2. Stage 1 — Data Preparation & Exploratory Data Analysis

# 2.1 Load Data & Initial Inspection

# TODO:
# - Load CSV
# - .head()
# - .info()
# - .describe()
# - Identify numeric + categorical features
# - Check missing values

# 2.2 Exploratory Data Analysis (EDA)

# TODO:
# - Histograms for numeric features
# - Boxplots for numeric features
# - Countplots for categorical features
# - Scatterplots / pairplots
# - Correlation heatmap

# WRITE:
# - Observations from each visualization
# - Interpret any patterns
# - Identify potential outliers

# 2.3 Data Cleaning

# TODO:
# - Decide how to handle missing values
# - Fix incorrect types
# - Outlier detection & decision

# WRITE:
# - Justify every cleaning choice

# 2.4 Encoding

# TODO:
# - One-hot encode island + gender
# - Decide label encoding strategy for species (only used in supervised learning)

# WRITE:
# - Why one-hot vs label encoding
# - Why species must be removed for clustering

# 2.5 Scaling

# TODO:
# - Choose scaler (e.g., StandardScaler)
# - Apply to appropriate models only (not trees)

# WRITE:
# - Why clustering/KNN/LR need scaling
# - Why trees don't

# Optional (for A1 upgrade)
# - Attempt PCA
# - Visualize PCA 2D scatterplot

# WRITE:
# - What variance the components capture
# - Whether PCA aids interpretation

# 3. Stage 2 — Clustering (K-Means)

# 3.1 Prepare Data for Clustering

# TODO:
# - Remove species
# - Encode + scale features

# 3.2 Determine Optimal k

# TODO:
# - Elbow plot (inertia vs k)
# - Silhouette score plot

# WRITE:
# - Decide best k
# - Justify choice

# 3.3 Apply K-Means

# TODO:
# - Fit model
# - Add cluster labels to dataframe

# 3.4 Cluster Analysis

# TODO:
# - Summary statistics per cluster
# - Visualize clusters using original features
# - Compare clusters to species distribution (OPTIONAL but A-level)

# WRITE:
# - Characteristics of each cluster
# - What each cluster represents biologically

# Optional Extra Credit
# - DBSCAN
# - Fit and compare
# - Discuss differences

# 4. Stage 3 — Supervised Learning (Decision Tree Classification)

# 4.1 Train/Test Split

# TODO:
# - Encode categoricals
# - DO NOT scale
# - Create train/test split

# 4.2 Fit Decision Tree

# TODO:
# - Train model
# - Visualize tree

# WRITE:
# - Comment on structure
# - Which splits are most important

# 4.3 Evaluate Model

# TODO:
# Generate:
# - Accuracy
# - Confusion matrix
# - Classification report

# WRITE:
# - Where model performs well
# - Where it struggles
# - Any imbalance issues

# 4.4 Feature Importance

# TODO:
# - Plot feature importances

# WRITE:
# - Which features matter, why it makes sense biologically

# Extra credit (recommended for A1)
# - Hyperparameter tuning
# - Pre-pruning
# - Post-pruning
# - Random Forest
# - Compare performance

# 5. Stage 4 — Comparative Classification (KNN & Logistic Regression)
# Reuse same train/test split

# 5.1 KNN Classifier

# TODO:
# - Scale numeric features
# - Try multiple k
# - Accuracy vs k plot

# WRITE:
# - Why your chosen k is optimal

# 5.2 Logistic Regression

# TODO:
# - Scale features
# - Fit model
# - Evaluate

# 5.3 Compare All Three Models
# Decision Tree vs KNN vs Logistic Regression

# WRITE:
# - Which is most accurate
# - Which generalizes best
# - Which is most interpretable
# - Tradeoffs

# 5.4 Interpret Logistic Regression

# TODO:
# - Plot / examine coefficients

# WRITE:
# - Which features increase likelihood of each species
# - Biological interpretation

# 6. Stage 5 — Regression (Linear Regression)

# 6.1 Choose target
# Likely: body_mass or nose_length
# Up to you.

# 6.2 Prepare Data

# TODO:
# - Select meaningful predictors
# - Train/test split
# - Scale if needed

# 6.3 Fit Model

# TODO:
# - Train linear regression
# - Predict

# 6.4 Evaluate

# Compute:
# - R²
# - MAE
# - MSE / RMSE
# - Residual plot

# 6.5 Interpret

# WRITE:
# - Meaning of coefficients
# - Whether model fits well
# - What affects the target most

# 7. Final Conclusion

# WRITE:
# - Summary of data insights
# - Overall finding in clustering
# - Best-performing classifier
# - How regression explains traits
# - Limits of dataset
# - Recommendations for future research