### Project Topic

This project explores the application of supervised learning techniques to classify mushrooms as either **edible** or **poisonous**. This is a **binary classification problem** as the task is to predict one of two discrete classes. 

The primary goal of this project is to:
1.  Demonstrate the process of supervised learning by applying multiple machine learning algorithms to a real-world classification task.
2.  Learn and implement advanced techniques such as hyperparameter tuning, feature engineering, and model evaluation.
3.  Showcase performance metrics and comparisons to determine the most effective model for the task.
4.  Highlight the importance of accurate classification in practical scenarios, such as identifying potentially dangerous mushrooms.

### Data 

The dataset used is the **Secondary Mushroom Dataset** from the UCI Machine Learning Repository.
> Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [Secondary Mushroom Dataset]. Retrieved from [https://archive.ics.uci.edu/dataset/848/secondary+mushroom+datase](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+datase).


This dataset was curated as a simulated dataset for binary classification tasks, specifically focusing on the edibility of mushrooms based on their features. It provides diverse feature types and a large number of samples, making it suitable for exploring advanced machine learning techniques and evaluating model performance comprehensively.

The data is sourced into this notebook using the `ucimlrepo` package.

#### Data Description

- **Number of Samples:** 61,068
- **Number of Features:** 20
- **Feature Types:**
  - Categorical features (e.g., cap shape, surface, color, etc.)
  - Continuous features (e.g., numerical indicators for specific measurements: cap diameter, stem height & width, etc.)
- **Task Type:** Binary classification (edible or poisonous)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from ucimlrepo import fetch_ucirepo 

In [18]:
# fetch dataset 
secondary_mushroom = fetch_ucirepo(id=848) 
# data (as pandas dataframes) 
X = secondary_mushroom.data.features 
y = secondary_mushroom.data.targets 
  
# metadata 
print(secondary_mushroom.metadata) 
# variable information 
print(secondary_mushroom.variables) 

{'uci_id': 848, 'name': 'Secondary Mushroom', 'repository_url': 'https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/848/data.csv', 'abstract': 'Dataset of simulated mushrooms for binary classification into edible and poisonous.', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 61068, 'num_features': 20, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2021, 'last_updated': 'Wed Apr 10 2024', 'dataset_doi': '10.24432/C5FP5Q', 'creators': ['Dennis Wagner', 'D. Heider', 'Georges Hattab'], 'intro_paper': {'ID': 259, 'type': 'NATIVE', 'title': 'Mushroom data creation, curation, and simulation to support classification tasks', 'authors': 'Dennis Wagner, D. Heider, Georges Hattab', 'venue': 'Scientific Reports', 'year': 2021, 'journal': None, '

In [19]:
''' 
Exploratory Data Analysis

The variable information printed above shows that several features are missing values.
In order to not out right break certain my logisitic regression model or to degrade performance of my other models, 
I inspected the percentage of samples missing the particular features. I decided to remove features with more than 30% of values missing
to avoid causing collinearity among the features. For the rest with missing values I add a new, unique value (?) to represent missing
'''

missing_proportions = X.isnull().mean() * 100
missing_features = missing_proportions[missing_proportions > 0].sort_values(ascending=False)
print("Missing values (%) from source dataset:")
print(missing_features)
print('')
print('Cleaning data...')

features_to_drop = missing_proportions[missing_proportions > 30].index
X_cleaned = X.drop(columns=features_to_drop)

features_to_mod = missing_proportions[missing_proportions <= 30].index
for f in features_to_mod:
    X_cleaned.fillna({ f: "?" }, inplace=True)

missing_proportions_check = X_cleaned.isnull().mean() * 100
missing_features_check = missing_proportions_check[missing_proportions_check > 0]
assert len(missing_features_check) == 0, 'Data is not cleaned as expected'
X = X_cleaned
print('Data is clean and ready to explore, train, and test!')

Missing values (%) from source dataset:
veil-type            94.797688
spore-print-color    89.595376
veil-color           87.861272
stem-root            84.393064
stem-surface         62.427746
gill-spacing         41.040462
cap-surface          23.121387
gill-attachment      16.184971
ring-type             4.046243
dtype: float64

Cleaning data...
Data is clean and ready to explore, train, and test!
