## Mushroom Classification Project
### Author: Kate Huntsman
### Date: March 21st, 2025

## Introduction
This project aims to classify mushrooms as edible or poisonous using the UCI Mushroom Dataset.
We will explore the dataset, preprocess the data, and train machine learning models to make predictions.

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Section 1. Import and Inspect the Data
### 1.1 Load the dataset and display the first 10 rows

In [None]:
# load the dataset
data = pd.read_csv('mushrooms.csv')

# Display the first 10 rows
data.head(10)

### 1.2 Check for missing values and display summary statistics.

In [None]:
# Check for missing values
print(data.isnull().sum())

# Display summary statistics
print(data.describe(include='all'))

### Reflection 1: What do you notice about the dataset? Are there any data issues?
The dataset has categorical features and no missing values.

## Section 2. Data Exploration and Preparation
### 2.1 Explore Data Patterns and Distributions
Create histograms, boxplots, and count plots for categorical variables (as applicable).
Identify patterns, outliers, and anomalies in feature distributions.
Check for class imbalance in the target variable (as applicable).

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='class', data=data, palette='coolwarm')
plt.title('Class Distribution')
plt.show()

# Explore categorical feature distributions
for column in data.columns[1:]:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=column, data=data, palette='viridis')
    plt.title(f'Distribution of {column}')
    plt.xticks(rotation=45)
    plt.show()

# Check for class imbalance
class_counts = data['class'].value_counts()
print("Class Distribution:")
print(class_counts)

### 2.2 Handle missing values and clean data
Impute or drop missing values (as applicable).
Remove or transform outliers (as applicable).
Convert categorical data to numerical format using encoding (as applicable).

In [None]:
# Encode categorical features
label_encoders = {}
for column in data.columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Verify encoding
print("Encoded Data Sample:")
display(data.head())

### 2.3 Feature selection and engineering
Create new features (as applicable).
Transform or combine existing features to improve model performance (as applicable).
Scale or normalize data (as applicable).

In [None]:
# Select features and target
X = data.drop(columns=['class'])
y = data['class']

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Reflection 2: What patterns or anomalies do you see? Do any features stand out? What preprocessing steps were necessary to clean and improve the data? Did you create or modify any features to improve performance?

## Section 3. Feature Selection and Justification
### 3.1 Choose features and target
Select two or more input features (numerical for regression, numerical and/or categorical for classification)
Select a target variable (as applicable)
Regression: Continuous target variable (e.g., price, temperature).
Classification: Categorical target variable (e.g., gender, species).
Clustering: No target variable.

In [None]:
selected_features = X.columns.tolist()
print("Selected Features:", selected_features)

### 3.2 Define X and y
Assign input features to X
Assign target variable to y (as applicable)

### Reflection 3: Why did you choose these features? How might they impact predictions or accuracy?

## Section 4. Train a Model (Classification: Choose 1: Decision Tree, Random Forest, Logistic Regression)
### 4.1 Split the data into training and test sets using train_test_split (or StratifiedShuffleSplit if class imbalance is an issue).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

# Train a Decision Tree model
model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train, y_train)

# Evaluate the model
y_pred_dt = model_dt.predict(X_test)
print("Decision Tree Classifier Performance:")
print(classification_report(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

# Store performance metrics
dt_accuracy = accuracy_score(y_test, y_pred_dt)
dt_precision = precision_score(y_test, y_pred_dt)
dt_recall = recall_score(y_test, y_pred_dt)
dt_f1 = f1_score(y_test, y_pred_dt)

### 4.2 Train model using Scikit-Learn model.fit() method.

### 4.3 Evalulate performance, for example:
Regression: R^2, MAE, RMSE (RMSE has been recently updated)
Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix
Clustering: Inertia, Silhouette Score

### Reflection 4: How well did the model perform? Any surprises in the results?

## Section 5. Improve the Model or Try Alternates (Implement a Second Option)
### 5.1 Train an alternative classifier (e.g., Decision Tree, Random Forest, Logistic Regression) OR adjust hyperparameters on the original model.

In [None]:
model_rf = RandomForestClassifier(random_state=42, n_estimators=100)
model_rf.fit(X_train, y_train)

# Evaluate the alternative model
y_pred_rf = model_rf.predict(X_test)
print("Random Forest Classifier Performance:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

# Store performance metrics
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)

### 5.2 Compare performance of all models across the same performance metrics.

In [None]:
# Compare models
print("\nModel Performance Comparison:")
print(f"Decision Tree - Accuracy: {dt_accuracy:.4f}, Precision: {dt_precision:.4f}, Recall: {dt_recall:.4f}, F1-score: {dt_f1:.4f}")
print(f"Random Forest - Accuracy: {rf_accuracy:.4f}, Precision: {rf_precision:.4f}, Recall: {rf_recall:.4f}, F1-score: {rf_f1:.4f}")

### Reflection 5: Which model performed better? Why might one classifier be more effective in this specific case?

## Section 6. Final Thoughts & Insights
### 6.1 Summarize findings.

### 6.2 Discuss challenges faced.

### 6.3 If you had more time, what would you try next?

### Reflection 6: What did you learn from this project?