<a href="https://colab.research.google.com/github/matthewpblock/ml_classification_block/blob/main/block_classification_midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm Project: Classification Analysis - Matthew Block
- 11 November 2025

[insert brief introduction]

In [18]:
!pip install ucimlrepo



In [19]:
import pandas as pd
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


## Section 1. Import and Inspect the Data
### 1.1 Load the dataset and display the first 10 rows.

In [20]:
# fetch dataset
mushroom = fetch_ucirepo(id=73)

# data (as pandas dataframes)
X = mushroom.data.features
y = mushroom.data.targets
df_shrooms = mushroom.data

# Combine X and y into a single DataFrame
df = pd.concat([X, y], axis=1)

# Display the first 10 rows of the combined DataFrame
print(df.head(10))

#Save original data to data folder
df.to_csv('data/mushroom_data.csv', index=False)

  cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing  \
0         x           s         n       t    p               f            c   
1         x           s         y       t    a               f            c   
2         b           s         w       t    l               f            c   
3         x           y         w       t    p               f            c   
4         x           s         g       f    n               f            w   
5         x           y         y       t    a               f            c   
6         b           s         w       t    a               f            c   
7         b           y         w       t    l               f            c   
8         x           y         w       t    p               f            c   
9         b           s         y       t    a               f            c   

  gill-size gill-color stalk-shape  ... stalk-color-above-ring  \
0         n          k           e  ...                      w  

OSError: Cannot save file into a non-existent directory: 'data'

In [None]:
# variable information for decoder ring
print(mushroom.variables)

### 1.2 Check for missing values and display summary statistics.

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Display summary statistics
print("\nSummary statistics:")
df.info()

### Reflection 1:
- What do you notice about the dataset?
  - Stalk-root has a lot of missing values, so I probably won't look at that as a first option (although it's possible that it's salvageable)
- Are there any data issues?
  - Mainly the missing stalk-root values. Also the values have been encoded as with single letters that makes the dataset a smaller filesize, but makes it slightly less intuitive to work with.

## Section 2. Data Exploration and Preparation
### 2.1 Explore data patterns and distributions
- Create histograms, boxplots, and count plots for categorical variables (as applicable).
- Identify patterns, outliers, and anomalies in feature distributions.
- Check for class imbalance in the target variable (as applicable).


In [None]:
# Select a few categorical columns to plot
categorical_cols = ['cap-shape', 'cap-color', 'odor', 'habitat', 'poisonous']

# Create count plots for selected columns
for col in categorical_cols:
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    plt.show()


In [None]:
# Check correlations with 'poisonous' for each feature
for col in df.columns:
    if col != 'poisonous':
        print(f"\nCorrelation with poisonous for: {col}")
        display(df.groupby([col, 'poisonous']).size().unstack(fill_value=0))

### 2.2 Handle missing values and clean data
- Impute or drop missing values (as applicable).
- Remove or transform outliers (as applicable).
- Convert categorical data to numerical format using encoding (as applicable).


In [None]:
# Handle missing values in 'stalk-root' by encoding them as a separate category
df['stalk-root'] = df['stalk-root'].fillna('missing')

# Create a copy of the DataFrame to store label encoded values
df_encoded = df.copy()

# Convert categorical data to numerical format using Label Encoding
# Iterate over each column in the DataFrame
for column in df_encoded.columns:
    # Check if the column is of object type (categorical)
    if df_encoded[column].dtype == 'object':
        le = LabelEncoder()
        df_encoded[column] = le.fit_transform(df_encoded[column])

# Display the first 5 rows of the encoded DataFrame
print(df_encoded.head())

### 2.3 Feature selection and engineering
- Create new features (as applicable).
- Transform or combine existing features to improve model performance (as applicable).
- Scale or normalize data (as applicable).


In [None]:
# Identify all columns containing 'color' in their name from the encoded DataFrame
color_features = [col for col in df_encoded.columns if 'color' in col]

# Exclude the target variable 'poisonous' if it was accidentally included
if 'poisonous' in color_features:
    color_features.remove('poisonous')

print(f"Color features selected for engineering: {color_features}")

# Determine the maximum number of digits needed for padding.
# LabelEncoder assigns values from 0 to n-1. So max value is n-1.
# Max unique categories for 'gill-color' is 12 (values 0-11), requiring 2 digits.
max_label_value = df_encoded[color_features].max().max() # Find the max label across all color columns
padding_width = len(str(max_label_value)) # Determine padding width based on max label

# Create a new feature by combining the 'color' features
# Convert each feature's label-encoded value to a padded string and concatenate
df_encoded['combined_color_feature'] = df_encoded[color_features].astype(str).apply(
    lambda x: ''.join(x.str.zfill(padding_width)), axis=1
).astype(int)

print("\nFirst 5 rows with the new 'combined_color_feature':")
print(df_encoded[['combined_color_feature'] + color_features].head())

#Save transformed data to data folder
df.to_csv('data/mushroom_data_transformed.csv', index=False)

### Reflection 2:
- What patterns or anomalies do you see?
- Do any features stand out?
  - There's a few features that have several zeroes in each column when comparing to 'poisonous', especially odor and gill-color.
- What preprocessing steps were necessary to clean and improve the data?
- Did you create or modify any features to improve performance?
  - Yes. In order to reduce all of the color features down to one dimension, I created a combined_color_feature.

## Section 3. Feature Selection and Justification
### 3.1 Choose features and target

- Select two or more input features (numerical for regression, numerical and/or categorical for classification)
  - combined_color_feature
  - odor
- Select a target variable (as applicable)
    - poisonous
- Justify your selection with reasoning.
  - The number one question when it comes to mushrooms is "Can I eat it?"
  - The input features selected showed particular promise in the correlation comparion review (highlighted earlier). While gill-color looked like the strongest color indicator, several color-based features looked useful so I wanted to reduce them to one dimension.

### 3.2 Define X and y
- Assign input features to X
- Assign target variable to y (as applicable)

In [None]:
# Assign input features to X
X = df_encoded[['combined_color_feature', 'odor']]

# Assign target variable to y
y = df_encoded['poisonous']

print("X head:")
print(X.head())
print("\ny head:")
print(y.head())

### Reflection 3:
- Why did you choose these features?
- How might they impact predictions or accuracy?

## Section 4. Train a Model
### 4.1 Split the data into training and test sets using train_test_split

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

### 4.2 Train model using Scikit-Learn model.fit() method.

In [None]:
# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)

print("Model training complete.")

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Calculate precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")

# Calculate recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.4f}")

# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")

# Calculate Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

### Reflection 4:
- How well did the model perform?
- Any surprises in the results?

### 4.3 Evalulate performance

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

print("Predictions made on the test set. First 10 predictions:")
print(y_pred[:10])

### Reflection 4:
- How well did the model perform?
- Any surprises in the results?


## Section 5. Improve the Model or Try Alternates (Implement a Second Option)
### 5.1 Train an alternative classifier (e.g., Decision Tree, Random Forest, Logistic Regression) OR adjust hyperparameters on the original model.

In [None]:
# Initialize the Decision Tree Classifier model
decision_tree_model = DecisionTreeClassifier(random_state=42)

# Train the model using the training data
decision_tree_model.fit(X_train, y_train)

print("Decision Tree model training complete.")

# Make predictions on the test set with the Decision Tree model
y_pred_dt = decision_tree_model.predict(X_test)

print("\n--- Decision Tree Classifier Performance ---")

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Accuracy (Decision Tree): {accuracy_dt:.4f}")

# Calculate precision
precision_dt = precision_score(y_test, y_pred_dt)
print(f"Precision (Decision Tree): {precision_dt:.4f}")

# Calculate recall
recall_dt = recall_score(y_test, y_pred_dt)
print(f"Recall (Decision Tree): {recall_dt:.4f}")

# Calculate F1-score
f1_dt = f1_score(y_test, y_pred_dt)
print(f"F1-Score (Decision Tree): {f1_dt:.4f}")

# Calculate Confusion Matrix
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
print("\nConfusion Matrix (Decision Tree):")
print(conf_matrix_dt)

### 5.2 Compare performance of all models across the same performance metrics.

In [None]:
# Create a dictionary to store the metrics
metrics = {
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Accuracy': [accuracy, accuracy_dt],
    'Precision': [precision, precision_dt],
    'Recall': [recall, recall_dt],
    'F1-Score': [f1, f1_dt]
}

# Create a DataFrame from the metrics dictionary
metrics_df = pd.DataFrame(metrics)

print("\n--- Model Performance Comparison ---")
display(metrics_df)


### Reflection 5:
- Which model performed better?
- Why might one classifier be more effective in this specific case?

## Section 6. Final Thoughts & Insights

### 6.1 Summarize findings.
### 6.2 Discuss challenges faced.
### 6.3 If you had more time, what would you try next?

### Reflection 6:
- What did you learn from this project?
