##**Assignment 3 (2024/2): ML1**
**Safe to eat or deadly poison?**



This homework is a classification task to identify whether a mushroom is edible or poisonous.

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


Step 1. Load 'mushroom2020_dataset.csv' data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Drop rows where the target (label) variable is missing.

Step 3. Drop the following variables:
'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'

Step 4. Examine the number of rows, the number of digits, and whether any are missing.

Step 5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

Step 6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

Step 7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

Step 8. Split train/test with 20% test, stratify, and seed = 2020.

Step 9. Create a Random Forest with GridSearch on training data with 5 CV.
	'criterion':['gini','entropy']
'max_depth': [2,3]
'min_samples_leaf':[2,5]
'N_estimators':[100]
'random_state': 2020

Step 10.  Predict the testing data set with classification_report.


**Complete class MushroomClassifier from given code template below.**

In [86]:
#import your other libraries here
import pandas as pd
# hint
# import 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score


In [94]:
file_path = 'mushroom2020_dataset.csv'
df = pd.read_csv(file_path)
df

Unnamed: 0,id,label,cap-shape,cap-surface,bruises,odor,gill-attachment,gill-spacing,gill-size,stalk-shape,...,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate,gill-color-rate,veil-color-rate,stalk-color-above-ring-rate,stalk-color-below-ring-rate
0,1,p,x,s,t,p,f,c,n,e,...,o,p,k,s,u,1.0,3.0,1.0,1.0,1.0
1,2,e,x,s,t,a,f,c,b,e,...,o,p,n,n,g,2.0,3.0,1.0,1.0,1.0
2,3,e,b,s,t,l,f,c,b,e,...,o,p,n,n,m,3.0,1.0,1.0,1.0,1.0
3,4,p,x,y,t,p,f,c,n,e,...,o,p,k,s,u,3.0,1.0,1.0,1.0,1.0
4,5,e,x,s,f,n,f,w,b,t,...,o,e,n,a,g,4.0,3.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819,5820,e,k,s,f,n,a,c,b,e,...,o,p,b,c,l,1.0,10.0,2.0,7.0,8.0
5820,5821,e,x,s,f,n,a,c,b,e,...,o,p,b,v,l,1.0,10.0,1.0,7.0,8.0
5821,5822,e,f,s,f,n,a,c,b,e,...,o,p,b,c,l,1.0,1.0,2.0,7.0,8.0
5822,5823,p,k,y,f,y,f,c,n,t,...,o,e,w,v,l,1.0,9.0,1.0,1.0,1.0


In [95]:
df['gill-size'].isna().sum()

np.int64(121)

In [96]:
df.dropna(subset=['label'], inplace=True)
df.shape

(5764, 24)

In [97]:
df.drop(columns=['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'],
            inplace=True)
df.shape

(5764, 12)

In [98]:
number_type = ['float', 'int']
numeric_cols = df.select_dtypes(include=number_type).columns
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.fillna(x.mean()), axis=0)

categorical_cols = df.select_dtypes(exclude=number_type).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]), axis=0)

df['label'] = df['label'].map({'e': 1.0, 'p': 0.0})
class_counts = df['label'].value_counts()
# class_counts.to_dict()
print(tuple(class_counts.to_list()))

(3660, 2104)


In [46]:
df_encoded = pd.get_dummies(df, drop_first=True)
X = df_encoded.drop(columns=['label'])
y = df_encoded['label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=2020
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print((X_train.shape, X_test.shape))

Training set size: 4611 samples
Testing set size: 1153 samples
((4611, 42), (1153, 42))


In [48]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 3],
    'min_samples_leaf': [2, 5],
    'n_estimators': [100],
    'random_state': [2020]  # Keep this fixed in the model initialization
}

# Initialize the Random Forest model
rf = RandomForestClassifier(random_state=2020)

# Perform GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    verbose=1,  # Optional: shows progress of the grid search
    n_jobs=-1   # Use all processors for faster computation
)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Best parameters and the corresponding score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best Parameters: {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 5, 'n_estimators': 100, 'random_state': 2020}
Best Cross-Validation Score: 0.9718048991428967


In [50]:
best_params.values()

dict_values(['gini', 3, 5, 100, 2020])

In [None]:
best_rf_model = grid_search.best_estimator_  # Retrieve the best model
y_pred = best_rf_model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

class_report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(class_report)

macro_f1 = f1_score(y_test, y_pred, average='macro')

macro_f1_rounded = round(macro_f1, 2)
print(f"\nMacro F1 Score (2 decimal places): {macro_f1_rounded}")

In [99]:
class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        # remove pass and replace with you code
        return self.df['gill-size'].isna().sum()


    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], inplace=True)
        self.df.drop(columns=['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'],
            inplace=True)
        return df.shape


    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
            - Note: You need to reproduce the process (code) from Q2 to obtain the correct result.
        """
        self.Q2()
        number_type = ['float', 'int']
        numeric_cols = self.df.select_dtypes(include=number_type).columns
        self.df[numeric_cols] = self.df[numeric_cols].apply(lambda x: x.fillna(x.mean()), axis=0)

        categorical_cols = self.df.select_dtypes(exclude=number_type).columns
        self.df[categorical_cols] = self.df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]), axis=0)

        self.df['label'] = self.df['label'].map({'e': 1.0, 'p': 0.0})
        class_counts = self.df['label'].value_counts()
        return tuple(class_counts.to_list())


    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
            - Note: You need to reproduce the process (code) from Q2, Q3 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q3()
        df_encoded = pd.get_dummies(self.df, drop_first=True)
        X = df_encoded.drop(columns=['label'])
        y = df_encoded['label']
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, stratify=y, random_state=2020
        )
        self.X_train = X_train
        self.X_test = X_test
        self.y_train = y_train
        self.y_test = y_test
        return (X_train.shape, X_test.shape)


    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q4()
        param_grid = {
            'criterion': ['gini', 'entropy'],
            'max_depth': [2, 3],
            'min_samples_leaf': [2, 5],
            'n_estimators': [100],
            'random_state': [2020]
        }

        rf = RandomForestClassifier(random_state=2020)

        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=param_grid,
            scoring='accuracy',
            cv=5,
            verbose=1,
            n_jobs=-1
        )

        grid_search.fit(self.X_train, self.y_train)

        self.grid_search = grid_search

        return grid_search.best_params_.values()


    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (2 digits)?
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4, Q5 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q5()
        best_rf_model = self.grid_search.best_estimator_  # Retrieve the best model
        y_pred = best_rf_model.predict(self.X_test)

        macro_f1 = f1_score(self.y_test, y_pred, average='macro')

        macro_f1_rounded = round(macro_f1, 2)
        return macro_f1_rounded


        


Run the code below to test that your code can work.

In [100]:
hw = MushroomClassifier('mushroom2020_dataset.csv')

print(hw.Q1())
# print(hw.Q2())
print(hw.Q3())
# print(hw.Q4())
# print(hw.Q5())
# print(hw.Q6())

121
(3660, 2104)
