1)Fit Random Forest models using each possible input on its own to predict edibility. Evaluate the quality of fit by using the predict function to calculate the predicted class for each mushroom (edible or poisonous). Which input fits best? (i.e. which classifies the most mushrooms correctly?) (0.5 marks)

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load data
data = pd.read_csv(r"C:\Users\hughd\Desktop\machine learning\mushrooms.csv")

# Split the data into input and output variables
X = data[['CapShape', 'CapSurface', 'CapColor', 'Odor', 'Height']].copy()
y = data['Edible']

# Encode the categorical variables as numerical values
le = LabelEncoder()
for col in X.columns:
    X[col] = le.fit_transform(X[col])

# Fit Random Forest models using each input variable on its own
best_acc = 0
best_feature = ''
for feature in X.columns:
    rf = RandomForestClassifier()
    rf.fit(X[[feature]], y)
    acc = (rf.predict(X[[feature]]) == y).mean()
    if acc > best_acc:
        best_acc = acc
        best_feature = feature
    print(f"Accuracy of {feature} model: {acc:.3f}")

# Print the best input variable
print(f"\nBest feature: {best_feature} (accuracy: {best_acc:.3f})")


Accuracy of CapShape model: 0.564
Accuracy of CapSurface model: 0.581
Accuracy of CapColor model: 0.595
Accuracy of Odor model: 0.985
Accuracy of Height model: 0.518

Best feature: Odor (accuracy: 0.985)


In the Random Forest Model odor is by far the most accurate catergory being 98.5%

2) Using cross-validation, perform a model selection to determine which features are useful for making predictions using a Random Forest. As above, use the number of mushrooms correctly classified as the criterion for deciding which model is best. You might try to find a way to loop over all 32 possible models. Or select features ‘greedily’, by picking one at a time to add to the model. Present your results in the most convincing way you can. (2 marks)

In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import itertools

# Load the data
data = pd.read_csv(r"C:\Users\hughd\Desktop\machine learning\mushrooms.csv")

# Split the data into input and output variables
X = data[['CapShape', 'CapSurface', 'CapColor', 'Odor', 'Height']].copy()
y = data['Edible']

# Encode the categorical variables as numerical values
le = LabelEncoder()
for col in X.columns:
    X.loc[:, col] = le.fit_transform(X.loc[:, col])


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit Random Forest models using each combination of input variables
best_acc = [0]*5
best_feature_sets = [[] for i in range(5)]
for i in range(1, 6):
    for feature_subset in itertools.combinations(X.columns, i):
        rf = RandomForestClassifier()
        rf.fit(X_train[list(feature_subset)], y_train)
        acc = (rf.predict(X_test[list(feature_subset)]) == y_test).mean()
        if acc > best_acc[i-1]:
            best_acc[i-1] = acc
            best_feature_sets[i-1] = list(feature_subset)

# Print the best input variables for each subset size
for i in range(1, 6):
    print(f"Best {i}feature: {best_feature_sets[i-1]} (accuracy: {best_acc[i-1]:.3f})")

Best 1feature: ['Odor'] (accuracy: 0.984)
Best 2feature: ['CapColor', 'Odor'] (accuracy: 0.987)
Best 3feature: ['CapShape', 'CapColor', 'Odor'] (accuracy: 0.991)
Best 4feature: ['CapShape', 'CapSurface', 'CapColor', 'Odor'] (accuracy: 0.991)
Best 5feature: ['CapShape', 'CapSurface', 'CapColor', 'Odor', 'Height'] (accuracy: 0.990)


Here I used a greedy method for choosing the classifacation and found that odor on its own is still very accurate. As more catagories are added they accuracy is slowly increased until the final. After running this a few times it seems that height is the least important factor decreasing the accuracy of the 5th set. Odor becomes less important as more and more factors are added, but standalone it is significantly the best. The set of properties including the attributes of the cap become very accurate when all three are taken into consideration and become the top three important factors in set 4 and 5.

3) Would you use this classifier if you were foraging for mushrooms? Discuss with reference to factors that you identified as important and the probability of poisoning yourself. (0.5 marks)

For me I feel that i would stick primarily to odor alone as the accuracy only changes slightly when more and more factors added. It is much simpler and easy for Humans to learn by just odor and it has a very similar accuracy to using all of the factors.

4) Fit an ANN module using each possible input on its own to predict edibility. Evaluate the quality of fit by using the predict function to calculate the predicted class for each mushroom (edible or poisonous). Which input fits best? (i.e. which classifies the most mushrooms correctly?) (0.5 marks)

In [7]:
from sklearn.neural_network import MLPClassifier

# Separate the target variable and input features
X = data.drop(columns=["Edible"])
y = data["Edible"]

# Train an ANN for each column separately and evaluate its performance
for col in X.columns:
    print(f"Training ANN for column: {col}")
   
    # Create a new column with one-hot encoded categorical data
    if X[col].dtype == "object":
        X_one_hot = pd.get_dummies(X[col], prefix=col)
        X_col = X_one_hot.values
    else:
        X_col = X[[col]].values
   
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_col, y, test_size=0.2, random_state=42)
   
    # Create MLP classifier with one hidden layer of 5 neurons
    model = MLPClassifier(hidden_layer_sizes=(5,), max_iter=500)
   
    # Train model
    model.fit(X_train, y_train)
   
    # Evaluate model on test data
    score = model.score(X_test, y_test)
    print(f"Accuracy for column {col}: {score}")


Training ANN for column: CapShape
Accuracy for column CapShape: 0.5612307692307692
Training ANN for column: CapSurface
Accuracy for column CapSurface: 0.5784615384615385
Training ANN for column: CapColor
Accuracy for column CapColor: 0.5969230769230769
Training ANN for column: Odor
Accuracy for column Odor: 0.9846153846153847
Training ANN for column: Height
Accuracy for column Height: 0.5187692307692308


The results for this method are very similar to when the random forest ran the same question. Similarly, the odor was significantly the best on its own having nearly double the accuracy than that of any other factor alone.

5) Explore how the performance depends on the architecture of the ANN. Vary the number and the sizes of the hidden layers. For large networks you may want to increase the number of the stochastic gradient descent iterations. (1 mark)

In [10]:
# Set up parameters for the ANNs
num_iter = 1000# Increase for large networks
hidden_layer_sizes = [(5,)]  # Vary number and sizes of hidden layers

# Train an ANN for each column separately and evaluate its performance
for col in X.columns:
    #print(f"Training ANNs for column: {col}")
   
    # Create a new column with one-hot encoded categorical data
    if X[col].dtype == "object":
        X_one_hot = pd.get_dummies(X[col], prefix=col)
        X_col = X_one_hot.values
    else:
        X_col = X[[col]].values
   
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_col, y, test_size=0.2)
   
    # Train and evaluate ANNs with different architectures
    for hidden_sizes in hidden_layer_sizes:
        #print(f"Training ANN with hidden layer sizes: {hidden_sizes}")
       
        # Create MLP classifier with specified hidden layer sizes
        model = MLPClassifier(hidden_layer_sizes=hidden_sizes, max_iter=num_iter)
       
        # Train model
        model.fit(X_train, y_train)
       
        # Evaluate model on test data
        score = model.score(X_test, y_test)
        print(f"Accuracy for column {col}, hidden layer sizes {hidden_sizes}: {score}")

Accuracy for column CapShape, hidden layer sizes (5,): 0.556923076923077
Accuracy for column CapSurface, hidden layer sizes (5,): 0.5476923076923077
Accuracy for column CapColor, hidden layer sizes (5,): 0.5723076923076923
Accuracy for column Odor, hidden layer sizes (5,): 0.984
Accuracy for column Height, hidden layer sizes (5,): 0.5144615384615384


After exploring diffrent hidden layers I found that some factors are more heavily infuenced by the hidden layers than others specifically cap colour that had more variation than that of cap shape which had none. As this is a large network the program struggled to run at below 500 iterations and it plataues at approximately 700

6) Using cross-validation, perform a model selection to determine which features are useful for making predictions using the ANN. As above, use the number of mushrooms correctly classified as the criterion for deciding which model is best. You might try to find a way to loop over all 32 possible models. Or select features ‘greedily’, by picking one at a time to add to the model. Present your results in the most convincing way you can. (2 marks)

In [12]:
import numpy as np
from sklearn.metrics import accuracy_score


# Convert categorical data to numerical values
encoder = LabelEncoder()
for col in data.columns:
    data[col] = encoder.fit_transform(data[col])

# Separate the features and target
X = data.drop('Edible', axis=1)
y = data['Edible']

# Initialize an empty list to store the selected features
select_feat = []

# Loop over the features and select the one that gives the highest accuracy score
for i in range(len(X.columns)):
    best_feature, best_score = None, 0
    for feat in X.columns:
        if feat not in select_feat:
            # Combine the previously selected features with the current feature
            feats = select_feat + [feat]
            # Split the data into training and validation sets
            X_train, X_val, y_train, y_val = train_test_split(X[feats], y, test_size=0.3)
            # Train an ANN on the training set
            model = MLPClassifier(hidden_layer_sizes=(5,), max_iter=3000)
            model.fit(X_train, y_train)
            # Predict the classes of the validation set
            y_pred = model.predict(X_val)
            # Compute the accuracy score
            score = accuracy_score(y_val, y_pred)
            # Update the best feature and score if this one is better
            if score > best_score:
                best_feature, best_score = feat, score
    # Add the best feature to the selected features list
    select_feat.append(best_feature)
    # Print the current subset and its accuracy score
    print(f"Best {i+1}-feature subset: {select_feat} (accuracy: {best_score:.3f})")


Best 1-feature subset: ['Odor'] (accuracy: 0.600)
Best 2-feature subset: ['Odor', 'CapShape'] (accuracy: 0.986)
Best 3-feature subset: ['Odor', 'CapShape', 'CapSurface'] (accuracy: 0.988)
Best 4-feature subset: ['Odor', 'CapShape', 'CapSurface', 'Height'] (accuracy: 0.987)
Best 5-feature subset: ['Odor', 'CapShape', 'CapSurface', 'Height', 'CapColor'] (accuracy: 0.772)


Here I used a greedy method for choosing the classifacation and found that ,similar to random forest, odor on its own is still the most accurate, however ANN thinks its much less accurate. As more catagories are added they accuracy is slowly increased until the final. After running this a few times it seems that cap colour is the least important factor decreasing the accuracy significantly on the 5th set. Odor stays the most important as more and more factors are added. The final accuracy is ranked lower than that of random forest model.

7) Compare the performance of Random Forest and ANN models. For example, which data types, do you think, the two ML models are most suited to describe. (0.5 marks)

I think that ANN results are likely to be more accurate as it is more suited to non numerical functions simulating neural pathways. Random forest is great for numerical representations, but I feel less accurate in this case. Overall for this particular discription I would trust ANN model more.