<a href="https://colab.research.google.com/github/laurenzalt/project_mushroom/blob/main/PROJECT_Mushroom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mushroom Dataset

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended



Attribute Information: (classes: edible=e, poisonous=p)

    cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

    cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

    cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

    bruises: bruises=t,no=f

    odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

    gill-attachment: attached=a,descending=d,free=f,notched=n

    gill-spacing: close=c,crowded=w,distant=d

    gill-size: broad=b,narrow=n

    gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

    stalk-shape: enlarging=e,tapering=t

    stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

    stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    veil-type: partial=p,universal=u

    veil-color: brown=n,orange=o,white=w,yellow=y

    ring-number: none=n,one=o,two=t

    ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

    spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

    population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

    habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d


In [None]:
!wget https://www.spataru.at/students/course_files/stdm/mushrooms.csv

Der Befehl "wget" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


# Libs

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import plotly.express as px

# Data Preprocessing

In [None]:
df = pd.read_csv("mushrooms.csv")
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


# Data Cleaning

In [None]:
duplicate_rows = df.duplicated().sum()

if duplicate_rows > 0:
    df = df.drop_duplicates()

placeholders = ['?', 'NA', None, '']
missing_values = df.isin(placeholders).sum()

print(f"Duplicate rows: {duplicate_rows}")
print(f"Missing values:\n{missing_values}\n")

Duplicate rows: 0
Missing values:
class                          0
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64



In [None]:
# Replace '?' with np.nan
df.replace('?', np.nan, inplace=True)

impute = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
df['stalk-root'] = impute.fit_transform(df[['stalk-root']])

label_encoder = LabelEncoder()
for column in df.columns:
    df[column] = label_encoder.fit_transform(df[column])

# Data Analysis

### 1. Proportion of Edible vs. Poisonous Mushrooms

In [None]:
class_counts = df['class'].value_counts()
class_counts.index = class_counts.index.map({'e': 'edible', 'p': 'poisonous'})
fig = px.bar(class_counts, x=class_counts.index, y=class_counts.values, title="Edible vs. poisonous mushrooms")
fig.show()

### 2. Most Common Cap Colors

In [None]:
cap_color_counts = df['cap-color'].value_counts()
cap_color_counts.index = cap_color_counts.index.map({'n': 'brown', 'g': 'gray', 'e': 'red', 'y': 'yellow', 'w': 'white', 'b': 'buff', 'p': 'pink', 'c': 'cinnamon', 'u': 'purple'})
fig = px.bar(x=cap_color_counts.index, y=cap_color_counts.values, labels={'x': 'Cap Color', 'y': 'Count'}, title='Distribution of Cap Colors')
fig.show()


### 3. Distribution of Mushrooms Across Different Habitats

In [None]:
habitat_counts = df['habitat'].value_counts()
habitat_counts.index = habitat_counts.index.map({'d': 'woods', 'g': 'grasses', 'p': 'paths', 'l': 'leaves', 'u': 'urban', 'm': 'meadows', 'w': 'waste'})
fig = px.bar(x=habitat_counts.index, y=habitat_counts.values, labels={'x': 'Habitat', 'y': 'Count'}, title='Mushroom Distribution Across Habitats')
fig.show()


### 4. Common Odors Among Mushrooms

In [None]:
odor_counts = df['odor'].value_counts()
odor_counts.index = odor_counts.index.map({'n': 'none', 'f': 'foul', 's': 'spicy', 'y': 'fishy', 'l': 'anise', 'a': 'almond', 'p': 'pungent', 'c': 'creosote', 'm': 'musty'})
fig = px.bar(x=odor_counts.index, y=odor_counts.values, labels={'x': 'Odor', 'y': 'Count'}, title='Distribution of Mushroom Odors')
fig.show()


### 5. Distribution of Cap Shapes

In [None]:
cap_shape_counts = df['cap-shape'].value_counts()
cap_shape_counts.index = cap_shape_counts.index.map({'x': 'convex', 'f': 'flat', 'k': 'knobbed', 'b': 'bell', 's': 'sunken', 'c': 'conical'})
fig = px.pie(names=cap_shape_counts.index, values=cap_shape_counts.values, title='Distribution of Cap Shapes')
fig.show()


### 6. Relationship Between Cap Color and Edibility

In [None]:
cap_class = pd.crosstab(df['cap-color'], df['class'])
cap_class.index = cap_class.index.map({'n': 'brown', 'g': 'gray', 'e': 'red', 'y': 'yellow', 'w': 'white', 'b': 'buff', 'p': 'pink', 'c': 'cinnamon', 'u': 'purple'})
cap_class.columns = cap_class.columns.map({'e': 'edible', 'p': 'poisonous'})
fig = px.bar(cap_class, barmode='stack', title='Cap Color vs Mushroom Edibility')
fig.show()


### 7. Relationship Between Odor and Edibility

In [None]:
odor_class = pd.crosstab(df['odor'], df['class'])
odor_class.index = odor_class.index.map({'n': 'none', 'f': 'foul', 's': 'spicy', 'y': 'fishy', 'l': 'anise', 'a': 'almond', 'p': 'pungent', 'c': 'creosote', 'm': 'musty'})
odor_class.columns = odor_class.columns.map({'e': 'edible', 'p': 'poisonous'})
fig = px.bar(odor_class, barmode='stack', title='Odor vs Mushroom Edibility')
fig.show()


### 8. Impact of Gill Color on Edibility

In [None]:
gill_color_class = pd.crosstab(df['gill-color'], df['class'])
gill_color_class.index = gill_color_class.index.map({'k': 'black', 'n': 'brown', 'g': 'gray', 'p': 'pink', 'w': 'white', 'h': 'chocolate', 'u': 'purple', 'e': 'red', 'b': 'buff', 'y': 'yellow', 'o': 'orange'})
gill_color_class.columns = gill_color_class.columns.map({'e': 'edible', 'p': 'poisonous'})
fig = px.bar(gill_color_class, barmode='group', title='Gill Color Impact on Mushroom Edibility')
fig.show()


### 9. Correlation Between Stalk Shape and Habitat

In [None]:
stalk_habitat = pd.crosstab(df['stalk-shape'], df['habitat'])
stalk_habitat.index = stalk_habitat.index.map({'e': 'enlarging', 't': 'tapering'})
stalk_habitat.columns = stalk_habitat.columns.map({'d': 'woods', 'g': 'grasses', 'p': 'paths', 'l': 'leaves', 'u': 'urban', 'm': 'meadows', 'w': 'waste'})
fig = px.imshow(stalk_habitat, title='Stalk Shape vs Habitat')
fig.show()


### 10. Veil Color Variation with Ring Types

In [None]:
veil_ring = pd.crosstab(df['veil-color'], df['ring-type'])
veil_ring.index = veil_ring.index.map({'n': 'brown', 'o': 'orange', 'w': 'white', 'y': 'yellow'})
veil_ring.columns = veil_ring.columns.map({'e': 'evanescent', 'f': 'flaring', 'l': 'large', 'n': 'none', 'p': 'pendant', 's': 'sheathing', 'z': 'zone'})
fig = px.imshow(veil_ring, title='Veil Color vs Ring Type')
fig.show()


## Analysis Summary of Mushroom Edibility

### Cap Color
- **Brown** and **yellow** cap colors are more commonly associated with edible mushrooms.
- **Buff** and **pink** cap colors tend to indicate poisonous mushrooms.

### Odor
- **Almond** and **anise** odors are exclusive to edible mushrooms, suggesting a strong correlation between these odors and non-toxicity.
- **Foul**, **creosote**, **pungent**, **spicy**, and **fishy** odors are predominantly linked to poisonous mushrooms.

### Gill Color
- **Orange** and **purple** gill colors are found only in edible mushrooms, making them reliable markers for edibility.
- **Buff** colored gills are exclusively observed in poisonous mushrooms, indicating a high risk of toxicity.
- Other gill colors present a mix of edible and poisonous characteristics, requiring further attributes for accurate edibility classification.

From this analysis, it's evident that specific features are highly indicative of a mushroom's class.



# Data Modeling

Split the data into features and target variable

In [None]:
X = df.drop('class', axis=1)  # features
y = df['class']  # target variable

Split the dataset into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initialize the machine learning models

In [None]:
decision_tree_model = DecisionTreeClassifier(random_state=42)
gaussian_nb_model = GaussianNB()
knn_model = KNeighborsClassifier(n_neighbors=5)

Train and evaluate the Decision Tree model using cross-validation

In [None]:
dt_scores = cross_validate(decision_tree_model, X_train, y_train, cv=5,
                           scoring=('accuracy', 'precision', 'recall', 'f1'),
                           return_train_score=False)

 Train and evaluate the Gaussian Naive Bayes model using cross-validation

In [None]:
gnb_scores = cross_validate(gaussian_nb_model, X_train, y_train, cv=5,
                            scoring=('accuracy', 'precision', 'recall', 'f1'),
                            return_train_score=False)

Train and evaluate the K-Nearest Neighbors model using cross-validation

In [None]:
knn_scores = cross_validate(knn_model, X_train, y_train, cv=5,
                            scoring=('accuracy', 'precision', 'recall', 'f1'),
                            return_train_score=False)

In [None]:
dt_scores, gnb_scores, knn_scores