#### 1. Business Understanding (Tavoitteenasettelu):
a) Can we accurately predict whether a mushroom is edible or poisonous based on its physical characteristics?

b) Which features are the most important in determining the edibility of a mushroom?

c) Can we create simple, interpretable rules for identifying poisonous mushrooms, similar to those mentioned in the dataset description?

d) How does our model's performance compare to the previously reported results (e.g., STAGGER's 95% accuracy)?

#### 2. Data Understanding (Aineiston kuvaus):
Let's load and examine the dataset:

In [1]:
import pandas as pd

# List of attribute names from the dataset description
attribute_names = [
    'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
    'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
    'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
    'stalk-surface-below-ring', 'stalk-color-above-ring',
    'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
    'ring-type', 'spore-print-color', 'population', 'habitat'
]

# Load the data, specifying no header and assigning column names
df = pd.read_csv('agaricus-lepiota.csv', header=None, names=attribute_names)

print("Dataset Shape:", df.shape)
print("\nFeature Names:")
print(df.columns.tolist())
print("\nSample Data:")
print(df.head())
print("\nData Types:")
print(df.dtypes)
print("\nSummary Statistics:")
print(df.describe(include='all'))
print("\nClass Distribution:")
print(df['class'].value_counts(normalize=True))
print("\nMissing Values:")
print(df.isnull().sum())
print("\nUnique Values per Column:")
for column in df.columns:
    print(f"{column}: {df[column].nunique()}")

Dataset Shape: (8124, 23)

Feature Names:
['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']

Sample Data:
  class cap-shape cap-surface cap-color bruises odor gill-attachment  \
0     p         x           s         n       t    p               f   
1     e         x           s         y       t    a               f   
2     e         b           s         w       t    l               f   
3     p         x           y         w       t    p               f   
4     e         x           s         g       f    n               f   

  gill-spacing gill-size gill-color  ... stalk-surface-below-ring  \
0            c         n          k  ...                        s   
1  

Based on this output, we can make several observations:

Confirm that we have 8124 instances with 23 features (including the target variable).
All features are categorical, as mentioned in the dataset description.
The class distribution should be close to 51.8% edible and 48.2% poisonous.
We if there are any missing values, especially in the 'stalk-root' feature (attribute #11 in the description).

#### Data Preparation (Aineiston esikäsittely):
We need to prepare the data for analysis:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Separate features and target
X = df.drop('class', axis=1)
y = df['class']

# Encode categorical variables
le = LabelEncoder()
X_encoded = X.apply(le.fit_transform)

# Encode target variable
y_encoded = le.fit_transform(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (6499, 22)
Testing set shape: (1625, 22)
