# 🌟 **<span style="color:orange;">MUSHROOM DATASET OVERVIEW</span>** 🍄

---

## 📌 **<span style="color:yellow;">About the Dataset</span>**

The **Mushroom Dataset** is a classic dataset from the UCI Machine Learning Repository, containing **morphological characteristics** of various mushroom species.  
Each sample represents one type of mushroom, and the **goal** is to predict whether it is **edible** or **poisonous** based on its attributes.

---

## 🔍 **<span style="color:lightgreen;">Key Facts</span>**
- **Number of Instances:** `8,124`
- **Number of Features:** `22` (all categorical)
- **Target Variable:** `class` → `e` (edible) or `p` (poisonous)
- **Data Type:** Categorical (needs encoding before ML models)
- **Source:** [UCI Machine Learning Repository](https://www.kaggle.com/datasets/uciml/mushroom-classification)

---

## 🧾 **<span style="color:lightblue;">Feature Description</span>**
1. **class** → Edible (`e`) or Poisonous (`p`)
2. **cap-shape** → bell, conical, convex, flat, knobbed, sunken
3. **cap-surface** → fibrous, grooves, scaly, smooth
4. **cap-color** → brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow
5. **bruises** → bruises or no bruises
6. **odor** → almond, anise, creosote, fishy, foul, musty, none, pungent, spicy
7. **gill-attachment** → attached, descending, free, notched
8. **gill-spacing** → close, crowded, distant
9. **gill-size** → broad, narrow
10. **gill-color** → black, brown, buff, chocolate, gray, green, orange, pink, purple, red, white, yellow
11. **stalk-shape** → enlarging or tapering
12. **stalk-root** → bulbous, club, cup, equal, rhizomorphs, rooted, missing
13. **stalk-surface-above-ring** → fibrous, scaly, silky, smooth
14. **stalk-surface-below-ring** → fibrous, scaly, silky, smooth
15. **stalk-color-above-ring** → brown, buff, cinnamon, gray, orange, pink, red, white, yellow
16. **stalk-color-below-ring** → brown, buff, cinnamon, gray, orange, pink, red, white, yellow
17. **veil-type** → partial or universal
18. **veil-color** → brown, orange, white, yellow
19. **ring-number** → none, one, two
20. **ring-type** → cobwebby, evanescent, flaring, large, none, pendant, sheathing, zone
21. **spore-print-color** → black, brown, buff, chocolate, green, orange, purple, white, yellow
22. **population** → abundant, clustered, numerous, scattered, several, solitary
23. **habitat** → grasses, leaves, meadows, paths, urban, waste, woods

---

## 🎯 **<span style="color:pink;">Objective</span>**
The main objective is to **predict whether a mushroom is edible or poisonous** based on its characteristics.  
This classification task is crucial for understanding how different features (like **odor**, **gill-size**, **cap-color**) influence edibility.

---

## 📊 **<span style="color:red;">Why This Dataset is Interesting</span>**
- All features are **categorical**, which makes preprocessing essential.
- The dataset is **balanced**, meaning the edible/poisonous ratio is fairly even.
- Useful for testing **decision tree-based algorithms** like AdaBoost, which can handle categorical encodings well.
- Real-world relevance — mushroom identification is important for health and safety.

---


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# ===============================
# 🍄 MUSHROOM DATASET FULL EDA
# ===============================

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML WORK/ADABOOST ALGORITHUM/mushrooms.csv')
df.head()

# ---------------------------------
# 1. BASIC INFO
# ---------------------------------

In [None]:
print("🔍 Dataset Info")
print(df.info())

In [None]:
print("\n❓ Missing Values")
print(df.isnull().sum())

In [None]:
df.columns

# -------------------------------------------------------------
# TARGET VARIABLE DISTRIBUTION
#   ----------------------------------------------------------------

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='class', palette='Set2')
plt.title("Class Distribution (Edible vs Poisonous)", fontsize=14, fontweight='bold')
plt.xlabel("Class", fontsize=12)
plt.ylabel("Count")
plt.show()

In [None]:
print("\nClass Value Counts:")
print(df['class'].value_counts())

# ---------------------------------
# 3. UNIVARIATE ANALYSIS
#   ------------------------------------

In [None]:
categorical_features = df.columns.tolist()
categorical_features.remove('class')
categorical_features

In [None]:
for col in categorical_features:
    plt.figure(figsize=(6,4))
    sns.countplot(data=df, x=col, palette='Set3')
    plt.title(f"Distribution of {col}", fontsize=14, fontweight='bold')
    plt.xticks(rotation=45)
    plt.show()

# ------------------------------------------------------------------
#  BIVARIATE ANALYSIS (Feature vs Target)
# ------------------------------------------------------------------

In [None]:
for col in categorical_features:
    plt.figure(figsize=(6,4))
    sns.countplot(data=df, x=col, hue='class', palette='Set1')
    plt.title(f"{col} vs Class", fontsize=14, fontweight='bold')
    plt.xticks(rotation=45)
    plt.show()

# ===================================
# 🍄 MUSHROOM DATASET PREPROCESSING
# ===================================

# ---------------------------------------------
#  SPLIT FEATURES & TARGET
# ---------------------------------------------

In [None]:
#first split Data into X features and y
X=df.drop('class',axis=1)
y=df['class']

## ONE HOT ENCODING

In [None]:
#now apply one hot encoding X feature
X=pd.get_dummies(X,drop_first=True)
X.head()

## ONE HOT LABEL ENCODER

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['class']=le.fit_transform(df['class'])

# ---------------------------------------
# FEATURE SCALING
# ----------------------------------------

Before feature scaling we need to perform Train_test split

## TRAIN TEST SPLIT

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

# -----------------------------------------------
# 5. DATA READY FOR MODELING
# -----------------------------------------------

# **1. First Model — One Decision Stump**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
model1=AdaBoostClassifier(n_estimators=1,random_state=42)
model1.fit(X_train,y_train)

In [None]:
y_pred=model1.predict(X_test)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
#graping the most import feature in one decision stump
imp_feat=model1.feature_importances_.argmax()
print(imp_feat)

In [None]:
#the 22 index feature is more important
X.columns[22]

In [None]:
sns.countplot(data=df,x='odor',hue='class')

# **2. Number of Estimators vs Error**

In [None]:
len(X.columns)

In [None]:
error_rates=[]
for n in range(1,len(X.columns)+1):
  model=AdaBoostClassifier(n_estimators=n,random_state=42)
  model.fit(X_train,y_train)
  y_pred=model.predict(X_test)
  error_rates.append(1-accuracy_score(y_test,y_pred))


In [None]:
plt.plot(range(1,96),error_rates)
plt.xlabel('Number of Estimators')
plt.ylabel('Error Rate')
plt.title('Error Rate vs Number of Estimators')
plt.show()

In [None]:
fets=pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Importance'])
fets

In [None]:
imp_feat=fets[fets['Importance']>0.0].sort_values(by='Importance',ascending=True)
imp_feat

In [None]:
#plot the importance feature
plt.figure(figsize=(10,6))
sns.barplot(x=imp_feat.index,y=imp_feat['Importance'],data=imp_feat)
plt.title('Feature Importance')
plt.xticks(rotation=90);

# -----------------------------------------------------------
# 3. MODEL TRANING USING GRIDSEARCH-CV
# -------------------------------------------------------------

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Define parameter grid
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1],
    'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3)]
}

In [None]:
ada = AdaBoostClassifier()

In [None]:
from sklearn.tree import DecisionTreeClassifier

ada = AdaBoostClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1],
    'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3)]
}


final_model=GridSearchCV(estimator=ada,param_grid=param_grid,cv=3,verbose=2,n_jobs=-1)
final_model.fit(X_train,y_train)
y_final_pred=final_model.predict(X_test)

In [None]:
final_model.best_params_

In [None]:
print(classification_report(y_test,y_final_pred))