 Naïve Bayes model** in **scikit-learn**, covering **training, prediction, evaluation, feature engineering, and hyperparameter tuning** step by step. 🚀  

Let's dive into the code! 🐍  

---

### **📌 Step 1: Import Required Libraries**
First, we need to import the necessary Python libraries.

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, log_loss, roc_auc_score
```
🔹 `numpy` & `pandas`: For handling data.  
🔹 `train_test_split`: To split data into training and testing sets.  
🔹 `Naïve Bayes models`: Importing different types of Naïve Bayes classifiers.  
🔹 `StandardScaler`, `LabelEncoder`, `TfidfVectorizer`: For feature engineering.  
🔹 `metrics`: For model evaluation.  

---

### **📌 Step 2: Generate a Dummy Dataset**
Let's create a **fake dataset** with **3 feature types**:  
1. **Numerical Features** (Age, Salary)  
2. **Categorical Features** (Job Type, City)  
3. **Text Feature** (Customer Review)  

```python
# Creating a dummy dataset
np.random.seed(42)  # Setting seed for reproducibility

data = pd.DataFrame({
    'Age': np.random.randint(18, 60, 100),  # Random ages between 18-60
    'Salary': np.random.randint(20000, 100000, 100),  # Salary range
    'Job Type': np.random.choice(['Engineer', 'Doctor', 'Teacher', 'Lawyer'], 100),  # Categorical feature
    'City': np.random.choice(['New York', 'San Francisco', 'Chicago', 'Los Angeles'], 100),  # Another categorical
    'Review': np.random.choice(['Great service!', 'Terrible experience.', 'Okay, but not great.', 'Loved it!'], 100),  # Text data
    'Purchased': np.random.choice([0, 1], 100)  # Target variable (0 = No, 1 = Yes)
})

print(data.head())  # Display first 5 rows
```

🎯 **Explanation:**  
- The dataset contains **100 rows** and **6 columns**.  
- The **target variable** is `Purchased` (binary classification problem).  
- **Categorical columns (`Job Type`, `City`) need encoding** before training the Naïve Bayes model.  
- **Text data (`Review`) needs vectorization** using **TF-IDF**.  

---

### **📌 Step 3: Preprocessing Data**
#### **🔹 Encode Categorical Features**
Since Naïve Bayes works with **numerical** data, we **convert categorical columns** using `LabelEncoder`.

```python
le = LabelEncoder()

data['Job Type'] = le.fit_transform(data['Job Type'])
data['City'] = le.fit_transform(data['City'])

print(data.head())  # Checking processed data
```

🎯 **What happens here?**  
- `Engineer`, `Doctor`, etc., are **converted into numbers** (e.g., `0, 1, 2, 3`).  
- `New York`, `Chicago`, etc., are also **numerically encoded**.  

---

#### **🔹 Convert Text Data using TF-IDF**
For **text features**, we apply **TF-IDF Vectorization**.

```python
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data['Review']).toarray()  # Convert to array

# Add the TF-IDF features to the dataset
tfidf_df = pd.DataFrame(tfidf_matrix, columns=vectorizer.get_feature_names_out())

# Drop the original text column and concatenate TF-IDF data
data = data.drop(columns=['Review']).reset_index(drop=True)
data = pd.concat([data, tfidf_df], axis=1)

print(data.head())  # Display new dataset
```

🎯 **What happens here?**  
- `TfidfVectorizer()` converts text into **numerical features** based on word importance.  
- The original `Review` column is **removed** and replaced with **TF-IDF features**.  

---

### **📌 Step 4: Split Dataset into Training and Testing Sets**
Now, we divide our dataset **80% for training** and **20% for testing**.

```python
X = data.drop(columns=['Purchased'])  # Features
y = data['Purchased']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

🎯 **Why?**  
- The model needs unseen data for evaluation, so we keep **20% as the test set**.  

---

### **📌 Step 5: Apply Different Naïve Bayes Models**
#### **1️⃣ Gaussian Naïve Bayes (For Continuous Data)**
```python
scaler = StandardScaler()  # Standardize features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)
y_pred_gnb = gnb.predict(X_test_scaled)

print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred_gnb))
```
✔ **Used for:** **Continuous numerical data** (e.g., Age, Salary).  
✔ **Requires feature scaling** for better results.  

---

#### **2️⃣ Multinomial Naïve Bayes (For Text Data)**
```python
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_mnb = mnb.predict(X_test)

print("MultinomialNB Accuracy:", accuracy_score(y_test, y_pred_mnb))
```
✔ **Used for:** **Text-based classification** (e.g., spam detection).  

---

#### **3️⃣ Bernoulli Naïve Bayes (For Binary Data)**
```python
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred_bnb = bnb.predict(X_test)

print("BernoulliNB Accuracy:", accuracy_score(y_test, y_pred_bnb))
```
✔ **Used for:** **Binary feature data** (e.g., presence/absence of a word in text).  

---

#### **4️⃣ Complement Naïve Bayes (For Imbalanced Data)**
```python
cnb = ComplementNB()
cnb.fit(X_train, y_train)
y_pred_cnb = cnb.predict(X_test)

print("ComplementNB Accuracy:", accuracy_score(y_test, y_pred_cnb))
```
✔ **Better for imbalanced datasets** (e.g., fraud detection).  

---

### **📌 Step 6: Model Evaluation**
```python
print("Classification Report:\n", classification_report(y_test, y_pred_gnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gnb))
print("Log Loss:", log_loss(y_test, gnb.predict_proba(X_test_scaled)))
print("ROC-AUC Score:", roc_auc_score(y_test, gnb.predict_proba(X_test_scaled)[:, 1]))
```
✔ **Accuracy** → How often the model is correct.  
✔ **Log Loss** → Measures probability confidence (lower is better).  
✔ **ROC-AUC** → Measures how well the model separates classes.  

---

### **🎯 Conclusion**
✅ We implemented **all Naïve Bayes models**.  
✅ Used **TF-IDF for text** and **scaling for GaussianNB**.  
✅ Evaluated models using **accuracy, confusion matrix, log loss, and AUC-ROC**.  

 🚀