### 🌳 **Decision Tree (DT)** – *Basic Overview*

* A flowchart-like structure for **classification** or **regression**.
* Splits data based on **feature values** using rules.
* Each **node** represents a feature test; each **leaf** represents an output.
* Goal: Make the dataset as "pure" as possible after each split.
* Common splitting criteria: **Gini Impurity**, **Entropy** (for classification), **MSE** (for regression).

✅ **Pros**: Easy to interpret, no feature scaling needed
❌ **Cons**: Can overfit on noisy data

---

### 🌲 **Random Forest (RF)** – *Basic Overview*

* An **ensemble** of many Decision Trees.
* Uses **bagging** (bootstrap sampling) + **random feature selection**.
* Each tree votes → final prediction is by **majority** (classification) or **average** (regression).
* Reduces overfitting and improves accuracy.

✅ **Pros**: More accurate, handles large data well
❌ **Cons**: Slower, less interpretable than a single tree

---

### 🆚 Quick Comparison

| Feature          | Decision Tree | Random Forest             |
| ---------------- | ------------- | ------------------------- |
| Model Type       | Single tree   | Multiple trees (ensemble) |
| Accuracy         | Moderate      | Higher                    |
| Overfitting Risk | High          | Low                       |
| Interpretability | High          | Lower                     |

---

In [None]:
from google.colab import files
uploded = files.upload()

Saving shop data (2).csv to shop data (2).csv


In [None]:
import pandas as pd
df=pd.read_csv('shop data.csv')
df1=df.copy()
df1.head()

Unnamed: 0,age,income,gender,m_status,buys
0,<25,high,male,single,no
1,<25,high,male,married,no
2,25-35,high,male,single,yes
3,>35,medium,male,single,yes
4,>35,low,female,single,yes


## 🔤 Label Encoding


**Label Encoding** is a technique to convert **categorical values** into **numeric codes**. It is often used in preprocessing data for machine learning algorithms that require numerical input.

Each unique category is assigned an integer, for example:

```
["low", "medium", "high"] → [1, 2, 0]
```

In [None]:
from sklearn.preprocessing import LabelEncoder
la=LabelEncoder()

# Encode a single column
df1['income'] = la.fit_transform(df1['income'])
print(df1.head())
#Encode all columns (if all are categorical)
df1 = df1.apply(la.fit_transform)

# Show the result

print(df1.head())

     age  income  gender m_status buys
0    <25       0    male   single   no
1    <25       0    male  married   no
2  25-35       0    male   single  yes
3    >35       2    male   single  yes
4    >35       1  female   single  yes
   age  income  gender  m_status  buys
0    1       0       1         1     0
1    1       0       1         0     0
2    0       0       1         1     1
3    2       2       1         1     1
4    2       1       0         1     1


In [None]:
df1=df1.apply(la.fit_transform)
df1.head()

Unnamed: 0,age,income,gender,m_status,buys
0,1,0,1,1,0
1,1,0,1,0,0
2,0,0,1,1,1
3,2,2,1,1,1
4,2,1,0,1,1


In [None]:
X=df1.drop('buys',axis=1)
y=df1['buys']

## 📚 **Theory Behind `train_test_split()`**

### 🔄 **Purpose**

* To **split your dataset** into:

  * **Training set**: used to train the model.
  * **Test set**: used to evaluate the model's performance on unseen data.

---

## 🧩 **Components Explained**

### 1. **`X` and `y`**

* `X`: Features (input variables)
* `y`: Target (output variable/class label)

---

### 2. **`train_test_split()`**

* A function from **Scikit-learn** that **randomly splits** the data into training and testing subsets.

---

### 3. **`test_size=0.2`**

* Means **20% of the data** will be used for **testing**.
* The remaining **80% is used for training**.
* You can adjust this ratio (e.g., `0.3` for 70/30 split).

---

### 4. **`random_state=42`**

* Sets a **seed** for the random number generator.
* Ensures that **the split is the same every time** you run the code.
* Any integer will work; `42` is commonly used by convention.

---

## 📌 Final Result:

| Variable  | Meaning                      |
| --------- | ---------------------------- |
| `X_train` | Training features (80%)      |
| `X_test`  | Testing features (20%)       |
| `y_train` | Training target labels (80%) |
| `y_test`  | Testing target labels (20%)  |

---

## ✅ Why This Matters:

* Splitting your data prevents **data leakage**.
* Ensures your model is evaluated on **data it has never seen**.
* Helps in assessing **generalization performance**.

---

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

## 🌳 **Decision Tree: Basic Concept and Theory**

A **Decision Tree** is a supervised machine learning algorithm used for both **classification** and **regression**.

It works like a flowchart:

* **Nodes** represent features/attributes,
* **Branches** represent decisions/rules,
* **Leaves** represent final outcomes or predictions.

---

## 🧠 **How a Decision Tree Works**

1. **Start at the root node** (the whole dataset).
2. **Split** the dataset based on the feature that gives the **best separation** between classes or values.
3. Repeat this **recursively** on each subset until:

   * A stopping criterion is met (e.g., max depth),
   * All samples in a node belong to the same class.

---

### 🎯 Goal of a Split

To find the best feature and value that:

* In **classification**, results in **pure groups** (e.g., mostly "yes" or "no").
* In **regression**, reduces the variance (makes groups similar in value).

---

## 🔍 Criteria for Splitting

### Classification Trees:

* **Gini Impurity**:

  $$
  Gini = 1 - \sum (p_i^2)
  $$

  Lower Gini = better split.

* **Entropy (Information Gain)**:

  $$
  Entropy = -\sum p_i \log_2(p_i)
  $$

  Information Gain = Entropy(before) - Entropy(after)

### Regression Trees:

* Use **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)** to minimize variance.

---

## 🏗️ Structure of a Decision Tree

```
                 [Is Age > 30?]
                /             \
             Yes               No
          [Income > 50k?]     [Buy=No]
          /          \
      Yes             No
  [Buy=Yes]        [Buy=No]
```

Each internal node is a **test**, and each leaf node is a **decision**.

---

## ✅ Advantages of Decision Trees

* Easy to understand and visualize.
* No need for feature scaling or normalization.
* Handles both numerical and categorical data.
* Can model non-linear relationships.

---

## ❌ Disadvantages

* **Overfitting**: Deep trees can memorize training data.
* **Unstable**: Small data changes can change the structure.
* **Greedy**: Makes locally optimal decisions (not globally optimal).

---

## ✨ Use Cases

* Credit scoring
* Medical diagnosis
* Marketing segmentation
* Customer churn prediction

---


In [None]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(X_train,y_train)

In [None]:
print("Accuracy:", model.score(X_test, y_test))

Accuracy: 0.6


# 🌳 Forest Tree

A **Forest Tree** isn't a separate algorithm—it's a general term referring to the **multiple decision trees** that make up a **Random Forest**.

So when we say "forest tree," we usually mean:

> One of the many **decision trees** inside a **Random Forest**.

---

## 🧠 Key Concepts of Forest Trees in Random Forest

### 1. **Decision Tree (Base Learner)**

A **decision tree** splits the data based on feature values to make predictions. Each "tree" learns patterns from the training data.

* Root → Internal Nodes → Leaf Nodes
* Splits are based on **Gini impurity**, **entropy**, or **variance**.

### 2. **Forest = Many Trees**

A **Random Forest** is a **collection (ensemble)** of **many decision trees** trained on different subsets of the data.

* Each tree is trained independently.
* Each tree sees a slightly **different version of the data**.

### 3. **Randomness in Forest Trees**

To make the trees diverse (and not identical), Random Forest uses **two types of randomness**:

* **Bootstrap Sampling**: Each tree gets a random subset of training data (with replacement).
* **Feature Subset Selection**: At each split, the tree considers only a **random subset of features**, not all of them.

### 4. **Combining Trees (Ensemble Output)**

Each tree makes a prediction, and the forest combines them:

* **Classification**: Uses **majority vote** (most common class from trees).
* **Regression**: Uses the **average** of all tree outputs.

---

## 🔁 Process Summary (Forest Tree Lifecycle)

1. **Draw random samples** (with replacement) from training data → bootstrap sample.
2. **Train a decision tree** on this sample.
3. At each split in the tree, use a **random subset of features**.
4. Repeat steps 1–3 for `n_estimators` times (e.g., 100 times).
5. **Combine predictions** from all trees to produce the final output.


In [None]:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(n_estimators=155)
model.fit(X_train,y_train)

### 🔢 What is `n_estimators` in `RandomForestClassifier`?

➡️ `n_estimators` **specifies the number of trees** in the **random forest**.

---

### 📌 Why Is It Important?

* A **Random Forest** is made up of multiple **Decision Trees**.
* The parameter `n_estimators=155` means the model will build **155 individual decision trees**.
* These trees each vote, and the final prediction is made by **majority voting** (in classification).
---

In [None]:
print("Accuracy:", model.score(X_test, y_test))

Accuracy: 0.8
