# üß† Unsupervised Machine Learning ‚Äì Anomaly Detection (Isolation Forest)

---

## üîç Introduction
 
We‚Äôre going to continue our discussion on **Unsupervised Machine Learning**, focusing today on **<span style="color:orange">Anomaly Detection</span>**.

Whenever we talk about anomaly detection, it basically means detecting **<span style="color:red">outliers</span>** ‚Äî data points that deviate significantly from the rest.

---

## üí° Real-World Examples

1. **Bank Account Security**  
   Imagine your bank account is in India.  
   If someone tries to log in using your credentials from another country, your bank immediately sends a security alert.  
   How is this detected?  
   ‚Üí By identifying this login as an **<span style="color:red">anomaly</span>**, since it‚Äôs an unusual event.

2. **Cancer Detection**  
   Only a small subset of patients have cancer.  
   Those rare cases act as **outliers**.  
   Detecting these anomalies helps save lives.

3. **Cybersecurity (Fake IPs)**  
   If a hacker uses a suspicious IP to access a server, anomaly detection can flag it as an outlier.

4. **Cricket Example (IPL)**  
   Suppose in IPL, the runs per over are:  
   15, 10, 12, **100**.  
   Clearly, scoring **100 runs in one over** is impossible.  
   Hence, it‚Äôs an **outlier** ‚Äî an anomaly.

---

## üå≤ Isolation Forest (Concept)

The **<span style="color:orange">Isolation Forest</span>** is an **unsupervised anomaly detection technique** based on **decision trees**.

Even though it uses trees internally, it **does not require labeled data**.

---

### üî∏ Basic Idea

Consider two features:  
$$
f_1, f_2
$$

Now, plot your data points ‚Äî most will form **clusters**, but a few will be far away.  
Those isolated points are **potential anomalies**.

Isolation Forest works by **isolating** these data points.

- It builds multiple **Isolation Trees**.  
- Each tree **randomly splits features** until every data point becomes a **leaf node**.

If a data point is **isolated in fewer splits**, it‚Äôs more likely an **outlier**.

---

### üß© Example Visualization

Let‚Äôs imagine a 2D feature space:

- Clustered region ‚Üí Normal points  
- A few distant points ‚Üí Outliers

The algorithm splits the data like a decision tree:
- Some splits isolate normal points deeper in the tree.  
- Outliers are isolated **quickly**, in **fewer paths (shallow depth)**.

Hence:
- **Fewer splits ‚Üí Higher anomaly likelihood**

---

## üßÆ Mathematical Formulation

To compute the **anomaly score**, we use:

$$
s(x, m) = 2^{-\frac{E[h(x)]}{c(m)}}
$$

---

### ‚ú≥Ô∏è Where:

| Symbol | Meaning |
|:--|:--|
| $$x$$ | Data point for which we calculate anomaly score |
| $$m$$ | Sample size (number of data points) |
| $$h(x)$$ | Path length (depth to isolate $$x$$ in an isolation tree) |
| $$E[h(x)]$$ | Average path length for $$x$$ across all trees |
| $$c(m)$$ | Average path length of unsuccessful searches in a Binary Search Tree of size $$m$$ |

---

### üß† Interpretation

### üß† Interpretation

- If a point is isolated **quickly**:

  $$
  E[h(x)] \ll c(m)
  $$

  then

  $$
  s(x, m) \approx 1
  $$

  ‚Üí meaning **high anomaly score** ‚Üí **outlier**

---

- If a point is isolated **slowly**:

  $$
  E[h(x)] \gg c(m)
  $$

  then

  $$
  s(x, m) < 0.5
  $$

  ‚Üí meaning **normal point**

---

### ‚öôÔ∏è Threshold Rule

We can set a **threshold (œÑ)** for anomaly classification:

$$
\text{If } s(x, m) > \tau \Rightarrow \text{Outlier}
$$

Usually,  
$$
\tau = 0.5
$$  
but it can be tuned depending on the dataset.

---

## üß± Implementation Example (Python)

Let‚Äôs use the **Isolation Forest** from **scikit-learn**.

```python
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Example health dataset with two features
df = pd.read_csv("health_data.csv")

# Initialize Isolation Forest
clf = IsolationForest(contamination=0.2, random_state=42)

# Fit the model
clf.fit(df)

# Predict anomalies
pred = clf.predict(df)

# 1 ‚Üí normal, -1 ‚Üí anomaly
print(pred)


# Get indices of outliers
outlier_index = np.where(pred < 0)

print(outlier_index)


Visualizing Outliers

plt.figure(figsize=(8,6))

plt.scatter(df.iloc[:, 0], df.iloc[:, 1], color='blue', label='Normal Data')


# Highlight anomalies
plt.scatter(df.iloc[outlier_index, 0],

            df.iloc[outlier_index, 1],
            
            edgecolor='red',
            
            facecolor='none',
            
            s=80,
            
            label='Outliers')
            

plt.title("Isolation Forest ‚Äì Anomaly Detection")

plt.legend()

plt.show()


üéØ Output Interpretation

Blue dots ‚Üí Normal data points

Red-circled dots ‚Üí Detected anomalies

These isolated points are the ones identified by the Isolation Forest as potential outliers.

üß≠ Summary
Concept	Description

Algorithm	Isolation Forest

Type	Unsupervised

Core Idea	Isolate anomalies quickly using random splits

Output	Anomaly score between 0 and 1

Threshold	> 0.5 ‚Üí Outlier

Advantage	Fast, efficient, and scalable for high-dimensional data
