## Anomaly Detection
## Question: 
Implement a function that identifies outliers in a dataset using statistical measures like mean and standard deviation. 
### Follow-up: 
Explain how your algorithm would handle large-scale data and how to optimize it.

### 1. Statistical Method - Z-Score

data point with a z-score greater than the specified threshold is considered an outlier. 
#### Algorithm:
1. **Calculate the Mean ($\mu $)**: Compute the average of the dataset.
2. **Calculate the Standard Deviation ($\sigma $)**: Measure the spread of the data around the mean.
3. **Compute Z-Scores ($Z $)**: For each data point ($x $), calculate the Z-score.
The Z-score indicates how many standard deviations a data point is from the mean. The Z-Score for a data point is calculated using the formula:

$$
Z = \frac{x - \mu}{\sigma}
$$

4. **Set Threshold**: Define a threshold (e.g., 3). Data points with absolute Z-scores greater than the threshold are considered outliers.
#### Use Case:
+ Best suited for data that is **normally distributed**.
+ Sensitive to **extreme outliers**, as they can inflate the mean and standard deviation.

In [47]:
import numpy as np 

def detect_outliers_z_score(data, threshold):
    mean = np.mean(data)
    #print('mean=', mean)
    std_dev = np.std(data)
    #print('standard deviation=', std_dev)
    z_score = [(x - mean)/std_dev for x in data]
    #print('Z score=', z_score)
    return [i for i,z in enumerate(z_score) if abs(z) > threshold]

#---------------------
#Example
#---------------------
data = [10, 12, 12, 11, -15, 13, 15, 100, 11, 12, 14, 120,-10]
outliers = detect_outliers_z_score(data, 0.9)
for index in outliers:
    print(f"Index: {index}, Value: {data[index]}")

Index: 4, Value: -15
Index: 7, Value: 100
Index: 11, Value: 120


### 2. Statistical Method - IQR (Interquartile Range)
Data points outside the range defined by lower bound and upper bound are considered outliers.
#### Algorithm:
1. **Calculate Quartiles**:
   - **Q1 (25th Percentile)**: Median of the lower half of the dataset.
   - **Q3 (75th Percentile)**: Median of the upper half of the dataset.

2. **Calculate IQR**:
   $$
   \text{IQR} = Q3 - Q1
   $$
   - Measures the range of the middle 50% of the data.

3. **Determine Outlier Bounds**:
   - **Lower Bound**:
     $$
     Q1 - 1.5 \times \text{IQR}
     $$
   - **Upper Bound**:
     $$
     Q3 + 1.5 \times \text{IQR}
     $$

4. **Identify Outliers**:
   - Data points outside these bounds are classified as outliers.

#### Use Case:
- Effective for **non-normal or skewed distributions**.
- Robust to **extreme values**.


In [59]:
def detect_outliers_iqr(data):
    q1 = np.percentile(data,25)     # Median of the lower half of the dataset.
    q3 = np.percentile(data, 75)    # Median of the upper half of the dataset.
    iqr = q3 - q1                   #range of the middle 50% of the data.
    lowerbound = q1 - 1.5 * iqr     # lower bound
    upperbound = q3 + 1.5 * iqr     #upper bound
    return [i for i, x in enumerate(data) if x < lowerbound or x > upperbound]

#---------------------
#Example
#---------------------
outliers_iqr = detect_outliers_iqr(data)
for index in outliers_iqr:
    print(f"index: {index}, value: {data[index]}")

index: 4, value: -15
index: 7, value: 100
index: 11, value: 120
index: 12, value: -10


### 3. Machine Learning - Isolation Forest
Identifies outliers based on the concept of isolation.
#### Algorithm:
1. **Tree-Based Isolation**:
   - Builds a binary tree structure to isolate data points by randomly selecting a feature and splitting the data at random values.
   - Anomalies are isolated more quickly because they lie in sparse regions.

2. **Calculate Path Length**:
   - The number of splits required to isolate a data point is called the path length.
   - Anomalies have shorter average path lengths compared to normal points.

3. **Score Calculation**:
   - An anomaly score is computed for each point based on its path length.
   - A higher score indicates a higher likelihood of being an anomaly.

4. **Threshold**:
   - Based on a contamination parameter (fraction of anomalies expected in the dataset), classify points as normal or anomalous.

#### Use Case:
- Handles **high-dimensional data** and **non-linear relationships**.
- Suitable for **large-scale datasets**.
