# Outlier Detection Methods in Python

This notebook demonstrates multiple methods for identifying outliers in a dataset using Z-score, IQR, box plot visualization, and DBSCAN.

## 1. Import Required Libraries

We will import NumPy, pandas, matplotlib, and scikit-learn libraries needed for data manipulation, visualization, and clustering.

In [None]:
# Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

## 2. Create Sample DataFrame

Let's create a pandas DataFrame with invoice IDs and corresponding amounts to be used for outlier detection.

In [None]:
# Create the DataFrame
data = {
    'Invoice ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'Amount': [185, 188, 192, 191, 187, 190, 194, 195, 1, 189, 120, 450, 189, 190, 193, 191, 187, 188, 186, 189]
}
df = pd.DataFrame(data)
df.head()

### Visualization

In [None]:
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Invoice ID'], df['Amount'])
plt.title('Invoice Amounts')
plt.xlabel('Invoice ID')
plt.ylabel('Amount')
plt.show()


## 3. Identify Outliers Using Z-Score Method

We will calculate the mean and standard deviation of the 'Amount' column, compute Z-scores, and flag data points as outliers based on a threshold.
A z-score threshold is a value used to identify outliers in a dataset. Common thresholds are typically around ±2 or ±3 standard deviations from the mean. 
Values beyond these thresholds are considered unusual or potentially outliers. 

In [None]:
# Calculate the mean and standard deviation
mean = np.mean(df['Amount'])
std_dev = np.std(df['Amount'])

# Calculate the Z-score for each data point
df['Z-Score'] = (df['Amount'] - mean) / std_dev

# Define a threshold for identifying outliers
threshold = 2

# Identify outliers
df['Z-Score Outlier'] = np.abs(df['Z-Score']) > threshold

# Display the results
df[['Invoice ID', 'Amount', 'Z-Score', 'Z-Score Outlier']]

### Visualization

In [None]:
# Create a scatter plot, highlighting outliers
plt.figure(figsize=(10, 6))
plt.scatter(df['Invoice ID'], df['Amount'], c=df['Z-Score Outlier'], cmap='coolwarm', label='Data Points')
plt.title('Invoice Amounts with Outliers Highlighted')
plt.xlabel('Invoice ID')
plt.ylabel('Amount')
plt.axhline(y=mean + threshold * std_dev, color='r', linestyle='--', label='Upper Threshold')
plt.axhline(y=mean - threshold * std_dev, color='g', linestyle='--', label='Lower Threshold')
plt.legend()
plt.show()

## 4. Identify Outliers Using IQR Method

We will compute the first and third quartiles (Q1, Q3), calculate the interquartile range (IQR), and identify outliers outside the lower and upper bounds.

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Define the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
df['IQR Outlier'] = (df['Amount'] < lower_bound) | (df['Amount'] > upper_bound)

# Display the results
df[['Invoice ID', 'Amount', 'IQR Outlier']]

### Visualization

In [None]:
# Create a scatter plot, highlighting outliers by IQR method
plt.figure(figsize=(10, 6))
plt.scatter(df['Invoice ID'], df['Amount'], c=df['IQR Outlier'], cmap='coolwarm', label='Data Points')
plt.title('Invoice Amounts with Outliers Highlighted (IQR Method)')
plt.xlabel('Invoice ID')
plt.ylabel('Amount')
plt.axhline(y=upper_bound, color='r', linestyle='--', label='Upper Bound')
plt.axhline(y=lower_bound, color='g', linestyle='--', label='Lower Bound')
plt.legend()
plt.show()

### Visualize Outliers with Box Plot

Let's plot a box plot of the 'Amount' column to visually inspect and highlight outliers.

In [None]:
plt.figure(figsize=(10, 6))
plt.boxplot(df['Amount'], vert=False)
plt.xlabel('Amount')
plt.title('Box Plot of Invoice Amounts')
plt.show()

## 6. Identify Outliers Using DBSCAN

We will apply the DBSCAN clustering algorithm to the 'Amount' data and label points as outliers based on the clustering results (DBSCAN labels outliers as -1).

In [None]:
# Reshape the 'Amount' column to be a 2D array
X = df['Amount'].values.reshape(-1, 1)

# Apply DBSCAN
dbscan = DBSCAN(eps=10, min_samples=3)
dbscan.fit(X)

# Add the DBSCAN labels to the DataFrame
df['DBSCAN_Label'] = dbscan.labels_

# Identify outliers (DBSCAN labels outliers as -1)
df['DBSCAN Outlier'] = df['DBSCAN_Label'] == -1

# Display the results
df[['Invoice ID', 'Amount', 'DBSCAN_Label', 'DBSCAN Outlier']]

### Visualization

In [None]:
# Create a bubble chart to visualize the outliers identified by DBSCAN.
# X: Labels, Y: Amounts, Size: DBSCAN labels
plt.figure(figsize=(10, 6))
plt.scatter(df['Invoice ID'], df['Amount'], c=df['DBSCAN_Label'], cmap='coolwarm')
plt.title('DBSCAN Clustering of Invoice Amounts')
plt.xlabel('Invoice ID')
plt.ylabel('Amount')
plt.colorbar(label='DBSCAN Label')
plt.legend()
plt.show()
