## Using AI for Anomalies Detection in Data Quality
**Description**: Implement an AI-based approach to detect anomalies in data quality.

**Steps**:
1. Use an Anomaly Detection Algorithm:
    - Use sklearn's Isolation Forest for anomaly detection.

**Example data:**

data = np.array([[25, 50000], [30, 60000], [35, 75000], [40, None], [45, 100000]])

2. Integrate with Great Expectations:
    - Generate alerts if anomalies are detected:

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.impute import SimpleImputer

# Example data with a missing value (None as object)
data = np.array([[25, 50000], [30, 60000], [35, 75000], [40, None], [45, 100000]])

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Age", "Income"])

# Ensure numeric types (very important!)
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')  # Convert to numeric, set invalids as NaN

# Drop rows that are fully NaN (optional, depending on strategy)
df.dropna(how='all', inplace=True)

# Check if DataFrame is empty after cleaning
if df.empty:
    raise ValueError("DataFrame is empty after cleaning. No valid numeric data to process.")

# Impute missing values using median
imputer = SimpleImputer(strategy="median")
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Check again: if imputation returns empty (e.g. all columns were dropped)
if df_imputed.empty:
    raise ValueError("Imputed DataFrame is empty. Check input data.")

# Isolation Forest for anomaly detection
clf = IsolationForest(n_estimators=100, contamination=0.2, random_state=42)
df_imputed["Anomaly"] = clf.fit_predict(df_imputed)

# Map output to labels for clarity
df_imputed["Anomaly_Label"] = df_imputed["Anomaly"].map({1: "Normal", -1: "Anomaly"})

# Show results
print("Detected Anomalies:")
print(df_imputed)

# Alert if any anomalies are found
if (df_imputed["Anomaly_Label"] == "Anomaly").any():
    print("ALERT: Data quality issues detected due to anomalies.")
else:
    print("Data quality is good.")

Detected Anomalies:
    Age    Income  Anomaly Anomaly_Label
0  25.0   50000.0        1        Normal
1  30.0   60000.0        1        Normal
2  35.0   75000.0        1        Normal
3  40.0   67500.0        1        Normal
4  45.0  100000.0       -1       Anomaly
ALERT: Data quality issues detected due to anomalies.
