### What is Data Drift in Machine Learning?

**Data drift** refers to changes in the statistical properties of input data over time. Machine learning models are trained with the assumption that the data distribution remains stable. When this distribution shifts, model performance can degrade, leading to less accurate predictions.

### Why Does Data Drift Happen?

Data drift can occur due to several reasons:

- **Changes in user behavior**: Users may change their preferences or actions over time.
- **Seasonality**: Periodic changes, such as holiday seasons, can impact data patterns.
- **External factors**: Economic conditions, new regulations, or unforeseen events can alter the data distribution.
- **Measurement changes**: A change in how data is collected (e.g., sensors or instruments) can cause drift.

### Types of Data Drift

1. **Covariate Shift**: The distribution of input features (X) changes, but the relationship between the input and the target remains the same.
2. **Prior Probability Shift**: The distribution of the target variable (y) changes over time. For example, the proportion of different classes in a classification task may shift.
3. **Concept Drift**: The relationship between input features and the target variable changes. This is more complex, as it means the model's learned patterns no longer apply (e.g., changes in customer behavior due to external events).

### How to Identify Data Drift

You can detect data drift using the following methods:

- **Statistical Tests**:
  - **Kolmogorov-Smirnov Test**: Compares the distribution of input features.

In [2]:
import pandas as pd
from scipy import stats

# Example DataFrames representing data at different times
data1 = pd.Series([1.2, 2.3, 3.5, 4.6, 5.9, 7.2, 8.3])
data2 = pd.Series([1.1, 2.1, 3.6, 4.5, 6.0, 7.1, 8.5])

# Perform the KS test
ks_statistic, p_value = stats.ks_2samp(data1, data2)

print(f"KS Statistic: {ks_statistic}, P-value: {p_value}")
if p_value < 0.05:
    print("Data drift detected (significant difference in distributions).")
else:
    print("No data drift detected.")

KS Statistic: 0.14285714285714285, P-value: 0.9999609537692629
No data drift detected.


  res = hypotest_fun_out(*samples, **kwds)


      - **Chi-Square Test**: Often used for categorical variables.

In [3]:
from scipy.stats import chi2_contingency

# Example categorical data at two different time points
observed_data = pd.Series(['A', 'B', 'A', 'A', 'B', 'C', 'A'])
new_data = pd.Series(['A', 'B', 'B', 'B', 'C', 'A', 'A'])

# Create a frequency table
observed_freq = observed_data.value_counts().sort_index()
new_freq = new_data.value_counts().sort_index()

# Align the indexes and fill missing values
observed_freq, new_freq = observed_freq.align(new_freq, fill_value=0)

# Perform the Chi-Square test
chi2_stat, p_value, _, _ = chi2_contingency([observed_freq, new_freq])

print(f"Chi-Square Statistic: {chi2_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Data drift detected (significant difference in categorical distribution).")
else:
    print("No data drift detected.")

Chi-Square Statistic: 0.34285714285714286, P-value: 0.8424604416167714
No data drift detected.


      - **Population Stability Index (PSI)**: Measures how much a feature's distribution has shifted.

In [4]:
import numpy as np

# Function to calculate PSI
def calculate_psi(expected, actual, buckets=10):
    """Calculates PSI for a single feature."""
    breakpoints = np.linspace(0, 1, buckets + 1)
    expected_perc = np.histogram(expected, bins=buckets, range=(0, 1))[0] / len(expected)
    actual_perc = np.histogram(actual, bins=buckets, range=(0, 1))[0] / len(actual)

    psi_values = (expected_perc - actual_perc) * np.log(expected_perc / actual_perc)
    psi = np.sum(psi_values[np.isfinite(psi_values)])  # Ignore nan or inf
    return psi

# Example data
expected_data = np.random.normal(0.5, 0.1, 1000)  # Original data
actual_data = np.random.normal(0.6, 0.2, 1000)    # New data with drift

# Calculate PSI
psi_value = calculate_psi(expected_data, actual_data)
print(f"PSI Value: {psi_value}")

# PSI thresholds:
if psi_value < 0.1:
    print("No significant drift.")
elif psi_value < 0.2:
    print("Slight drift.")
else:
    print("Significant drift detected.")

PSI Value: 0.6216175723509778
Significant drift detected.


  psi_values = (expected_perc - actual_perc) * np.log(expected_perc / actual_perc)


- **Model Performance Monitoring**: Degrading performance (accuracy, precision, recall) can be a sign of data drift.

In [5]:
from sklearn.metrics import accuracy_score

# Example: Accuracy of a model on two datasets (before and after drift)
true_labels = [0, 1, 1, 0, 1, 1, 0]
predicted_labels_before = [0, 1, 1, 0, 1, 1, 0]
predicted_labels_after = [0, 0, 1, 0, 0, 1, 0]

# Calculate accuracy before and after
accuracy_before = accuracy_score(true_labels, predicted_labels_before)
accuracy_after = accuracy_score(true_labels, predicted_labels_after)

print(f"Accuracy Before Drift: {accuracy_before}")
print(f"Accuracy After Drift: {accuracy_after}")

if accuracy_before - accuracy_after > 0.1:  # Arbitrary threshold for drop
    print("Significant performance drop detected, possible data drift.")
else:
    print("No significant performance drop.")

Accuracy Before Drift: 1.0
Accuracy After Drift: 0.7142857142857143
Significant performance drop detected, possible data drift.


- **Visualization**: Comparing data distributions across time periods can help visually detect drift.
- **Drift Detection Algorithms**:
  - **DDM (Drift Detection Method)** and **EDDM (Early Drift Detection Method)**: Algorithms that monitor error rates for drift detection.

In [6]:
class DDM:
    def __init__(self):
        self.min_error_rate = float('inf')
        self.warning_level = 0
        self.drift_level = 0
        self.error_thresholds = []

    def add_error(self, error_rate):
        self.min_error_rate = min(self.min_error_rate, error_rate)
        diff = error_rate - self.min_error_rate
        self.warning_level = self.min_error_rate + 2 * np.sqrt(self.min_error_rate * (1 - self.min_error_rate) / (len(self.error_thresholds) + 1))
        self.drift_level = self.min_error_rate + 3 * np.sqrt(self.min_error_rate * (1 - self.min_error_rate) / (len(self.error_thresholds) + 1))
        self.error_thresholds.append(error_rate)

        if error_rate > self.drift_level:
            print("Drift detected.")
        elif error_rate > self.warning_level:
            print("Warning: Possible drift.")

# Simulate error rates over time
error_rates = [0.1, 0.12, 0.09, 0.15, 0.2, 0.3]  # A simulated drift in error rates

ddm = DDM()
for rate in error_rates:
    ddm.add_error(rate)

### How to Mitigate the Effects of Data Drift

1. **Regular Model Retraining**: Retrain your model periodically to adapt to new data distributions.
2. **Online Learning**: Use algorithms that can learn incrementally, updating the model as new data arrives.
3. **Feature Engineering**: Adjust or create new features dynamically to account for changes in data patterns.
4. **Deploy Drift Detection Systems**: Implement systems that automatically detect drift and alert you to take action.
5. **Monitor Input Data**: Regularly check for shifts in the distribution of input features.

### What to Do When Data Drift Happens

When data drift is detected, take the following actions:

1. **Diagnose the Drift**: Identify whether the drift is due to covariate shift, prior probability shift, or concept drift.
2. **Retrain the Model**: Use recent data to retrain the model. If it's concept drift, you may need to re-engineer features or even change the model architecture.
3. **Model Refresh**: Consider periodically refreshing or replacing models if the drift is severe.
4. **Deploy Adaptive Models**: In cases of continuous drift, online learning or adaptive models may be more suitable.
5. **Investigate External Factors**: If the drift is driven by external events (e.g., new regulations), further adjustments might be required.


## Evidently 

[Evidently](https://github.com/evidentlyai/evidently) is an open source python library that test and correct data drifts
