# Concept Drift: An Illustration

**Concept drift** (at least for me) is one of those mathy terms that combines two familiar words into a kind of metafictional, sci-fi phrase that doesn't mean anything on first glance. It sounds like something a Replicant would be accused of in *Blade Runner*. Unfortunately, concept drift is both real and can be difficult to pinpoint. (Which muddies the *Blade Runner* metaphor a bit. Alas.)

In machine learning terms, **concept drift** describes a scenario where the observed data continues to look stable and familiar, but the relationship between inputs and outcomes has shifted over time. It's related somewhat to **data drift** - when a model's training data diverges from the real-world data it's meant to predict - but it's sneakier than that. To continue the sci-fi theme, concept drift is closer to *Invasion of the Body Snatchers*: everything looks the same and even still follows the rules, but the thing you think you're dealing with is no longer what it was. 

## Example
First, we'll create a reference data set ("the good ole days"), which the model was initially trained on (*df_initial*). For this scenario, we have a random selection of incomes from $30,000 per year to $100,000 per year. 

Let's say that when this model was trained, it was a given that anyone earning more than $65,000 per year repaid their loans.

Next, we'll create a dataframe representing data collected more recently (*df_current*). Let's say that since the model was created and trained, a recession has hit. In "the good ole days", anyone earning over 65k was likely to repay their loans. But these days, inflation is up and life is more expensive, while income stays the same. Now, only people earning *$85,000* actually repay their loans.


In [None]:
import pandas as pd
import numpy as np
import random
np.random.seed(17)

# creating the initial data set
df_initial = pd.DataFrame(np.random.randint(30000, 100000, size=1000), columns=['income'])
# if income is above 65000, the loan is approved
df_initial['repaid'] = (df_initial['income'] > 65000).astype(int)
# model predictions based on repaid rate in training data
df_initial['predicted_repaid'] = (df_initial['income'] > 65000).astype(int) 

# creating the current data set - same income range
df_current = pd.DataFrame(np.random.randint(30000, 100000, size=1000), columns=['income'])
# if income is about 85000, the loan is approved
df_current['repaid'] = (df_current['income'] > 85000).astype(int)
# model keeps predicting repayment at >65k
df_current['predicted_repaid'] = (df_current['income'] > 65000).astype(int) 

In [38]:
print(f"Good Ole Days Income Mean: ${df_initial['income'].mean():.2f}")
print(f"Good Old Days Repayment Rate: {df_initial['repaid'].mean():.2%}")
print(f"Good Old Days Predicted Repayment Rate: {df_initial['predicted_repaid'].mean():.2%}")
print("--------------------------------")
print(f"Current Income Mean: ${df_current['income'].mean():.2f}")
print(f"Current Repayment Rate: {df_current['repaid'].mean():.2%}")
print(f"Current Predicted Repayment Rate: {df_current['predicted_repaid'].mean():.2%}")

Good Ole Days Income Mean: $65934.32
Good Old Days Repayment Rate: 51.50%
Good Old Days Predicted Repayment Rate: 51.50%
--------------------------------
Current Income Mean: $65403.77
Current Repayment Rate: 23.00%
Current Predicted Repayment Rate: 50.20%


The model is still using the same rules it learned on the training data, but the world has moved on. The model now predicts a significantly different outcome (repayment) distribution than reality. 

Note that the **income did not meaningfully change** - looking at the mean income, we can see that it's practically the same. However, the **relationship** between income and repayment has shifted.

## Next steps
- KDE plots for distributions of successful repayments