## Predictive Power Score (PPS)
The __PPS__ is an alternative to __correlation__ that finds patterns within a data.

## Calculating the Predictive Power Score (PPS)
Suppose we have two columns and want to calculate the __predictive power score__ of A predicting B. In this case, we treat B as our __target variable__ and A as our (only) __feature__. We can now calculate a __cross-validated Decision Tree__ and calculate a suitable evaluation metric. When the target is _numeric_ we can use a __Decision Tree Regressor__ and calculate the __Mean Absolute Error__ (MAE). When the target is _categoric_, we can use a __Decision Tree Classifier__ and calculate the weighted F1. You might also use other scores like the ROC etc

## Applications of the PPS and the PPS matrix
1. __Find patterns in the data__: The PPS finds every relationship that the correlation finds — and more. Thus, you can use the PPS matrix as an alternative to the correlation matrix to detect and understand linear or nonlinear patterns in your data. This is possible across data types using a single score that always ranges from 0 to 1.

2. __Feature selection__: In addition to your usual feature selection mechanism, you can use the predictive power score to find good predictors for your target column. Also, you can eliminate features that just add random noise. Those features sometimes still score high in feature importance metrics. In addition, you can eliminate features that can be predicted by other features because they don’t add new information. Besides, you can identify pairs of mutually predictive features in the PPS matrix — this includes strongly correlated features but will also detect non-linear relationships.

3. __Detect information leakage__: Use the PPS matrix to detect information leakage between variables — even if the information leakage is mediated via other variables.

4. __Data Normalization__: Find entity structures in the data via interpreting the PPS matrix as a directed graph. This might be surprising when the data contains latent structures that were previously unknown. For example: the TicketID in the Titanic dataset is often an indicator for a family.

In [2]:
import pandas as pd
import ppscore as pps
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('AmesHousing.csv')
df.head(3)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000


In [6]:
# Calculating the PPS for a given pandas dataframe:
pps.score(df, "Lot Area", "SalePrice")

{'x': 'Lot Area',
 'y': 'SalePrice',
 'task': 'regression',
 'ppscore': 0,
 'metric': 'mean absolute error',
 'baseline_score': 56054.2313993174,
 'model_score': 63460.467765339214,
 'model': DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=None, splitter='best')}

In [7]:
# You can also calculate the whole PPS matrix:
pps.matrix(df)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
Order,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000194,0.000000,0.000000,0.000000,0.000000,0.006323,0.790471,0.015548,0.015445,0.000000
PID,0.993091,1.000000,0.165221,0.161765,0.068605,0.192030,0.000000,0.089376,0.000000,0.016797,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.004739,0.049474
MS SubClass,0.084553,0.311771,1.000000,0.000000,0.187608,0.081994,0.000000,0.113382,0.000000,0.000000,...,0.000000,0.442352,0.000761,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
MS Zoning,0.738777,0.947956,0.252383,1.000000,0.304377,0.347835,0.003192,0.426724,0.000044,0.000044,...,0.000044,1.000000,0.000333,0.013567,0.000000,0.000044,0.000044,0.001060,0.000044,0.055448
Lot Frontage,0.078757,0.327156,0.249809,0.079173,1.000000,0.249117,0.000000,0.092385,0.006030,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.002788,0.001836,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mo Sold,0.084364,0.077644,0.034448,0.010640,0.065892,0.068982,0.001211,0.007492,0.014351,0.002617,...,0.000051,0.000000,0.048555,0.003942,0.002119,1.000000,0.039340,0.017742,0.018548,0.065946
Yr Sold,0.998875,0.145407,0.122210,0.061598,0.129276,0.131958,0.001274,0.008434,0.011809,0.012502,...,0.001539,0.169130,0.060339,0.013007,0.004606,0.119834,1.000000,0.044377,0.069013,0.123426
Sale Type,0.105106,0.086333,0.000087,0.000000,0.027134,0.000000,0.000087,0.009328,0.000087,0.000087,...,0.000087,0.215734,0.001418,0.088487,0.000000,0.000087,0.000087,1.000000,0.594052,0.023334
Sale Condition,0.070522,0.042610,0.000019,0.002152,0.032500,0.000000,0.000019,0.002840,0.000019,0.000019,...,0.000000,0.000000,0.001516,0.062473,0.000000,0.000019,0.000019,0.511430,1.000000,0.045786


## How fast is the PPS in comparison to the correlation?
Although the __PPS__ has many advantages over the __correlation__, there is some drawback: it takes longer to calculate. When calculating a single PPS using the Python library, the time should be no problem because it usually takes around 10–500ms. The calculation time mostly depends on the data types, the number of rows and the used implementation. However, when calculating the whole PPS matrix for 40 columns this results in 40*40=1600 individual calculations which might take 1–10 minutes.

## Limitations
1. The calculation is slower than the correlation (matrix).

2. The score cannot be interpreted as easily as the correlation because it does not tell you anything about the type of relationship that was found. Thus, the PPS is better for finding patterns but the correlation is better to communicate found linear relationships.

3. You cannot compare the scores for different target variables in a strict mathematical way because they are calculated using different evaluation metrics. The scores are still valuable in the real world, but you need to keep this in mind.

4. There are limitations of the components used underneath the hood. Please remember: you might exchange the components e.g. using a GLM instead of a Decision Tree or using ROC instead of F1 for binary classifications.

5. If you use the PPS for feature selection you still want to perform forward and backward selection in addition. Also, the PPS cannot detect interaction effects between features towards your target.

### References
1. https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598