The Cool Tool uses the association rules to characterize subpopulations, with the LHS value defining the subpopulation and the RHS characterizing it.  If the LHS occurs at a frequency lower than the minimum support metric, this method will not return results.  

This experiment involves testing whether other algorithm results can be used to characterize these relatively small subpopulations.

Three algorithm results are tested:  __Conditional Probabilities__, __Classification Coefficients__ and __Frequency Counts__.

The __Season Stats__ data set is used with __pos=G__ defining the subpopulation.

In [None]:
import pandas as pd

# Location of data sets
DATA_DIR_NAME = '/Users/karenblakemore/merck/data/'

# Classification Results
For this experiment, the __dependent value__ defines the subpopulation. __Independent values__ that are strongly associated with the dependent value characterize the subpopulation.

The classification results include a set of __coefficients__ for each dependent value.  The results also include __counts__ of the co-occurrence of dependent and independent values. 

## Load Classification Results and create subpopulation

In [None]:
# classification results
DATA_SET_NAME = 'Seasons_Stats_noindex_classification_coefficients'

# Read classification results
pdf = pd.read_csv(DATA_DIR_NAME + DATA_SET_NAME + '.csv')

# Create subpopulation defined by dependent value
pos_g = pdf[pdf['dependent_value'] == 'pos=G']

## Top classification coefficients for subpopulation
The top coefficients are for infrequent values.  Higher frequency values would provide more general characterizations. Also, the relationships may not be linear which would mean that the coefficients can not be interpreted as the importance of an independent value.

In [None]:
coefficients = pos_g.reindex(pos_g['coefficient'].abs().sort_values(ascending=False).index)
print(coefficients[:10].to_string())

## Top frequency counts for subpopulation
The __count__ field is the number of co-occurrences of the dependent and independent values.  This is equivalent to the support metric in the association rules and conditional probabilities, with the added benefit of no minimum support.

A variable value of __nan__ means that the field was blank.

In [None]:
counts = pos_g.reindex(pos_g['count'].abs().sort_values(ascending=False).index)
print(counts[:30].to_string())

# Conditional Probabilities
As with the Association Rules, the __LHS__ defines the subpopulation and the __RHS__ characterizes it.

In the Season Stats data set, pos=G occurs with a frequency of .026.  Since the minimum support for conditional probabilites is .01, the results can capture rules for the subpopulation.

## Load Conditional Probabilities and create subpopulation

In [None]:
# classification results
DATA_SET_NAME = 'Seasons_Stats_noindex_conditional_probabilities'

# Read classification results
pdf = pd.read_csv(DATA_DIR_NAME + DATA_SET_NAME + '.csv')

# Create subpopulation defined by dependent value
pos_g = pdf[pdf['LHS'] == 'pos=G']

## Top ranking conditional probabilities by support
The top ranking conditional probabilities are equivalent to those produced by the frequency count method, with the exception of additional metrics.  These are __confidence__ and __lift__.  Confidence explains how often the RHS value occurs in the subpopulation relative to other values of the same RHS variable.  Lift explains the relative frequency of the RHS in the subpopulation versus the general population.

In [None]:
counts = pos_g.reindex(pos_g['support'].sort_values(ascending=False).index)
print(counts[:30].to_string())

# Conclusion
Of the algorithms currently implemented, conditional probabilities is the best method for characterizing small subpopulations defined by a single variable value.   That is because conditional probabilities can use a lower support threshold than association rules and the results include lift and confidence metrics.