DataCost

Calculate cost-based metrics about binary classification data.

These cost metrics are used at the core of the classification algorithms CSTree, CSForest, BCSForest, and BCF.

Full documentation available here.

Quick Guide

Step 1: Installing the Library

The library can be downloaded and installed using pip. Enter this at the terminal (Mac/Linux/Unix) or command prompt (Windows):

pip install datacost

Step 2: Importing the Library

At a Python command line or in a Python script, write this to load the library:

import datacost

Step 3: Defining a Cost-Matrix

Cost-matrices in datacost are in the format shown in the code block below. They're a dictionary with the keys 'TP', 'TN', 'FP', and 'FN'.

cost_matrix = {'TP':1, 'TN':0, 'FP':1, 'FN':5}

Step 4: Calculating a Metric

A list of available metrics is available further down in this ReadMe. The following example code calculates the expected cost for a dataset with 4 positive records, 10 negative records. It uses the cost_matrix defined in Step 3. It should output approximately 16.47. Try it out!

datacost.expected_cost(4, 10, cost_matrix)

Basic Notation Used in this Readme

N_TN: Number of true negative predictions.
N_TP: Number of true positive predictions.
N_FN: Number of false negative predictions.
N_FP: Number of false positive predictions.
C_TN: Cost incurred by true negative predictions.
C_TP: Cost incurred by true positive predictions.
C_FN: Cost incurred by false negative predictions.
C_FP: Cost incurred by false positive predictions.

What Can Be Calculated Using datacost:

Cost of Labelling a Set of Data Points as Either Negative or Positive

The cost incurred by labelling as negative is calculated as:

C_N = N_TN X C_TN + N_FN X C_FN

The cost incurred by labelling as positive is calculated as:

C_P = N_TP X C_TP + N_FP X C_FP

Expected Cost

The expected cost is typically a representation of how much a set of data points can be expected to cost a business. It is represented by the symbol E. The equation for E is as follows:

Expected Cost After Split

After a split, a set of data points has several new sets of class supports, one for each split. The expected cost difference can be calculated as the difference between E for the original dataset, and the summed E over all splits. The equation for expected cost after a split is as follows:

Where k is the number of splits, C_Pⁱ is the value of C_P for the i'th split.

Expected Cost Per Record

The expected cost per data point is simply the expected cost for a dataset divided by the number of data points in the dataset. It is a way of normalizing expected cost such that logical comparisons may be made between the expected cost of two datasets of different size.

Where |D| is the number of records in the dataset D.

Total Cost

The total cost for a set of records is calculated as either C_N or C_P, whichever is lowest.

C_T = min(C_N, C_P)

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
docs		docs
readmeImg		readmeImg
tests		tests
LICENSE		LICENSE
README.md		README.md
datacost.py		datacost.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

readmeImg

readmeImg

tests

tests

LICENSE

LICENSE

README.md

README.md

datacost.py

datacost.py

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

DataCost

Quick Guide

Step 1: Installing the Library

Step 2: Importing the Library

Step 3: Defining a Cost-Matrix

Step 4: Calculating a Metric

Basic Notation Used in this Readme

What Can Be Calculated Using datacost:

Cost of Labelling a Set of Data Points as Either Negative or Positive

Expected Cost

Expected Cost After Split

Expected Cost Per Record

Total Cost

About

Releases 8

Packages

Languages

License

mikesiers/DataCost

Folders and files

Latest commit

History

Repository files navigation

DataCost

Quick Guide

Step 1: Installing the Library

Step 2: Importing the Library

Step 3: Defining a Cost-Matrix

Step 4: Calculating a Metric

Basic Notation Used in this Readme

What Can Be Calculated Using datacost:

Cost of Labelling a Set of Data Points as Either Negative or Positive

Expected Cost

Expected Cost After Split

Expected Cost Per Record

Total Cost

About

Topics

Resources

License

Stars

Watchers

Forks

Languages