disinformation

This module contains methods of handling NaN heavy data without having to guess the missing values.

The current methods are intentionally simple proofs of concept that aim to:

Perform a presence-absence encoding, so that a value being NaN becomes a non-problematic fact about the data, rather than "missing data" that we need to impute or infer before using certain models.
Put all features on a common scale, so as not to confuse statistical models that implicitly assume the data lives in a euclidean or affine space in which "every direction has the same units."
Explicitly represent the frequency with each feature column is present or absent, rather than just representing the fact of their presence or absence as a 0 or 1. That is, for a given input feature a, larger numbers in the a_presence column correspond to rarer events in terms of the presence or absence of that feature. This follows the common idea in information theory (or in every day language) that common things should have simple encodings, and the less common a thing, the "larger" is the space occupied by its encoding. Since we typically use models that operate on floats rather than abstract code-words, the largeness of the floats in the presence encodings below is intended to represent the rareness of the presence or absence of that feature in the dataset.

Other methods readily suggest themselves, but here's how the current ones work:

Example 1

The functions can take a dataframe or a series.

>>> nan = float('nan')

>>> s = pd.Series([-100,nan,+1000], name='a')

encode_jaynes uses a cdf representation of the data, which handles outliers better but loses information about absolute scale.

>>> es.encode_jaynes(s)

   a_presence  a_value
0    0.333333     -1.0
1   -0.666667      0.0
2    0.333333      1.0

encode_shannon uses a z-scored representation of the data, which handles outliers less well and removes absolute units, but preserves relative magnitude in standard-deviation units. That is, though this representation does not retain absolute scale, it preserves relative position in units of that column's standard deviation.

>>> es.encode_shannon(s)

   a_presence   a_value
0    0.764828 -0.707107
1   -1.258953  0.000000
2    0.764828  0.707107

We can also pass them dataframes.

>>> df = pd.DataFrame({'a': [1,2,nan], 'b': [3, nan, nan]})

>>> df
     a    b
0  1.0  3.0
1  2.0  NaN
2  NaN  NaN

>>> es.encode_jaynes(df)

   a_presence  a_value  b_presence  b_value
0    0.333333     -1.0    0.666667      0.0
1    0.333333      1.0   -0.333333      0.0
2   -0.666667      0.0   -0.333333      0.0

>>> es.encode_shannon(df)

   a_presence   a_value  b_presence  b_value
0    0.764828 -0.707107    1.258953      0.0
1    0.764828  0.707107   -0.764828      0.0
2   -1.258953  0.000000   -0.764828      0.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src/disinformation		src/disinformation
.gitignore		.gitignore
MYTHOLOGY.md		MYTHOLOGY.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

disinformation

Example 1

The functions can take a dataframe or a series.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

disinformation

Example 1

The functions can take a dataframe or a series.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages