Tools for working with categorical variables in Pandas (port of features from R forcats package - https://cran.r-project.org/web/packages/forcats/index.html)
currently in development - pre-alpha. Current version 0.1.20
pypi package reference here https://pypi.python.org/pypi/pycats/0.1.20
pip install pycats
pandas, numpy
import pandas as pd
import pycats
convert a series (column in data frame) to a category
x - series to convert to a category
the category representation of the provided series
x = pd.DataFrame({
'a': [4,1,9,6,2,3,5,7,2,9],
'b': ['foo', 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'baz', 'baz2']
})
x['b'] = pycats.as_cat(x['b'])
port of forcats fct_lump - Lump together least/most common factor levels in a categorical variable into "other" (or any custom name).
Positive n preserves the most common n values. Negative n preserves the least common -n values.
If there are ties for the most/least common factor levels, a random selection is made from the tying levels. For instance, if n = 3 and the level counts are [4,3,3,3,2,1], the level occurring 4 times is maintained, along with a random selection of 2 out of 3 of the levels which occur 3 times. Other tie resolution methods may be added (and parameterised) later.
x - category object to lump
n - threshold number of occurrences below which to lump into other level. i.e. for n = 2, levels occurring <= 2 in x will be lumped together into 'other' level. If n > 0, the most common n levels are preserved. If n < 0, the least common n levels are preserved
other_level - name for the 'other' level which factor levels are converted to
the lumped version of the provided category object
x = pd.DataFrame({
'a': [4,1,9,6,2,3,5,7,2,9],
'b': ['foo', 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'baz', 'baz2']
})
x['b'] = x['b'].astype('category')
x['b'] = pycats.cat_lump(x['b'], 2)
print(x['b'])
0 foo
1 foo
2 foo
3 foo
4 foo
5 bar
6 bar
7 bar
8 Other
9 Other
port of forcats fct_other - replace levels with other
x - category object
drop - list of category levels to replace in x
other_level - name for the 'other' level which dropped factor levels are converted to
category object with dropped levels replaced by other_level
x = pd.DataFrame({
'a': [4,1,9,6,2,3,5,7,2,9],
b': ['foo', 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'baz', 'baz2']
})
x['b'] = pycats.cat_other(x['b'], ['foo', 'baz'])
print(x['b'])
0 Other
1 Other
2 Other
3 Other
4 Other
5 bar
6 bar
7 bar
8 Other
9 baz2
port of forcats fct_anon - replace category level names with random integers. Maintains level groupings, does not preserve order or values of original categories
x - category object
category object, with each category level replaced by a random number in range [0,10000000]
x = pd.DataFrame({
'a': [4,1,9,6,2,3,5,7,2,9],
'b': ['foo', 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'baz', 'baz2']
})
x['b'] = x['b'].astype('category')
x['b'] = pycats.cat_anon(x['b'])
print(x['b'])
0 1223194
1 1223194
2 1223194
3 1223194
4 1223194
5 6220873
6 6220873
7 6220873
8 2811679
9 582436
port of forcats fct_collapse - Collapse factor levels into manually defined groups
x - category object groups - dictionary, with each key a new category level to use, and each value the list of category levels that the new level should be replaced with. the values are collapsed to the key.
category object, with levels matching the specification in groups.
x = pd.DataFrame({
'a': [4,1,9,6,2,3,5,7,2,9],
'b': ['foo', 'foo', 'foo', 'foo2', 'foo3', 'bar', 'bar', 'bar2', 'baz', 'baz2']
})
groups = {
'other': ['bar2', 'baz'],
'cool': ['foo','foo2']
}
x['b'] = x['b'].astype('category')
x['b'] = pycats.cat_collapse(x['b'], groups)
print(x['b'])
thin wrapper on pandas remove_unused_categories. TBD may replace later based on performance
x - category object in_place - whether or not to drop unused categories inplace or return a copy of this categorical with unused categories dropped
category object, with unused levels dropped
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
cd test
python -m unittest discover -t ..