In [1]:
import pandas as pd
import numpy as np

# Del 1: Pandas - napredno

## Categorical Data

- [Using The Pandas Category Data Type](https://pbpython.com/pandas_dtypes_cat.html)
- [Use Categorical Data to Save on Time and Space](https://realpython.com/python-pandas-tricks/#5-use-categorical-data-to-save-on-time-and-space)

This section introduces the pandas Categorical type. I will show how you can achieve
better performance and memory use in some pandas operations by using it. I
also introduce some tools for using categorical data in statistics and machine learning
applications.

### Background and Motivation

Frequently, a column in a table may contain repeated instances of a smaller set of distinct
values. We have already seen functions like `unique` and `value_counts`, which
enable us to extract the distinct values from an array and compute their frequencies,
respectively:

In [24]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

In [25]:
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [26]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [27]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have
developed specialized approaches for representing data with repeated values for more
efficient storage and computation. In data warehousing, a best practice is to use socalled
dimension tables containing the distinct values and storing the primary observations
as integer keys referencing the dimension table:

In [28]:
values = pd.Series([0, 1, 0, 0] * 2)

In [29]:
dim = pd.Series(['apple', 'orange'])

In [30]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [31]:
dim

0     apple
1    orange
dtype: object

We can use the take method to restore the original Series of strings:

In [32]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the categorical or dictionary-encoded representation.
The array of distinct values can be called the categories, dictionary, or levels
of the data. In this book we will use the terms categorical and categories. The integer
values that reference the categories are called the category codes or simply codes.

The categorical representation can yield significant performance improvements when
you are doing analytics. You can also perform transformations on the categories while
leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
- Renaming categories
- Appending a new category without changing the order or position of the existing categories

### Categorical Type in pandas

pandas has a special Categorical type for holding data that uses the integer-based
categorical representation or encoding. Let’s consider the example Series from before:

In [77]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [78]:
N = len(fruits)

In [79]:
df = pd.DataFrame({'fruit': fruits,
    'basket_id': np.arange(N),
    'count': np.random.randint(3, 15, size=N),
    'weight': np.random.uniform(0, 4, size=N)},
    columns=['basket_id', 'fruit', 'count', 'weight'])

In [80]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,7,3.55535
1,1,orange,14,1.269247
2,2,apple,6,0.36279
3,3,apple,12,2.987721
4,4,apple,10,2.114636
5,5,orange,11,2.021017
6,6,apple,3,1.525508
7,7,apple,3,2.222021


Here, df['fruit'] is an array of Python string objects. We can convert it to categorical
by calling:

In [81]:
fruit_cat = df['fruit'].astype('category')

In [82]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

The values for fruit_cat are not a NumPy array, but an instance of pandas.Catego
rical:

In [83]:
c = fruit_cat.values

In [84]:
type(c)

pandas.core.arrays.categorical.Categorical

The Categorical object has categories and codes attributes:

In [85]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [86]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

> Notice that the dtype is NumPy’s int8, an 8-bit signed integer that can take on values from -127 to 128. (Only a single byte is needed to represent a value in memory. 64-bit signed ints would be overkill in terms of memory usage.) Our rough-hewn example resulted in int64 data by default, whereas Pandas is smart enough to downcast categorical data to the smallest numerical dtype possible.

You can convert a DataFrame column to categorical by assigning the converted result:

In [87]:
df['fruit'] = df['fruit'].astype('category')

In [90]:
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
basket_id    8 non-null int32
fruit        8 non-null category
count        8 non-null int32
weight       8 non-null float64
dtypes: category(1), float64(1), int32(2)
memory usage: 312.0 bytes


You can also create pandas.Categorical directly from other types of Python
sequences:

In [15]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [16]:
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

If you have obtained categorical encoded data from another source, you can use the
alternative from_codes constructor:

In [17]:
categories = ['foo', 'bar', 'baz']

In [18]:
codes = [0, 1, 2, 0, 0, 1]

In [20]:
my_cats_2 = pd.Categorical.from_codes(codes, categories)

In [21]:
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

Unless explicitly specified, categorical conversions assume no specific ordering of the
categories. So the categories array may be in a different order depending on the
ordering of the input data. When using from_codes or any of the other constructors,
you can indicate that the categories have a meaningful ordering:

In [22]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

In [23]:
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering,
and so on. An unordered categorical instance can be made ordered with as_ordered:

In [24]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

As a last note, categorical data need not be strings, even though I have only showed
string examples. A categorical array can consist of any immutable value types.

Sorting will use the order defined by categories, not any lexical order present on the data type. This is even true for strings and numeric data:

In [63]:
s = pd.Series([1, 2, 3, 1], dtype="category")

In [64]:
s = s.cat.set_categories([2, 3, 1], ordered=True)

In [65]:
s

0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [66]:
s.sort_values(inplace=True)

In [67]:
s

1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [68]:
s.min(), s.max()

(2, 1)

### Better performance with categoricals

If you do a lot of analytics on a particular dataset, converting to categorical can yield
substantial overall performance gains. A categorical version of a DataFrame column
will often use significantly less memory, too. Let’s consider some Series with 10 million
elements and a small number of distinct categories:

In [130]:
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

Now we convert labels to categorical:

In [131]:
categories = labels.astype('category')

In [138]:
categories.head(10)

0    foo
1    bar
2    baz
3    qux
4    foo
5    bar
6    baz
7    qux
8    foo
9    bar
dtype: category
Categories (4, object): [bar, baz, foo, qux]

Now we note that labels uses significantly more memory than categories:

In [132]:
labels.memory_usage()

80000080

In [133]:
categories.memory_usage()

10000272

The conversion to category is not free, of course, but it is a one-time cost:

In [135]:
%timeit _ = labels.astype('category')

427 ms ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


GroupBy operations can be significantly faster with categoricals because the underlying
algorithms use the integer-based codes array instead of an array of strings.

A bonus is that computational efficiency gets a boost too: for categorical Series, the string operations are performed on the `.cat.categories attribute`rather than on each original element of the Series.

In other words, the operation is done once per unique category, and the results are mapped back to the values. Categorical data has a .cat accessor that is a window into attributes and methods for manipulating the categories.

### Categorical Methods

Series containing categorical data have several special methods similar to the Ser
ies.str specialized string methods. This also provides convenient access to the categories
and codes. Consider the Series:

In [143]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [144]:
cat_s = s.astype('category')

In [145]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

The special attribute cat provides access to categorical methods:

In [146]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [147]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the
four values observed in the data. We can use the set_categories method to change
them:

In [150]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [151]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [152]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, `value_counts` respects the categories, if present:

In [153]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [154]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. After you filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the `remove_unused_categories` method to trim unobserved categories:

In [206]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [207]:
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [208]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

There are a few caveats, though. Categorical data is generally less flexible. For instance, if inserting previously unseen values, you need to add this value to a .categories container first:

In [217]:
colors = pd.Series(['periwinkle', 'mint green', 'burnt orange',
                     'periwinkle', 'burnt orange', 'rose', 
                     'rose', 'mint green', 'rose', 'navy'])

ccolors = colors.astype('category')

In [219]:
# ValueError: Cannot setitem on a Categorical with a new category, set the categories first

# ccolors.iloc[5] = 'a new color'

In [210]:
cat_s3 = cat_s3.cat.add_categories(['m'])

In [220]:
ccolors = ccolors.cat.add_categories(['a new color'])

In [221]:
ccolors.iloc[5] = 'a new color' 

In [222]:
ccolors

0      periwinkle
1      mint green
2    burnt orange
3      periwinkle
4    burnt orange
5     a new color
6            rose
7      mint green
8            rose
9            navy
dtype: category
Categories (6, object): [burnt orange, mint green, navy, periwinkle, rose, a new color]

### Example: Using The Pandas Category Data Type

> Link do dataseta nekje na cloud platformi

Categorical data is data which takes on a finite number of possible values. For example, if we were talking about a physical product like a t-shirt, it could have categorical variables such as:
- Size (X-Small, Small, Medium, Large, X-Large)
- Color (Red, Black, White)
- Style (Short sleeve, long sleeve)
- Material (Cotton, Polyester)

The key take away is that whether or not a variable is categorical depends on its application. Since we only have 3 colors of shirts, then that is a good categorical variable. However, “color” could represent thousands of values in other situations so it would not be a good choice.

There is no hard and fast rule for how many values a categorical value should have. You should apply your domain knowledge to make that determination on your own data sets. In this article, we will look at one approach for identifying categorical values.

The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data.

At the end of the day why do we care about using categorical values? There are 3 main reasons:
- We can define a custom sort order which can improve summarizing and reporting the data. In the example above, “X-Small” < “Small” < “Medium” < “Large” < “X-Large”. Alphabetical sorting would not be able to reproduce that order.
- Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
- Categorical data uses less memory which can lead to performance improvements.

While categorical data is very handy in pandas. It is not necessary for every type of analysis. In fact, there can be some edge cases where defining a column of data as categorical then manipulating the dataframe can lead to some surprising results. Care must be taken to understand the data set and the necessary analysis before converting columns to categorical data types.

#### Data Preparation

One of the main use cases for categorical data types is more efficient memory usage. In order to demonstrate, we will use a large data set from the US Centers for Medicare and Medicaid Services. This data set includes a 500MB+ csv file that has information about research payments to doctors and hospital in fiscal year 2017.

First, set up imports and read in all the data:

In [27]:
df_raw = pd.read_csv('data/OP_DTL_RSRCH_PGYR2017_P01182019.csv', 
                     low_memory=False)

> **low_memory : bool, default True**: 
Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).

In [28]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602530 entries, 0 to 602529
Columns: 176 entries, Change_Type to Context_of_Research
dtypes: float64(34), int64(3), object(139)
memory usage: 809.1+ MB


In [41]:
# df_raw.head(3)

The 500MB csv file fills about 816MB of memory. This seems large but even a low-end laptop has several gigabytes of RAM so we are nowhere near the need for specialized processing tools.

One interesting thing about this data set is that it has over 176 columns but many of them are empty. I found a stack overflow solution to quickly drop all the columns where at least 90% of the data is empty. I thought this might be handy for others as well.

In [33]:
drop_thresh = df_raw.shape[0]*0.9

In [92]:
df = df_raw.dropna(thresh=drop_thresh, how='all', axis='columns').copy()

In [93]:
# df.info()

Now that we only have 33 columns, taking 153MB of memory, let’s take a look at which columns might be good candidates for a categorical data type.

In order to make this a little easier, I created a small helper function to create a dataframe showing all the unique values in a column.

In [94]:
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                                         columns=['Column_Name', 'Num_Unique'])

In [95]:
unique_counts.sort_values(by=['Num_Unique'], inplace=True)

In [96]:
# unique_counts

This table highlights a couple of items that will help determine which values should be categorical. First, there is a big jump in unique values once we get above 554 unique values. This should be a useful threshold for this data set.

In addition, the date fields should not be converted to categorical.

The simplest way to convert a column to a categorical type is to use astype('category') . We can use a loop to convert all the columns we care about using astype('category')

In [97]:
cols_to_exclude = ['Program_Year', 'Payment_Publication_Date', 'Date_of_Payment']

In [98]:
for col in df.columns:
    if df[col].nunique() < 600 and col not in cols_to_exclude:
        df[col] = df[col].astype('category')

In [99]:
# df.info()

If we use df.info() to look at the memory usage, we have taken the 153 MB dataframe down to 81.7 MB. This is pretty impressive. We have cut the memory usage almost in half just by converting to categorical values for the majority of our columns.

There is one other feature we can use with categorical data - defining a custom order. To illustrate, let’s do a quick summary of the total payments made by the form of payment:

In [100]:
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

Unnamed: 0_level_0,Total_Amount_of_Payment_USDollars
Covered_Recipient_Type,Unnamed: 1_level_1
Covered Recipient Physician,82374420.0
Covered Recipient Teaching Hospital,1023229000.0
Non-covered Recipient Entity,3451684000.0
Non-covered Recipient Individual,2806649.0


If we want to change the order of the Covered_Recipient_Type , we need to define a custom `CategoricalDtype`:

In [101]:
cats_to_order = ["Non-covered Recipient Entity", "Covered Recipient Teaching Hospital",
                 "Covered Recipient Physician", "Non-covered Recipient Individual"]

In [102]:
covered_type = pd.CategoricalDtype(categories=cats_to_order, ordered=True)

In [103]:
covered_type

CategoricalDtype(categories=['Non-covered Recipient Entity',
                  'Covered Recipient Teaching Hospital',
                  'Covered Recipient Physician',
                  'Non-covered Recipient Individual'],
                 ordered=True)

Then, explicitly re_order the category:

In [104]:
df['Covered_Recipient_Type'] = df['Covered_Recipient_Type'].cat.reorder_categories(cats_to_order, ordered=True)

Now, we can see the sort order in effect with the groupby:

In [105]:
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

Unnamed: 0_level_0,Total_Amount_of_Payment_USDollars
Covered_Recipient_Type,Unnamed: 1_level_1
Non-covered Recipient Entity,3451684000.0
Covered Recipient Teaching Hospital,1023229000.0
Covered Recipient Physician,82374420.0
Non-covered Recipient Individual,2806649.0


If you have this same type of data file that you will be processing repeatedly, you can specify this conversion when reading the csv by passing a dictionary of column names and types via the dtype : parameter.

In [None]:
df_raw_2 = pd.read_csv('data/OP_DTL_RSRCH_PGYR2017_P01182019.csv',
                        dtype={'Covered_Recipient_Type':covered_type})

#### Performance

We’ve shown that the size of the dataframe is reduced by converting values to categorical data types. Does this impact other areas of performance? The answer is yes.

Here is an example of a groupby operation on the categorical vs. object data types. First, perform the analysis on the original input dataframe.

In [108]:
%%timeit
df_raw.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

59.5 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now, on the dataframe with categorical data:

In [109]:
%%timeit
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

6.29 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In this case we sped up the code by 10x. You can imagine that on much larger data sets, the speedup could be even greater.

#### Watch Outs

Categorical data seems pretty nifty. It saves memory and speeds up code, so why not use it everywhere? Well, Donald Knuth is correct when he warns about premature optimization:

> The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In the examples above, the code is faster but it really does not matter when it is used for quick summary actions that are run infrequently. In addition, all the work to figure out and convert to categorical data is probably not worth it for this data set and this simple analysis.

In addition, categorical data can yield some surprising behaviors in real world usage. The examples below will illustrate a couple of issues.

Let’s build a simple dataframe with one ordered categorical variable that represents the status of the customer. This trivial example will highlight some potential subtle errors when dealing with categorical values. It is worth noting that this example shows how to use astype() to convert to the ordered category in one step instead of the two step process used earlier.

In [110]:
sales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Status': 'Silver', 'Jan': 50,  'Feb': 90,  'Mar': 95 }]

In [111]:
df_1 = pd.DataFrame(sales_1)

In [112]:
status_type = pd.CategoricalDtype(categories=['Silver', 'Gold'], ordered=True)

In [113]:
df_1['Status'] = df_1['Status'].astype(status_type)

This yields a simple dataframe that looks like this:

In [114]:
df_1

Unnamed: 0,Feb,Jan,Mar,Status,account
0,200,150,140,Gold,Jones LLC
1,210,200,215,Gold,Alpha Co
2,90,50,95,Silver,Blue Inc


We can inspect the categorical column in more detail:

In [115]:
df_1['Status']

0      Gold
1      Gold
2    Silver
Name: Status, dtype: category
Categories (2, object): [Silver < Gold]

All looks good. We see the data is all there and that Gold is > then Silver.

Now, let’s bring in another dataframe and apply the same category to the status column:

In [116]:
sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70},
         {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]

In [117]:
df_2 = pd.DataFrame(sales_2)

In [118]:
df_2['Status'] = df_2['Status'].astype(status_type)

In [119]:
df_2

Unnamed: 0,Feb,Jan,Mar,Status,account
0,100,100,70,Silver,Smith Co
1,65,310,80,,Bingo


Hmm. Something happened to our status. If we just look at the column in more detail:

In [120]:
df_2['Status']

0    Silver
1       NaN
Name: Status, dtype: category
Categories (2, object): [Silver < Gold]

We can see that since we did not define “Bronze” as a valid status, we end up with an NaN value. Pandas does this for a perfectly good reason. It assumes that you have defined all of the valid categories and in this case, “Bronze” is not valid. You can just imagine how confusing this issue could be to troubleshoot if you were not looking out for it.

This scenario is relatively easy to see but what would you do if you had 100’s of values and the data was not cleaned and normalized properly?

Here’s another tricky example where you can “lose” the category object:

In [121]:
sales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Status': 'Silver', 'Jan': 50,  'Feb': 90,  'Mar': 95 }]

df_1 = pd.DataFrame(sales_1)

In [122]:
# Define an unordered category
df_1['Status'] = df_1['Status'].astype('category')

In [123]:
sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70},
     {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]

df_2 = pd.DataFrame(sales_2)

In [125]:
df_2['Status'] = df_2['Status'].astype('category')

In [126]:
# Combine the two dataframes into 1
df_combined = pd.concat([df_1, df_2])

In [127]:
df_combined

Unnamed: 0,Feb,Jan,Mar,Status,account
0,200,150,140,Gold,Jones LLC
1,210,200,215,Gold,Alpha Co
2,90,50,95,Silver,Blue Inc
0,100,100,70,Silver,Smith Co
1,65,310,80,Bronze,Bingo


Everything looks ok but upon further inspection, we’ve lost our category data type:

In [128]:
df_combined['Status']

0      Gold
1      Gold
2    Silver
0    Silver
1    Bronze
Name: Status, dtype: object

In this case, the data is still there but the type has been converted to an object. Once again, this is pandas attempt to combine the data without throwing errors but not making assumptions. If you want to convert to a category data type now, you can use astype('category') .

#### General Guidelines

Now that you know about these gotchas, you can watch out for them. But I will give a few guidelines for how I recommend using categorical data types:

1. Do not assume you need to convert all categorical data to the pandas category data type.
2. If the data set starts to approach an appreciable percentage of your useable memory, then consider using categorical data types.
3. If you have very significant performance concerns with operations that are executed frequently, look at using categorical data.
4. If you are using categorical data, add some checks to make sure the data is clean and complete before converting to the pandas category type. Additionally, check for NaN values after combining or converting dataframes.



- https://janakiev.com/blog/pandas-multiindex-pivot/
- https://github.com/datacamp/community-hierarchical-indices/blob/master/hierarchical_indices_multiple_groupbys_and_pandas.ipynb

## Write Pandas Objects Directly to Compressed Format