# Python DataFrames vs Native Data Structures - Memory Consumption

In this experiment, we want to evaluate the memory footprint in Python 3 of data stored in various tabular formats. In particular, we want to compare DataFrames, List of Dictionaries and Dictionaries of Lists.

These 3 are different ways to store table-like data, which basically can be seen as data represented by rows and columns. In this examination, we will ignore any questions regarding efficient read/write or lookups. We are purely concerned with one question today: which approach will be more memory intensive? 

## Dataset 


We generate a nonsense (but very large) simple test dataset for this experiment, using a list of some popular dog breeds. Note that this list is definitely not biased at all, and all of them are definitely dogs.

To ensure that the test is general enough for most use cases, we ensure that this dataset has at least three primitive data types: str, int and float. 

In [30]:
import random
random.seed(123)

breeds = [
    'German Shepherd',
    'Golden Retriever',
    'Siberian Husky',
    'Japanese Spitz',
    'Sleeping Whippets',
    'Samoyed',
    'Hippogriff',
    'Standard Poodle',
    'Wallaby',
    'Dalmatian',
    'Dachshund',
    'Flying Fox',
    'Mandrake',
]

list_of_dictionaries = [ {
    'breed': random.choice(breeds),
    'count': random.randrange(0,200), 
    'barks': random.uniform(50,70),
} for _ in range(1000000) ]

## Experiments

### DataFrames

Measuring the memory of DataFrames is relatively simple, and can be done with a simple built-in function: `DataFrame.memory_usage`.

In [31]:
import pandas as pd

df = pd.DataFrame(list_of_dictionaries)
mem = df.memory_usage(index=True, deep=True)
total_size = sum(mem)

print(f'{total_size:,} B')
print(f'{total_size/1000000:.2f} MB')

84,613,093 B
84.61 MB


## List of Dictionaries

Measuring lists of dictionaries is slightly more involved. This is because the memory footprint of lists and dictionaries do not include the memory taken up by the objects in them. As such, one has to iterate each object within the list, and each key-value pair within each dictionary to get the proper values. 

We will use the function `sys.getsizeof` to assist us in getting the memory footprint of individual object.

In [32]:
import sys

total_size = 0
total_size += sys.getsizeof(list_of_dictionaries)
for dictionary in list_of_dictionaries:
    total_size += sys.getsizeof(dictionary)
    for key, value in zip(dictionary.keys(), dictionary.values()):
            total_size += sys.getsizeof(key)
            total_size += sys.getsizeof(value)

print(f'{total_size:,} B')
print(f'{total_size/1000000:.2f} MB')

515,041,877 B
515.04 MB


## Dictionary of Lists

An alternative way of representing tabular data in json format is the dictionary of lists. It consists of a dictionary, where each key represents a column, and points to an array. Each array's index in such a case corresponds to a row index. 

An e.g. of such a data structure is as follows:
```json
{
    'breed': [ ... ],
    'count': [ ... ],
    'barks': [ ... ],
}
```

In [33]:
dictionary_of_lists = df.to_dict(orient='list')
total_size = 0
total_size += sys.getsizeof(dictionary_of_lists)
for key in dictionary_of_lists.keys():
    total_size += sys.getsizeof(key)
    total_size += sys.getsizeof(dictionary_of_lists[key])
    for item in dictionary_of_lists[key]:
        total_size += sys.getsizeof(item)

print(f'{total_size:,} B')
print(f'{total_size/1000000:.2f} MB')

136,593,711 B
136.59 MB


## Conclusion

We can see that the Pandas DataFrames, despite their added complexity, have a significantly smaller footprint than lists of dictionaries (~6 times smaller), and even a dictionary of lists (~2 times smaller). We can conclude hence that the use of DataFrames can be a useful non-trivial optimisation in certain use cases.