## Memory Utilization (the computer kind)

While scaling up code for big data analyses, performance and scaling requirements make it important to gain the skills to predict RAM requirements and monitor RAM consumption in order to maximally utilize resources.  This also helps to identify areas of analyses that need modification in order to run within the physical constraints of a particular system.

In [1]:
import numpy as np
import psutil
import sys
import os
import gc

In [2]:
big_array = np.zeros((1024, 1024, 128))

In [3]:
# Dimensions
big_array.shape

(1024, 1024, 128)

In [4]:
# Data being stored
big_array.dtype

dtype('float64')

In [5]:
# Storage per element
big_array.dtype.itemsize

8

In [6]:
# Total element count: N_x * N_y * N_z
np.prod(big_array.shape)

134217728

In [7]:
# Total storage
np.prod(big_array.shape) * big_array.dtype.itemsize

1073741824

In [8]:
# Human readable:
def PrintArraySize(a):
  count = np.prod(a.shape)
  elem_size = a.dtype.itemsize
  print(f'{count*elem_size/1024**3} GB')

PrintArraySize(big_array)

1.0 GB


In [9]:
psutil.virtual_memory()

svmem(total=134775570432, available=110147788800, percent=18.3, used=20981387264, free=14947459072, active=75775299584, inactive=40267792384, buffers=0, cached=98846724096, shared=3005259776, slab=2565996544)

In [10]:
psutil.Process(os.getpid())

psutil.Process(pid=105977, name='python', status='running', started='13:29:15')

In [11]:
# rss is "Resident Set Size",
# vms is "Virtual Memory Size" which includes other aspects of memory such
# as memory swapped to disk (not in use on Rhino) and that might be
# copy-on-write shared, such as shared library memory that typically won't
# count against usage limits.
# These values right here are low because of Linux "over-commit" features.
# It will only assign real memory once it's used, not when it's reserved.
psutil.Process(os.getpid()).memory_info()

pmem(rss=76439552, vms=2089861120, shared=14061568, text=3342336, lib=0, data=1900802048, dirty=0)

Checking your active processes. ssh into rhino (or open a terminal), and run:
```squeue -u your_username```

Your JupyterLab server will look like "spawner-", and will have a node associated.  Your cluster jobs will have names based on the functions you launched.  You can connect to the node for an active job with ssh like:
```ssh node42```

Then you can run the program "top" where shift-P sorts processes by processor load, and shift-M sorts processes by memory load.  Sometimes this is helpful for quickly seeing the load of a running process.

In [12]:
# Now let's actually use the memory!  We'll assign 1 to every entry.
big_array[:] = 1

In [13]:
# Suddenly Linux recognizes that this is real memory in real use,
# and gives it physical storage!
psutil.Process(os.getpid()).memory_info()

pmem(rss=1150914560, vms=2089861120, shared=14278656, text=3342336, lib=0, data=1900802048, dirty=0)

In [14]:
psutil.Process(os.getpid()).memory_info().rss / 1024**3

1.0718765258789062

In [15]:
# Clean-up usage.  Note, this only sometimes works right away!
big_array = None
psutil.Process(os.getpid()).memory_info().rss / 1024**3

0.07188034057617188

In [16]:
big_array = None
# If you need a stronger guarantee of clean-up (rarely needed),
# you can try calling the garbage collector manually
gc.collect()
psutil.Process(os.getpid()).memory_info().rss / 1024**3

0.0718841552734375

In [17]:
def WorkFunction():
  big_array = np.zeros((1024, 1024, 128))
  big_array[:] = 1
  print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

WorkFunction()
# Variables are cleaned up when the function returns!
print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

1.0717277526855469
0.07190322875976562


In [18]:
# It's easy to build up memory utilization as you keep making modified copies.
big_array = np.zeros((1024, 1024, 128))
big_array[:] = 1
big_array2 = 2*big_array
print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

2.071880340576172


In [19]:
# This manually removes variables, cleans up like assigning to None except
# the variable has no definition anymore.
del big_array
del big_array2

In [20]:
# Numpy arrays are more performant modifying in place.
big_array = np.zeros((1024, 1024, 128))
big_array[:] = 1
big_array *= 2
print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

1.0719375610351562


In [21]:
# Dimensionally reduced data is miniscule:
big_array = np.zeros((1024, 1024, 128))
big_array[:] = 1
big_array_mean = np.mean(big_array, axis=0)
print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

1.0719757080078125


In [22]:
del big_array

In [23]:
# Smart approach!
def MakeReducedData():
  big_array = np.zeros((1024, 1024, 128))
  big_array[:] = 1
  big_array_mean = np.mean(big_array, axis=0)
  print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)
  return big_array_mean

big_array_mean = MakeReducedData()
print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)

1.0727462768554688
0.07293701171875


## File Input/Output (IO)

Continuing on in Big Data analyses will give you an increased need to think carefully about file IO, as you migrate from using premade data to being a generator of data.

### Temporary / Intermediate data

This kind of data is transient, used in the middle of a calculation for storing a result for minutes, hours, or maybe up to a couple months.  It is data that typically does not need to be backed up, and where you are not worried about whether or not it can be accessed years later, or work on new environments or new computers or even on anyone else's computer or language.  You just need it to "work" cleanly and simply for short term work, usually to avoid recalculating things.

Good examples of this in Python are pickle and numpy saved files.  Pickle is a "do not share it" data format, because there are security considerations where loading pickled data can actually execute code in the file.  But since there is little chance of you hacking yourself, you can gain the advantages of quickly streaming to disk arbitrary Python objects and reloading them.  Pickled data is NOT compatible if, for example, you save data for an object of a class type, and then change the class!  The same type needs to be available when you reload.  This means it will often stop working even on standard types or common library types if you upgrade to a new version of Python or new environment.  Hence, it is a powerful and useful temporary local-use format.

In [24]:
import pickle
d = {'Bob': 34, 'Alice': 43, 'Joe': 25, 'Susan': 36}
with open('my_dictionary.pkl', 'wb') as fw:
  pickle.dump(d, fw)

In [36]:
with open('my_dictionary.pkl', 'rb') as fr:
  d2 = pickle.load(fr)
d2

{'Bob': 34, 'Alice': 43, 'Joe': 25, 'Susan': 36}

Saving with numpy, such as np.save or np.savez is similar, as some things you might save with numpy are actually pickled to do it!  Pure numerical data saved with numpy has slightly longer longevity, and non-pickled data saved with numpy can be shared, but it is probably not a format to trust for very long term storage.

In [26]:
a = np.ones((1024, 32))
print(np.prod(a.shape)*a.dtype.itemsize)
np.save('my_array.npy', a)
os.path.getsize('my_array.npy')

262144


262272

Notice numpy stores binary data compactly!  This is an asset for large data.

### Long term archival of data

If your data is small dimensionally reduced data, however, other formats like csv (comma-separated values) can be very convenient for long term archival (decades through human lifetime).  This matters a lot for valuable data!

In [27]:
import pandas as pd
pd.DataFrame(a).to_csv('my_array.csv')
os.path.getsize('my_array.csv')

135169

Here csv looks smaller!  But that's deceptive, because we're actually storing "1, 1, 1, 1, 1, ..." which is very compact because we used all ones.

In [28]:
a = np.random.random((1024, 32))
print(np.prod(a.shape)*a.dtype.itemsize)
np.save('my_array2.npy', a)
os.path.getsize('my_array2.npy')

262144


262272

In [29]:
pd.DataFrame(a).to_csv('my_array2.csv')
os.path.getsize('my_array2.csv')

635664

In [30]:
# Let's look at the first 5 lines of the file the manual way:
with open('my_array2.csv', 'r') as fr:
  lines = fr.readlines()
  print('\n'.join(lines[0:5]))

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31

0,0.5221398373735353,0.878158461637144,0.5658874757670006,0.010533476771540151,0.7032463417597493,0.3636674395217909,0.8427889669099672,0.9756874821119245,0.2873606452730445,0.3658607204590304,0.6014833752915677,0.5361589981241498,0.9867838237161558,0.4175930924573791,0.20533108661435273,0.3758964185819238,0.8200142068561991,0.14242864197829197,0.6660212674650331,0.2608031340968813,0.9641711740101091,0.8396557898115048,0.3762816078386366,0.12489480992922397,0.9595306058939054,0.32874567373059904,0.5668151360207953,0.39923762491703074,0.5966142377002017,0.4105682342003333,0.6317588928539587,0.7242054680046272

1,0.7814038572307075,0.6307285124443849,0.9839196911329311,0.8104328668135102,0.10455734480111045,0.5500214038164026,0.1844991663792298,0.2708450157486192,0.769890366380933,0.8013905190940831,0.6602251826403915,0.38663607936008215,0.2631915858034609,0.8932694835533228,0.9174085506983124,0.02889

In [31]:
# Maybe we just wanted the data, and not the headers and indices...
pd.DataFrame(a).to_csv('my_array3.csv', header=None, index=None)
os.path.getsize('my_array3.csv')

631567

In [32]:
# Now let's look at the first 5 lines of the file the manual way:
with open('my_array3.csv', 'r') as fr:
  lines = fr.readlines()
  print('\n'.join(lines[0:5]))

0.5221398373735353,0.878158461637144,0.5658874757670006,0.010533476771540151,0.7032463417597493,0.3636674395217909,0.8427889669099672,0.9756874821119245,0.2873606452730445,0.3658607204590304,0.6014833752915677,0.5361589981241498,0.9867838237161558,0.4175930924573791,0.20533108661435273,0.3758964185819238,0.8200142068561991,0.14242864197829197,0.6660212674650331,0.2608031340968813,0.9641711740101091,0.8396557898115048,0.3762816078386366,0.12489480992922397,0.9595306058939054,0.32874567373059904,0.5668151360207953,0.39923762491703074,0.5966142377002017,0.4105682342003333,0.6317588928539587,0.7242054680046272

0.7814038572307075,0.6307285124443849,0.9839196911329311,0.8104328668135102,0.10455734480111045,0.5500214038164026,0.1844991663792298,0.2708450157486192,0.769890366380933,0.8013905190940831,0.6602251826403915,0.38663607936008215,0.2631915858034609,0.8932694835533228,0.9174085506983124,0.02889416547388346,0.6075402403373885,0.8607454097174535,0.5915069310006902,0.5560166541480497,0.6

The json format is another text format for data which retains human-readable properties, and can be expected to have long term archival properties.  As a text format it is easy to work with but often not compact for numerical data, so it is an appropriate store for dimensionally reduced results and smaller data that you want to preserve for a long time.

In [33]:
import json
d = {'Bob': 34, 'Alice': 43, 'Joe': 25, 'Susan': 36, 'TheSmithBrothers': [29, 31]}
with open('the_ages.json', 'w') as fw:
  json.dump(d, fw)
os.path.getsize('the_ages.json')

78

In [34]:
with open('the_ages.json', 'r') as fr:
  d2 = json.load(fr)
print(d2)

{'Bob': 34, 'Alice': 43, 'Joe': 25, 'Susan': 36, 'TheSmithBrothers': [29, 31]}


In [35]:
# Let's see what we stored.
with open('the_ages.json', 'r') as fr:
  lines = fr.readlines()
  print('\n'.join(lines))

{"Bob": 34, "Alice": 43, "Joe": 25, "Susan": 36, "TheSmithBrothers": [29, 31]}


The json format stores as human readable text, and in fact looks a lot like Python code!  This is a very future-save archival format for data that fits into the supportd types.

For long term compact archiving of large binary data, you will want to do some reading for the particular type of data you are trying to store, and choose a format that you estimate will work well for this and retain long support.  There is no one answer for this.