Some of the localization files that are generated by Fang's software are 4 GB in size or more. Reading in this data and processing it poses a problem because of its large size and the limited amount of memory available on our machines. A solution to this problem is to employ out-of-core processing (also known as chunking), whereby data is broken up into tiny bits, processed, and then written to disk before more data is read into memory.

Python's Pandas library is already well-suited to employ out-of-core processing because it is built into the library. However, to use it with DataSTORM's FiducialDriftCorrection processor, we need to manually drive the FiducialDriftCorrection methods that it would otherwise automatically perform.

The purpose of this notebook is to demonstrate how to perform out-of-core processing on a large dataset.

In [1]:
# Load the necessary libraries
%pylab
import DataSTORM.processors as ds
import pandas               as pd
import importlib

from pathlib import Path

Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib


We will start by opening a connection to the file so that we may pull out subsets of data at any given time.

In [2]:
filePath      = Path('../test-data/MicroTubules_LargeFOV/FOV2_1500_10ms_1_MMStack_locResults.dat')
numRowsToRead = 200000

# Opens a connection to the file, reading in 200000 rows at a time.
reader        = pd.read_csv(str(filePath.resolve()), chunksize = numRowsToRead)

*reader* is a TextFileReader object that may be iterated over to extract the data. Alternatively, we may use its *.get_chunk()* method to extract a chunk with a specific number of rows. Note that everytime *.get_chunk(size)* is called, a pointer to the current row inside the file moves forward by *size* rows.

The pointer can not be moved backwards, so once *.get_chunk()* is called, you must do something with those rows. Otherwise, you will need to restart the notebook.

In [3]:
# If you uncomment the next line to try .get_chunk(), you should restart the notebook.
#reader.get_chunk(50000).describe()

Now that we have opened a connection to the file, let's begin the dedrift process by interactively searching for fiducials in each chunk.