Some of the localization files that are generated by Fang's software are 4 GB in size or more. Reading in this data and processing it poses a problem because of its large size and the limited amount of memory available on our machines. A solution to this problem is to employ out-of-core processing (also known as chunking), whereby data is broken up into tiny bits, processed, and then written to disk before more data is read into memory.

Python's Pandas library is already well-suited to employ out-of-core processing because it is built into the library. However, to use it with DataSTORM's FiducialDriftCorrection processor, we need to manually drive the FiducialDriftCorrection methods that it would otherwise automatically perform.

The purpose of this notebook is to demonstrate how to perform out-of-core processing on a large dataset.

In [1]:
# Load the necessary libraries
%pylab
import DataSTORM.processors as ds
import pandas               as pd

from pathlib import Path

Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib


## User-adjustable parameters

In [2]:
# Input file
filePath   = Path('../test-data/MicroTubules_LargeFOV/FOV2_1500_10ms_1_MMStack_locResults.dat')

# Output file
outputFile = filePath.parent / Path(filePath.stem + '_DC' + filePath.suffix)


numRowsToReadInteractive = 8e6 # Read 8 million rows when searching for fiducials.
numRowsToRead            = 1e6 # Number of rows to read for all other operations.

# Create the FiducialDriftCorrect processor and set its properties.
dc = ds.FiducialDriftCorrect(mergeRadius           = 50,
                             offTime               = 1,
                             minSegmentLength      = 20,
                             minFracFiducialLength = 0.4,
                             neighborRadius        = 500,
                             smoothingWindowSize   = 500,
                             smoothingFilterSize   = 300)

# Open a file for out-of-core processing
We will start by opening a connection to the file so that we may pull out subsets of data at any given time.

In [3]:
# Opens a connection to the file, reading in 200000 rows at a time.
reader        = pd.read_csv(str(filePath.resolve()), chunksize = numRowsToReadInteractive)

*reader* is a TextFileReader object that may be iterated over to extract the data. Alternatively, we may use its *.get_chunk()* method to extract a chunk with a specific number of rows. Note that everytime *.get_chunk(size)* is called, a pointer to the current row inside the file moves forward by *size* rows.

## Overview of the fiducial-based drift correction using OOC processing
Now that we have opened a connection to the file, let's begin the dedrift process by interactively searching for fiducials in each chunk. The outline of the steps looks like this:

1. Create a FiducialDriftCorrect processor from DataSTORM. Note that in normal (on-core) processing we set its *interactiveSearch* flag to True so that the interactive search for fiducials is performed automatically. Here, since we'll direct the search process on each chunk, we will leave it at its default value of **False**.

2. Loop through each chunk of data. For each chunk allow the user to specify a subregion containing fiducials. The processor will remember each subregion that the user specified. This step is best performed on chunks as large as possible because the fiducials may drift between chunks.

3. Loop over all chunks again, filtering out localizations from the current chunk that do not lie within the search areas. Append these localizations to a DataFrame that is collecting all localizations that are fiducial candidates. This DataFrame will be fed to the FiducialDriftCorrect processor's regular routines to build the correction curve.

4. Use the *detectFiducials()* method of the processor class to look for fiducial trajectories within the localizations that are output from step 3.

5. Compute the fiducial drift correction curves from the identified fiducials.

6. Open the data in chunks one last time, applying the correction to each localization and stream the data to a different file.

# Interactively select regions containing fiducials

In [4]:
# Create a CleanUp processor to ensure data in each chunk is clean.
clean = ds.CleanUp()

When you select a region containing a fiducial in chunk, that region is remembered for all frames. So, if you have already selected a region containing a fiducial in one chunk, you do not have to select the same region again in other chunks unless you believe that the fiducial has drifted out of the region you selected in earlier chunks.

**Reminder**: Use the zoom tool in the figure to zoom in on regions containing high counts. Deactivate the zoom tool and then click and drag around a bin with a large count, making sure the border of the selection rectangle is just a tiny bit bigger than the bin. Press space to add a region to the list of regions to search for fiducials. You typically will not need more than three fiducials.

In [5]:
# minFrame and maxFrame will hold the absolute min and maximum frame in all the chunks
minFrame = 1e7
maxFrame = 0

for chunk in reader:
    # Clean up the data in the chunks
    chunk = clean(chunk)
    
    # Update the minimum and maximum frames with each chunk
    minFrame = np.min([minFrame, chunk['frame'].min()])
    maxFrame = np.max([maxFrame, chunk['frame'].max()])
    
    # Rename the columns because trackpy does not accept column names with units
    chunk.rename(columns = {'x [nm]' : 'x', 'y [nm]' : 'y'}, inplace = True)
    
    # It's important to set resetRegions = False here.
    # Otherwise, the regions will be overwritten for each new chunk.
    dc.iSearch(chunk, resetRegions = False)

  shell.run_cell(code, store_history=store_history, silent=silent)


Now, we'll chunk and loop over the data again, keeping localizations in the previously defined search regions.

In [6]:
# fids will hold the localizations belonging to fiducials and will grow with each processed chunk
fids = pd.DataFrame()

# We don't need large chunks anymore, so let's set them to a smaller size (1 million in this case)
reader = pd.read_csv(str(filePath.resolve()), chunksize = numRowsToRead)
for chunk in reader:
    chunk = clean(chunk)
    
    # Rename the columns because trackpy does not accept column names with units
    chunk.rename(columns = {'x [nm]' : 'x', 'y [nm]' : 'y'}, inplace = True)
    
    # Filter out localizations that are not within all previously-defined search regions.
    # Append these to the fiducial Data Frame defined just before the start of the loop.
    currentFids = dc.reduceSearchArea(chunk)
    fids        = fids.append(currentFids, ignore_index = True)

  shell.run_cell(code, store_history=store_history, silent=silent)


# Detect the fiducials within the selected subregions

In [7]:
# Detect the fiducuial trajectories from these localizations
dc.detectFiducials(fids)

Frame 49999: 3 trajectories present
3 fiducial(s) detected.


In [8]:
# Compute the correction curves
dc.fitSplines()
dc.combineSplines(None, startFrame = minFrame, stopFrame = maxFrame)

If there were detected fiducials and everything went well, we can now plot the fiducial tracks and the average spline to verify that we have a good drift correction curve.

In [9]:
dc.plotFiducials()

# Perform the actual drift correction on the data
Now that we have the drift curves, we need to perform the actual correction. This is achieved by chunking the same file as before and dynamically writing the corrected data to another file while each chunk is open.

In [12]:
reader = pd.read_csv(str(filePath.resolve()), chunksize = numRowsToRead) # We'll read fewer rows this time
headerSwitch = True
for chunk in reader:
    chunk = clean(chunk)
    
    # Change the column names
    chunk.rename(columns = {'x [nm]' : 'x', 'y [nm]' : 'y'}, inplace = True)
    
    # Remove fiducials from data
    chunk = chunk[~((chunk['x'].isin(fids['x']) & (chunk['y'].isin(fids['y']))))]
    
    # Correct the localizations
    chunk = dc.correctLocalizations(chunk)
    
    # Change the column names back
    chunk.rename(columns = {'x'  : 'x [nm]',
                            'y'  : 'y [nm]',
                            'dx' : 'dx [nm]',
                            'dy' : 'dy [nm]'},
                 inplace = True)
    
    # Write the contents to a file, writing the header on the first write only
    if headerSwitch:
        chunk.to_csv(str(outputFile), mode = 'w', header = True, index = False)
        headerSwitch = False
    else:
        chunk.to_csv(str(outputFile), mode = 'a', header = False, index = False)

  shell.run_cell(code, store_history=store_history, silent=silent)
