Some of the localization files that are generated by Fang's software are 4 GB in size or more. Reading in this data and processing it poses a problem because of its large size and the limited amount of memory available on our machines. A solution to this problem is to employ out-of-core processing (also known as chunking), whereby data is broken up into tiny bits, processed, and then written to disk before more data is read into memory.

Python's Pandas library is already well-suited to employ out-of-core processing because it is built into the library. However, to use it with DataSTORM's FiducialDriftCorrection processor, we need to manually drive the FiducialDriftCorrection methods that it would otherwise automatically perform.

The purpose of this notebook is to demonstrate how to perform out-of-core processing on a large dataset.

In [1]:
# Load the necessary libraries
%pylab
import DataSTORM.processors as ds
import pandas               as pd
import importlib

from pathlib import Path

Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib


We will start by opening a connection to the file so that we may pull out subsets of data at any given time.

In [2]:
filePath      = Path('../test-data/MicroTubules_LargeFOV/FOV2_1500_10ms_1_MMStack_locResults.dat')
numRowsToRead = 8e6 # Read 8 million rows at a time. This should be as large as is reasonable.

# Opens a connection to the file, reading in 200000 rows at a time.
reader        = pd.read_csv(str(filePath.resolve()), chunksize = numRowsToRead)

*reader* is a TextFileReader object that may be iterated over to extract the data. Alternatively, we may use its *.get_chunk()* method to extract a chunk with a specific number of rows. Note that everytime *.get_chunk(size)* is called, a pointer to the current row inside the file moves forward by *size* rows.

The pointer can not be moved backwards, so once *.get_chunk()* is called, you must do something with those rows. Otherwise, you will need to restart the notebook.

In [3]:
# If you uncomment the next line to try .get_chunk(), you should restart the notebook.
#reader.get_chunk(50000).describe()

Now that we have opened a connection to the file, let's begin the dedrift process by interactively searching for fiducials in each chunk. The outline of the steps looks like this:

1. Create a FiducialDriftCorrect processor from DataSTORM. Note that in normal (on-core) processing we set its *interactiveSearch* flag to True so that the interactive search for fiducials is performed automatically. Here, since we'll direct the search process on each chunk, we will leave it at its default value of **False**.

2. Loop through each chunk. For each chunk allow the user to specify a subregion containing fidcuials. The processor will remember each subregion that the user specified.

3. Within the same loop iteration, filter out localizations from the current chunk that do not lie within the search areas. Append these localizations to a DataFrame that is collecting all localizations that are fiducial candidates.

4. Use the *detectFiducials()* method of the processor class to look for fiducial trajectories within the localizations that are output from step 3.

In [4]:
# Create the FiducialDriftCorrect processor.
dc = ds.FiducialDriftCorrect(mergeRadius           = 50,
                             offTime               = 1,
                             minSegmentLength      = 20,
                             minFracFiducialLength = 0.4,
                             neighborRadius        = 500,
                             smoothingWindowSize   = 500,
                             smoothingFilterSize   = 300)

# Create a CleanUp processor to ensure data in each chunk is clean.
clean = ds.CleanUp()

When you select a region containing a fiducial in chunk, that region is remembered for all frames. So, if you have already selected a region containing a fiducial in one chunk, you do not have to select the same region again in other chunks unless you believe that the fiducial has drifted out of the region you selected in earlier chunks.

In [5]:
# fids will hold the localizations belonging to fiducials and will grow with each processed chunk
fids = pd.DataFrame()

# minFrame and maxFrame will hold the absolute min and maximum frame in all the chunks
minFrame = 0
maxFrame = 0

for chunk in reader:
    # Clean up the data in the chunks
    chunk = clean(chunk)
    
    # Update the minimum and maximum frames with each chunk
    minFrame = np.min([minFrame, chunk['frame'].min()])
    maxFrame = np.max([maxFrame, chunk['frame'].max()])
    
    # Rename the columns because trackpy does not accept column names with units
    chunk.rename(columns = {'x [nm]' : 'x', 'y [nm]' : 'y'}, inplace = True)
    
    # It's important to set resetRegions = False here.
    # Otherwise, the regions will be overwritten for each new chunk.
    dc.iSearch(chunk, resetRegions = False)
    
    # Filter out localizations that are not within all previously-defined search regions.
    # Append these to the fiducial Data Frame defined just before the start of the loop.
    currentFids = dc.reduceSearchArea(chunk)
    fids        = fids.append(currentFids, ignore_index = True)

  shell.run_cell(code, store_history=store_history, silent=silent)


In [6]:
fids.describe()

Unnamed: 0,x,y,z [nm],frame,uncertainty [nm],intensity [photon],offset [photon],loglikelihood,sigma [nm]
count,72269.0,72269.0,72269,72269.0,72269.0,72269.0,72269.0,72269.0,72269.0
mean,37172.145429,22195.987848,0,25429.689203,6.130453,3749.025102,299.799022,228.021819,128.672439
std,16696.546577,19424.913094,0,16136.743263,2.321445,2508.708965,50.801959,1162.593001,17.978308
min,20606.0,1935.9,0,100.0,0.57666,738.22,98.01,30.063,78.444
25%,20862.0,2272.6,0,9484.0,3.9108,2087.4,267.53,95.1,113.35
50%,21055.0,41049.0,0,26816.0,6.2946,2845.5,287.14,138.24,128.49
75%,54268.0,41132.0,0,40193.0,8.1418,5354.1,325.65,219.86,143.06
max,54695.0,41620.0,0,49999.0,14.287,75098.0,1863.3,267490.0,249.81


In [7]:
# Detect the fiducuial trajectories from these localizations
dc.detectFiducials(fids)

Frame 49999: 2 trajectories present
2 fiducial(s) detected.


In [8]:
# Drop fiducials from the list of localizations
dc.fitSplines()
dc.combineSplines(None, startFrame = minFrame, stopFrame = maxFrame)

If there were detected fiducials and everything went well, we can now plot the fiducial tracks and the average spline to verify that we have a good drift correction curve.

In [9]:
dc.plotFiducials()

### Performing the actual drift correction
Now that we have the drift curves, we need to perform the actual correction. This is achieved by chunking the same file as before and dynamically writing the corrected data to another file while each chunk is open.

In [23]:
Path(str(filePath.parent) + filePath.stem + '_DC' + filePath.suffix)

PosixPath('../test-data/MicroTubules_LargeFOVFOV2_1500_10ms_1_MMStack_locResults_DC.dat')

In [24]:
outputFile = Path(str(filePath.parent) + filePath.stem + '_DC' + filePath.suffix)

reader = pd.read_csv(str(filePath.resolve()), chunksize = 1e6) # We'll read fewer rows this time
for chunk in reader:
    chunk = dc.dropFiducials()
    chunk = dc._correctLocalizations(chunk)