# Feature Extraction

In [1]:
from brainlit.utils.ngl_pipeline import NeuroglancerSession
from brainlit.preprocessing.features import *
import pandas as pd
import numpy as np
import glob

Using TensorFlow backend.


## Feature extraction using brainlit
The classes `LinearFeatures` and `NeighborhoodFeatures` convert swc files to a list of axon/background examples in a dataframe, with relevant features. 

First, instantiate the classes. You pass a `url` to pull data from, a `size` of the bounding box around each point, where the actual box is `2i+1` for each index `i`, and an `offset` to shift the bounding box to get a background point.

In [2]:
lin = LinearFeatures(url="s3://open-neurodata/brainlit/brain1", size=[1,1,1], offset=[15,15,15])
nbr = NeighborhoodFeatures(url="s3://open-neurodata/brainlit/brain1", size=[1,1,1], offset=[15,15,15])

For the `LinearFeatures` class, you need to define the filters you want to convolve with the neighborhoods. To do so, you use the `add_filters` method. Currently, `brainlit` supports adding Gaussian, Gaussian Gradient Magnitude, Gaussian Laplace, and Gabor filters.

In [3]:
lin.add_filter('gaussian', sigma=[1, 1, 0.3])
lin.add_filter('gaussian gradient', sigma=[1, 1, 0.3])
lin.add_filter('gaussian laplace', sigma=[1, 1, 0.3])
lin.add_filter('gabor', sigma=[1, 1, 0.3], phi=[0, 0], frequency=2)
# lin.add_filter('gabor', sigma=[1, 1, 0.3], phi=[0, np.pi/2], frequency=2)

Calling the `fit` method for each class with a list of ids and a number of vertices to fit onto for each swc. If the second argument isn't given (or passed as `None`), features will be collected for every vertex.

In [11]:
df_lin = lin.fit(seg_ids=[2, 7], num_verts=5)
df_nbr = nbr.fit(seg_ids=[2, 7], num_verts=5)

Downloading: 100%|██████████| 1/1 [00:00<00:00, 18.62it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 18.75it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 32.31it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 30.78it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00,  4.96it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 11.73it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 17.25it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 12.51it/s]
Downloading:   0%|

The outputted dataframes themselves have a `Segment` column, `Vertex` column, `Label` column (1=axon 0=background), and feature columns that are indexed starting at 0.

In [12]:
df_lin

Unnamed: 0,Segment,Vertex,Label,0,1,2,3
0,2,0,0,65468.0,248.0,49148.0,22196.0
1,2,0,0,12289.0,179.0,26056.0,16470.0
2,2,1,0,26707.0,126.0,32086.0,35791.0
3,2,1,0,12393.0,99.0,5644.0,16609.0
4,2,2,0,14954.0,249.0,45051.0,20041.0
5,2,2,0,12254.0,105.0,45890.0,16422.0
6,2,3,0,21364.0,146.0,24849.0,28631.0
7,2,3,0,12318.0,89.0,6261.0,16508.0
8,2,4,0,18898.0,68.0,49350.0,25327.0
9,2,4,0,12455.0,248.0,50720.0,16693.0


In [6]:
df_nbr

Unnamed: 0,Segment,Vertex,Label,0,1,2,3,4,5,6,...,17,18,19,20,21,22,23,24,25,26
0,2,0,0,52070,65520,65520,50811,65520,65520,48358,...,65520,54167,65520,65520,53639,65520,65520,50633,65520,65520
1,2,0,0,12199,12130,12317,12242,12237,12304,12174,...,12242,12266,12528,12349,12409,12321,12181,12315,12263,12212
2,2,1,0,23385,25448,18951,16656,18795,19440,15958,...,24960,22352,20712,18062,31562,34366,26039,34962,41279,30076
3,2,1,0,12405,12582,12389,12245,12152,12296,12322,...,12364,12296,12591,12441,12287,12541,12598,12358,12508,12338
4,2,2,0,17392,16834,13865,17211,16742,14421,17905,...,13361,13452,13083,13023,14224,13248,12739,14989,13557,12861
5,2,2,0,12074,12255,12256,12222,12192,12271,12233,...,12415,12338,12147,12320,12377,12301,12370,12101,12325,12302
6,2,3,0,41719,37753,22465,25085,23521,20091,14124,...,20030,14861,16001,16468,14788,19751,23278,13164,16269,19436
7,2,3,0,12250,12262,12320,12399,12338,12391,12397,...,12269,12337,12427,12332,12275,12512,12411,12417,12264,12232
8,2,4,0,26145,26223,19846,20104,21345,15194,14866,...,13363,20166,23709,23565,14187,14691,14389,12479,12459,12788
9,2,4,0,12444,12473,12564,12380,12414,12301,12551,...,12516,12435,12432,12370,12610,12407,12411,12342,12490,12421


Both neighborhood and linear features can be extracted using the `include_neighborhood` parameter in the `LinearFeatures` `fit` function.

In [13]:
df_lin = lin.fit(seg_ids=[2, 7], num_verts=5, include_neighborhood=True)
df_lin

Downloading: 100%|██████████| 1/1 [00:00<00:00, 34.57it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 32.29it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 31.73it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 31.99it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 30.09it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 33.15it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 21.98it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 32.60it/s]
Downloading:   0%|

Unnamed: 0,Segment,Vertex,Label,0,1,2,3,4,5,6,...,21,22,23,24,25,26,27,28,29,30
0,2,0,0,52070,65520,65520,50811,65520,65520,48358,...,53639,65520,65520,50633,65520,65520,65468.0,248.0,49148.0,22196.0
1,2,0,0,12199,12130,12317,12242,12237,12304,12174,...,12409,12321,12181,12315,12263,12212,12289.0,179.0,26056.0,16470.0
2,2,1,0,23385,25448,18951,16656,18795,19440,15958,...,31562,34366,26039,34962,41279,30076,26707.0,126.0,32086.0,35791.0
3,2,1,0,12405,12582,12389,12245,12152,12296,12322,...,12287,12541,12598,12358,12508,12338,12393.0,99.0,5644.0,16609.0
4,2,2,0,17392,16834,13865,17211,16742,14421,17905,...,14224,13248,12739,14989,13557,12861,14954.0,249.0,45051.0,20041.0
5,2,2,0,12074,12255,12256,12222,12192,12271,12233,...,12377,12301,12370,12101,12325,12302,12254.0,105.0,45890.0,16422.0
6,2,3,0,41719,37753,22465,25085,23521,20091,14124,...,14788,19751,23278,13164,16269,19436,21364.0,146.0,24849.0,28631.0
7,2,3,0,12250,12262,12320,12399,12338,12391,12397,...,12275,12512,12411,12417,12264,12232,12318.0,89.0,6261.0,16508.0
8,2,4,0,26145,26223,19846,20104,21345,15194,14866,...,14187,14691,14389,12479,12459,12788,18898.0,68.0,49350.0,25327.0
9,2,4,0,12444,12473,12564,12380,12414,12301,12551,...,12610,12407,12411,12342,12490,12421,12455.0,248.0,50720.0,16693.0


## Saving Extracted Features

In cases where one wishes to extract features from many vertices, `Brainlit` allows batch loading and writing of the data. The data is written into the binary `feather` format.

In [14]:
df_lin = lin.fit(seg_ids=[2, 7], num_verts=10, file_path='demo', batch_size=10)

Downloading: 100%|██████████| 1/1 [00:00<00:00, 31.09it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 22.58it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 29.04it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 30.95it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 34.12it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 28.77it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 32.05it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|        

Each `batch_size` samples is output as a `feather` file. The `file_path` argument determines the prefix of the filenames. Afterwards, it is followed by the starting sample number, the last sample number, the segment ID for the last sample in the batch, and the vertex ID for the last sample in the batch.

In [20]:
sorted(glob.glob('*.feather'))

['demo0_10_2_4.feather',
 'demo10_20_2_9.feather',
 'demo20_30_7_4.feather',
 'demo30_40_7_9.feather']

Using the `start_seg` and `start_vert` arguments, you can choose where in the data to start extracting information from.

In [6]:
df_lin = lin.fit(seg_ids=[2, 7], num_verts=10, file_path='demo_2_', batch_size=10, start_seg=7, start_vert=4)

Downloading: 100%|██████████| 1/1 [00:00<00:00, 22.22it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 31.32it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 28.40it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 29.01it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 33.72it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 [00:00<00:00, 27.61it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading: 100%|██████████| 1/1 

['demo_2_0_10_7_4.feather']

In [7]:
sorted(glob.glob('demo_2*.feather'))

['demo_2_0_10_7_4.feather']