# Prepare hyperspectral data for regression in Python.

spectralpy reads in data as a 3d array. It is unclear to me if we need to reformat this into 2D before running regression or can just take a slice.

In [2]:
import spectral.io.envi as envi

In [3]:
import glob

In [4]:
files_list = glob.glob("/scratch2/NSF_GWAS/macroPhor_Array/T16_DEV_genes/EA/wk7/*.hdr")

In [5]:
i = 2

In [6]:
file_path = files_list[i]

This will go into it's own file, called hypercube, and be imported by adding to ```__init__.py```:<br>
```from .hypercube import Hypercube```

In [7]:
class Hypercube:
    """A 3D hypercube containing spectra for each pixel
    
    :param value: value to set as the ``attribute`` attribute
    :ivar attribute: contains the contents of ``values`` passed as init
    
    """
    i = 12345

    def f(self):
        return 'hello world'
    
    def __init__(self, file_path):
        # Define attribute with contents of the value param
        self.hypercube = envi.open(file_path)
        self.wavelengths = envi.read_envi_header(file_path)['wavelength']

In [23]:
test_cube = Hypercube(file_path)

Header parameter names converted to lower case.
Header parameter names converted to lower case.


#### Now to create another class (a child class?) that has this hypercube collapsed into a 2D format with which we can perform matrix algebra operations for regression

##### First, let's figure out how to collapse that hypercube into 2D with numpy.

Look at the dimensions of our image.

In [10]:
test_cube.hypercube.shape

(1571, 1419, 318)

This is (n, m, lambda), where each n, m pixel has a signal intensity lambda. Flattening of this is similar to the problem described here (https://stackoverflow.com/questions/32838802/numpy-with-python-convert-3d-array-to-2d) but with hyperspectral instead of RGB data.

In [13]:
import numpy as np

In [17]:
test_cube = np.asarray(test_cube)

In [18]:
test_cube.size

1

In [12]:
test_cube_flattened = test_cube.reshape(newshape = ((test_cube.shape[0]*test_cube.shape[1], test_cube.shape[2])))

AttributeError: 'BilFile' object has no attribute 'reshape'

##### Apparently the BilFile object still isn't read into memory. Need to read it in...

In [24]:
bands = test_cube.hypercube.read_bands()

TypeError: read_bands() missing 1 required positional argument: 'bands'

##### To follow DRY principle, take earlier code for getting wavelengths (in build_X) and convert to function that is used both here and in build_X

In [25]:
def read_wavelengths(file_path):
    h = envi.read_envi_header(file_path)
    wavelengths = h['wavelength']
    return(wavelengths)

In [26]:
wavelengths = read_wavelengths(file_path = file_path)

Header parameter names converted to lower case.


##### We also need to subset wavelengths to only the desired wavelengths. Use same function as before for this.

In [27]:
def find_desired_indices(wavelengths, min_desired_wavelength, max_desired_wavelength):
    wavelengths = np.asarray(wavelengths)
    # https://stackoverflow.com/questions/13869173/numpy-find-index-of-the-elements-within-range
    wavelength_indices_desired = np.where(np.logical_and(wavelengths.astype(float)>=min_desired_wavelength,
                                                          wavelengths.astype(float)<=max_desired_wavelength))
    return(wavelength_indices_desired)

In [33]:
subset_indices = find_desired_indices(wavelengths, 550, 650)
subset_wavelengths = np.array(wavelengths)[subset_indices]

##### It should be much quicker to load in bands between 550-650 than ALL bands. Let's compare the relative speeds of each.

Let's first look at the documentation to figure out how to input our desired bands.

In [35]:
help(test_cube.hypercube.read_bands)

Help on method read_bands in module spectral.io.bilfile:

read_bands(bands, use_memmap=True) method of spectral.io.bilfile.BilFile instance
    Reads multiple bands from the image.
    
    Arguments:
    
        `bands` (list of ints):
    
            Indices of bands to read.
    
        `use_memmap` (bool, default True):
    
            Specifies whether the file's memmap interface should be used
            to read the data. Setting this arg to True only has an effect
            if a memmap is being used (i.e., if `img.using_memmap` is True).
            
    Returns:
    
       :class:`numpy.ndarray`
    
            An `MxNxL` array of values for the specified bands. `M` and `N`
            are the number of rows & columns in the image and `L` equals
            len(`bands`).



Now to benchmark reading in 550-650nm bands vs. all bands...

In [36]:
import time

In [40]:
time_pre_read_partial = time.perf_counter()
bands_partial = test_cube.hypercube.read_bands(bands=subset_indices)
time_post_read_partial = time.perf_counter() - time_pre_read_partial

ValueError: axes don't match array

In [41]:
subset_indices

(array([121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
        134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
        147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
        160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
        173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185,
        186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198,
        199, 200]),)

In [43]:
len(subset_indices)

1

In [45]:
subset_indices[0]

array([121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
       147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
       160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
       173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185,
       186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198,
       199, 200])

In [46]:
type(subset_indices)

tuple

Ahhh... the object subset_indices is not an array, but a tuple containing an array. What we want is the array at the ```[0]``` position of this object.

Try again with ```subset_indices[0]``` instead of ```subset_indices```

In [48]:
time_pre_read_partial = time.perf_counter()
bands_partial = test_cube.hypercube.read_bands(bands=subset_indices[0])
time_post_read_partial = time.perf_counter() - time_pre_read_partial

Need to define an array of every wavelength index (note: NOT every wavelength)

In [50]:
all_indices =  np.arange(0, len(wavelengths), step = 1)

In [51]:
all_indices

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

Ok. now let's pass this object to the ```bands``` argument of ```read_bands```

In [52]:
time_pre_read_full = time.perf_counter()
bands_full = test_cube.hypercube.read_bands(bands=all_indices)
time_post_read_full = time.perf_counter() - time_pre_read_full

##### Compare the runtimes recorded....

For loading wavelengths only within range 550-650nm, ran in this many seconds....

In [53]:
print(time_post_read_partial)

0.35270235361531377


For loading all wavelengths...

In [54]:
print(time_post_read_full)

7.346348533872515


These are both pretty fast. Let's make sure the full objects are actually loaded into memory....

In [55]:
bands_partial.shape

(1571, 1419, 80)

In [56]:
bands_partial[1:10,1:10,1]

array([[ 99, 116, 122, 101, 111, 114, 118, 112, 104],
       [ 99, 118, 120, 117, 110, 109, 126, 107, 117],
       [111, 111, 123,  97, 114, 112, 103, 112, 104],
       [110, 113, 123, 119, 109, 115, 110,  99, 115],
       [ 96, 112, 106, 120, 101, 106, 110, 110, 117],
       [108, 112, 116, 121, 113, 129, 120, 110, 118],
       [103, 109, 106, 109, 107, 114, 125, 114, 114],
       [107, 110, 118, 117, 112, 113,  93, 107, 113],
       [104, 103, 121, 121, 106, 109, 108, 117, 118]], dtype=uint16)

Pretty sure it is! Note, these runtimes are at least an order of magnitude faster than ```hyperSpec::read.ENVI()``` in ```R```

##### Now to integrate this complete loading of data into the hypercube....

In [8]:
class Hypercube:
    """A 3D hypercube containing spectra for each pixel
    
    :param file_path: A string indicating the path to the header file (in ENVI .hdr format) corresponding to the hyperspectral image file (in ENVI .raw format) to be read in
    :param min_desired_wavelength: A numeric value indicating a threshold BELOW which spectral data is excluded
    :param max_desired_wavelength: A numeric value indicating a threshold ABOVE which spectral data is excluded
    :param hypercube: 3D numpy array containing a spectra for each pixel
    :ivar wavelengths: contains the contents of ``wavelengths`` passed as init and subsequently trimmed to desired range
    
    """
    
    def __init__(self, file_path, min_desired_wavelength, max_desired_wavelength):
        # Define attribute with contents of the value param
        
        all_wavelengths = read_wavelengths(file_path)
        subset_indices = find_desired_indices(all_wavelengths, min_desired_wavelength, max_desired_wavelength)
        subset_wavelengths = np.array(all_wavelengths)[subset_indices[0]]
        
        self.hypercube = test_cube.hypercube.read_bands(bands=subset_indices[0])
        self.wavelengths = subset_wavelengths

In [10]:
import time

In [11]:
time_pre_read_partial = time.perf_counter()
my_cube = Hypercube(file_path = file_path,
                    min_desired_wavelength = 550,
                    max_desired_wavelength = 650)
time_post_read_partial = time.perf_counter() - time_pre_read_partial

NameError: name 'read_wavelengths' is not defined

Seconds it ran in (should be ~0.3s as before)

In [73]:
print(time_post_read_partial)

0.3633701531216502


Alright! Now for some more checks. Let's make sure we have the same # bands in both the matrix and the wavelength array.

In [74]:
len(my_cube.wavelengths)

80

In [75]:
my_cube.hypercube.shape

(1571, 1419, 80)

##### Let's flatten this into the right shape

In [1]:
test_cube_flattened = my_cube.reshape(newshape = ((test_cube.shape[0]*test_cube.shape[1], test_cube.shape[2])))

NameError: name 'my_cube' is not defined

All good! Ready to work on documentation for the pre-regression functions, and ultimately regression!