# 1D CNN with Laplace Beltrami Spectrum

## Binary Classification: Males vs. Females

Here we are going to see if we can create a 1 dimensional convolutional neural network that can classify a subject as male or female based on the shape (LB spectrum) of their white matter tracts.

The initial set up will be very similar to the MLP as we want the data to remain as a 1D vector. The main changes will be the types of layers in the actual network.

### Import libraries
First, let's import the libraries we will use.

In [9]:
#to read in the data
import pickle
#for plotting, numbers etc.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#for splitting the data
from sklearn.model_selection import train_test_split
#keras functions
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Conv1D, GlobalAveragePooling1D, MaxPooling1D
from keras.utils import np_utils, plot_model, to_categorical
from keras.optimizers import Adam

#normalize the data
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegressionCV 

ImportError: No module named seaborn

### Import and check the data
Now let's read in the data using pickle. The data was previously processed in python and saved down using pickle. This will be the same steps as used in the MLP notebook so I will not include as many comments

In [4]:
# eigenvalue dictionary with entry for eact tract, 600 evs per tract
tractev_dict_600 = pickle.load(open("tract_ev_dict_600.pk",'rb'))
# list of tracts we want to use
tractstouse = pickle.load(open('tractstouse.pk','rb'))
# subject list
HCP_subj_list = pickle.load(open('HCP_subj_list.pk','rb'))
# list of subject gender 1 = male, 2 = female
gender_id = pickle.load(open('gender_id.pk','rb'))

### Preprocess the data

The eigenvalue data is already in a vector format, so we do not need to vectorize it. However, we will need to combine the vectors of all the tracts so that we have a single vector per subject. 

We also need to normalize the data so that each set of eigenvalues has a mean of 0 and a standard deviation of 1. We will write a function to do this using sklearn's `StandardScaler` function.

**Normalize the data**

In [5]:
def scale_ev_dict(ev_dict):
    scaled_dict = {}
    for tract in ev_dict.keys():
        scaler = StandardScaler()
        scaled_dict[tract] = scaler.fit_transform(ev_dict[tract])
    return scaled_dict


In [6]:
# normalize all of the tracts so that each ev is centered on 0.
tractev_dict_600_scaled = scale_ev_dict(tractev_dict_600)

NameError: global name 'StandardScaler' is not defined

**Reorganize the data**

Currently the data is a dictionary of 2D matrices, we want to reorganize this into a single 2D matrix with the shape `(1013, n * 48)`, where n is the number of eigenvalues we are using. It is likely that 600 eigenvalues is way more than we need, but we do not know how many eigenvalues is optimal. We will write a function to do this reorganization so we can easily try multiple amounts of eigenvalues. 

In [7]:
# change the organization to be one vector per subject with all evs for all tracts
def reorganize_spectrums(ev_dict_scaled, numev, HCP_subj_list=HCP_subj_list, tractstouse=tractstouse):
    # create an empty numpy array of the shape we want
    # numev is the number of eigenvalues we want per tract
    allsubjs_alltracts_scaled = np.zeros([len(HCP_subj_list), numev*len(tractstouse)])
    for i in range(len(tractstouse)):
        allsubjs_alltracts_scaled[:, i*numev:i*numev+numev] = ev_dict_scaled[tractstouse[i]][:, 0:numev]
    return allsubjs_alltracts_scaled

In [None]:
numev=200
allsubjs_alltracts_scaled = reorganize_spectrums(tractev_dict_600_scaled, numev)

**One hot encoding the labels**

To one hot encoding these labels all we need to do is simple subtract 1 from every entry. Now 0 = male and 1 = female

In [8]:
# one-hot encoding for the gender ID
genderid_ohe = np.asarray(gender_id) - 1

**Check datatype**

Finally, we need to make sure all inputs are of datatype `float32`

In [None]:
allsubjs_alltracts_scaled = allsubjs_alltracts_scaled.astype('float32')
genderid_ohe = genderid_ohe.astype('float32')

### Split the data

The input data is now preprocessed and ready to be input into a neural network. However, we first have to split the data into training, validation, and testing sets. We do not have a ton of samples, so we will try splitting the data into 3 subsets first and then explore other cross validation options.

In [None]:
X = allsubjs_alltracts_scaled
Y = genderid_ohe

#first split the training/validation data from the testing data
trainvalX, testX, trainvalY, testY = train_test_split(X, Y, train_size  = .8, test_size = .2, random_state=0)

print len(trainvalX) 
print len(testX)

In [None]:
#second split the validation data from the training data
trainX, valX, trainY, valY = train_test_split(trainvalX, trainvalY, train_size  = .75, test_size = .25, random_state=0)

print len(trainX)
print len(valX)

Now we have 3 subsets of data. Training data with 607 samples, validation data with 203 samples, and testing data with 203 samples. Again, this may be too few samples for training and if so we can employ other cross validation methods.

### Try Logistic Regression Classifier from sklearn for comparison

One of the reasons we are trying deep learning with this data is to see if we can improve upon 'basic' machine learning algorithms. Lets try this data with a simple Logistic Regression classifier so we have something to compare the deep learning accuracy to. We will train and evaluate the training and validation splits as well as the training/val and testing splits. 

In [None]:
#train on training data
lr = LogisticRegressionCV()
lr.fit(trainX, trainY)

#evaluate with validation data
print("Accuracy = {:.2f}".format(lr.score(valX, valY)))

In [None]:
#train on training and validation data
lr = LogisticRegressionCV()
lr.fit(trainvalX, trainvalY)

#evaluate with validation data
print("Accuracy = {:.2f}".format(lr.score(testX, testY)))