<a href="https://colab.research.google.com/github/obarnstedt/LINdoscope2023/blob/main/notebooks/analysis_1p/lindoscope_1P_decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Can we decode behaviors from cell responses?**

This notebook details how we can use a logistic regression to estimate the probability of a behavior occuring, given the cell response. We use manually annotated behaviors of an animal performing the helping behavior task to understand what is happening in the dorsal hippocampus(dHC). We use a 1P miniscope to image the dHC and caiman to obtain the deconnvolved activity/spikes for the active cells in the dHC.  

Logistic regressor:
Models the probability of a behavior event taking place based on the independent variables, here the spike inferred activity obtained from Caiman analysis.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

1. The first step is to create a design matrix that contains all the behavior variables you are interested in. Our design matrix here contains the manually annotated data. When the animal is performing the behavior, the design matrix is filled with 1's and when the animal is not performing the behavior it is filled with 0's.

2. Once we have the design matrix, we can choose which behaviors we want to decode. We use a logistic regression to look for the relationship between the stimulus(i.e the behavior) and the response(i.e S mat)

3. We will then shuffle the S mat activity and compute the shuffled decoder performance. Using the shuffled performance and decoder performance, we can calculate the percentile score from which we can see how many cells respond to the behavior of interest.

In [None]:
!git clone https://github.com/obarnstedt/LINdoscope2023.git

Cloning into 'LINdoscope2023'...
remote: Enumerating objects: 399, done.[K
remote: Counting objects: 100% (63/63), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 399 (delta 20), reused 29 (delta 4), pack-reused 336[K
Receiving objects: 100% (399/399), 83.89 MiB | 14.93 MiB/s, done.
Resolving deltas: 100% (214/214), done.
Updating files: 100% (50/50), done.


In [None]:
#Imports
import numpy as np
import pandas as pd
import os
import glob
import seaborn as sns
import matplotlib.pyplot as plt
from LINdoscope2023.notebooks.analysis_1p.utils.designmatrix_utils import *
from LINdoscope2023.notebooks.analysis_1p.utils.logistic_regression_utils import *
from LINdoscope2023.notebooks.analysis_1p.utils.plotting_utils import *

In [None]:
drive_path = '/content/drive' #Change this to point to where you have your sample data
data_path = os.path.join(os.getcwd(), 'data')

if not os.listdir(data_path):
    print(f'{data_path} is empty. Check if data has been downloaded and placed in the correct folder. Or check that file name and types')

manual_annotation_file = glob.glob(os.path.join(data_path, '*_manual_annotations.csv*'))[0]
path, name = os.path.split(manual_annotation_file)
s_mat_file = glob.glob(os.path.join(data_path, '*_spikes.npy*'))[0]

In [None]:
#Create a folder to save the outputs of this notebook
save_results_fold = os.path.join(data_path, 'output')

if not os.path.exists(save_results_fold):
    os.mkdir(save_results_fold)
    print('Folder %s created!' % save_results_fold)
else:
    print('Folder %s already exists!' % save_results_fold)

In [None]:
#Load the S_mat data and see how the raw traces look for some cells
s_mat = np.load(s_mat_file, allow_pickle=True)

cell_idx = 10 #The id of the cell you want to see the traces for
plot_smat(s_mat, cell_idx)

In [None]:
#Let's create our design matrix with some sample data here:
#Which behaviors are we interested in for the design matrix?
behaviors = ['exploratory behaviors', 'task behaviors', 'appraisel behaviors', 'defensive behaviors',
            'prosocial behaviors']

#If we are interested in adding a pre-behavior time set add_backward_jitter=True and set how much time(frames)
#you want to add with the backward=5 parameter. Similarly, you can add a post-behavior time using the
#add_forward_jitter and forward parameters.

design_matrix = create_design_matrix(manual_annotation_file, save_results_fold, s_mat, n_beh_events=behaviors,
                         add_forward_jitter=True, forward=5, add_back_jitter=True, backward=5,
                         fps=30, to_use='Regrouped Behaviors')

#Plot a ethogram for the behaviors
plot_design_matrix(design_matrix)

Let's choose one behavior that we are interested in. I am choosing 'task_behaviors' here.
Feel free to change the 'behavior_decode' parameter below to see how the decoder behaves for different behaviors

In [None]:
behavior_decode = 'task behaviors'
stimulus = (design_matrix[behavior_decode].values).reshape(-1, 1)

In [None]:
#Bin the data (the s_mat and the design matrix)

#Set some binning parameters
bin_step = 30
bin_start = 0
bin_stop = stimulus.shape[0]

#func parameter: the function on each bin. Change to np.mean to see how this changes the bin values
binned_smat = bin_data(s_mat.T, bin_step, bin_start, bin_stop, func=np.max)
plot_binned_data(s_mat.T, binned_smat, cell_idx=10)

binned_stimulus = bin_data(stimulus, bin_step, bin_start, bin_stop, func=np.max)
plot_binned_data(stimulus, binned_stimulus, cell_idx=0)


In [None]:
#Split the data into test and train
X_train, X_test, Y_train, Y_test = data_splitter(binned_smat, binned_stimulus)

#Check to make sure our training data has enough values of each class.
#Class 0: behavior does not occur
#Class 1: behavior occurs
counter = Counter(Y_train.flatten())


In [None]:
#Synthetic Minority Over-sampling(SMOTE)

#The classification categories are not eqully(approximately) represented in our current dataset. We have only a small
#percentage of 'interesting' behavior samples. We can undersample the majority class, proposed before and
#implemented. However, undersampling the majority class, and over sampling the minority class to create a more
#balances dataset for the decoder to train on will result in a better decoder performance.

if counter[0] and counter[1] > 4:
    X_train, Y_train = oversampler(X_train, Y_train, sampling_strategy=0.4)
else:
    print('Not enough values to create synthetic values')

In [None]:
print('Training and fitting model.')

Y_fit_train, Y_fit, accuracy_log_reg, recall_score_log_reg, score_f1_log_reg, auc_score, corr_vals = logistic_regression_decoder(X_train, Y_train, X_test, Y_test)

In [None]:
#Lets see what the decoder trains on, and what it predicts during training:
plot_train(X_train, Y_train, Y_fit_train, cell_idx=10)

In [None]:
#Lets see what the decoder predicts after training for one cell and it's performance:
cell_idx=90
plot_test(X_test, Y_test, Y_fit, cell_idx)
print('F1 score value:{0:.2f}'.format(score_f1_log_reg[cell_idx]))
print('Accuracy score value:{0:.2f}'.format(accuracy_log_reg[cell_idx]))
print('Recall score value:{0:.2f}'.format(recall_score_log_reg[cell_idx]))
print('AUC score value:{0:.2f}'.format(auc_score[cell_idx]))
print('Correlation score value:{0:.2f}'.format(corr_vals[cell_idx]))

In [None]:
#How does the decoder perform for this session
#F1 score: TP / TP + 0.5(FP + FN)
plt.figure()
plt.plot(score_f1_log_reg)
plt.axhline(0.4, color='r')
plt.xlabel('Neurons')
plt.ylabel('F1 Score')
sns.despine()

#Recall score
plt.figure()
plt.plot(recall_score_log_reg)
plt.axhline(0.7, color='r')
plt.xlabel('Neurons')
plt.ylabel('Recall')
sns.despine()

#Accuracy score
plt.figure()
plt.plot(accuracy_log_reg)
plt.axhline(0.7, color='r')
plt.xlabel('Neurons')
plt.ylabel('Accuracy')
sns.despine()

#AUC score
plt.figure()
plt.plot(auc_score)
plt.axhline(0.7, color='r')
plt.xlabel('Neurons')
plt.ylabel('AUC score')
sns.despine()

#Correlation
plt.figure()
plt.plot(corr_vals)
plt.axhline(0.7, color='r')
plt.xlabel('Neurons')
plt.ylabel('Correlation value')
sns.despine()

In [None]:
#Lets see how the decoder behaves when we shuffle one of the inputs to the decoder
#(The X; which in this case is the S matrix)
#We average the performance measures for 100 shufles. If you want to change how many shuffles,
#change the n_shuffle parameter

print('Computing the shuffled decoder performance scores')
# Two different strategies can be used to compute the shuffled F1 scores
score_f1_log_reg_shuffled, recall_score_log_reg_shuffled, accuracy_log_reg_shuffled, auc_score_shuffled, \
corr_vals_shuffled = shuffle_log_reg(X_train, Y_train, X_test, Y_test, n_shuffle=100, \
                                     shuffle_strategy='smat_shuffle')


In [None]:
#From the shuffled performance scores and the unshuffled data, let's compute a percentile score
#which tells us how good the decoder performed on the data compared to the shuffled data.

pct_f1_log_reg = compute_pct(score_f1_log_reg_shuffled, score_f1_log_reg)

fig, (ax1, ax2, ax3) = plt.subplots(3, 1)
ax1.hist(pct_f1_log_reg, bins=20)
ax1.axvline(0.95, color='r')
ax1.set_xlabel('Bins')
ax1.set_ylabel('No. of Neurons')
ax1.set_title('Percentile scores')
ax2.hist(score_f1_log_reg_shuffled[0], bins=20)
ax3.hist(score_f1_log_reg_shuffled[10], bins=20)
ax2.set_title('F1 scores for shuffle 0')
ax3.set_title('F1 scores for shuffle 10')
plt.tight_layout()
sns.despine()

In [None]:
responding_cells_idx = np.where(pct_f1_log_reg > 0.95)[0]

#Plot the predictions of the decoder for a cell with a percentile score > 0.95
cell_idx = np.random.choice(responding_cells_idx)

print('F1 score value:{0:.2f}'.format(score_f1_log_reg[responding_cell]))
fig, (ax1, ax2) = plt.subplots(2, 1)
ax1.plot(X_test[:, cell_idx], '-b', label='Observed spikes')
ax2.plot(Y_fit[cell_idx, :], '-r', label='Fitted events')
ax2.plot(Y_test, label='Behavior events')
sns.despine()
ax1.legend()
ax2.legend()

In [None]:
data_dict = {'F1 score': score_f1_log_reg, 'Accuracy': accuracy_log_reg,'Recall': recall_score_log_reg,
             'AUC score': auc_score,'Correlation value': corr_vals, 'Percentile score F1': score_f1_log_reg}

results_df = pd.DataFrame.from_dict(data_dict, orient="columns")
csv_file_path = save_results_fold + '/' + name[:-4] + '_' + str(bin_step) + '_results.csv'
results_df.to_csv(csv_file_path)

How many cells are responding to the behavior of interest?

In [None]:
print('Percentage of cells responding to behavior of interest: {0:.2f} %'.format((len(responding_cells_idx)/len(s_mat))*100))