# ABBA cell count analysis

This notebook is the last step in the ABBA whole-brain cell counting analysis.  
It assumes you have done the following steps:
- Alignment of brain slices in ABBA, exported to a QuPath project.
- Detected cells of interest in QuPath. The detections should be exported to ```.csv``` files (one per slice) in a folder called ```results```. 
- If there are regions to exclude, you should have drawn them and exported to ```.txt``` files (one per slice) in a folder called ```regions_to_exclude```.

Run this notebook to load the cell counts and do analysis on them. 

## Before we start ...
The majority of the functions and classes we need written in 3 files: ```brain_hierarchy.py```, ```readCSV_helpers.py``` and ```pls_helpers.py```. We will now import the necessary functions and classes from these python files to this notebook, so that we can use them later:

In [1]:
from brain_hierarchy import AllenBrainHierarchy
from readCSV_helpers import collect_and_analyze_cell_counts,save_results
from pls_helpers import PLS

And we'll need other python functions to easily read and manipulate data and make nice plots:

In [2]:
import pandas as pd
# import copy
# import json
import numpy as np
import os

# import matplotlib.pyplot as plt
import plotly.express as px
# import seaborn as sns
# import pickle
import plotly.graph_objects as go

## The Allen Brain Atlas

We start by importing the mouse Allen Brain Atlas, in which we find information about all brain regions (their parent region and children regions in the brain hierarchy, for example).

In [3]:
path_to_allen_json = "./data/AllenMouseBrainOntology.json"
AllenBrain = AllenBrainHierarchy(path_to_allen_json) 

edges = AllenBrain.edges_dict
tree = AllenBrain.tree_dict
brain_region_dict = AllenBrain.brain_region_dict
regions = list(brain_region_dict.keys())

We now have access to useful information about all brain regions. Below, show the first three of them:

In [4]:
AllenBrain.df.head(3)

Unnamed: 0_level_0,atlas_id,ontology_id,acronym,region_name,color_hex_triplet,graph_order,st_level,hemisphere_id,parent_structure_id,children,id,distance_from_root
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
997,-1.0,1,root,root,FFFFFF,0,0,3,,"[{'id': 8, 'atlas_id': 0, 'ontology_id': 1, 'a...",997,0
8,0.0,1,grey,Basic cell groups and regions,BFDAE3,1,1,3,997.0,"[{'id': 567, 'atlas_id': 70, 'ontology_id': 1,...",8,1
1009,691.0,1,fiber tracts,fiber tracts,CCCCCC,1101,1,3,997.0,"[{'id': 967, 'atlas_id': 686, 'ontology_id': 1...",1009,1


We can also visualize the hierarchy of brain regions as a network (a tree). **Note that running the above cell may take a few minutes**.

In [5]:
# Plot brain region hierarchy
# If you want to plot it, install PyDot (pydot)
# fig = AllenBrain.plot_plotly_graph()
# fig.show()

Based on the graph above, you might want to specify the regions on which you want to do further analysis:  
*Note: to see more information about the regions, hover over them with your mouse.*

- Specify a level. Analysis can only be done one one level (slice) in the brain region.

- To exclude brain regions that belong to a certain branch, add the *abbreviated* nodes at the beginning of the branches to the list above.  
Example:  
```branches_to_exclude = ['retina', 'VS']```  
means that **all the subregions that belong to the retina and the ventricular systems** are excluded from the analysis.

In [6]:
level = 6
branches_to_exclude = ['retina','VS','grv','fiber tracts']

Now, get the selected regions as a variable:

In [7]:
selected_regions = AllenBrain.list_regions_to_analyze(level, branches_to_exclude)
print(f'You selected %d regions to analyze.'%len(selected_regions))

You selected 288 regions to analyze.


## Load data

Now, we're ready to read the ```.csv``` files with the cell counts, and also the exclusion files (if there were regions to exclude).  
Below, you have to specify:
- ```animals_root```: Absolute path to the folder that contains the animal folders.
- ```stress_dirs```: A list of names of the folders corresponding to animals in the Stress group. Indeed, it is necessary to store the results in individual folders for each animal.
- ```control_dirs```: A list of names of the folders corresponding to animals in the Control group.
- ```area_key```: A string of the column in the ```.csv``` files that refers to the size of a brain areatra
- ```tracer_key```: A string of the column in the ```.csv``` files that refers to the tracer number used to highlight the marker
- ```marker_key```: A string of the marker we would like to highlight (e.g. CFos)

Provare a modificar per ottenere densita in mm^2 (da micron)

In [8]:
# ####################################### SET PARAMETERS ####################################


animals_root = './data/results/'
stress_dirs = ['Stress_5S', 'Stress_8S', 'Stress_10S', 'Stress_13S', 'Resilient_1R', 'Resilient_2R', 'Resilient_3R', 'Resilient_4R', 'Resilient_11R']
control_dirs = ['Control_17C', 'Control_18C', 'Control_19C']
area_key='DAPI: DAPI area um^2'
tracer_key='Num AF647'
marker_key='CFos'


# ###########################################################################################

Now, we load the Control and Stress results seperately in two pandas dataframes, and save the results.

**Note**: regions to exclude are automatically excluded.

In [23]:
# Load cell counts
control_results = collect_and_analyze_cell_counts(animals_root, control_dirs, AllenBrain,
                                                    area_key, tracer_key, marker_key)
stress_results = collect_and_analyze_cell_counts(animals_root, stress_dirs, AllenBrain,
                                                    area_key, tracer_key, marker_key)

# Save results
output_path = os.path.join(animals_root, 'results')
save_results(control_results, output_path, 'results_cell_counts_C.txt')
save_results(stress_results, output_path, 'results_cell_counts_S.txt')

Importing slices in Control_17C...
Imported 53 slices.
Raw cell counts are saved to ./data/results/Control_17C/results_python
Importing slices in Control_18C...
Imported 29 slices.
Raw cell counts are saved to ./data/results/Control_18C/results_python
Importing slices in Control_19C...
Imported 34 slices.
Raw cell counts are saved to ./data/results/Control_19C/results_python
Importing slices in Stress_5S...
Imported 32 slices.
Raw cell counts are saved to ./data/results/Stress_5S/results_python
Importing slices in Stress_8S...
Imported 48 slices.
Raw cell counts are saved to ./data/results/Stress_8S/results_python
Importing slices in Stress_10S...
Imported 39 slices.
Raw cell counts are saved to ./data/results/Stress_10S/results_python
Importing slices in Stress_13S...
Imported 31 slices.
Raw cell counts are saved to ./data/results/Stress_13S/results_python
Importing slices in Resilient_1R...
Imported 41 slices.
Raw cell counts are saved to ./data/results/Resilient_1R/results_python
Im

True

The data are stored in ```control_results``` and ```stress_results```:

# Partial Least Squares  

The analysis done below is taken from the tutorial written by [Krishnan et al.](https://www.sciencedirect.com/science/article/pii/S1053811910010074).  
Run the 2 cells below to get started.

In [10]:
# PLS
animal_list = stress_dirs + control_dirs
normalization = 'Density'   #Normalize on Density rather then Percentage
rank = 1

# Create a PLS object
cfosPLS = PLS(control_results, stress_results, control_dirs, stress_dirs, selected_regions, 'CFos', normalization)

# Show the matrix X
cfosPLS.X

  self.X = data.loc[regions].T.dropna('columns')
  Y = pd.get_dummies(y)
  self.Ly = pd.get_dummies(self.y) @ self.u


Unnamed: 0,CLA,LA,MEV,CUN,VTN,PPN,MO,SS,GU,VISC,...,CS,LDT,PRNr,RPO,CN,NTB,RM,CENT3,"CUL4, 5",ANcr1
Control_17C,0.000399,0.000294,0.000321,0.000173,6.6e-05,8.2e-05,0.000389,0.000771,0.000383,0.000381,...,0.000138,0.000308,9.9e-05,0.000253,0.000341,0.000634,0.000447,0.00043,0.000344,0.000628
Control_18C,3.1e-05,4.9e-05,0.000116,4.3e-05,1.2e-05,3.4e-05,2.8e-05,6e-05,4.9e-05,5.1e-05,...,3.2e-05,0.000127,4e-05,1e-05,1.4e-05,1.8e-05,1e-05,2.8e-05,3.3e-05,0.0
Control_19C,3e-05,8.4e-05,7.7e-05,0.000144,0.0,0.000185,0.000172,0.000223,0.000242,0.000111,...,0.000329,0.000296,0.000184,0.000403,0.000251,0.000214,0.000249,0.000401,0.000418,0.00048
Stress_5S,0.000262,0.000162,0.000124,6.5e-05,3.5e-05,6.5e-05,0.000109,0.000147,0.000235,0.000104,...,6.7e-05,0.000137,7.2e-05,6.3e-05,9.2e-05,0.000393,0.000171,4.1e-05,6e-05,7.2e-05
Stress_8S,0.000529,0.00029,0.000385,0.000244,0.000495,0.000325,0.000626,0.000859,0.00062,0.000502,...,0.000295,0.000347,0.000289,0.000253,0.000371,0.00041,0.000368,0.000306,0.000318,0.000501
Stress_10S,0.000277,0.000226,0.000578,0.000304,8.8e-05,0.000292,0.00047,0.000704,0.000527,0.000401,...,0.000203,0.000371,9.9e-05,0.000276,0.000289,0.0,0.000166,0.000507,0.000393,0.000338
Stress_13S,0.000274,0.000185,0.000385,7.7e-05,0.000195,9.7e-05,8.7e-05,0.000201,7.2e-05,8.7e-05,...,9.9e-05,0.000337,0.00011,0.000113,0.000306,0.000275,0.000124,0.000546,0.000379,0.000692
Resilient_1R,0.000204,0.000149,0.000311,0.000118,0.000111,9e-05,0.000185,0.000347,0.000312,0.000202,...,8.5e-05,0.000165,6.4e-05,0.000103,0.000296,9e-05,7.4e-05,0.000157,0.000201,0.000329
Resilient_2R,0.000119,0.000169,0.000289,0.000157,3.2e-05,0.000123,4.9e-05,0.000112,0.000161,8.6e-05,...,0.000146,0.000162,8.9e-05,7.1e-05,9.4e-05,2.8e-05,5.9e-05,4.9e-05,8.5e-05,0.000128
Resilient_3R,4.8e-05,7.8e-05,0.0,6e-05,3.2e-05,0.000111,2.9e-05,8.1e-05,3.7e-05,5.4e-05,...,8.9e-05,0.000182,9.4e-05,0.000427,0.0,9.2e-05,0.000201,0.0,2.7e-05,2.4e-05


In [11]:
# Show the matrix Y
pd.get_dummies(cfosPLS.y).rename(columns={0:'Contrl',1:'Stress'})

  pd.get_dummies(cfosPLS.y).rename(columns={0:'Contrl',1:'Stress'})


Unnamed: 0,Contrl,Stress
Control_17C,0,1
Control_18C,0,1
Control_19C,0,1
Stress_5S,1,0
Stress_8S,1,0
Stress_10S,1,0
Stress_13S,1,0
Resilient_1R,1,0
Resilient_2R,1,0
Resilient_3R,1,0


The two matrices printed above (X and Y) illustrate the data on which the PLS is done.  
- ```X:``` The rows in this matrix are the mice. The columns in the matrix are the regions selected for analysis. The values in the matrix are the **percentage of Rabies+ cells in that region relative to the whole brain.**
- ```Y:``` The rows in this matrix are the mice. The columns in the matrix are the 2 groups (IL or BLA). **A value in this matrix is 1 if the mice belongs to the specified group**.

In brief, PLS analyzes the relationship (correlation) between the columns of ```X``` and ```Y```. In our specific case, there will be 2 important outputs:
- **Salience scores**: Each brain region has a salience score. A high salience scores means that the brain region explains much of the correlation between ```X``` and ```Y```.  
- **Singular values**: These are the eigenvalues of the correlation matrix $R = Y^TX$.

## Random permutations to see whether we can differentiate signal from noise. 
Here, we randomly shuffle the group to which a mouse belongs, and calculate the singular values of the permuted dataset.  
From [Krishnan et al.](https://www.sciencedirect.com/science/article/pii/S1053811910010074):  
> The set of all the (permuted) singular values provides a sampling distribution of the singular values under the null hypothesis and, therefore can be used as a null hypothesis test.

*Note: running the cell below will take a few minutes.*

In [12]:
num_permutations = 5000
print(f'Randomly permuting singular values %d times ...'%num_permutations)
s,singular_values = cfosPLS.randomly_permute_singular_values(num_permutations)
print('Done!\n')

Randomly permuting singular values 5000 times ...
Done!



In [13]:
# TODO: move to Plotly

# Plot distribution of singular values
# plt.figure(figsize=(10,4))
# plt.hist(singular_values[:,0],bins=10)
# plt.axvline(cfosPLS.s[0], color='r')
# plt.xlabel('First singular value')
# plt.ylabel('Frequency')
# plt.legend([f'Experiment','Sampling distribution\nunder H0 (%d permutations)'%num_permutations])
# plt.show()

In [14]:
# Calculate p-value = Probability(experiment | H0)
p = (singular_values[:,0] > s[0]).sum() / num_permutations
print('p-value = '+str(p))

p-value = 0.4054


## Bootstrap to identify stable salience scores

Here, we use [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) (= sampling of the mice in the dataset, with replacement) to get an estimate of which salience scores are stable.

From [Krishnan et al.](https://www.sciencedirect.com/science/article/pii/S1053811910010074):  
> When a vector of saliences is considered generalizable and is kept for further analysis, we need to identify its elements that are stable through resampling. In practice, the stability of an element is evaluated by dividing it by its standard error. [...] To estimate the standard errors, we create bootstrap samples which are obtained by sampling with replacement the observations in and (Efron and Tibshirani, 1986). A salience standard error is then estimated as the standard error of the saliences from a large number of these bootstrap samples (say 1000 or 10000). **The ratios are akin to a Z-score, therefore when they are larger than 2 the corresponding saliences are considered significantly stable.**

*Note: Running the cell below will take a few minutes.*

In [15]:
num_bootstrap = 5000
print(f'Bootstrapping salience scores {num_bootstrap} times...')
u_salience_scores,v_salience_scores = cfosPLS.bootstrap_salience_scores(rank,num_bootstrap)
print('Done!')

Bootstrapping salience scores 5000 times ...


  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dum

Done!


  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dummies(y)
  Y = pd.get_dum

In [16]:
output_path

'./data/results/results'

In [17]:
# Plot PLS salience scores
plot_threshold = 1.2 # Only brain regions with a salience higher than plot_threshold are shown. 2 is the significance threshold.

file_title = 'PLS_CFos' + '_' + normalization + '.png'
output_path = os.path.join(animals_root, 'results_python')

tp, salient_regions = cfosPLS.plot_salience_scores(plot_threshold, output_path, file_title, brain_region_dict,
                              fig_width=1000, fig_height=2000)

##### salient_regions.reset_index()['index']

In [18]:
df = salient_regions.reset_index()
df.columns = ['region', 'salience']
df['salience'] = df['salience'].abs()
df = df.sort_values(by='salience')
df.to_csv('./data/R_results/salient_regions.csv', sep=';', index=False)
df

Unnamed: 0,region,salience
11,IA,1.205501
6,COA,1.233459
41,APN,1.267545
2,PPN,1.42756
44,I5,1.434632
13,SI,1.490909
14,MA,1.566115
15,MSC,1.586086
10,SH,1.62416
43,SUT,1.628646


In [19]:
pls_filename = 'PLS_CFos_' + normalization + '_salience_scores.txt'
save_results(v_salience_scores.rename(columns={0:'salience score'}), output_path, pls_filename)


! A results_python folder already existed in root. I am overwriting previous results!

Results are saved in ./data/results/results_python

Done!


True

# Plot percentages

In [20]:
# control_results[('RAB','Percentage')].to_csv('plot.csv', index=True)

In [21]:
# In this case we wanted to normalize it based on the density, rather then the Percentage 
# I didn't modify the various labels in the plot as I was just focused on adapting the code to our dataset, rather then polishing it

tracer_to_plot = 'CFos'
normalization = 'Density' # 'Density','Percentage','RelativeDensity'
threshold = 1e-2 # Only plot bars with value larger than threshold (1e-6, 1e-2, 3)
y_axis_label = 'region_names' # change this to 'acronym' to have acronyms on the y-axis

# Calculate mean values
control_df = pd.DataFrame(control_results[(tracer_to_plot,normalization)].rename('cell counts'))
control_avg = control_df.reset_index().groupby('level_0').mean()
control_sem = control_df.reset_index().groupby('level_0').sem()

stress_df = pd.DataFrame(stress_results[(tracer_to_plot,normalization)].rename('cell counts'))
stress_avg = stress_df.reset_index().groupby('level_0').mean()
stress_sem = stress_df.reset_index().groupby('level_0').sem()

# Determine which regions to plot  
mean_sum = control_avg + stress_avg
#regs_to_plot = mean_sum[(mean_sum['cell counts']>threshold) & (mean_sum['cell counts'].notnull())].sort_values(by='cell counts').index.to_list()
regs_to_plot = cfosPLS.X.columns.to_list()

# y-axis, with seperate values for each region
y_axis_il, ticklabels = pd.factorize(control_df.loc[regs_to_plot].reset_index()['level_0'])
y_axis_bla, ticklabels = pd.factorize(stress_df.loc[regs_to_plot].reset_index()['level_0'])
if(y_axis_label=='region_names'):
    ticklabels = [AllenBrain.brain_region_dict[reg] for reg in ticklabels]
     
fig = go.Figure()

# Barplot
fig.add_trace(go.Bar(
                     x = control_avg.loc[regs_to_plot]['cell counts'],
                     name = 'C mean',
                     error_x = dict(
                         type='data',
                         array=control_sem.loc[regs_to_plot]['cell counts']
                     )
              )
)

fig.add_trace(go.Bar(
                     x = stress_avg.loc[regs_to_plot]['cell counts'],
                     name = 'S mean',
                     error_x = dict(
                         type='data',
                         array=stress_sem.loc[regs_to_plot]['cell counts']
                     )
              )
)

fig.update_layout(barmode='group', colorway=['rgb(0,255,0)', 'rgb(255,0,0)'])

# Scatterplot (animals)
fig.add_trace(go.Scatter(
                    mode = 'markers',
                    y = y_axis_il - 0.2,
                    x = control_df.loc[regs_to_plot]['cell counts'],
                    name = 'C animals',
                    opacity=0.5,
                    marker=dict(
                        color='rgb(0,255,0)',
                        size=5,
                        line=dict(
                            color='rgb(0,0,0)',
                            width=1
                        )
                    )
              )
)

fig.add_trace(go.Scatter(
                    mode = 'markers',
                    y = y_axis_bla + 0.2,
                    x = stress_df.loc[regs_to_plot]['cell counts'],
                    name = 'S animals',
                    opacity=0.5,
                    marker=dict(
                        color='rgb(255,0,0)',
                        size=5,
                        line=dict(
                            color='rgb(0,0,0)',
                            width=1
                        )
                    )
              )
)

# Figure title
title = ''
if normalization=='RelativeDensity':
    title = '['+tracer_to_plot+ '(r) / area(r)] / ['+tracer_to_plot+'(brain) / area(brain)].'
elif normalization=='Density':
    title = '['+tracer_to_plot+ '(r) / area(r)]'
elif normalization=='Percentage':
    title = '['+tracer_to_plot+ '(r) / brain(r)]'

# Update layout
fig.update_layout(
    title = title,
    yaxis = dict(
        tickmode = 'array',
        tickvals = np.arange(0,len(regs_to_plot)),
        ticktext = ticklabels
    ),
    xaxis=dict(
        title = 'CFos density (relative to brain)'
    ),
    width=900, height=5000,
    hovermode="x unified",
    yaxis_range = [-1,len(regs_to_plot)+1]
)

fig.show()

# Save figure as PNG
output_path = os.path.join(animals_root, 'results_python')
if not(os.path.exists(output_path)):
    os.mkdir(output_path)
file_title = 'barplot_' + tracer_to_plot + '_' + normalization + 'CvS.png'
output_file = os.path.join(output_path, file_title)
fig.write_image(output_file)


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.


The default value of numeric_only in DataFrameGroupBy.sem is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.


The default value of numeric_only in DataFrameGroupBy.sem is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

