# Upload soma table to CAVE
To make an efficient and user-friendly table structure, we decided to make 1. a soma table with all the information, and 2. cell-specific tables (i.e., neuron and glia) that have the same information by referencing the soma table. With this approach, the CAVE server only needs to update the information in `somas_dec2022`, but users can just run `client.materialize.query_table('neuron_somas_dec2022')` or `client.materialize.query_table(client.info.get_datastack_info()['soma_table'])` to get updated `pt_root_id` of neurons.
![How soma table is organized on CAVE](./img/soma_table_CAVE.png)

## 0. Prepare CAVE environment
You need to install CAVEclient and initialize it first. If this is your first time to use CAVE, check [here](https://globalv1.daf-apis.com/info/) about which dataset you have access. You should also take a look at [CAVE's official document](https://caveclient.readthedocs.io/en/latest/index.html).

In [1]:
import numpy as np
import sys
import os
import pandas as pd
from tqdm import tqdm
from caveclient import CAVEclient
from fanc.upload import xyz_StringSeries2List
from fanc import ngl_info

In [2]:
datastack_name = 'fanc_production_mar2021'
client = CAVEclient(datastack_name)

## 1. Create (empty) tables
Uploading data to CAVE usually takes two steps: 1. make an empty table, and 2. upload annotations to the table you have just created. Here, we first made three tables: `somas_dec2022` stores all the coordinates of neuronal and glial nuclei, `neuron_somas_dec2022` stores which ids belong to neurons, and `glia_somas_dec2022` stores which ids belong to glias.

In [None]:
%%script false

client.annotation.create_table(table_name='somas_dec2022',
                               schema_name='nucleus_detection',
                               description = 'Information about all the somas identified by the nucleus detection pipeline. We segmented nuclei using 3D U-net (DeepEM) and excluded objects smaller than the size dimension of (1,376 nm, 1,376 nm, 1,800 nm) for (x, y, z). id : a 17-digit nucleus id, pt_position (x,y,z) : a center coordinate, volume (um^3) : a volume, bb (x,y,z) : a bounding box. [created/uploaded/managed by Sumiya Kuroda - s.kuroda@ucl.ac.uk]',
                               flat_segmentation_source=ngl_info.nuclei["path"],
                               write_permission="GROUP",
                               voxel_resolution =[4.3, 4.3, 45.0])
client.annotation.create_table(table_name='neuron_somas_dec2022',
                               schema_name='simple_reference',
                               description = 'This table points to neuronal somas in somas_dec2022. target_id: a nucleus id of a neuron. [created/uploaded/managed by Sumiya Kuroda - s.kuroda@ucl.ac.uk]',
                               reference_table = 'somas_dec2022',
                               track_target_id_updates=True,
                               flat_segmentation_source=ngl_info.nuclei["path"],
                               write_permission="GROUP",
                               voxel_resolution =[4.3, 4.3, 45.0])
client.annotation.create_table(table_name='glia_somas_dec2022',
                               schema_name='simple_reference',
                               description = 'This table points to glial somas in somas_dec2022. target_id: a nucleus id of a glial cell. [created/uploaded/managed by Sumiya Kuroda - s.kuroda@ucl.ac.uk]',
                               reference_table = 'somas_dec2022',
                               track_target_id_updates=True,
                               flat_segmentation_source=ngl_info.nuclei["path"],
                               write_permission="GROUP",
                               voxel_resolution =[4.3, 4.3, 45.0])

## 2. Load data and format tables
We also have to format the tables before uploading to CAVE. You can check what columns are required by CAVE, by running `client.schema.schema_definition('nucleus_detection')` or [here](https://globalv1.daf-apis.com/schema/views/type/nucleus_detection/view).

In [3]:
df = pd.read_csv("../Output/soma_info_Aug2021ver5_1_new_column_of_center_for_CAVE.csv", header=0) # original detection result
newID_neuron = pd.read_csv("../Output/nucID_mapping_neuron_20221207.csv", header=0) # new nucleus ids for each label
newID_glia = pd.read_csv("../Output/nucID_mapping_glia_20221219.csv", header=0)
newID_fp = pd.read_csv("../Output/nucID_mapping_fp_20221219.csv", header=0)
newID_dup_neuron = pd.read_csv("../Output/nucID_mapping_dup_neuron.csv", header=0)
newID_dup_glia = pd.read_csv("../Output/nucID_mapping_dup_glia_20221219.csv", header=0)
newIDs = [newID_neuron, newID_glia, newID_fp, newID_dup_neuron, newID_dup_glia]

In [4]:
def which_label(id, label, dfs):
    label_list = list(label.values())
    for i in range(len(label)):
        if sum(np.isin(dfs[i]["old_nucID"], np.array(id, dtype=np.uint64))) == 1:
            return label_list[i]

In [5]:
df["new_nucID"] = 0
df["vol_um"] = 0
df["label"] = ""
voxel2um = (0.0043*2**4)*(0.0043*2**4)*0.045 

newID_all = pd.concat(newIDs)
label = {'Neurons': "neurons", 'Glia': "glias", 'False positive': "false_positives", 'Duplicated neuron': "dup_neurons", 'Duplicated glia': "dup_glias"}

for i in tqdm(range(len(df))):
    # update nucleus ids
    nucID = df.iloc[i]["nucID"]
    idx = np.isin(newID_all["old_nucID"], np.array(nucID, dtype=np.uint64))
    df.at[i, "new_nucID"] = newID_all[idx]["new_nucID"].values[0]

    # calculate volume in um^3
    voxel_size = df.iloc[i]["voxel_size"]
    df.at[i, "vol_um"] = voxel_size * voxel2um

    # label whether neuron or glia or false positive or duplicated neuron
    df.at[i, "label"] = which_label(nucID, label, newIDs)

100%|██████████| 17076/17076 [01:08<00:00, 248.79it/s]


In [6]:
print(len(df[df["label"]==list(label.values())[0]]))
print(len(df[df["label"]==list(label.values())[1]]))
print(len(df[df["label"]==list(label.values())[2]]))
print(len(df[df["label"]==list(label.values())[3]]))
print(len(df[df["label"]==list(label.values())[4]]))

14628
2019
412
15
2


In [7]:
nucleus_df = df.reindex(columns=['new_nucID', 'center_for_CAVE', 'vol_um', 'bbx_min', 'bbx_max', 'label'])
nucleus_df = nucleus_df.rename(columns={'new_nucID': 'id', 'center_for_CAVE': 'pt_position', 'vol_um': 'volume', 'bbx_min': 'bb_start_position', 'bbx_max': 'bb_end_position'})

In [8]:
nucleus_df['pt_position'] = xyz_StringSeries2List(nucleus_df['pt_position'])
nucleus_df['bb_start_position'] = xyz_StringSeries2List(nucleus_df['bb_start_position'])
nucleus_df['bb_end_position'] = xyz_StringSeries2List(nucleus_df['bb_end_position'])
nucleus_df_s = nucleus_df.sort_values('volume', ascending=True).reset_index(drop=True)

## 3. Upload annotations
After formatting the tables, we uploaded them into the empty tables we just created. You can directly upload them, but `stage_annotations` is a more safe option because it will warn you if your table is incorrectly formatted.

In [9]:
# we first uploaded the soma table
tmp = nucleus_df_s[nucleus_df_s['label'].isin(["neurons", "glias"])].reset_index(drop=True)
df_upload = tmp.reindex(columns=['id', 'pt_position', 'volume', 'bb_start_position', 'bb_end_position'])

In [10]:
df_upload

Unnamed: 0,id,pt_position,volume,bb_start_position,bb_end_position
0,72481181429400700,"[25528, 84220, 2199]",1.589229,"[25136, 84000, 2167]","[25920, 84544, 2210]"
1,72481250081768945,"[27260, 86824, 2002]",2.807190,"[27008, 86544, 1982]","[27408, 87104, 2023]"
2,72481319002571550,"[28028, 92796, 2662]",3.072168,"[27696, 92448, 2634]","[28512, 93248, 2700]"
3,72623705288605712,"[35088, 188960, 791]",3.302213,"[34832, 188656, 771]","[35344, 189264, 812]"
4,72481181362296023,"[26584, 83696, 2034]",3.782539,"[26176, 83344, 2013]","[26992, 84048, 2060]"
...,...,...,...,...,...
16642,72412118355280021,"[23512, 160472, 2125]",401.849957,"[22016, 159408, 2033]","[25008, 161536, 2217]"
16643,72623018630709815,"[34288, 149640, 3050]",412.453123,"[33072, 148528, 2949]","[35504, 150752, 3151]"
16644,73044681272852922,"[60672, 116472, 2665]",431.741133,"[59408, 115472, 2550]","[61936, 117472, 2781]"
16645,72833712814817408,"[47520, 124320, 3907]",484.838331,"[46112, 123248, 3791]","[48928, 125392, 4023]"


In [None]:
%%script false

k=10000 # CAVE only accepts up to 10,000 annotations
minidfs = [df_upload.loc[i:i+k-1, :] for i in range(0, len(df_upload), k)]
for dftmp in minidfs:
    stage = client.annotation.stage_annotations("somas_dec2022", id_field=True)
    stage.add_dataframe(dftmp)
    client.annotation.upload_staged_annotations(stage)

In [11]:
# neuron table
tmp = nucleus_df_s[nucleus_df_s['label'].isin(["neurons"])].reset_index(drop=True)
df_upload = tmp.reindex(columns=['id'])
df_upload['target_id'] = df_upload['id']

In [12]:
df_upload

Unnamed: 0,id,target_id
0,72481181429400700,72481181429400700
1,72481250081768945,72481250081768945
2,72481319002571550,72481319002571550
3,72623705288605712,72623705288605712
4,72481181362296023,72481181362296023
...,...,...
14623,72412118355280021,72412118355280021
14624,72623018630709815,72623018630709815
14625,73044681272852922,73044681272852922
14626,72833712814817408,72833712814817408


In [None]:
%%script false

k=10000
minidfs = [df_upload.loc[i:i+k-1, :] for i in range(0, len(df_upload), k)]
for dftmp in minidfs:
    stage = client.annotation.stage_annotations("neuron_somas_dec2022", id_field=True)
    stage.add_dataframe(dftmp)
    client.annotation.upload_staged_annotations(stage)

In [13]:
# glia table
tmp = nucleus_df_s[nucleus_df_s['label'].isin(["glias"])].reset_index(drop=True)
df_upload = tmp.reindex(columns=['id'])
df_upload['target_id'] = df_upload['id']

In [14]:
df_upload

Unnamed: 0,id,target_id
0,72904562259788371,72904562259788371
1,73327186840388111,73327186840388111
2,72341886848730335,72341886848730335
3,72342161793745692,72342161793745692
4,72834605496926408,72834605496926408
...,...,...
2014,73186037169389815,73186037169389815
2015,72412049501587544,72412049501587544
2016,73114912443859674,73114912443859674
2017,73115393345978782,73115393345978782


In [None]:
%%script false

k=10000
minidfs = [df_upload.loc[i:i+k-1, :] for i in range(0, len(df_upload), k)]
for dftmp in minidfs:
    stage = client.annotation.stage_annotations("glia_somas_dec2022", id_field=True)
    stage.add_dataframe(dftmp)
    client.annotation.upload_staged_annotations(stage)