# Ketamine VR5 - SINGLE LABELED DATA ONLY
#### Jonathan Ramos 1/26/2024

I'm glad these data came just as I finished the sleep dep set so some of code is still fresh in my brain. For these data, the format of the csvs is quite different (due to the difference in the way PIPSQUEAK vs POLYGON spit out csvs). Col names are different and some label names need to be changed; in particular, some stain type names are simply called "hand drawn" if the user added ROIs that were not detected by the polygon algorithm. This causes probems because all hand drawn ROIs of any stain type are all called "hand drawn." This has been an on going issue with polygon, but we have a work around.

In the filename col, all files follow a consistent naming scheme:
- *_2.tif : PV
- *_3.tif : cFos
- *_4.tif : Npas4
- *_5.tif : WFA

Additionally, since there is no subject ID col, we can construct it from informatively named filenames instead. For this project, filenames follow the following format:

*(rat number)*_*(brain region)*_*(bregma)*_*(n)*.tif

In this notebook I will wrangle all the data into one spot (data is distributed over ~600 small csvs), clean things up, normalize intensity and count mean cell ns.



### Loading data, stitching sets together

In [70]:
import numpy as np
import pandas as pd
import glob

# load cohort key (there's a few empty rows at the end)
df_key = pd.read_csv('KETAMINE_COHORT KEY.csv').dropna()

# load data; getting it all in one spot
df_full = pd.concat([pd.read_csv(f) for f in glob.glob('CSV_FILES/*/*.csv')])

# col names begin with a whitespace char; let's remove all ' ' chars from col names
df_full.columns = [col.replace(' ', '') for col in df_full.columns]

# let's take a look
df_full

Unnamed: 0,cell_number,roi_id,roi_source,roi_type,CoM_x,CoM_y,pixel_area,background,mean_intensity,median_intensity,...,feret_angle,feret_min,circularity,aspect_ratio,roundness,solidity,skewness,kurtosis,filename,analysis_date
0,1,000-00000,Parvalbumin,OVAL,120.64,307.06,665.0,285.8023,636.1711,677.3578,...,0.0,28.0,0.9387,1.1447,0.8205,0.9419,-0.4228,-0.6905,PE-11-7_PFC_3.9_A_2.tif,Thu Jan 25 15:09:00 PST 2024
1,2,000-00001,Parvalbumin,OVAL,403.99,379.52,468.0,285.8023,412.1381,395.8835,...,90.0,24.0,0.8789,1.1555,0.7613,0.8797,0.4731,-0.5155,PE-11-7_PFC_3.9_A_2.tif,Thu Jan 25 15:09:00 PST 2024
2,3,000-00002,Parvalbumin,OVAL,363.44,463.29,399.0,285.8023,369.5932,366.851,...,0.0,20.0,0.8698,1.4266,0.6299,0.9027,0.3051,-0.1709,PE-11-7_PFC_3.9_A_2.tif,Thu Jan 25 15:09:00 PST 2024
3,4,000-00003,Parvalbumin,OVAL,68.43,322.91,550.0,285.8023,518.8725,540.6805,...,0.0,22.0,0.8089,1.6476,0.535,0.8814,-0.4078,-0.4258,PE-11-7_PFC_3.9_A_2.tif,Thu Jan 25 15:09:00 PST 2024
4,5,000-00004,Parvalbumin,OVAL,386.72,251.18,524.0,285.8023,832.9809,920.6865,...,0.0,26.0,0.9146,1.0654,0.8539,0.9066,-0.6879,-0.6308,PE-11-7_PFC_3.9_A_2.tif,Thu Jan 25 15:09:00 PST 2024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,74,FFF-00073,hand-drawn,OVAL,491.39,60.52,130.0,541.3023,435.1857,415.7151,...,0.0,28.0,0.1835,1.1447,0.1604,0.1841,1.3612,1.5127,KET-10-4_PFC_3.5_C_4.tif,Mon Jan 22 16:14:32 PST 2024
74,75,FFF-00074,hand-drawn,OVAL,497.35,15.91,65.0,541.3023,413.2948,392.1778,...,0.0,28.0,0.0918,1.0617,0.0933,0.0921,3.3638,15.8851,KET-10-4_PFC_3.5_C_4.tif,Mon Jan 22 16:14:32 PST 2024
75,76,FFF-00075,hand-drawn,OVAL,437.04,376.98,130.0,541.3023,493.2017,462.339,...,0.0,28.0,0.1835,1.1447,0.1604,0.1841,1.4819,2.5275,KET-10-4_PFC_3.5_C_4.tif,Mon Jan 22 16:14:32 PST 2024
76,77,FFF-00076,hand-drawn,OVAL,472.11,432.2,104.0,541.3023,472.384,433.4397,...,0.0,28.0,0.1468,1.1447,0.1283,0.1473,1.6255,3.4102,KET-10-4_PFC_3.5_C_4.tif,Mon Jan 22 16:14:32 PST 2024


### Building the necessary cols

In [151]:
# creating a new rat_n col
df_full['rat_n'] = df_full.filename.apply(lambda x: x.split('_')[0])\
    .replace({' ': ''}, regex=True) # for some reason, we have more leading whitespace chars

# some checks. we want be sure that the structure of all our rat_n labels is consistent
# in particular, we expect something of the form 'PE-12-7', that is we have exactly
# two dashes '-' separating some letters, followed by two numbers
assert df_full.rat_n.apply(lambda x: len(x.split('-')) == 3).sum() == len(df_full)
assert df_full.rat_n.apply(lambda x: x.split('-')[0].isalpha()).sum() == len(df_full)
assert df_full.rat_n.apply(lambda x: x.split('-')[1].isnumeric()).sum() == len(df_full)
assert df_full.rat_n.apply(lambda x: x.split('-')[2].isnumeric()).sum() == len(df_full)

# building a cohort key dictionary from df_key
treatment = dict(zip(df_key.Subject, df_key.TX.replace({' ': '_'}, regex=True)))

# creating new treatment col by mapping from cohort key dict
df_full['treatment'] = df_full.rat_n.map(treatment)

# creating new stain_type col from filename
df_full['stain_type'] = df_full.filename.replace({
    '.*_2.tif$' : 'PV',
    '.*_3.tif$' : 'cFos',
    '.*_4.tif$' : 'Npas4',
    '.*_5.tif$' : 'WFA'
}, regex=True)



In [155]:
df_full.columns

Index(['cell_number', 'roi_id', 'roi_source', 'roi_type', 'CoM_x', 'CoM_y',
       'pixel_area', 'background', 'mean_intensity', 'median_intensity',
       'mode_intensity', 'stdev', 'mean-background', 'integrated_density',
       'min', 'max', 'raw_integrated_density', 'ferets_diameter', 'feret_x',
       'feret_y', 'feret_angle', 'feret_min', 'circularity', 'aspect_ratio',
       'roundness', 'solidity', 'skewness', 'kurtosis', 'filename',
       'analysis_date', 'rat_n', 'stain_type', 'treatment'],
      dtype='object')

In [157]:
df_full.CoM_y

0     307.06
1     379.52
2     463.29
3     322.91
4     251.18
       ...  
73     60.52
74     15.91
75    376.98
76     432.2
77    214.76
Name: CoM_y, Length: 24613, dtype: object

In [130]:
test

0     False
1     False
2     False
3     False
4     False
      ...  
73    False
74    False
75    False
76    False
77    False
Name: rat_n, Length: 24613, dtype: bool