# Distribution Estimation

In this notebook, we take a step back from the procedures done in the "Non-Parametric Distribution Comparisons" notebook. Instead of looking at the mean and variance of the data, we aim to look at the data as a whole for each file and can we estimate how it is distributed, given each files sample size. 

Some specific questions I aim to answer in this notebook: <br>
    - Is the mean and variance from "peakVal" variable meaningful?<br>
    - Are the mean and variance a valid measure? <br>
    - Can we visualize with histograms, the "peakVal" data? <br>
    - How do other columns contribute to the variance in our data? <br>
    - Is our data approximately normal? Uniform? Something different? <br>
    - How can we compare the distributions of the Control group versus Thapsigargin? <br>
    - Bonus: Does the sex of the mice examined have any influence on the data (if valid)? <br>

In [None]:
# Auto formatting for Python to PEP8
%load_ext nb_black

In [17]:
import numpy as np
from numpy.random import default_rng
import pandas as pd
import scipy.stats as sp
import plotly.express as px
import plotly.graph_objects as go 
from plotly.subplots import make_subplots
from IPython.display import display
import seaborn as sns

In [14]:
# we can add to this as we get more data
pathList = []

# I changed this to urls so that it would be accesable from any device
# pathList.append("dataGianni/masterDb-jan-12-2022.csv")
# pathList.append("dataGianni/masterDb-jan-18-2022.csv")
# pathList.append("dataGianni/masterDb-feb-15-2022.csv")
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-jan-12-2022.csv"
)
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-jan-18-2022.csv"
)
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-feb-15-2022.csv"
)


# make a list of dataframe (from csv files)
dfList = []
for fileIdx, path in enumerate(pathList):
    dfPath = pd.read_csv(path)
    dfPath["myFileIdx"] = fileIdx  # add for our book keeping if necc
    dfList.append(dfPath)

# make a single dataframe from all files in list
dfMaster = pd.concat(dfList, ignore_index=True)
dfMaster = dfMaster.drop("Unnamed: 0", axis = 1)
display(dfMaster)

Unnamed: 0,analysisVersion,interfaceVersion,file,detectionType,cellType,sex,condition,sweep,sweepSpikeNumber,spikeNumber,...,diastolicDuration_ms,widths,widths_10,widths_20,widths_50,widths_80,widths_90,myDateStr,fileIdx,myFileIdx
0,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,0,0,...,23.5704,"[{'halfHeight': 10, 'risingPnt': 96, 'fallingP...",769.9664,742.4676,557.8328,396.7684,302.4868,jan-12-2022,0,0
1,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,1,1,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 347, 'falling...",785.6800,754.2528,589.2600,428.1956,345.6992,jan-12-2022,0,0
2,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,2,2,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 598, 'falling...",809.2504,754.2528,581.4032,408.5536,282.8448,jan-12-2022,0,0
3,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,3,3,...,19.6420,"[{'halfHeight': 10, 'risingPnt': 849, 'falling...",895.6752,726.7540,581.4032,396.7684,259.2744,jan-12-2022,0,0
4,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,4,4,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 1100, 'fallin...",813.1788,714.9688,601.0452,381.0548,243.5608,jan-12-2022,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,0,0,...,51.5350,"[{'halfHeight': 10, 'risingPnt': 169, 'falling...",688.6950,627.7900,505.9800,187.4000,182.7150,feb-15-2022,8,2
396,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,1,1,...,18.7400,"[{'halfHeight': 10, 'risingPnt': 346, 'falling...",773.0250,726.1750,580.9400,281.1000,79.6450,feb-15-2022,8,2
397,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,2,2,...,79.6450,"[{'halfHeight': 10, 'risingPnt': 541, 'falling...",754.2850,716.8050,557.5150,248.3050,182.7150,feb-15-2022,8,2
398,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,3,3,...,28.1100,"[{'halfHeight': 10, 'risingPnt': 720, 'falling...",482.5550,454.4450,351.3750,154.6050,84.3300,feb-15-2022,8,2


In [62]:
# From meeting on 3/7, just trying different histograms
# subset = dfMaster["peakVal"][dfMaster["file"] == "220110n_0014.tif"]
# figure = px.histogram(subset)
# figure.show()

# subset = dfMaster["peakVal"][dfMaster["file"] == "2.5Hz_ctrl_0012.tif"]
# figure = px.histogram(subset)
# figure.show()

# dfMaster.file.unique()
filenames = dfMaster.file.unique()
fig = make_subplots(rows = len(filenames), cols = 1, subplot_titles = filenames) #Two columns creates staggering problems with plots



for index, file in enumerate(dfMaster.file.unique()):
    dfFile =dfMaster.loc[dfMaster["file"] == file]
    
    #We split the subplot into two columns, deciding which column based on odd or even index
    #if (index % 2) == 0: #even col 1
    fig.append_trace(go.Histogram(x= dfFile["peakVal"]), row = index + 1, col = 1)
#     else: #odd col 2
#         fig.append_trace(go.Histogram(x= dfFile["peakVal"]), row = index + 1, col = 2)
        
fig.update_layout(height=7000, width=700, showlegend = False, colorway= px.colors.qualitative.Vivid,
                  title_text="Subplot for Each File's peakVal")

fig.show()

# dfFile =dfMaster.loc[dfMaster["file"] == "2.5Hz_ctrl_0012.tif"]
# display(dfFile)


To do next, KDE on each plot and research if KDE is a valid measure for small samples 
https://en.wikipedia.org/wiki/Kernel_density_estimation