# Distribution Estimation

In this notebook, we take a step back from the procedures done in the "Non-Parametric Distribution Comparisons" notebook. Instead of looking at the mean and variance of the data, we aim to look at the data as a whole for each file and can we estimate how it is distributed, given each files sample size. 

Some specific questions I aim to answer in this notebook: <br>
    - ~Can we visualize with histograms, the "peakVal" data?~ <br>
    - ~Is our data approximately normal? Uniform? Something different?~ <br>
    - ~How can we compare the distributions of the Control group versus Thapsigargin?~ <br>
    - ~How do other columns contribute to the variance in our data?~ <br>
    - Is the mean and variance from "peakVal" variable meaningful?<br>
    - Are the mean and variance a valid measure? <br>
    - ~Bonus: Does the sex of the mice examined have any influence on the data (if valid)?~ UPDATE: Sex data is NAN <br>

In [73]:
# Auto formatting for Python to PEP8
%load_ext nb_black

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [74]:
import numpy as np
from numpy.random import default_rng
import pandas as pd
import scipy.stats as sp
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display
import seaborn as sns
from sklearn.neighbors import KernelDensity

<IPython.core.display.Javascript object>

In [75]:
# we can add to this as we get more data
pathList = []

# I changed this to urls so that it would be accesable from any device
# pathList.append("dataGianni/masterDb-jan-12-2022.csv")
# pathList.append("dataGianni/masterDb-jan-18-2022.csv")
# pathList.append("dataGianni/masterDb-feb-15-2022.csv")
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-jan-12-2022.csv"
)
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-jan-18-2022.csv"
)
pathList.append(
    "https://raw.githubusercontent.com/gspiga/Cudmore/main/VarAnalysis/gianni_var_analysis/dataGianni/masterDb-feb-15-2022.csv"
)


# make a list of dataframe (from csv files)
dfList = []
for fileIdx, path in enumerate(pathList):
    dfPath = pd.read_csv(path)
    dfPath["myFileIdx"] = fileIdx  # add for our book keeping if necc
    dfList.append(dfPath)

# make a single dataframe from all files in list
dfMaster = pd.concat(dfList, ignore_index=True)
dfMaster = dfMaster.drop("Unnamed: 0", axis=1)
display(dfMaster)

Unnamed: 0,analysisVersion,interfaceVersion,file,detectionType,cellType,sex,condition,sweep,sweepSpikeNumber,spikeNumber,...,diastolicDuration_ms,widths,widths_10,widths_20,widths_50,widths_80,widths_90,myDateStr,fileIdx,myFileIdx
0,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,0,0,...,23.5704,"[{'halfHeight': 10, 'risingPnt': 96, 'fallingP...",769.9664,742.4676,557.8328,396.7684,302.4868,jan-12-2022,0,0
1,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,1,1,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 347, 'falling...",785.6800,754.2528,589.2600,428.1956,345.6992,jan-12-2022,0,0
2,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,2,2,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 598, 'falling...",809.2504,754.2528,581.4032,408.5536,282.8448,jan-12-2022,0,0
3,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,3,3,...,19.6420,"[{'halfHeight': 10, 'risingPnt': 849, 'falling...",895.6752,726.7540,581.4032,396.7684,259.2744,jan-12-2022,0,0
4,20210803a,20210803a,220110n_0003.tif,mv,,,Control,0,4,4,...,3.9284,"[{'halfHeight': 10, 'risingPnt': 1100, 'fallin...",813.1788,714.9688,601.0452,381.0548,243.5608,jan-12-2022,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,0,0,...,51.5350,"[{'halfHeight': 10, 'risingPnt': 169, 'falling...",688.6950,627.7900,505.9800,187.4000,182.7150,feb-15-2022,8,2
396,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,1,1,...,18.7400,"[{'halfHeight': 10, 'risingPnt': 346, 'falling...",773.0250,726.1750,580.9400,281.1000,79.6450,feb-15-2022,8,2
397,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,2,2,...,79.6450,"[{'halfHeight': 10, 'risingPnt': 541, 'falling...",754.2850,716.8050,557.5150,248.3050,182.7150,feb-15-2022,8,2
398,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,3,3,...,28.1100,"[{'halfHeight': 10, 'risingPnt': 720, 'falling...",482.5550,454.4450,351.3750,154.6050,84.3300,feb-15-2022,8,2


<IPython.core.display.Javascript object>

## Exploring Kernel Density Estimation

Most people perform density estimation all the time without knowing it. The histogram is the easiest way to understand how data is distrubuted. However, the histogram and the data are victim to the bin width of the histogram itself. Different bins can easily present the same data in a very different way. Kernel Density Estimation is a technique used to mathematically estimate the distribution of the data. There are different types of KDEs, such as the Gaussian, tophat, and Epanechnikov. 

Our goal is to understand how the data of each file is distributed. While some of them have very small samples sizes (I believe n = 4 is the smallest), we should have no issues with using KDE as it is a non-parametric method, not restricting us with statistical assumptions.

In this notebook, we will use Sklearn's **KernalDensity()** function, with different types of KDEs. 

In [76]:
# Lets look at just one file
subset = dfMaster["peakVal"][dfMaster["file"] == "2.5Hz_ctrl_0012.tif"]
plotdf = pd.DataFrame({"subset": subset}).reset_index(drop=True)

# KDE
subset = subset.to_numpy().reshape(-1, 1)
kde = KernelDensity(kernel="gaussian", bandwidth=0.05).fit(subset)
log_dens = np.array(np.exp(kde.score_samples(subset))).reshape(-1, 1)
plotdf["log_dens"] = log_dens
plotdf = plotdf.sort_values(by=["subset"])

# Plot
figure = go.Figure(
    data=go.Histogram(
        x=plotdf["subset"], histnorm="probability density", name="2.5Hz_ctrl_0012.tif"
    )
)
figure.add_trace(
    go.Scatter(x=plotdf["subset"], y=plotdf["log_dens"], mode="lines", name="KDE")
)

figure.show()

<IPython.core.display.Javascript object>

In [77]:
# Using plotlys distplot
# THIS IS DEPRECATED
import plotly.figure_factory as ff

group_labels = ["distplot"]  # name of the dataset

fig = ff.create_distplot(
    [plotdf["subset"]], group_labels, curve_type="kde", bin_size=0.1
)
# fig.show()

<IPython.core.display.Javascript object>

In [78]:
# From meeting on 3/7, just trying different histograms
# subset = dfMaster["peakVal"][dfMaster["file"] == "220110n_0014.tif"]
# figure = px.histogram(subset)
# figure.show()

# subset = dfMaster["peakVal"][dfMaster["file"] == "2.5Hz_ctrl_0012.tif"]
# figure = px.histogram(subset)
# figure.show()

# dfMaster.file.unique()
filenames = dfMaster.file.unique()
fig = make_subplots(rows = len(filenames), cols = 1, subplot_titles = filenames) #Two columns creates staggering problems with plots



for index, file in enumerate(dfMaster.file.unique()):
    dfFile =dfMaster.loc[dfMaster["file"] == file]

    #We split the subplot into two columns, deciding which column based on odd or even index
    #if (index % 2) == 0: #even col 1
    fig.append_trace(go.Histogram(x= dfFile["peakVal"]), row = index + 1, col = 1)
#     else: #odd col 2
#         fig.append_trace(go.Histogram(x= dfFile["peakVal"]), row = index + 1, col = 2)
        
fig.update_layout(height=7000, width=700, showlegend = False, colorway= px.colors.qualitative.Vivid,
                  title_text="Subplot for Each File's peakVal")

fig.show()

# dfFile =dfMaster.loc[dfMaster["file"] == "2.5Hz_ctrl_0012.tif"]
# display(dfFile)


<IPython.core.display.Javascript object>

In [79]:
filenames = dfMaster.file.unique()
fig = make_subplots(
    rows=len(filenames), cols=1, subplot_titles=filenames
)  # Two columns creates staggering problems with plots

for index, file in enumerate(dfMaster.file.unique()):
    dfFile = dfMaster.loc[dfMaster["file"] == file]
    subset = dfFile["peakVal"]
    plotdf = pd.DataFrame({"subset": subset}).reset_index(drop=True)

    # KDE
    subset = subset.to_numpy().reshape(-1, 1)

    # Picking different bandwidth values, 0.01 -0.025 seems to yield meaningful results
    kde = KernelDensity(kernel="gaussian", bandwidth=0.02).fit(subset)
    log_dens = np.array(np.exp(kde.score_samples(subset))).reshape(-1, 1)
    plotdf["log_dens"] = log_dens
    plotdf = plotdf.sort_values(by=["subset"])

    fig.append_trace(
        go.Histogram(
            x=plotdf["subset"],
            histnorm="probability density",
            name=dfFile.condition.unique()[0],
        ),
        row=index + 1,
        col=1,
    )
    fig.append_trace(
        go.Scatter(
            x=plotdf["subset"], y=plotdf["log_dens"], mode="lines", name="Gaussian KDE"
        ),
        row=index + 1,
        col=1,
    )


fig.update_layout(
    height=7500,
    width=700,
    showlegend=False,
    colorway=px.colors.qualitative.Vivid,
    title_text="Subplot for Each File's peakVal",
)
fig.show()
display(dfFile)

Unnamed: 0,analysisVersion,interfaceVersion,file,detectionType,cellType,sex,condition,sweep,sweepSpikeNumber,spikeNumber,...,diastolicDuration_ms,widths,widths_10,widths_20,widths_50,widths_80,widths_90,myDateStr,fileIdx,myFileIdx
395,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,0,0,...,51.535,"[{'halfHeight': 10, 'risingPnt': 169, 'falling...",688.695,627.79,505.98,187.4,182.715,feb-15-2022,8,2
396,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,1,1,...,18.74,"[{'halfHeight': 10, 'risingPnt': 346, 'falling...",773.025,726.175,580.94,281.1,79.645,feb-15-2022,8,2
397,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,2,2,...,79.645,"[{'halfHeight': 10, 'risingPnt': 541, 'falling...",754.285,716.805,557.515,248.305,182.715,feb-15-2022,8,2
398,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,3,3,...,28.11,"[{'halfHeight': 10, 'risingPnt': 720, 'falling...",482.555,454.445,351.375,154.605,84.33,feb-15-2022,8,2
399,20210803a,20210803a,thapsi 2.5_0047.tif,mv,,,TG,0,4,4,...,28.11,"[{'halfHeight': 10, 'risingPnt': 843, 'falling...",571.57,562.2,487.24,337.32,229.565,feb-15-2022,8,2


<IPython.core.display.Javascript object>

In [80]:
# Try with distplot, this has been deprecated in plotly and the procedure for KDE is unclear from plotly documentation, I assume its gaussian
# for index, file in enumerate(dfMaster.file.unique()):
#     dfFile = dfMaster.loc[dfMaster["file"] == file]
#     distplot = ff.create_distplot(
#         [dfFile["peakVal"]],
#         group_labels,
#         curve_type="kde",
#         bin_size=0.03,
#         colors=px.colors.qualitative.Vivid,
#     )
#     distplot.show()

<IPython.core.display.Javascript object>

## Distribution Estimation using the Shapiro-Wilk and Kolmogorov-Smirnov Tests

Our KDE provides us with an idea of how the data is distributed, however, we are victim to two hyperparameters, the binwidth of the histogram and bandwidth of the kernel density estimation. This proposes difficulites in interpretation. So to ensure we make accurate conclusions, let's use some non-graphical methods to test for normality, and uniformity. 

In the previous notebook regarding Variance analysis, we used the two-sample Kolmogorov-Smirnov test to compare two distributions. We could also use the one-sample version of this test to check if our data comes from a normally distributed population. However, it has been shown that while this test can work, a more powerful test for normality would be the also non-parametric Shapiro-Wilk test. The null hypothesis $H_0$ for the Shapiro-Wilk test is that our data comes from a normally distributed population. 

For testing uniformity, we can use the one-sample Kolmogorov-Smirnov test. We could also use a chi-squared GOF test if that was of interest. There does not appear to be evidence that one performs better than the other. 

In [81]:
# Declare empty lists to append values to
swlist = np.array([])
kslist = np.array([])
nlist = np.array([])


for index, file in enumerate(dfMaster.file.unique()):
    subset = dfMaster["peakVal"][dfMaster["file"] == file]
    plotdf = pd.DataFrame({"subset": subset}).reset_index(drop=True)
    n = len(plotdf.index)
    sw = sp.shapiro(plotdf["subset"])
    ks = sp.kstest(plotdf["subset"], "uniform")
    nlist = np.append(nlist, n)
    swlist = np.append(swlist, sw.pvalue)
    kslist = np.append(kslist, ks.pvalue)


testdf = pd.DataFrame(
    {
        "file": dfMaster.file.unique(),
        "Sample Size": nlist.astype(int),
        "Shap. Wilk": swlist,
        "Kol. Smir.": kslist,
    }
)
display(testdf)

Unnamed: 0,file,Sample Size,Shap. Wilk,Kol. Smir.
0,220110n_0003.tif,10,0.428726,0.0
1,220110n_0005.tif,6,0.728396,0.0
2,220110n_0009.tif,5,0.872466,0.0
3,220110n_0010.tif,6,0.965881,0.0
4,220110n_0014.tif,11,0.568973,0.0
5,220110n_0017.tif,12,0.620454,0.0
6,220110n_0020.tif,10,0.78117,0.0
7,220110n_0021.tif,10,0.282813,0.0
8,220110n_0022.tif,10,0.288951,0.0
9,220110n_0023.tif,10,0.786499,0.0


<IPython.core.display.Javascript object>

Looks like a lot of these samples fail to reject $H_0$, implying they come from a normal population. Let's look at which samples we reject the null (here we will use an $\alpha = 0.05$).

In [82]:
nonorm = testdf[testdf["Shap. Wilk"] < 0.05]
display(nonorm)

# For alpha = 0.1
display(testdf[testdf["Shap. Wilk"] < 0.1])

Unnamed: 0,file,Sample Size,Shap. Wilk,Kol. Smir.
20,220110n_0060.tif,7,0.00153,0.0
34,2.5Hz_ctrl_0017.tif,9,1.6e-05,0.0


Unnamed: 0,file,Sample Size,Shap. Wilk,Kol. Smir.
19,220110n_0055.tif,11,0.056977,0.0
20,220110n_0060.tif,7,0.00153,0.0
34,2.5Hz_ctrl_0017.tif,9,1.6e-05,0.0
37,cell.tif,8,0.086283,0.0
45,thapsi 2.5_0047.tif,5,0.058264,0.0


<IPython.core.display.Javascript object>

The two files listed above are the only files that reject $H_0$. The values of the p-values being very small (less than 0.01) should be noted as strong evidence that these files may very well have different distributions. None of the distributions tested could fail to reject the null hypothesis for the Kolmogrov-Smirnov test for Uniformiality, leading us to conclude none of our samples come from uniform distributions. 

One last thing to note should be that at a higher significance level of $\alpha = 0.1$, we reject $H_0$ for three more additional files. This should be considered in how the distributions of these files are treated in any future studies. 

### How does Control compare to Thapsigargin?

In [83]:
# Cleaning
condList = dfMaster["condition"].unique()
print(f"before condList:{condList}")
# do this
dfMaster.loc[dfMaster["condition"] == "TG", "condition"] = "Thapsigargin"
condList = dfMaster["condition"].unique()
print(f"after condList:{condList}")

before condList:['Control' 'Thapsigargin' 'TG']
after condList:['Control' 'Thapsigargin']


<IPython.core.display.Javascript object>

In [84]:
temp = dfMaster[["file", "condition"]]
temp = temp.drop_duplicates()
temp = temp.reset_index()
# temp

<IPython.core.display.Javascript object>

In [85]:
# Merge the two
testdf = temp.merge(testdf)
testdf

Unnamed: 0,index,file,condition,Sample Size,Shap. Wilk,Kol. Smir.
0,0,220110n_0003.tif,Control,10,0.428726,0.0
1,10,220110n_0005.tif,Control,6,0.728396,0.0
2,16,220110n_0009.tif,Control,5,0.872466,0.0
3,21,220110n_0010.tif,Control,6,0.965881,0.0
4,27,220110n_0014.tif,Control,11,0.568973,0.0
5,38,220110n_0017.tif,Control,12,0.620454,0.0
6,50,220110n_0020.tif,Control,10,0.78117,0.0
7,60,220110n_0021.tif,Control,10,0.282813,0.0
8,70,220110n_0022.tif,Control,10,0.288951,0.0
9,80,220110n_0023.tif,Control,10,0.786499,0.0


<IPython.core.display.Javascript object>

In [86]:
# Lets see more
# pd.set_option('display.max_rows', 10)
# testdf.head(25)

# The groups that are not normal at alpha = 0.05
display(testdf[testdf["Shap. Wilk"] < 0.05])
# and at 0.1
display(testdf[testdf["Shap. Wilk"] < 0.1])

Unnamed: 0,index,file,condition,Sample Size,Shap. Wilk,Kol. Smir.
20,175,220110n_0060.tif,Thapsigargin,7,0.00153,0.0
34,301,2.5Hz_ctrl_0017.tif,Control,9,1.6e-05,0.0


Unnamed: 0,index,file,condition,Sample Size,Shap. Wilk,Kol. Smir.
19,164,220110n_0055.tif,Thapsigargin,11,0.056977,0.0
20,175,220110n_0060.tif,Thapsigargin,7,0.00153,0.0
34,301,2.5Hz_ctrl_0017.tif,Control,9,1.6e-05,0.0
37,330,cell.tif,Control,8,0.086283,0.0
45,395,thapsi 2.5_0047.tif,Thapsigargin,5,0.058264,0.0


<IPython.core.display.Javascript object>

It appears that besides a two files from the control group and three from the Thapsigargin group, we can conclude that all the other files, regardless of condition, come from a normal population. If curious, we could see the strength of this normality by assessing which group has the higher proportion of large p-values. 

## Meauring the Contributions of Variables to the Variance of the Data

https://www.sportsci.org/resource/stats/correl.html

Next: Build a correlation heat map and calculate $R^2$ to measure how much variance in X measures Y
https://plotly.com/python/heatmaps/

In [87]:
# using pandas
# dfMaster.corr

display(dfMaster.dtypes)
# I am going to need to clean up this data..
dfMaster.count()

analysisVersion                object
interfaceVersion               object
file                           object
detectionType                  object
cellType                      float64
sex                           float64
condition                      object
sweep                           int64
sweepSpikeNumber                int64
spikeNumber                     int64
include                          bool
userType                        int64
errors                         object
dvdtThreshold                 float64
mvThreshold                   float64
medianFilter                    int64
halfHeights                    object
thresholdPnt                    int64
thresholdSec                  float64
thresholdVal                  float64
thresholdVal_dvdt             float64
dacCommand                    float64
peakPnt                         int64
peakSec                       float64
peakVal                       float64
peakHeight                    float64
timeToPeak_m

analysisVersion               400
interfaceVersion              400
file                          400
detectionType                 400
cellType                        0
sex                             0
condition                     400
sweep                         400
sweepSpikeNumber              400
spikeNumber                   400
include                       400
userType                      400
errors                        400
dvdtThreshold                 244
mvThreshold                   400
medianFilter                  400
halfHeights                   400
thresholdPnt                  400
thresholdSec                  400
thresholdVal                  400
thresholdVal_dvdt             400
dacCommand                    400
peakPnt                       400
peakSec                       400
peakVal                       400
peakHeight                    400
timeToPeak_ms                 400
preMinPnt                      44
preMinVal                      44
preLinearFitPn

<IPython.core.display.Javascript object>

In [88]:
# Sex, CellType, and lateDiastolicDuration are all NA, so lets drop them.
dfMaster = dfMaster.drop(["cellType", "sex", "lateDiastolicDuration"], axis=1)

<IPython.core.display.Javascript object>

In [89]:
# Lets explore some variables, what do some of the object columns look like?
display(dfMaster["halfHeights"])
display(dfMaster["widths"])
display(dfMaster["myDateStr"])

0      [10, 20, 50, 80, 90]
1      [10, 20, 50, 80, 90]
2      [10, 20, 50, 80, 90]
3      [10, 20, 50, 80, 90]
4      [10, 20, 50, 80, 90]
               ...         
395    [10, 20, 50, 80, 90]
396    [10, 20, 50, 80, 90]
397    [10, 20, 50, 80, 90]
398    [10, 20, 50, 80, 90]
399    [10, 20, 50, 80, 90]
Name: halfHeights, Length: 400, dtype: object

0      [{'halfHeight': 10, 'risingPnt': 96, 'fallingP...
1      [{'halfHeight': 10, 'risingPnt': 347, 'falling...
2      [{'halfHeight': 10, 'risingPnt': 598, 'falling...
3      [{'halfHeight': 10, 'risingPnt': 849, 'falling...
4      [{'halfHeight': 10, 'risingPnt': 1100, 'fallin...
                             ...                        
395    [{'halfHeight': 10, 'risingPnt': 169, 'falling...
396    [{'halfHeight': 10, 'risingPnt': 346, 'falling...
397    [{'halfHeight': 10, 'risingPnt': 541, 'falling...
398    [{'halfHeight': 10, 'risingPnt': 720, 'falling...
399    [{'halfHeight': 10, 'risingPnt': 843, 'falling...
Name: widths, Length: 400, dtype: object

0      jan-12-2022
1      jan-12-2022
2      jan-12-2022
3      jan-12-2022
4      jan-12-2022
          ...     
395    feb-15-2022
396    feb-15-2022
397    feb-15-2022
398    feb-15-2022
399    feb-15-2022
Name: myDateStr, Length: 400, dtype: object

<IPython.core.display.Javascript object>

In [90]:
# Lets drop these columns, some more objects, and columns with mostly Nans and turn 'condition' into a dummy variable
dfRed = dfMaster.drop(
    [
        "analysisVersion",
        "interfaceVersion",
        "detectionType",
        "errors",
        "halfHeights",
        "widths",
        "myDateStr",
        "preMinPnt",
        "preMinVal",
        "cycleLength_pnts",
        "cycleLength_ms",
    ],
    axis=1,
)
display(dfRed)

Unnamed: 0,file,condition,sweep,sweepSpikeNumber,spikeNumber,include,userType,dvdtThreshold,mvThreshold,medianFilter,...,isi_ms,spikeFreq_hz,diastolicDuration_ms,widths_10,widths_20,widths_50,widths_80,widths_90,fileIdx,myFileIdx
0,220110n_0003.tif,Control,0,0,0,True,0,0.0,1.41,0,...,,,23.5704,769.9664,742.4676,557.8328,396.7684,302.4868,0,0
1,220110n_0003.tif,Control,0,1,1,True,0,0.0,1.41,0,...,986.0284,1.014170,3.9284,785.6800,754.2528,589.2600,428.1956,345.6992,0,0
2,220110n_0003.tif,Control,0,2,2,True,0,0.0,1.41,0,...,982.1000,1.018226,3.9284,809.2504,754.2528,581.4032,408.5536,282.8448,0,0
3,220110n_0003.tif,Control,0,3,3,True,0,0.0,1.41,0,...,989.9568,1.010145,19.6420,895.6752,726.7540,581.4032,396.7684,259.2744,0,0
4,220110n_0003.tif,Control,0,4,4,True,0,0.0,1.41,0,...,974.2432,1.026438,3.9284,813.1788,714.9688,601.0452,381.0548,243.5608,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,thapsi 2.5_0047.tif,Thapsigargin,0,0,0,True,0,,1.08,0,...,,,51.5350,688.6950,627.7900,505.9800,187.4000,182.7150,8,2
396,thapsi 2.5_0047.tif,Thapsigargin,0,1,1,True,0,,1.08,0,...,805.8200,1.240972,18.7400,773.0250,726.1750,580.9400,281.1000,79.6450,8,2
397,thapsi 2.5_0047.tif,Thapsigargin,0,2,2,True,0,,1.08,0,...,932.3150,1.072599,79.6450,754.2850,716.8050,557.5150,248.3050,182.7150,8,2
398,thapsi 2.5_0047.tif,Thapsigargin,0,3,3,True,0,,1.08,0,...,843.3000,1.185818,28.1100,482.5550,454.4450,351.3750,154.6050,84.3300,8,2


<IPython.core.display.Javascript object>

In [91]:
dfRed.set_index(["file", "condition"])
np.corrcoef(dfRed["earlyDiastolicDuration_ms"], dfRed["peakVal"])

peakValCor = dfRed.corr()["peakVal"][:]
peakValCor

sweep                              NaN
sweepSpikeNumber             -0.043569
spikeNumber                  -0.043569
include                            NaN
userType                           NaN
dvdtThreshold                      NaN
mvThreshold                   0.838027
medianFilter                       NaN
thresholdPnt                  0.185874
thresholdSec                  0.173549
thresholdVal                  0.242629
thresholdVal_dvdt             0.167908
dacCommand                         NaN
peakPnt                       0.168305
peakSec                       0.155019
peakVal                       1.000000
peakHeight                    0.986120
timeToPeak_ms                -0.355022
preLinearFitPnt0              0.190657
preLinearFitPnt1              0.188468
earlyDiastolicDuration_ms    -0.385979
preLinearFitVal0              0.305526
preLinearFitVal1              0.284234
earlyDiastolicDurationRate    0.378514
preSpike_dvdt_max_pnt         0.174456
preSpike_dvdt_max_val    

<IPython.core.display.Javascript object>

In [92]:
# Let's see the R^2 value now, how much of the variance of peakVal is explained by other variables?
peakValCor ** 2

sweep                              NaN
sweepSpikeNumber              0.001898
spikeNumber                   0.001898
include                            NaN
userType                           NaN
dvdtThreshold                      NaN
mvThreshold                   0.702289
medianFilter                       NaN
thresholdPnt                  0.034549
thresholdSec                  0.030119
thresholdVal                  0.058869
thresholdVal_dvdt             0.028193
dacCommand                         NaN
peakPnt                       0.028327
peakSec                       0.024031
peakVal                       1.000000
peakHeight                    0.972433
timeToPeak_ms                 0.126041
preLinearFitPnt0              0.036350
preLinearFitPnt1              0.035520
earlyDiastolicDuration_ms     0.148980
preLinearFitVal0              0.093346
preLinearFitVal1              0.080789
earlyDiastolicDurationRate    0.143273
preSpike_dvdt_max_pnt         0.030435
preSpike_dvdt_max_val    

<IPython.core.display.Javascript object>

From the collected $R^2$ values above, we have a measurement of variance explained. That is, we answer the question, "How much of the variance in peakVal is explained by the variance in another variable?" We can see variables "postSpike_dvdt_min_val" and "peakHeight" have extremely high $R^2$ values. Another highly correlated variable is "mvThreshold," with a value of 0.702. While the meanings of these variables are beyond my direct understanding, these are important to point out for those reading this that do understand the meaning. For exampe, in regards the variable just mentioned (mvThreshold), we can understand these results as the following: about 70% of the variance in peakVal is explained by the variance of the threshold. 

Some of the lowest contributers to the variance of peakVal such as sweepSpikeNumber and spikeNumber. We can also see variables thresholdVal_dvdt, peakPnt, peakSec, thresholdPnt, and thresholdSec have lower proportional variance explained. 

## Are Mean and Variance of 'peakVal' meaningful?

In [93]:
# doubled filenames https://www.pythonforbeginners.com/lists/repeat-each-element-in-a-list-in-python
doublefile = [file for file in filenames for i in range(2)]

fig2 = make_subplots(
    rows=len(filenames), cols=2, subplot_titles=doublefile
)  # Two columns creates staggering problems with plot

for index, file in enumerate(dfMaster.file.unique()):
    dfFile = dfMaster.loc[dfMaster["file"] == file]

    # We split the subplot into two columns, deciding which column based on odd or even index
    fig2.append_trace(
        go.Scatter(
            x=dfFile["thresholdSec"],
            y=dfFile["peakVal"],
            name=dfFile.condition.unique()[0],
        ),
        row=index + 1,
        col=1,
    )
    fig2.append_trace(
        go.Scatter(
            x=dfFile["thresholdSec"],
            y=dfFile["peakHeight"],
            name=dfFile.condition.unique()[0],
        ),
        row=index + 1,
        col=2,
    )
    fig2.update_xaxes(title_text="thresholdSec", row=index + 1, col=1)
    fig2.update_xaxes(title_text="thresholdSec", row=index + 1, col=2)
    fig2.update_yaxes(title_text="peakVal", row=index + 1, col=1)
    fig2.update_yaxes(title_text="peakHeight", row=index + 1, col=2)

fig2.update_layout(
    height=7500,
    width=1000,
    showlegend=False,
    colorway=px.colors.qualitative.Vivid,
    title_text="Subplot for Each File's peakVal and peakHeight",
)

fig2.show()

<IPython.core.display.Javascript object>

## References

Minimum Sample Size for KDE:
https://stats.stackexchange.com/questions/76948/what-is-the-minimum-number-of-data-points-required-for-kernel-density-estimation

General Intro to sklearn's density estimation:
https://scikit-learn.org/stable/modules/density.html

KernelDensity():
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity

Worked through Example:
https://stackabuse.com/kernel-density-estimation-in-python-using-scikit-learn/

Density Curve Plotting for Plotly:
https://stackoverflow.com/questions/63865209/plotly-how-to-show-both-a-normal-distribution-and-a-kernel-density-estimation-i

Scipy.stats.shapiro
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

Scipy.stats.kstest
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html

Testing for Normality with Small N 
https://stats.stackexchange.com/questions/13983/is-it-meaningful-to-test-for-normality-with-a-very-small-sample-size-e-g-n

Finding Correlation of One variable with all others
https://datascience.stackexchange.com/questions/39137/how-can-i-check-the-correlation-between-features-and-target-variable


Fano Factor
http://math.bu.edu/people/mak/Eden_Kramer_J_Comp_Neuro_2010.pdf