# Abstract

In this article we review the paper "Statistical analysis of numerical preclinical radiobiological data". The work is submitted as a term project for the Graduate Level Course on Statistical Modelling and Practices at University of California Berkeley. The authors are graduate students from department of EECS and Civil&Environmental Engineering and have restricted their attention to the methods and analysis done in the paper. The review is an attempt to reproduce the tests and results presented in the paper, and discuss some other non-parametric tests and results eg. Permutation tests, that can be seen as an alternative to making certain assumptions and finding surprises in the data. No attempt has been made to look into the biological aspects and validity of certain assumptions related to them.

# Introduction

The paper begins by voicing a growing concern towards "Scientific fraud and Plagiarism" in the scientific community and is successful in presenting a strong message. To add more.....

Before proceeding further, we would like to comment that the organization of the paper could have been much better with the use of Sections and Subsections, and re-arranging some of the sections a little bit. 

But the focus of this review is not on the readability and organizational strucutre, and we pay more attention to reproducing the results and discussing some more ways of identifying anomaly. In particular, in a separate section section{our_analysis} we perform certain tests that go very well with the spirit of the paper - promoting simple statistical tools for detecting anomaly.  

## Problem Set Up

The authors in the paper analyze anomalous patterns in radiobiological data from a lab, in particular they were able to detect suspicious patterns in the data reported by one of the 10 researchers (whom we shall refer to as RTS as per their notation). They do three different tests to validate their suspicion and also validate their tests and assumptions by looking at the data obtained from three other sources. To dive further, we discuss a bit more ind etail about the nature of the data. Each researcher had to report three different measurements for two different types of numbers - Colony Count and Coulter Count. Each of these numbers represent an observation of number of cells surving some experiment, and probably three measurements are done in order to be more accurate about the observations. The concern of the authors is that it is easy to fabricate a triplet such that you get desired mean for that particular set of observations, and one can do that by setting the mean and then using two roughly equal constants, calculate the other two values as this initial value plus or minus the selected constants. Such a fabrication can be flagged easily by looking at the triplets and counting how many of them contain the mean. Having made these observations, the authors mainly focus  "on developing a method to calculate bounds and estimates for the probability that a given
set of n such triplicates contains k or more triples which contain their own mean" and mention that such probability bounds should be helpful across various other areas. Under these models they show that RTS's data is pretty surprising and that the chances of seeing such a data are astronomically low. Besides this specific set up (which require some assumptions) they also look at some more general tests that have been used in the past to detect anomalous patterns. Namely they test for - (1) Distribution of last digit when it is insignificant and (2) Chances of observing equal pairs of terminal digits when they are not significant. Ideally, for (1), we expect to see a uniform distribution over $\{0, 1, \ldots, 9\}$ unless the distribution that underlies the data suggests otherwise. Similarly, for (2), ideal chances are 1 in 10. 

Next we discuss three major concerns and justify why they were important to us:

- The paper begins with the RTS guy being labelled as anomalous and then a probability model is developed to determine the chances of seeing the mean in a triplet. The authors mentioned briefly that "Having observed what appeared to us to be an unusual frequency of triples in RTS data containing a value close to their mean, we used R to calculate the mid-ratios for all of the colony data triples that were available to us". The authors didn't comment how were they able to identify the particular researcher? Whether they partitioned the data into an observation set and then ran tests on the validation set is pretty unclear and the tables tend to hint otherwise. We would like to point that a standard practice is usually to classify the data into training and test set. Our concern relies on the well known fact in statistics "the data that raised the suspicion if used to validate it, will most likely give a very biased result".

- The authors also ran tests for the last digit and equality of the pair of terminal digits on the datasets, which can be seen as a validation of their suspicion. However all the results that are produced are of the form "RTS vs The Rest". It would have been more convincing if the authors presented some justification or some experiment results which justified such a treatment. The ideal scenario would have been presentation of results in a "Take - One - Out" fashion, where every individual would have been compared to the rest of them pooled together. This is the core principle behind the two sample permutation tests, where we test the strong null hypothesis that each researcher's data is just a random sample from the population of all the data put together. We will dive into this in detail in Secion 4.

- There was no discussion about the number of data points across researchers. For some reason, the data collected by the researcher in question, namely RTS had more than twice the data put together by twelve other researchers. Such an overwhelming fraction of samples belonging to one researcher has some implications which we explore in Section 4.


In the next section, we touch upon the reproducibility of results. In Section 3, we discuss our tests and their implications and then make some final remarks in the Conclusion section.



# Reproducibility of Results
 
In this section, we replicated the statistical experiments that were conducted by the researchers. There were several mismatches in our first implementation because of subjectivity at certain places. However, with some trial and error and fine tuning we were able to replicate most of their results. All our results and code are available at [https://github.com/ianno/stat215a_project1] 


## Mid-Ratio Analysis

To begin with, the authors first consider the histogram of mid-ratio which is defined for a triplet $(a, b, c), a<b<c$ as $\frac{b-a}{c-a}$, and show that the histogram of RTS concentrates abnormaly around $0.4-0.6$ range, compared to everyone else put together. We tried to reproduce the histogram in python using the numpy's histogram plots and it looked very different. Then, we tweaked the histogram to include the right edge of the bins and it looked very similar to the Figure(1) of the paper. But the histogram still had differences, for instance, the authors get very close to 50% chance of obtaining a mid-ratio of 0.4-0.5, while we get close to 44% chance. Also, we used 1361 values for computing the histogram after removing the triplets with missing values (in fact, 1360 because one triplet had all equal values) while the authors used 1343/1361 and provided no justification for the same. Similarly, we had 595 triplets to plot the histogram for the rest of the researchers (of the same lab). However, our plots can be categorized very similar to theirs after the bin adjustment, and we categorized these differences too minor for investment of more time.



## Probability Model

In this section, we followed the equations provided by the authros in Appendix A to caluclate the probability - lambda table. We could replicate Table 1 from the paper exactly the same. However as we tried to increase lambda value to a fairly large one (>1000), the result that we could get is always 0 even if we chose to use the logrithm to avoid over floating of the computer. And thus, we could not verify lambda and corresponding values from around 1000 to 2500, which was said to be done in the literature. We did observe the probability started to decrease after lambda reached 4, and after a scanning of the whole probability values, 0.42, which is stated by the authors, should be a maximum threehold value. 

However, we also came up some arguments about the Poisson assumption for the micro organism counts. As the samples were taken by the researchers, and if they did a manual count of the total organisms of the sample, there are a lot of subjectivity and biases in each of the experiment. For example, the location of the sight view to count the numbers really matter, as if the researcher accidentally chose a nutrient-rich location, the count will be way higher compared to if the sight view is taken at a relatively nutrient-low location. In addition, certain microorganisms have such characeristics that they would gather together, and thus when we do not have very high accuracy equipment, we would mistakeingly count several cells as one cell, or less cells based on our arbitary experiences. All of these situations will bring a lot of biases into the final cell count. Based on that uncertainty, it might not be precise and accurate to treat the triplet counts as a Poisson triplets, as there might be dependence between the samples. 

"
In this section, we tried to replicate what they did in the paper. A slight difference of the final probabilities were observed in our calculations compared to their results, where the turning point of lambda value is around 11, which means the probability keeps increasing to its maximum value till lambda equals to 11, and then it decreases as lambda increases. However, we got a similar result for the threhold value for probability. In the paper, they got 0.42, and we got 0.433. As their details calculations are not included in this paper, so we completed the statistical calculations based on their assumptions, such as the triple is generated by independent, identical Poisson variables with known parameter lambda includes its own (rounded) mean value. We applied their assumptions of iid variables for the counts of micro organisms, and got our results. However, their assumptions may not always hold at different situations, thus needs further consideration.
"


Raaz : A more detailed deviation and consideration for the Poisson model calculation is discussed here. Verify the Appendix and comment on the model.

Antonio : Feel free to add your comments about the model.

## Digits Analysis

Antonio : Please elaborate your tests.


In [2]:
# Where do we discuss the applicability of these tests? Should we do that or not?

## Our Analysis


As the data is acquired by different researchers in the biological lab, a lot of biases would be introduced, such as the skillness of the researcher to take samples, how fluent they are at this specific task, different growth situations for micro organisms, how accurate their instruments are. If we have those biases present in our data, our assumptions, such as i.i.d variables would not be valiaded, and thus our corresponding calculations will not be accurate enough. Thus based on this logic, their reasoning about the difference between the RTS data and outside lab researcher data analysis should probably be on held. 

A detailed calculation is in the Appendix. 

## Hypothesis testing II - using lambda to obtain p-values

In this section, the researchers from the paper used the lambda values to calculate the corresponding p-value. They applied a heuristic method to estimate the actual probability that a given collection of n triples includes k mean containing tripes to legitimate experimental data, and such that they are able to confirm the validity of their models, which is the Poission model. As the true lambda value of the Poisson variables that generated the triples in the datasets are unknown, they took advantage of the lambda MidProb table to estimate the true value, based on the fact that mean of any actural triple is a resonable estimate of the lambda parameter of the variables. In addition, a Poisson binomial distribution is assigned to Poisson binominal random variables, which is the case in their paper. From the characteristics of a poisson binomial variables, the mean is the sum of all p values, and the standard deviation is the sum of all p*(1-p) values, which could be used for further corresponding hypothesis tests. And thus they used this idea to test the RTS collection of 1343 colony trples, and came to a conclusion that the probability is an extremely small number, which contradicted same test results for other investigators. 

Several perspectives should be considered when conducting these statistical tests. We need to check the underlying assumptions such as a Poisson bernouli variables. One thing to note is that those data has underlying microbial phenomenna behind them, so when we just treat them as a number, then we would possibly lose a lot of intrinisic characteristics of the data. From this logic, the assumption that the triplets follows a poisson distribution will also be argued, as the sample would be taken from different growth stage of the organisms, biases would be introduced to devalidate the Possion assumption, which lead to the faliure of a Poisson binomial variable assumption. In addition, the idea of using the existing questionable data to fit a parameter lambda should be considered further. If there are already frauds in the existing dataset, it may not be wise to use these data to fit our parameters. From the same logic, they calculated the mean of the data, and come up with the corresponding lambda value, which could be checked to get the p-values. Thus, it implies an idea of using questionable data to fit parameter, and then use this fitted parameter to check the questionable data, which is not very scientific. 

This model works ok for the colonies as the mean of the counts is less than 100, and thus we could use the mean value to represent lambda and thus to find the corresponding probability values for the Poisson binomial model. However, when we conducted the same experiments for Coutlers, due to the limitation of computations, we could barely come up with a reasonable probability value as the mean value of counts are a lot larger, and this might not be a very good representation for the lambda values, and thus we could not replicate the values for the Coutler. Thus we did a  regression based on the statement from the literature that when lambda = 100, probability < 0.14, lambda = 2000, probability = 0.032. We could not verify the probability value when lambda = 2000, so we assumed that this value is good to use. And thus fitted the probability values based on this. 


In [3]:
print('            Linear combination for probability values when lambda is very large. Coutler ')
print('                       mean1    probability    mean2   probability    mean3    probability')
print('RTS Coulter            998.6    0.042          1019.2  0.039          1039.8   0.040  ')
print('Others Coutler         2918.6   0.013          2966.5  0.013          3012.5   0.011')
print('Outside Lab2 Coutler   2135.2   0.028          2454.4  0.022          2748.2   0.019')
print('Outside Lab3 Coutler   3322.1   0.011          3383.4  0.010          3450.1   0.009 ')

            Linear combination for probability values when lambda is very large. Coutler 
                       mean1    probability    mean2   probability    mean3    probability
RTS Coulter            998.6    0.042          1019.2  0.039          1039.8   0.040  
Others Coutler         2918.6   0.013          2966.5  0.013          3012.5   0.011
Outside Lab2 Coutler   2135.2   0.028          2454.4  0.022          2748.2   0.019
Outside Lab3 Coutler   3322.1   0.011          3383.4  0.010          3450.1   0.009 


## Hypothesis testing III - normal estimation of p-values

From the literature, the researchers obtained reasonable approximations of the upper tail probabilities of a Poisson binomial random variable using normal probabilities. This idea is obtained from Central Limit Theorem, and thus when we did the replication, we directly applied this idea to calculate the corresponding upper tail probability from Z scores. However, as the normal distribution could not catch all characteristics of a Poisson bbinominal distribution, a lot of considerations were taken into account by the researchers, such as implementing a second-order correction. Thus, using the probability values obtained from previous calculation, they were able to calculate the corresponding mean and standard deviation for the Poission binominal variables, which were further assigned to be the mean and standard deviation for the normal distribution assimilation. The researchers also noted that the normal distribution probabilities are not exact values for the Poisson binominal probabilities, thus this biases introduced by using normal distribution to simulate the original distribution should be considered if we are going to deal with further analysis.

However, as we mentioned in previous sections about their underlying assumptions for the past procedures to calculate the lambda-probability table based on poisson process assumption, and also use the questionable existing data to fit lambda values to get p-values. A lot of uncertainties were introduced by using the uncertain, inaccurate results to calculate the upper tail probability even if the assumption for this section looks reasonable based on Central Limit Theorem. Thus, our results are not very close to what they got in their literature. 

As specified by the authors, that they used a round mean value to calculate the NO.means, so initially we directly used the round() function from Python, however we could not get a closer values for the NO.means as they stated in their paper. The corresponding table is denoted Old"Round" value. And then we tried the int() function from Python, to check if this is what they mean by round, which is also not the case. So for either of these situations, the corresponding No.mean, No.expected, Sd, Z, and p>=k values are off their presentations. Then we checked a third method, which is the deviation between the median value and the mean value is less than 1, then they count. And we got similar results for No.mean. 

As mentioned in section "Hypothesis Testing II", the probability calculationgs for Coutler data are off, and we used a regression for the probabilities, and thus we could not replicate the same results as the authors. Details could be found within our tables below. 



In [4]:
print('                                       New "Round" value for Colony')
print('Name                No.Complete      No.mean        No.expected       Sd           Z          p>= k')
print('RTS Colonies,          1343            690            207.27         13.24        23.19       0.00')
print('Others Coloniess       577             109            92.7           8.82         -1.06       0.855')
print('Outside Lab1 Colony    48              3              8.0            2.58         -1.78       0.962')
print('Coutler lambda is too large to calculate those statistics. ')

print('                         ')
print('            Linear combination for probability values when lambda is very large. Coutler ')
print('RTS Coulter            1725            176            69.58          7.37         5.89        1.01e-9')
print('Others Coutler         928             73             11.44          3.36         4.93        4.14e-7')
print('Outside Lab2 Coutler   95              0              2.19           1.46         -1.5        0.933')
print('Outside Lab3 Coutler   118             1              1.18           1.08         -0.17       0.566')

                                       New "Round" value for Colony
Name                No.Complete      No.mean        No.expected       Sd           Z          p>= k
RTS Colonies,          1343            690            207.27         13.24        23.19       0.00
Others Coloniess       577             109            92.7           8.82         -1.06       0.855
Outside Lab1 Colony    48              3              8.0            2.58         -1.78       0.962
Coutler lambda is too large to calculate those statistics. 
                         
            Linear combination for probability values when lambda is very large. Coutler 
RTS Coulter            1725            176            69.58          7.37         5.89        1.01e-9
Others Coutler         928             73             11.44          3.36         4.93        4.14e-7
Outside Lab2 Coutler   95              0              2.19           1.46         -1.5        0.933
Outside Lab3 Coutler   118             1             

## Terminal digit analysis

## Equal digit Analysis 

# Our Analysis

In this section, after we replicated the statistical experiments from the researchers, we will double check their underlying assumptions, such as poission model, normal distribution, and so on. Besides, we also used cross-validation to show if the combined data mid-ratio would give us a different results if we take each data set individually and calculate corresponding mid-ratios. 



















## Permutation Tests

## Discussions

## Results interpolation

# Conclusion

Data fraud is an extremely important topic in science, enginering and many other subjects. Methods to detect the manually manipulated data are needed to identy the existance of data fraud, data fabrication and falsification. In our review, we conducted permutation test, blah blah tests to deterine if there is any underlying data fabrication and falisicaiton of this paper. 

From permutation test, we found...

From ... test, we found that...


<!--bibtex

@Article{PER-GRA:2007,
  Author    = {P\'erez, Fernando and Granger, Brian E.},
  Title     = {{IP}ython: a System for Interactive Scientific Computing},
  Journal   = {Computing in Science and Engineering},
  Volume    = {9},
  Number    = {3},
  Pages     = {21--29},
  month     = may,
  year      = 2007,
  url       = "http://ipython.org",
  ISSN      = "1521-9615",
  doi       = {10.1109/MCSE.2007.53},
  publisher = {IEEE Computer Society},
}

@article{Papa2007,
  author = {Papa, David A. and Markov, Igor L.},
  journal = {Approximation algorithms and metaheuristics},
  pages = {1--38},
  title = {{Hypergraph partitioning and clustering}},
  url = {http://www.podload.org/pubs/book/part\_survey.pdf},
  year = {2007}
}

-->



Examples of citations: [CITE](#cite-PER-GRA:2007) or [CITE](#cite-Papa2007).