## Goodness of Fit Test in Histogram Fitting

In this example we perform several GoF tests of an histogram generated with a given distribution. 
We do the following GoF tests:  
* Chi-square test using the Neyman Chi-square (based on the observed bin error)
* Chi-square test using the Pearson chisquare (based on the expected bin error) 
* Chi-square test using the Baker-Cousins  likelihood ratio value. 

We study the obtained test statistic distribution using pseudo-experiments and we study if it is following a Chi-square distribution and for how many degree of freedom. 

At the end we study also the bias in the fitted function parameters

We start creating the histogram for the data sample we are going to fit and thre different histogram for studying the 3 different test statistics used for the GoF tests.  

In [1]:
%jsroot on  

In [2]:
// some configuration parameters
// number of bins used for the data histogram and data range
int nbin = 100;         
double xmin = 0;         
double xmax = 10; 
 
int nexp = 1000; // number of experiments
int n = 200;   // size of data sample for each experiment

auto hist = new TH1D("hist","Data Sample distribution",nbin,xmin,xmax);
auto hchi2N = new TH1D("hchi2N","Neyman chi-squared distribution",nbin,0,2*nbin);
auto hchi2P = new TH1D("hchi2P","Pearson chi-squared distribution",nbin,0,2*nbin);
auto hchi2L = new TH1D("hchi2L","Baker-Cousins LR distribution",nbin,0,2*nbin);

##### Data distribution

Here we define the function used to generate and then later fit the generated histograms. 
We use an Exponential function, but we could also use a Gaussian or even a Constant function.

In [3]:
TF1 f1("f1","[A]*exp(-x/[tau])",xmin,xmax);
double trueParams[] = {1,5};   // true parameter values used for generating the events
double params0[] = { 10, 4};  //  initial parameter values used for fitting

#### Toy Monte Carlo Study

We perform all the work in the code below. We generate a set of pseudo-exeriments (e.g. 1000) where  each one of them conists of an histogram filled with 200 events generated according to the given distribution (e.g. exponential). 

We perform the 3 different possible fits to the histogram and we study the resulting Chi-square distribution.
* Chi-square fit using the Neyman Chi-square (based on the observed bin error)
* Chi-square fit using the Pearson chisquare (based on the expected bin error) 
* Binned likelihood fit assuming a Poisson p.d.f for each bin.

We study then the resulting Chi-square (test statistic) values obtained by the three fits. In case of the binned likelihood fit, the test statistics we use is the log-likelihood ratio obtained using a saturated model ( Baker-Cousins procedure, see *S. Baker and R. D. Cousins, Nucl. Instrum. Meth. 221 (1984) 437*).
This log-likelihood ratio can be computed directly in ROOT by multiplying by 2 the minimum value of negative log-likelihood function at the minimum. 

We compare then the different chi-square obtained in the 2 different methods by collecting their values in 3 separate histograms. 

Note that we use `TF1::GetRandom` to generating events according to a generic distribution (e.g in this case defined by the ROOT `TF1` class). 

In [4]:
hchi2N->Reset(); // reset in case we run a second time
hchi2P->Reset(); 
hchi2L->Reset(); 
for (int iexp = 0; iexp < nexp; ++iexp) {
    f1.SetParameters(trueParams);  // set initial param values before generating
    hist->Reset();
    for (int i = 0; i < n; ++i){
        hist->Fill( f1.GetRandom() );
    }
    // set the initial function parameters 
    f1.SetParameters(params0);
    f1.SetLineColor(kRed);
    f1.SetTitle("Neyman Fit");
    // Neyman chisquare fit 
    // use option Q to avoid too much printing and option 0 to avoid saving the fit function
    auto rN = hist->Fit(&f1,"S Q ");  
    f1.SetParameters(params0);   // reset parameters 
    f1.SetLineColor(kBlack);
    f1.SetTitle("Pearson Fit");
    // Pearson chisquare fit
    auto rP = hist->Fit(&f1,"S P Q +");  // use option P 
    f1.SetParameters(params0);   // reset parameters 
    f1.SetLineColor(kBlue);
    f1.SetTitle("Likelihood Fit");
    // Baker-Cousins log-likelihood fit
    auto rL = hist->Fit(&f1,"S Q L +");  // use option L
    
    // get results from fit results and save them  in histograms 
    if (rN == 0) hchi2N->Fill( rN->Chi2 () );  
    if (rP == 0) hchi2P->Fill( rP->Chi2 () ); 
    // to get LL value from fit result
    if (rL == 0) hchi2L->Fill( 2.* rL->MinFcnValue() ); 
    
}


Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1


We plot as example one of the data set we have generated

In [5]:
// just one example of the data 
hist->Draw();
legend = new TLegend(0.6,0.6,0.88,0.88);
functions = hist->GetListOfFunctions();
// loop on the list of histogram functions and add them in the legend
for ( auto * f : *functions) {
    if (f->IsA()==TF1::Class())
       legend->AddEntry(f,f->GetTitle(),"L");
}
legend->Draw();
gPad->Draw();

We plot here the 3 different test statistics distribution we have obtained 

In [6]:
canvas = new TCanvas("c","c",1000,400);
canvas->Divide(3,1);
canvas->cd(1); hchi2N->Draw();
canvas->cd(2); hchi2P->Draw();
canvas->cd(3); hchi2L->Draw();
canvas->Draw();

#### Fit of Obtained Test Statistic distribution

We study the obtained test statistics by fitting them with a Chi-Square distribution function with the number of degree of freedom as a free parameter. We fit also an irrelevant normalization constant. 

We start defining the chi2-distribution function we use for fitting the test statistics

In [7]:
fchi2 = new TF1("fchi2","[Constant]*ROOT::Math::chisquared_pdf(x,[ndf])",0,100);
// create also a new Canvas for plotting
gPad = new TCanvas();

- **Neyman Chi-Square fit** 

In [8]:
fchi2->SetParameters(hchi2N->GetEntries()*hchi2N->GetBinWidth(1), hchi2N->GetMean());
auto rchi2N = hchi2N->Fit(fchi2,"LS");
gStyle->SetOptFit(1111); 
gPad->Draw();

 FCN=209.97 FROM MIGRAD    STATUS=CONVERGED      30 CALLS          31 TOTAL
                     EDM=1.21372e-08    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  Constant     2.00000e+03   6.32456e+01   6.33518e-01   1.67026e-08
   2  ndf          4.81117e+01   3.06981e-01   3.07518e-03  -3.58862e-04
                               ERR DEF= 0.5


- **Pearson Chi-Square fit**

In [9]:
fchi2->SetParameters(hchi2P->GetEntries()*hchi2P->GetBinWidth(1), hchi2P->GetMean());
auto rchi2P = hchi2P->Fit(fchi2,"LS");
gPad->Draw();

 FCN=79.4075 FROM MIGRAD    STATUS=CONVERGED      30 CALLS          31 TOTAL
                     EDM=1.32585e-09    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  Constant     2.00000e+03   6.32456e+01   3.90352e-01   6.34497e-09
   2  ndf          4.86398e+01   3.08697e-01   1.90546e-03  -1.17948e-04
                               ERR DEF= 0.5


-  **Baker-Cousins Likelihood Ratio Chi-Square Fit**

In [10]:
fchi2->SetParameters(hchi2L->GetEntries()*hchi2L->GetBinWidth(1), hchi2L->GetMean());
auto rchi2L = hchi2L->Fit(fchi2,"LS");
gPad->Draw();

 FCN=41.8835 FROM MIGRAD    STATUS=CONVERGED      25 CALLS          26 TOTAL
                     EDM=2.42287e-07    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  Constant     2.00000e+03   6.32456e+01   2.84290e-01  -8.47032e-09
   2  ndf          1.10092e+02   4.67109e-01   2.09776e-03  -1.05377e-03
                               ERR DEF= 0.5


In [11]:
std::cout << "Number of degree of freedom for Neyman chi2 distribution is " 
    << rchi2N->Parameter(1) << " +/- " << rchi2N->Error(1) << std::endl;

Number of degree of freedom for Neyman chi2 distribution is 48.1117 +/- 0.306981


In [12]:
std::cout << "Number of degree of freedom for Pearson chi2 distribution is " 
    << rchi2P->Parameter(1) << " +/- " << rchi2P->Error(1) << std::endl;

Number of degree of freedom for Pearson chi2 distribution is 48.6398 +/- 0.308697


In [13]:
std::cout << "Number of degree of freedom for Baker-Cousins chi2 distribution is " 
    << rchi2L->Parameter(1) << " +/- " << rchi2L->Error(1) << std::endl;

Number of degree of freedom for Baker-Cousins chi2 distribution is 110.092 +/- 0.467109


It is interesting to note that the fitted number of degree of freedom is larger than the number of bins in the Baker-Cousins case. This is consistent with the study performed on the Poisson Log Likelihood ratio in a CDF note by 
*Joel G. Heinrich, “The Log Likelihood Ratio of the Poisson Distribution for Small μ,” CDF/MEMO/CDF/CDFR/5718,  http://www-cdf. fnal.gov/physics/statistics/notes/cdf5718_loglikeratv2.ps.gz

One can study how the fitted number of degree of freedom changes by increasing or decreasing the true expected contents for data sample histogram.




### Fit Parameter Bias

We extend the exercise by studying also the distribution of the fit parameters, $\tau$ in the 3 different cases.

In [14]:
htauN = new TH1D("htauN","Distribution of fitted parameter tau for Neyman Chi-squared fits",50,0,10);
htauP = new TH1D("htauP","Distribution of fitted parameter tau for Pearson Chi-squared fits",50,0,10);
htauL = new TH1D("htauL","Distribution of fitted parameter tau for Log-Likelihood fits",50,0,10);

In [15]:
htauN->Reset(); // reset in case we run a second time
htauP->Reset(); 
htauL->Reset(); 
for (int iexp = 0; iexp < nexp; ++iexp) {
    f1.SetParameters(trueParams);  // set initial param values before generating
    hist->Reset();
    for (int i = 0; i < n; ++i){
        hist->Fill( f1.GetRandom() );
    }
    // set the initial function parameters 
    f1.SetParameters(params0);
    f1.SetLineColor(kRed);
    f1.SetTitle("Neyman Fit");
    // Neyman chisquare fit 
    // use option Q to avoid too much printing and option 0 to avoid saving the fit function
    auto rN = hist->Fit(&f1,"S  Q");  
    f1.SetParameters(params0);   // reset parameters 
    f1.SetLineColor(kBlack);
    f1.SetTitle("Pearson Fit");
    // Pearson chisquare fit
    auto rP = hist->Fit(&f1,"S P Q +");  // use option P 
    f1.SetParameters(params0);   // reset parameters 
    f1.SetLineColor(kBlue);
    f1.SetTitle("Likelihood Fit");
    // Baker-Cousins log-likelihood fit
    auto rL = hist->Fit(&f1,"S Q L +");  // use option L
    
    // get results from fit results and save them  in histograms 
    if (rN == 0) htauN->Fill( rN->Parameter(1) );  
    if (rP == 0) htauP->Fill( rP->Parameter(1) ); 
    if (rL == 0) htauL->Fill( rL->Parameter(1) ); 
    
}



In [16]:
canvas = new TCanvas("c","c",1000,400);
canvas->Divide(3,1);
canvas->cd(1); htauN->Fit("gaus","L");
canvas->cd(2); htauP->Fit("gaus","L");
canvas->cd(3); htauL->Fit("gaus","L");
canvas->Draw();

 FCN=23.1561 FROM MIGRAD    STATUS=CONVERGED      71 CALLS          72 TOTAL
                     EDM=1.00307e-06    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  Constant     5.29177e+01   2.30338e+00   6.21169e-03  -3.12512e-04
   2  Mean         7.87887e+00   5.57839e-02   1.68863e-04  -8.87711e-03
   3  Sigma        1.30183e+00   4.38892e-02   2.95912e-05   3.46844e-02
                               ERR DEF= 0.5
 FCN=20.7605 FROM MIGRAD    STATUS=CONVERGED      59 CALLS          60 TOTAL
                     EDM=4.53479e-08    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  Constant     7.11670e+01   2.89194e+00   7.37894e-03   4.92958e-05
   2  Mean         7.50856e+00   3.62607e-02   1.14272e-04   3



One can modify this exercise and use instead of an exponential distribution, a normal distribution. In this case one can study the bias for the sigma of the gaussian using the three different methods. 