# Compare BIC scores of a list of distributions

The Bayesian Information Criterion (BIC) ranks a list of models according to a weighted maximum likelihood criteria which takes into account for the sample size and the number of parameters of each distribution. A lower BIC score is better. 

Reference:
https://stackoverflow.com/questions/65972875/how-to-rank-a-list-of-distributions-with-bic-in-openturns

In [1]:
import openturns as ot
import tqdm

In [2]:
sample = ot.Normal().getSample(100)

The `GetContinuousUniVariateFactories` static method returns a list of all available factories for continuous distributions. We could use this list without further processing, but the histogram would come first in the ranking, because it is specially designed for this purpose. Hence, we do not include it in our computation of the BIC score. 

In [3]:
factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
marginalFactories = []
for factory in factories:
    if str(factory) != "HistogramFactory":
        print(factory)
        marginalFactories.append(factory)
number_of_factories = len(marginalFactories)
print("Number of selected factories:", number_of_factories)

ArcsineFactory
BetaFactory
BurrFactory
ChiFactory
ChiSquareFactory
DirichletFactory
ExponentialFactory
FisherSnedecorFactory
FrechetFactory
GammaFactory
GeneralizedParetoFactory
GumbelFactory
InverseNormalFactory
LaplaceFactory
LogisticFactory
LogNormalFactory
LogUniformFactory
MeixnerDistributionFactory
NormalFactory
ParetoFactory
RayleighFactory
RiceFactory
StudentFactory
TrapezoidalFactory
TriangularFactory
TruncatedNormalFactory
UniformFactory
VonMisesFactory
WeibullMaxFactory
WeibullMinFactory
Number of selected factories: 30


In the following script, we perform a for loop over all factories in the list that we previously created. We will later sort the BIC scores by increasing order. This is why we store the BIC score and the marginal index in the `score_array` sample. The computation can be quite long for some distributions. Hence we use the `tqdm` module to print a progress bar. Finally, some distribution do not build on this specific sample. In order to avoid to break the for loop, we wrap the call to the `BIC` method into a `try/except`. If the distribution fitting fails, we set the BIC score to the maximum finite value of a floating point number (this is `MaxScalar`), which is approximately equal to $10^{308}$. 

In [4]:
score_array = ot.Sample(number_of_factories, 2)
for i in tqdm.tqdm(range(number_of_factories)):
    try:
        factory = marginalFactories[i]
        fitted_dist, bic_score = ot.FittingTest.BIC(sample, factory)
        score_array[i] = [i, bic_score]
    except TypeError:
        print("Cannot build ", factory)
        score_array[i] = [i, ot.SpecFunc.MaxScalar]

100%|██████████| 30/30 [00:00<00:00, 782.76it/s]

Cannot build  BurrFactory
Cannot build  DirichletFactory
Cannot build  FisherSnedecorFactory
Cannot build  GeneralizedParetoFactory
Cannot build  InverseNormalFactory
Cannot build  LogUniformFactory
Cannot build  MeixnerDistributionFactory
Cannot build  RiceFactory





The key step is to sort the array containing the BIC scores.

In [5]:
sorted_BIC_scores = score_array.sortAccordingToAComponent(1)

There might be more than 30 distributions which can be built onto the sample. Here, we limit the list to the top 10 distributions which have the lowest BIC scores. We will use Pandas in order to print the BIC scores nicely. To do this, we create the `BIC_data` list, which contains the name of the factory and the corresponding BIC score. This is where the index of the distribution in the first column of `sorted_BIC_scores` is used. However, the `Sample` stores `float`s: we have to convert them into an integer before using it as an index. 

In [6]:
BIC_data = []
rank = list(range(min(number_of_factories, 10)))
for i in rank:
    distribution_index = int(sorted_BIC_scores[i, 0])
    factory = marginalFactories[distribution_index]
    BIC_score = sorted_BIC_scores[i, 1]
    BIC_data.append([factory, BIC_score])
    print("%s, BIC = %.3f" % (factory, BIC_score))

NormalFactory, BIC = 2.902
WeibullMaxFactory, BIC = 2.934
WeibullMinFactory, BIC = 2.939
TruncatedNormalFactory, BIC = 2.939
LogisticFactory, BIC = 2.945
LogNormalFactory, BIC = 2.948
StudentFactory, BIC = 2.949
VonMisesFactory, BIC = 2.953
TriangularFactory, BIC = 2.968
BetaFactory, BIC = 3.033
