# Percentage Binary Thresholding

In this notebook I wish to illustrate a technique and a recipe I developed for thresholding not based on direct lower and higher gray values, but rather following the intention of let's say, dim the darkest 20% and the lightest 20% of the image pixels, exposing (i.e. highlighting) only the pixels that are in between.

-----------------------
This notebook start by exploring histograms of a gray image:

In [6]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2
import numpy as np
import glob
import re
%matplotlib inline

try:
    f = open('test2.jpg')
    f.close()
except:
    raise Exception('Please make sure that the notebook is in the same folder as the test images.')

# A flagvalue of zero means we're loading the image as a grayscale image
# discarding any color information in the file.
img_gray = cv2.imread('test2.jpg', 0)

# Note that I chose 255 bins and explicitly expressed the range of values
# as going from 0 to 255 regardless of what gray levels might exist in the
# sample image. This ensures nice and whole-number bin edges.
histogram, bin_edges = np.histogram(img_gray, 255, (0.0, 255.0))
print("HISTOGRAM:\n", histogram)
print("BIN EDGES:\n", bin_edges)

HISTOGRAM:
 [  227   117    98    70    65    71    80    90   133   144   151   166
   215   207   228   262   287   266   300   259   295   337   348   331
   403   445   565   747  1115  1592  1765  2128  2911  3474  4214  4737
  4943  3764  2783  1894  1265  1018   881   868   798   805   845   844
   917  1002  1085  1118  1245  1304  1337  1426  1492  1691  2172  2627
  3126  3393  3678  3975  4922  6411  8599  9921  9675  9147  9837 10918
 12555 13764 13657 11958  9861  9004  9313 10257 10491  9864  8869  8371
  8034  8245  8381  8453  7761  6770  6258  5983  5788  5889  5631  5235
  5077  4804  4512  4275  4217  4103  3891  3766  3482  3285  3202  3216
  3115  2857  2750  2538  2458  2513  2796  3789  4594  5308  6185  7913
  7740  7527  7598  7752  8161  9157 10089 11074 11965 12301 12120 11054
 10928 12366 13521 13652 11961 10285  9424  9713 10263 10519 10235  9030
  8112  7584  7787  8347  7832  7780  7454  6286  5781  5241  5666  6507
  7940  7401  5994  5255  5270  6080  6

As we can see above, the bin edges or the value categories are simply the range 0 to 255. Hence we may discard this part and use the histogram part alone.

The next step is to sum up the histogram to turn it into an ascending list of counts. This way, looking up the 10th count in the list will no longer tell us how many pixels in the image are at gray level "10", but rather, how many pixels in the image are at gray level "1, 2, 3, 4, ..., 10". Effectively, it will tell us the pixel count acquiring a gray level of anything up to and including the indexing value.

This is done using the numpy np.cumsum function.

In [8]:
cumulative_histogram = np.cumsum(histogram)

print("CUMULATIVE HISTOGRAM:\n", cumulative_histogram)

CUMULATIVE HISTOGRAM:
 [   227    344    442    512    577    648    728    818    951   1095
   1246   1412   1627   1834   2062   2324   2611   2877   3177   3436
   3731   4068   4416   4747   5150   5595   6160   6907   8022   9614
  11379  13507  16418  19892  24106  28843  33786  37550  40333  42227
  43492  44510  45391  46259  47057  47862  48707  49551  50468  51470
  52555  53673  54918  56222  57559  58985  60477  62168  64340  66967
  70093  73486  77164  81139  86061  92472 101071 110992 120667 129814
 139651 150569 163124 176888 190545 202503 212364 221368 230681 240938
 251429 261293 270162 278533 286567 294812 303193 311646 319407 326177
 332435 338418 344206 350095 355726 360961 366038 370842 375354 379629
 383846 387949 391840 395606 399088 402373 405575 408791 411906 414763
 417513 420051 422509 425022 427818 431607 436201 441509 447694 455607
 463347 470874 478472 486224 494385 503542 513631 524705 536670 548971
 561091 572145 583073 595439 608960 622612 634573 6448

Fantastic, now instead of just dealing with the counts, we would like to convert those counts into something that is independent of the overall dimensions of the image. This is something like: "a percentage" of the total number of pixels.

We call this the normalized cumulative histogram which is simply done through the division of the array above with a scalar.

In [9]:
normalized_histogram = cumulative_histogram / img_gray.size

print("NORMALIZED HISTOGRAM:\n", normalized_histogram)

NORMALIZED HISTOGRAM:
 [  2.46310764e-04   3.73263889e-04   4.79600694e-04   5.55555556e-04
   6.26085069e-04   7.03125000e-04   7.89930556e-04   8.87586806e-04
   1.03190104e-03   1.18815104e-03   1.35199653e-03   1.53211806e-03
   1.76540799e-03   1.99001736e-03   2.23741319e-03   2.52170139e-03
   2.83311632e-03   3.12174479e-03   3.44726562e-03   3.72829861e-03
   4.04839410e-03   4.41406250e-03   4.79166667e-03   5.15082465e-03
   5.58810764e-03   6.07096354e-03   6.68402778e-03   7.49457465e-03
   8.70442708e-03   1.04318576e-02   1.23470052e-02   1.46560330e-02
   1.78146701e-02   2.15842014e-02   2.61566840e-02   3.12966580e-02
   3.66601562e-02   4.07443576e-02   4.37641059e-02   4.58192274e-02
   4.71918403e-02   4.82964410e-02   4.92523872e-02   5.01942274e-02
   5.10601128e-02   5.19335937e-02   5.28504774e-02   5.37662760e-02
   5.47612847e-02   5.58485243e-02   5.70258247e-02   5.82389323e-02
   5.95898438e-02   6.10047743e-02   6.24555122e-02   6.40028212e-02
   6.562174

So now let's say that we want to discard the 'dark_thresh' darkest pixels in the image, and the 'light_thresh' lightest pixels in the image, where both of the thresholds are expressed as fractions from 0.0 to 1.0. Or better, let's expose or highlight the pixels that are in between! In this case, we are going to call them 'lower_thresh' and 'upper_thresh', respectively.

Note 'upper_threshold' is equal to "1.0 - light_thresh" just to give you an idea of what it means.

So how we do this, we use the boundaries above to get a Boolean selection vector using numpy as follows:


In [10]:
lower_thresh = 0.2
upper_thresh = 0.8

selection_vector = (normalized_histogram >= lower_thresh) & (normalized_histogram <= upper_thresh)

print("SELECTION VECTOR FOR GRAY VALUES IN RANGE:\n", selection_vector)

SELECTION VECTOR FOR GRAY VALUES IN RANGE:
 [False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True False False False False False False False False
 False 

How to convert that to the actual gray values within range? I am glad you asked!

In [11]:
gray_values = np.arange(0, 255)[selection_vector]

print("GRAY VALUES MEETING PERCENTAGE CRITERIA:\n", gray_values)

GRAY VALUES MEETING PERCENTAGE CRITERIA:
 [ 74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109
 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
 146 147]


As you can see, these are continuous values and can easily be mirrored to a lower and an upper comparison boundary.

In [13]:
lthresh = np.min(gray_values)
uthresh = np.max(gray_values)

print("ACTUAL GRAY BOUNDARIES:\n", lthresh, uthresh)

ACTUAL GRAY BOUNDARIES:
 74 147


# One Liner

Here you can find a more re-usable implementation:

In [21]:
def percentages_to_thresholds(img_gray, lower_percentage, upper_percentage):
    norm_hist = (np.cumsum(np.histogram(img_gray, 255, (0.0, 255.0))[0]) / img_gray.size).astype(np.float32)
    values = np.arange(0, 255)[((norm_hist >= lower_percentage) & (norm_hist <= upper_percentage))]
    
    return np.min(values), np.max(values)

print("ACTUAL GRAY BOUNDARIES (FUNCTION):\n", percentages_to_thresholds(img_gray, 0.2, 0.8))

ACTUAL GRAY BOUNDARIES (FUNCTION):
 (74, 147)
