In [1]:
import numpy as np

### Generate 30 fake observations of probabilities of 100 different features

In [29]:
x = np.random.rand(30,100)

### Take a look at our "data"

In [30]:
x

array([[0.99492465, 0.22095045, 0.25813016, ..., 0.3621739 , 0.23108548,
        0.12255272],
       [0.63811856, 0.82901642, 0.21632034, ..., 0.26698813, 0.20210668,
        0.05375498],
       [0.21408718, 0.35322647, 0.49409748, ..., 0.68262781, 0.54424569,
        0.93023409],
       ...,
       [0.47249648, 0.0292927 , 0.96718061, ..., 0.60663064, 0.92074738,
        0.78785331],
       [0.07200287, 0.7142433 , 0.82390776, ..., 0.88286222, 0.09071952,
        0.92917339],
       [0.93173402, 0.03685549, 0.41983396, ..., 0.2528376 , 0.22912984,
        0.35282777]])

### Now let's calculate a probility with this data using chain rule - i.e. multiply the probability of each feature together to get a total probability that our data matches a certain label. (Completely fictitious scenario here, but something you might do in real life in a machine learning algorithm.)

In [35]:
np.prod(x, axis=1)

array([8.59448242e-42, 2.82899442e-38, 3.92391137e-40, 3.44035677e-40,
       8.25180649e-41, 5.57610986e-40, 1.65642970e-41, 8.56778609e-42,
       7.32080243e-43, 1.86083195e-47, 1.58973469e-45, 1.28413097e-40,
       7.36936006e-43, 5.18968616e-42, 1.95749064e-39, 2.17969107e-44,
       1.51411730e-47, 1.04168833e-47, 2.13527862e-42, 4.33600948e-48,
       2.55300039e-36, 3.17857492e-39, 1.48198325e-42, 4.84412025e-53,
       3.95888512e-45, 3.72340823e-44, 9.08788657e-41, 3.18725998e-48,
       8.32885538e-43, 4.56648639e-49])

### Notice how small our values are here - when I ran this, the values were preceded by between 36 and 53 zeroes (very, very small). This is what happened with only 100 features. What happens when we have more?

In [48]:
y = np.random.rand(30,900)

In [49]:
y

array([[0.46091631, 0.36235332, 0.56174749, ..., 0.34393905, 0.53237969,
        0.04561387],
       [0.72279088, 0.95532394, 0.52931943, ..., 0.03563779, 0.27869188,
        0.23943472],
       [0.10886475, 0.1345471 , 0.49532917, ..., 0.47744018, 0.39132254,
        0.58712641],
       ...,
       [0.27241278, 0.58220484, 0.22602247, ..., 0.44082575, 0.67906181,
        0.76278196],
       [0.50842952, 0.63971033, 0.05667392, ..., 0.15918026, 0.23386718,
        0.65989332],
       [0.29019685, 0.85948612, 0.73707668, ..., 0.09231485, 0.53978581,
        0.24006832]])

In [50]:
np.prod(y, axis=1)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### Note that with random probabilities, when we get to 900 features being evaluated, the numbers are so small the computer reports them as zero. This is a problem when dealing with large, complex data. The solution? Log values! Let's take a look at what log does.

In [52]:
np.log(.00025)

-8.294049640102028

### Note that the log of a pretty small number ends up being a large negative number. 

In [54]:
np.log(.00000000000015)

-29.52814110081443

### As the number gets smaller, the log value just converts into a larger, more negative value. We can multiply as many of these as we want without them ever being too small to evaluate.

### Log values also have some mathematical properties that make them easier to work with (and code) as well. For example, log(xy) - or the log of x times y - is actually equal to log(x) + log(y). So when you're doing multiplication a la the chain rule, with logs all of that multiplication converts to addition. Whereas in our example above for y, we used np.prod to multiply across the 900 different features, if we convert them to logs first, then we use np.sum instead of np.prod to get us to the answer. Example:

In [56]:
np.sum(np.log(y), axis=1)

array([-891.21099345, -929.46892397, -911.86458638, -845.46286087,
       -886.24867831, -912.22258844, -904.86232699, -897.66393888,
       -848.17801184, -938.87641622, -871.22172113, -923.08790726,
       -874.580211  , -836.3713478 , -856.29945082, -901.5391768 ,
       -888.14815201, -833.88300569, -861.19737342, -914.27544507,
       -865.87173013, -951.37417972, -886.90123542, -902.09150429,
       -892.55532516, -909.80394593, -942.3121585 , -914.14800378,
       -954.79831311, -867.50551094])

### So we're dealing with numbers smaller than our computer can handle, but because they're converted to log values, we can still use them. 

### Note: You can also do log division - i.e. log(x/y) = log(x) - log(y).

### But wait, how can we know for sure that this log addition is actually the equivalent of our multiplication of non-log probabilities? Well, the opposite of taking the log is exponentiation. So it follows, then, that if we follow our log math example, and then take the exponent, we should be able to come up with the same number as taking the product of the non-log values. Let's go back to our x variable, where we had only 100 features so the numbers were still readable.

In [58]:
x.shape

(30, 100)

In [59]:
x

array([[0.99492465, 0.22095045, 0.25813016, ..., 0.3621739 , 0.23108548,
        0.12255272],
       [0.63811856, 0.82901642, 0.21632034, ..., 0.26698813, 0.20210668,
        0.05375498],
       [0.21408718, 0.35322647, 0.49409748, ..., 0.68262781, 0.54424569,
        0.93023409],
       ...,
       [0.47249648, 0.0292927 , 0.96718061, ..., 0.60663064, 0.92074738,
        0.78785331],
       [0.07200287, 0.7142433 , 0.82390776, ..., 0.88286222, 0.09071952,
        0.92917339],
       [0.93173402, 0.03685549, 0.41983396, ..., 0.2528376 , 0.22912984,
        0.35282777]])

### Let's run this through the same procedure that we used with y and generate the probabilities using the log method.

In [63]:
logx = np.sum(np.log(x), axis=1)

In [64]:
logx

array([ -94.55745349,  -86.45831221,  -90.73631477,  -90.86782854,
        -92.29555667,  -90.38491234,  -93.90132431,  -94.56056454,
        -97.02043906, -107.6004757 , -103.15276204,  -91.85332152,
        -97.01382813,  -95.06190068,  -89.12915526, -100.53456094,
       -107.80666674, -108.18065658,  -95.94997677, -109.05713001,
        -81.95579406,  -88.64438567,  -96.31519268, -120.45924428,
       -102.24036673,  -99.99910465,  -92.19904643, -109.36492286,
        -96.89143296, -111.30792549])

### Let's remember what our original chain rule prod of x got us:

In [62]:
np.prod(x, axis=1)

array([8.59448242e-42, 2.82899442e-38, 3.92391137e-40, 3.44035677e-40,
       8.25180649e-41, 5.57610986e-40, 1.65642970e-41, 8.56778609e-42,
       7.32080243e-43, 1.86083195e-47, 1.58973469e-45, 1.28413097e-40,
       7.36936006e-43, 5.18968616e-42, 1.95749064e-39, 2.17969107e-44,
       1.51411730e-47, 1.04168833e-47, 2.13527862e-42, 4.33600948e-48,
       2.55300039e-36, 3.17857492e-39, 1.48198325e-42, 4.84412025e-53,
       3.95888512e-45, 3.72340823e-44, 9.08788657e-41, 3.18725998e-48,
       8.32885538e-43, 4.56648639e-49])

### Now let's see if the exponent of our log math equals the same thing as our product of the non-log numbers:

In [65]:
np.exp(logx)

array([8.59448242e-42, 2.82899442e-38, 3.92391137e-40, 3.44035677e-40,
       8.25180649e-41, 5.57610986e-40, 1.65642970e-41, 8.56778609e-42,
       7.32080243e-43, 1.86083195e-47, 1.58973469e-45, 1.28413097e-40,
       7.36936006e-43, 5.18968616e-42, 1.95749064e-39, 2.17969107e-44,
       1.51411730e-47, 1.04168833e-47, 2.13527862e-42, 4.33600948e-48,
       2.55300039e-36, 3.17857492e-39, 1.48198325e-42, 4.84412025e-53,
       3.95888512e-45, 3.72340823e-44, 9.08788657e-41, 3.18725998e-48,
       8.32885538e-43, 4.56648639e-49])

### Perfect! So now we see how using log can simplify the maths (using sums vs. products, or in the case of division, subtraction vs. division...) and allow us to work with significantly smaller real values than our computers could handle without taking the log.