**In order to learn using gradient descent** error function has to be continuous and differentiable.
- Sigmoid function $sigmoid(x) = \frac{1}{1+e^{-x}}$ gives not discrete value but probability of being a value (probabilty space). Sigmoid activation function is used when we have 2 labels. 

**Multi-Class Classification and Softmax**

- The softmax function is the equivalent of the sigmoid activation function but when the problem has 3 or more classes.
- We caclulate a score for each label then we compute the probability by using $exp$ (it always give a positive number) so that our probability will always be positive
- Let us say we have N classes and a linear model that gives us the following scores Z1, Z2....ZN:
   $$P(class i) = \frac{e^{Zi}}{e^{Z1} + e^{Z2} + ... + e^{ZN}} $$

In [4]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    return np.exp(L) / (np.sum(np.exp(L)))

softmax([0, 5])

array([0.00669285, 0.99330715])

### One-hot encoding 

If input data is not numbers, we need to encode it to numbers. 

In [2]:
# You can apply both transformations (from text categories to integer categories, 
#                                    then from integer categories to one-hot vectors)
# in one shot using the LabelBinarizer class:
from sklearn.preprocessing import LabelBinarizer
cat_features = ['color', 'director_name', 'actor_2_name']
encoder = LabelBinarizer()
new_cat_features = encoder.fit_transform(cat_features)
new_cat_features

array([[0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [7]:
# from sklearn.preprocessing import OneHotEncoder

X = np.array([['Cat', 1, 1], ['Dog', 3, 3,], ['Mouse', 2, 2], ['Goat', 5, 5]])

oneHotEncoder  = OneHotEncoder()

oneHotEncoder.fit_transform(X)

oneHotEncoder.get_feature_names()

array(['x0_Cat', 'x0_Dog', 'x0_Goat', 'x0_Mouse', 'x1_1', 'x1_2', 'x1_3',
       'x1_5', 'x2_1', 'x2_2', 'x2_3', 'x2_5'], dtype=object)

### Maximum Likelihood

Probability will be one of our best friends as we go through Deep Learning. In this lesson, we'll see how we can use probability to evaluate (and improve!) our models.

We pick the model that gives the existing labels the highest probability, thus by maximizing the probability we can pick the best possible model.

The model classifies most points correctly with P(all) indicating how accurate the model is.

$$
P (all) = P(e1) * P(e2) * P(e3) * ... * P(en)
$$

All need to do is to maximize this probability. We are not going to maximize the product since if we have thousands of points (where each prob is 0 <  < 1) the product will be very tiny 0.00000000000000000000 something. So we gonna do sum instead. To turn the product to sum we gonna use log

### Cross-Entropy

ln base e 
$$ln (P) = ln(P1 .P2 . P3 .P4) = ln(P1) + ln(P2) + ln(P3) + ln(P4) $$ 

The good model is the one giving a samll cross entropy (ln). Problem goes from maximizing prob to minimize cross entropy.

Cross entropy: if I have a bunch of events and a bunch of probabilities, how likely is that those events happen based on the probabilities.if it very likely then we have a small cross entropy, if it unlikely then we have a large cross entropy.

The forumla encompasses the sums of the negatives of lagorithms which is precisely the cross-entropy. So the cross-entropy really tells us when two vectors are similar or different.

In [60]:
import numpy as np

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return - np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

cross_entropy(np.array([1, 1, 0]), np.array([0.9, 0.8, 0.1]))

0.43386458262986227