### [5 Marks] Briefly explain and implement from scratch the following functions: i) cross-entropy; ii) entropy; iii) mutual information; iv) conditional entropy; v) KL divergence. Take appropriate example toy data/distributions and explain the insights from calculating these quantities.

###### (here functions are not explained serial wise in order to give better explanation and understanding of the functions sequentially)

#### importing useful libraries

In [1]:
import pandas as pd
import numpy as np
import sklearn

#### importing breast_cancer toy dataset from sklearn library

In [2]:
from sklearn.datasets import load_breast_cancer

#loading the breast cancer data 
breast_cancer_data = load_breast_cancer() 
from sklearn.metrics import log_loss

In [3]:
 #X will contain all the input variable
X = breast_cancer_data.data

#here y is the target variable containing values 0 & 1; 
#0 meaning no breat cancer while 1 means there is breast cancer
y = breast_cancer_data.target 

In [4]:
#let's take overview of all the input variables 
df_inputs = pd.DataFrame(data = X, columns = breast_cancer_data.feature_names)
df_inputs

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [5]:
 #getting info about the input variables
df_inputs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

we can see that there is not any null value in input variable X.
let's check the target variable

In [6]:
#overviewing target variable
df_target = pd.DataFrame(data=y)
df_target

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
564,0
565,0
566,0
567,0


In [7]:
##getting info about the target variable
df_target.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       569 non-null    int32
dtypes: int32(1)
memory usage: 2.3 KB


we can see that there is not any null value in target variable y too.

### Now, let's implement our functions from scratch

#### ii) Entropy
Entropy is a measure of the amount of uncertainty or randomness in a system. In the context of machine learning, entropy is commonly used as a criterion for evaluating the purity of a split in a decision tree algorithm.

Also, according to Shannon’s information theory, entropy measures the average information content of a message: entropy is zero when all messages are identical. In Machine Learning, it is frequently used as an impurity measure.

The entropy or entropy of a node in decision tree is defined as:
<span style="color:blue">$$ E = -\sum p(i)log_{2}p(i) $$</span>
where: p(i) is the probability of the i-th class in the data set.

In [8]:
#let's define function for entropy from scratch
def entropy(Y): #here, we will be calculating entropy of data present in Y
    unique, count = np.unique(Y, return_counts=True, axis=0)
    
    #calculating probability
    prob = count/len(Y)
    
    #applying formula
    ent = -np.sum(np.dot(prob,np.log2(prob)))
    return ent

#now, let's find the entropy of target variable y
print("Entropy is:",entropy(y))

Entropy is : 0.9526351224018599


So here, we can see that the the entropy of target variable "y" is 0.95 which is the amount of information contained by "y".

#### iv) Conditional Entropy
In machine learning, conditional entropy is a measure of the uncertainty of a random variable X given the value of another random variable Y.

Also, it can also be defined as - conditional entropy(Y,X) = Joint Entropy(Y,X) - Entropy of X as:<span style="color:blue">$$H(Y|X) = H(Y;X) - H(X)  $$ </span>where H(Y|X) is Conditional Entropy of Y given X, H(Y;X) is Joint Entropy of Y and X, H(X) is Entropy of X and Y,X being any two random variable.

So, let's discuss about joint entropy - In machine learning, joint entropy is a measure of the uncertainty of a pair of random variables. Specifically, joint entropy measures the amount of information contained in both random variables together.


In [9]:
#Joint Entropy
def joint_entropy(Y,X):
    YX = np.c_[Y,X]  #concatinating both Y and X
    return entropy(YX)

#Conditional Entropy
def conditional_entropy(Y, X):
    return joint_entropy(Y, X) - entropy(X)

#Let's take any two random input variable from X and find their conditional entropy
#Let x1 be mean radius(first row of X) and x2 be mean texture(second row of X)
x1 = df_inputs["mean radius"]
x2 = df_inputs["mean texture"]

#converting x1 and x2 into numpy arrays
x1 = x1.values
x2 = x2.values

#finding conditional entropy of variable x1 given the value of x2
print("Conditional Entropy is: ",conditional_entropy(x1,x2)) 

Conditional Entropy is :  0.3282846880834285


Here, conditional entropy of variable x1 given value of x2 comes out to be 0.32 which shows the measures the amount of uncertainty or randomness in x1(mean radius) after taking into account the information provided by x2(mean texture)

#### iii) Mutual Information
Mutual information can be defined measure of the dependence between two random variables, often used in machine learning to evaluate feature selection and feature importance.

In particular, the mutual information between two random variables X and Y is defined as:
<span style="color:blue">$$ I(Y; X) = H(Y) - H(Y|X) $$</span>
where I(Y; X) is Mutual Information between Y and X, H(Y) is the Entropy of Y, and H(Y|X) is the Conditional Entropy of Y given X. Intuitively, mutual information measures how much knowing one variable reduces the uncertainty about the other variable.

In [10]:
#mutual information
def mutual_info(Y,X):
    return (entropy(Y) - conditional_entropy(Y,X))

#here, let's find out the mutual information of target 
#variable y and one of the input variable x1(mean readius)
print("Mutual Information is: ",mutual_info(y,x1))

Mutual Information is :  0.8607815854836038


Here, 0.86 is therefore the reduction in uncertainty about target variable y,or the expected reduction in the number of yes/no questions needed to guess y after observing x1(mean radius)

#### i) Cross Entropy
Cross-entropy, also known as log loss can be defined as a loss function commonly used in machine learning for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true probability distribution of the target variable.

In classification tasks, the model outputs a probability distribution over the possible classes for each input instance. The cross-entropy measures how well the predicted probability distribution matches the true probability distribution, which is typically represented as a one-hot encoded vector (i.e., a vector with a single 1 and all other elements 0, indicating the true class of the instance).

Formula for cross entropy can be written as:
cross entropy = <span style="color:blue">$$ -Σ [p_{i} * log_{2}(q_{i})] $$</span>
where i ranges over all possible outcomes, and p_i and q_i are the probabilities of the i-th outcome according to the true distribution and the predicted distribution, respectively.

In [11]:
#cross entropy
def cross_entropy(true_probability,predicted_probability):
#     return -np.sum(np.where(predicted_probability!=0,np.dot(true_probability,np.log2(predicted_probability)))
      return log_loss(true_probability,predicted_probability)

# Import necessary libraries
from sklearn.naive_bayes import GaussianNB

#creating a model for predicting the probability
model = GaussianNB()

#fitting the input variables X and target variable 
#y to the model
model.fit(X,y)

#predicting probability for input variable X
predicted_probability = model.predict_proba(X)[:,1] #since predict_proba() return a 2-D matrix 
                                                    #of probabilities of 0 and 1 respectively, 
                                                    #so here I'm slicing it to take the probability
                                                    #of either 0 or 1

#here y, the target variable is true probability 
#while we have calulated predicted probability
print("Cross Entropy: ",cross_entropy(y,predicted_probability))

Cross Entropy:  0.5204082855966206


Here, we can see that cross entropy comes out to be 0.52 which represents the logarithmic loss score of the model.The lower the value of cross-entropy, the better the model is at predicting the classes.

#### v) KL Divergence
KL divergence (or Kullback-Leibler divergence) is a measure of dissimilarity between two probability distributions. In machine learning, KL divergence is often used as a loss function or as a measure of performance.

Formally, given two probability distributions p and q over the same space, the KL divergence from p to q measures the expected number of extra bits required to code samples from p using a code optimized for q, compared to using a code optimized for p(according to Shannon's Information Theory) . This can be written as:
<span style="color:blue">$$ KL(p||q) = \sum p(x) * log(\frac{p(x)}{q(x)}) $$ </span>

Here, p(x) and q(x) are the probabilities of observing the value x under the distributions p and q, respectively. The KL divergence is asymmetric, meaning KL(p||q) is not equal to KL(q||p), and is also non-negative.

In [12]:
#importing necessary library
from sklearn.model_selection import train_test_split

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #as defined in above cells, X is input variables and y is target variable

# Compute the true distribution of the labels
p = np.bincount(y_train) / len(y_train)

#Define a function to compute the predicted 
#distribution of the labels
def get_predicted_distribution(X, y, model):
    
    # Compute the predicted probabilities of the positive class
    y_pred_prob = model.predict_proba(X)[:, 1]
    
    # Compute the predicted distribution of the labels
    q = np.bincount(y, weights=y_pred_prob) / np.sum(y_pred_prob)
    
    return q

# Train a logistic regression model on the training data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

#Compute the predicted distribution of the labels using 
#the logistic regression model
q = get_predicted_distribution(X_test, y_test, model)

# Compute the KL divergence between the true and predicted distributions
#here, the two probability distributions are p(true distribution) and 
#q(predicted distribution)
kl_div = np.sum(np.dot(p , np.log(p / q)))

print("KL divergence:", kl_div)

KL divergence: 0.5341040492226561


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Here, we firstly get calculated true probability of the target variable "y" and predicted probabilities by traing a logistic regression model by splitting X and y into train/test split.

Next, we calculate the KL divergence between the two classes using the above defined formula.

And, the KL divergence comes out to be 0.53 which shows how similar two classes, p and q are respectively, allowing for comparison and evaluation of these two classes.

###### link to demonstration video - https://drive.google.com/file/d/1igoj_J3Duv_g0S-5MiHDKw0nuAbX59i4/view?usp=share_link