## Prepare python environment


**Note:** The specified version of pomegranate is necessary, so please do not modify it. It may take more than 5-10 mins to install these packages, so please be patient.

In [None]:
# Installs required packages
!apt install libgraphviz-dev
!pip install pomegranate==0.15.0
!pip install matplotlib pygraphviz

# Press "Restart Runtime" after running this cell, before going to the rest of the code.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Credit Card Fraud Detection Dataset (2 points)

---


The dataset contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
See [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) for details of the dataset. We will post process the data to balance both the classes indicating whether the transaction is fraud or not.

We will use a subset of the dataset for simplifying the experiments.

### Loading the dataset

#### There are a total of 284807 entries in this dataset with no missing values. The last column indicates whether the transaction is fraud or not while remaining columns are features.

In [None]:
# Download and load dataset
import os
from zipfile import ZipFile
if not os.path.exists('creditcard.csv'):
    !wget https://github.com/jha-lab/ECE364_2025/raw/main/data/creditcard.zip
with ZipFile("creditcard.zip", 'r') as zObject:
    # Extracting all the members of the zip
    # into a specific location.
    zObject.extractall(path=".")
df = pd.read_csv("creditcard.csv")

# Select a subset of features
ind_subset=np.array([0,2,3,4,5,-2,-1])
subset_features=df.columns[ind_subset]
df=df[subset_features]

# Display the first five instances in the dataset
print(df.head())

In [None]:
# Check the datatype of each column
df.info()

#### Look at some statistics of the data using the `describe` function in pandas. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) details about this function.

In [None]:
# Display some statistics of the data
df.describe()

1. Count tells us the number of Non-empty rows in a feature.

2. Mean tells us the mean value of that feature.

3. Std tells us the Standard Deviation Value of that feature.

4. Min tells us the minimum value of that feature.

5. 25%, 50%, and 75% are the percentile/quartile of each feature.

6. Max tells us the maximum value of that feature.

#### Visualize the distribution of fraudulent vs genuine transactions

In [None]:
# Make a pie chart showing transaction type
fig, ax = plt.subplots(1, 1)
ax.pie(df.Class.value_counts(),autopct='%1.1f%%', labels=['Genuine','Fraud'], colors=['green','red'])
plt.axis('equal')
plt.ylabel('')

In [None]:
## Check fraudulent activity over time (note: total time is 48 hours)
df["Time_Hr"] = df["Time"]/3600 # convert to hours
fig, (ax1, ax2) = plt.subplots(2, 1, sharex = True, figsize=(6,3))
ax1.hist(df.Time_Hr[df.Class==0],bins=48,color='g',alpha=0.5)
ax1.set_title('Genuine')
ax2.hist(df.Time_Hr[df.Class==1],bins=48,color='r',alpha=0.5)
ax2.set_title('Fraud')
plt.xlabel('Time (hrs)')
plt.ylabel('# transactions')

In [None]:
# Remove 'Time' feature as it is already captured when converting to hours
df = df.drop(['Time'],axis=1)
FEATURE_NAMES=df.drop('Class',axis=1).columns

#### Create a balanced dataset with 50% from each class

In [None]:
fraud_indices = np.array(df[df.Class == 1].index) #indices corresponding to fraud transaction
genuine_ind = df[df.Class == 0].index #indices corresponding to genuine transaction
total_fraud_transactions = len(df[df.Class == 1]) # total transactions that were fraud
np.random.seed(0) # fix the random seed generator for consistent results
indices_genuine_transaction = np.random.choice(genuine_ind, total_fraud_transactions, replace = False)
indices_genuine_transaction = np.array(indices_genuine_transaction)
selected_balanced_indices = np.concatenate([fraud_indices,indices_genuine_transaction]) # indices for balanced data
balanced_data = df.iloc[selected_balanced_indices,:]

print("% genuine transactions: ",len(balanced_data[balanced_data.Class == 0])/len(balanced_data))
print("% fraud transactions: ",len(balanced_data[balanced_data.Class == 1])/len(balanced_data))

# Make a pie chart showing transaction type
fig, ax = plt.subplots(1, 1)
ax.pie(balanced_data.Class.value_counts(),autopct='%1.1f%%', labels=['Genuine','Fraud'], colors=['green','red'])
plt.axis('equal')
plt.ylabel('')

### Extract target and descriptive features (1 point)


In [None]:
# Store all the features from the balanced data in X (use balanced_data, not df)
X= # insert your code here
# Store all the labels in y (use balanced_data, not df)
y= # insert your code here

#### **NOTE**: After dropping the 'Time' and the 'Class' feature, the dataset has changed and 'Time_Hr' was added towards the end. Let's print the `X.info` again to see the current indices and info of the columns (before converting it to a numpy array):

In [None]:
X.info()

In [None]:
# Convert data to numpy array
X = # insert your code here
y = # insert your code here

### Create training and test datasets (1 point)


We will split the dataset into training and test set. Generally in machine learning, we split the data into training,
validation and test set (this will be covered in later chapters). The model with best performance on the validation set is used to evaluate perfromance on 
the test set which is the unseen data. In this assignment, we will using `train set` for training and evaluate the performance on the `test set` for various 
model configurations to determine the best hyperparameters (parameter setting yielding the best performance).

Split the data into training and test set using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for testing.

In [None]:
X_train,X_test,y_train,y_test = # insert your code here

## Checking the dependency between features (2 points)

We want to check whether or not the two features, $V3$ and $V4$, are independent from each other. For simplicity, we probe $P(-0.5 \leq V3 \leq 0.5)$, $P(-0.5 \leq V4 \leq 0.5)$, and $P(-0.5 \leq V3 \leq 0.5, -0.5 \leq V4 \leq 0.5)$ to represent $P(V3)$, $P(V4)$, and $P(V3, V4)$, respectively. Calculate them and comment on the dependency between $V3$ and $V4$. 

Remember that for independent features $A$ and $B$:

$$P(A \cap B) = P(A) \cdot P(B)$$

If $P(A \cap B) \approx P(A) \cdot P(B)$, then $A$ and $B$ can be considered independent. 

**Note:** Since $V3$ and $V4$ are continuous (float) features, we cannot directly calculate probabilities of exact values. Instead, we estimate probabilities by calculating the proportion of data points falling within specified ranges. The intersection $P(V3 \cap V4)$ represents the probability that both conditions are satisfied simultaneously.

**Hint:** Since `X` is a numpy array, you can use the bitwise AND (`&`) operator to combine multiple conditions. For example, `(X[:, 1] >= -0.5) & (X[:, 1] <= 0.5)` creates a boolean array indicating which rows have column 1 (V3) values in the range $[-0.5, 0.5]$. You can count True values using `.sum()`, then divide by the total number of rows `len(X)` to get the probability. 

**Hint:**  V3 is at index 1 and V4 is at index 2 in the numpy array `X`.

In [None]:
# insert your code here

**ANS:**

## Training probability-based classifiers (16 points)


### Exercise 1: Learning a Naive Bayes Model (7 points)

#### We will use the `pomegranate` library to train a Naive Bayes Model. Review ch.6 and see [here](https://pomegranate.readthedocs.io/en/v0.8.1/NaiveBayes.html) for more details.

In [None]:
from pomegranate.distributions import NormalDistribution, ExponentialDistribution, DiscreteDistribution
from pomegranate.NaiveBayes import NaiveBayes
from pomegranate.BayesClassifier import BayesClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer
import math

np.random.seed(random_state)

#### Exercise 1a: Fit naive bayes model using a single distribution type (1 point)

#### Train one naive bayes model using a normal distribution for each feature. Train another naive bayes model using an exponential distribution for each feature. Hint: use NormalDistribution or ExponentialDistribution (continuous features) and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracies for each model. Hint: use accuracy_score()


In [None]:
# insert your code here

#### Exercise 1b: Fit a naive bayes model using different feature distributions (2 points)

#### Visualize the feature distributions (done for you below) to determine which distribution (normal or exponential) better models a specific feature. 

#### Train a Naive Bayes classifier using this set of feature-specific distributions. Hint: use NormalDistribution or ExponentialDistribution and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracies for the model. Hint: use accuracy_score()

In [None]:
# visualization code

num_cols=3
num_rows=int(len(FEATURE_NAMES)/num_cols) if len(FEATURE_NAMES)%num_cols == 0 else int(math.ceil(len(FEATURE_NAMES)/num_cols))
fig,ax=plt.subplots(num_rows,num_cols,figsize=(10,6))

for ft_index in np.arange(X_train.shape[1]):
    ax[ft_index//num_cols,ft_index%num_cols].hist(X_train[:,ft_index], color='blue')
    ax[ft_index//num_cols,ft_index%num_cols].hist(X_test[:,ft_index], color='red')
    ax[ft_index//num_cols,ft_index%num_cols].set_title(FEATURE_NAMES[ft_index])
    
fig.tight_layout()

In [None]:
# insert your code here: train a classifier

#### Comment on any performance difference between this model and the models trained in Ex. 1a. (1 point)

**ANS:**

#### Exercise 1c: Fit a naive bayes model on categorical features (2 points)

#### Besides fitting a naive bayes model on the continuous features, one can fit a naive bayes model on categorical features derived from binning the continuous features, and then compute a probability mass function for each categorical feature.

#### Bin the features by varying the strategy among {equal-width binning, equal-frequency binning}. For each binning strategy, vary the number of bins among {3,10,50}. Hint: use [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer.get_params) by modifying n_bins and strategy and setting encode="ordinal" to map the labels to numerical categories.

#### For each binning setting tried above, fit a naive bayes model on the binned version of the training set. Hint: use DiscreteDistribution to model the categorical features and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracy for each model trained and evaluated on binned versions of the training and test sets respectively. 

**Note** There may be some variability in the actual performance scores, but the overall trends should remain the consistent.

In [None]:
# insert your code here

#### Briefly explain any performance difference between equal-width and equal-frequency binning. Also comment on the effect of increasing the number of bins (see ch.3). (1 point)

**ANS:**

### Exercise 2: Learning a Bayes Net (9 points)

#### We will use the `pomegranate` library to train a Bayes Net to assess whether relaxing the assumption in Naive bayes (i.e., all features are independent given the target feature) could improve the classification model. Review ch.6 and see [here](https://pomegranate.readthedocs.io/en/v0.8.1/BayesianNetwork.html) for more details.

#### Exercise 2a: Create a categorical version of the dataset (1 point)

#### Create categorical versions of the training and test sets by using equal-frequency binning with the number of bins set to 10 (as in Ex. 1c). 

#### *<u>Use these datasets for training and evaluating the bayes net models in the following exercises.<u>*

**Note** This is done because pomegranate currently only supports bayes net over categorical features.

In [None]:
# insert your code here

#### Exercise 2b: Construct a Bayes net (3 points)

#### Construct and train a Bayes net in which the V2 node is a parent of the target feature node (only these 2 nodes should be in the net). Use construct_and_train_bayes_net (defined below) by passing in the binned training dataset and specifying the index of the parent feature node.

#### Construct and train another Bayes net in which the Time_hr node is a parent of the target feature node (only these 2 nodes should be in the net). Use construct_and_train_bayes_net (defined below) by passing in the binned training dataset and specifying the index of the parent feature node.

#### Report the training and test accuracies of each Bayes Net. Use get_performance (defined below) by passing in the trained bayes net, binned datasets, and specifying the index of the parent feature node.

In [None]:
from pomegranate import *

"""
X_train_binned: ndarray (# instances, # features) This is the binned version of the training set
y_train: 1darray (# instances,)
ind_chosen_parent_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are parent nodes of the target node. 
ind_chosen_child_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are children nodes of the target node.
                            
Returns a BayesianNetwork representing the trained bayes net
"""
def construct_and_train_bayes_net(X_train_binned,
                                  y_train,
                                  ind_chosen_parent_features=np.array([]), 
                                  ind_chosen_child_features=np.array([]),
                                ):
    # parent nodes of target

    dist_by_parent_feature=[]
    state_by_parent_feature=[]
    if len(ind_chosen_parent_features)>0:
        parent_feature_names_chosen=FEATURE_NAMES[ind_chosen_parent_features]

        for ft_index in ind_chosen_parent_features:
            ft_dist=DiscreteDistribution.from_samples(X_train_binned[:,ft_index])
            dist_by_parent_feature.append(ft_dist)
            state_by_parent_feature.append(State(ft_dist, str(FEATURE_NAMES[ft_index])))
        dist_by_parent_feature=np.array(dist_by_parent_feature)
        state_by_parent_feature=np.array(state_by_parent_feature)


    # target node
    if len(ind_chosen_parent_features)>0:
        X_train_parent_features_binned_with_labels=np.concatenate((X_train_binned[:,ind_chosen_parent_features],
                                                                   np.expand_dims(y_train,axis=1)),axis=1)
        target_dist=ConditionalProbabilityTable.from_samples(X_train_parent_features_binned_with_labels)
        # temporary workaround to properly initialize the distribution
        target_dist=ConditionalProbabilityTable(target_dist.parameters[0],dist_by_parent_feature.tolist())
    else:
        target_dist=DiscreteDistribution.from_samples(y_train)
    target_state=State(target_dist, "target")

    # children node of target

    dist_by_child_feature=[]
    state_by_child_feature=[]    
    if len(ind_chosen_child_features)>0:
        child_feature_names_chosen=FEATURE_NAMES[ind_chosen_child_features]

        for ft_index in ind_chosen_child_features:
            X_train_child_features_binned_with_labels=np.concatenate((np.expand_dims(y_train,axis=1),
                                                                        np.expand_dims(X_train_binned[:,ft_index],axis=1)),
                                                                     axis=1)
            ft_dist=ConditionalProbabilityTable.from_samples(X_train_child_features_binned_with_labels)
            ft_dist=ConditionalProbabilityTable(ft_dist.parameters[0],[target_dist])
            dist_by_child_feature.append(ft_dist)
            state_by_child_feature.append(State(ft_dist, str(FEATURE_NAMES[ft_index])))
        dist_by_child_feature=np.array(dist_by_child_feature)
        state_by_child_feature=np.array(state_by_child_feature)


    pom_model = BayesianNetwork()
    pom_model.add_states(*list(state_by_parent_feature))
    pom_model.add_states(target_state)
    pom_model.add_states(*list(state_by_child_feature))

    for parent_index in np.arange(len(ind_chosen_parent_features)):
        pom_model.add_edge(state_by_parent_feature[parent_index],target_state)

    for child_index in np.arange(len(ind_chosen_child_features)):
        pom_model.add_edge(target_state, state_by_child_feature[child_index])

    pom_model.bake()

    return pom_model


"""
pom_model: BayesianNetwork represents the trained bayes net model
X_train_binned: ndarray (# instances, # features) This is the binned training set
y_train: 1darray (# instances,)
X_test_binned: ndarray (# instances, # features) This is the binned test set
y_test: 1darray (# instances,)
ind_chosen_parent_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are parent nodes of the target node. 
ind_chosen_child_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are children nodes of the target node.
                            
Returns the training and test set accuracies attained by the bayes net model (pom_model)
"""
def get_performance(pom_model, X_train_binned, y_train, X_test_binned, y_test, 
                    ind_chosen_parent_features=np.array([]), ind_chosen_child_features=np.array([])):
    nones_array=np.expand_dims(np.array([None]*len(X_train_binned)),axis=1)
    ind_target_node=len(ind_chosen_parent_features)
    if len(ind_chosen_parent_features)>0:
        X_train_binned_with_none=X_train_binned[:,ind_chosen_parent_features]
        X_train_binned_with_none=np.concatenate((X_train_binned_with_none,nones_array),axis=1)
    else:
        X_train_binned_with_none=nones_array

    if len(ind_chosen_child_features)>0:
        X_train_binned_with_none=np.concatenate((X_train_binned_with_none,
                                                X_train_binned[:,ind_chosen_child_features]),
                                               axis=1)
    pred_labels=np.array(pom_model.predict(X_train_binned_with_none),dtype='int64')[:,ind_target_node]
    train_acc=accuracy_score(y_train, pred_labels)

    nones_array=np.expand_dims(np.array([None]*len(X_test_binned)),axis=1)
    if len(ind_chosen_parent_features)>0:
        X_test_binned_with_none=X_test_binned[:,ind_chosen_parent_features]
        X_test_binned_with_none=np.concatenate((X_test_binned_with_none,nones_array),axis=1)
    else:
        X_test_binned_with_none=nones_array

    if len(ind_chosen_child_features)>0:
        X_test_binned_with_none=np.concatenate((X_test_binned_with_none,
                                               X_test_binned[:,ind_chosen_child_features]),
                                               axis=1)
    pred_labels=np.array(pom_model.predict(X_test_binned_with_none),dtype='int64')[:,ind_target_node]
    test_acc=accuracy_score(y_test, pred_labels)
    
    return train_acc, test_acc

    



In [None]:
# insert your code here (for the case where V2 is the parent node)

In [None]:
# insert your code here (for the case where Time_hr is the parent node)

#### Comment on which feature seems more informative for the prediction task. (1 point)

**ANS:**

#### Exercise 2c: Construct a Bayes net with multiple parent nodes (3 points)

Here, we'll implement a Bayes net with both parent nodes and children nodes.



#### Construct and train a Bayes net in which:
#### -the following features are all parents of the target feature node (V2, V3, Time_Hr).  
#### -the following features are all children of the target feature node (V4, V5, Amount).  

#### Use construct_and_train_bayes_net by passing in the binned training dataset and specifying the indices of the parent feature nodes AND the children feature nodes. (this could take several minutes)

#### Report the training and test accuracy of the Bayes Net using get_performance by passing in the trained bayes net, binned datasets, and indices of the paren, children feature nodes.

In [None]:
# insert your code here

#### Compare the performance of this Bayes net against the Bayes nets from Ex. 2b. (1 point)

**ANS:**