In [91]:
import pandas as pd
import numpy as np
import keras

from collections import Counter

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from scipy.stats import sem, t

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

from IPython.display import display, Math, Latex
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# An Additional Lecture on Sampling

Sampling methods in machine learning play a crucial role in ensuring that the data used for training, validation, and testing are representative of the underlying population. Let's delve into some common types of sampling techniques used in machine learning:

1. **Random Sampling**:
   - Random sampling involves selecting a subset of data points from the entire dataset without any specific pattern or bias. 
   - Mathematically, each data point has an equal probability of being chosen.
   - Random sampling is widely used when the dataset is large and uniformly distributed, ensuring that each sample is representative of the population.

3. **Systematic Sampling**:
   - Systematic sampling involves selecting data points at regular intervals from an ordered list of the population.
   - The interval, also known as the sampling interval, is determined by dividing the total population size by the desired sample size.
   - This method is efficient and straightforward but can introduce bias if there is an underlying pattern in the ordering of the data.

2. **Stratified Sampling**:
   - In stratified sampling, the dataset is divided into several homogeneous subgroups called strata based on certain characteristics.
   - Samples are then randomly selected from each stratum proportionally to their size in the population.
   - This technique ensures that each subgroup is represented adequately in the sample, which is crucial when certain subgroups are underrepresented in the dataset.

4. **Cluster Sampling**:
   - Cluster sampling involves dividing the population into clusters or groups and then randomly selecting entire clusters to be included in the sample.
   - This method is particularly useful when it is impractical or expensive to sample individuals directly.
   - However, cluster sampling can introduce bias if the clusters are not representative of the population or if there is heterogeneity within clusters.

6. **Stratified Cluster Sampling**:
   - This method combines the principles of stratified and cluster sampling by first dividing the population into strata and then selecting clusters within each stratum.
   - It aims to capture the variability within strata while also accounting for the efficiency gained through cluster sampling.

Each sampling method has its advantages and limitations, and the choice of method depends on various factors such as the nature of the data, the research objectives, and resource constraints. Understanding these sampling techniques is crucial for ensuring the reliability and generalizability of machine learning models.

## Random Sampling

Ordinary random sampling, also known as simple random sampling, is one of the most straightforward and commonly used sampling techniques. In this method, each data point in the population has an equal probability of being selected, independently of other data points. This ensures that every possible sample of a given size has an equal chance of being selected, making the sample representative of the population as a whole.

Ordinary random sampling is useful when the population is homogeneous, and there is no need for additional stratification or clustering. It is straightforward to implement and provides an unbiased representation of the population. However, it may not be suitable for datasets with specific characteristics, such as stratified or clustered populations, where other sampling methods like stratified or cluster sampling may be more appropriate.

## Systematic Sampling

Systematic sampling is a method of selecting data points from a population at regular intervals. In the context of time series data, systematic sampling involves selecting every kth data point from the time series, where k is the sampling interval. This method ensures that the selected samples are spread evenly across the entire time series, allowing for efficient and representative sampling.

Suppose we have a time series dataset with timestamps as index. Let's create a sample time series dataset for demonstration. Then we will use a sampling interval of length 3, i.e. we will use every 3rd data point.

In [2]:
data = {
    'value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'timestamp': pd.date_range(start='2024-03-01', periods=10, freq='D')
}

df = pd.DataFrame(data)
df.set_index('timestamp', inplace=True)
df

Unnamed: 0_level_0,value
timestamp,Unnamed: 1_level_1
2024-03-01,10
2024-03-02,20
2024-03-03,30
2024-03-04,40
2024-03-05,50
2024-03-06,60
2024-03-07,70
2024-03-08,80
2024-03-09,90
2024-03-10,100


In [3]:
sampling_interval = 3
systematic_sample = df.iloc[::sampling_interval]

print("Systematic Sample:")
print(systematic_sample)

Systematic Sample:
            value
timestamp        
2024-03-01     10
2024-03-04     40
2024-03-07     70
2024-03-10    100


In this example:

- We have a time series dataset with timestamps as the index.
- We specify the sampling interval (`sampling_interval`) as every 3rd data point.
- We perform systematic sampling by selecting every 3rd data point from the time series using the `iloc` indexer with the step size specified by the sampling interval.

Systematic sampling is useful for time series data because it ensures that samples are evenly spaced throughout the time series, providing a representative sample of the entire dataset. This can be beneficial for various analyses and modeling tasks, such as trend analysis, forecasting, and anomaly detection.

## Cluster Sampling

Suppose we have a dataset with a 'population_id' column representing different clusters. Let's create a sample dataset for demonstration

In [4]:
data = {
    'population_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'feature2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
    'class': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B']
}
df = pd.DataFrame(data)
df

Unnamed: 0,population_id,feature1,feature2,class
0,1,1,11,A
1,1,2,12,A
2,1,3,13,A
3,2,4,14,B
4,2,5,15,B
5,2,6,16,B
6,3,7,17,B
7,3,8,18,B
8,3,9,19,B
9,4,10,20,B


Then, we randomly select 'num_clusters_to_sample' clusters from the unique population IDs.  Finally, we filter the dataframe to include only the selected clusters

In [5]:
clusters_to_sample = np.random.choice(df['population_id'].unique(), 2, replace=False)
cluster_sample = df[df['population_id'].isin(clusters_to_sample)]

print("Cluster Sample:")
print(cluster_sample)

Cluster Sample:
    population_id  feature1  feature2 class
0               1         1        11     A
1               1         2        12     A
2               1         3        13     A
9               4        10        20     B
10              4        11        21     B
11              4        12        22     B


## Cluster Sampling with Imbalanced Data

In this example:

- We have an imbalanced dataset with two classes ('A' and 'B').
- We calculate the probabilities of selecting each cluster based on the class distribution.
- We then use these probabilities to select clusters, ensuring that clusters containing minority classes have a higher chance of being selected.
- Finally, we filter the dataframe to include only the rows corresponding to the selected clusters.

By adjusting the cluster selection probabilities based on class distribution, we can ensure that the resulting cluster sample is more representative of the entire dataset, even in the presence of class imbalance.

Let us calculate probabilities of selecting clusters based on class distribution:

In [6]:
class_distribution = df['population_id'].value_counts(normalize=True)
cluster_probabilities = class_distribution / class_distribution.sum()
cluster_probabilities

population_id
1    0.25
2    0.25
3    0.25
4    0.25
Name: proportion, dtype: float64

In [7]:
sample_distribution = cluster_sample['population_id'].value_counts(normalize=True)
sample_probabilities = sample_distribution / sample_distribution.sum()
sample_probabilities

population_id
1    0.5
4    0.5
Name: proportion, dtype: float64

## Stratified Sampling

Suppose we have a dataset with a 'target' column representing different classes. Let's create a sample dataset for demonstration

In [8]:
data = {
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'target': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
}

df = pd.DataFrame(data)
df

Unnamed: 0,feature1,feature2,target
0,1,11,A
1,2,12,A
2,3,13,B
3,4,14,B
4,5,15,B
5,6,16,C
6,7,17,C
7,8,18,C
8,9,19,C
9,10,20,C


Stratified sampling using train_test_split from sklearn. We specify the 'stratify' parameter to ensure that the sampling is stratified based on the 'target' column

In [9]:
train, test = train_test_split(df, test_size=0.25, stratify=df['target'])

In [10]:
print("Training Set:")
print(train)

print("\nTest Set:")
print(test)

Training Set:
   feature1  feature2 target
5         6        16      C
1         2        12      A
6         7        17      C
3         4        14      B
4         5        15      B
9        10        20      C
7         8        18      C

Test Set:
   feature1  feature2 target
8         9        19      C
0         1        11      A
2         3        13      B


## Stratified sampling with imbalanced data

Stratified sampling is a method used to ensure that different subgroups within a population are adequately represented in the sample. This is particularly useful when certain subgroups are underrepresented in the dataset, and we want to ensure that our sample reflects the diversity of the population.

In the example above

- We have a dataset with two features ('feature1' and 'feature2') and a target variable ('target').
- We use the `train_test_split` function from the `sklearn.model_selection` module to split the dataset into training and test sets.
- We specify the `stratify` parameter and pass the 'target' column to indicate that we want to perform stratified sampling based on the 'target' variable.
- The `test_size` parameter specifies the proportion of the dataset to include in the test split.
- Finally, we print the training and test sets to observe the distribution of classes in both sets.

By using stratified sampling, we ensure that the distribution of classes in the training and test sets closely resembles the distribution of classes in the original dataset. This helps in building more robust and generalizable machine learning models.

## Stratified Cluster Sampling

Stratified cluster sampling combines the principles of stratified sampling and cluster sampling. In this method, the population is first divided into homogeneous subgroups or strata based on certain characteristics (similar to stratified sampling). Then, within each stratum, clusters are randomly selected, and all individuals within the selected clusters are included in the sample (similar to cluster sampling). This approach ensures that the sample is representative of the entire population, with each subgroup and cluster being adequately represented.

Suppose we have a dataset with a 'population_id' column representing different clusters and a 'stratum' column representing different strata.  Let's create a sample dataset for demonstration


In [11]:
data = {
    'population_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'stratum': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'feature2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
}

df = pd.DataFrame(data)
df

Unnamed: 0,population_id,stratum,feature1,feature2
0,1,A,1,11
1,1,A,2,12
2,1,B,3,13
3,2,B,4,14
4,2,B,5,15
5,2,C,6,16
6,3,C,7,17
7,3,C,8,18
8,3,D,9,19
9,4,D,10,20


Suppose we want to sample 2 clusters from each stratum. We first group the dataframe by stratum, initialize an empty dataframe to store the cluster sample and  iterate over each stratum.

In [12]:
num_clusters_to_sample_per_stratum = 1
grouped = df.groupby('stratum')
cluster_sample = pd.DataFrame()

for stratum, group in grouped:
    # Select clusters from the current stratum
    clusters_to_sample = np.random.choice(group['population_id'].unique(), num_clusters_to_sample_per_stratum, replace=False)
    # Filter the dataframe to include only the selected clusters from the current stratum
    stratum_cluster_sample = group[group['population_id'].isin(clusters_to_sample)]
    # Concatenate the current stratum cluster sample with the overall cluster sample
    cluster_sample = pd.concat([cluster_sample, stratum_cluster_sample])

print("Stratified Cluster Sample:")
print(cluster_sample)

Stratified Cluster Sample:
    population_id stratum  feature1  feature2
0               1       A         1        11
1               1       A         2        12
2               1       B         3        13
6               3       C         7        17
7               3       C         8        18
9               4       D        10        20
10              4       D        11        21
11              4       D        12        22


In this example:

- We have a dataset with a 'population_id' column representing different clusters and a 'stratum' column representing different strata.
- We want to sample a certain number of clusters from each stratum.
- We group the dataframe by stratum using the groupby function.
- Then, for each stratum, we randomly select clusters and filter the dataframe to include only the selected clusters from that stratum.
- Finally, we concatenate the cluster samples from all strata to obtain the overall stratified cluster sample.

By using stratified cluster sampling, we ensure that the resulting sample represents the diversity of the entire population, with each stratum and cluster being adequately represented in the sample. This approach is particularly useful when there are significant differences between subgroups in the population and when the population is naturally clustered into groups.

## Stratified cluster sampling with imbalanced data 

When dealing with imbalanced datasets, stratified cluster sampling can help ensure that both the minority and majority classes are adequately represented in the sample. Here's how stratified cluster sampling can be analyzed in the case of imbalanced datasets:

1. **Stratification based on Class Distribution**: In the case of an imbalanced dataset, we can stratify the population based on the class distribution. This means that each stratum will contain a mix of samples from all classes, with the proportion of samples from each class reflecting the overall class distribution in the dataset.

2. **Cluster Selection within Strata**: Within each stratum, we can perform cluster sampling to select clusters for inclusion in the sample. By randomly selecting clusters within each stratum, we ensure that the sample is representative of the entire population, with each class being adequately represented in the sample.

3. **Adjustment for Imbalance**: Depending on the severity of the class imbalance, we may need to adjust the cluster selection probabilities to ensure that clusters containing minority classes have a higher chance of being selected. This helps prevent the minority classes from being underrepresented or excluded from the sample.

Suppose we have an imbalanced dataset with a 'class' column representing different classes.  Let's illustrate stratified cluster sampling in the case of an imbalanced dataset using Python:

In [13]:
data = {
    'population_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'stratum': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
    'class': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B']
}

df = pd.DataFrame(data)
df

Unnamed: 0,population_id,stratum,class
0,1,A,A
1,1,A,A
2,1,B,A
3,2,B,B
4,2,B,B
5,2,C,B
6,3,C,B
7,3,C,B
8,3,D,B
9,4,D,B


Suppose we want to sample 1 cluster from each stratum. First, let us calculate class distribution within each stratum, initialize an empty dataframe to store the cluster sample and then iterate over each stratum:

In [14]:
num_clusters_to_sample_per_stratum = 1
stratum_class_distribution = df.groupby('stratum')['class'].value_counts(normalize=True)
cluster_sample = pd.DataFrame()

for stratum, group in df.groupby('stratum'):
    unique_pop_ids = group['population_id'].unique()
    cluster_probabilities = [stratum_class_distribution.get((stratum, group[group['population_id'] == pop_id]['class'].iloc[0]), 0) for pop_id in unique_pop_ids]
    cluster_probabilities /= np.sum(cluster_probabilities)
    clusters_to_sample = np.random.choice(unique_pop_ids, num_clusters_to_sample_per_stratum, replace=False, p=cluster_probabilities)
    stratum_cluster_sample = group[group['population_id'].isin(clusters_to_sample)]
    cluster_sample = pd.concat([cluster_sample, stratum_cluster_sample])

print("Stratified Cluster Sample:")
print(cluster_sample)

Stratified Cluster Sample:
   population_id stratum class
0              1       A     A
1              1       A     A
3              2       B     B
4              2       B     B
6              3       C     B
7              3       C     B
8              3       D     B


In this example:

- We have an imbalanced dataset with two classes ('A' and 'B') and a 'stratum' column representing different strata.
- We calculate the class distribution within each stratum.
- We adjust the cluster selection probabilities based on the class distribution within each stratum.
- We then randomly select clusters within each stratum, ensuring that clusters containing minority classes have a higher chance of being selected.
- Finally, we concatenate the cluster samples from all strata to obtain the overall stratified cluster sample.

By adjusting the cluster selection probabilities based on the class distribution, we ensure that the resulting sample is more representative of the entire population, even in the presence of class imbalance. This helps prevent bias and ensures that the model trained on the sample is more robust and generalizable.

## Machine Learning Algorithms in the Context of Imbalanced Dataset

In [15]:
def experiment1(model, X, y, ts=0.25):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=ts, stratify=y)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return classification_report(y_test, y_pred)

In [94]:
def experiment2(name, model, X, y):
    validation = RepeatedStratifiedKFold(n_splits=10, n_repeats=5)
    intervals = {'model': name}
    means = {'model': name}
    for method in ['accuracy', 'precision', 'recall']:
        val_scores = cross_val_score(model, X, y, scoring=method, cv=validation)
        score = np.mean(val_scores)
        sdev = sem(val_scores)
        df = len(val_scores)-1
        intervals.update({method: t.interval(0.95, df, loc=score, scale=sdev)})
        means.update({method: score})
    return means, intervals

In [95]:
X, y = make_classification(n_samples=5000, n_features=10, n_classes=2, weights=[0.9, 0.1])

Model with parameter `class_weight='balanced'` automatically assign weights inversely proportional to class frequencies in the training data.

In [97]:
lr_model = LogisticRegression(class_weight='balanced')
print(experiment1(lr_model, X, y))
print(experiment2('LR', lr_model, X, y))

              precision    recall  f1-score   support

           0       0.98      0.88      0.93      1120
           1       0.46      0.86      0.60       130

    accuracy                           0.88      1250
   macro avg       0.72      0.87      0.77      1250
weighted avg       0.93      0.88      0.90      1250

({'model': 'LR', 'accuracy': 0.9039200000000001, 'precision': 0.5245011961875514, 'recall': 0.8930693815987933}, {'model': 'LR', 'accuracy': (0.9001532717869943, 0.9076867282130058), 'precision': (0.5138666578550104, 0.5351357345200924), 'recall': (0.8829874554659349, 0.9031513077316518)})


In [98]:
svm_model = SVC(class_weight='balanced',kernel='rbf')
print(experiment1(svm_model, X, y))
print(experiment2('SVM',svm_model, X, y))

              precision    recall  f1-score   support

           0       0.98      0.92      0.95      1121
           1       0.56      0.88      0.69       129

    accuracy                           0.92      1250
   macro avg       0.77      0.90      0.82      1250
weighted avg       0.94      0.92      0.93      1250

({'model': 'SVM', 'accuracy': 0.9289200000000001, 'precision': 0.6147530181399504, 'recall': 0.8543061840120663}, {'model': 'SVM', 'accuracy': (0.9255834973973867, 0.9322565026026135), 'precision': (0.6008561009822374, 0.6286499352976634), 'recall': (0.8390064652032819, 0.8696059028208506)})


In [99]:
dt_model = DecisionTreeClassifier(class_weight='balanced')
print(experiment1(dt_model, X, y))
print(experiment2('DT', dt_model, X, y))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1120
           1       0.72      0.76      0.74       130

    accuracy                           0.94      1250
   macro avg       0.85      0.86      0.86      1250
weighted avg       0.95      0.94      0.95      1250

({'model': 'DT', 'accuracy': 0.9455600000000001, 'precision': 0.7377931203230931, 'recall': 0.7285972850678735}, {'model': 'DT', 'accuracy': (0.9432539169243505, 0.9478660830756497), 'precision': (0.7214663656676171, 0.754119874978569), 'recall': (0.7123256563526125, 0.7448689137831345)})


In [103]:
knn_model = KNeighborsClassifier(n_neighbors=1,weights='distance')
print(experiment1(knn_model, X, y))
print(experiment2('KNN', knn_model, X, y))

              precision    recall  f1-score   support

           0       0.96      0.97      0.97      1121
           1       0.73      0.65      0.69       129

    accuracy                           0.94      1250
   macro avg       0.85      0.81      0.83      1250
weighted avg       0.94      0.94      0.94      1250

({'model': 'KNN', 'accuracy': 0.93928, 'precision': 0.7256072876054183, 'recall': 0.6513574660633484}, {'model': 'KNN', 'accuracy': (0.9361189449438391, 0.9424410550561609), 'precision': (0.7087240571597557, 0.7424905180510808), 'recall': (0.6336884563108381, 0.6690264758158587)})


In [100]:
rf_model = OneVsRestClassifier(RandomForestClassifier(n_estimators=10))
print(experiment1(rf_model, X, y))
print(experiment2('RF',rf_model, X, y))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1121
           1       0.93      0.75      0.83       129

    accuracy                           0.97      1250
   macro avg       0.95      0.87      0.91      1250
weighted avg       0.97      0.97      0.97      1250

({'model': 'RF', 'accuracy': 0.96268, 'precision': 0.8875801627400535, 'recall': 0.7521417797888387}, {'model': 'RF', 'accuracy': (0.9609666797154833, 0.9643933202845166), 'precision': (0.8754538109253168, 0.8997065145547901), 'recall': (0.7351807488592992, 0.7691028107183782)})


In [101]:
ab_model = OneVsRestClassifier(AdaBoostClassifier(algorithm='SAMME'))
print(experiment1(ab_model, X, y))
print(experiment2('AdaBoost',ab_model, X, y))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1121
           1       0.76      0.78      0.77       129

    accuracy                           0.95      1250
   macro avg       0.87      0.88      0.87      1250
weighted avg       0.95      0.95      0.95      1250

({'model': 'AdaBoost', 'accuracy': 0.9516400000000002, 'precision': 0.7491489005196972, 'recall': 0.8007088989441931}, {'model': 'AdaBoost', 'accuracy': (0.9488895429871295, 0.9543904570128708), 'precision': (0.7373338017768587, 0.7609639992625357), 'recall': (0.7851675516760798, 0.8162502462123064)})


In [107]:
nn_model = keras.Sequential([
      keras.layers.Dense(32, activation='relu', input_shape=(X.shape[-1],)),
      keras.layers.Dropout(0.25),
      keras.layers.Dense(1, activation='sigmoid')
  ])

nn_model.compile(
    optimizer=keras.optimizers.Adamax(learning_rate=1e-1),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()]
)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y.reshape(y.shape[-1],1,), test_size=0.25, stratify=y)
nn_model.fit(X_train, y_train, epochs=200, batch_size=50, class_weight={0: 0.95, 1: 0.05})

Epoch 1/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - binary_accuracy: 0.9030 - loss: 0.0471  
Epoch 2/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - binary_accuracy: 0.9180 - loss: 0.0184
Epoch 3/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - binary_accuracy: 0.9053 - loss: 0.0197
Epoch 4/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - binary_accuracy: 0.9304 - loss: 0.0153
Epoch 5/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - binary_accuracy: 0.9106 - loss: 0.0194
Epoch 6/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - binary_accuracy: 0.9316 - loss: 0.0157
Epoch 7/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - binary_accuracy: 0.9213 - loss: 0.0180
Epoch 8/200
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - binary_accuracy: 0.9289 - los

<keras.src.callbacks.history.History at 0x763de16489b0>

In [109]:
y_pred = np.argmax(nn_model.predict(X_test),axis=1)
print(classification_report(y_test.reshape(y_test.shape[0]), y_pred))

[1m40/40[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      1121
           1       0.00      0.00      0.00       129

    accuracy                           0.90      1250
   macro avg       0.45      0.50      0.47      1250
weighted avg       0.80      0.90      0.85      1250



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [111]:
1121/(1121+129)

0.8968

In [110]:
confusion_matrix(y_test,y_pred)

array([[1121,    0],
       [ 129,    0]])

In [112]:
table = [
    {'model': 'LR', 'accuracy': 0.9039200000000001, 'precision': 0.5245011961875514, 'recall': 0.8930693815987933},
    {'model': 'SVM', 'accuracy': 0.9289200000000001, 'precision': 0.6147530181399504, 'recall': 0.8543061840120663}, 
    {'model': 'DT', 'accuracy': 0.9455600000000001, 'precision': 0.7377931203230931, 'recall': 0.7285972850678735},
    {'model': 'RF', 'accuracy': 0.96268, 'precision': 0.8875801627400535, 'recall': 0.7521417797888387}, 
    {'model': 'AdaBoost', 'accuracy': 0.9516400000000002, 'precision': 0.7491489005196972, 'recall': 0.8007088989441931},
    {'model': 'KNN', 'accuracy': 0.93928, 'precision': 0.7256072876054183, 'recall': 0.6513574660633484},
    {'model': 'NN', 'accuracy': 0.8968, 'precision': 0.45, 'recall': 0.90},
]

pd.DataFrame(table)

Unnamed: 0,model,accuracy,precision,recall
0,LR,0.90392,0.524501,0.893069
1,SVM,0.92892,0.614753,0.854306
2,DT,0.94556,0.737793,0.728597
3,RF,0.96268,0.88758,0.752142
4,AdaBoost,0.95164,0.749149,0.800709
5,KNN,0.93928,0.725607,0.651357
6,NN,0.8968,0.45,0.9


## A Real Dataset

The dataset is taken from [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud).

In [24]:
data = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [30]:
#del data['Time']
del data['Amount']

In [31]:
data

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,0
2,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.356170,...,1.475829,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0
284803,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,-0.975926,...,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,0
284804,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,-0.484782,...,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,0
284805,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,-0.399126,...,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,0


In [33]:
data_x, data_y = data.iloc[:,:28], data.iloc[:,28]

In [35]:
lr_model = LogisticRegression(max_iter=10000, solver='liblinear', class_weight='balanced')
print(experiment1(lr_model, data_x, data_y))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     71079
           1       0.06      0.93      0.11       123

    accuracy                           0.97     71202
   macro avg       0.53      0.95      0.55     71202
weighted avg       1.00      0.97      0.98     71202



In [36]:
svm_model = SVC(class_weight='balanced', kernel='rbf')
print(experiment1(svm_model, data_x, data_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.29      0.77      0.43       123

    accuracy                           1.00     71202
   macro avg       0.65      0.88      0.71     71202
weighted avg       1.00      1.00      1.00     71202



In [37]:
dt_model = DecisionTreeClassifier(class_weight='balanced')
print(experiment1(dt_model, data_x, data_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.75      0.69      0.72       123

    accuracy                           1.00     71202
   macro avg       0.87      0.85      0.86     71202
weighted avg       1.00      1.00      1.00     71202



In [88]:
knn_model = KNeighborsClassifier(n_neighbors=3)
print(experiment1(knn_model, data_x, data_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.98      0.82      0.89       123

    accuracy                           1.00     71202
   macro avg       0.99      0.91      0.95     71202
weighted avg       1.00      1.00      1.00     71202



In [116]:
rf_model = OneVsOneClassifier(RandomForestClassifier(n_estimators=10))
print(experiment1(rf_model, data_x, data_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.95      0.72      0.82       123

    accuracy                           1.00     71202
   macro avg       0.97      0.86      0.91     71202
weighted avg       1.00      1.00      1.00     71202



In [117]:
ab_model = OneVsRestClassifier(AdaBoostClassifier())
print(experiment1(ab_model, data_x, data_y))



              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.84      0.66      0.74       123

    accuracy                           1.00     71202
   macro avg       0.92      0.83      0.87     71202
weighted avg       1.00      1.00      1.00     71202



In [119]:
table = [
    {'model': 'LR', 'precision': 53, 'recall': 95, 'f1': 55},
    {'model': 'SVM', 'precision': 65, 'recall': 88, 'f1': 71},
    {'model': 'DT', 'precision': 87, 'recall': 85, 'f1': 86},
    {'model': 'RF', 'precision': 97, 'recall': 86, 'f1': 91},
    {'model': 'KNN', 'precision': 99, 'recall': 91, 'f1': 95},
    {'model': 'AdaBoost', 'precision': 92, 'recall': 93, 'f1': 87}
]

pd.DataFrame(table)

Unnamed: 0,model,precision,recall,f1
0,LR,53,95,55
1,SVM,65,88,71
2,DT,87,85,86
3,RF,97,86,91
4,KNN,99,91,95
5,AdaBoost,92,93,87


In [79]:
nn_model = keras.Sequential([
      keras.layers.Dense(16, activation='relu', input_dim=data_x.shape[1]),
      keras.layers.Dropout(0.5),
      keras.layers.Dense(32, activation='softmax'),
      keras.layers.Dropout(0.25),
      keras.layers.Dense(1, activation='sigmoid')
  ])

nn_model.compile(
    optimizer=keras.optimizers.Adamax(learning_rate=5e-2),
    loss=keras.losses.BinaryFocalCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()]
)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [80]:
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.25, stratify=data_y)

In [81]:
nn_model.fit(X_train, y_train, epochs=50, batch_size=100)

Epoch 1/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9963 - loss: 0.0062
Epoch 2/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9988 - loss: 0.0020
Epoch 3/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9989 - loss: 0.0016
Epoch 4/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9989 - loss: 0.0017
Epoch 5/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9987 - loss: 0.0019
Epoch 6/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9989 - loss: 0.0015
Epoch 7/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_accuracy: 0.9989 - loss: 0.0015
Epoch 8/50
[1m2137/2137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - binary_

<keras.src.callbacks.history.History at 0x763dbff29910>

In [82]:
y_pred = np.argmax(nn_model.predict(X_test),axis=1)
print(classification_report(y_test, y_pred))

[1m2226/2226[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 702us/step
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     71079
           1       0.00      0.00      0.00       123

    accuracy                           1.00     71202
   macro avg       0.50      0.50      0.50     71202
weighted avg       1.00      1.00      1.00     71202



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [83]:
confusion_matrix(y_test,y_pred)

array([[71079,     0],
       [  123,     0]])