<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

## Isolation Forest

We first generate some sample data in two clusters (each one containing n_samples) by randomly sampling the standard normal distribution as returned by numpy.random.randn. One of them is spherical and the other one is slightly deformed.

For consistency with the IsolationForest notation, the inliers (i.e. the gaussian clusters) are assigned a ground truth label 1 whereas the outliers (created with numpy.random.uniform) are assigned the label -1.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

n_samples, n_outliers = 120, 40
rng = np.random.RandomState(0)
covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

We can visualize the resulting clusters:

In [None]:
import matplotlib.pyplot as plt

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")
plt.show()

Next, we train the Isolation Forest with the default 100 estimators, default contamination and sub-sampling rate of 100.

In [None]:
from sklearn.ensemble import IsolationForest

#clf = IsolationForest(n_estimators=100, max_samples=100, contamination='auto', random_state=0)
clf = IsolationForest(n_estimators=1000, max_samples=100, contamination=0.2, random_state=0)
clf.fit(X_train)

We use the class DecisionBoundaryDisplay to visualize a discrete decision boundary. The background color represents whether a sample in that given area is predicted to be an outlier or not. The scatter plot displays the true labels.

In [None]:
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

By setting the response_method="decision_function", the background of the DecisionBoundaryDisplay represents the score given by the path length averaged over a forest of random trees.

When a forest of random trees collectively produce short path lengths for isolating some particular samples, they are highly likely to be anomalies and the measure of normality is close to 0. Similarly, large paths correspond to values close to 1 and are more likely to be inliers.

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(disp.ax_.collections[1])
plt.show()

## Exercise
Change the hyperparameters of the isolation forest and observe what happens to the decision boundaries

## Isolation Forest for Credit Card Fraud Detection

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation, as well as the time of transaction and the amount transacted.

We first import the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import IsolationForest

The data is in the file creditcard.csv. Write code to read the file and display the first 5 rows

<details>
<summary>
    Click here to see code
</summary>
    
    
```
df=pd.read_csv('datasets/creditcard.csv')
df.head(5)
```

In [None]:
#Enter code here


Run the code below to examine the data distribution. Most of the data are valid transactions, with only a small percentage of fraud cases

In [None]:
print(df.Class.value_counts())
df.Class.value_counts().plot(kind='bar')

As a first step, we can train the Isolation Forest using only the non-fraudulent class. For validation, we will use data from both fraudulent and non-fraudulent classes

In [None]:
seed = 1337
from sklearn.model_selection import train_test_split
def get_data(df, clean_train=True):
    """
        clean_train=True returns a train sample that only contains clean samples.
        Otherwise, it will return a subset of each class in train and test (10% outlier)
    """
    clean = df[df.Class == 0].copy().reset_index(drop=True)
    fraud = df[df.Class == 1].copy().reset_index(drop=True)
    print(f'Clean Samples: {len(clean)}, Fraud Samples: {len(fraud)}')

    if clean_train:
        train, test_clean = train_test_split(clean, test_size=len(fraud), random_state=seed)
        print(f'Train Samples: {len(train)}')

        test = pd.concat([test_clean, fraud]).reset_index(drop=True)

        print(f'Test Samples: {len(test)}')

        # shuffle the test data
        test.sample(frac=1, random_state=seed).reset_index(drop=True)
        
        train_X, train_y = train.loc[:, ~train.columns.isin(['Class'])], train.loc[:, train.columns.isin(['Class'])]
        test_X, test_y = test.loc[:, ~test.columns.isin(['Class'])], test.loc[:, test.columns.isin(['Class'])]
    else:
        clean_train, clean_test = train_test_split(clean, test_size=int(len(fraud)+(len(fraud)*0.9)), random_state=seed)
        fraud_train, fraud_test = train_test_split(fraud, test_size=int(len(fraud)*0.1), random_state=seed)
        print(len(clean_train))
        print(len(fraud_train))
        
        train_samples = pd.concat([clean_train, fraud_train]).reset_index(drop=True)
        test_samples = pd.concat([clean_test, fraud_test]).reset_index(drop=True)
        
        # shuffle
        train_samples.sample(frac=1, random_state=seed).reset_index(drop=True)
        
        print(f'Train Samples: {len(train_samples)}')
        test_samples.sample(frac=1, random_state=seed).reset_index(drop=True)
        
        print(f'Test Samples: {len(test_samples)}')
        train_X, train_y = train_samples.loc[:, ~train_samples.columns.isin(['Class'])], train_samples.loc[:, train_samples.columns.isin(['Class'])]
        test_X, test_y = test_samples.loc[:, ~test_samples.columns.isin(['Class'])], test_samples.loc[:, test_samples.columns.isin(['Class'])]
    
    return train_X, train_y, test_X, test_y



In [None]:
train_X, train_y, test_X, test_y = get_data(df)

model = IsolationForest(random_state=seed)
model.fit(train_X)

We can now make the predictions and print the classification report

In [None]:
from sklearn.metrics import classification_report
def predict(X):
    test_yhat = model.predict(X)
    # values are -1 and 1 (-1 for outliers and 1 for inliers), thus we will map it to 0 (inlier) and 1 (outlier) as this is our target variable
    test_yhat = np.array([1 if y == -1 else 0 for y in test_yhat])
    return test_yhat

test_yhat = predict(test_X)

In [None]:
def get_classification_report(test_y, test_yhat):
    labels = ['Legitimate','Fraudulent']
    print(classification_report(test_y, test_yhat, target_names=labels))
    
get_classification_report(test_y, test_yhat)

As seen in the classification report, the f1-score for fraudulent class is 0.89, which is quite good. Let us see what happens if our training data had both fraudulent and non-fraudulent data

Write code to perform isolation forest on training data that contain both fraudulent and non-fraudulent data

<details>
<summary>
    Click here to see code
</summary>
    
    
```
train_X, train_y, test_X, test_y = get_data(df, clean_train=False)
model = IsolationForest(random_state=seed)
model.fit(train_X)
```

In [None]:
#insert code here


Write code to obtain the classification report for the model above

<details>
<summary>
    Click here to see code
</summary>
     
```
test_yhat = predict(test_X)
get_classification_report(test_y, test_yhat)
```

In [None]:
#insert code here


The f1-score for fraudulent cases has dropped to 0.67 although the f1-score for legitimate cases have increased. This indicates that the inclusion of the outliers in the training set has caused the threshold to better identify legitimate points, at the expense of the fraudulent cases.  

## Mobile Payment Fraud Detection with some Feature Engineering and Explainability

Let us now try a synthetic dataset and add some feature engineering and explainability to the anomaly detection model. The dataset here was generated by a program called PaySim which simulates mobile payments based on “aggregated transactional data” from a real company. Given its confidential nature, it is difficult to obtain publicly available transactional data. The dataset comprises a total of 6,362,620 transactions which occurred over a simulated time span of 30 days.

The data is stored in the file financial.csv. Based on the code above, read in the file into a pandas dataframe and display the first five rows.

<details><summary>Click here for answer</summary> 
<br/>

```
df=pd.read_csv('datasets/financial.csv')
df.head(5)
```
</details>


In [None]:
# Enter code here

There is both numerical data, such as the ‘amount’ and ‘oldbalanceOrg’ fields, as well as categorical data, such as the ‘type’ and ‘nameOrig’ fields. These records contain information about the original and the new account balance of each of the two parties involved in the transaction (origin and destination), as well as a separate record of the exact amount (supposed to be) transferred. The ‘step’ field denotes the number of hours passed since the start of the simulation. The ‘isFraud’ column tells us which transactions are indeed fraudulent, whereas the ‘isFlaggedFraud’ column is a simple indicator variable for whether the amount transferred in a given transaction exceeds the threshold of 200,000. This latter field represents the rule based strategy mentioned in the lecture.


All numerical features can easily be used as inputs to the model, so the fields ‘amount’, ‘oldbalanceOrg’, ‘newbalanceOrig’, ‘oldbalanceDest’ and ‘newbalanceDest’ will be used as features as they are.

In [None]:
features = pd.DataFrame(index=df.index)
numerical_columns = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
features[numerical_columns] = df[numerical_columns]

Since the ‘amount’ field seems to sometimes deviate from the difference between the original and the new balances of one or both of the transaction parties, we will include these differences in the data as two additional features: ‘changebalanceOrig’ and ‘changebalanceDest’.

Try to write some code to create the above data and store it in the 'changebalanceOrig' and 'changebalanceDest' columns of the features dataset

<details>
<summary>
    Click here to see code
</summary>
     
```

features['changebalanceOrig'] = features['newbalanceOrig'] - features['oldbalanceOrg']
features['changebalanceDest'] = features['newbalanceDest'] - features['oldbalanceDest']

```

In [None]:
# Write code here

Since the ‘step’ field gives us the relative timestamps of all transactions in an hourly resolution, we can derive the (hourly) time of the day when the transaction occurred. To do this we simply transform the ‘step’ field by applying the modulo of 24. Try to write the code for this and store the data in the 'hour' column of the features dataset.

<details>
<summary>
    Click here to see code
</summary>
     
```
features['hour'] = df['step'] % 24

```

In [None]:
# Write code here

Finally, we want to make use of the information provided in the ‘type’ column. Since our model will only be able to use numerical data, and since there is no logical ordering of the values that the ‘type’ field can assume, we will proceed by one-hot encoding the field into 5 columns, one for each of the possible values of ‘type’. The binary values in the columns indicate whether the content of ‘type’ is equal to the column’s corresponding value. The matrix of one-hot encodings is then appended to our feature matrix.

In [None]:
type_one_hot = pd.get_dummies(df['type'])
features = pd.concat([features, type_one_hot], axis=1)

We are now ready to use Isolation Forest for anomaly detection. To function as a fraud detection system that is as general as possible, we want the following properties in our model:

- Makes no assumptions about what an anomaly looks like.
- Does not require any flagged data (labels).
- Provides a continuous anomaly score, such that the number of identified anomalies can be adjusted depending on the desired strictness.

Isolation forest fulfills all of the above requirements and relies on two simple assumptions: Anomalies are few, and anomalies are different.

Try to use the code in the previous section to train the Isolation Forest

<details><summary>Click here for answer</summary> 
<br/>

```
from sklearn.ensemble import IsolationForest
forest = IsolationForest(random_state=0)
forest.fit(features)
```
</details>

In [None]:
# Enter code here

To get a continous anomaly score for each data point rather than a binary anomaly indicator which would be dependent on an arbitrarily-chosen threshold value, we call the score samples method.

In [None]:
scores = forest.score_samples(features)
print(scores)

In [None]:
# plot anomaly score distribution
plt.hist(scores, bins=50)
plt.ylabel('Number of transactions', fontsize=15)
plt.xlabel('Anomaly score', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

The raw output of the Isolation Forest is not a split of the dataset into anomalies and non-anomalies, but rather a list of continuous anomaly scores, one for every entry. This means that, depending on how many anomalies we want to detect (how wide we want to cast our net), we can set a different threshold which determines the data points that are considered as anomalies (i.e. data points with scores below the threshold). A lower anomaly score means that there is a higher chance that the data point is an anomaly.

Run the code below to examine the top 5 outlier points identified by the isolation forest model.

In [None]:
top_n_outliers = 5
top_n_outlier_indices = np.argpartition(scores, top_n_outliers)[:top_n_outliers].tolist()
top_outlier_features = features.iloc[top_n_outlier_indices, :]
top_outlier_features


In [None]:
#alternatively
index2=np.argsort(scores)[:5]
top_outlier_features = features.iloc[index2, :]
top_outlier_features

One way to evaluate the result of our model without choosing one particular threshold is by computing the area under the ROC curve of the model output.

To have a baseline to compare the isolation forest to, we will use a naive method for anomaly detection in this dataset which consists of treating the money amount transfered as the anomaly score, where higher amounts represent a higher chance of being an anomaly. We will take this approach to compute a naive ROC area under the curve.

Finally, we will also add the AUC score that would be obtained by random guessing.

The predicted anomalies are evaluated against the 'isFraud' columns which represents the ground truth value of whether the given entry constitutes an anomaly or not.

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_curve
from sklearn.metrics import precision_recall_curve, roc_auc_score
# evaluate isolation forest anomaly scores
fpr_iforest, tpr_iforest, thresholds_iforest = roc_curve(df['isFraud'], -scores)
auc_score_iforest = roc_auc_score(df['isFraud'], -scores)

# evaluate naive (amount) anomaly scores
fpr_naive, tpr_naive, thresholds_naive = roc_curve(df['isFraud'], df['amount'])
auc_score_naive = roc_auc_score(df['isFraud'], df['amount'])

In [None]:
def plot_roc_curve(fpr, tpr, name, auc_score):
    plt.plot(fpr, tpr, label=name + ', AUC={}'.format(round(auc_score, 3)))

In [None]:
plot_roc_curve(fpr_iforest, tpr_iforest, 'Isolation Forest', auc_score_iforest)
plot_roc_curve(fpr_naive, tpr_naive, 'Naive', auc_score_naive)
plot_roc_curve([0, 1], [0, 1], 'Random guessing', 0.5)
plt.xlabel('False positive rate', fontsize=15)
plt.ylabel('True positive rate (recall)', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(prop={'size': 12})
plt.legend()

What do the results above tell us?

We will now try to understand how the model generates the output. The explanation model we are going to use is called SHAP. For our purposes, what we need to understand about SHAP is that the explanation values it provides tell us about the effect that the value of a feature of a particular data point had on its associated anomaly score. In other words, if we look at a particular output of our model, SHAP values tell us how much each feature of the input contributed to that score, and in which direction (i.e. whether the feature contributed to a higher or a lower anomaly score).

First, we need to instantiate an appropriate  Explainer  model. Since we are using a tree-based model it makes sense to use SHAP’s  TreeExplainer.

In [None]:
#!pip install shap
import shap
explainer = shap.TreeExplainer(forest)

Next, we compute the SHAP values for a set of 5000 randomly chosen data points. 

In [None]:
random_indices = np.random.choice(len(features), 5000)
shap_values_random = explainer.shap_values(features.iloc[random_indices, :])
random_features = features.iloc[random_indices, :]

To visualize the explanation values of a single point, we can use the force_plot function. Let’s display the explanation for the first entry in the randomly chosen dataset.

In [None]:
from IPython.display import display, HTML
shap.initjs
dis=shap.force_plot(explainer.expected_value, shap_values_random[0, :], random_features.iloc[0, :],matplotlib=False)
shap_html = f"{shap.getjs()}{dis.html()}"

with open("orig_shap.html", "w", encoding='utf8') as file:
    file.write(shap_html)


The ‘base value’ in the above plot corresponds to the average output of the model over the training set, whereas f(x) value in bold indicates the model output for this specific datapoint. The purpose of this plot is to show how the individual features of this data point contributed to shifting the model output from its expected (base) value, to the actual value. The values of these individual contributions are SHAP values. Blue bars represent SHAP values that are negative and contributed to a lower anomaly score (making it more likely for that data point to be an anomaly), whereas red bars represent positive SHAP values making the output higher (i.e. suggesting that this data point is normal). The sum of all SHAP values is equal to the difference between base value and model output value. Note that the SHAP explainer works with raw anomaly scores, whereas on the histogram earlier we were looking at normalized values. The meaning stays the same: a lower score implies higher chance of being an anomaly.

In the above example we can see that, while a couple of features were suggestive of this data point being anomalous, such as the amount field (presumably because it is a rather high amount), most of the features indicate that this data point is rather normal.

One issue with the previous plot is the fact that the different values of the ‘type’ field are treated as separate features with binary values due to the one-hot-encoding we applied in the beginning. To solve this, we can simply undo the one-hot-encoding in the feature matrix and add all of the corresponding SHAP values to a single value. 

In [None]:
def condense(features, shap_values, indices):
    features_condensed = features.drop(columns=['CASH_IN', 'PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT'])
    features_condensed['type'] = df['type']
    selected_features_condensed = features_condensed.iloc[indices, :]
    
    shap_values_without_type = shap_values[:, :-5]
    shap_values_type_sum = shap_values[:, -5:].sum(axis=1).reshape(-1, 1)

    shap_values_condensed = np.concatenate([shap_values_without_type, shap_values_type_sum], axis=1)

    return selected_features_condensed, shap_values_condensed

In [None]:
random_features_condensed, shap_values_random_condensed = condense(features, shap_values_random, random_indices)

Use the code above to generate a new Shap plot usign these condensed values and save it to 'condensed_shap.html'. 

<details><summary>Click here for answer</summary> 
<br/>

```
from IPython.display import display, HTML
shap.initjs
dis=shap.force_plot(explainer.expected_value, shap_values_random_condensed[0, :], random_features_condensed.iloc[0, :],matplotlib=False)
shap_html = f"{shap.getjs()}{dis.html()}"
#display(HTML(shap_html))
with open("condensed_shap.html", "w", encoding='utf8') as file:
    file.write(shap_html)
#display(HTML(shap_html))

```

In [None]:
#insert code here

Even though SHAP values are local explanations, i.e. they explain the contributions of features on single data points, we can gain more general insights about the decision our model makes by aggregating many of these local explanations to discover global trends. Let us first take a look at a summary of the SHAP value for the features


In [None]:
shap.summary_plot(shap_values_random_condensed, random_features_condensed)

One useful tool that the SHAP library provides to gain such insights is the dependence_plot function. It shows us how, across many data points, a SHAP value of a specific feature (y-axis) depends on the feature’s value (x-axis). Dots in the plot represent individual data points.

In [None]:
shap.dependence_plot(
 'changebalanceDest',
 shap_values_random,
 random_features,
 interaction_index=None,
 xmax='percentile(99)'
)

We can see that the SHAP value of ‘changebalanceDest’ is small for values close to zero, suggesting small changes in the receiver account’s balance are common. The more the absolute value increases, the more the SHAP value contributes to making the anomaly score lower (i.e. increasing the chance of being an anomaly). We can see that the most significant SHAP values are in the realm of very large positive numbers.

The dependence_plot also allows us to take into consideration the effect of another feature with the interaction_index argument.

In [None]:
shap.dependence_plot(
 'changebalanceDest',
 shap_values_random,
 random_features,
 interaction_index='CASH_OUT',
 xmax='percentile(99)'
)

This plot shows that, whether the ‘type’ field of a data point is equal to CASH_OUT or not also correlates with how the SHAP value for ‘changebalanceDest’ behaves. It seems as though large sums of money become indicators for an anomaly more quickly if the ‘type’ field is equal to CASH_OUT.

In [None]:
shap.dependence_plot(
 'hour',
 shap_values_random,
 random_features,
 interaction_index='PAYMENT',
 xmax='percentile(99)'
)

We can also detect an interesting interaction between the SHAP value of the ‘hour’ feature and whether the ‘type’ feature is equal to PAYMENT.
While a ‘hour’ value between 0 and 5 is generally indicative of being an anomaly, this effect is more pronounced for transactions of the PAYMENT type. Interestingly, this effect is not there or even reversed during other times of the day.


## DBSCAN

Let us now use a different method, DBSCAN for anomaly detection. Recall that DBSCAN is a density based clustering approach that clusters data points based on continuous regions of high point density and determines the ideal number of clusters to be formed. In contrast to k-means, not all points are assigned to a cluster, and we are not required to declare the number of clusters (k). However, the two key parameters in DBSCAN are min_samples (to set the minimum number of data points required to determine a core point) and eps (max allowed distance between two points to put them in the same cluster).

Let's first try a toy example

In [None]:
import numpy as np
import matplotlib.pylab as plt
from sklearn.cluster import DBSCAN

X_train = np.array([[60,36], [100,36], [100,70], [60,70],
    [140,55], [135,90], [180,65], [240,40],
    [160,140], [190,140], [220,130], [280,150], 
    [200,170], [185, 170]])
plt.scatter(X_train[:,0], X_train[:,1], s=200)
plt.show()


We can use DBSCAN with eps=45 and min_samples=4 to perform anomaly detection. There are 6 core points found by the algorithm, 2 clusters and a couple of outliers (noise points).

In [None]:
eps = 45
min_samples = 4
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X_train)
labels = db.labels_
print(labels)

print(db.core_sample_indices_)

We can visualize the clusters as below. Points in cluster 0 are colored red, points in cluster 1 are colored green, outlier points are colored black and core points are marked with '*'s. Two points are connected by an edge if they are within the epsilon neighbourhood.

In [None]:
def dist(a, b):
    return np.sqrt(np.sum((a - b)**2))

colors = ['r', 'g', 'b', 'k']
for i in range(len(X_train)):
    plt.scatter(X_train[i,0], X_train[i,1], 
                s=300, color=colors[labels[i]], 
                marker=('*' if i in db.core_sample_indices_ else 'o'))
                                                            
    for j in range(i+1, len(X_train)):
        if dist(X_train[i], X_train[j])  < eps:
            plt.plot([X_train[i,0], X_train[j,0]], [X_train[i,1], X_train[j,1]], '-', color=colors[labels[i]])
            
plt.title('Clustering with DBSCAN', size=15)
plt.show()

## DBSCAN on credit card fraud dataset

We first scale and normalize the train_X data from above. Recall from the previous practical how to do this.

<details><summary>Click here for answer</summary> 
<br/>

```
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize

scaler=StandardScaler().fit(train_X)
X_s = scaler.transform(train_X)
X_norm = pd.DataFrame(normalize(X_s))
X_norm.head()

```

In [None]:
# insert code here

We will fit the DBSCAN model using eps 0.65 and min_samples as 5. Obtain the class that each sample was assigned to and store it in the variable "labels". Recall that anomalies have ‘-1’.

<details><summary>Click here for answer</summary> 
<br/>

```
db_model = DBSCAN (eps=0.65, min_samples=5).fit(X_norm)
labels=db_model.labels_
np.unique(labels)

```

In [None]:
# insert code here

We can visualize a logarithmic histogram of the labels, and count the number of outliers identified

In [None]:
import matplotlib.pyplot as plt
plt.hist(labels, bins=len(np.unique(labels)),log=True)
plt.show

In [None]:
n_clusters=len(np.unique(labels))-1
anomaly=list(labels).count(-1)
print(f'Clusters: {n_clusters}')
print(f'Abnormal points: {anomaly}')

Let us check the accuracy of DBSCAN on the training set. Try to write some code to do this.

<details><summary>Click here for answer</summary> 
<br/>

```

labels2=np.array([1 if y == -1 else 0 for y in labels])
def get_classification_report(test_y, test_yhat):
    label_name = ['Legitimate','Fraudulent']
    print(classification_report(test_y, test_yhat, target_names=label_name))
get_classification_report(train_y, labels2)

```

In [None]:
# insert code here

The accuracy of DBSCAN doesn't seem good with the current parameters. Isolation Forest has a better performance. The anomalies detected by DBSCAN from this dataset are not the actual anomalies.

## Exercise

Try adjusting some of the parameters of DBSCAN and observe what happens

## Exercise (Optional)

Try using DBSCAN on the mobile payment dataset

## Additional Resource (Optional): Pycaret
Pycaret is an Automated Machine Learning (AutoML) tool that can be used for both supervised and unsupervised learning. It contains many anomaly detection models.

We will be performing anomaly detection on the Wisconsin Breast Cancer (Diagnostic) dataset from UCI Machine Learning Repository which contains features computed digitized image of a fine needle aspirate of a breast mass and the diagnosis if the mass is benign (B) or malignant (M). This dataset commonly used for demonstrating supervised machine learning where a model is trained to predict the diagnosis. For the purpose of demonstrating unsupervised anomaly detection, we will ignore the diagnosis. We first split the data into the training set and reserve a small “unseen” set for scoring.

In [None]:
from pycaret.anomaly import *
from sklearn.datasets import load_breast_cancer
df = load_breast_cancer(as_frame=True)['data']
df_train = df.iloc[:-10]
df_unseen = df.tail(10)

df_unseen.head()


Next, we will setup Pycaret to use the dataset. To use Pycaret, we will need to first call the setup function as below. Setting the silent parameter to True automatically confirms the input of data types when setup is executed. If silent is set to False, Pycaret requires the user to do manual confirmation of the input data types as shown in the image below.

In [None]:
anom = setup(data = df_train)

We can check the anomaly detection models available in Pycaret. The reference column indicates which source package the model was built from. 

In [None]:
models()

Next, we will train an anomaly detection model. Let's load the iforest model that we have seen previously, with fraction parameter = 0.05. The fraction parameter is the contamination parameter that we have seen previously and indicates the amount of outliers present in the dataset. It has a default value of 0.05

In [None]:
anom_model = create_model(model = 'iforest', fraction = 0.05)

We can now train the model using the *assign_model* function. This scores the training dataset using the trained model and returns the prediction of the model, concatenated with the training data. The Anomaly column is binary where 1 indicates that the record is anomalous and 0 indicates that it is normal. The Anomaly_Score column gives the raw score for the record, where negative indicates that the record is normal.

In [None]:
results = assign_model(anom_model)

In [None]:
print(results)

We can visualize the high dimensional results in lower dimensions using data visualization non-linear graph based methods such as t-SNE or UMAP

In [None]:
plot_model(anom_model, plot = 'tsne')

In [None]:
plot_model(anom_model, plot = 'umap')

Finally, we can save the model, load the saved model and use it to make predictions 

In [None]:
save_model(model = anom_model, model_name = 'iforest_model')
loaded_model = load_model('iforest_model')
loaded_model.predict(df_unseen)

We can look also look at the probabilities, as well as the anomaly scores using the following functions respectively.

In [None]:
loaded_model.predict_proba(df_unseen)

In [None]:
loaded_model.decision_function(df_unseen)

However, the results are not very accurate with the default fraction value

In [None]:
df2=load_breast_cancer(as_frame=True)['target']
df2_unseen = df2.tail(10)
print(df2_unseen)

## Additional Exercise (Optional)

1. Try changing the fraction/contamination value and see if the prediction accuracy increases. Discuss why or why not.
2. Try using the pycaret package on the credit card fraud or the mobile payment dataset and vice versa