# Anomaly Detection

1. [Supervised Learning](#Supervised-Learning)  
    - Dataset 1 - Credit Card Fraud Detection (source: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)  
    - Techniques used and results 

2. [Unsupervised Learning](#Unsupervised-Learning)  
    - Dataset 2 -   
    - Techniques used and results

### Summary




==================================================================================
### Types of Anomalies:
- Point Anomalies:  
    Individual data points that are significantly different from the rest of the data.
- Contextual Anomalies:  
    Data points that are anomalous only within a specific context (e.g., high network traffic at 3 AM is normal, but at 3 PM might be an anomaly).
- Collective Anomalies:  
    A group of related data points that collectively represent an anomaly, even if individual points are not anomalous on their own.

### Machine Learning Approaches:  
Anomaly detection utilizes various machine learning techniques, broadly categorized as:
##### Supervised :  
Requires labeled datasets with both normal and anomalous data points. Algorithms like: 
- logistic regression, 
- decision trees, random forests, and 
- neural networks can be trained to classify new data as normal or anomalous. This approach is effective when anomalies are well-defined and sufficient labeled data is available. 
##### Unsupervised :
Identifies anomalies without requiring labeled data by learning the underlying patterns of normal data and flagging deviations. This is particularly useful when anomalies are rare or unknown in advance. Common algorithms include:   
- Isolation Forest: An ensemble method that isolates anomalies by building a tree structure.  
- One-Class SVM (OCSVM): A variant of Support Vector Machines that learns a boundary around normal data, classifying points outside this boundary as anomalies.  
- K-Nearest Neighbors (KNN): Anomaly scores are based on the distance to the K-nearest neighbors, with distant points being potential anomalies.   
- Autoencoders: Neural networks that learn a compressed representation of data; high reconstruction error can indicate an anomaly.  
- Clustering-based methods (e.g., K-Means): Identifies anomalies as data points that do not belong to any cluster or are far from cluster centroids.   
##### Semi-supervised :
Combines aspects of both supervised and unsupervised learning, often using a small amount of labeled data to guide the learning process.



<a name='Supervised-Learning'></a>
## Supervised Learning

1. [Data preparation and visualization](#step-1-data-preparation-and-visualization)
2. Model development
3. Model evaluation

In [11]:
# Divide the original credit card fraud file (150.8MB, too large to upload to Github) into smaller size 

# ==========================
# print all files in the current directory
# import os
# for dirname, _, filenames in os.walk('./'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
# ==========================

# Step 1: unzip file
from zipfile import ZipFile
with ZipFile('./data/credit-card-fraud/creditcard.csv.zip', 'r') as zip_object:
    zip_object.extractall('./data/credit-card-fraud/')

# Step 2: divide the csv file into smaller files
import pandas as pd

input_csv_file_path = './data/credit-card-fraud/creditcard.csv'
rows_per_file = 50_000
output_prefix = 'data/credit-card-fraud/creditcard_part_'

csv_reader = pd.read_csv(input_csv_file_path, chunksize=rows_per_file)
for i, chunk in enumerate(csv_reader):
    output_file = f'{output_prefix}{i}.csv'
    chunk.to_csv(output_file, index=False)
    print(f'Created {output_file}')

# Step 3: delete the original large csv file
import os
os.remove(input_csv_file_path)

Created data/credit-card-fraud/creditcard_part_0.csv
Created data/credit-card-fraud/creditcard_part_1.csv
Created data/credit-card-fraud/creditcard_part_2.csv
Created data/credit-card-fraud/creditcard_part_3.csv
Created data/credit-card-fraud/creditcard_part_4.csv
Created data/credit-card-fraud/creditcard_part_5.csv


<a name='step-1-data-preparation-and-visualization'></a>
### Step 1. Data Preparation and Visualization

In [None]:
# Load files into a data frame
import pandas as pd
import glob

path = './data/credit-card-fraud/'
all_files = glob.glob(path + "creditcard_part_*.csv")

df_list = []
for filename in all_files:
    df = pd.read_csv(filename)
    df_list.append(df)
df = pd.concat(df_list, ignore_index=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

<a name='Unsupervised-Learning'></a>
## Unsupervised Learning