This a solution notebook to an assignment question given in a Data Mining graduate course. Each code block is accompanied by relevant analysis wherever required.
Dataset link: https://github.com/GuansongPang/ADRepository-Anomaly-detection-datasets/blob/main/numerical%20data/DevNet%20datasets/bank-additional-full_normalised.csv
Samples with Class label 1 are treated as anomalous.
Broadly, the following steps have been performed in this solution notebook:
- Applied different statistical measures and presented them on infograph.
- Count plot and classwise categorical plot for categorical attributes
- Histogram plot for continuous attribute
- Pie chart depciting class distribution
- Correlation Analysis
- Using KNN as baseline model and fitting it on the dataset
- Dimensionality Reduction using PCA and retraining the model using reduced dimensions.
- Performed accuracy comparison of baseline model with the new model obtained after retaining various levels of variance (60,70, 80,90,99)%
- Clustering using DBSCAN to remove anomalies and retraining the model after removal of anomalies.
- Performed accuracy comparison of baseline model with model trained after anomalies removal.
- Used a classification model(Decision Tree) to identify anomalies on test set. Followed by retraining the model after anomalies removal.
- Performed accuracy comparison of baseline model with model trained after anomalies removal.