In [1]:
# hack to allow importing from sibling directories
#https://stackoverflow.com/questions/34478398/import-local-function-from-a-module-housed-in-another-directory-with-relative-im
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)


from mlviz.dimensionality_reduction import HDVis
from mlviz.data_visualisation import DraughtPlot

# imports to support analysis
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


# required bokeh imports
from bokeh.io import output_notebook
output_notebook()

%load_ext autoreload
%autoreload 1

# Comments on this example

This is an advanced example: it deals with a dataset which in its raw format will provide meaningless results. The data requires some pre-processing (scaling and class balancing) to help the analysis.

# Load and inspect the data

The data set is the credit card dataset, host on [openml](https://www.openml.org/d/1597). It contains information on credit card transactions and is a classification data set, the target being the transaction was fraudlent (target=1) or unfraudlent (target=0).

It contains 30 features and 284807 instances. The data is well formated and 28 of the features are the result of a PCA transformation (in part for data protection).


**We begin by loading the data and dropping the 'Class' column from the training set:**

In [2]:
data_fpath = 'data/creditcard/creditcard.csv'

df = pd.read_csv(data_fpath)

target = df['Class'].str.strip('\'').astype(int)
df.drop('Class', axis=1, inplace=True)

# Balance the dataset

The dataset is **highly** (492 out of 2848067 instances are positive) imbalanced, to help tool evalulation we will take a sample of the data, oversampling the positive class. We  will take all the positive samples (492) and a user-selected number of negative class instances.

**Sample the dataset:**

*Note: you can vary negative_samples to see how it affects the results*


In [3]:
negative_samples = 5000

df_pos = df[target==1]
df_neg = df[target==0].sample(n=negative_samples)

reduced_df = pd.concat([df_pos,df_neg])

Next we drop the Time axis as there is unlikely to be much information in this axis due to the sparisity of the positive class and also make a new target array for the reduced dataset.

**Removing 'Time' feature and making new target:**

In [4]:
reduced_df.drop('Time', axis=1, inplace=True)
reduced_target = np.r_[np.ones(492), np.zeros(negative_samples)]

## Scale the data

For nearly all the dimensionality reduction techniques **you must** scale the data (i.e., so all features take values between 0 and 1 or are on the same length scale).

We suggest using either StandardScaler or MinMaxScaler() from Sklearn as a minimum.


**Performing feature scaling:**

In [5]:
# scale the data, sklearn does not work with pd.DataFrame objects so use numpy array
reduced_scaled_X = StandardScaler().fit_transform(reduced_df.to_numpy())

reduced_scaled_df = pd.DataFrame(reduced_scaled_X, columns=reduced_df.columns)

## Use the HDviz tool

You can now call the HDviz tool. Making sure to provide the correct url for your notebook.

To begin with we recommend you select the following as your selections:

- UMAP as your dimensionality method: it is fast and effective with most datasets.
- Select all the data points.

Then hit run. After a short wait you should then observe several clusters.  Now you can try colouring the instances by different features and the target class to get a feel for the data quality and which features are strong predictors.
When colouring by the target class you should see several clusters of postive class which are seperated from the bulk of the negative class. 

You should independently brush each cluster then click 'add selection' to extract the indices of these clusters.

In [7]:
HD_plot = HDVis(reduced_scaled_df, y=reduced_target, url='localhost:8888')


We can see that each method does quite a good job at seperating positive and negative classes. There are several different clusters of the positive class and we can see by varying the colour dimension that some of them are stronger 'positives' than others. 

If we brush and extract these different clusters and look in greater detail 

## Extract the brushes


The HD_plot object has a brushes attribute which contains the indices (in the original df) for the brushed instances. 

.brushes is a list of arrays: if you brushed three clusters you will get three arrays each containing the indices of the respective cluster.


Users can either use these indices to investigate the different clusters themselves or provide them to a different tool in the MLViz library.

In [10]:
X, y  = HD_plot.get_brushed_data()

# DraughtPlot

We can then feed this brushed data (with or without the target) into our Draughtplot tool, which allows us to investigate these clusters in the original feature space.

This will allow us to gain insight into the reasons the different clusters form and could also be an aid for a feature dimensionality reduction tool.

In [11]:
DP = DraughtPlot(X, y, features=['V1','V2','V3','V9','V16'])