# Tumor classification
Let’s see an example implementation on the BreastCancer dataset, where the objective is to determine if a tumour is benign or malignant. `One Class classification` is a type of algorithm where the training dataset contains observations belonging to only one class.

With only that information known, the objective is to figure out if a given observation in a new (or test) dataset belongs to that class.
Reference : https://www.machinelearningplus.com/statistics/mahalanobis-distance/

In [1]:
import pandas as pd
import numpy as np
from sklearn.covariance import EmpiricalCovariance,MinCovDet
from sklearn.model_selection import train_test_split

## Read data

In [2]:
df = pd.read_csv('https://goz39a.s3.eu-central-1.amazonaws.com/breastcancer.csv')

In [3]:
df.dropna(how='any',inplace=True) 
df.head()

Unnamed: 0,Id,Cl.thickness,Cell.size,Cell.shape,Marg.adhesion,Epith.c.size,Bare.nuclei,Bl.cromatin,Normal.nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,0
1,1002945,5,4,4,5,7,10.0,3,2,1,0
2,1015425,3,1,1,1,2,2.0,3,1,1,0
3,1016277,6,8,8,1,3,4.0,3,7,1,0
4,1017023,4,1,1,3,2,1.0,3,1,1,0


In [4]:
X=df.values[:,1:-1]
y = df.values[:,-1]

In [5]:
np.unique(y, return_counts=True)

(array([0., 1.]), array([444, 239]))

Splitting 50% of the dataset into training and test. Only the 1’s are retained in the training data.
The 1's cover the malignant cases of cancer.

# Find the benign cases
A benign case is an outlier vs. the malignant cases.

In [6]:
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.5, random_state=0)

Isolate all the malignant cases from the training set into the variable `xtrain_pos`

In [7]:
xtrain_pos = xtrain[ytrain == 1, :]

Calculate the Mahalanobian distance using the `standard covariance matrix` and the `mincovdet` variant. These objects have a built-in mahalanobis metrics.

In [8]:
dist_test_mincov = MinCovDet(random_state=0).fit(xtrain_pos).mahalanobis(xtest)
dist_test_cov = EmpiricalCovariance().fit(xtrain_pos).mahalanobis(xtest)


The cut-off (threshold) distance is calculated using the 1% tail in a Chi-squared distribution.

In [9]:
from scipy.stats import chi2
crit_distance = chi2.ppf((1-0.01), df=xtrain_pos.shape[1])
print(f'critical distance {np.round(crit_distance,2)}')

critical distance 21.67


Number of benign cases using the mincovdet function

In [10]:
idx = dist_test_mincov>crit_distance
np.sum(ytest[idx]==0)

6

Number of benign cases using the empirical covariance matrix

In [11]:
idx = dist_test_cov>crit_distance
np.sum(ytest[idx]==0)

1