# **Deep learning-based meta-classifier approach for breast cancer diagnosis using deep support vector data description (dsvdd) and one-class classification convolutional neural network (occnn)**

---




# Task1: Python code for one-class classification using support vector data description
---

## 1.Import the necessary libraries 

In [None]:
#import libraries
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import pickle
pd.pandas.set_option('display.max_rows',None) 
pd.pandas.set_option('display.max_columns',None)

## 2.CBIS-DDSM (Curated Breast Imaging Subset of DDSM) 

 The CBIS-DDSM dataset contains **6 CSV** files including individual files for mass and calcification training and test sets: 
* mass_case_description_train_set.csv
* mass_case_description_test_set.csv 
* calc_case_description_train_set.csv
* calc_case_description_test_set.csv

It contains Benign, Benign without call-back, and Malignant cases.

It contain also metadata & Dicom info files.

The CBIS-DDSM dataset contains 6774 directories including  **10239 mammography images in jpeg format**. The images are distributed in three types: 
*   Full mammogram images
*   Cropped images
*   ROI mask images


### 2.1. How to Load Kaggle Datasets Directly into Google Colab?
https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/


---
Addition site

https://www.journaldunet.fr/web-tech/developpement/1441251-comment-importer-des-donnees-dans-les-notebooks-google-colaboratory/

In [None]:
! pip install kaggle

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

### 2.2.Download CBIS-DDSM Datasets

In [None]:
! kaggle datasets download awsaf49/cbis-ddsm-breast-cancer-image-dataset

In [None]:
! unzip cbis-ddsm-breast-cancer-image-dataset

Archive:  cbis-ddsm-breast-cancer-image-dataset.zip
replace csv/calc_case_description_test_set.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 



## 3.Data explore




Let's check 2nd file: calc_case_description_train_set.csv and take a quick look at what the data looks like:

In [None]:
#Read the dataset
df1 = pd.read_csv('/content/csv/calc_case_description_train_set.csv')
df1.head(5)

In [None]:
#Count the number of rows and columns in the dataset or datafame shape
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

Si on veut visualiser la totalité du dataset,

on applique l'instruction suivante:


---
pd.set_option('display.max_rows',df1.shape[0]+1)

df1

To find out how many of the columns are categorical and numerical we can use pandas “dtypes” to get the different data types and you can use pandas “value_counts()” function to get count of each data type. Value_counts groups all the unique instances and gives the count of each of those instances.

**As you can see below we have 10 columns which are objects (categorical data) and 4 columns which are of int data type.**

In [None]:
display(df1.dtypes.value_counts())

In [None]:
# Identify Numerical variables(data) and Categorical variables
num_vars=df1.columns[df1.dtypes!='object']
cat_vars=df1.columns[df1.dtypes=='object']
print(num_vars)
print(cat_vars)

In [None]:
#Voir les valeurs manquantes
df1.info()

In [None]:
#count the empty (NaN, NAN, na) values in each column
df1.isna().sum().sort_values(ascending=False)

In [None]:
df1.describe()

In [None]:
df1.describe(include=object)

Percentage of missing Numerical data by feature with visualization:

In [None]:
# Percentage of missing values in each column along with visualization
total = df1.isnull().sum().sort_values(ascending=False)
percent = df1.isnull().sum()/df1.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.xlabel('df1', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data

In [None]:
# Renseigner les valeurs manquantes
cat_data=[]
num_data=[]
for i,c in enumerate(df1.dtypes):
  if c==object:
    cat_data.append(df1.iloc[:,i])
  else:
    num_data.append(df1.iloc[:,i])
cat_data=pd.DataFrame(cat_data).transpose()
num_data=pd.DataFrame(num_data).transpose()

In [None]:
cat_data

In [None]:
num_data

**Pour les variables catégoriques on va remplacer les valeurs manquantes par les valeurs qui se repetent le plus.**


In [None]:
cat_data=cat_data.apply(lambda x:x.fillna(x.value_counts().index[0]))
cat_data.isnull().sum().any()

In [None]:
cat_data['calc distribution'].value_counts()

In [None]:
cat_data

In [None]:
# encoding the the target column
target_value={'MALIGNANT':1,'BENIGN':0,'BENIGN_WITHOUT_CALLBACK':2}
target=cat_data['pathology']
cat_data.drop('pathology',axis=1,inplace=True)
target=target.map(target_value)
target

In [None]:
# Remplacer les valeurs catégoriques par des valeurs numériques 0,1,2...
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in cat_data:
  cat_data[i]=le.fit_transform(cat_data[i])
cat_data

In [None]:
# Concatener cat_data et num_data et spécifier la colonne target
df1=pd.concat([cat_data,num_data,target], axis=1)

In [None]:
df1

In [None]:
#counte the number of Malignant, Benign and benign_without_callback in target column
target.value_counts()

In [None]:
#Visualize this count
plt.figure(figsize=(8,6))
sns.countplot(df1['pathology'],label="count")
MALIGNANT=target.value_counts()[1]/len(target)
BENIGN=target.value_counts()[0]/len(target)
BENIGN_WITHOUT_CALLBACK=target.value_counts()[2]/len(target)
print(f'le pourcentage des valeurs accordés est:{MALIGNANT}')
print(f'le pourcentage des valeurs accordés est:{BENIGN}')
print(f'le pourcentage des valeurs accordés est:{BENIGN_WITHOUT_CALLBACK}')

In [None]:
#plotting the corellation matrix
corr = df1.corr()
plt.figure(figsize=(18,18))
sns.heatmap(corr, cmap='coolwarm', annot = True)
plt.show()

In [None]:
df1=df1[['breast density', 'abnormality id', 'assessment', 'subtlety', 'patient_id', 'left or right breast', 'image view', 'abnormality type',
       'calc type', 'calc distribution', 'pathology', 'image file path',
       'cropped image file path', 'ROI mask file path']]
df1.corr()

In [None]:
#finding out the positively corelated feature
cc=corr[abs(corr['pathology']) > 0.5].index
print('- Number of most correlated features = ', len(cc))
print('--------------------------------------------------')
print('- Most correlated features is: \n ',cc)

In [None]:
acc=df1[df1.columns[:]].corr()['pathology']
print('All features  with thier correlations is: \n',acc)

In [None]:
#finding out the negatively corelated feature
cc2=corr[abs(corr['pathology']) <= 0.5].index
print('- Number of Least correlated features = ', len(cc2))
print('--------------------------------------------------')
print('- Least correlated features is: \n ',cc2)

## 4.Create DEEP SVDD Model

In [None]:
!pip install deep-svdd

In [None]:
from dsvdd import *

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import os
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
#dividing the dataframe into training and target features
x = df1.drop(['pathology'], axis=1)
y = df1['pathology']
import sys
sys.path.append("..")
from sklearn.datasets import load_wine
#splitting the dataframe and keeping 80% of the data for training and rest 20% for testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify = y)

# 
deepsvdd=DeepSVDD(x_train, input_shape=(32, 32, 3), representation_dim=128,
                    objective='one-class')

# 
deepsvdd.fit(x_train, y_train)

# 
deepsvdd.plot_boundary(x_train,  y_train)

#
y_test_predict = deepsvdd.predict(x_test, y_test)

#
radius = deepsvdd.radius
distance = deepsvdd.get_distance(x_test)
deepsvdd.plot_distance(radius, distance)