# Project 2
## This project will investigate the Wisconsin Breast Cancer dataset.

***
### Undertake an analysis/review of the dataset and present an overview and background.
***
Select 1 from 2 datasets:
- Original: 699 instances. Attributes are integers. Contains missing values
- Diagnostic: 569 instances. Attributes are real. No missing values

"Cancer classification historically requires prior biological knowledge. However, most machine learning engineers do not have a biological background. Therefore, the general classification performance is worth improving. Prediction models with high accuracy are aimed to aid oncologists with diagnosis
and prognosis.
The goal is to create a good predictor with samples from known classes, and possibly identify hidden cancer subtypes, without any biological priors" (Shen, 2019).

#### Breast Cancer Wisconsin (Diagnostic) Data Set 
This dataset uses 10 features to predict the classification of tumours. The measured and recorded features used to classify whether a tumour is malignant (M) or benign (B) are:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

These features are computed for each cell nucleus using digitized images of fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Discriminating between the two classes is hard by a single characteristic but by integrating all features, the two masses are classifiable (Breast Cancer Wisconsin (Diagnostic) Data Set, 2019). 

Table below summarises the dataset:

**Type** | Value
:-|:-|
**Classes** | 2
**Samples per class** | 212(M), 357(B)
**Samples Total** | 569
**Characteristics** | 10
**Dimensionality (Features)** | 30
**Features** | real, positive

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

Dimensionality being 30 is the result of the mean, standard error and worst/ largest mean being computed for each feature of the images (10x3)

***
###  Provide a literature review on classifiers which have been applied to the dataset and compare their performance
***
(JOFSR, 2022) compared between logistic regression and k means clustering and concluded, along with a comparison of different literature reviews with research in hand, that logistic regression was a better predictive model. (JOFSR, 2022) also observed that the given requirements and usage of different machine learning tools significantly affected the results and accuracy of the model.

(Shen, 2019) state that radius, perimeter and area are highly correlated (through their definitions and through the observations made in a heat map). Also that compactness mean, concavity mean and concave points mean are highly related. By this definition (Shen, 2019) simplified the analysis by dropping the features with high correlation leaving only 6 features remaining. . As the data features have entirely different scales, preprocessing was applied with z-scoring in order to normalize the data as it was seen to observe Gaussian distribution similarities when pairplot histograms were executed (Figure 3 in the paper). Without normalizing the data, if KNN with l2 norm was applied as the ML algorithm, the calculation of distances would have been dominated by the feature of the largest scale, preventing the estimator from learning from other features.

(Mohammad et al., 2022) used a number of algorithms to classify the tumour types. These were: Decision Tree (DT), Artificial Neural Networks (ANN), Support Vector Machines (SVM), Naive Bayes multinomial classifier (NBM), and K-nearest neighbors (KNN). The results of the classifying techniques in terms of accuracy are summarized in Figure 15 from the paper seen below.
![Figure15_ComparisonOfClassifiers.PNG](attachment:Figure15_ComparisonOfClassifiers.PNG)

(Saygili, 2019) classified characteristics of the people included in the Wisconsin Diagnostic Breast Cancer dataset by support vector machines (SVM), k-nearest neighborhood, Naive Bayes, J48, random forest and multilayer perceptron methods. The preprocessing step of normalizing the data to the range was applied to the dataset prior to classification.
After the preprocessing stage, six different classifiers were applied to the data using 10-fold cross-validation method. Table 6 shows that the random forest was the most successful method followed by the multilayer perceptron method. For a general comparison of success between methods, it was seen that the random forest method was the most successful method with a value of 0.999 when evaluated according to the preferred AUC value. This was followed by the multilayer perceptron method, followed by the k-NN method with 0.991.
![Table6_EvaluationOfTheClassificationMethods.PNG](attachment:Table6_EvaluationOfTheClassificationMethods.PNG)

Confusion matrix, accuracy, sensitivity, specificity and ROC area (AUC) metrics were used to measure the classification success of the methods. Equations 5, 6 and 7 in the paper show how these metrics are obtained:
* TP: Data that is sick and labelled as patient
* TN: Data that is not sick and labelled as non-patient
* FP: Data that is sick and labelled as non-patient
* FN: Data that is sick and labelled as patient

$$ Accuracy = \frac{TP+TN}{TP+FP+FN+TN} $$ Equation 5

$$ Specificity = \frac{TN}{FP+TN} $$ Equation 6

$$ Sensitivity = \frac{TP}{TP+FN} $$ Equation 7

Examining Tables 3 and 4 in this paper, it can be seen that the preprocessing phases affect the success of classification. Especially in the random forest method, which gives the most successful results, it was observed that the selection of features changed the success considerably.
![Table3_SuccesRatesWithoutPreprocessing.PNG](attachment:Table3_SuccesRatesWithoutPreprocessing.PNG)

![Table4_SuccessRatesWithPreprocessing.PNG](attachment:Table4_SuccessRatesWithPreprocessing.PNG)

(Tiwari et al., 2022) figure 1 lays out the methodology for implementing this model. 
![Figure1_Methodology.PNG](attachment:Figure1_Methodology.PNG)

Pre-processing includes methods such as Label Encoder and Normalisation. Label Encoder was seen as an efficient tool for encoding the levels of the categorical features into numeric values. All the categorical features were encoded. In this paper, malignant and benign values were classified as 0 and 1. In the Normalizer Method, the values of all the attributes were rescaled in the range of 0 to 1. SVM and Random Forest Classifier were the best for predictive analysis with an accuracy of 96.5%, followed by KNN and Decision Tree at 95%. 

The consensus among all these papers for the methods to tackle this problem statement are in agreement with Figure 1 in (Tiwari et al., 2022). Preprocessing increases the accuracy of the models and normalization is the popular choice. SVM and Random Forest were the most successful methods with KNN also being a strong choice delivering good levels of accuracy when compared against all other models used. 


###### Breast Cancer Wisconsin (Diagnostic) Data Set (2019). Predict whether the cancer is benign or malignant.
###### JOFSR (2022). Comparative analysis of Malignancy prediction of Breast Cancer cells using Logistic Regression& K Means Algorithm
###### Mohammad, Walid Theib et al (2022). Diagnosis of Breast Cancer Pathology on the Wisconsin Dataset with the Help of Data Mining Classification and Clustering Techniques
###### Saygili, Ahmet (2018). Classification and Diagnostic Prediction of Breast Cancers via Different Classifiers
###### Shen, Ziyuan (2019). Final Report: Breast Cancer Wisconsin

***
### Present a statistical analysis of the dataset
***
##### building block of this section taken from https://www.kaggle.com/code/kanncaa1/statistical-learning-tutorial-for-beginners

In [90]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# https://www.kaggle.com/code/kanncaa1/statistical-learning-tutorial-for-beginners used as reference for statistical analysis

import re
# https://www.pythontutorial.net/python-regex/python-regex-sub/

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# https://stackoverflow.com/questions/26414913/normalize-columns-of-a-dataframe

In [85]:
data = pd.read_csv(r"C:\Users\35387\Program\repos\gitIntro\Programming\PfDA_Project2\Data\data.csv") 

#### (i) Tidy up the data

In [86]:
# tidy up data

data = data.drop(['id'], axis=1)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

data.drop(data.iloc[:, 11:31], inplace=True, axis=1)
# https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/

data.drop(['Unnamed: 32'], axis=1)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

data = data.replace('M',0)
data = data.replace('B',1)
# Had to replace the strings 'M' and 'B' with integers otherwise the normalisation didn't work
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

In [87]:
data

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,Unnamed: 32
0,0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,
1,0,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,
2,0,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,
3,0,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,
4,0,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,
...,...,...,...,...,...,...,...,...,...,...,...,...
564,0,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,
565,0,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,
566,0,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,
567,0,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,


#### (ii) Normalise the data
##### reference used for this section https://stackoverflow.com/questions/26414913/normalize-columns-of-a-dataframe

In [88]:
# Unbiased estimate
data.iloc[:,0:-1] = data.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(data)

     diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    -1.296535     1.096100     -2.071512        1.268817   0.983510   
1    -1.296535     1.828212     -0.353322        1.684473   1.907030   
2    -1.296535     1.578499      0.455786        1.565126   1.557513   
3    -1.296535    -0.768233      0.253509       -0.592166  -0.763792   
4    -1.296535     1.748758     -1.150804        1.775011   1.824624   
..         ...          ...           ...             ...        ...   
564  -1.296535     2.109139      0.720838        2.058974   2.341795   
565  -1.296535     1.703356      2.083301        1.614511   1.722326   
566  -1.296535     0.701667      2.043775        0.672084   0.577445   
567  -1.296535     1.836725      2.334403        1.980781   1.733693   
568   0.769931    -1.806811      1.220718       -1.812793  -1.346604   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0           1.567087          3.280628        2.650542  

In [91]:
# Biased estimate
data.iloc[:,0:-1] = scaler.fit_transform(data.iloc[:,0:-1].to_numpy())
print(data)

     diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    -1.297676     1.097064     -2.073335        1.269934   0.984375   
1    -1.297676     1.829821     -0.353632        1.685955   1.908708   
2    -1.297676     1.579888      0.456187        1.566503   1.558884   
3    -1.297676    -0.768909      0.253732       -0.592687  -0.764464   
4    -1.297676     1.750297     -1.151816        1.776573   1.826229   
..         ...          ...           ...             ...        ...   
564  -1.297676     2.110995      0.721473        2.060786   2.343856   
565  -1.297676     1.704854      2.085134        1.615931   1.723842   
566  -1.297676     0.702284      2.045574        0.672676   0.577953   
567  -1.297676     1.838341      2.336457        1.982524   1.735218   
568   0.770609    -1.808401      1.221792       -1.814389  -1.347789   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0           1.568466          3.283515        2.652874  

In [None]:
# The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to...
# affect the performance of machine learning algorithms and we can safely use them.

In [96]:
# Unbiased estimate
normalized_data1 = (data-data.mean())/data.std()
normalized_data1

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,Unnamed: 32
0,-1.296535,1.096100,-2.071512,1.268817,0.983510,1.567087,3.280628,2.650542,2.530249,2.215566,2.253764,
1,-1.296535,1.828212,-0.353322,1.684473,1.907030,-0.826235,-0.486643,-0.023825,0.547662,0.001391,-0.867889,
2,-1.296535,1.578499,0.455786,1.565126,1.557513,0.941382,1.052000,1.362280,2.035440,0.938859,-0.397658,
3,-1.296535,-0.768233,0.253509,-0.592166,-0.763792,3.280667,3.399917,1.914213,1.450431,2.864862,4.906602,
4,-1.296535,1.748758,-1.150804,1.775011,1.824624,0.280125,0.538866,1.369806,1.427237,-0.009552,-0.561956,
...,...,...,...,...,...,...,...,...,...,...,...,...
564,-1.296535,2.109139,0.720838,2.058974,2.341795,1.040926,0.218868,1.945573,2.318924,-0.312314,-0.930209,
565,-1.296535,1.703356,2.083301,1.614511,1.722326,0.102368,-0.017817,0.692434,1.262558,-0.217473,-1.057681,
566,-1.296535,0.701667,2.043775,0.672084,0.577445,-0.839745,-0.038646,0.046547,0.105684,-0.808406,-0.894800,
567,-1.296535,1.836725,2.334403,1.980781,1.733693,1.524426,3.269267,3.294046,2.656528,2.135315,1.042778,


In [97]:
# Min Max scaling
normalized_data2 = (data-data.min())/(data.max()-data.min())
normalized_data2

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,Unnamed: 32
0,0.0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.703140,0.731113,0.686364,0.605518,
1,0.0,0.643144,0.272574,0.615783,0.501591,0.289880,0.181768,0.203608,0.348757,0.379798,0.141323,
2,0.0,0.601496,0.390260,0.595743,0.449417,0.514309,0.431017,0.462512,0.635686,0.509596,0.211247,
3,0.0,0.210090,0.360839,0.233501,0.102906,0.811321,0.811361,0.565604,0.522863,0.776263,1.000000,
4,0.0,0.629893,0.156578,0.630986,0.489290,0.430351,0.347893,0.463918,0.518390,0.378283,0.186816,
...,...,...,...,...,...,...,...,...,...,...,...,...
564,0.0,0.690000,0.428813,0.678668,0.566490,0.526948,0.296055,0.571462,0.690358,0.336364,0.132056,
565,0.0,0.622320,0.626987,0.604036,0.474019,0.407782,0.257714,0.337395,0.486630,0.349495,0.113100,
566,0.0,0.455251,0.621238,0.445788,0.303118,0.288165,0.254340,0.216753,0.263519,0.267677,0.137321,
567,0.0,0.644564,0.663510,0.665538,0.475716,0.588336,0.790197,0.823336,0.755467,0.675253,0.425442,
