# Introduction
***

This assignment will investigate the Wisconsin Breast Cancer Dataset by:
- Undertake an analysis/review of the dataset and present an overview and background.

- Provide a literature review on classifiers which have been applied to the dataset and compare their performance

- Present a statistical analysis of the dataset

- Using a range of machine learning algorithms, train a set of classifiers on the dataset and present classification performance results. Detailing rationale for the parameter selections made.

- Compare, contrast and critique your results with reference to the literature

- Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints

# Overview of Breast Cancer
***

Breast cancer is the abnormal growth of cells in the breast, causing a tumour. It is the most common form of cancer diagnosed in women.  Breast cancer is more common amongst women but men can also be diagnosed (Mayo Clinic, 2022).

According to the Irish Cancer Society, there are more than 3,500 women in Ireland get diagnosed with breast cancer, more common in women over 50 (Irish Cancer Society,_).

# Dataset Background
***

The Wisconsin Breast Cancer (original) Dataset was compiled by three individuals, Wolberg, Street and Mangasarian at the University of Wisconsin in 1996. The dataset consists of records of measurements for 699 cases/samples of breast cell nuclei based on 10 attributes. The data was collected by Dr. Wolberg's in clinical cases by a non surgical method called Fine Needle Biopsy which extracts a small amount of cells from the tumor.

![Breast%20cells.png](attachment:Breast%20cells.png)
The attributes of the dataset are as follows;

- ID number/Sample number
- Clump Thickness (1-10)
- Uniformity of Cell Size (1-10)
- Uniformity of Cell Shape (1-10)
- Marginal Adhesion (1-10)
- Single Epithelial Cell Size (1-10)
- Bare Nuclei (1-10)
- Bland Chromatin (1-10)
- Normal Nucleoli (1-10)
- Mitoses (1-10)
- Class (2 FOR Benign, 4 for Malignant)

There are 16 instances in the data with a missing attribute value, denoted by "?".
I have downloaded the dataset from Kaggle. [Link](https://www.kaggle.com/datasets/ninjacoding/breast-cancer-wisconsin-benign-or-malignant
"optional title")


# Literary Review
***

## Classifiers

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7330506/

# Statistical Review
***
- Import data
- Cleaning data, check for missing attributes, empty fields ,tidy etc, basic stats(describe) etc, corelation/pairplot

In [2]:
# importing libraries
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Read CSV data and form a dataframe.
df = pd.read_csv("tumor.csv")
df.head()
df

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,2
679,841769,2,1,1,1,2,1,1,1,1,2
680,888820,5,10,10,3,7,3,8,10,2,4
681,897471,4,8,6,4,3,4,10,6,1,4


# References
https://www.mayoclinic.org/diseases-conditions/breast-cancer/symptoms-causes/syc-20352470