Table of Contents
Libraries
Data Source
Project Workflow
Training
Test
Confusion Matrix
Conclusion
- numpy
- pandas
- matplotlib.pyplot
- seaborn as sns
- sklearn
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
It can be downloaded from UCI Machine Learning Repository
The workflow for this project is listed below:
- Pre-processing
- Exploratory Data Analysis (EDA)
- Data Visulization
- Feature Seletion
- Train-Test
- Confusion Matrix
- Conclusion
We tried Four Machine Learning Models and evaluated their performance with our Dataset.
The models are listed with their training accuracy below:
- Decision Tree: Mean accuracy = 0.954974 & Std accuracy is 0.020103
- Support Vector Machine: Mean accuracy = 0.971386 & Std accuracy = 0.013512
- Gaussian Naive Bayes: Mean accuracy = 0.963223 & Std accuracy = 0.025463
- KNN', K-Nearest Neighbors: Mean accuracy = 0.969345 & Std accuracy = 0.016428
Training shows that the Support Vector Machine model has the highest training accuracy. So we further tested our test dataset with SVC and the Test Accuracy was 0.9714285714285714 (97.14%)
We provided a list of numbers as an input for nine features and it accurately predicted the class for breast cancer.
Confusion Matrix from our Validation is:
We had Wisconsin Breast Cancer Database having 699 records in 11 columns for attributes.
Attributes from column indices 2 through 10 have been used to represent instances.
- Each instance has one of 2 possible classes: benign or malignant.
- These classes was included as attribute at column index 11
Class distribution was:
- Benign: 458 (65.5%)
- Malignant: 241 (34.5%)
After processing data and obtaining analysis from it, we split the dataset into train-test of 70%-30%. we trained four models. By far, SVM model had highest accuracy for training and test data.