# Step 1 - Importing the Dataset
- Dataset: Breast Cancer Wisconsin Diagnostic Database
- Dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign.
- Dataset has 569 instances/data on 569 tumors, and includes information on 30 attributes/features.
- Task: predict whether or not a tumor is malignant or benign.
- Attributes capture important characteristics anout the nature of the data.
- Given the label to be predicted (malignant vs benign), useful attributes include the size, radius, and texture of the tumor.
- Create new variables for each import set of information and assign the data; result in lists for each set of information.

In [1]:
import sklearn
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()            # dictionary object

# Organize data
label_names = data['target_names']     # label names - list object
labels = data['target']                # actual labels - list object
feature_names = data['feature_names']  # attribute names - list object
features = data['data']                # attributes - list object

# Look at data
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])

['malignant' 'benign']
0
mean radius
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


- Result shows that the class names are malignant and benign, which are then mapped to binary values of 0 and 1 (0 represents malignant tumors, and 1 represents benign tumors).

# Step 2 - Organizing Data into Sets
- To evaluate how well a classifier is performing, always test the model on unseen data.
- Before building a model, split the data into two parts: a training set and a test set.
- Use the training set to train and evaluate the model during the development stage
- Use the trained model to make predictions on the unseen test set; this gives a sense of the model's performance and robustness.

In [2]:
from sklearn.model_selection import train_test_split

# Split data
train, test, train_labels, test_labels = train_test_split(features, labels,
                                                          test_size=0.33,
                                                          random_state=42)

- The train_test_split() function randomly splits the data using the test_size parameter.
- The test set represents 33% of the original dataset; the reminaing data makes up the training data.

# Step 3 - Building and Evaluating the Model
- Naive Bayes (NB) classifier algorithm usually performs well in binary classification tasks.
- After training the model, use the trained model to make prediction on the test set (using predict() function).
- The predict() function returns an array of predictions for each data instance in the test set
- Print the results.

In [3]:
from sklearn.naive_bayes import GaussianNB

# Initialize classifier
gnb = GaussianNB()

# Train classifier - fit model to data
model = gnb.fit(train, train_labels)

# Make predictions - use trained model on test set
preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


- The predict() function returns an array of 0s and 1s, which represent the predicted values for the tumor class (malignant vs benign).

# Step 4 - Evaluating Model's Accuracy
- Using the array of true class labels, evaluate the accuracy of the model's predicted values by comparing the two arrays (test_labels vs preds).

In [4]:
from sklearn.metrics import accuracy_score

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

0.9414893617021277


- Output suggests that the NB classifier is 94.15% accurate, which means that 94.15 percent of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign.
- These results suggest that out feature set of 30 attributes are good indicators of tumor class.