# Cancer Classifier

In this project, we will be using several Python libraries to make a K-Nearest Neighbor classifier that is trained to predict whether a patient has breast cancer.

If you get stuck during this project or want to see an experienced developer solve it, watch the project's explanatory video.

##### Explore the data

1. Let’s begin by importing the breast cancer data from ``sklearn``. We want to import the function ``load_breast_cancer`` from ``sklearn.datasets``.

   Once we’ve imported the dataset, let’s load the data into a variable called ``breast_cancer_data``. Do this by setting ``breast_cancer_data`` equal to the function ``load_breast_cancer()``.

In [1]:
# Task 1
from sklearn.datasets import load_breast_cancer

breast_cancer_data = load_breast_cancer()

2. Before jumping into creating our classifier, let’s take a look at the data. Begin by printing ``breast_cancer_data.data[0]``. That’s the first datapoint in our set. But what do all of those numbers represent? Let’s also print ``breast_cancer_data.feature_names``.

In [5]:
# Task 2
print(f'The Firt datapoint in our set: \n {breast_cancer_data.data[0]}')
print()
print(f'Features names dataset: \n {breast_cancer_data.feature_names}')

The Firt datapoint in our set: 
 [1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]

Features names dataset: 
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


3. We now have a sense of what the data looks like, but what are we trying to classify? Let’s print both ``breast_cancer_data.target`` and ``breast_cancer_data.target_names``.

    Was the very first data point tagged as malignant or benign?

In [7]:
# Task 3
print(breast_cancer_data.target)
print(breast_cancer_data.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

##### Splitting the data into Training and Validation Sets

4. We have our data, but now it needs to be split into training and validation sets. Luckily, ``sklearn`` has a function that does that for us. Begin by importing the ``train_test_split`` function from ``sklearn.model_selection``.

In [8]:
# Task 4
from sklearn.model_selection import train_test_split

5. Call the train_test_split function. It takes several parameters:

- The data you want to split (for us ``breast_cancer_data.data``)
- The labels associated with that data (for us, ``breast_cancer_data.target``).
- The ``test_size``. This is what percentage of your data you want to be in your testing set. Let’s use ``test_size = 0.2``
- random_state. This will ensure that every time you run your code, the data is split in the same way. This can be any number. We used ``random_state = 100``.

In [10]:
# Task 5
train_test_split

<function sklearn.model_selection._split.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)>

6. Right now we’re not storing the return value of ``train_test_split``. ``train_test_split`` returns four values in the following order:

- The training set
- The validation set
- The training labels
- The validation labels

    Store those values in variables named ``training_data``, ``validation_data``, ``training_labels``, and ``validation_labels``.

In [11]:
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size=0.2, random_state=99)