<div style="width: 100%; overflow: hidden;">
    <a href="http://www.uc.pt/fctuc/dei/">
    <div style="float:left; width: 75%;">
        <img src="https://eden.dei.uc.pt/~naml/images_ecos/dei25.png"/>
    </div>
    </a>
</div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
# Just to make plots look better
plt.rcParams["figure.figsize"] = (20,12)
plt.rcParams['axes.grid'] = True
plt.style.use('fivethirtyeight')
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['lines.linewidth'] = 3

<h2><font color='#3498db'>Introduction to Machine Learning</font></h2>

The main goal of this class is introduce you to the fundamental steps of using Scikit-Learn in Python. To to this we are going to use the [Breast Cancer Wisconsin Diagnostic Database](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).  

The dataset comprises a wide range of details concerning breast cancer tumors, along with classification labels indicating whether they are *malignant* or *benign*. It is composed by 569 instances, each representing a distinct tumor characterised 30 attributes or features, including characteristics like tumor radius, texture, smoothness, and area.

Our objective with this dataset is to construct a machine learning model capable of utilizing tumor-related information to make predictions about the malignancy or benign nature of tumors.

Scikit-learn provides a collection of preloaded datasets that we can seamlessly import into Python, and the specific dataset we require is readily available. Let's proceed to import and load this dataset.



### Variable Description
Attribute Information:

- ID number 
- Diagnosis (M = malignant, B = benign) 

Ten real-valued features are computed for each cell nucleus:

- Radius (mean of distances from center to points on the perimeter) 
- Texture (standard deviation of gray-scale values) 
- Perimeter 
- Area 
- Smoothness (local variation in radius lengths) 
- Compactness (perimeter^2 / area - 1.0) 
- Concavity (severity of concave portions of the contour) 
- Concave points (number of concave portions of the contour) 
- Symmetry 
- Fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

### Loading the Data

In [2]:
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()

The `data` variable serves as a Python object functioning similarly to a dictionary. It is crucial to focus on specific keys within this dictionary, including the names of classification labels (`target_names`), the real labels (`target`), the names of attributes or features (`feature_names`), and the attributes themselves (`data`).

Attributes play a pivotal role in any classifier as they encapsulate vital characteristics about the underlying data. In the context of predicting the label we are interested in (distinguishing between malignant and benign tumors), potential informative attributes might encompass tumor size, radius, and texture.

Let's proceed to create new variables for each of these essential sets of information and assign the corresponding data to them.

In [3]:
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

In [4]:
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])

['malignant' 'benign']
0
mean radius
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


Az the cell above our classes are denoted as `malignant` and `benign` and they are subsequently encoded into binary values: 0 corresponds to malignant tumors, while 1 corresponds to benign tumors. Hence, the initial data instance in our dataset represents a malignant tumor with a mean radius of 1.79900000e+01.

With our data successfully loaded and class labels defined, we are now ready to proceed with our machine learning classifier development.

### Organizing the Data in Training and Test

Tto measure the performance of a classifier, it is essential to test the model on data it hasn't seen during training. To accomplish this, it's a common practice to partition your dataset into two distinct ones: a `training set` and a `test set`.

The training set is utilized for model development, training, and evaluation. T

The test set is reserved for assessing how well the trained model generalizes to unseen data. This methodology provides valuable insights into the model's performance and its ability to handle new, unseen instances.

Thankfully, scikit-learn offers a convenient function called "train_test_split()" that simplifies the process of dividing your dataset into these distinct sets. You can import this function and then apply it to split your data as follows:

In [5]:
from sklearn.model_selection import train_test_split
# Split our data
##TODO: Your code here!
train, test, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=42)

Using the "train_test_split()" function, the data is randomly divided, and the degree of the split is controlled by the `test_size` argument. In this instance, split the data by allocating 33% of the initial dataset to the test set (referred to as "test"), while the remaining data constitutes the training dataset (referred to as "train"). Additionally, we've created corresponding labels for both the training and test datasets, known as "train_labels" and "test_labels."

With this data preparation step complete, we're now ready to proceed with the training of our initial model.

### Model Construction and Evaluation
Machine learning encompasses a variety of models, and each of these models possesses its unique set of advantages and limitations.

We are going to use to instatiate a few algorithms, and see how they performed in the Breast Cancer Wisconsin Diagnostic Database.

### k-Nearest Neighbors (kNN)

In [6]:
from sklearn.neighbors import KNeighborsClassifier
##TODO: Your code here!

clf_neigh = KNeighborsClassifier(n_neighbors=3)
clf_neigh.fit(train, train_labels)

KNeighborsClassifier(n_neighbors=3)

First we import the KNeighborsClassifier. Then we initialize the classifier and trained it using the `fit` method.

Once our model has been trained, we can use it for making predictions on our test set. This is achieved by employing the `predict()` function, which furnishes an array of predictions for each individual data instance within the test set. Subsequently, we can print these predictions to gain insight into the outcomes generated by the model.

To perform predictions on the test set and observe the results, apply the "predict()" function as follows:

In [7]:
# Make predictions
##TODO: Your code here!
preds = clf_neigh.predict(test)
print(preds)

[0 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1
 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0
 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
 0 0 1]


As you can see by output above, the `predict()` function returns an array of 0s and 1s, symbolizing our predicted values for tumor classification (distinguishing between malignant and benign).

With our predictions in hand, the next step is to assess the performance of our kNN classifier.

### Evaluating the Model Accuracy

Taking into account the array of the true class labels, we can assess the accuracy of our model's predictions by conducting a comparison between the two arrays (`test_labels` versus `preds`). To achieve this, we will employ the scikit-learn function "accuracy_score()" to compute the accuracy of our classifier.

In [16]:
# Evaluates model
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

##TODO: Your code here!

print("accuracy score \t=", accuracy_score(test_labels, preds))
print("recall score \t=", recall_score(test_labels, preds))
print("f1 score \t=", f1_score(test_labels, preds))


print("Confusion Matrix\n", confusion_matrix(test_labels, preds))

accuracy score 	= 0.9414893617021277
recall score 	= 0.9504132231404959
f1 score 	= 0.9543568464730291
Confusion Matrix
 [[ 62   5]
 [  6 115]]


### Decision Trees

In [17]:
from sklearn import tree
##TODO: Your code here!
clf_tree = tree.DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=6)
clf_tree.fit(train, train_labels)

DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=6)

In [18]:
# Make predictions
##TODO: Your code here!
preds = clf_tree.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


In [19]:
# Evaluates model
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

##TODO: Your code here!

print("accuracy score \t=", accuracy_score(test_labels, preds))
print("recall score \t=", recall_score(test_labels, preds))
print("f1 score \t=", f1_score(test_labels, preds))


print("Confusion Matrix\n", confusion_matrix(test_labels, preds))

accuracy score 	= 0.9627659574468085
recall score 	= 0.9752066115702479
f1 score 	= 0.9711934156378601
Confusion Matrix
 [[ 63   4]
 [  3 118]]


### Support Vector Machines

In [20]:
from sklearn import svm
##TODO: Your code here!
clf_svm = svm.SVC()
clf_svm.fit(train, train_labels)

SVC()

In [21]:
# Make predictions
##TODO: Your code here!
preds = clf_svm.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


In [22]:
# Evaluates model
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

##TODO: Your code here!

print("accuracy score \t=", accuracy_score(test_labels, preds))
print("recall score \t=", recall_score(test_labels, preds))
print("f1 score \t=", f1_score(test_labels, preds))


print("Confusion Matrix\n", confusion_matrix(test_labels, preds))

accuracy score 	= 0.9521276595744681
recall score 	= 0.9917355371900827
f1 score 	= 0.963855421686747
Confusion Matrix
 [[ 59   8]
 [  1 120]]


What can you conlclude regarding the results?