<a href="https://colab.research.google.com/github/kwaldenphd/building-a-ml-model/blob/main/digital_ocean_ml_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Adapted from...

# Michelle Morales, "[How to Build a Machine Learning Classifier in Python With Scikit-learn](https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn)" *Digital Ocean* (24 March 2019).

## Introduction

[Machine learning](https://www.digitalocean.com/community/tutorials/an-introduction-to-machine-learning) is a research field in computer science, artificial intelligence, and statistics. The focus of machine learning is to train algorithms to learn patterns and make predictions from data. Machine learning is especially valuable because it lets us use computers to automate decision-making processes.

You’ll find machine learning applications everywhere. Netflix and Amazon use machine learning to make new product recommendations. Banks use machine learning to detect fraudulent activity in credit card transactions, and healthcare companies are beginning to use machine learning to monitor, assess, and diagnose patients.

In this tutorial, you’ll implement a simple machine learning algorithm in Python using [Scikit-learn](http://scikit-learn.org/stable/), a machine learning tool for Python. Using a database of breast cancer tumor information, you’ll use a [Naive Bayes (NB)](http://scikit-learn.org/stable/modules/naive_bayes.html) classifer that predicts whether or not a tumor is malignant or benign.

By the end of this tutorial, you’ll know how to build your very own machine learning model in Python.

## Step 1 - Importing Scikit-Learn

Let’s begin by installing the Python module Scikit-learn, one of the best and most documented machine learning libaries for Python. 

In [None]:
# install packages in the current Jupyter kernel
import sys
!{sys.executable} -m pip install --user numpy
!{sys.executable} -m pip install --user pandas
!{sys.executable} -m pip install --user scipy
!{sys.executable} -m pip install --user matplotlib
!{sys.executable} -m pip install --user sckikit-learn

In [None]:
# import statements
import numpy as np
import pandas as pd
import scipy
import matplotlib
import sklearn

Now that we have sklearn imported in our notebook, we can begin working with the dataset for our machine learning model.

## Step 2 - Importing Scikit-learn's Dataset

The dataset we will be working with in this tutorial is the [Breast Cancer Wisconsin Diagnostic Database](http://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-database). The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area.

Using this dataset, we will build a machine learning model to use tumor information to predict whether or not a tumor is malignant or benign.

Scikit-learn comes installed with various datasets which we can load into Python, and the dataset we want is included. Import and load the dataset:

In [None]:
# import statement
from sklearn.datasets import load_breast_cancer

# load dataset
data = load_breast_cancer()

The `data` variable represents a Python object that works like a dictionary. The important dictionary keys to consider are the classification label names (`target_names`), the actual labels (`target`), the attribute/feature names (`feature_names`), and the attributes (`data`).

Attributes are a critical part of any classifier. Attributes capture important characteristics about the nature of the data. Given the label we are trying to predict (malignant versus benign tumor), possible useful attributes include the size, radius, and texture of the tumor.

Create new variables for each important set of information and assign the data:

In [None]:
# organize data by selecting subsets to create new variables

# target names/labels
label_names = data['target_names']

# target data
labels = data['target']

# feature names/labels
feature_names = data['feature_names']

# feature data
features = data['data']

We now have [lists](https://www.digitalocean.com/community/tutorials/understanding-lists-in-python-3) for each set of information. 

To get a better understanding of our dataset, let’s take a look at our data by printing our class labels, the first data instance’s label, our feature names, and the feature values for the first data instance:

In [None]:
# show label names
print(label_names)

# print first label
print(labels[0])

# show feature names
print(feature_names[0])

# print first feature
print(features[0])

As the image shows, our class names are `malignant` and `benign`, which are then mapped to binary values of `0` and `1`, where `0` represents malignant tumors and `1` represents benign tumors. 

Therefore, our first data instance is a malignant tumor whose mean radius is `1.79900000e+01`.

Now that we have our data loaded, we can work with our data to build our machine learning classifier.

## Step 3 - Organizing Data Into Sets

To evaluate how well a classifier is performing, you should always test the model on unseen data. Therefore, before building a model, split your data into two parts: a training set and a test set.

You use the training set to train and evaluate the model during the development stage. You then use the trained model to make predictions on the unseen test set. This approach gives you a sense of the model’s performance and robustness.

Fortunately, sklearn has a function called `train_test_split()`, which divides your data into these sets. Import the function and then use it to split the data:

In [None]:
# import statement
from sklearn.model_selection import train_test_split


# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

The function randomly splits the data using the `test_size` parameter. In this example, we now have a test set (`test`) that represents 33% of the original dataset. The remaining data (`train`) then makes up the training data. We also have the respective labels for both the train/test variables, i.e. train_labels and test_labels.

We can now move on to training our first model.

## Step 4 - Building and Evaluating the Model

There are many models for machine learning, and each model has its own strengths and weaknesses. In this tutorial, we will focus on a simple algorithm that usually performs well in binary classification tasks, namely [Naive Bayes (NB)](http://scikit-learn.org/stable/modules/naive_bayes.html).

First, import the `GaussianNB` module. Then initialize the model with the `GaussianNB()` function, then train the model by fitting it to the data using `gnb.fit()`:

In [None]:
# import statement
from sklearn.naive_bayes import GaussianNB

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

After we train the model, we can then use the trained model to make predictions on our test set, which we do using the `predict()` function. The `predict()` function returns an array of predictions for each data instance in the test set. We can then print our predictions to get a sense of what the model determined.

Use the `predict()` function with the `test` set and print the results:

In [None]:
# Make predictions
preds = gnb.predict(test)

# show predictions
print(preds)

As you see in the Jupyter Notebook output, the `predict()` function returned an array of `0s` and `1s` which represent our predicted values for the tumor class (malignant vs. benign).

Now that we have our predictions, let’s evaluate how well our classifier is performing.

## Step 5 — Evaluating the Model’s Accuracy

Using the array of true class labels, we can evaluate the accuracy of our model’s predicted values by comparing the two arrays (`test_labels` vs. `preds`). We will use the `sklearn` function `accuracy_score()` to determine the accuracy of our machine learning classifier. 

In [None]:
# import statement
from sklearn.metrics import accuracy_score

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

As you see in the output, the NB classifier is 94.15% accurate. This means that 94.15 percent of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign. These results suggest that our feature set of 30 attributes are good indicators of tumor class.

## Putting It All Together

You have successfully built your first machine learning classifier. Let’s reorganize the code by placing all `import` statements at the top of the Notebook or script. The final version of the code should look like this: 

In [None]:
# import statements
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()

# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# Look at our data
print(label_names)
print('Class label = ', labels[0])
print(feature_names)
print(features[0])

# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)
print(preds)

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

Now you can continue to work with your code to see if you can make your classifier perform even better. You could experiment with different subsets of features or even try completely different algorithms. Check out [Scikit-learn’s website](http://scikit-learn.org/stable/) for more machine learning ideas.

## Conclusion

In this tutorial, you learned how to build a machine learning classifier in Python. Now you can load data, organize data, train, predict, and evaluate machine learning classifiers in Python using Scikit-learn. The steps in this tutorial should help you facilitate the process of working with your own data in Python.

# Next Steps

[Click here](https://github.com/kwaldenphd/building-a-ml-model/) to return to the main lab page on GitHub.