__An Introduction to Machine Learning__

Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning generally is to understand the structure of data and fit that data into models that can be understood and utilized by people.

Although machine learning is a field within computer science, it differs from traditional computational approaches. In traditional computing, algorithms are sets of explicitly programmed instructions used by computers to calculate or solve a problem. Machine learning algorithms instead allow for computers to train on data inputs and use statistical analysis in order to output values that fall within a specific range. Because of this, machine learning facilitates computers in building models from sample data in order to automate decision-making processes based on data inputs.

Any technology user today has benefitted from machine learning. Facial recognition technology allows social media platforms to help users tag and share photos of friends. Optical character recognition (OCR) technology converts images of text into movable type. Recommendation engines, powered by machine learning, suggest what movies or television shows to watch next based on user preferences. Self driving cars that rely on machine learning to navigate may soon be available to consumers.

In this project I will implement a simple machine learning algorithm in Python using Scikit-learn, a machine learning tool for Python. Using a database of breast cancer tumor information, I'll use a Naive Bayes (NB) classifier that predicts whether or not a tumor is malignant or benign.


In [30]:
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

The dataset I'll be working with in this tutorial is the Breast Cancer Winsconsin Diagnostic Database. The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area.

__I will now proceed to organize the data__

In [32]:
# Load dataset
data = load_breast_cancer()

In [33]:
label_names = data['target_names']

In [34]:
labels = data['target']

In [35]:
feature_names = data['feature_names']

In [36]:
features = data['data']

__Now let's have a look at the data__

In [38]:
print(label_names)

['malignant' 'benign']


In [39]:
print(labels[0])

0


In [40]:
print(feature_names[0])

mean radius


In [41]:
print(features[0])

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


__Splitting our data__

In [43]:
train, test, train_labels, test_labels = train_test_split(features,labels,
                                            test_size=0.33,
                                            random_state=42)

There are many models for machine learning, and each model has its own strengths and weaknesses. In this project I will focus on an algorithm that usually performs well in binary classification tasks, namely Naive Bayes(NB). First I will start with importing the GaussianNB module. Then initialize the model with the __GaussianNB()__ function, then train the model by fitting it to the data using __gnb.fit():__

__Initialzing the classifier__

In [45]:
gnb = GaussianNB()

__Train our Classifier__

In [47]:
model = gnb.fit(train, train_labels)

__Make predictions__

In [49]:
preds = gnb.predict(test)

In [50]:
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


As you can see in the output above, the __predict()__ function returned an array of 0s and 1s which represent our predicted values for the tumor class(malignant vs. benign). Now it's time to evaluate how well the classifier is performing. Using the array of true class labels, I can evaluate the accuracy of my model's predicted values values by comparing the two arrays (test_labels vs. preds). I will use the __sklearn__ function __accuracy_score()__ to determine the accuracy of my machine learning classifier.

__Evaluate accuracy__

In [52]:
print(accuracy_score(test_labels, preds))

0.9414893617021277


As shown in the output above, the NB classifier is 94.15% accurate. This means that 94.15 percent of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign. These results suggest that our feature set of 30 attributes are good indicators of tumor class.