## Machine Learning Recap: Classifying Breast Cancer Using ML Models

Welcome to this recap of our intensive machine learning course! In this interactive notebook, we will focus on a critical application of machine learning – classifying breast cancer.

Breast cancer is a significant health concern, and accurate diagnosis is crucial for effective treatment. By using basic machine learning algorithms, we can contribute to this important field and showcase the practicality of machine learning in real-life scenarios.

Our dataset is the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which provides information on 30 different characteristics of cell nuclei. We will use these features to predict the stage of breast cancer, classifying it as either malignant (M) or benign (B).

In [1]:
# here we will import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
%matplotlib inline
from sklearn.linear_model import LogisticRegression # to apply the Logistic regression
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics # for the check the error and accuracy of the model

Import the data

In [2]:
data = pd.read_csv("./data/breast_cancer.csv", header=0)
# here header 0 means the 0 th row is our coloumn header in data
    
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Before we dive into the models, let's understand the attribute information in the dataset. It includes an ID number, diagnosis (malignant or benign), and ten real-valued features for each cell nuclei group. These features capture important characteristics like radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

The features are categorized as Mean, Standard Error, and Worst, each containing ten parameters. Mean represents the average values, Standard Error indicates the measurement's variability, and Worst represents the most concerning cell characteristics.

Get ready to embark on this exciting journey where we combine the power of machine learning with the vital task of breast cancer classification. Let's dive in and explore the models together!

Let's get the basic information from the dataset: columns, count, and type of columns. Can you find the Pandas method that achieves this?

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

Are there any null values?

In [4]:
data.isnull().values.any()

True

In [5]:
data.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

We can see that column Unnamed:32 has 0 non null objects. This means all values of this column are null so we cannot use this column for our analysis. Let's drop it!.

In [6]:
data.drop("Unnamed: 32",axis=1,inplace=True)

In [7]:
# here you can check the column has been droped
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

Is there any other column that has no relevance for a model whatsoever?

In [8]:
data.drop("id",axis=1,inplace=True)

Let's get the list of columns that are used for the mean, for the standard deviation, and for the worst value, in 3 different lists.

In [9]:
features_mean = list(data.columns[1:11])
features_se = list(data.columns[11:20])
features_worst =list(data.columns[21:31])
print(features_mean)
print("-----------------------------------")
print(features_se)
print("------------------------------------")
print(features_worst)

['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']
-----------------------------------
['radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se']
------------------------------------
['radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


Now let's transform the diagnosis column to integer, where a 0 will be used for the benign cells and a 1 for the malign ones

In [10]:
data['diagnosis'] = data['diagnosis'].map({'M':1,'B':0})

Let's check the distribution of the diagnosis column, the one we want to predict. How many benign and malign cells are there?

In [11]:
data['diagnosis'].value_counts()

0    357
1    212
Name: diagnosis, dtype: int64

### Train and test split

Divide our dataset in 80% trainind and 20% split. Use pandas' `train_test_split` function, and print the number of rows of each dataset. Use a random state of 10 for this.

In [28]:
train, test = train_test_split(data, test_size = 0.2, random_state=10)

print(train.shape)
print(test.shape)

(455, 31)
(114, 31)


### Model training

Let's train a model using a Random Forest Classifier

Get the `X` matrix (features) and `y` vector (variable to predict) for both the train and test sets

In [29]:
train_X = train.drop(columns=["diagnosis"])
train_y = train.diagnosis
test_X = test.drop(columns=["diagnosis"])
test_y = test.diagnosis

Train a Random Forest Classifier. Use a random state of 10

In [30]:
model=RandomForestClassifier(random_state=10)
model.fit(train_X,train_y)

What is the accuracy of the model?

In [31]:
prediction = model.predict(test_X)
metrics.accuracy_score(prediction, test_y)

0.9824561403508771

What are the precision, recall, and f1-score of the model?

In [32]:
precision = metrics.precision_score(prediction, test_y)
recall = metrics.recall_score(prediction, test_y)
f1 = metrics.f1_score(prediction, test_y)

print (f'Precision: {precision:.4f}')
print (f'Recall: {recall:.4f}')
print (f'F1: {f1:.4f}')

Precision: 1.0000
Recall: 0.9512
F1: 0.9750


Can you try other scikit-learn classification models and repeat the same process?

**Logistic**

In [33]:
model = LogisticRegression(max_iter=10000, random_state=10)
model.fit(train_X,train_y)

In [34]:
prediction = model.predict(test_X)
metrics.accuracy_score(prediction, test_y)

0.956140350877193

In [35]:
precision = metrics.precision_score(prediction, test_y)
recall = metrics.recall_score(prediction, test_y)
f1 = metrics.f1_score(prediction, test_y)

print (f'Precision: {precision:.4f}')
print (f'Recall: {recall:.4f}')
print (f'F1: {f1:.4f}')

Precision: 0.9744
Recall: 0.9048
F1: 0.9383


**Naive Bayes**

In [36]:
model = GaussianNB()
model.fit(train_X,train_y)

In [37]:
prediction = model.predict(test_X)
metrics.accuracy_score(prediction, test_y)

0.956140350877193

In [38]:
precision = metrics.precision_score(prediction, test_y)
recall = metrics.recall_score(prediction, test_y)
f1 = metrics.f1_score(prediction, test_y)

print (f'Precision: {precision:.4f}')
print (f'Recall: {recall:.4f}')
print (f'F1: {f1:.4f}')

Precision: 0.9744
Recall: 0.9048
F1: 0.9383


Based on this, which model would you choose for the task?

```Random Forest Classifier```

Which features have the most predictive importance? Can you get the five most important ones?

In [39]:
model=RandomForestClassifier(random_state=10)
model.fit(train_X,train_y)

In [40]:
featimp = pd.Series(model.feature_importances_, index=train_X.columns).sort_values(ascending=False)
featimp[:5]

area_worst              0.135426
concave points_worst    0.132727
radius_worst            0.120659
concave points_mean     0.104581
perimeter_worst         0.103084
dtype: float64