## Machine Learning Recap: Classifying Breast Cancer Using ML Models

Welcome to this recap of our intensive machine learning course! In this interactive notebook, we will focus on a critical application of machine learning – classifying breast cancer.

Breast cancer is a significant health concern, and accurate diagnosis is crucial for effective treatment. By using basic machine learning algorithms, we can contribute to this important field and showcase the practicality of machine learning in real-life scenarios.

Our dataset is the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which provides information on 30 different characteristics of cell nuclei. We will use these features to predict the stage of breast cancer, classifying it as either malignant (M) or benign (B).

In [None]:
# here we will import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
%matplotlib inline
from sklearn.linear_model import LogisticRegression # to apply the Logistic regression
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics # for the check the error and accuracy of the model

Import the data

In [None]:
data = ...
    
data.head()

Before we dive into the models, let's understand the attribute information in the dataset. It includes an ID number, diagnosis (malignant or benign), and ten real-valued features for each cell nuclei group. These features capture important characteristics like radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

The features are categorized as Mean, Standard Error, and Worst, each containing ten parameters. Mean represents the average values, Standard Error indicates the measurement's variability, and Worst represents the most concerning cell characteristics.

Get ready to embark on this exciting journey where we combine the power of machine learning with the vital task of breast cancer classification. Let's dive in and explore the models together!

Let's get the basic information from the dataset: columns, count, and type of columns. Can you find the Pandas method that achieves this?

In [None]:
...

Are there any null values?

In [None]:
...

We can see that column Unnamed:32 has 0 non null objects. This means all values of this column are null so we cannot use this column for our analysis. Let's drop it!.

In [None]:
...

Is there any other column that has no relevance for a model whatsoever?

In [None]:
...

Let's get the list of columns that are used for the mean, for the standard deviation, and for the worst value, in 3 different lists.

In [None]:
features_mean = ...
features_se = ...
features_worst = ...

print(features_mean)
print("-----------------------------------")
print(features_se)
print("------------------------------------")
print(features_worst)

Now let's transform the diagnosis column to integer, where a 0 will be used for the benign cells and a 1 for the malign ones

In [None]:
data['diagnosis'] = ...

Let's check the distribution of the diagnosis column, the one we want to predict. How many benign and malign cells are there?

In [None]:
...

### Train and test split

Divide our dataset in 80% trainind and 20% split. Use pandas' `train_test_split` function, and print the number of rows of each dataset. Use a random state of 10 for this.

In [None]:
train, test = ...

print(train.shape)
print(test.shape)

### Model training

Let's train a model using a Random Forest Classifier

Get the `X` matrix (features) and `y` vector (variable to predict) for both the train and test sets

In [None]:
train_X = ...
train_y = ...
test_X = ...
test_y = ...

Train a Random Forest Classifier. Use a random state of 10

In [None]:
...

What is the accuracy of the model?

In [None]:
...

What are the precision, recall, and f1-score of the model?

In [None]:
precision = ...
recall = ...
f1 = ...

print (f'Precision: {precision:.4f}')
print (f'Recall: {recall:.4f}')
print (f'F1: {f1:.4f}')

Can you try other scikit-learn classification models and repeat the same process?

In [None]:
...

Based on this, which model would you choose for the task?

```Your Answer```

Which features have the most predictive importance? Can you get the five most important ones?

In [None]:
...