## Machine Learning Recap: Classifying Breast Cancer Using ML Models

Welcome to this recap of our week-long intensive machine learning course! In this interactive notebook, we will focus on a critical application of machine learning – classifying breast cancer.

Breast cancer is a significant health concern, and accurate diagnosis is crucial for effective treatment. By using basic machine learning algorithms, we can contribute to this important field and showcase the practicality of machine learning in real-life scenarios.

Our dataset is the well-known Breast Cancer Wisconsin (Diagnostic) dataset, which provides information on 30 different characteristics of cell nuclei. We will use these features to predict the stage of breast cancer, classifying it as either malignant (M) or benign (B).

Throughout this notebook, we will explore fundamental machine learning algorithms, step by step, with clear explanations and easy-to-follow implementations. Our goal is to make this recap accessible and enjoyable for beginners.

Before we dive into the models, let's understand the attribute information in the dataset. It includes an ID number, diagnosis (malignant or benign), and ten real-valued features for each cell nucleus. These features capture important characteristics like radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

The features are categorized as Mean, Standard Error, and Worst, each containing ten parameters. Mean represents the average values, Standard Error indicates the measurement's variability, and Worst represents the most concerning cell characteristics.

Get ready to embark on this exciting journey where we combine the power of machine learning with the vital task of breast cancer classification. Let's dive in and explore the models together!

In [None]:
# here we will import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv), data manipulation as in SQL
import matplotlib.pyplot as plt # this is used for the plot the graph 
# import seaborn as sns # used for plot interactive graph.
%matplotlib inline
from sklearn.linear_model import LogisticRegression # to apply the Logistic regression
from sklearn.model_selection import train_test_split # to split the data into two parts
# from sklearn.cross_validation import KFold # use for cross validation
from sklearn.model_selection import GridSearchCV# for tuning parameter
from sklearn.ensemble import RandomForestClassifier # for random forest classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm # for Support Vector Machine
from sklearn import metrics # for the check the error and accuracy of the model

Import the data

In [None]:
data = pd.read_csv("./data/data.csv",header=0)# here header 0 means the 0 th row is our coloumn 
                                                # header in data

In [None]:
# have a look at the data
print(data.head(2))# as u can see our data have imported and having 33 columns
# head is used for to see top 5 by default I used 2 so it will print 2 rows
# If we will use print(data.tail(2))# it will print last 2 rows in data

In [None]:
# now lets look at the type of data we have. We can use 
data.info()

1. So lets describe what these data type means, e.g 5 radius_mean 569 non-null float64 that means the radius_mean have 569 float type value.

2. Now we can see Unnamed:32 have 0 non null object it means the all values are null in this column so we cannot use this column for our analysis*

In [None]:
# now we can drop this column Unnamed: 32
data.drop("Unnamed: 32",axis=1,inplace=True) # in this process this will change in our data itself 
# here axis 1 means we are droping the column

In [None]:
# here you can check the column has been droped
data.columns # this gives the column name which are persent in our data no Unnamed: 32 is not now there

In [None]:
# like this we also don't want the Id column for our analysis
data.drop("id",axis=1,inplace=True)

In [None]:
# As I said above the data can be divided into three parts.lets divied the features according to their category
features_mean= list(data.columns[1:11])
features_se= list(data.columns[11:20])
features_worst=list(data.columns[21:31])
print(features_mean)
print("-----------------------------------")
print(features_se)
print("------------------------------------")
print(features_worst)

In [None]:
# lets now start with features_mean 
# now as ou know our diagnosis column is a object type so we can map it to integer value
data['diagnosis']=data['diagnosis'].map({'M':1,'B':0})

## Explore the Data now

In [None]:
data.describe() # this will describe the all statistical function of our data

## Data Analysis a little feature selection

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

data_subset = data[['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
                   'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean',
                   'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se',
                   'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se',
                   'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
                   'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst',
                   'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']]

cor = data_subset.corr()  # Calculate the correlation of the variables

# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 10))

# Define the heatmap properties
heatmap = ax.imshow(cor, cmap='coolwarm')

# Set the tick labels and font size
ax.set_xticks(range(len(cor.columns)))
ax.set_yticks(range(len(cor.columns)))
ax.set_xticklabels(cor.columns, rotation=45, ha='right', fontsize=10)
ax.set_yticklabels(cor.columns, rotation=0, ha='right', fontsize=10)

# Set axis labels
ax.set_xlabel('Features', fontsize=12)
ax.set_ylabel('Features', fontsize=12)

# Set the title
ax.set_title('Correlation Heatmap', fontsize=14)

# Add colorbar
cbar = plt.colorbar(heatmap)

# Remove the gridlines
ax.grid(False)

# Show the plot
plt.show()


### Observations:

- We observe a strong correlation among the features radius, perimeter, and area, which is expected due to their inherent relationship. Therefore, we can choose any one of these features for our analysis.

- Among the features compactness_mean, concavity_mean, and concave points_mean, there is a significant correlation. In this case, we will select compactness_mean as our representative feature.

Based on these observations, the selected parameters for use in our analysis are:
- Perimeter_mean
- Texture_mean
- Compactness_mean
- Symmetry_mean

These features exhibit distinct correlations and will be valuable in our classification tasks.

In [None]:
prediction_var = ['texture_mean','perimeter_mean','smoothness_mean','compactness_mean','symmetry_mean']
# now these are the variables which will use for prediction

In [None]:
#now split our data into train and test
train, test = train_test_split(data, test_size = 0.3)# in this our main data is splitted into train and test
# we can check their dimension
print(train.shape)
print(test.shape)

In [None]:
train_X = train[prediction_var]# taking the training data input 
train_y=train.diagnosis# This is output of our training data
# same we have to do for test
test_X= test[prediction_var] # taking test data inputs
test_y =test.diagnosis   #output value of test dat

In [None]:
model=RandomForestClassifier(n_estimators=100)# a simple random forest model

In [None]:
model.fit(train_X,train_y)# now fit our model for traiing data

In [None]:
prediction=model.predict(test_X)# predict for the test data
# prediction will contain the predicted value by our model predicted values of dignosis column for test inputs

In [None]:
metrics.accuracy_score(prediction,test_y) # to check the accuracy
# here we will use accuracy measurement between our predicted value and our test output values

* Here the Accuracy for our model is 91 % which seems good*

Lets now try with SVM

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

**SVM is giving only 0.85 which we can improve by using different techniques** 

*Now lets do this for all feature_mean so that from Random forest we can get the feature which are important**

In [None]:
prediction_var = features_mean # taking all features

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model=RandomForestClassifier(n_estimators=100)

In [None]:
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

 - By taking all features accuracy increased but not so much so according to Razor's rule simpler method is better
 - By the way now lets check the important features in the prediction

In [None]:
featimp = pd.Series(model.feature_importances_, index=prediction_var).sort_values(ascending=False)
print(featimp) # this is the property of Random Forest classifier that it provide us the importance 
# of the features used

First lets do with SVM also using all features

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

As observed, the accuracy of the SVM significantly decreases. Therefore, let's proceed by considering only the top 5 important features identified by the RandomForest classifier.

In [None]:
prediction_var=['concave points_mean','perimeter_mean' , 'concavity_mean' , 'radius_mean','area_mean']      

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

Based on this discussion, it becomes apparent that multicollinearity has a substantial impact on the SVM model, while it doesn't affect the Random Forest model to the same extent. This highlights the difference in effort required for analysis between the two models. Moving forward, let's focus on the third part of the data, which pertains to the "worst" features. We will begin by considering all the features in the "worst" category.

In [None]:
prediction_var = features_worst

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# but same problem With SVM, very much less accuray I think we have to tune its parameter
# that i will do later in intermidate part
#now we can get the important features from random forest now run Random Forest for it 

In [None]:
model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
# the accuracy for RandomForest invcrease it means the value are more catogrical in Worst part
#lets get the important features
featimp = pd.Series(model.feature_importances_, index=prediction_var).sort_values(ascending=False)
print(featimp) # this is the property of Random Forest classifier that it provide us the importance 
# of the features used

In [None]:
# same parameter but with great importance and here it seamed the only conacve points_worst is making 
# very important so it may be bias lets check only for top 5 important features

In [None]:
prediction_var = ['concave points_worst','radius_worst','area_worst','perimeter_worst','concavity_worst'] 

In [None]:
train_X= train[prediction_var]
train_y= train.diagnosis
test_X = test[prediction_var]
test_y = test.diagnosis

In [None]:
model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_y)
prediction = model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

In [None]:
#check for SVM
model = svm.SVC()
model.fit(train_X,train_y)
prediction=model.predict(test_X)
metrics.accuracy_score(prediction,test_y)

Considering the need for simplicity, it seems that Random Forest would be a more suitable choice for prediction.

Let's further explore the data. We will focus on the features_mean and attempt to identify variables that can be used for classification by plotting a scatter plot. Our objective is to find variables that exhibit a distinct boundary between the two cancer classes.

We will begin the data analysis by examining the features_mean. Our aim is to determine which features can be used for prediction. I will create scatter plots for all the features_mean, showcasing the data points for both diagnosis categories. Through this visualization, we can identify the features that display a noticeable distinction between the two categories and can be effectively used for differentiation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

color_function = {0: "blue", 1: "red"}  # Red color represents 1 (M) and blue represents 0 (B)
colors = data["diagnosis"].map(lambda x: color_function.get(x))  # Mapping the color function with the diagnosis column

pd.plotting.scatter_matrix(data[features_mean], c=colors, alpha=0.5, figsize=(15, 15))  # Plotting scatter plot matrix

plt.show()  # Display the plot


### Observations

1. Radius, area and perimeter have a strong linear relationship as expected
2. As graph shows the features like as texture_mean, smoothness_mean, symmetry_mean and fractal_dimension_mean can t be used for classify two category because both category are mixed there is no separable plane
3. So we can remove them from our prediction_var

In [None]:
# So predicton features will be 
features_mean

In [None]:
# So predicton features will be 
predictor_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']

In [None]:
from sklearn.model_selection import KFold

def model(model, data, prediction, outcome):
    # This function will be used to check the accuracy of different models
    kf = KFold(n_splits=10) # Define the number of folds for cross-validation

In [None]:
prediction_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']

Features that have the ability to classify the classes will be more valuable in our analysis. In this section, I will provide an overview of some machine learning concepts. Additionally, I will compare the accuracy of different models and demonstrate the use of cross-validation. Furthermore, I will explain the process of tuning model parameters using gridSearchCV.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics

def classification_model(model, data, prediction_input, output):
    # This function is used to evaluate the accuracy of different models
    model.fit(data[prediction_input], data[output])  # Fit the model using the training set

    predictions = model.predict(data[prediction_input])  # Make predictions on the training set

    accuracy = metrics.accuracy_score(predictions, data[output])  # Calculate accuracy on the same data
    print("Accuracy: %s" % "{0:.3%}".format(accuracy))

    cv_scores = cross_val_score(model, data[prediction_input], data[output], cv=5)  # Perform cross-validation
    print("Cross-Validation Scores: ", cv_scores)
    print("Mean Cross Validation Accuracy: %s" % "{0:.3%}".format(np.mean(cv_scores)))


Now from here on start using different model

In [None]:
model = DecisionTreeClassifier()
prediction_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']
outcome_var= "diagnosis"
classification_model(model,data,prediction_var,outcome_var)

Move on to SVM

In [None]:
model = svm.SVC()

classification_model(model,data,prediction_var,outcome_var)

In [None]:
model = KNeighborsClassifier()
classification_model(model,data,prediction_var,outcome_var)

In [None]:
model = RandomForestClassifier(n_estimators=100)
classification_model(model,data,prediction_var,outcome_var)

In [None]:
model=LogisticRegression()
classification_model(model,data,prediction_var,outcome_var)

### We just saw a detailed comparison of some Machine Learning models 

 1. In next segment we will see the tuning of parameter for different models
 2. Then using those parameter we will try to make predictions

### Tuning Parameters  using grid search CV

Lets Start with decision tree classifier:

Tuning the parameters means using the best parameter for predict 
 there are many parameters need to model a Machine learning Algorithm
 for decision tree classifier refer this link [Link](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
data_X= data[prediction_var]
data_y= data["diagnosis"]

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

def Classification_model_gridsearchCV(model, param_grid, data_X, data_y):
    clf = GridSearchCV(model, param_grid, cv=10, scoring="accuracy")
    clf.fit(data_X, data_y)
    
    print("The best parameter found on the development set is:")
    print(clf.best_params_)
    
    print("The best estimator is:")
    print(clf.best_estimator_)
    
    print("The best score is:")
    print(clf.best_score_)


In [None]:
param_grid = {'max_features': ['sqrt', 'log2'],
              'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10], 
              'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8, 9, 10]}
model = DecisionTreeClassifier()
Classification_model_gridsearchCV(model, param_grid, data_X, data_y)

## Observations

1. The accuracy score has significantly increased to 95%.
2. This is a substantial improvement and indicates the effectiveness of the tuned parameters.
3. Next, let's explore the K-Nearest Neighbors (KNN) algorithm.
4. For more details on KNN, you can refer to the [Link](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).
5. If you are a beginner, I highly recommend following the provided link as it will provide valuable information and insights on KNN.

In [None]:
model = KNeighborsClassifier()

k_range = list(range(1, 30))
leaf_size = list(range(1,30))
weight_options = ['uniform', 'distance']
param_grid = {'n_neighbors': k_range, 'leaf_size': leaf_size, 'weights': weight_options}
Classification_model_gridsearchCV(model,param_grid,data_X,data_y)

 1. Try with SVM
 2. [link](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
model=svm.SVC()
param_grid = [
              {'C': [1, 10, 100, 1000], 
               'kernel': ['linear']
              },
              {'C': [1, 10, 100, 1000], 
               'gamma': [0.001, 0.0001], 
               'kernel': ['rbf']
              },
 ]
Classification_model_gridsearchCV(model,param_grid,data_X,data_y)

### Observations

1. The SVM model is performing well with the optimal parameters, highlighting the importance of parameter tuning.
2. Initially, using the default parameters, the accuracy was only 70%.
3. However, after tuning the parameters, the accuracy significantly improved to 95%.

1. Similarly, we can apply the same approach to the Random Forest classifier.
2. However, for the sake of brevity, I will not provide the code for the Random Forest classifier in this context.
3. If someone is using this as a reference and wants to explore the Random Forest classifier, I encourage them to apply the same techniques discussed for parameter tuning and evaluation to the Random Forest classifier as well.

### Conclusion

1. The primary goal of this notebook is to offer a comprehensive introduction to various machine learning methods.
2. Thank you for your attention and I hope you find this notebook valuable in your machine learning journey.