#### breast_cancer.data Column Details


 Attribute Information: (class attribute has been moved to last column)<br>
<pre>
     Attribute                     Domain
   -- -----------------------------------------
1. Sample code number            id number
2. Clump Thickness               1 - 10
3. Uniformity of Cell Size       1 - 10
4. Uniformity of Cell Shape      1 - 10
5. Marginal Adhesion             1 - 10
6. Single Epithelial Cell Size   1 - 10
7. Bare Nuclei                   1 - 10
8. Bland Chromatin               1 - 10
9. Normal Nucleoli               1 - 10
10. Mitoses                      1 - 10
11. Class:                       (2 for benign, 4 for malignant)
</pre>

#### Classification Exercise

1) Read the dataset 'breast_cancer.data'<br>
2) Remove/handle the null values.<br>
3) Rename the column names appropriately as per the attribute info mentioned above<br>
4) Based on the general understanding of the dataset, select independent features and dependent feature<br>
5) Split the dataset into training and testing dataset with test_size 25%<br>
6) Apply Decision Tree Classification and predict the class for the test data.<br>
7) Find the confusion matrix, accuracy_score and generate classification_report.<br>
8) Apply Random Forest Classification and predict the class for the test data.<br>
9) Find the confusion matrix, accuracy_score and generate classification_report.<br>
Use Cross Validation appropriately

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# 1. Read the dataset 'breast_cancer.data'
df = pd.read_csv("breast_cancer.data",header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       699 non-null    int64 
 1   1       699 non-null    int64 
 2   2       699 non-null    int64 
 3   3       699 non-null    int64 
 4   4       699 non-null    int64 
 5   5       699 non-null    int64 
 6   6       699 non-null    object
 7   7       699 non-null    int64 
 8   8       699 non-null    int64 
 9   9       699 non-null    int64 
 10  10      699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


**Note:** The 7th columns is an object datatype

In [4]:
# 2. Remove/handle the null values.
df.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64

No information on missing values

In [5]:
# 3. Rename the column names appropriately as per the attribute info mentioned above
df.rename(columns={0:"Sample code number",1:"Clump Thickness",2:"Uniformity of Cell Size",3:"Uniformity of Cell Shape"
                   ,4:"Marginal Adhesion",5:"Single Epithelial Cell Size",6:"Bare Nuclei",7:"Bland Chromatin",
                   8:"Normal Nucleoli",9:"Mitoses",10:"Class"},inplace=True)

The `Bare Nuclei` column contains some "?" values so replacing it by "any arbitrary chosen value" we can convert the column in the same datatype

In [6]:
df["Bare Nuclei"] = df["Bare Nuclei"].astype(str).replace("?",5).astype(np.int64)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           699 non-null    int64
 1   Clump Thickness              699 non-null    int64
 2   Uniformity of Cell Size      699 non-null    int64
 3   Uniformity of Cell Shape     699 non-null    int64
 4   Marginal Adhesion            699 non-null    int64
 5   Single Epithelial Cell Size  699 non-null    int64
 6   Bare Nuclei                  699 non-null    int64
 7   Bland Chromatin              699 non-null    int64
 8   Normal Nucleoli              699 non-null    int64
 9   Mitoses                      699 non-null    int64
 10  Class                        699 non-null    int64
dtypes: int64(11)
memory usage: 60.2 KB


In [8]:
# 4. Based on the general understanding of the dataset, select independent features and dependent feature
X = df.drop(["Class"],axis=1)
y = df.Class

In [9]:
# 5. Split the dataset into training and testing dataset with test_size 25%
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((524, 10), (175, 10), (524,), (175,))

In [10]:
# 6. Apply Decision Tree Classification and predict the class for the test data.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
clf_preds = clf.predict(X_test)
clf_preds

array([4, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 4, 2, 4, 2, 4, 4, 2, 4, 4, 2,
       2, 2, 4, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 2, 4, 4, 2, 2, 2, 2, 2,
       2, 2, 4, 4, 4, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 4, 4, 2, 2, 4, 2, 4, 4, 4, 2, 2, 4, 2, 2, 2, 4, 2, 2,
       4, 4, 2, 4, 2, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 4, 2, 2, 2,
       4, 2, 2, 2, 4, 4, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2,
       4, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 2, 2, 2,
       4, 4, 2, 2, 2, 2, 4, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2],
      dtype=int64)

In [11]:
# 7. Find the confusion matrix, accuracy_score and generate classification_report.
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
conf_mat = confusion_matrix(y_test,clf_preds)
print(f"Confusion Matrix:\n {conf_mat}\n")
print(f"Model Score: {round(100*accuracy_score(y_test,clf_preds),2)}%\n")
print(f"Classification Report:\n {classification_report(y_test,clf_preds)}")

Confusion Matrix:
 [[108   5]
 [  6  56]]

Model Score: 93.71%

Classification Report:
               precision    recall  f1-score   support

           2       0.95      0.96      0.95       113
           4       0.92      0.90      0.91        62

    accuracy                           0.94       175
   macro avg       0.93      0.93      0.93       175
weighted avg       0.94      0.94      0.94       175



In [12]:
# 8. Apply Random Forest Classification and predict the class for the test data.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train,y_train)
rf_clf_preds = rf_clf.predict(X_test)
rf_clf_preds

array([4, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 4, 2, 4, 2, 4, 4, 2, 4, 4, 2,
       2, 2, 4, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 2, 4, 4, 4, 2, 2, 2, 2,
       2, 4, 4, 4, 4, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 4, 4, 2, 2, 4, 2, 4, 4, 4, 2, 2, 4, 2, 2, 2, 4, 2, 2,
       4, 4, 2, 4, 2, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 4, 2, 2, 2,
       4, 2, 2, 2, 4, 4, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2,
       4, 2, 2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2,
       4, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2],
      dtype=int64)

In [13]:
# 9. Find the confusion matrix, accuracy_score and generate classification_report.
conf_mat_ = confusion_matrix(y_test,rf_clf_preds)
print(f"Confusion Matrix:\n {conf_mat_}\n")
print(f"Model Score: {round(100*accuracy_score(y_test,rf_clf_preds),2)}%\n")
print(f"Classification Report:\n {classification_report(y_test,rf_clf_preds)}")

Confusion Matrix:
 [[109   4]
 [  2  60]]

Model Score: 96.57%

Classification Report:
               precision    recall  f1-score   support

           2       0.98      0.96      0.97       113
           4       0.94      0.97      0.95        62

    accuracy                           0.97       175
   macro avg       0.96      0.97      0.96       175
weighted avg       0.97      0.97      0.97       175



**Applying Cross Validation for Decision Tree Classifier**

In [14]:
criterion = ['gini','entropy']
max_depth = [None, 2,4,6,8,10,12,14]
min_samples_split = [2,4,6,8,10]
min_samples_leaf = [1,3,5]
max_features = ["auto","sqrt","log2"]

In [15]:
from sklearn.model_selection import cross_val_score
for i in criterion:
    score = cross_val_score(estimator=DecisionTreeClassifier(criterion=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For criterion {i} score is: {round(100*score.mean(),2)}%\n")
for i in max_depth:
    score = cross_val_score(estimator=DecisionTreeClassifier(max_depth=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For max_depth {i} score is: {round(100*score.mean(),2)}%\n")
for i in min_samples_split:
    score = cross_val_score(estimator=DecisionTreeClassifier(min_samples_split=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For min_samples_split {i} score is: {round(100*score.mean(),2)}%\n")
for i in min_samples_leaf:
    score = cross_val_score(estimator=DecisionTreeClassifier(min_samples_leaf=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For min_samples_leaf {i} score is: {round(100*score.mean(),2)}%\n")
for i in max_features:
    score = cross_val_score(estimator=DecisionTreeClassifier(max_features=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For max_features {i} score is: {round(100*score.mean(),2)}%\n")

For criterion gini score is: 92.42%

For criterion entropy score is: 93.28%

For max_depth None score is: 92.56%

For max_depth 2 score is: 92.71%

For max_depth 4 score is: 93.27%

For max_depth 6 score is: 92.71%

For max_depth 8 score is: 91.99%

For max_depth 10 score is: 92.71%

For max_depth 12 score is: 92.42%

For max_depth 14 score is: 92.71%

For min_samples_split 2 score is: 92.13%

For min_samples_split 4 score is: 92.56%

For min_samples_split 6 score is: 92.13%

For min_samples_split 8 score is: 90.27%

For min_samples_split 10 score is: 91.13%

For min_samples_leaf 1 score is: 92.71%

For min_samples_leaf 3 score is: 91.84%

For min_samples_leaf 5 score is: 93.28%

For max_features auto score is: 92.71%

For max_features sqrt score is: 92.28%

For max_features log2 score is: 92.27%



Selecting the features with best accuracy and applying the model.fit with selected parameters

In [16]:
clf.set_params(criterion="entropy",
               max_depth=2,
               min_samples_split=4,
               min_samples_leaf=5,
               max_features="auto")

DecisionTreeClassifier(criterion='entropy', max_depth=2, max_features='auto',
                       min_samples_leaf=5, min_samples_split=4)

In [17]:
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.92

**Applying cross validation on RandomForestClassifer model**

In [18]:
bootstrap = [True,False]
criterion = ['gini','entropy']
max_depth = [None,2,4,6,8,10]
max_features = ['auto','sqrt','log2']
min_samples_leaf = [1,3,5,7]
min_samples_split = [2,4,6,8]
n_estimators = [100,150,200,300,500]

In [19]:
for i in bootstrap:
    score = cross_val_score(estimator=RandomForestClassifier(bootstrap=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For bootstrap {i} score is: {round(100*score.mean(),2)}%\n")
for i in criterion:
    score = cross_val_score(estimator=RandomForestClassifier(criterion=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For criterion {i} score is: {round(100*score.mean(),2)}%\n")
for i in max_depth:
    score = cross_val_score(estimator=RandomForestClassifier(max_depth=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For max_depth {i} score is: {round(100*score.mean(),2)}%\n")
for i in max_features:
    score = cross_val_score(estimator=RandomForestClassifier(max_features=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For max_features {i} score is: {round(100*score.mean(),2)}%\n")
for i in min_samples_leaf:
    score = cross_val_score(estimator=RandomForestClassifier(min_samples_leaf=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For min_samples_leaf {i} score is: {round(100*score.mean(),2)}%\n")
for i in min_samples_split:
    score = cross_val_score(estimator=RandomForestClassifier(min_samples_split=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For min_samples_split {i} score is: {round(100*score.mean(),2)}%\n")
for i in n_estimators:
    score = cross_val_score(estimator=RandomForestClassifier(n_estimators=i),
                            X=X,
                            y=y,
                            cv=5,
                            scoring="accuracy")
    print(f"For n_estimators {i} score is: {round(100*score.mean(),2)}%\n")

For bootstrap True score is: 96.28%

For bootstrap False score is: 95.71%

For criterion gini score is: 96.14%

For criterion entropy score is: 95.57%

For max_depth None score is: 96.14%

For max_depth 2 score is: 96.28%

For max_depth 4 score is: 96.43%

For max_depth 6 score is: 96.43%

For max_depth 8 score is: 96.0%

For max_depth 10 score is: 96.0%

For max_features auto score is: 96.14%

For max_features sqrt score is: 96.57%

For max_features log2 score is: 96.28%

For min_samples_leaf 1 score is: 96.0%

For min_samples_leaf 3 score is: 96.43%

For min_samples_leaf 5 score is: 96.43%

For min_samples_leaf 7 score is: 96.57%

For min_samples_split 2 score is: 95.71%

For min_samples_split 4 score is: 96.28%

For min_samples_split 6 score is: 96.28%

For min_samples_split 8 score is: 96.28%

For n_estimators 100 score is: 96.14%

For n_estimators 150 score is: 96.14%

For n_estimators 200 score is: 96.28%

For n_estimators 300 score is: 96.14%

For n_estimators 500 score is: 96.0

Selecting the features with best accuracy and applying the model.fit with selected parameters

In [20]:
rf_clf.set_params(bootstrap=True,
                  criterion="gini",
                  max_depth=4,
                  max_features="sqrt",
                  min_samples_leaf=7,
                  min_samples_split=4,
                  n_estimators=200)

RandomForestClassifier(max_depth=4, max_features='sqrt', min_samples_leaf=7,
                       min_samples_split=4, n_estimators=200)

In [21]:
rf_clf.fit(X_train,y_train)
rf_clf.score(X_test,y_test)

0.9542857142857143