# Breast Cancer Prediction using Naive Bayes

**Note:**

>* There are many freely available datasets which we can find from the SKLEARN library itself, many of which are actually real world datasets which are masked and made open source. 

>* One good idea to start off practicing any given ML Algorithm is to firstly try out the datasets which are toy/real-life and available from SKLEARN. 

## 1. Data Loading & EDA

In [1]:
# import the datasets library. 
from sklearn import datasets

In [2]:
# load the Breast Cancer dataset. 
data = datasets.load_breast_cancer()

In [3]:
# what did we really download? 
print(type(data))

<class 'sklearn.utils.Bunch'>


In the language of SKLEARN Bunch is nothing but a data structure similar to a Python Dictionary where we have the data and the metadata. 

In [4]:
# let's see what we got along with the data. 
print(data.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [5]:
# Let's look into the data description. 
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [6]:
# loading the pandas library
import pandas as pd

In [7]:
# load only the data. 
df = pd.DataFrame(data.data, columns=data.feature_names)

In [8]:
# let's look into the sample rows in the data.
df.sample(10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
321,20.16,19.66,131.1,1274.0,0.0802,0.08564,0.1155,0.07726,0.1928,0.05096,0.5925,0.6863,3.868,74.85,0.004536,0.01376,0.02645,0.01247,0.02193,0.001589,23.06,23.03,150.2,1657.0,0.1054,0.1537,0.2606,0.1425,0.3055,0.05933
371,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,0.1721,0.05544,0.1783,0.4125,1.338,17.72,0.005012,0.01485,0.01551,0.009155,0.01647,0.001767,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
182,15.7,20.31,101.2,766.6,0.09597,0.08799,0.06593,0.05189,0.1618,0.05549,0.3699,1.15,2.406,40.98,0.004626,0.02263,0.01954,0.009767,0.01547,0.00243,20.11,32.82,129.3,1269.0,0.1414,0.3547,0.2902,0.1541,0.3437,0.08631
338,10.05,17.53,64.41,310.8,0.1007,0.07326,0.02511,0.01775,0.189,0.06331,0.2619,2.015,1.778,16.85,0.007803,0.01449,0.0169,0.008043,0.021,0.002778,11.16,26.84,71.98,384.0,0.1402,0.1402,0.1055,0.06499,0.2894,0.07664
419,11.16,21.41,70.95,380.3,0.1018,0.05978,0.008955,0.01076,0.1615,0.06144,0.2865,1.678,1.968,18.99,0.006908,0.009442,0.006972,0.006159,0.02694,0.00206,12.36,28.92,79.26,458.0,0.1282,0.1108,0.03582,0.04306,0.2976,0.07123
238,14.22,27.85,92.55,623.9,0.08223,0.1039,0.1103,0.04408,0.1342,0.06129,0.3354,2.324,2.105,29.96,0.006307,0.02845,0.0385,0.01011,0.01185,0.003589,15.75,40.54,102.5,764.0,0.1081,0.2426,0.3064,0.08219,0.189,0.07796
362,12.76,18.84,81.87,496.6,0.09676,0.07952,0.02688,0.01781,0.1759,0.06183,0.2213,1.285,1.535,17.26,0.005608,0.01646,0.01529,0.009997,0.01909,0.002133,13.75,25.99,87.82,579.7,0.1298,0.1839,0.1255,0.08312,0.2744,0.07238
511,14.81,14.7,94.66,680.7,0.08472,0.05016,0.03416,0.02541,0.1659,0.05348,0.2182,0.6232,1.677,20.72,0.006708,0.01197,0.01482,0.01056,0.0158,0.001779,15.61,17.58,101.7,760.2,0.1139,0.1011,0.1101,0.07955,0.2334,0.06142
366,20.2,26.83,133.7,1234.0,0.09905,0.1669,0.1641,0.1265,0.1875,0.0602,0.9761,1.892,7.128,103.6,0.008439,0.04674,0.05904,0.02536,0.0371,0.004286,24.19,33.81,160.0,1671.0,0.1278,0.3416,0.3703,0.2152,0.3271,0.07632


In case of datasets loaded from SKLEARN we must load the target column separately, as the data.data key contains only the features data. 

In [9]:
# let's load the target column. 
df['target'] = data.target

In [10]:
# now let's look into the sample rows. 
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [11]:
# look into the class dist. 
df['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

**Important Note On Data Sampling:**

>* In real life, any IT or other application may generate humongous amounts of data. Some even generate PB scale of data per day. 

>* In such cases, if we plan to fit any given/applicable machine learning model to solve any given Regression/Classification problem, we need to perform a stratified sampling from the data (or we can do Random sampling too, depending on how uniformly distributed our data is) and take this sample as the training data. 

>* We must document the process &/or technical details (e.g., Python code, SQL queries, Database details, table details, etc.) which we have carried out to get the Sample of the data. 

>* Using this collected sample, we perform the ML algorithm trainings and find the best model which works for us. 

>* In case we are taking an approach similar to above, we must ensure that the sample is a good representation of the population, and the sample size is as big as we can process using existing setup. 

>* In case this is not the scenario for us, and we already have very small data (10s of thousands in total) as the population, then we do not need to mandatorily sample the data since we can then take the entire data to perform the modelling. 



In [12]:
# install Pandas profiling
#!pip uninstall pandas-profiling --yes
#!pip install pandas-profiling==2.7.1

In [14]:
# import pandas profiler
from pandas_profiling import ProfileReport

In [15]:
# Profile out data. 
data_profile = ProfileReport(df)

In [16]:
# print the profile on the screen. 
data_profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/43 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
data_profile.to_file('/content/breast_cancer_data_analysis_v1.html')

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 2. Initial Modelling. 

In [20]:
# divide the data into features and targets. 
X = df.drop('target', axis=1)
y = df['target']

In [21]:
# load libraries for training and testing splits
from sklearn.model_selection import train_test_split

In [22]:
# loading the imblearn library
from imblearn.over_sampling import SMOTE

In [23]:
# oversample the entire dataset. 
oversampler = SMOTE()
X,y = oversampler.fit_resample(X,y)



In [24]:
# check the new size of the data.
print(X.shape)
print(y.shape)

(714, 30)
(714,)


In [25]:
# Split the data into training and testing. 
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)

In [18]:
# importing module for Naive Bayes
from sklearn.naive_bayes import GaussianNB

In [19]:
# create the object for the model. 
nb = GaussianNB()

In [26]:
# fit our first model using the training data. 
nb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [27]:
# predict on the testing data. 
y_pred = nb.predict(X_test)

In [28]:
# import the classification metrics' libraries. 
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score

In [29]:
# let's look into the metrics. 
print("Accuracy of the model: ", accuracy_score(y_pred, y_test))
print("")
print("AUC Score: ", roc_auc_score(y_pred, y_test))
print("")
print("Confusion matrix:")
print(confusion_matrix(y_pred, y_test))
print("")
print("Classification Report:")
print(classification_report(y_pred, y_test))
print("")

Accuracy of the model:  0.9106145251396648

AUC Score:  0.9109649122807018

Confusion matrix:
[[77  7]
 [ 9 86]]

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.92      0.91        84
           1       0.92      0.91      0.91        95

    accuracy                           0.91       179
   macro avg       0.91      0.91      0.91       179
weighted avg       0.91      0.91      0.91       179




**Very Important Feature About Naive Bayes:**

>* Naive Bayes is such a classification algorithm which we can call as **Lazy Learner**. 

>* Reason why are calling it a lazy learning, is because it does not train itself at all on the training data!!! 

>* The mdoel starts to learn the patterns only when a new datapoint is given to it. 

>* This makes the model really fast during training time as compared to any other given prob. or non-prob. models. 

>* However, depending on the test data size and quality, the accuracy of the model and the performance might hinder. 

>* For this particular reason, whenever we usually perform any NLP or Text Classification problem, where the feature set is usually a very sparse or large matrix of data, we tend to use Naive Bayes because of its elegance and its performance. 

# End