<p  style="text-align: center;"><font size="10"><b>PREDICTING BREAST CANCER IN WISCONSIN</b></font></p>


Using data from a digitized images of a brest mass in the state of Wisconsin, this notebook will use feature selection and model building using several different algorithms to attempt to predict whether a breast mass is benign or malignant. 

<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Table of Contents</h3>

1. [Libraries & Packages](#libraries)
2. [Initial Insights](#insights)
3. [Data Preprocessing & Feature Engineering](#preprocessing)
4. [Data Exploration & Visualization](#exploration)
5. [Feature Selection](#features)  
6. [Model Building](#models)  
    A. [Random Forest](#rf)  
    B. [Random Forest w Select K Best](#rfkbest)  
    C. [Support Vector Machine](#svm)  
    D. [Logistic Regression](#lr)  
    E. [Decision Tree](#dt)  
    F. [K Nearest Neighbors](#knn) 
7. [Algorithm Comparison](#comparison)
8. [Conclusion](#conclusion) 

<a id="libraries"></a>
## LIBRARIES & PACKAGES 

In [None]:
!conda install -c conda-forge pydotplus -y
!conda install -c conda-forge python-graphviz -y

In [None]:
!pip install --upgrade scikit-learn==0.20.3

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno as msno

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
import matplotlib.image as mpimg
%matplotlib inline 

import itertools


#EVALUATION ALGORITHMS
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn import metrics
# from sklearn.metrics import jaccard_score
from sklearn.externals.six import StringIO
from sklearn import tree

import pydotplus

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.head()

<a id="insights"></a>
## INITIAL INSIGHTS

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

<a id="preprocesing"></a>
## DATA PRE-PROCESSING

In [None]:
# SHAPE OF FEATURE DATASET

df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
Y = df.diagnosis
X = df.drop('diagnosis', axis=1)
X.shape

In [None]:
#DATA STANDARDIZATION
X_std = (X - X.mean()) / (X.std())

<a id="exploration"></a>
## DATA EXPLORATION & VISUALIZATION

This section will use visualization techniques to get an overview of the data and the correlation between each feature and the target variable. 

Using violin and swarm plots we'll be able to see what kind of distinctions there are between benign and malignant cases and their respective feature variables. 

In [None]:
#VISUALIZE NUMBER OF BENIGN AND MALIGNANT CASES

sns.countplot(df['diagnosis'], label='Count')

B, M = df['diagnosis'].value_counts()
print('Benign: ',B)
print('Malignant : ',M)

In [None]:
# SPLIT DATASET INTO TWO SETS OF 15 FEATURES EACH

df_set1 = pd.concat([Y, X_std.iloc[:, 0:15]], axis=1)
df_set2 = pd.concat([Y, X_std.iloc[:, 15:30]], axis=1)


# TRANSFORM DATA INTO 3 COLUMN DATA FRAME W/ ALL FEATURES IN ONE COLUMN

df_melt1 = pd.melt(df_set1, id_vars="diagnosis", var_name="features", value_name='value')
df_melt2 = pd.melt(df_set2, id_vars="diagnosis", var_name="features", value_name='value')
df_melt1.head()

In [None]:
# VIOLIN PLOT TO VISUALIZE BENIGN AND MALIGNANT FEATURE CORRELATIONS

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=df_melt1, split=True, inner="quart")
plt.xticks(rotation=90)

In [None]:
# SWARM PLOT TO VISUALIZE BENIGN AND MALIGNANT FEATURE CORRELATIONS

sns.set(style='whitegrid', palette='muted')

plt.figure(figsize=(15,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=df_melt1)
plt.xticks(rotation=90)

In [None]:
# VIOLIN PLOT TO VISUALIZE BENIGN AND MALIGNANT FEATURE CORRELATIONS

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=df_melt2, split=True, inner="quart")
plt.xticks(rotation=90)

In [None]:
# SWARM PLOT TO VISUALIZE BENIGN AND MALIGNANT FEATURE CORRELATIONS

sns.set(style='whitegrid', palette='muted')

plt.figure(figsize=(15,10))
sns.swarmplot(x="features", y="value", hue="diagnosis", data=df_melt2)
plt.xticks(rotation=90)

<a id="features"></a>
## FEATURE SELECTION

We will narrow down our features by using a heatmap to visualize the correlation between variables and eliminating those features that are fully correlated. 

In [None]:
# CREATE HEATMAP TO VISUALIZE DATA CORRELATIONS

plt.figure(figsize=(16,16))
sns.heatmap(df.corr(), cbar = True,  square = True, annot=True, fmt= '.1f', annot_kws={'size': 12},
           xticklabels=X.columns, yticklabels=X.columns,
           cmap= 'YlGnBu')
plt.title('FEATURE VARIABLE CORRELATIONS')


Several features have a 100% correlation. For instance, **radius_mean**, **perimeter_mean**, and **area_mean** are all 100% correlated so we can keep one and eliminate the rest. We'll keep **area_mean**. 

This step will be repeated for the other features until we have a feature set that is narrowed down to the most essential features. 

In [None]:
# CREATE NEW FEATURE SET 

features = ['area_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean',
        'area_se', 'texture_se', 'smoothness_se', 'compactness_se', 'symmetry_se', 'fractal_dimension_se',
        'area_worst', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'symmetry_worst', 'fractal_dimension_worst']
X_1 = X[features]


<a id="models"></a>
## MODEL BUILDING

Using our new feature set we will build several models to determine which method is best for predicting the outcome of our target variable, diagnosis. 

We will then evaluate the accuracy of each model using **accuracy_score**, **F1_Score**, **Confusion Matrix**, and **Log Loss**

### MODELS USED:
1. RANDOM FOREST
2. RANDOM FOREST USING SELECT K BEST FEATURES
3. SUPPORT VECTOR MACHINE
4. LOGISTIC REGRESSION
5. DECISION TREE
6. K NEAREST NEIGHBOR

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, classification_report
import itertools

<a id="rf"></a>
### RANDOM FOREST CLASSIFICATION

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_1, Y, test_size=0.3, random_state=1)

clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
yhat_rf = clf_rf.predict(X_test)
yhat_proba_rf = clf_rf.predict_proba(X_test)

ac_rf = accuracy_score(y_test, yhat_rf)
print('Accuracy Score: ', ac_rf)

cm_rf = confusion_matrix(y_test, yhat_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - Random Forest')



#### F1 SCORE

In [None]:
f1_rf = f1_score(y_test, yhat_rf, average='weighted') 
print('F1 Score: ', f1_rf)

In [None]:
log_loss_rf = log_loss(y_test, yhat_proba_rf)
log_loss_rf

<hr>

<a id="rfkbest"></a>
### SELECT K BEST AND RANDOM FOREST

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
y_train

In [None]:
select_features = SelectKBest(chi2, k=9).fit(X_train, y_train)
X_train.columns[select_features.get_support()]

In [None]:
feature_scores = pd.DataFrame(X_train.columns, columns=['Features'])
feature_scores['scores'] = select_features.scores_
feature_scores = feature_scores.sort_values(by='scores', ascending=False)

In [None]:
X_train_2 = select_features.transform(X_train)
X_test_2 = select_features.transform(X_test)

clf_rf2 = RandomForestClassifier()
clf_rf2.fit(X_train_2, y_train)
yhat_rf2 = clf_rf2.predict(X_test_2)
yhat_proba_rf2 = clf_rf2.predict_proba(X_test_2)

ac_rf2 = accuracy_score(y_test, yhat_rf2)
print('Accuracy Score: ', ac_rf2)

cm_rf_kbest = confusion_matrix(y_test, yhat_rf2)
sns.heatmap(cm_rf_kbest, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - Random Forest 2')

#### F1_SCORE

In [None]:
f1_rf2 = f1_score(y_test, yhat_rf2, average='weighted')
print('F1 Score: ', f1_rf2)

#### LOG LOSS

In [None]:
log_loss_rf2 = log_loss(y_test, yhat_proba_rf2)
log_loss_rf2

<a id="svm"></a>
### SUPPORT VECTOR MACHINE

In [None]:
clf_svm = svm.SVC(kernel='rbf', probability=True)
clf_svm.fit(X_train, y_train)
yhat_svm = clf_svm.predict(X_test)
yhat_proba_svm = clf_svm.predict_proba(X_test)

In [None]:
ac_svm = accuracy_score(y_test, yhat_svm)
print('Accuracy Score: ', ac_svm)

cm_svm = confusion_matrix(y_test, yhat_svm)
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - Support Vector Machine')

#### F1_SCORE

In [None]:
f1_svm = f1_score(y_test, yhat_svm, average='weighted')
print('F1 Score: ', f1_svm)

#### LOG LOSS

In [None]:
log_loss_svm = log_loss(y_test, yhat_proba_svm)
log_loss_svm

<a id="lr"></a>
### LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf_lr = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
clf_lr

In [None]:
clf_lr.fit(X_train, y_train)
yhat_lr = clf_lr.predict(X_test)
yhat_proba_lr = clf_lr.predict_proba(X_test)

In [None]:
ac_lr = accuracy_score(y_test, yhat_lr)
print('Accuracy Score: ', ac_lr)

cm_lr = confusion_matrix(y_test, yhat_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - Logistic Regression')

#### F1 SCORE

In [None]:
f1_lr = f1_score(y_test, yhat_lr, average='weighted') 
print('F1 Score: ', f1_lr)

In [None]:
print (classification_report(y_test, yhat_lr))

#### LOG LOSS

In [None]:
log_loss_lr = log_loss(y_test, yhat_proba_lr)
log_loss_lr

<a id="dt"></a>
### DECISION TREE

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf_dt = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
clf_dt.fit(X_train, y_train)
yhat_dt = clf_dt.predict(X_test)
yhat_proba_dt = clf_dt.predict_proba(X_test)

In [None]:
metrics.accuracy_score(yhat_dt, y_test)
ac_dt = metrics.accuracy_score(yhat_dt, y_test)
print("DecisionTrees's Accuracy: ", ac_dt)

In [None]:
cm_dt = confusion_matrix(y_test, yhat_dt)
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - Decision Tree')

#### F1 SCORE

In [None]:
f1_dt = f1_score(y_test, yhat_dt, average='weighted') 
print('F1 Score: ', f1_dt)

#### LOG LOSS

In [None]:
log_loss_dt = log_loss(y_test, yhat_proba_dt)
log_loss_dt

#### VISUALIZE DECISION TREE

In [None]:
dot_data = StringIO()
filename = "clf_dt.png"
featureNames = X_train.columns[0:18]
targetNames = Y.unique().tolist()
out=tree.export_graphviz(clf_dt,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

<a id="knn"></a>
### K NEAREST NEIGHBORS

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
k = 7
clf_knn = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat_knn = clf_knn.predict(X_test)
yhat_proba_knn = clf_knn.predict_proba(X_test)

In [None]:
ac_knn = metrics.accuracy_score(y_test, yhat_knn)

print("Train set Accuracy: ", metrics.accuracy_score(y_train, clf_knn.predict(X_train)))
print("Test set Accuracy: ", ac_knn)

In [None]:
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
# ConfusionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    clf_knn = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat_knn=clf_knn.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat_knn)

    
    std_acc[n-1]=np.std(yhat_knn==y_test)/np.sqrt(yhat_knn.shape[0])

mean_acc

In [None]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks), mean_acc - 1 * std_acc, mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with k =", mean_acc.argmax()+1) 

cm_knn = confusion_matrix(y_test, yhat_knn)
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
plt.xlabel('Predicted Values - Y Hat')
plt.ylabel('Actual Values - Y')
plt.title('Confusion Matrix - K Nearest Neighbors')

#### F1 SCORE

In [None]:
f1_knn = f1_score(y_test, yhat_knn, average='weighted') 
print('F1 Score: ', f1_knn)

#### LOG LOSS

In [None]:
log_loss_knn = log_loss(y_test, yhat_proba_knn)
log_loss_knn

<a id="comparison"></a>
## CLASSIFICATION ACCURACY COMPARISON

We will do a side by side comparison and a visualization of each algorithm's **accuracy_score**, **f1_score**, and **log loss** to determine which model yielded the best results. 

In [None]:
# CREATE NEW DATAFRAME WITH THE ALGORITHM AND EACH ACCURACY MEASUREMENT. 

d = {'Algorithm' : ['Random Forest', 'Random Forest w/ KBest', 'Support Vector Machine', 'Logistic Regression', 'Decision Tree', 'K Nearest Neighbor'],
     'Accuracy_Score' : [ac_rf, ac_rf2, ac_svm, ac_lr, ac_dt, ac_knn],
    'F1_Score' : [f1_rf, f1_rf2, f1_svm, f1_lr, f1_dt, f1_knn],
    'Log_Loss' : [log_loss_rf, log_loss_rf2, log_loss_svm, log_loss_lr, log_loss_dt, log_loss_knn]}
df_accuracy = pd.DataFrame(data=d)
df_accuracy


In [None]:
# CREATE BAR CHART TO VISUALIZE EACH ALGORITHM'S ACCURACY MEASUREMENT. 

fig = go.Figure(data=[go.Bar(name='Accuracy_Score', x=df_accuracy['Algorithm'], y=df_accuracy['Accuracy_Score']),
                      go.Bar(name='F1_Score', x=df_accuracy['Algorithm'], y=df_accuracy['F1_Score']),
                      go.Bar(name='Log_Loss', x=df_accuracy['Algorithm'], y=df_accuracy['Log_Loss']),
                     ])

# Change the bar mode
fig.update_layout(barmode='group', title_text='Classification Scores')
fig.show()

In [None]:
# CREATE SUBPLOTS WITH ALL CONFUSIION MATRICES

fig, axes = plt.subplots(2, 3, figsize=(15, 8), sharey=True)
fig.suptitle("CONFUSION MATRICES", fontsize=16)
fig.text(0.5, 0.04, 'PREDICTED VALUES (YHAT)', ha='center', va='center')
fig.text(0.06, 0.5, 'ACTUAL VALUES (Y)', ha='center', va='center', rotation='vertical')

ax1 = sns.heatmap(cm_rf, ax=axes[0, 0], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
ax2 = sns.heatmap(cm_rf_kbest,ax=axes[0, 1], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
ax3 = sns.heatmap(cm_svm, ax=axes[0, 2], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
ax4 = sns.heatmap(cm_lr, ax=axes[1, 0], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
ax5 = sns.heatmap(cm_dt, ax=axes[1, 1], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')
ax6 = sns.heatmap(cm_knn, ax=axes[1, 2], annot=True, fmt='d', cmap='YlGnBu', xticklabels='BM', yticklabels='BM')

ax1.set_title('Random Forest')
ax2.set_title('Random Forest - KBest')
ax3.set_title('Support Vector Machine')
ax4.set_title('Logistic Regression')
ax5.set_title('Decision Tree')
ax6.set_title('K Neareest Neighbors')





<a id="conclusion"></a>
## CONCLUSION

It appears that the Random Forest algorithm gives us the best chance at accuracy within our dataset with an accuracy score of 93% and Log Loss of 16%. 

Thank you for stopping by! I'd love to recieve some feedback or suggestions on what I could do to improve this kernel. Please leave a comment below. 

-Milton