# Classifying Benign and Malignant Tumors

The data consists of variables that will help classify tumors as benign or malignment. The variables are:
,,,,

* Clump Thickness
* Uniformity of Cell Size
* Uniformity of Cell Shape
* Marginal Adhesion
* Single Epithelial Cell Size
* Bare Nuclei
* Bland Chromatin
* Normal Nucleoli
* Mitoses

The classes are encoded at 2: 'Benign' or 4: 'Malignant'. Here, we will use Logistic, Kneighbors, SVC, GassianNB, and RandomForest to classify the tumor types.

# Importing Packages

In [52]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

In [53]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Importing Dataset and data processing

In [54]:
df = pd.read_csv('/content/drive/MyDrive/DataScience_Portfolio/CancerData.csv')

# Splitting data into X and Y. X contains descriptor variables and Y contains classifications.
X=df.iloc[:,:-1].values
Y=df.iloc[:,-1].values

# Splitting Data into test and training sets
xTrain, xTest, yTrain, yTest = train_test_split(X,Y,test_size=0.25,random_state=0)

df.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


# Running classification

In [55]:
def classy(xTrain,xTest,yTrain,yTest):

    '''
    DATA PREPARATION
    '''
    xTrain = np.delete(xTrain,0,1)      # remove sample name from xTrain
    xTestSamples = xTest[:,0]           # extract sample name from xTest 
    xTest = np.delete(xTest,0,1)        # remove sample name from xTest
    
    # Scaling Features to equal influence for features
    sc = StandardScaler()                    
    xTrain = sc.fit_transform(xTrain)   
    xTest = sc.fit_transform(xTest)

    #Logistic Classification
    LogisticClassy = LogisticRegression(random_state=0)

    # KNN Classifier
    knnClassy = KNeighborsClassifier()

    # Kernel SVC
    SVCClassy = SVC(kernel='rbf',random_state=0)

    # Naive Bayes
    BayesClassy = GaussianNB()

    # Random Forest
    RandomForestClassy = RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=0)

    # Prepping output dataframes
    xTest_inverseTransform = sc.inverse_transform(xTest)
    resultDF = pd.DataFrame(({'Patient':xTestSamples,'True_Label':yTest}))
    accuracyDF = []

    # Running classifiers
    classifiers = []
    classifiers.extend([LogisticClassy,knnClassy,SVCClassy,BayesClassy,RandomForestClassy])

    for clf in classifiers:
        classyName = clf.__class__.__name__
        print('Performing ' + classyName)
        clf.fit(xTrain,yTrain)                # Fitting model
        clf_ypredict = clf.predict(xTest)     # predict data from testing set
        clf_acc = accuracy_score(y_true=yTest,y_pred=clf_ypredict) # Obtaining accuracy

        resultDF[classyName + '_Prediction'] = clf_ypredict
        accuracyDF.append({classyName + '_Accuracy':clf_acc})

    mapping = {2:'Benign', 4: 'Malignant'}
    resultDF.iloc[:,1:7] = resultDF.iloc[:,1:7].applymap(lambda s: mapping.get(s) if s in mapping else s)

    return resultDF, pd.DataFrame(accuracyDF)

In [56]:
results,accuracy = classy(xTrain,xTest,yTrain,yTest)

Performing LogisticRegression
Performing KNeighborsClassifier
Performing SVC
Performing GaussianNB
Performing RandomForestClassifier


In [57]:
results

Unnamed: 0,Patient,True_Label,LogisticRegression_Prediction,KNeighborsClassifier_Prediction,SVC_Prediction,GaussianNB_Prediction,RandomForestClassifier_Prediction
0,1173347,Benign,Benign,Benign,Benign,Benign,Benign
1,1156017,Benign,Benign,Benign,Benign,Benign,Benign
2,706426,Malignant,Malignant,Malignant,Malignant,Malignant,Malignant
3,1330439,Malignant,Malignant,Malignant,Malignant,Malignant,Malignant
4,693702,Benign,Benign,Benign,Benign,Benign,Benign
...,...,...,...,...,...,...,...
166,1266124,Benign,Benign,Benign,Benign,Benign,Benign
167,1197979,Benign,Benign,Benign,Benign,Benign,Benign
168,764974,Benign,Benign,Benign,Benign,Benign,Benign
169,1137156,Benign,Benign,Benign,Benign,Benign,Benign


In [58]:
accuracy

Unnamed: 0,LogisticRegression_Accuracy,KNeighborsClassifier_Accuracy,SVC_Accuracy,GaussianNB_Accuracy,RandomForestClassifier_Accuracy
0,0.947368,,,,
1,,0.947368,,,
2,,,0.947368,,
3,,,,0.94152,
4,,,,,0.947368
