# BUILDING THE PREDICTIVE MODEL

Let's now build our model using Logistic Regression. Logistic regression is a widely-used algorithm for classification tasks, especially for binary classification. It directly models the probability that a given data point belongs to a particular class.

**Loading the libraries**

In [35]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.metrics import log_loss
from sklearn.utils import resample
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

Let's load the dataset into a DataFrame:

In [36]:
breast_cancer = pd.read_csv(r"C:\Users\maria\Desktop\proyecto cancer de mama\breast-cancer-wisconsin-data_data.csv")
print(len(breast_cancer))
breast_cancer.head()

569


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [37]:
drop_columns=['id', 'texture_se', 'smoothness_se', 'symmetry_se', 'fractal_dimension_mean', 'fractal_dimension_se', 'radius_worst', 'area_worst', 'radius_mean'] 

Let's create the numpy arrays for train and test.

In [38]:
X = breast_cancer.drop('diagnosis',axis=1)
X = X.drop(drop_columns, axis=1)
Y = breast_cancer['diagnosis'].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)
len(X_train), len(X_test), len(X_train.columns)

(398, 171, 22)

Let's use the LabelEncoder to convert the class labels into numerical values and let's standardize all the data to ensure uniform scaling.

In [39]:
le = LabelEncoder()
Y_train = le.fit_transform(Y_train)
Y_test = le.transform(Y_test)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

Let's now build our model using Logistic Regression:

In [40]:
lr = LogisticRegression()
lr.fit(X_train, Y_train)

Let's assess the quality of our model using two metrics:

- Accuracy: Simply counts how many of the model's classifications are correct; it returns a value between 0 and 1 (higher is better).
- Negative Log-Likelihood (log loss): Takes into account the probability; it returns a value between 0 and 1 (lower is better).

To obtain the probability of belonging to a class instead of the class itself, we can use the predict_proba() method.

In [41]:
Y_pred = lr.predict(X_train)
Y_pred_proba = lr.predict_proba(X_train)

print("TRAIN ACCURACY: "+str(accuracy_score(Y_train, Y_pred)))
print("TRAIN LOG LOSS: "+str(log_loss(Y_train, Y_pred_proba)))

TRAIN ACCURACY: 0.9874371859296482
TRAIN LOG LOSS: 0.05952943868796311


In [42]:
Y_pred = lr.predict(X_test)
Y_pred_proba = lr.predict_proba(X_test)

print("TEST ACCURACY: "+str(accuracy_score(Y_test, Y_pred)))
print("TEST LOG LOSS: "+str(log_loss(Y_test, Y_pred_proba)))

TEST ACCURACY: 0.9766081871345029
TEST LOG LOSS: 0.08171116905823453


As we can see, the accuracy of our model on the test set is 0.9776, which is excellent. Let's now build our model using other classification algorithms. We'll try K-NN, Random Forest, and SVM to see if we can achieve better results.

**K-NN**

In [43]:
Ks = [1,2,3,4,5,7,10,12,15,20]

for K in Ks:
    print("K="+str(K))
    knn = KNeighborsClassifier(n_neighbors=K)
    knn.fit(X_train,Y_train)
    
    Y_pred_train = knn.predict(X_train)
    Y_prob_train = knn.predict_proba(X_train)
    
    Y_pred = knn.predict(X_test)
    Y_prob = knn.predict_proba(X_test)
    
    accuracy_train = accuracy_score(Y_train, Y_pred_train)
    accuracy_test = accuracy_score(Y_test, Y_pred)

    loss_train = log_loss(Y_train, Y_prob_train)
    loss_test = log_loss(Y_test, Y_prob)
    
    print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))
    print("LOG LOSS: TRAIN=%.4f TEST=%.4f" % (loss_train,loss_test))

K=1
ACCURACY: TRAIN=1.0000 TEST=0.9415
LOG LOSS: TRAIN=0.0000 TEST=2.1078
K=2
ACCURACY: TRAIN=0.9724 TEST=0.9415
LOG LOSS: TRAIN=0.0383 TEST=1.1188
K=3
ACCURACY: TRAIN=0.9824 TEST=0.9298
LOG LOSS: TRAIN=0.0428 TEST=0.9372
K=4
ACCURACY: TRAIN=0.9749 TEST=0.9415
LOG LOSS: TRAIN=0.0525 TEST=0.7366
K=5


ACCURACY: TRAIN=0.9698 TEST=0.9415
LOG LOSS: TRAIN=0.0633 TEST=0.3328
K=7
ACCURACY: TRAIN=0.9698 TEST=0.9532
LOG LOSS: TRAIN=0.0739 TEST=0.3379
K=10
ACCURACY: TRAIN=0.9648 TEST=0.9474
LOG LOSS: TRAIN=0.0872 TEST=0.1348
K=12
ACCURACY: TRAIN=0.9598 TEST=0.9532
LOG LOSS: TRAIN=0.0920 TEST=0.1302
K=15
ACCURACY: TRAIN=0.9598 TEST=0.9474
LOG LOSS: TRAIN=0.0996 TEST=0.1341
K=20
ACCURACY: TRAIN=0.9598 TEST=0.9415
LOG LOSS: TRAIN=0.1063 TEST=0.1337


As observed, the best results are obtained with k=7 and k=12, with an accuracy on the test set of 0.9532. However, this accuracy is lower than the one achieved with Logistic Regression.

**Random Forest**

In [44]:
forest = RandomForestClassifier(n_estimators=29, max_depth=8, random_state=False)

forest.fit(X_train, Y_train)

Y_pred_train = forest.predict(X_train)
Y_pred = forest.predict(X_test)

accuracy_train = accuracy_score(Y_train, Y_pred_train)
accuracy_test = accuracy_score(Y_test, Y_pred)

print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))

ACCURACY: TRAIN=0.9975 TEST=0.9532


As we can see, using 29 estimators and a maximum depth of 8, we achieve a test accuracy of 0.9532, which is the same as the one obtained with K-NN. However, the accuracy is still better using Logistic Regression.

**Support Vector Machine (SVM)**

In [45]:
svc = LinearSVC()
svc.fit(X_train, Y_train)
print("ACCURACY: Train=%.4f Test=%.4f" % (svc.score(X_train, Y_train), svc.score(X_test,Y_test)))

ACCURACY: Train=0.9874 Test=0.9591




In this case, better results are achieved than with K-NN and Random Forest, as the test accuracy is 0.9591. Nevertheless, the results using Logistic Regression still outperform.

**And what would have happened if we had used all the features?**

Let's see it by building the model using Logistic Regression. 

In [46]:
X = breast_cancer.drop('diagnosis',axis=1).values
Y = breast_cancer['diagnosis'].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)  

In [47]:
le = LabelEncoder()
Y_train = le.fit_transform(Y_train)
Y_test = le.transform(Y_test)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train, Y_train)

In [48]:
Y_pred = lr.predict(X_train)
Y_pred_proba = lr.predict_proba(X_train)

print("TRAIN ACCURACY: "+str(accuracy_score(Y_train, Y_pred)))
print("TRAIN LOG LOSS: "+str(log_loss(Y_train, Y_pred_proba))) 

TRAIN ACCURACY: 0.9899497487437185
TRAIN LOG LOSS: 0.05370470413754593


In [49]:
Y_pred = lr.predict(X_test)
Y_pred_proba = lr.predict_proba(X_test)

print("TEST ACCURACY: "+str(accuracy_score(Y_test, Y_pred)))
print("TEST LOG LOSS: "+str(log_loss(Y_test, Y_pred_proba)))

TEST ACCURACY: 0.9649122807017544
TEST LOG LOSS: 0.11068153613896022


As observed, if we hadn't removed the features identified during the exploratory data analysis, we would have obtained a worse model. In this case, we have achieved a test accuracy of 0.9649. 

# CONCLUSION

We conclude that by extracting features identified during exploratory data analysis, which provided less information, we have obtained a model with excellent performance, achieving a test accuracy of 97.76%. We also tried building our model using K-NN, Random Forest, and SVM. While the results obtained with these approaches are also very good (test accuracy exceeding 95%), the results with Logistic Regression still outperform them.