# Let's make a classification model with breast cancer dataset.
**This dataset contains statistical data obtained from histopathology studies.**


**There is a binary classification problem.**

**We want to classify it as benign or malignant.**



In [1]:
import pandas as pd
dataset = pd.read_csv('/kaggle/input/breast-cancer-dataset/breast-cancer (3).csv')

In [2]:
dataset.shape

(569, 31)

In [3]:
dataset.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,M,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,M,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,M,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,M,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


**Label Encoder**

Label Encoding is converting labels/words into numeric form.

In [4]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
dataset['diagnosis'] = labelencoder.fit_transform(dataset['diagnosis'].values)

**We only have one dataset. We need to divide it into train set and test set.**

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
train, test = train_test_split(dataset, test_size =0.3)

**We are trying to guess whether the tumor is benign or malignant.**

**Our target variable(X) is the diagnostic column.**

**Let's assign properties(Y)**

In [7]:
X_train = train.drop('diagnosis', axis = 1)
y_train = train.loc[:,'diagnosis']

X_test = test.drop('diagnosis', axis = 1)
y_test = test.loc[:, 'diagnosis']

* #  First, let's train and test the Logistic Regression classifier.

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
model = LogisticRegression()

In [10]:
model.fit(X_train, y_train)

* # We can make predictions on the test dataset using the confusion matrix.

In [11]:
predictions = model.predict(X_test)
predictions

array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0])

In [12]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

array([[101,   1],
       [  4,  65]])

**169 out of 171 predictions are correct.**

In [13]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98       102
           1       0.98      0.94      0.96        69

    accuracy                           0.97       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.97      0.97      0.97       171



**Our rates are quite high.**

# Let's use another classification algorithm:

# SUPPORT VECTOR MACHINE 

In [14]:
from sklearn.svm import LinearSVC

In [15]:
model_2 = LinearSVC()

In [16]:
model_2.fit(X_train, y_train)

In [17]:
predictions = model_2.predict(X_test)
predictions

array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0])

In [18]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

array([[101,   1],
       [  4,  65]])

**We see that 168 of 171 predictions were correct.**

In [19]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98       102
           1       0.98      0.94      0.96        69

    accuracy                           0.97       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.97      0.97      0.97       171



**When we compare it, we see that logistic regression is better.**

# Thank you!!