# **Binary Classification with XGBoost: A Comprehensive Approach Using Cross-Validation**

This notebook demonstrates the use of XGBoost for binary classification tasks, focusing on its implementation, training, and evaluation. The model's performance is assessed using a confusion matrix and accuracy score, while k-Fold cross-validation ensures reliable results. Ideal for structured data and high-performance predictions.

# XGBoost

**XGBoost (Extreme Gradient Boosting**) is a fast, efficient, and scalable machine learning algorithm based on the gradient boosting framework. It builds an ensemble of decision trees, where each tree corrects the errors of the previous ones. XGBoost is popular for its high accuracy, speed, and ability to handle missing data, regularization, and large-scale data. It's commonly used for classification, regression, and ranking tasks.

In short, it’s a go-to algorithm for structured data, known for winning data science competitions due to its strong performance.


XGBoost is widely used for classification tasks. It excels in binary and multiclass classification problems due to its ability to handle large datasets efficiently and produce highly accurate models.



**Binary Classification**:

* Predicting whether a customer will churn (Yes/No).
* Identifying spam emails (Spam/Not Spam).
* Detecting fraud in financial transactions (Fraud/No Fraud).

**Multiclass Classification**:

* Predicting the type of disease based on symptoms (Class A, Class B, Class C).
* Classifying images into different categories (e.g., Cats, Dogs, Birds).

## Importing the libraries

In [49]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

The dataset is related to breast cancer diagnosis. It includes various features related to cell characteristics from breast cancer biopsies and a class label indicating the diagnosis. Here’s a brief overview of the dataset:

<br>

**Sample code number**: An identifier for each sample.

**Clump Thickness**: Measurement of the thickness of clumps of cells.

**Uniformity of Cell Size**: Uniformity in size of the cells.

**Uniformity of Cell Shape**: Uniformity in shape of the cells.

**Marginal Adhesion**: Degree of adhesion of cells to each other.

**Single Epithelial Cell Size**: Size of individual epithelial cells.

**Bare Nuclei**: Number of bare nuclei (nuclei without surrounding cytoplasm).

**Bland Chromatin**: Texture of chromatin in cells.

**Normal Nucleoli**: Number of normal nucleoli in cells.

**Mitoses**: Number of mitotic figures in the sample.

**Class**: Diagnosis label, where 2 typically represents benign and 4 represents malignant.

In [50]:
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,2
679,841769,2,1,1,1,2,1,1,1,1,2
680,888820,5,10,10,3,7,3,8,10,2,4
681,897471,4,8,6,4,3,4,10,6,1,4


In [51]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [52]:
X[:5]

array([[1000025,       5,       1,       1,       1,       2,       1,
              3,       1,       1],
       [1002945,       5,       4,       4,       5,       7,      10,
              3,       2,       1],
       [1015425,       3,       1,       1,       1,       2,       2,
              3,       1,       1],
       [1016277,       6,       8,       8,       1,       3,       4,
              3,       7,       1],
       [1017023,       4,       1,       1,       3,       2,       1,
              3,       1,       1]])

In [53]:
y[:40]

array([2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 4, 4, 2, 2, 4, 2, 4, 4,
       2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 4, 4])

In [54]:
# Change the Y values to 0 and 1
y = np.where(y == 2, 0, 1)
y[:40]

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1])

## Splitting the dataset into the Training set and Test set

In [55]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training XGBoost on the Training set

In [56]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [57]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[85  2]
 [ 1 49]]


0.9781021897810219

## Applying k-Fold Cross Validation

In [58]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.71 %
Standard Deviation: 2.28 %


END