![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2019/12/Artboard-1-100-1.jpg)

# What is Classification ?

- Classification is a machine learning task that involves categorizing or classifying data into predefined classes or categories. It is a supervised learning approach where the model learns from labeled examples to make predictions on new, unseen data. The goal of classification is to find patterns or relationships in the input features that can be used to accurately assign each instance to the correct class.

- In classification, the input data consists of a set of features or attributes, and each instance is associated with a specific class label. The classification model uses these features to learn a decision boundary or a set of rules that separate different classes. Once trained, the model can then predict the class labels for new, unseen instances based on their feature values.

- There are two main types of classification problems: binary classification and multi-class classification. In binary classification, the task is to classify instances into one of two possible classes. For example, distinguishing between spam and non-spam emails. In multi-class classification, the goal is to assign instances to one of several possible classes. For instance, classifying images of animals into categories such as dog, cat, or bird.

- Classification algorithms vary in complexity and can range from simple ones like logistic regression and decision trees to more advanced methods like support vector machines (SVM), random forests, and neural networks. The choice of algorithm depends on the specific problem, the nature of the data, and the desired performance metrics.

- Evaluation of a classification model is typically done using metrics such as accuracy, precision, recall, and F1 score, which assess the model's performance in correctly predicting class labels. These metrics help measure the model's effectiveness in terms of correctly identifying positive and negative instances and minimizing false positives and false negatives.

- Overall, classification is a fundamental task in machine learning with applications in various domains, such as image recognition, sentiment analysis, fraud detection, and medical diagnosis, among others.

![](https://s3.amazonaws.com/youngwonks.lessons/epR4c2VUDRkFtjonhL8XPDd6)

# What is Logistic Regression ?

- Logistic regression is a statistical model used in classification problems, particularly for binary classification. It allows the prediction of the outcome as a probability value using the linear combination of input data.

- Logistic regression is employed when the dependent variable (class label) is categorically divided into two classes. For example, it can be used to predict whether a patient has a certain disease or not, or to predict whether a customer will purchase a product or not.

- Logistic regression combines the features in the input data through a linear function and calculates the outcome as a logarithmic odds ratio. This logarithmic odds ratio is then transformed into a probability value using the sigmoid function (logit function). This probability value represents the likelihood of the data belonging to a particular class.

- The training of logistic regression involves utilizing the features and class labels in the dataset. The training process uses the maximum likelihood estimation to find the optimal coefficients (weights) of the model. It aims to determine the parameter values that best fit the observed class labels in the dataset.

- The evaluation of logistic regression is performed using various metrics. These include accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC-ROC). These metrics measure the classification performance of the model and assess how accurate its predictions are.

- Logistic regression is a simple and interpretable classification model. However, it assumes linear relationships among the input data, and in some cases, other classification algorithms may be used to handle complexity.

![](https://datatron.com/wp-content/uploads/2021/05/Support-Vector-Machine.png)

# What is LinearSVC ?

- LinearSVC (Linear Support Vector Classification) is a linear classification model used for binary and multiclass classification tasks. It is a variant of Support Vector Machines (SVM) that uses a linear kernel function.

- LinearSVC aims to find a linear decision boundary that separates the classes in the input data. It constructs a hyperplane in the feature space that maximizes the margin between the classes. The margin is the distance between the hyperplane and the closest data points from each class.

- During the training process, LinearSVC learns the coefficients (weights) and intercept of the linear decision boundary by solving an optimization problem. It seeks to minimize the hinge loss, which penalizes misclassifications and encourages a wider margin. The optimization algorithm finds the optimal hyperplane that achieves the best separation between the classes.

- LinearSVC is suitable for large-scale datasets as it has efficient training algorithms that can handle a large number of samples and features. It can handle both linearly separable and linearly non-separable data. However, it is generally not suitable for datasets with high-dimensional or complex feature spaces, where non-linear classifiers such as kernel SVMs or deep learning models may be more appropriate.

- After training, LinearSVC can make predictions by evaluating the input data's position relative to the learned linear decision boundary. It assigns the data points to the class associated with the side of the decision boundary on which they lie.

- The performance of LinearSVC can be evaluated using various metrics such as accuracy, precision, recall, and F1 score. It is important to tune the hyperparameters of the model, such as the regularization parameter C, to achieve optimal performance on the specific classification task.

# Dataset Story

- The dataset is named "Breast Cancer Wisconsin (Diagnostic) dataset". It contains clinical features of patients diagnosed with breast cancer. The dataset consists of 30 different features that contribute to the classification of cancer as "M" (malignant) or "B" (benign).

- **The columns in the dataset are defined as follows:**

* **"diagnosis":** The target variable indicating the classification of cancer. It is coded as "M" for malignant and "B" for benign.

* **"radius_mean":** Mean of distances from the center to points on the perimeter of the cell.

* **"texture_mean":** Standard deviation of gray-scale values.

* **"perimeter_mean":** Mean size of the core tumor.

* **"area_mean":** Mean area of the cell.

* **"smoothness_mean":** Standard deviation of radial distances.

* **"compactness_mean":** Perimeter^2 / Area - 1.0.

* **"concavity_mean":** Mean severity of concave portions of the contour.

* **"concave points_mean":** Mean number of concave portions of the contour.

* **"symmetry_mean":** Symmetry of the cell nucleus.

* **"fractal_dimension_mean":** Fractal dimension.

- The other columns follow a similar pattern, representing mean, maximum, and standard deviation values for various features.

- This dataset contains the features that are used for the diagnosis and classification of breast cancer. It can be used to build classification models and support breast cancer diagnosis in fields such as machine learning and data analytics.


# Road Map

- **1. Import Required Libraries**
- **2. Loading the Data Set**
- **3. Checking Rows and Columns**
- **4. "LabelEncoder: Transforming Categorical Variables into Numerical Values"**
- **5. "Encoding Diagnosis Column using LabelEncoder"**
- **6. "Train-Test Split: Dividing the Dataset into Training and Testing Sets"**
- **7. "Creating X_train and y_train: Separating Features and Target Variable from the Train Dataset"**
- **8. "Creating X_test and y_test: Separating Features and Target Variable from the Test Dataset"**
- **9. "Creating model_1: Logistic Regression Model"**
- **10. "Fitting model_1: Training the Logistic Regression Model on the Training Data"**
- **11. "Making Predictions: Predicting the Target Variable using model_1 on the Test Data"**
- **12. "Calculating Confusion Matrix: Evaluating the Performance of the Predictions using Confusion Matrix on the Test Data"**
- **13. "Printing Classification Report: Assessing the Performance of the Predictions using Classification Report on the Test Data"**
- **14. "Creating model_2: Linear Support Vector Classifier (LinearSVC) Model"**
- **15. "Fitting model_2: Training the Linear Support Vector Classifier (LinearSVC) Model on the Training Data"**
- **16. "Making Predictions: Predicting the Target Variable using model_2 on the Test Data"**
- **17. "Calculating Confusion Matrix: Evaluating the Performance of the Predictions using Confusion Matrix on the Test Data"**
- **18. "Printing Classification Report: Assessing the Performance of the Predictions using Classification Report on the Test Data"**

# 1. Import Required Libraries

In [1]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# 2. Loading the Data Set

In [2]:
df = pd.read_csv("/kaggle/input/breast-cancercsv/breast-cancer.csv")

# 3. Checking Rows and Columns

In [3]:
df.shape

(569, 31)

In [4]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,M,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,M,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,M,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,M,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [5]:
df.tail()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
564,M,2.110995,0.721473,2.060786,2.343856,1.041842,0.21906,1.947285,2.320965,-0.312589,...,1.901185,0.1177,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,M,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,...,1.53672,2.047399,1.42194,1.494959,-0.69123,-0.39482,0.236573,0.733827,-0.531855,-0.973978
566,M,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.03868,0.046588,0.105777,-0.809117,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,M,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635
568,B,-1.808401,1.221792,-1.814389,-1.347789,-3.112085,-1.150752,-1.114873,-1.26182,-0.82007,...,-1.410893,0.76419,-1.432735,-1.075813,-1.859019,-1.207552,-1.305831,-1.745063,-0.048138,-0.751207


# 4. "LabelEncoder: Transforming Categorical Variables into Numerical Values"

In [6]:
labelencoder = LabelEncoder()

# 5. "Encoding Diagnosis Column using LabelEncoder"

In [7]:
df["diagnosis"] = labelencoder.fit_transform(df["diagnosis"].values) 

In [8]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,1,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [9]:
df.tail()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
564,1,2.110995,0.721473,2.060786,2.343856,1.041842,0.21906,1.947285,2.320965,-0.312589,...,1.901185,0.1177,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,...,1.53672,2.047399,1.42194,1.494959,-0.69123,-0.39482,0.236573,0.733827,-0.531855,-0.973978
566,1,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.03868,0.046588,0.105777,-0.809117,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635
568,0,-1.808401,1.221792,-1.814389,-1.347789,-3.112085,-1.150752,-1.114873,-1.26182,-0.82007,...,-1.410893,0.76419,-1.432735,-1.075813,-1.859019,-1.207552,-1.305831,-1.745063,-0.048138,-0.751207


# 6. "Train-Test Split: Dividing the Dataset into Training and Testing Sets"

In [10]:
train, test = train_test_split(df, test_size=0.3)

# 7. "Creating X_train and y_train: Separating Features and Target Variable from the Train Dataset"

In [11]:
X_train = train.drop("diagnosis",axis=1)
y_train = train.loc[:,"diagnosis"]

# 8. "Creating X_test and y_test: Separating Features and Target Variable from the Test Dataset"

In [12]:
X_test = test.drop("diagnosis",axis=1)
y_test = test.loc[:,"diagnosis"]

# 9. "Creating model_1: Logistic Regression Model"

In [13]:
model_1 = LogisticRegression()

# 10. "Fitting model_1: Training the Logistic Regression Model on the Training Data"

In [14]:
model_1.fit(X_train,y_train)

# 11. "Making Predictions: Predicting the Target Variable using model_1 on the Test Data"

In [15]:
predictions = model_1.predict(X_test)
predictions

array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])

# 12. "Calculating Confusion Matrix: Evaluating the Performance of the Predictions using Confusion Matrix on the Test Data"

In [16]:
confusion_matrix(y_test, predictions)

array([[95,  4],
       [ 1, 71]])

# 13. "Printing Classification Report: Assessing the Performance of the Predictions using Classification Report on the Test Data"

In [17]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.99      0.96      0.97        99
           1       0.95      0.99      0.97        72

    accuracy                           0.97       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.97      0.97      0.97       171



# 14. "Creating model_2: Linear Support Vector Classifier (LinearSVC) Model"

In [18]:
model_2 = LinearSVC()

# 15. "Fitting model_2: Training the Linear Support Vector Classifier (LinearSVC) Model on the Training Data"

In [19]:
model_2.fit(X_train,y_train)

# 16. "Making Predictions: Predicting the Target Variable using model_2 on the Test Data"

In [20]:
predictions = model_2.predict(X_test)
predictions

array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])

# 17. "Calculating Confusion Matrix: Evaluating the Performance of the Predictions using Confusion Matrix on the Test Data"

In [21]:
confusion_matrix(y_test, predictions)

array([[97,  2],
       [ 2, 70]])

# 18. "Printing Classification Report: Assessing the Performance of the Predictions using Classification Report on the Test Data"

In [22]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98        99
           1       0.97      0.97      0.97        72

    accuracy                           0.98       171
   macro avg       0.98      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171

