# Classification of Breast Cancer Type.

Let's walk through a step-by-step machine learning workflow to classify whether a cancer type diagnosis is Malignant or Benign based on all features in the dataset. We'll also include data visualization as part of the workflow.

[Malignant vs. Benign Tumors: What Are the Differences?](https://www.verywellhealth.com/what-does-malignant-and-benign-mean-514240)

Source of dataset from [Kaggle](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

## Data Attributes
1. `buying`: Car buying price (categorical: 'vhigh', 'high', 'med', 'low')
2. `maint`: Maintenance price (categorical: 'vhigh', 'high', 'med', 'low')
3. `door`: Number of doors (categorical: '2', '3', '4', '5more')
4. `persons`: Person capacity (categorical: '2', '4', 'more')
5. `lug_boot`: Luggage boot size (categorical: 'small', 'med', 'big')
6. `safety`: Safety of the car (categorical: 'low', 'med', 'high')
7. `class`: Acceptability of the car (categorical: 'unacc', 'acc', 'good', 'vgood')

## Step 1: Load and Explore the Data
First, we'll load the data from the CSV file and take a look at its structure.

Explanation:
- We'll use `pandas` to load the data.
- Inspect the first few rows and basic statistics to understand the dataset.

In [None]:
import pandas as pd

# Load the data
data = pd.read_csv('breast-cancer.csv')

# Display the first few rows of the dataset
data.head()

In [None]:
# Display basic statistics
data.describe()

In [None]:
# Display information about the dataset
data.info()

## Step 2: Data Visualization

We'll create a pie chart to visualize the distribution of the diagnosis (Malignant or Benign).

Explanation:
- Use `matplotlib` to create a pie chart.
- Show the proportion of Malignant and Benign diagnoses.

In [None]:
import matplotlib.pyplot as plt

# Plot the distribution of diagnosis
diagnosis_counts = data['diagnosis'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(diagnosis_counts, labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Cancer Diagnosis')
plt.show()

## Step 3: Data Preprocessing

We'll encode categorical variables and split the data into training and testing sets.

Explanation:
- Removed `Unnamed: 32` column.
- Convert the diagnosis column to numerical values (`Malignant`: 1, `Benign`: 0).
- Split the dataset into features and target.
- Split the data into training and testing sets.

In [None]:
# Drop the 'Unnamed: 32' column
data.drop(columns=['Unnamed: 32'], inplace=True)

from sklearn.preprocessing import LabelEncoder

# Define features and target
X = data.iloc[:, 2:].values
y = data.iloc[:, 1].values

lb = LabelEncoder()
y = lb.fit_transform(y)

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Display the shape of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Step 4: Train a Support Vector Machine (SVM) Model

We'll train an SVM model to classify the diagnosis.

Explanation:
- We'll standardize the features using `StandardScaler` class from scikit-learn.
- We'll use the `SVC` class from scikit-learn.
- Fit the model to the training data.

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.svm import SVC

# Initialize and train the SVM model
# svm_model = SVC()   # default RBF kernel, C=1
svm_model = SVC(kernel='linear', C=0.1)
svm_model.fit(X_train_scaled, y_train)

## Step 5: Evaluate the Model

We'll evaluate the model's performance using metrics such as `accuracy`, `precision`, `recall`, and `F1-score`.

Explanation:
- Predict the target values for the test set.
- Calculate evaluation metrics to assess model performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the target values for the test set
y_pred = svm_model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

## Model Evaluation Metrics

### Accuracy:

Accuracy=0.9825

- Explanation: Accuracy measures the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. An accuracy of 0.9825 means that the model correctly classified about 98.25% of the cancer diagnoses.
- Interpretation: The model has very high accuracy, indicating that it performs exceptionally well overall in classifying cancer diagnoses.

### Precision:

Precision=1.0

- Explanation: Precision is the ratio of true positive predictions to the total number of positive predictions (true positives + false positives). A precision of 1.0 means that 100% of the instances predicted as Malignant by the model are actually Malignant.
- Interpretation: The precision is perfect, which means the model did not produce any false positives. Every prediction of Malignant was correct.

### Recall:

Recall=0.9592

- Explanation: Recall (also known as sensitivity or true positive rate) is the ratio of true positive predictions to the total number of actual positives (true positives + false negatives). A recall of 0.9592 means that about 95.92% of all actual Malignant cases were correctly identified by the model.
- Interpretation: The recall is very high, indicating that the model is very effective at identifying Malignant cases, though it missed a small number (about 4.08%) of actual Malignant cases (false negatives).

### F1-score:

F1-score=0.9792

- Explanation: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. An F1-score of 0.9792 indicates a very high balance between precision and recall.
- Interpretation: The F1-score is very high, indicating that the model maintains a great balance between precision and recall.

## Summary
- `Accuracy` (98.25%): The model performs exceptionally well in overall classification, with a very high proportion of correct predictions.
- `Precision` (100%): The model is perfect in identifying Malignant cases, with no false positives.
- `Recall` (95.92%): The model is very effective at identifying most Malignant cases, with a very small number of false negatives.
- `F1-score` (97.92%): The model maintains a very high balance between precision and recall, indicating strong performance in both metrics.

## Step 6: Predict Breast Cancer Type Diagnosis as Malignant or Benign

- `Collect New Patient Data`: Ensure the new data includes values for all the features used in the training set.
- `Preprocess the New Data`: Standardize the new data using the same scaler used for the training data.
- `Make Predictions`: Use the trained SVM model to predict the diagnosis for the new data.

In [None]:
# New patient data
new_patient = {
    'radius_mean': 14.2,
    'texture_mean': 19.8,
    'perimeter_mean': 94.3,
    'area_mean': 603.4,
    'smoothness_mean': 0.1,
    'compactness_mean': 0.12,
    'concavity_mean': 0.09,
    'concave points_mean': 0.05,
    'symmetry_mean': 0.17,
    'fractal_dimension_mean': 0.06,
    'radius_se': 0.4,
    'texture_se': 1.2,
    'perimeter_se': 2.3,
    'area_se': 18.5,
    'smoothness_se': 0.007,
    'compactness_se': 0.03,
    'concavity_se': 0.04,
    'concave points_se': 0.011,
    'symmetry_se': 0.02,
    'fractal_dimension_se': 0.003,
    'radius_worst': 16.2,
    'texture_worst': 25.4,
    'perimeter_worst': 106.3,
    'area_worst': 715.2,
    'smoothness_worst': 0.145,
    'compactness_worst': 0.23,
    'concavity_worst': 0.16,
    'concave points_worst': 0.1,
    'symmetry_worst': 0.24,
    'fractal_dimension_worst': 0.075
}

# Convert the new patient data to a DataFrame
new_patient_data = pd.DataFrame([new_patient])

# Standardize the new patient data
new_patient_scaled = scaler.fit_transform(new_patient_data)

# Make a prediction
new_patient_prediction = svm_model.predict(new_patient_scaled)

# Output the prediction
diagnosis = 'Malignant' if new_patient_prediction[0] == 1 else 'Benign'
print(f'Predicted diagnosis for the new patient: {diagnosis}')