# **Cancer Classification Model**

Welcome to our model! This is a logistic regression model that predicts a patient's cancer type (label) based on their genetic makeup. Let's begin!

# **DATA PREPROCESSING**

Firstly, we'll import the necessary libraries as well as our data.

In [None]:
#importing libraries
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
%matplotlib notebook
import matplotlib.pyplot as plt
import os
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#importing the data uploaded to the colab files
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/data.csv'
df = pd.read_csv(file_path)
file_path = '/content/drive/MyDrive/labels.csv'
df_labels = pd.read_csv(file_path)

#our data came in two separate datasets, but we want to combine them for now.
df = pd.concat([df, df_labels], axis = 1)

Mounted at /content/drive


Let's take a first look at our data! Based on what's below, it looks like we have 20,533 genetic expressions for 801 patients! We expect that there might be missing data in all of these, as the precision of genetic data collection can only be so precise with so many data points.

In [None]:
df.shape

(801, 20534)

In [None]:
df.head()

Unnamed: 0.2,Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,...,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530,Unnamed: 0.1,Class
0,sample_0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,...,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0,sample_0,PRAD
1,sample_1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,...,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0,sample_1,LUAD
2,sample_2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,...,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0,sample_2,PRAD
3,sample_3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,...,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0,sample_3,PRAD
4,sample_4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,...,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0,sample_4,BRCA


In [None]:
#we can remove the Unnamed: 0 column, since it won't help the model
df = df.drop(columns=['Unnamed: 0'])

In [None]:
df.head()

Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,...,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530,Class
0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,0.0,...,8.210257,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0,PRAD
1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,0.0,...,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0,LUAD
2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,0.0,...,8.127123,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0,PRAD
3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,0.0,...,8.792959,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0,PRAD
4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,0.0,...,8.891425,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0,BRCA


Let's check for any missing values

In [None]:
print(df.isnull().sum())

gene_0        0
gene_1        0
gene_2        0
gene_3        0
gene_4        0
             ..
gene_20527    0
gene_20528    0
gene_20529    0
gene_20530    0
Class         0
Length: 20532, dtype: int64


Nice, we're not missing any data! We can continue with looking at our data

Note: Because we have so little data, there is a risk of overfitting due to the small amount of data available. In this case, the model will memorize the solutions, rather than learn how to predict cancer types. On the flip side, there is so little data the model may be too simple to detect patterns and underfit. To handle this, we will implement k-fold cross-validation to maximize our opportunity to tweak the model's hyperparameters and assess its performance.

K-fold cross-validation involves dividing the dataset into K equally-sized folds/subsets. The model will be trained on some folds, and validated on others. This process is repeated many times, using a different fold each time as the validation set.

Next, let's look into the possible input for the genes, as well as the labels.

In [None]:
df['Class'].unique()

array(['PRAD', 'LUAD', 'BRCA', 'KIRC', 'COAD'], dtype=object)

These are the different cancer types. This is the key below:

PRAD: Prostate adenocarcinoma (Prostate cancer)

LUAD: Lung adenocarcinoma (Lung cancer)

BRCA: Breast carcinoma (Breast cancer)

KIRC: Kidney renal clear cell carcinoma (Renal tubule kidney cancer)

COAD: Colon adenocarcinoma (Colon cancer)

In [None]:
df['gene_0'].unique()

array([0.        , 0.34175809, 0.63152341, 1.48233202, 0.66775591,
       0.54408967, 0.81893234, 0.41294407, 0.531868  , 0.3287216 ,
       0.34164424, 0.56588958, 0.69884077, 0.6247092 , 0.58890049,
       1.21915303, 0.48738344, 0.33753984, 0.54774594, 1.24110797,
       0.87334075, 0.75745002, 0.64643919, 0.56666905, 0.54201036,
       0.84406421, 0.55316387, 0.45037988, 0.28392177, 0.66411871,
       0.32515612, 0.36333931, 0.80776706, 0.40860318, 0.40403114,
       0.43658829])

Nice, looks like there's no unexpected inputs in the data. Let's move into our train/validation split. First, we're going to create labeled examples and split our data.

In [None]:
y = df['Class']
X = df.drop(columns = 'Class', axis=1)

print("Number of Examples: " + str(X.shape[0]))
print("Number of Features: " + str(X.shape[1]))
print("\nFeatures:\n")
print(X)
print("\nLabels:\n")
print(y)

Number of Examples: 801
Number of Features: 20531

Features:

     gene_0    gene_1    gene_2    gene_3     gene_4  gene_5    gene_6  \
0       0.0  2.017209  3.265527  5.478487  10.431999     0.0  7.175175   
1       0.0  0.592732  1.588421  7.586157   9.623011     0.0  6.816049   
2       0.0  3.511759  4.327199  6.881787   9.870730     0.0  6.972130   
3       0.0  3.663618  4.507649  6.659068  10.196184     0.0  7.843375   
4       0.0  2.655741  2.821547  6.539454   9.738265     0.0  6.566967   
..      ...       ...       ...       ...        ...     ...       ...   
796     0.0  1.865642  2.718197  7.350099  10.006003     0.0  6.764792   
797     0.0  3.942955  4.453807  6.346597  10.056868     0.0  7.320331   
798     0.0  3.249582  3.707492  8.185901   9.504082     0.0  7.536589   
799     0.0  2.590339  2.787976  7.318624   9.987136     0.0  9.213464   
800     0.0  2.325242  3.805932  6.530246   9.560367     0.0  7.957027   

       gene_7  gene_8  gene_9  ...  gene_20521  g

# **MODEL SETUP**
Setting up the model using k-fold cross-validation and L1 normalization to train the model more rigorously.

In [None]:
from sklearn.preprocessing import normalize

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# Assuming X and y are your features and target labels
# StratifiedKFold ensures that each fold maintains the class distribution
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize the model
model = KNeighborsClassifier(n_neighbors=5, weights='distance')

# Standardize the features (important for k-NN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Normalize the features for k-NN

# Initialize lists to store performance metrics for each fold
precisions = []
recalls = []
f1_scores = []
accuracies = []

# Start k-fold cross-validation
for train_index, test_index in kf.split(X_scaled, y):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train = normalize(X_train, norm='l1', axis=1)
    X_test = normalize(X_test, norm='l1', axis=1)

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Calculate evaluation metrics
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    accuracy = accuracy_score(y_test, y_pred)

    # Append metrics for this fold
    precisions.append(precision)
    recalls.append(recall)
    f1_scores.append(f1)
    accuracies.append(accuracy)

# Compute the mean and standard deviation of each metric across all folds
print(f"Average Precision: {np.mean(precisions):.4f} ± {np.std(precisions):.4f}")
print(f"Average Recall: {np.mean(recalls):.4f} ± {np.std(recalls):.4f}")
print(f"Average F1 Score: {np.mean(f1_scores):.4f} ± {np.std(f1_scores):.4f}")
print(f"Average Accuracy: {np.mean(accuracies):.4f} ± {np.std(accuracies):.4f}")


Average Precision: 0.9941 ± 0.0035
Average Recall: 0.9938 ± 0.0040
Average F1 Score: 0.9938 ± 0.0039
Average Accuracy: 0.9938 ± 0.0040


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_scaled, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")


Best parameters: {'n_neighbors': 5, 'weights': 'uniform'}
Best cross-validation accuracy: 0.9963


In [None]:
model = KNeighborsClassifier(n_neighbors=5, weights='uniform')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# **EVALUATION**

**Cross-Validation**

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score
import numpy as np

# Initialize LeaveOneOut
loo = LeaveOneOut()

# Perform LOOCV
accuracy_scores = []
for train_index, test_index in loo.split(X_train):
    X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]

    # Use .iloc to index y_train based on integer positions:
    y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]

    # Train the model
    model.fit(X_train_fold, y_train_fold)

    # Make predictions
    y_pred_fold = model.predict(X_test_fold)

    # Calculate accuracy for this fold
    accuracy_scores.append(accuracy_score(y_test_fold, y_pred_fold))

# Print the accuracy scores
print('Accuracies for each fold in Leave-One-Out CV:')
print(accuracy_scores)

# Calculate and print the mean accuracy
mean_accuracy = np.mean(accuracy_scores)
print(f'Mean Accuracy (LOOCV): {mean_accuracy:.4f}') # Corrected line

Accuracies for each fold in Leave-One-Out CV:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0

Next, we're going to build our model. We're building a logistic regression model, and we want to make sure we're optimizing the model's hyperparameters. Firstly, we'll be testing different c values and choose the best one depending on how low the log loss is relative to how high the accuracy score is.

**Mean Accuracy**

In [None]:
# Find the mean accuracy score and save to variable 'acc_mean'
acc_mean = np.mean(accuracy_scores)
print('The mean accuracy score across the five iterations:')
print(acc_mean)


# Find the standard deviation of the accuracy score and save to variable 'acc_std'
acc_std = np.std(accuracy_scores)

# Print the standard deviation of the accuracy scores using the std() method to see the degree of variance.
print('The standard deviation of the accuracy score across the five iterations:')
print(acc_std)

The mean accuracy score across the five iterations:
0.9937597503900156
The standard deviation of the accuracy score across the five iterations:
0.0787483897917252


**Cross Validation (Again)**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier


# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print("Cross-validation scores:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())

Cross-validation scores: [0.99378882 1.         1.         1.         1.        ]
Mean CV accuracy: 0.9987577639751553


**Overall Accuracy**

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 98.75%


**Euclidean Distance**

In [None]:
def euclidean_distance(x1, x2):
    """
    Compute the Euclidean distance between two data points (x1 and x2).
    """
    return np.sqrt(np.sum((x1 - x2) ** 2))

def predict_class(X_train, y_train, X_test_point, k=3):
    """
    Predict the class of a test point based on Euclidean distance.
    Arguments:
    - X_train: Training feature data
    - y_train: Training labels
    - X_test_point: A single test data point (row of features)
    - k: Number of neighbors to consider (default is 3)

    Returns:
    - Predicted class label for the test point
    """
    distances = []

    # Compute the distance from the test point to each training point
    for i in range(len(X_train)):
        dist = euclidean_distance(X_train[i], X_test_point)
        distances.append((dist, y_train.iloc[i]))

    # Sort the distances in ascending order
    distances.sort(key=lambda x: x[0])

    # Get the k nearest neighbors
    nearest_neighbors = distances[:k]

    # Count the classes of the nearest neighbors
    class_counts = {}
    for _, label in nearest_neighbors:
        class_counts[label] = class_counts.get(label, 0) + 1

    # Return the class with the highest count
    predicted_class = max(class_counts, key=class_counts.get)
    return predicted_class

# Predict for each test point
predictions = []
for i in range(len(X_test)):
    test_point = X_test[i]
    predicted_class = predict_class(X_train, y_train, test_point, k=3)
    predictions.append(predicted_class)

# Compute accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.4f}')


Accuracy: 0.9875


**Mean-Squared Error**

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

y_pred_proba = model.predict_proba(X_test)
label_encoder = LabelEncoder()
y_test_encoded = label_encoder.fit_transform(y_test)

# Get predicted labels by selecting the class with highest probability
y_pred_labels = y_pred_proba.argmax(axis=1)

# Calculate MSE using predicted labels and true labels
mse = mean_squared_error(y_test_encoded, y_pred_labels)
print(mse)

0.0625


**Top-K**

In [None]:
from sklearn.preprocessing import LabelEncoder

top_k_indices = np.argsort(y_pred_proba, axis=1)[:, -2:]  # Use y_pred_proba for probabilities
k = 3  # Define k for Top-k Accuracy
top_k_accuracy = np.mean([y_test.iloc[i] in label_encoder.classes_[top_k_indices[i]] for i in range(len(y_test))])
print(f"Top-{k} Accuracy: {top_k_accuracy * 100:.2f}%")

Top-3 Accuracy: 99.38%


**Precision, Recall, F1, Confusion Matrix**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, accuracy_score

# Assuming you have the following from your model:
# y_test - true labels for test data
# y_pred - predicted labels from your model

# Calculate Precision, Recall, F1 Score for Multi-Class Classification
precision = precision_score(y_test, y_pred, average='weighted')  # 'weighted' is common for multi-class
recall = recall_score(y_test, y_pred, average='weighted')        # 'weighted' handles imbalanced classes
f1 = f1_score(y_test, y_pred, average='weighted')                # 'weighted' is commonly used for multi-class

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:")
print(conf_matrix)


Precision: 0.9879
Recall: 0.9875
F1 Score: 0.9874
Accuracy: 0.9875
Confusion Matrix:
[[60  0  0  0  0]
 [ 1 15  0  0  0]
 [ 0  0 29  0  0]
 [ 1  0  0 27  0]
 [ 0  0  0  0 27]]


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
from sklearn.neighbors import KNeighborsRegressor


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN regressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Evaluate permutation importance
result = permutation_importance(model, X_test_scaled, y_test, n_repeats=5, random_state=42)

# Get the importance scores for each feature
importance_scores = result.importances_mean

# Create a DataFrame to visualize the importance of each feature
import pandas as pd
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importance_scores
})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Show the top 5 important features
print(importance_df.head(5))

In [None]:
# Get a random index from the test set
index = np.random.randint(0, len(X_test))

# Get the input features for the selected index
example_input = X_test[index]

# Get the ground truth label for the selected index
ground_truth = y_test.iloc[index]

# Reshape the example_input for prediction
example_input_reshaped = example_input.reshape(1, -1)  # Reshape to (1, num_features)

# Get the model's prediction
predicted_label = model.predict(example_input_reshaped)[0]

# Print the results
print(f"Example Input (First 5 Features): {example_input[:5]}")
print(f"Ground Truth Label: {ground_truth}")
print(f"Predicted Label: {predicted_label}")
print(f"Accuracy: {accuracy * 100:.2f}%")


Example Input (First 5 Features): [-1.10607571e-05  4.51175838e-05  3.07848123e-05 -5.49098278e-06
 -3.38983904e-05]
Ground Truth Label: PRAD
Predicted Label: PRAD
Accuracy: 98.75%
