# Cancer Classification using Logistic Regression

## Introduction to Logistic Regression

Logistic Regression is a fundamental machine learning algorithm used for classification problems. Despite its name, it's used for classification, not regression!

### What is Logistic Regression?
- It's like a smart decision-maker that learns from examples
- Perfect for yes/no or multiple category predictions
- In our case, it will help predict different types of cancers based on genetic markers (SNPs)

### How does it work?
- Takes input features (in our case, SNP markers)
- Calculates probability of belonging to each class (cancer type)
- Makes prediction based on highest probability
- Uses a special S-shaped curve called 'sigmoid' to make predictions

### Why use it?
- Simple to understand and explain
- Works well for classification problems
- Gives probability scores for predictions
- Can handle multiple classes (like our different cancer types)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [None]:
# Load the data
data = pd.read_csv('common_cancer.csv')

# Display basic information about our dataset
print("Dataset Overview:")
print(f"Number of samples: {len(data)}")
print(f"Number of features: {len(data.columns)}")
print("\nFirst few rows of the data:")
display(data.head())

In [None]:
# Separate features (X) and target variable (y)
X = data.iloc[:, 1:]  # All columns except the first one (cancer types)
y = data.iloc[:, 0]   # First column (cancer types)

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

print("Data splitting:")
print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

In [None]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Create and train the model
print("Training the model...")
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Cancer Type')
plt.ylabel('Actual Cancer Type')
plt.show()

In [None]:
# Calculate feature importance
feature_importance = pd.DataFrame({
    'SNP': X.columns,
    'Importance': np.abs(model.coef_).mean(axis=0)
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

# Plot top 20 most important features
plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance.head(20), x='Importance', y='SNP')
plt.title('Top 20 Most Important SNP Markers')
plt.xlabel('Importance Score')
plt.ylabel('SNP Marker')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Print top 10 important features
print("\nTop 10 Most Important SNP Markers:")
print(feature_importance.head(10))