<a href="https://colab.research.google.com/github/genomicclass/ML_in_genomics/blob/main/logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cancer Classification using Logistic Regression - Mathematical Approach

## 1. Introduction to Logistic Regression

### Mathematical Foundation

Logistic Regression is based on the following mathematical concepts:

1) **The Logistic Function (Sigmoid)**:
   $$\sigma(z) = \frac{1}{1 + e^{-z}}$$

2) **Linear Combination**:
   $$z = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n = w^Tx + b$$

3) **Probability Estimation**:
   $$P(Y=1|X) = \sigma(w^TX + b)$$

### Why Logistic Regression for SNP Analysis?
- Handles multiple classes (different cancer types)
- Provides probabilistic output
- Interpretable feature importance
- Works well with categorical data (SNP genotypes)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Plot the sigmoid function
x = np.linspace(-10, 10, 100)
sigmoid = 1/(1 + np.exp(-x))

plt.figure(figsize=(10, 6))
plt.plot(x, sigmoid)
plt.title('Sigmoid Function: σ(z) = 1/(1 + e^(-z))')
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.grid(True)
plt.show()

## 2. Data Loading and Preprocessing

### Understanding SNP Data
SNP values represent genotypes:
- 1: Homozygous reference (AA)
- 2: Heterozygous (AB)
- 3: Homozygous alternate (BB)

In [None]:
# Load the data
data = pd.read_csv('common_cancer.csv')

print("Dataset Information:")
print(f"Number of samples: {len(data)}")
print(f"Number of SNPs: {len(data.columns)-1}")

# Display first few rows
print("\nFirst few rows of the data:")
display(data.head())

# Plot distribution of cancer types
plt.figure(figsize=(12, 6))
cancer_counts = data.iloc[:, 0].value_counts()
sns.barplot(x=cancer_counts.index, y=cancer_counts.values)
plt.title('Distribution of Cancer Types')
plt.xlabel('Cancer Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3. Data Preprocessing

### Feature Scaling
We standardize features using the formula:
$$X_{scaled} = \frac{X - \mu}{\sigma}$$

Where:
- μ is the mean
- σ is the standard deviation

In [None]:
# Split features and target
X = data.iloc[:, 1:]  # SNP features
y = data.iloc[:, 0]   # Cancer types

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training set shape:", X_train_scaled.shape)
print("Testing set shape:", X_test_scaled.shape)

# Show feature statistics before and after scaling
print("\nFeature statistics before scaling:")
print(pd.DataFrame(X_train).describe().loc[['mean', 'std']].round(2))

print("\nFeature statistics after scaling:")
print(pd.DataFrame(X_train_scaled).describe().loc[['mean', 'std']].round(2))

## 4. Model Training

### Multinomial Logistic Regression
For K classes, probability for each class k is calculated as:

$$P(Y=k|X) = \frac{e^{w_k^TX}}{\sum_{j=1}^K e^{w_j^TX}}$$

The model minimizes the cross-entropy loss:
$$L = -\sum_{i=1}^n \sum_{k=1}^K y_{ik}\log(p_{ik})$$

In [None]:
# Create and train the model
model = LogisticRegression(max_iter=1000, 
                          multi_class='multinomial',
                          solver='lbfgs')

print("Training the model...")
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_prob = model.predict_proba(X_test_scaled)

print("\nModel trained successfully!")

## 5. Model Evaluation

### Key Metrics

1. **Accuracy**:
   $$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

2. **Precision** (for each class):
   $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$

3. **Recall** (for each class):
   $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$

In [None]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(conf_matrix, 
            annot=True, 
            fmt='d', 
            cmap='Blues',
            xticklabels=model.classes_,
            yticklabels=model.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Cancer Type')
plt.ylabel('Actual Cancer Type')
plt.tight_layout()
plt.show()

## 6. Feature Importance Analysis

### Mathematical Interpretation
For each feature i, importance is calculated as:
$$\text{Importance}_i = \frac{1}{K}\sum_{k=1}^K |w_{ik}|$$

Where:
- K is the number of classes
- $w_{ik}$ is the coefficient for feature i in class k

In [None]:
# Calculate feature importance
feature_importance = pd.DataFrame({
    'SNP': X.columns,
    'Importance': np.abs(model.coef_).mean(axis=0)
})

# Sort by importance
feature_importance = feature_importance.sort_values('Importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance.head(20), x='Importance', y='SNP')
plt.title('Top 20 Most Important SNP Markers')
plt.xlabel('Average Absolute Coefficient Value')
plt.ylabel('SNP Marker')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Print top 10 SNPs
print("\nTop 10 Most Important SNP Markers:")
print(feature_importance.head(10))