<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

# Week 9 | HomeWork: Classification | Logistic Regression

**Clemson University** </br>
**Instructor(s):** Tim Ransom </br>

------------------------------------------------------------------------
## Learning objectives

- Differentiate between supervised and unsupervised classification methods.
- Implement a logistic regression model for classification.
- Evaluate the performance of a classification model using metrics like accuracy.
- Compare the performance of different classification algorithms.
- Preprocess data for classification tasks, including normalization.


### INSTRUCTIONS

-   To submit your assignment, follow the instructions provided by Coursera Labs.
-   Restart the kernel and run the whole notebook again before you
    submit.
-   As much as possible, try and stick to the hints and functions we
    import at the top of the homework, as those are the ideas and tools
    the class supports and are aiming to teach. And if a problem
    specifies a particular library, you're required to use that library,
    and possibly others from the import list.

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LassoCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import matplotlib
import matplotlib.pyplot as plt
from matplotcheck.base import PlotTester
%matplotlib inline

import seaborn as sns
sns.set()

from scipy.stats import ttest_ind


<div class='theme'> Cancer Classification from Gene Expressions </div>

- In this problem, we will build a classification model to distinguish between two related classes of cancer, `acute lymphoblastic leukemia (ALL)` and `acute myeloid leukemia (AML)`, using gene expression measurements. 
- The dataset is provided in the file `hw6_enhance.csv`.
- Each row in this file corresponds to a tumor tissue sample from a patient with one of the two forms of Leukemia. 
    - The first column contains the cancer type, with **0 indicating the ALL** class and **1 indicating the AML** class. 
    - Columns 2-7130 contain expression levels of 7129 genes recorded from each tissue sample.

- In the following questions, we will use linear and logistic regression to build classification models for this data set.

<div class='exercise'><b> Exercise 1</b></div>

## Data Exploration

1. Load the dataset `data/hw6_enhance.csv` into a pandas DataFrame named `df`. (Given csv file is located in `/data/hw6_enhance.csv` path)
2. Split the observations into an approximate **80-20 train-test split**:
   - Store the feature matrix in `X` (excluding the `Unnamed: 0` index column and `Cancer_type`).
   - Store the target variable (`Cancer_type`) in `y`.
3. Use `train_test_split()` from `sklearn.model_selection` to split the data into `X_train`, `X_test`, `y_train`, and `y_test` with:
   - `test_size=0.2`
   - `random_state=72` (for reproducibility)
   - `stratify=y` (to maintain the class distribution)
4. Print the dataset shape **before** and **after** splitting.
5. Verify class distribution in both training and test sets.


In [None]:
"""Write your code for exercise-1 here:"""

# your code here
raise NotImplementedError

### Normalization

- Let's take a peek at your training set (Using the describe()method): you should notice the severe differences in the measurementsfrom one gene to the next (some are negative, some hover around zero,and some are well into the thousands).

In [None]:
# original data statistics before normalization
print("Original Training Set Description:")
print(X_train.describe())

In [None]:
print("Checking for missing values in X_train before normalization:")
print(X_test.isnull().sum().sum())

# Check for non-numeric data (shouldn't happen if all columns are numeric)
print("Checking for non-numeric columns in X_train:")
print(X_train.dtypes)

print("Min and Max values in X_test before scaling:")
print(f"Min: {X_train.min().min()}, Max: {X_train.max().max()}")

In [None]:
# original data statistics before normalization
print("Original Test Set Description:")
print(X_test.describe())

In [None]:
print("Checking for missing values in X_test before normalization:")
print(X_test.isnull().sum().sum())

# Check for non-numeric data (shouldn't happen if all columns are numeric)
print("Checking for non-numeric columns in X_test:")
print(X_test.dtypes)

print("Min and Max values in X_test before scaling:")
print(f"Min: {X_test.min().min()}, Max: {X_test.max().max()}")

<div class='exercise'><b> Exercise 2</b></div>

## Normalization

In this exercise, you will normalize the dataset to ensure that all predictors vary between **0 and 1**. This helps in handling differences in scale and variability across different gene expression levels.

1. Use `MinMaxScaler()` from `sklearn.preprocessing` to scale **X_train** and **X_test**.
2. Fit the scaler on the **training data** and transform both **X_train** and **X_test**.
3. Store the transformed data back into the same DataFrames, keeping the original structure.
4. Ensure that all feature values are now between **0 and 1**.

**Note:**  
For the remainder of this homework, you **must** use these normalized values instead of the original raw values.

---

In [None]:
"""Write your code for exercise-2 here:"""

# your code here
raise NotImplementedError

In [None]:
# normalized data statistics
print("\nNormalized Training Set (first 5 rows):")
print(X_train.head())

<div class="theme"> Question 1:</div>

The training set contains more predictors than observations. What problem(s) can this lead to when fitting a classification model? (Select the most appropriate answer)

1. The model will generalize well to new data because it has a large number of predictors.
2. The model will perform faster because more features improve efficiency.
3. The model may overfit due to the high dimensionality, leading to poor generalization.
4. Having more predictors than observations is beneficial and does not cause any issues.

**Store your answer in an integer variable named 'answer' in the below code cell.**

In [None]:
# your code here
raise NotImplementedError

### Next we want to determine which 10 genes individually discriminate between the two cancer classes the best (consider every gene in the dataset). Code has been provided to do this for you. Make sure you understand what the code is doing. Note that it makes use of [t-testing](https://en.wikipedia.org/wiki/Welch%27s_t-test).

In [None]:
print(X_train.columns)
# Drop the column 'Unnamed: 0' if it exists in the DataFrame
X_train = X_train.loc[:, ~X_train.columns.str.contains('^Unnamed')]
X_test = X_test.loc[:, ~X_test.columns.str.contains('^Unnamed')]

In [None]:
# Drop the 'Unnamed: 0' column from df and X_train, X_test 
df = df.drop(columns=['Unnamed: 0'], errors='ignore')
X_train = X_train.drop(columns=['Unnamed: 0'], errors='ignore')
X_test = X_test.drop(columns=['Unnamed: 0'], errors='ignore')

### Below code uses t-values to determine which genes discriminate between the two cancer classes the best. 

In [None]:
"""
This code uses t-values to determine which genes discriminate between the two
cancer classes the best. 
"""
predictors = df.columns
predictors = predictors.drop('Cancer_type');
print(predictors.shape) 

means_0 = X_train[y_train==0][predictors].mean()
means_1 = X_train[y_train==1][predictors].mean()
stds_0 = X_train[y_train==0][predictors].std()
stds_1 = X_train[y_train==1][predictors].std()
n1 = X_train[y_train==0].shape[0]
n2 = X_train[y_train==1].shape[0]

t_tests = np.abs(means_0-means_1)/np.sqrt( stds_0**2/n1 + stds_1**2/n2)

best_preds_idx = np.argsort(-t_tests.values)
best_preds = t_tests.index[best_preds_idx]

print(t_tests[best_preds_idx[0:10]])
print(t_tests.index[best_preds_idx[0:10]])

best_pred = t_tests.index[best_preds_idx[0]]
print(best_pred)


<div class='exercise'><b> Exercise 3</b></div>

## Visualizing the Best Gene for Cancer Classification 

In this exercise, you will create a histogram to visualize how the most discriminative gene (identified using the t-test from above code cell) varies between the two cancer types.

### **Instructions**
1. Write a function `plot_histograms()` that takes the following parameters:
   - `best_pred`: The gene name identified as the best predictor.
   - `X`: The feature matrix (gene expression data).
   - `y`: The target variable (cancer type labels).
   - `dataset_name`: A string indicating whether the data is from the **training** or **test** set.
   
2. The function should:
   - Plot histograms for the gene expression levels for **Cancer Type 0** and **Cancer Type 1**.
   - Use **different colors** to differentiate between the two cancer types.
   - Include a **title** to indicate which dataset is being visualized.
   - Label the **x-axis (gene expression levels)** and **y-axis (frequency count)**.
   - Add a **legend** for clarity.


**Example code:**
```python
    def plot_histograms(best_pred, X, y, dataset_name):
        ....
        ....
```
**Code usage:**
```python
# Plot histograms for the training set
plot_histograms(best_pred, X_train, y_train, 'Training')

# Plot histograms for the test set
plot_histograms(best_pred, X_test, y_test, 'Testing')
```

In [None]:
"""Write your code for exercise-3 here:"""

# your code here
raise NotImplementedError

<div class='exercise'><b> Exercise 4</b></div>

## Creating a Manual Classification Model  

In this exercise, you will create a simple **manual classification model** using the best gene identified in Exercise 3. Rather than using a machine learning algorithm, you will classify cancer types based on a **manually chosen threshold** from the histogram.

1. **Choose a threshold value** based on the histogram from Exercise 3.  
   - Manually **eye-ball** the distribution and choose a threshold value that best **separates the two cancer types** (**ALL vs. AML**).
   - Assign the threshold value to a variable named **`threshold`**.
   
2. **Implement a classification rule** using the threshold:
    - If the **gene expression value** is **greater** than the threshold, classify it as **1 (AML)**.
    - Otherwise, classify it as **0 (ALL)**.
    - Store the predicted values in a variable **`y_pred_test`**.

3. **Evaluate your model**:
   - Compute the **accuracy** on the test set and store it in a variable named `accuracy`.
   - Print the chosen threshold and the computed accuracy.

-------

In [None]:
"""Write your code for exercise-4 here:"""

# your code here
raise NotImplementedError

- **In class, we discussed how to use both `linear regression` and `logistic regression` for classification.**
- **Now we will explore these two models by working with the single gene that you identified above as being the best predictor.**

<div class='exercise'><b> Exercise 5</b></div>

## Linear and Logistic Regression for Cancer Classification

In this exercise, you will fit a **simple linear regression model** using the single **best gene predictor** identified previously. You will analyze whether **linear regression** is suitable for classifying AML vs ALL.

#### **Linear Regression Model**
1. Fit a **simple linear regression model** to predict cancer type (AML vs ALL) based on the best gene predictor.
   - Use the **normalized values** of the best predictor.
   - Create an instance of `LinearRegression()` named **`linear_model`**.
   - Train this model using **X_train_best** (the single best predictor) and **y_train**.

2. Predict the **cancer type** (`y_train_pred`) using the linear model.

3. **Plot the results**:
   - **Scatter plot**: True binary labels (0 for ALL, 1 for AML).
   - **Line plot**: Predicted values from the linear regression model.


In [None]:
"""Write your code for exercise-5 here:"""

# your code here
raise NotImplementedError

<div class='exercise'><b> Exercise 6: Linear and Logistic Regression  </b></div>

## Linear Regression as a Classifier  

In this exercise, you will use your **trained linear regression model** from Exercise 5 to classify cancer types into **0 (ALL)** and **1 (AML)**. Since linear regression produces **continuous values**, you will apply a **Bayes classifier** by using a **threshold of 0.5** for classification.

1. Apply the Classification Rule
- If the **predicted value** is **greater than 0.5**, classify the sample as **1 (AML)**.
- Otherwise, classify the sample as **0 (ALL)**.

2. Compute Classification Accuracy
- **Training Set:**
  - Convert the predicted values `y_train_pred` into binary classifications.
  - Store the classified predictions in a variable named **`y_train_classified`**.
  - Compute the **training accuracy** and store it in **`train_accuracy`**.

- **Test Set:**
  - Predict values for **`X_test_best`** using the trained linear regression model.
  - Convert predictions into binary classifications and store them in **`y_test_classified`**.
  - Compute the **test accuracy** and store it in **`test_accuracy`**.

3. Print the accuracy values for both sets.
---

In [None]:
"""Write your code for exercise-6 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 7</b> </div>

## Logistic Regression for Cancer Classification  

In this exercise, you will fit a **simple logistic regression model** to classify AML vs ALL based on the best predictor gene identified earlier. Unlike linear regression, logistic regression is a more appropriate method for **binary classification**.

1. **Create a logistic regression model**:
   - Use `LogisticRegression()` from `sklearn.linear_model`.
   - Set **C=10000** to minimize regularization.
   - Set **random_state=4300** for reproducibility.
   - Name the model **`logistic_model`**.

2. **Fit the model** using:
   - **X_train_best** (the single best predictor).
   - **y_train** (cancer type labels).

3. **Make predictions**:
   - Predict **training set labels** and store in `y_train_pred_logistic`.
   - Predict **test set labels** and store in `y_test_pred_logistic`.

4. **Compute accuracy**:
   - Compute **training accuracy** and store in `train_accuracy_logistic`.
   - Compute **test accuracy** and store in `test_accuracy_logistic`.
---


In [None]:
"""Write your code for exercise-7 here:"""

# your code here
raise NotImplementedError

<div class="theme"> Question 2:</div>

How does the classification accuracy of **logistic regression** compare to **linear regression** in this problem? (Select the most appropriate answer)

1. Logistic regression performs **worse** than linear regression because it overfits the training data.
2. Logistic regression performs **similarly** to linear regression because both methods are linear models.
3. Logistic regression performs **better** than linear regression because it is specifically designed for binary classification.
4. Linear regression performs **better** than logistic regression because it can handle continuous predictions.

**Store your answer in an integer variable named `answer` in the below code cell.**


In [None]:
# your code here
raise NotImplementedError

<div class='exercise'><b> Exercise 8: Linear and Logistic Regression  </b></div>

## Comparing Linear and Logistic Regression Predictions

In this exercise, you will visualize and compare the predictions from **linear regression** and **logistic regression** to evaluate their suitability for binary classification.

1. **Create two subplots** (side by side) for:
   - **Training data**
   - **Test data**

2. **Each plot should contain the following:**
   - **Linear regression predictions** (continuous values).
   - **Logistic regression predicted probabilities**.
   - **True binary response** (actual labels).
   - **A horizontal line at y=0.5** (classification threshold).

3. **Customize the plot:**
   - Use `plt.subplots(1, 2, figsize=(12, 6))` to create the figure.
   - Set appropriate **titles, labels, and legends**.

---

In [None]:
"""Write your code for exercise-8 here:"""

# your code here
raise NotImplementedError

<div class="theme"> Question 3:</div>

Based on the plots comparing **linear regression** and **logistic regression**, which model is better suited for binary classification? (Select the most appropriate answer)

- 1. **Linear regression** is better because it provides continuous predictions that can be thresholded.
- 2. **Logistic regression** is better because it models probabilities and naturally handles binary classification.
- 3. **Both models are equally effective** for classification tasks.
- 4. **Linear regression is better than logistic regression** for probability estimation.

**Store your answer in an integer variable named `answer` in the below code cell.**


In [None]:
# your code here
raise NotImplementedError

<div class='exercise'> <b> Exercise 9: Multiple Logistic Regression </b> </div>

## Multiple Logistic Regression

In this exercise, you will fit a **multiple logistic regression model** using **all gene predictors** in the dataset. Unlike the previous exercises, where classification was based on a single gene, this model will use all available gene expressions as features.

1. **Fit a logistic regression model** using all genes:
   - Use `LogisticRegression()` from `sklearn.linear_model`.
   - Set **C=10000**, **max_iter=10000**, and **random_state=4300**.
   - Name the model **`multi_logistic_model`**.

2. **Make predictions**:
   - Predict **training set labels** and store in `y_train_pred_multi`.
   - Predict **test set labels** and store in `y_test_pred_multi`.

3. **Compute accuracy**:
   - Store **training accuracy** in `train_accuracy_multi`.
   - Store **test accuracy** in `test_accuracy_multi`.

---


In [None]:
"""Write your code for exercise-9 here:"""

# your code here
raise NotImplementedError

<div class="theme"> Question 4:</div>

How does **multiple logistic regression** compare to the **single-gene models** in terms of accuracy? (Select the most appropriate answer)

- 1. **Multiple logistic regression performs worse** because using all genes increases overfitting.
- 2. **Multiple logistic regression performs similarly** to single-gene models because adding more predictors does not always improve classification.
- 3. **Multiple logistic regression performs better** because it considers multiple genes, leading to more informed predictions.
- 4. **Single-gene logistic regression is superior** because it is simpler and avoids overfitting.

**Store your answer in an integer variable named `answer` in the below code cell.**


In [None]:
# your code here
raise NotImplementedError

<div class="theme"> Question 5:</div>

Based on the classification accuracy observed in **Exercise 9**, how would you assess the generalization capacity of your trained multiple logistic regression model? (Select the most appropriate answer)

- 1. The model generalizes well if **training accuracy ≈ test accuracy**, meaning it is not overfitting.
- 2. The model overfits if **training accuracy >> test accuracy**, meaning it memorizes the training data but performs poorly on new data.
- 3. The model underfits if **training accuracy and test accuracy are both low**, meaning it fails to capture patterns in the data.
- 4. Generalization capacity cannot be assessed using accuracy alone; more evaluation metrics are needed.

**Store your answer in an integer variable named `answer` in the below code cell.**


In [None]:
# your code here
raise NotImplementedError

<div class='exercise'> <b> Exercise 10 </b> </div>

## Regularization with L1 Penalty (LASSO)

In this exercise, you will apply **LASSO-like regularization** (L1 penalty) using **5-fold cross-validation** to train a **logistic regression model**. Regularization helps to improve **generalization** by reducing overfitting.

1. **Create and train a logistic regression model** with **L1 penalty (LASSO)**:
   - Use `LogisticRegressionCV()` from `sklearn.linear_model`.
   - Set **Cs = 10** (searches across 10 regularization strengths).
   - Use **5-fold cross-validation (cv = 5)**.
   - Set **penalty = 'l1'** for LASSO.
   - Use **solver = 'saga'** (needed for L1 penalty).
   - Set **max_iter = 50** and **random_state = 4300**.

2. **Make predictions**:
   - Predict **training set labels** and store in `y_train_pred_lasso`.
   - Predict **test set labels** and store in `y_test_pred_lasso`.

3. **Compute accuracy**:
   - Store **training accuracy** in `train_accuracy_lasso`.
   - Store **test accuracy** in `test_accuracy_lasso`.

---


In [None]:
"""Write your code for exercise-10 here:"""

# your code here
raise NotImplementedError

# **Conclusion: Model Comparison and Generalization**

In this homework, we implemented and evaluated different **Logistic Regression models** for classifying leukemia types based on gene expression data. Below is a summary of the results:

| Model | Training Accuracy | Test Accuracy | Analysis |
|--------|-----------------|--------------|-----------|
| **Logistic Regression (No Regularization)** | **0.7205** | **0.6821** | The gap between training and test accuracy is small, but the overall performance is lower than other models. This simple model lacks regularization, which may limit its generalization. |
| **Multiple Logistic Regression (All Features)** | **1.0000** | **0.8278** | The model overfits the training data (100% accuracy), meaning it memorized the training set rather than generalizing well. The large gap between training and test accuracy suggests poor generalization. |
| **Lasso Logistic Regression (L1 Regularization)** | **0.9035** | **0.8808** | This model balances bias and variance, showing a high test accuracy with a small gap from the training accuracy. The use of **L1 regularization (Lasso)** prevents overfitting by selecting only the most relevant genes. |

## **Final Conclusion**
The **Lasso Logistic Regression model** provides the best **generalization** across all models. 

- It achieves a **high test accuracy (88.08%)** while avoiding **overfitting**.
- **L1 regularization** helps **select important features** and **ignore noise** in the data.
- Unlike **Multiple Logistic Regression**, which memorizes the training set, the **Lasso model** learns a more **generalizable decision boundary**.

### **Key Takeaways**
✅ Regularization techniques (such as **L1 penalty**) help **prevent overfitting**.  
✅ A **high training accuracy with a large test accuracy gap** is a sign of **overfitting**.  
✅ **Feature selection** via Lasso improves **interpretability and generalization**.  
✅ **Cross-validation** and **regularization** should be part of every **classification pipeline**.

**Lasso Logistic Regression is the recommended model for this task!**


# END