<a href="https://colab.research.google.com/github/kankkw/229352-StatisticalLearning/blob/main/Lab01_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

# For Fashion-MNIST
from tensorflow.keras.datasets import fashion_mnist

# For 20 Newsgroups
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Part 1: Marketing Campaign Dataset - Manual Data Preprocessing & Logistic Regression

### Load the Marketing Campaign Dataset ([Data Information](https://archive.ics.uci.edu/dataset/222/bank+marketing))

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (`'yes'`) or not (`'no'`) subscribed.

In [None]:
bank_url = 'https://raw.githubusercontent.com/donlap/ds352-labs/main/bank.csv'

df = pd.read_csv(bank_url, sep=';', na_values=['unknown'])
df = df.drop(["emp.var.rate", "cons.price.idx", "cons.conf.idx",	"euribor3m", "nr.employed"], axis=1)
print("Shape of the dataset:", df.shape)
df.head()

### Data Exploration

In [None]:
print("--- Missing Values Count ---")
print(df.isnull().sum())

In [None]:
print("--- Unique Values for Categorical Columns ---")
for col in df.select_dtypes(include='object').columns:
    print(f"\n'{col}' unique values:")
    print(df[col].value_counts(dropna=False)) # Include NaN counts

### Data Preprocessing

In [None]:
# Map target variable 'y' to 0 (no) and 1 (yes)
df['y'] = df['y'].map({'no': 0, 'yes': 1})

# Drop 'duration' due to data leakage
df = df.drop(columns=['duration'])

# Define features (X) and target (y)
X = df.drop(columns=['y'])
y = df['y']

# Split the data BEFORE any transformations
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Print data shape
print("X_train shape:", X_train.shape)
print("X_test shape :", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape :", y_test.shape)

We will apply `StandardScaler()`, `OrdinalEncoder()`, and `OneHotEncoder()` on a few selected columns.

**1. Numerical Feature: `age` and `campaign` (Standard Scaling)**

In [None]:
num_cols_demo = ['age', 'campaign']

scaler = StandardScaler()

# Fit the scaler ONLY on the training data
scaler.fit(X_train[num_cols_demo])

# Transform both training and test data
X_train_scaled_demo = scaler.transform(X_train[num_cols_demo])
X_test_scaled_demo = scaler.transform(X_test[num_cols_demo])

print("Scaled training data (first 5 rows):")
print(X_train_scaled_demo[:5])
print("\nScaled test data (first 5 rows):")
print(X_test_scaled_demo[:5])

**2. Ordinal Feature: `education` (Ordinal Encoding with Imputation)**

In [None]:
ord_col_demo = ['education']

imputer_ord = SimpleImputer(strategy='most_frequent')

imputer_ord.fit(X_train[ord_col_demo])

X_train_imputed_ord_demo = imputer_ord.transform(X_train[ord_col_demo])
X_test_imputed_ord_demo = imputer_ord.transform(X_test[ord_col_demo])

print("Imputed 'education' (train) first 5 rows:")
print(X_train_imputed_ord_demo[:5])
print("\nImputed 'education' (test) first 5 rows:")
print(X_test_imputed_ord_demo[:5])

- **Imputation**

- **Ordinal Encoding**

In [None]:
education_categories = [
    'illiterate', 'basic.4y', 'basic.6y', 'basic.9y', 'high.school',
    'professional.course', 'university.degree', 'masters', 'doctorate'
]

In [None]:
ordinal_encoder = OrdinalEncoder(categories=[education_categories])

ordinal_encoder.fit(X_train_imputed_ord_demo)

X_train_encoded_ord_demo = ordinal_encoder.transform(X_train_imputed_ord_demo)
X_test_encoded_ord_demo = ordinal_encoder.transform(X_test_imputed_ord_demo)

print("Encoded 'education' (train) first 5 rows:")
print(X_train_encoded_ord_demo[:5])
print("\nEncoded 'education' (test) first 5 rows:")
print(X_test_encoded_ord_demo[:5])

**3. Nominal Feature: `job` (One-Hot Encoding with Imputation)**

- **Imputation**

In [None]:
nom_col_demo = ['job']

imputer_nom = SimpleImputer(strategy='most_frequent')
imputer_nom.fit(X_train[nom_col_demo])

X_train_imputed_nom_demo = imputer_nom.transform(X_train[nom_col_demo])
X_test_imputed_nom_demo = imputer_nom.transform(X_test[nom_col_demo])

- **Nominal Encoding**

In [None]:
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

onehot_encoder.fit(X_train_imputed_nom_demo)

X_train_onehot_nom_demo = onehot_encoder.transform(X_train_imputed_nom_demo)
X_test_onehot_nom_demo = onehot_encoder.transform(X_test_imputed_nom_demo)

print("One-hot encoded 'job' (train) shape:", X_train_onehot_nom_demo.shape)
print("One-hot encoded 'job' (test) shape :", X_test_onehot_nom_demo.shape)

### **Exercise 1: Apply All Preprocessing & Train Logistic Regression**

Now, it's your turn to apply these preprocessing steps to *all* relevant columns and then train a Logistic Regression model.

**Instructions:**

1.  Look at the Variable Table in [this link](https://archive.ics.uci.edu/dataset/222/bank+marketing).
2. Make lists for `numerical_features`, `ordinal_features`, and `nominal_features`.
3. Preprocess the features. It is safer to make a copy of `X_train` using:
   ```
   X_train_copy = X_train.copy()
   X_test_copy = X_test.copy()
   ```
   and preprocess `X_train_copy` instead.

   **For nominal features, concat the one-hot encoded features using [`pd.concat(..., axis=1)`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and drop the old nominal features from the dataframe.**
4. Train Logistic Regression on the preprocessed `X_train_copy` and `y_train`.
5. Evaluate the Model:
    *   Make predictions on the preprocessed `X_test_copy`.
    *   Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?


In [None]:
# 1. Define feature groups (based on the UCI variable table)
numerical_features = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']
ordinal_features = ['education']
nominal_features = ['job', 'marital', 'default', 'housing', 'loan',
                    'contact', 'month', 'poutcome']

# Make copies to keep original X_train, X_test safe
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()

# 2. Numerical: impute (in case of missing) + scale
num_imputer = SimpleImputer(strategy='median')
X_train_num = num_imputer.fit_transform(X_train_copy[numerical_features])
X_test_num = num_imputer.transform(X_test_copy[numerical_features])

num_scaler = StandardScaler()
X_train_num_scaled = num_scaler.fit_transform(X_train_num)
X_test_num_scaled = num_scaler.transform(X_test_num)

# 3. Ordinal: impute + ordinal encode (education)
ord_imputer = SimpleImputer(strategy='most_frequent')
X_train_ord_imp = ord_imputer.fit_transform(X_train_copy[ordinal_features])
X_test_ord_imp = ord_imputer.transform(X_test_copy[ordinal_features])

ordinal_encoder_full = OrdinalEncoder(categories=[education_categories])
X_train_ord_enc = ordinal_encoder_full.fit_transform(X_train_ord_imp)
X_test_ord_enc = ordinal_encoder_full.transform(X_test_ord_imp)

# 4. Nominal: impute + one-hot encode
nom_imputer = SimpleImputer(strategy='most_frequent')
X_train_nom_imp = nom_imputer.fit_transform(X_train_copy[nominal_features])
X_test_nom_imp = nom_imputer.transform(X_test_copy[nominal_features])

onehot_encoder_full = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_train_nom_ohe = onehot_encoder_full.fit_transform(X_train_nom_imp)
X_test_nom_ohe = onehot_encoder_full.transform(X_test_nom_imp)

# 5. Concatenate all processed features
X_train_processed = np.hstack([X_train_num_scaled, X_train_ord_enc, X_train_nom_ohe])
X_test_processed = np.hstack([X_test_num_scaled, X_test_ord_enc, X_test_nom_ohe])

print("X_train_processed shape:", X_train_processed.shape)
print("X_test_processed shape :", X_test_processed.shape)

# 6. Train Logistic Regression
log_reg_bank = LogisticRegression(max_iter=1000, solver='lbfgs')
log_reg_bank.fit(X_train_processed, y_train)

# 7. Evaluate on test set
y_pred_bank = log_reg_bank.predict(X_test_processed)
y_proba_bank = log_reg_bank.predict_proba(X_test_processed)[:, 1]

print("Accuracy:", accuracy_score(y_test, y_pred_bank))
print("ROC-AUC :", roc_auc_score(y_test, y_proba_bank))
print("\nClassification report:")
print(classification_report(y_test, y_pred_bank))

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred_bank))

## Part 2: Fashion-MNIST Dataset - Image Classification

### Load Fashion-MNIST Dataset

The Fashion-MNIST dataset consists of 28x28 grayscale images of fashion items.

In [None]:
(fm_X_train, fm_y_train), (fm_X_test, fm_y_test) = fashion_mnist.load_data()

print(f"Fashion-MNIST Train data shape: {fm_X_train.shape}")
print(f"Fashion-MNIST Train labels shape: {fm_y_train.shape}")
print(f"Fashion-MNIST Test data shape: {fm_X_test.shape}")
print(f"Fashion-MNIST Test labels shape: {fm_y_test.shape}")

In [None]:
print(f"First image {fm_X_train[0]}")
print(f"First label {fm_y_train[0]}")

### Visualize Fashion-MNIST Images

Let's see what these images look like.

In [None]:
fashion_mnist_class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

# Visualize the images
plt.figure(figsize=(10, 10))
for i in range(25):
    plt.subplot(5, 5, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(fm_X_train[i], cmap='gray')
    plt.xlabel(fashion_mnist_class_names[fm_y_train[i]])
plt.tight_layout()
plt.show()

### **Exercise 2: Preprocessing Images (Flatten and Scale)**

Images are 2D arrays (matrices of pixels) and pixel values are integers from 0-255. For Logistic Regression, we need:
*  **Flattening:** Convert each 28x28 image into a 1D array of 784 features.
*  **Scaling:** Normalize pixel values from [0, 255] to [0, 1].

**Instructions:**

1.   **Flatten:** Use the `.reshape()` method (see [documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.reshape.html)). For `fm_X_train_binary` (shape `(num_samples, 28, 28)`), you want to reshape it to `(num_samples, 28*28)`.
2.  **Scale:** Divide the flattened pixel values by 255.0 to get values between 0 and 1.
3.   **Train Logistic Regression:**
    *   Initialize `LogisticRegression(solver='saga')`. `saga` is a good solver when both number of samples and number of features are large.
    *   Fit the model on your *processed* `fm_X_train_scaled` and `fm_y_train`.
4.   **Make Predictions:** Use `predict()` to make predictions on the *processed* `fm_X_test_scaled`.
5.   **Print Classification Report:** Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?
6.   **Visualize Misclassifications:**
    *   Find the indices in `fm_X_test_binary` where your model made incorrect predictions (i.e., `fm_y_pred != fm_y_test`).
    *   Select 5 of these misclassified images.
    *   Plot these images (using `plt.imshow`). For each image, print its true label and its predicted label.

In [None]:
# 1. Flatten images: (n_samples, 28, 28) â†’ (n_samples, 784)
n_train = fm_X_train.shape[0]
n_test = fm_X_test.shape[0]

fm_X_train_flat = fm_X_train.reshape(n_train, -1)
fm_X_test_flat = fm_X_test.reshape(n_test, -1)

# 2. Scale pixel values to [0, 1]
fm_X_train_scaled = fm_X_train_flat.astype('float32') / 255.0
fm_X_test_scaled = fm_X_test_flat.astype('float32') / 255.0

print("fm_X_train_scaled shape:", fm_X_train_scaled.shape)
print("fm_X_test_scaled shape :", fm_X_test_scaled.shape)

# 3. Train Logistic Regression (multiclass)
log_reg_fm = LogisticRegression(
    solver='saga',
    max_iter=1000,
    multi_class='multinomial',
    n_jobs=-1
)
log_reg_fm.fit(fm_X_train_scaled, fm_y_train)

# 4. Predict on test set
fm_y_pred = log_reg_fm.predict(fm_X_test_scaled)

# 5. Evaluation
print("Classification report (Fashion-MNIST):")
print(classification_report(fm_y_test, fm_y_pred, target_names=fashion_mnist_class_names))

# 6. Visualize some misclassified images
mis_idx = np.where(fm_y_pred != fm_y_test)[0]
print("Number of misclassified examples:", len(mis_idx))

num_to_plot = 5
plt.figure(figsize=(10, 2 * num_to_plot))
for i in range(num_to_plot):
    idx = mis_idx[i]
    plt.subplot(num_to_plot, 2, 2 * i + 1)
    plt.imshow(fm_X_test[idx], cmap='gray')
    plt.xticks([])
    plt.yticks([])
    true_label = fashion_mnist_class_names[fm_y_test[idx]]
    pred_label = fashion_mnist_class_names[fm_y_pred[idx]]
    plt.title(f"True: {true_label}\nPred: {pred_label}")
plt.tight_layout()
plt.show()

## Part 3: 20 Newsgroups Dataset - Text Classification

### Load 20 Newsgroups Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics.

In [None]:
news_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
news_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

X_train_news, y_train_news = news_train.data, news_train.target
X_test_news, y_test_news = news_test.data, news_test.target

print(f"Number of training documents: {len(X_train_news)}")
print(f"Number of test documents: {len(X_test_news)}")
print(f"Categories: {news_train.target_names}")

### Explore Sample Document

In [None]:
# Print the first document and its class
print("First training document:\n")
print(X_train_news[0])

print("\nClass index:", y_train_news[0])
print("Class name :", news_train.target_names[y_train_news[0]])

### Preprocessing: Text Vectorization Demonstration with `TfidfVectorizer`

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:

$$
\text{TF}(t, d) = \frac{\text{number of word }t\text{ in } d}{\text{number of words in } d} \quad \text{ and } \quad
\text{IDF}(t, D) = \log\left(\frac{\text{total number of documents}}{\text{number of documents that contain word }t}\right).
$$

In [None]:
sample_sentences = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the sample sentences
sample_vec_output_sparse = vectorizer.fit_transform(sample_sentences)

sample_vec_output_dense = sample_vec_output_sparse.toarray()

print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names_out())
print(sample_vec_output_dense)

### **Exercise 3: Apply TF-IDF Vectorization to Full Dataset**

Now, apply `TfidfVectorizer` to the actual training and testing datasets for the 20 Newsgroups classification task.

**Instructions:**

1.  **Initialize `TfidfVectorizer`:**
    *   Initialize `TfidfVectorizer`. Use `stop_words='english'` to remove common words.
2.  **Fit and Transform Training Data:**
    *   Call `fit_transform()` on `X_train_news` to learn the vocabulary and transform the training text into TF-IDF features. Store the result in `X_train_vec`.
3.  **Transform Test Data:**
    *   Call `transform()` on `X_test_news` using the *already fitted* vectorizer. Store the result in `X_test_vec`. **Crucially, do not call `fit_transform()` on the test data!** This would cause data leakage.
4.  **Initialize Logistic Regression:**
    *   Initialize `LogisticRegression(solver='saga')`. `saga` is a good solver when both number of samples and number of features are large.
5.  **Train the Model:**
    *   Fit the model on your `X_train_vec` and `y_train_news`.
6.  **Make Predictions:**
    *   Make predictions using `predict()` on the `X_test_vec`.
7.  **Evaluate the Model:**
    *   Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?

In [None]:
# 1. Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# 2. Fit and transform training data
X_train_vec = tfidf_vectorizer.fit_transform(X_train_news)

# 3. Transform test data (DO NOT fit again on test!)
X_test_vec = tfidf_vectorizer.transform(X_test_news)

print("X_train_vec shape:", X_train_vec.shape)
print("X_test_vec shape :", X_test_vec.shape)

# 4. Initialize Logistic Regression
log_reg_news = LogisticRegression(solver='saga', max_iter=1000)

# 5. Train the model
log_reg_news.fit(X_train_vec, y_train_news)

# 6. Make predictions on test data
y_pred_news = log_reg_news.predict(X_test_vec)

# 7. Evaluate the model
print("Accuracy (20 Newsgroups):", accuracy_score(y_test_news, y_pred_news))
print("\nClassification report (20 Newsgroups):")
print(classification_report(y_test_news, y_pred_news, target_names=news_train.target_names))