# HW4

Overall rules:

- Do not split your answers into separate files. All answers must be in a single jupyter notebook. 
- Obtain all required remote data using the appropriate API unless otherwise is specified.
- Refrain from using code comments to explain what has been done. Document your steps by writing appropriate markdown cells in your notebook.
- Avoid duplicating code by copying and pasting it from one cell to another. If copying and pasting is necessary, develop a suitable function for the task at hand and call that function.
- When providing parameters to a function, never use global variables. Instead, always pass parameters explicitly and always make use of local variables.
- Document your use of LLM models (ChatGPT, Claude, Code Pilot etc). Either take screenshots of your steps and include them with this notebook, or give me a full log (both questions and answers) in a markdown file named HW4-LLM-LOG.md.

Failure to adhere to these guidelines will result in a 15-point deduction for each infraction.

## Q1

In this problem, you will compare the performance of several supervised learning models on the CIFAR-10 dataset available through `tensorflow_datasets`. The CIFAR-10 dataset consists of 60,000 $32 \times 32$ color images in 10 classes, with 6,000 images per class. To simplify the computational requirements and focus on core algorithmic aspects, you may preprocess the images by converting them to grayscale.

1. Load the CIFAR-10 dataset using the `tensorflow_datasets` library. Preprocess the dataset by:
   - Converting the images to grayscale,
   - Flattening the image into a vector (if necessary) for models that require vector inputs,
   - Normalizing pixel values to the $[0,1]$ range.

2. Train and evaluate the following models on the dataset:
   - **Logistic Regression** (one-vs-rest and one-vs-one),
   - **Support Vector Machine (SVM)** with:
     - A linear kernel,
     - A Gaussian (RBF) kernel,
     - A polynomial kernel,
   - **A simple neural network** consisting of:
     - One or two hidden layers,
     - Nonlinear activation functions (such as ReLU or tanh),
     - A softmax output layer for classification.

3. For each model, report:
   - The overall test set classification accuracy,
   - The confusion matrix,
   - Precision, recall, and F1-score per class.

4. Conduct a **statistical error analysis**:
   - Compute $95\%$ confidence intervals for the test set accuracy estimates (e.g., using the binomial proportion confidence interval),
   - Discuss any significant differences between model performances.

5. Discuss the **computational trade-offs**:
   - Measure and report the **training time** and **inference time** for each model (e.g., using simple timing functions),
   - Measure and report **memory usage** where feasible (e.g., by estimating the number of parameters or using profiling tools),
   - Reflect on how model complexity (both in terms of runtime and memory) correlates with performance.

6. Conclude by discussing the observed trade-offs between **model simplicity**, **computational cost**, and **predictive accuracy**.

You may use `tensorflow`, `pytorch` or `keras` for neural network model implementation. Clearly indicate any hyperparameter choices (e.g., regularization strength, kernel parameters, number of neurons in hidden layers). Ensure that your experimental code supports reproducibility (by fixing random seeds where applicable).


## Q2

In this assignment, you will explore the application of boosting algorithms to a medical image classification problem. The dataset is **PneumoniaMNIST** from the **MedMNIST** collection, available through `tensorflow_datasets` under `medmnist.pneumoniamnist`. PneumoniaMNIST consists of chest X-ray images labeled either as **normal** or **pneumonia**.

The goal is to predict whether a patient shows signs of pneumonia based on their chest X-ray image.

1. Load the PneumoniaMNIST dataset using the `tensorflow_datasets` library. Preprocess the dataset by:
   - Normalizing the pixel values to $[0,1]$,
   - Flattening the $28\times28$ images into vectors suitable for tabular classifiers.

2. Train and evaluate the following models:
   - **Gradient Boosted Decision Trees (GBDT)** using **XGBoost**,
   - **AdaBoost** with **decision stumps** (single-level decision trees) as weak learners,
   - **Gradient Boosting Classifier** using **LightGBM** or **sklearn's GradientBoostingClassifier**.

3. For each model report the test set classification **accuracy**, **precision**, **recall**, **F1-score**, and do a proper error analysis on the values you calculated.

4. Interpret the learned models:
   - Extract and rank the most important pixels (features) contributing to the classification decision (feature importance analysis).
   - Visualize and interpret these "important pixels" on the original $28\times28$ image grid. Discuss whether they align with medical intuition (e.g., lung region emphasis).

5. Discuss computational aspects:
   - Measure and report the **training time** and **inference time** for each boosting model,
   - Report and compare the **model size** (e.g., number of leaves, total size of the trained model).

6. Write a final summary:
   - Summarize the strengths and weaknesses you observed for each boosting method in this specific small-image classification task,
   - Comment on whether boosting methods are well-suited for this type of structured low-dimensional image data compared to neural network approaches.

Hyperparameters such as learning rate, number of estimators, and maximum tree depth should be selected using 5-fold cross-validation or a separate validation split. Pay special attention to **early stopping** where appropriate to avoid overfitting.


## Q3

In this question, you will investigate the use of **autoencoders** for compressing high-dimensional text data into a lower-dimensional latent space. You will use the **IMDB movie reviews dataset** from `tensorflow_datasets` and develop a pipeline based on **count vectorization**, followed by a **fully-connected autoencoder**.  This question focuses on **unsupervised representation learning** and reconstruction accuracy in the **bag-of-words (BoW)** paradigm.

#### **Part I: Preprocessing Pipeline**

1. Load the `imdb_reviews` dataset using `tensorflow_datasets`.
2. Construct a preprocessing pipeline consisting of the following steps:
   - **Text cleaning**: remove punctuation and convert all text to lowercase.
   - **Tokenization**: split the review into word tokens.
   - **Count-based vectorization**:
     - Use `sklearn.feature_extraction.text.CountVectorizer`,
     - Limit the vocabulary to the top $V = 5000$ most frequent words,
     - Represent each review as a **$V$-dimensional sparse count vector**.

3. Optionally, apply TF scaling or binary indicator (0/1) encoding to the count vectors. Justify your choice.

#### **Part II: Autoencoder Architecture**

4. Build a **fully-connected feedforward autoencoder** in TensorFlow/Keras:
   - The **input layer** accepts a $V$-dimensional BoW vector,
   - The **encoder** compresses it through one or more dense layers to a **latent space** of dimension $d \ll V$ (e.g., $d = 32$ or $64$),
   - The **decoder** maps the latent representation back to a $V$-dimensional output using symmetric architecture,
   - Use **sigmoid activation** on the final output layer to model per-word presence/absence (binary cross-entropy loss),
   - Apply **dropout** and/or **L2 regularization** to mitigate overfitting.

#### **Part III: Training and Evaluation**

5. Train the autoencoder on the training portion of the dataset using:
   - Mini-batch SGD or Adam optimizer,
   - Binary cross-entropy loss,
   - A held-out validation set for early stopping.

6. Evaluate the quality of the trained autoencoder on the test set:
   - For each reconstructed output $\hat{x}$ and corresponding input $x$, compute the **binary prediction** $\hat{x}_i = 1$ if $\hat{x}_i > 0.5$,
   - Compute **per-sample accuracy**: proportion of correctly reconstructed binary entries per BoW vector together with a proper error analysis,
   - Report the **mean accuracy** across all test samples together with a proper error analysis.

7. In addition, report:
   - The total number of parameters in the model,
   - The average compression ratio (input dimension / latent dimension),
   - A few example reconstructions: for 5 randomly chosen reviews, list the top-10 most frequent words in the original and reconstructed BoW vectors. Discuss any semantic degradation or preservation.

#### **Part IV: Latent Space Visualization and Interpretation**

8. Extract the **latent representations** of 1000 test reviews.
9. Use PCA to reduce the $d$-dimensional latent codes to 2D and visualize the resulting plot.
   - Color the points based on **sentiment label** (positive or negative, available in the dataset but not used during training),
   - Discuss whether the autoencoder latent space **implicitly clusters** sentences by sentiment, even though the model was trained in an unsupervised manner.
