[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/hw05-dimensionality-reduction.ipynb)

# Homework 5: Dimensionality Reduction

**Due:** One week after Lecture 5

**Points:** 10

Apply PCA and t-SNE to chemical engineering data.

In [None]:
! pip install -q pycse
from pycse.colab import pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Problem 1: PCA Basics (4 points)

Analyze spectroscopic data from polymer samples.

In [None]:
np.random.seed(42)
n_samples = 100
n_wavelengths = 50

# Simulate spectral data with underlying structure
# Two main components: concentration and temperature effects
concentration = np.random.uniform(0.1, 1.0, n_samples)
temperature = np.random.uniform(300, 400, n_samples)

wavelengths = np.linspace(400, 800, n_wavelengths)
spectra = np.zeros((n_samples, n_wavelengths))

for i in range(n_samples):
    # Peak shifts with concentration, intensity changes with temperature
    peak1 = np.exp(-((wavelengths - 500 - 20*concentration[i])**2) / 500)
    peak2 = np.exp(-((wavelengths - 650)**2) / 800) * (temperature[i] / 350)
    spectra[i] = peak1 + peak2 + np.random.normal(0, 0.05, n_wavelengths)

spectra_df = pd.DataFrame(spectra, columns=[f'w{int(w)}' for w in wavelengths])
spectra_df['concentration'] = concentration
spectra_df['temperature'] = temperature

print(f"Data shape: {spectra.shape}")
plt.figure(figsize=(10, 4))
for i in range(5):
    plt.plot(wavelengths, spectra[i], alpha=0.7)
plt.xlabel('Wavelength (nm)')
plt.ylabel('Absorbance')
plt.title('Sample Spectra')
plt.show()

**1a.** Scale the spectral data and perform PCA. How many components explain 95% of the variance?

In [None]:
# Your code here


**1b.** Plot the cumulative explained variance ratio. Create a scree plot.

In [None]:
# Your code here


**1c.** Plot the first two principal component loadings. What wavelength regions are most important for each component?

In [None]:
# Your code here


**1d.** Create a scatter plot of PC1 vs PC2, colored by concentration. Is there a relationship?

In [None]:
# Your code here


## Problem 2: t-SNE Visualization (3 points)

Visualize clusters in high-dimensional data.

In [None]:
np.random.seed(42)

# Three types of catalysts with different property profiles
n_per_class = 50

# Each catalyst has 10 measured properties
cat_A = np.random.multivariate_normal([1, 2, 3, 1, 2, 3, 1, 2, 3, 1], np.eye(10)*0.3, n_per_class)
cat_B = np.random.multivariate_normal([3, 1, 2, 3, 1, 2, 3, 1, 2, 3], np.eye(10)*0.3, n_per_class)
cat_C = np.random.multivariate_normal([2, 3, 1, 2, 3, 1, 2, 3, 1, 2], np.eye(10)*0.3, n_per_class)

catalyst_properties = np.vstack([cat_A, cat_B, cat_C])
catalyst_labels = ['A']*n_per_class + ['B']*n_per_class + ['C']*n_per_class

print(f"Data shape: {catalyst_properties.shape}")

**2a.** Apply t-SNE with perplexity=30. Plot the results colored by catalyst type.

In [None]:
# Your code here


**2b.** How does perplexity affect the result? Compare perplexity=5, 30, and 50.

In [None]:
# Your code here


**2c.** Compare t-SNE to PCA for this dataset. Which method better separates the catalyst types?

In [None]:
# Your code here


## Problem 3: Interpretation (3 points)

**3a.** A colleague wants to use PCA to reduce 100 process variables to 10 for a predictive model. What questions should they consider before doing this?

*Your answer here:*



**3b.** Why can't you use t-SNE embeddings as features for a predictive model on new data?

*Your answer here:*



**3c.** You have spectral data from 1000 wavelengths and want to predict concentration. Would you use PCA, t-SNE, or neither for preprocessing? Explain.

*Your answer here:*

