This repository contains four comprehensive machine learning projects completed as part of the Machine Learning 2 course at Technion. Each project explores different aspects of deep learning, from fundamental neural network implementation to advanced topics like adversarial attacks and generative models.
- Project 1: Neural Networks from Scratch & CNN for Dog Emotion Classification
- Project 2: Overfitting Analysis & Sentiment Analysis with RNNs
- Project 3: Adversarial Attacks & Contrastive Learning
- Project 4: Generative Adversarial Networks (GANs)
This project is divided into theoretical and practical components that cover fundamental concepts in neural networks and convolutional neural networks (CNNs).
Objective: Derive the gradients of the softmax function and express them solely in terms of the softmax output.
Theory: The softmax function is defined as:
Results:
- When
$i = k$ :$\frac{\partial \text{softmax}(x)_j}{\partial x_k} = \text{softmax}(x)_j(1 - \text{softmax}(x)_j)$ - When
$i \neq k$ :$\frac{\partial \text{softmax}(x)_i}{\partial x_k} = -\text{softmax}(x)_i \cdot \text{softmax}(x)_j$
Objective: Derive the gradient of cross-entropy loss with respect to the softmax input vector.
Theory:
Given
The final result simplifies to:
- Implement a fully connected neural network from scratch without using PyTorch's automatic differentiation
- Train on MNIST dataset
- Achieve competitive accuracy
Architecture:
- Input layer: 784 neurons (28×28 flattened images)
- Hidden layer: 128 neurons
- Output layer: 10 neurons (digit classes)
- Activation function: Sigmoid
Constraints:
- No automatic differentiation (no
backward()) - No built-in loss functions
- No built-in activations
- No built-in optimization algorithms
- Manual implementation of forward and backward passes
Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Epochs: 16
- Seed: 42 (for reproducibility)
Training Process:
- Forward Pass: Compute predictions using manual matrix multiplications
- Loss Computation: Cross-entropy loss implementation from scratch
- Backward Pass: Manual gradient computation using chain rule
- Weight Updates: Gradient descent parameter updates
- Training accuracy: ~90%
- Test accuracy: ~88%
- The model demonstrates proper convergence with manual backpropagation
Experiment: Tested with 5 different learning rates: [0.0001, 0.001, 0.01, 0.1, 1.0]
Key Insights:
- LR = 0.0001: Extremely slow convergence, underfitting (accuracy ~11%)
- LR = 0.001: Optimal balance, steady convergence to ~60% accuracy
- LR = 0.01: Best performance, reaches ~90% accuracy with minimal overfitting
- LR = 0.1: Fast convergence but early overfitting signs (~95% train, slight gap)
- LR = 1.0: Severe overfitting (100% train accuracy, ~98% test)
Conclusion: Learning rate of 0.01 provides the best generalization.
Create a CNN-based classifier to identify dog emotions (angry, happy, relaxed, sad) from images. The model must:
- Achieve at least 70% test accuracy
- Minimize parameter count (deployed on smartphones)
- Not use pre-trained models
- Location:
hw1_data/Dog_Emotion/ - Classes: 4 emotions (angry, happy, relaxed, sad)
- Splits: Train, validation, and test sets
- Analyzed class distribution across splits
- Visualized sample images from each emotion category
- Identified potential class imbalances
Training Augmentation (Extensive):
- Random resized crop (64×64, scale 0.7-1.0)
- Random rotation (±20°)
- Color jitter (brightness, contrast, saturation, hue)
- Random horizontal/vertical flips
- Random affine transformations
- Random grayscale conversion (10%)
- Gaussian blur
- Normalization: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
Validation/Test Augmentation (Minimal):
- Resize to 64×64
- Normalization only
Design Philosophy: Use depthwise-separable convolutions to minimize parameters while maintaining representational power.
Architecture Details:
Initial Convolution:
- Conv2d(3→16, kernel=3, stride=1, padding=1)
- BatchNorm2d(16)
- ReLU
Depthwise-Separable Blocks (6 blocks):
Block 1-2: 16→32→32 (stride=2, then 1)
Block 3-4: 32→64→64 (stride=2, then 1)
Block 5-6: 64→128→128 (stride=2, then 1)
Global Average Pooling: AdaptiveAvgPool2d(1×1)
Classifier:
- Linear(128→128)
- BatchNorm1d(128)
- ReLU
- Dropout(0.5)
- Linear(128→4)
Total Parameters: ~53,000 (extremely efficient!)
Splits standard convolution into:
- Depthwise: Applies filters per channel separately
- Pointwise: 1×1 convolution to combine channels
- Significantly reduces computational cost and parameters
Hyperparameters:
- Batch size: 16 (small for better generalization)
- Learning rate: 0.001
- Optimizer: Adam with weight decay (1e-4)
- Loss function: Label smoothing cross-entropy (smoothing=0.1)
- Scheduler: CosineAnnealingLR (T_max=1000, eta_min=1e-5)
- Epochs: 1000
Training Strategy:
- Low batch size + high epochs for gradual learning
- Label smoothing to prevent overconfidence
- Extensive data augmentation to improve generalization
- Cosine learning rate decay for smooth convergence
- Best Test Accuracy: 75.8%
- Training Accuracy: ~82%
- Model successfully passes the 70% threshold
- Excellent parameter efficiency (53k parameters)
- Parameter Reduction: Successfully reduced from initial 400k to 53k parameters through architectural optimization
- Data Augmentation Critical: Extensive augmentation was essential for preventing overfitting on small dataset
- Label Smoothing: Helped reduce overconfidence and improved generalization
- Training Duration: High epoch count (1000) with small batches allowed thorough learning
Reference: Insights from research paper on animal emotion recognition: Nature Scientific Reports
- Load and analyze a pre-trained VGG16 model
- Understand filter responses in convolutional layers
- Visualize what the network learns
- Resize: Images to 224×224 pixels
- Convert to Tensor: Scale to [0, 1]
- Normalize: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
- Add Batch Dimension: For model input
1. Bird Image Classification:
- Loaded images from
hw1_data/birds - Applied VGG16 for top-5 predictions
- Analyzed class probabilities from ImageNet classes
2. Dog Image Classification:
- Selected dog image from
hw1_data/dogs - Obtained top-5 predictions with confidence scores
3. Filter Response Visualization: Visualized first 3 filters in the first convolutional layer:
- Filter 1: Detects vertical and diagonal edges
- Filter 2: Captures texture and fine details
- Filter 3: Identifies larger-scale shapes and structures
- Early layers detect low-level features (edges, textures)
- These features are combined in deeper layers for complex pattern recognition
- Filter visualization reveals what the network "sees" in images
This project explores generalization, overfitting, and recurrent neural networks (RNNs) for natural language processing tasks.
Demonstrate that neural networks can overfit to random labels, achieving near-zero training loss while having no meaningful generalization.
Dataset:
- MNIST (first 128 training samples only)
- Random binary labels generated from Bernoulli(0.5)
Constraints:
- Batch size: 128 (entire subset in one batch)
- Shuffle: False
- Binary classification (0 or 1)
Network Architecture:
- Input: 784 (28×28 flattened)
- Hidden: 128 neurons with ReLU
- Output: 2 classes
- Loss: Cross-entropy
Hyperparameters:
- Learning rate: 1e-3
- Optimizer: Adam
- Epochs: 100
- Training Loss: Converges to ~0.0
- Training Accuracy: ~100%
- Test Accuracy: ~49.58% (random chance)
This experiment demonstrates that:
- Neural networks have sufficient capacity to memorize random labels
- Perfect training performance doesn't guarantee generalization
- When labels are random, test accuracy equals random chance (50% for binary)
- Overfitting is the memorization of training data without learning generalizable patterns
Conclusion: This illustrates the importance of:
- Proper regularization techniques
- Validation set monitoring
- Not relying solely on training metrics
Build RNN-based models to classify tweet emotions into three categories: happiness, sadness, and neutral.
Target: Achieve at least 47% test accuracy
Files:
trainEmotions.csv: Training tweets with emotion labelstestEmotions.csv: Test tweets
Format:
emotion, tweet_content
happiness, Welcome @doeko ! Really glad to know you here...
sadness, Disappointment really sucks! I'm getting used to it.
neutral, I just want to Sleep.
Data Statistics:
- Training samples: ~15,000 tweets
- Test samples: ~3,000 tweets
- Classes: happiness, sadness, neutral
Class Distribution:
- Analyzed label distribution across train and test sets
- Visualized with bar plots
- Checked for class imbalance
Data Cleaning:
- Removed duplicate tweets
- Stripped special characters and punctuation
- Tokenization for word-level processing
1. Data Cleaning:
- Remove duplicates
- Remove punctuation using regex
- Lowercase conversion2. Word Embeddings:
- Method: Word2Vec (Google News 300-dimensional)
- Source: Pre-trained gensim model
- Fallback: Zero vector for out-of-vocabulary words
3. Sequence Processing:
- Variable-length sequences padded to max length
- Padding value: 0.0
4. Label Encoding:
- LabelEncoder for emotion categories
- Conversion to integer labels: happiness=0, neutral=1, sadness=2
Not implemented separately - Focus on gated architectures due to vanishing gradient issues
Architecture:
- Embedding dimension: 300 (from Word2Vec)
- GRU layers: 1
- Hidden dimension: Varied [128, 256]
- Dropout: Varied [0.1, 0.5]
- Output: Fully connected layer (hidden_dim → 3 classes)Architecture:
- Embedding dimension: 300 (from Word2Vec)
- LSTM layers: 1
- Hidden dimension: Varied [128, 256]
- Dropout: Varied [0.1, 0.5]
- Output: Fully connected layer (hidden_dim → 3 classes)Key Difference: LSTM uses cell state to better capture long-term dependencies compared to GRU.
Grid Search Space:
- Hidden dimensions: [128, 256]
- Dropout rates: [0.1, 0.5]
- Learning rates: [0.001, 0.0001]
- Optimizers: [Adam, SGD with momentum=0.9]
- Batch size: 32
- Epochs: 5 (for quick experimentation)
Total Configurations: 2×2×2×2 = 16 configurations per model type
Loss Function: Cross-Entropy Loss
Data Loaders:
- Training: Shuffled batches
- Validation: Sequential batches
- Test: Sequential batches
Training Loop:
- Forward pass through RNN
- Loss computation
- Backpropagation
- Optimizer step
- Validation evaluation each epoch
Best GRU Configuration:
- Hidden dimension: 256
- Dropout: 0.1
- Learning rate: 0.001
- Optimizer: Adam
- Test Accuracy: ~52%
Best LSTM Configuration:
- Hidden dimension: 256
- Dropout: 0.1
- Learning rate: 0.001
- Optimizer: Adam
- Test Accuracy: ~51%
Observations:
- Happiness class: Best recall and precision
- Neutral class: Moderate confusion with sadness
- Sadness class: Some misclassification as neutral
- Diagonal dominance indicates reasonable performance
Training Dynamics:
- Steady decrease in training loss over epochs
- Test loss follows similar trend
- Accuracy increases gradually
- Minimal overfitting gap between train and test
1. Effect of Learning Rate:
- LR=0.001: Optimal balance with Adam optimizer
- LR=0.0001: Slower convergence, may need more epochs
- High LR (>0.01): Risk of instability
2. Effect of Optimizer:
- Adam: Fast convergence, adaptive learning rates, minimal tuning
- SGD: Requires careful learning rate tuning, slower convergence
- SGD with high dropout (0.5): Tends to stagnate early
3. Effect of Dropout:
- Dropout=0.1: Good generalization, minimal underfit
- Dropout=0.5: May cause underfitting with limited epochs/data
4. Effect of Hidden Dimension:
- Hidden=128: Sufficient for this task
- Hidden=256: Marginal improvement, more capacity
5. GRU vs LSTM:
- GRU: Slightly faster training, fewer parameters
- LSTM: Better long-term dependencies, marginally better accuracy
- Difference: Minimal in this task (both ~50-52% accuracy)
1. Hyperparameter Tuning Challenges: The combinatorial explosion of hyperparameters makes exhaustive search impractical:
- Grid search with 4 parameters and 2-4 values each = 16-256 combinations
- Each configuration requires full training cycle
- Resource-intensive and time-consuming
2. Practical Strategies:
- Informed Search: Use theoretical knowledge (e.g., Adam generally better than SGD)
- Coarse-to-Fine: Start with broad ranges, then narrow down
- Random Search: Often more efficient than grid search
- Early Stopping: Avoid full training for obviously poor configurations
3. Optimizer Selection:
- Adam's adaptive learning rates make it more robust
- SGD requires more careful tuning but offers theoretical guarantees
- For quick experimentation, Adam is preferred
4. Regularization Trade-offs:
- Too much dropout (0.5) with limited data → underfitting
- Too little dropout (<0.1) → potential overfitting
- Balance depends on dataset size and model capacity
Current Limitations:
- Only 5 epochs trained (due to computational constraints)
- Limited hyperparameter search space
- Pre-trained embeddings not fine-tuned
- No attention mechanism
Potential Improvements:
- Increase training epochs (20-50)
- Implement attention mechanisms
- Use BERT or other transformer-based embeddings
- Data augmentation (synonym replacement, back-translation)
- Ensemble methods
- Class balancing techniques
This project demonstrates:
- RNNs can effectively model sequential text data
- Hyperparameter tuning significantly impacts performance
- Pre-trained embeddings (Word2Vec) provide strong baselines
- Gated architectures (GRU/LSTM) handle sentiment analysis reasonably well
- Achieving >47% accuracy target validates the approach
This project explores adversarial robustness and self-supervised learning through contrastive methods. It consists of adversarial attacks on CNN classifiers and contrastive learning for image embeddings.
SVHN (Street View House Numbers):
- Real-world digit images from house numbers
- 10 classes (digits 0-9)
- More challenging than MNIST due to natural variation
- Available through
torchvision.datasets
Train a CNN classifier achieving at least 90% test accuracy.
Convolutional Layers:
Block 1:
- Conv2d(3→64, kernel=3, padding=1)
- BatchNorm2d(64)
- ELU activation
- Conv2d(64→128, kernel=3, padding=1)
- BatchNorm2d(128)
- ELU activation
- MaxPool2d(2)
- Dropout(0.2)
Block 2:
- Conv2d(128→256, kernel=3, padding=1)
- BatchNorm2d(256)
- ELU activation
- Conv2d(256→256, kernel=3, padding=1)
- BatchNorm2d(256)
- ELU activation
- MaxPool2d(2)
- Dropout(0.3)
Block 3:
- Conv2d(256→512, kernel=3, padding=1)
- BatchNorm2d(512)
- ELU activation
- MaxPool2d(2)
- Dropout(0.4)
Fully Connected Layers:
- Linear(512×4×4 → 1024)
- ELU activation
- Dropout(0.5)
- Linear(1024 → 10)Hyperparameters:
- Batch size: 128
- Learning rate: 0.001
- Optimizer: Adam (β₁=0.9, β₂=0.999)
- Scheduler: StepLR (step_size=5, gamma=0.5)
- Epochs: 10
- Mixed precision training: Enabled (AMP)
Data Preprocessing:
- Normalization: mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]
- No additional augmentation initially
Regularization:
- Batch normalization after each conv layer
- Increasing dropout rates (0.2 → 0.5) in deeper layers
- L2 regularization implicit in Adam
- Final Test Accuracy: 91.23%
- Successfully exceeds 90% threshold
- Stable convergence with smooth loss curves
Confusion Matrix Insights:
- Most errors occur between visually similar digits (e.g., 3 vs 8, 4 vs 9)
- Diagonal dominance indicates good overall performance
- Some classes (0, 1, 6) have near-perfect classification
Error Analysis:
- Misclassified images often have:
- Partial occlusions
- Unusual viewpoints
- Multiple digits in frame
- Poor lighting conditions
- Motion blur
Concept: Generate adversarial examples by adding small perturbations in the direction of the gradient that maximizes loss.
Formula:
Where:
-
$x$ : Original input image -
$\epsilon$ : Perturbation magnitude -
$L$ : Loss function -
$\nabla_x L$ : Gradient of loss with respect to input
Goal: Create imperceptible perturbations that cause misclassification.
def fgsm_attack(model, images, labels, epsilon):
# Enable gradient computation for input
images.requires_grad = True
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward pass to get gradients
model.zero_grad()
loss.backward()
# Generate adversarial perturbation
perturbation = epsilon * images.grad.sign()
adversarial_images = images + perturbation
return adversarial_imagesdef eval_adversarial(model, test_loader, epsilon):
"""
Evaluate model accuracy on FGSM-perturbed images
"""
correct = 0
total = 0
for images, labels in test_loader:
images.requires_grad = True
# Generate adversarial examples
adv_images = fgsm_attack(model, images, labels, epsilon)
# Evaluate on perturbed images
outputs = model(adv_images.detach())
_, predicted = outputs.max(1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
return 100 * correct / totalBaseline (ε=0): 91.23% accuracy
Attack Results:
- ε=0.02: 78.4% accuracy (-12.8%)
- ε=0.05: 52.1% accuracy (-39.1%)
- ε=0.1: 23.7% accuracy (-67.5%) ✓ (meets <25% requirement)
- ε=0.2: 8.9% accuracy (-82.3%)
- ε=0.3: 5.2% accuracy (-86.0%)
Key Finding: At ε=0.1, the model's accuracy drops from 91% to 24%, demonstrating vulnerability to adversarial attacks.
1. Flipped Predictions: Images correctly classified before attack but misclassified after:
- Visual differences are barely perceptible to humans
- Model confidently predicts wrong classes
- Demonstrates adversarial perturbations' effectiveness
2. Adversarial Confusion Matrix: Shows systematic misclassifications:
- Certain class confusions are more common
- Attack causes non-uniform error distribution
- Some digits more vulnerable than others
3. Perceptibility Analysis:
Human Perception Threshold:
- ε ≤ 0.05: Perturbations imperceptible to humans, images look unchanged
- ε = 0.1: Slight noise visible but digits still clearly recognizable
- ε ≥ 0.2: Obvious visual degradation, salt-and-pepper noise visible
- ε ≥ 0.3: Significant noise, harder for humans to classify
Key Insight: The "sweet spot" for adversarial attacks is ε≈0.05-0.1, where perturbations fool the model but remain imperceptible to humans.
Adversarial Training improves model robustness by including adversarial examples in the training process.
Approach: For each mini-batch:
- Generate adversarial examples using FGSM
- Concatenate original and adversarial examples
- Train on combined dataset
- Model learns to be robust against perturbations
def train_network_adversarial(num_epochs=10, epsilon=0.1):
for epoch in range(num_epochs):
for images, labels in train_loader:
# Generate adversarial examples
images.requires_grad = True
adv_images = fgsm_attack(model, images, labels, epsilon)
# Combine original and adversarial
combined_images = torch.cat([images, adv_images])
combined_labels = torch.cat([labels, labels])
# Train on combined dataset
outputs = model(combined_images)
loss = criterion(outputs, combined_labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()- Epochs: 10
- Epsilon: 0.1 (same as attack)
- Effective batch size: 2× original (includes adversarial examples)
- All other hyperparameters unchanged
Clean Test Accuracy:
- Before adversarial training: 91.23%
- After adversarial training: 88.15% (-3.1%)
Adversarial Test Accuracy (ε=0.1):
- Before adversarial training: 23.7%
- After adversarial training: 73.4% (+49.7%) ✓ (exceeds 70% requirement)
Trade-off:
- Small decrease in clean accuracy (~3%)
- Massive increase in adversarial robustness (~50%)
- Excellent cost-benefit ratio
After Adversarial Training:
- Much stronger diagonal elements
- Fewer systematic confusions
- More balanced error distribution
- Model maintains structure even under attack
Observations:
- Certain digit pairs still confusing (3↔8, 4↔9)
- Overall misclassification rate dramatically reduced
- Model generalizes better across perturbation space
1. Vulnerability vs Robustness:
- Standard training optimizes for clean data only
- Models develop brittle decision boundaries
- Small perturbations can cross boundaries easily
2. Adversarial Training Benefits:
- Smooths decision boundaries
- Increases margin between classes
- Model learns invariant features
- Generalizes to perturbed inputs
3. Trade-offs:
- Slight decrease in clean accuracy acceptable
- Significant robustness gain justifies cost
- Critical for safety-critical applications
4. Practical Applications:
- Security systems (face recognition)
- Autonomous vehicles (sign detection)
- Medical imaging (tumor detection)
- Any system vulnerable to adversarial inputs
Implement self-supervised contrastive learning to create meaningful image embeddings without labeled data.
Tiny ImageNet:
- Subset of ImageNet
- 200 classes
- 96×96 pixel images
- Training split: ~100,000 images
- Test split: ~10,000 images
Concept: Learn representations by maximizing agreement between differently augmented views of the same image.
Pipeline:
- Data Augmentation: Create two random views (x_i, x_j) of same image
- Encoder: CNN extracts features (h_i, h_j)
- Projection Head: MLP maps to embedding space (z_i, z_j)
- Contrastive Loss: Maximize similarity of positive pairs, minimize for negative pairs
Key Idea: Images are their own labels! Augmented versions should have similar embeddings.
Training Augmentations (Aggressive):
- Random resized crop (96×96)
- Random horizontal flip
- Color jitter (brightness, contrast, saturation, hue)
- Random grayscale (20% probability)
- Normalization: mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]
Test Augmentations (Minimal):
- Resize to 96×96
- Normalization only
Rationale:
- Strong augmentations force model to learn invariant features
- Two views of same image should be similar despite transformations
- Creates rich positive pairs for contrastive learning
Base Encoder:
- ResNet18 (pre-trained on ImageNet)
- Remove final classification layer
- Use as feature extractor
Projection Head:
Fully Connected Layers:
- Linear(512 → 256)
- BatchNorm1d(256)
- ReLU
- Linear(256 → 128)
- BatchNorm1d(128)
- ReLU
- Linear(128 → 64) # Final embedding dimensionDesign Choices:
- Pre-trained backbone for faster convergence
- Multi-layer projection head for better representations
- BatchNorm for stable training
- Output embedding: 64 dimensions
Formula:
Where:
-
$z_i, z_j$ : Embeddings of two augmented views (positive pair) -
$\text{sim}(u, v)$ : Cosine similarity -
$\tau$ : Temperature parameter (0.5) -
$N$ : Batch size
Implementation:
def nt_xent_loss(embeddings1, embeddings2, temp=0.5):
batch_size = embeddings1.size(0)
# Concatenate both views
combined = torch.cat([embeddings1, embeddings2], dim=0)
# Compute similarity matrix
similarity = torch.matmul(combined, combined.T) / temp
# Labels: each image's positive pair
labels = torch.cat([torch.arange(batch_size) for _ in range(2)])
# Cross-entropy loss
loss = F.cross_entropy(similarity, labels)
return lossKey Properties:
- Pulls positive pairs together
- Pushes negative pairs apart
- Temperature controls softness of distribution
- Symmetric loss (both views treated equally)
Hyperparameters:
- Batch size: 256 (large for more negatives)
- Learning rate: 1e-3
- Optimizer: Adam
- Epochs: 2 (demonstration)
- Temperature: 0.5
Why Large Batch Size?
- More negative samples per batch
- Richer contrastive signal
- Better gradient estimates
- Critical for contrastive learning success
- Final Loss: <3.0 ✓ (meets requirement)
- Successfully converged in 2 epochs
- Loss decreased from ~6.5 to ~2.8
1. t-SNE Visualization:
- Projects 64D embeddings to 2D
- Displays test images at their embedding locations
- Reveals learned structure
Observations:
- Similar images cluster together
- Clear visual groupings emerge
- Unsupervised learning discovers semantic structure
2. Nearest Neighbor Analysis:
For 3 query images, find 5 nearest neighbors based on embedding distance:
- Measure: L2 distance in embedding space
- Expected: Visually similar images should be close
Results:
- Query: Dog → Neighbors: Similar dogs
- Query: Car → Neighbors: Other vehicles
- Query: Building → Neighbors: Similar architectures
Quality Assessment:
- Semantic similarity captured reasonably well
- Some neighbors very relevant, others less so
- 2 epochs insufficient for perfect embeddings
- Longer training would improve quality
Q1: Large or small batch size for contrastive learning? A: Large batch size is strongly preferred because:
- More negative samples per image
- Richer contrastive signal
- Better gradient estimates
- Improved representation quality
- SimCLR paper uses batch sizes of 4096+
Q2: Evaluation metrics for unsupervised representation learning? A: Several approaches:
- Linear Evaluation: Train linear classifier on frozen embeddings
- K-NN Classification: Use nearest neighbors in embedding space
- Clustering Metrics: Silhouette score, NMI
- Downstream Tasks: Fine-tune on specific tasks
- Embedding Quality: Inter/intra-class distances
Q3: Creating embeddings for test set? A: Only forward pass through encoder:
- No augmentation (or minimal)
- No contrastive loss computation
- Extract features directly
- Use for downstream tasks
Q4: Which augmentations for SimCLR?
| Augmentation | Use? | Reason |
|---|---|---|
| Random Crop | ✓ Yes | Creates diverse views, essential |
| Enlarge 128×128 | ✗ No | Changes scale, not semantic invariance |
| Random Rotation | ✓ Yes | Orientation invariance, useful |
| Gaussian Noise | ? Maybe | Small amounts okay, too much destroys info |
| Random Dimensions | ✗ No | Breaks aspect ratio, unnatural |
| Random Grayscale | ✓ Yes | Color invariance, used in SimCLR |
Best Practices:
- Use augmentations that preserve semantic content
- Avoid augmentations that destroy too much information
- Combination of geometric and photometric transforms
Current Limitations:
- Only 2 epochs trained
- Small embedding dimension (64)
- No fine-tuning evaluation
- Limited quantitative metrics
Potential Improvements:
- Train for 100+ epochs
- Larger batch sizes (512, 1024)
- Larger embedding dimension (128, 256)
- Momentum encoder (MoCo)
- Memory bank for more negatives
- Evaluate on downstream tasks
- Compare with supervised baseline
1. Self-Supervised Learning:
- No labels required for meaningful representations
- Data augmentation is the key
- Contrastive learning very effective
2. Representation Quality:
- Embeddings capture semantic similarity
- Unsupervised structure discovery
- Transferable to downstream tasks
3. Practical Applications:
- Pre-training for limited labeled data
- Transfer learning
- Image retrieval
- Clustering and organization
This project implements and analyzes Generative Adversarial Networks (GANs) with different latent space dimensions to generate realistic flower images.
GANs consist of two neural networks competing in a minimax game:
Generator (G):
- Input: Random noise vector
$z \sim \mathcal{N}(0, 1)$ - Output: Synthetic image
$G(z)$ - Goal: Generate realistic images to fool discriminator
Discriminator (D):
- Input: Real or fake image
- Output: Probability image is real
- Goal: Distinguish real from fake images
Training Dynamics:
- D tries to maximize classification accuracy
- G tries to minimize D's ability to classify
- Equilibrium: G generates perfect fakes, D cannot distinguish
Specifications:
- 102 flower categories
- Test split: ~1,000 images
- Resolution: 64×64 pixels (resized)
- Natural images with high variation
Preprocessing:
- Resize to 64×64
- Convert to tensor [0, 1]
- Normalize: mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]
- Maps to [-1, 1] range (for Tanh output)
Architecture (z_dim → 64×64×3 image):
Input: Noise vector (batch, z_dim, 1, 1)
Block 1: ConvTranspose2d(z_dim → g_feat*8, kernel=4, stride=1)
- BatchNorm2d
- ReLU
- Output: (batch, 512, 4, 4)
Block 2: ConvTranspose2d(g_feat*8 → g_feat*4, kernel=4, stride=2)
- BatchNorm2d
- ReLU
- Output: (batch, 256, 8, 8)
Block 3: ConvTranspose2d(g_feat*4 → g_feat*2, kernel=4, stride=2)
- BatchNorm2d
- ReLU
- Output: (batch, 128, 16, 16)
Block 4: ConvTranspose2d(g_feat*2 → g_feat, kernel=4, stride=2)
- BatchNorm2d
- ReLU
- Output: (batch, 64, 32, 32)
Block 5: ConvTranspose2d(g_feat → 3, kernel=4, stride=2)
- Tanh activation
- Output: (batch, 3, 64, 64)Key Features:
- Uses transposed convolutions for upsampling
- Batch normalization for stable training
- ReLU activations
- Tanh output for [-1, 1] range
- Progressively increases spatial dimensions
Architecture (64×64×3 image → binary classification):
Input: Image (batch, 3, 64, 64)
Block 1: Conv2d(3 → d_feat, kernel=4, stride=2)
- LeakyReLU(0.2)
- Output: (batch, 64, 32, 32)
Block 2: Conv2d(d_feat → d_feat*2, kernel=4, stride=2)
- BatchNorm2d
- LeakyReLU(0.2)
- Output: (batch, 128, 16, 16)
Block 3: Conv2d(d_feat*2 → d_feat*4, kernel=4, stride=2)
- BatchNorm2d
- LeakyReLU(0.2)
- Output: (batch, 256, 8, 8)
Block 4: Conv2d(d_feat*4 → d_feat*8, kernel=4, stride=2)
- BatchNorm2d
- LeakyReLU(0.2)
- Output: (batch, 512, 4, 4)
Block 5: Conv2d(d_feat*8 → 1, kernel=4, stride=1)
- Output: (batch, 1, 1, 1)Key Features:
- Strided convolutions for downsampling (no pooling)
- Batch normalization (except first layer)
- LeakyReLU for better gradients
- No activation on output (BCEWithLogitsLoss includes sigmoid)
Train GANs with three different latent space dimensions (z_dim) and compare their performance:
- z_dim = 10: Low-dimensional latent space
- z_dim = 100: Standard latent space (DCGAN default)
- z_dim = 500: High-dimensional latent space
Hyperparameters:
- Batch size: 128
- Learning rate: 2e-4
- Optimizer: Adam (β₁=0.5, β₂=0.999)
- Loss: BCEWithLogitsLoss
- Epochs: 50 (main experiments)
- Labels: real=1.0, fake=0.0
- Random seed: 42 (reproducibility)
Training Procedure:
For each epoch:
For each batch:
# Train Discriminator
1. Forward real images → D → loss_real
2. Generate fake images: z → G → fake
3. Forward fake images → D → loss_fake
4. loss_D = loss_real + loss_fake
5. Backprop and update D
# Train Generator
6. Generate new fake images: z → G → fake
7. Forward fake images → D (G's perspective)
8. loss_G = -log(D(G(z))) [want D to classify as real]
9. Backprop and update GObservations:
-
Loss Behavior:
- Discriminator: Stable, moderate values (~1.0-1.5)
- Generator: Higher variance, struggles initially
-
Generated Images:
- Limited diversity
- Captures basic color schemes (yellows, reds, greens)
- Lacks fine details and texture
- Some mode collapse evident
-
Quality: Poor to fair
-
Diversity: Low (limited by small latent space)
Insight: 10 dimensions insufficient to capture flower image complexity. Latent space too constrained.
Observations:
-
Loss Behavior:
- Discriminator: Oscillates around 0.8-1.2
- Generator: More stable than z=10
- Healthy adversarial dynamics
-
Generated Images:
- Recognizable flower structures
- Good color diversity
- Reasonable texture details
- Clear petals and flower centers
- Better variety across samples
-
Quality: Good
-
Diversity: High
Insight: 100 dimensions provides excellent balance. Standard choice for DCGAN architecture validated.
Observations:
-
Loss Behavior:
- Discriminator: Sometimes collapses to near-zero
- Generator: High variance, instability
- Training dynamics less healthy
-
Generated Images:
- Highly variable quality
- Some excellent samples
- Some mode collapse
- Occasional artifacts
- Diversity high but less consistent
-
Quality: Variable (good to excellent)
-
Diversity: Very high but inconsistent
Insight: 500 dimensions gives G more capacity but training becomes unstable. May require longer training or different hyperparameters.
Training Step Level:
- All three models show characteristic GAN oscillations
- z=100 has smoothest curves
- z=10 shows G struggling more
- z=500 has highest variance
Epoch Level:
- z=100 most stable convergence
- z=10 plateaus early
- z=500 continues oscillating
Method:
- Sample two random latent vectors:
$z_1, z_2 \sim \mathcal{N}(0,1)$ - Create interpolations:
$z_{\alpha} = (1-\alpha)z_1 + \alpha z_2$ for$\alpha \in [0,1]$ - Generate images:
$G(z_{\alpha})$ - Visualize smooth transition
Results:
z_dim = 10:
- Abrupt transitions between images
- Limited smooth variation
- Latent space not well-structured
z_dim = 100:
- Smooth, gradual transitions
- Semantic morphing (one flower gradually becomes another)
- Well-structured latent space
- Colors, shapes, sizes change smoothly
z_dim = 500:
- Very smooth transitions
- Rich variations
- Sometimes unexpected transformations
- High-dimensional manifold well-explored
Insight: Interpolation reveals latent space structure. Smooth transitions indicate continuous, meaningful representations.
Method:
- Sample 500 latent vectors from prior
$z \sim \mathcal{N}(0,1)$ - Apply PCA to reduce to 2D
- Scatter plot of latent codes
Observations (all z_dims):
- Approximately circular/elliptical distribution
- Standard Gaussian structure preserved
- No obvious clusters (unsupervised generation)
- Uniform coverage of latent space
Insight: Latent codes maintain Gaussian properties. No inherent structure in latent space (structure emerges through generator mapping).
Visualization:
- Sample 3 random latent vectors
- Reduce to 2D with PCA
- Display corresponding generated images
Analysis:
- Nearby latent codes produce similar images
- Latent space has semantic meaning
- Generator learned smooth manifold
- Walking in latent space = walking in image space
Purpose: Investigate if longer training improves quality.
Results:
- Image Quality: Significantly improved
- Details: Finer texture, more realistic petals
- Consistency: More samples look realistic
- Diversity: Maintained or slightly increased
- Mode Collapse: Minimal
Loss Dynamics:
- Continued oscillation (healthy GAN behavior)
- No signs of convergence to equilibrium
- Both G and D still improving
Conclusion: Extended training beneficial. 50 epochs good, 100 epochs better. Diminishing returns likely beyond this point.
Saved Models:
generator_zdim10.pkl
discriminator_zdim10.pkl
generator_zdim100.pkl
discriminator_zdim100.pkl
generator_zdim500.pkl
discriminator_zdim500.pkl
generator_zdim100_100epochs.pkl
discriminator_zdim100_100epochs.pkl
Usage:
# Load generator
G = torch.load('generator_zdim100.pkl')
G.eval()
# Generate images
z = torch.randn(16, 100, 1, 1)
fake_images = G(z)z_dim = 10:
- ❌ Insufficient capacity
- ❌ Limited diversity
- ❌ Poor image quality
- Use Case: Proof of concept only
z_dim = 100:
- ✓ Optimal balance
- ✓ Stable training
- ✓ Good quality and diversity
- ✓ Standard choice validated
- Use Case: Production, research
z_dim = 500:
- ✓ High capacity
⚠️ Training instability- ✓ Potential for excellent results
⚠️ Requires careful tuning- Use Case: When quality >> stability
Healthy GAN Training:
- Oscillating losses (not converging to zero)
- D slightly ahead but not dominant
- G improving over time
- No mode collapse
Signs of Issues:
- D loss → 0 (too strong)
- G loss → ∞ (can't fool D)
- Mode collapse (same images)
- Gradient vanishing
For Best Results:
- Use z_dim=100 as baseline
- Train for 50-100 epochs minimum
- Monitor generated samples regularly
- Save checkpoints frequently
- Use learning rate scheduling if needed
Hyperparameter Sensitivity:
- Learning rate: 2e-4 works well (DCGAN paper)
- Beta1: 0.5 critical (momentum for GANs)
- Batch size: 128 good balance
- Architecture: DCGAN proven effective
Quantitative Metrics:
- FID (Fréchet Inception Distance): Measures distribution similarity
- IS (Inception Score): Measures quality and diversity
- Human evaluation: Ultimate test
Qualitative Assessment:
- Visual inspection
- Diversity check
- Interpolation smoothness
- Mode collapse detection
Image Generation:
- Data augmentation
- Art generation
- Design assistance
Latent Space:
- Image editing (latent code manipulation)
- Interpolation for smooth transitions
- Attribute manipulation
Research:
- Understanding generative models
- Exploring learned representations
- Benchmark for new GAN variants
Current Limitations:
- Small dataset (Flowers 102)
- Limited resolution (64×64)
- No conditional generation
- Single architecture tested
Potential Improvements:
- Progressive GAN (higher resolution)
- StyleGAN architecture
- Conditional GAN (class-specific)
- Wasserstein GAN (improved training)
- Spectral normalization
- Self-attention layers
Advanced Techniques:
- FID/IS metric computation
- Hyperparameter search
- Multiple runs for statistical significance
- Comparison with other generative models (VAE, diffusion)
- PyTorch 2.0+: Deep learning framework
- torchvision: Computer vision utilities and datasets
- NumPy: Numerical computations
- Pandas: Data manipulation (Project 2)
- PIL (Pillow): Image processing
- gensim: Word embeddings (Project 2)
- Matplotlib: Plotting and visualization
- Seaborn: Statistical visualizations
- scikit-learn: t-SNE, PCA, metrics
- tqdm: Progress bars
- sklearn: Confusion matrices, label encoding
pip install torch torchvision
pip install numpy pandas matplotlib seaborn
pip install scikit-learn
pip install gensim pillow tqdm
pip install kagglehub # For Project 3 datasetProject 1:
cd Project1
jupyter notebook project1.ipynb
# Or run: python project1.pyProject 2:
cd Project2
jupyter notebook project2.ipynbProject 3:
cd Project3
jupyter notebook project3.ipynbProject 4:
cd Project4
jupyter notebook project4.ipynb- Neural Network Fundamentals: Manual backpropagation, gradient computation
- CNN Architectures: Efficient design, depthwise-separable convolutions
- RNN/LSTM/GRU: Sequence modeling, sentiment analysis
- Adversarial Machine Learning: FGSM attacks, adversarial training
- Self-Supervised Learning: Contrastive learning, SimCLR
- Generative Models: GANs, latent space analysis
- Optimization: Learning rates, optimizers, convergence
- Regularization: Dropout, batch normalization, data augmentation
- Generalization: Overfitting, train-test gap, robustness
- Loss Functions: Cross-entropy, contrastive loss, adversarial loss
- Evaluation: Accuracy, confusion matrices, qualitative analysis
- Data preprocessing is critical
- Start simple, iterate complexity
- Monitor training curves religiously
- Use proper validation strategies
- Hyperparameter tuning is an art and science
- Visualize
- Save checkpoints frequently
- Document experiments thoroughly
These projects were completed as part of the Machine Learning 2 course at Technion. Special thanks to the course instructors for designing comprehensive assignments that cover both theoretical foundations and practical applications of modern deep learning techniques.
These projects are for educational purposes as part of academic coursework.