Data Source: https://www.kaggle.com/datasets/kostastokis/simpsons-faces

# **Simpsons Faces Clustering and Mapping**

### **Data Source and Literature**
- **Data Source**: [Kaggle: Simpsons Faces Dataset](https://www.kaggle.com/datasets/kostastokis/simpsons-faces)  
- **Literature**: [Eigenfaces for Recognition by Turk and Pentland](https://www.face-rec.org/algorithms/PCA/jcn.pdf)  

---

### **Introduction and Motivation**
The problem we’re tackling is clustering the faces of characters from *The Simpsons* and figuring out how to map a third-party face image to *Simpsons* faces at different angles. This is important because it’s a fun and creative way to apply machine learning techniques, like clustering, while also exploring how facial features and styles translate across different visual styles. It could have real-world applications in entertainment, like creating personalized avatars or analyzing art styles, and it’s a great example of using technology to make something both useful and entertaining.

---

### **Related Work**
The paper *"Eigenfaces for Recognition"* by Turk and Pentland connects to our project because it tackles challenges similar to ours, like dealing with different backgrounds, facial expressions, emotions, and head angles in facial data. In their work, PCA is used to extract the most important features of faces while minimizing the impact of these variations. We used the same approach in our project to focus on the key traits of *The Simpsons* characters, despite the variety in their facial expressions and orientations. It shows how PCA is a practical tool for simplifying complex datasets like ours.

Our project builds on techniques like PCA and k-means clustering, which are commonly used in image analysis and dimensionality reduction. Similar approaches have been applied in tasks such as facial recognition (e.g., Turk and Pentland’s *Eigenfaces for Recognition*). Unlike existing work, our focus is on clustering *The Simpsons* faces and mapping third-party images to them, a novel application of these methods.

---

### **Methods**
1. **Data Collection and Preprocessing**:
   - We gathered a dataset of images of *The Simpsons* characters, focusing on close-up facial shots for most of the images.
   - To standardize inputs, we resized all images to the same dimensions and converted them to grayscale to simplify feature extraction for the first PCA try.
   - Then we considered the color factor and generated a mean face with color pixels.

2. **Feature Extraction**:
   - We used PCA (Principal Component Analysis) to reduce the dimensionality of the images while preserving key features that define the unique characteristics of each face.
   - After PCA, we retained the top components explaining 95% of the variance in the data.

3. **Clustering (Classification)**:
   - k-means clustering was applied to group the faces of different characters based on the reduced feature set from PCA. We experimented with different numbers of clusters to determine the optimal grouping. 
   - We reduced the number of components for a clearer cluster by manually selecting 10 images for each top character with consistent, similar orientations and facial expressions to reduce potential blur in data cleaning.
   - We also tried simplifying the images to only include *Simpsons* faces without backgrounds to test for clustering.

4. **Mapping Third-Party Images**:
   - We implemented a face embedding technique for external human face images and mapped these embeddings with a *Simpsons* face. The mapping was determined by calculating the Euclidean distance in the PCA-transformed feature space.

5. **Evaluation**:
   - Clusters were visually inspected for cohesion and separation. We also tested the mapping with various third-party images to assess the alignment with *The Simpsons* style.

---

### **Results**
- **Clustering Outcomes**:
  - PCA effectively reduced the feature space, making clustering computationally efficient.
  - k-means clustering successfully grouped similar faces, with clusters generally aligning to key visual traits (e.g., hairstyle, eye shape). However, some clusters showed overlap, especially for characters with ambiguous or shared features.

- **Mapping Results**:
  - The mapping worked reasonably well for simpler third-party images but struggled with complex or high-detail faces. Cartoonish or stylized external faces tended to map more accurately than realistic photographs.

- **Challenges and Limitations**:
  - Clustering quality was highly dependent on the choice of k, and finding the optimal number of clusters required significant trial and error.
  - The PCA-based feature reduction occasionally discarded subtle but critical features, leading to some misclassifications.
  - Mapping results were inconsistent, with some faces being mapped to clusters that visually did not match well. This highlighted the limitations of using Euclidean distances in a reduced feature space for style translation.

- **What Worked**:
  - PCA and k-means provided a good foundation for clustering *The Simpsons* faces.
  - Using visual inspection to evaluate clusters offered valuable insights into how well the algorithm captured the unique characteristics of the characters.

- **What Didn’t Work**:
  - The generalization to third-party human images needs improvement. Techniques like deep learning-based embeddings or style-aware features could enhance the mapping accuracy.
  - The reliance on grayscale images likely contributed to the loss of nuanced details, especially color-based traits.
  - Clustering using the whole datasets without cleaning the potential non-*Simpsons* images reduced the PCA and clustering accuracy.

---

### **Discussion**
- **Key Learnings**:
  We learned how PCA and k-means clustering simplify facial data and group features effectively despite variations like expressions and head angles. However, PCA sometimes missed subtle traits, and our mapping method struggled with third-party images due to oversimplified distance calculations. PCA reduced too much detail, and our mapping method couldn’t fully handle style differences. This led to overlaps in clustering and inconsistent results with external images.

- **Next Steps**:
  - Expand the dataset to include more *Simpsons* characters for better cluster diversity, and manually remove all the non-*Simpsons* images.
  - Develop an interactive tool for real-time face-to-cluster mapping, such as an application or website for the user to interact with it to create *Simpsons* faces in different angles, features, etc.

- **Improvements**:
  - **Feature Extraction**: Use color images or texture-based features to better capture unique traits (not directly using PCA to find the mean face regardless of the distinct characteristics).
  - **Mapping**: Replace Euclidean distances with a neural network trained on cartoon styles for more accurate mappings.

- **Extensions**:
  - **Deep Learning for Mapping**: Use CNNs for style-specific mapping, addressing inconsistencies.
  - **Advanced Clustering**: Apply more advanced methods like hierarchical clustering to handle variability in features more effectively.

---

### **Grading Scheme Highlights**
- Do you know why you are doing the project?  
- Can you relate it to work from the class?  
- How hard was what you did/tried to do?  
- What did you learn?  
- How much work did you do?  
- Can you clearly present the motivation and what you did/tried to do?  
- Did you get something to work?  
- Did you think deeply about the results or problems with the algorithm?  
- Are you aware of the strengths and limitations of your approach?  
- How much do you understand about the problem, algorithm, and/or approach?
