<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Research_Discussion_Assignment_2_Fomba_Kassoh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Research Discussion Assignment 2: Fomba Kassoh

* **Title:** *Music Recommendations at Scale with Spark*
* **Presenter:** Christopher Johnson
* **GitHub Notebook:**

### Summary and Key Insights

In this engaging Spark Summit presentation, the speaker, a machine learning engineer at Spotify, delves into the dual challenge of *building effective recommendation systems* and *scaling them across massive datasets*. The first half of the talk walks through the mathematical foundations of recommendation algorithms, while the second half explores the engineering hurdles of implementing these systems in production using Spark.

### Part 1: Mathematical Foundations

1. **Collaborative Filtering & Matrix Factorization:**

   * The speaker explains how Spotify utilizes *implicit feedback* rather than explicit ratings. Instead of 1–5 star scores, user interactions (song plays) are used.
   * The recommendation problem is framed as a *matrix factorization* task, decomposing the user-item interaction matrix into two lower-dimensional latent matrices.
   * A loss function incorporating confidence weights is used, giving more importance to frequently played songs.

2. **Alternating Least Squares (ALS):**

   * ALS is used to iteratively solve for user and item factors by fixing one and solving the other via *weighted Ridge regression*.
   * This method supports distributed parallel computation, making it suitable for big data environments.

### Part 2: Industrial-Scale Engineering with Spark

1. **Transition from Hadoop to Spark:**

   * Hadoop-based matrix factorization required multiple disk reads and writes per iteration—a severe performance bottleneck.
   * Spark, by enabling *in-memory caching*, dramatically reduces I/O overhead, accelerating iterative computation.

2. **Three Optimization Attempts in Spark:**

   * **Attempt 1 – Full Broadcast:**
     Every executor receives all item vectors. It works but incurs *heavy network overhead* and no caching.

   * **Attempt 2 – Full Gridify:**
     The ratings matrix is blocked into submatrices (user × item blocks), and only required vectors are sent to each block. Ratings are cached—improving performance.

   * **Attempt 3 – Half Gridify (used by MLlib):**
     Each block holds all ratings for a set of users. This approach reduces shuffling by grouping ratings with users. However, it may demand *high memory* if users collectively interact with all items.

3. **Performance Results:**

   * The *Half Gridify* method outperformed others on a dataset with 4M users and 500K artists, using Spark with 200 executors.
   * Compared to Hadoop, Spark offered up to a *10x speedup*.

### Lessons Learned & Practical Considerations

* **Serialization Matters:** Cryo serialization is much faster than Java’s default, but it often requires *custom serializers*.
* **Memory Management:** Running on full datasets sometimes led to *executor failures* due to resource constraints.
* **Not Yet Production Ready:** At the time, Spotify still relied on Hadoop for production, using Spark mainly for experimentation due to tuning difficulties and instability with large-scale runs.

### Reflection

This talk elegantly demonstrates the balance between *theoretical machine learning* and *practical system design*. The speaker's candid discussion of failed attempts, memory bottlenecks, and tuning frustrations gives real-world insight into deploying ML at scale. Particularly interesting was how Spark’s abstraction simplifies iterative matrix factorization, making it far more feasible than the disk-heavy Hadoop model.

For practitioners, this case study underscores:

* The importance of algorithm choice when dealing with implicit feedback.
* Why caching, partitioning, and minimizing data shuffling are crucial in large-scale ML pipelines.
* That "correct" algorithms often require just as much *engineering finesse* as mathematical rigor to become viable in production.