# Shazam Algorithm and System Design Notes for ML System Design Classes

### Key Points
- Shazam identifies songs by creating unique audio fingerprints from short audio samples, likely using signal processing techniques rather than heavy machine learning.
- The process involves recording audio, applying Fourier transforms to create spectrograms, generating hashes from significant peaks, and matching them against a database.
- The system is designed to be robust to noise, scalable for millions of users, and efficient in delivering fast results.
- While primarily used for music recognition, the technology can be applied to voice identification, plagiarism detection, and other audio-based applications.
- The evidence leans toward Shazam relying on traditional algorithms like Fourier transforms and hashing, though modern enhancements might incorporate machine learning for specific tasks.

### Overview of Shazam
Shazam is a mobile application that identifies music, movies, and other media by analyzing short audio clips (typically 5-10 seconds) captured via a device's microphone. Launched in 2002, it uses audio fingerprinting to match user-recorded audio against a database of known songs, providing details like song title, artist, and lyrics. This process is a classic example of system design in machine learning, balancing accuracy, scalability, and speed.

### How It Works
When you record a song snippet with Shazam, the app processes the audio to create a unique fingerprint based on its frequency patterns. This fingerprint is compared to a vast database to find a match, even if the audio is noisy or recorded in a busy environment. The system then retrieves and displays metadata about the matched song, such as the artist and album.

### Why It’s Effective
Shazam’s algorithm is designed to handle real-world challenges like background noise or partial audio. It uses mathematical techniques to ensure fast and accurate matching, making it reliable for millions of users worldwide. Its scalability and efficiency make it a great case study for system design.

### Applications Beyond Music
The technology behind Shazam can be used for tasks like identifying speakers in meetings, detecting copyrighted content in videos, or recognizing specific sounds in security systems, showcasing its versatility in audio processing.

---

# Detailed Notes for ML System Design Classes: Shazam Algorithm and System Design

These notes provide a comprehensive overview of Shazam’s algorithm and system design, tailored for ML System Design classes. They are structured for clarity, using Markdown, and include detailed explanations, examples, code snippets, and flowcharts to aid understanding. The content is based on a lecture transcription and supplemented with insights from reliable sources, ensuring technical accuracy while being accessible to freshers.

## 1. Introduction to Shazam

### What is Shazam?
Shazam is a mobile application that identifies music, movies, advertising, and television shows by analyzing short audio samples (5-10 seconds) captured via a device’s microphone ([Toptal: Shazam Algorithm](https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition)). Launched in 2002, it has become a leading tool for music discovery, allowing users to identify songs playing in their environment and access metadata like artist, album, and lyrics.

### Relevance to ML System Design
Shazam’s functionality relies on audio fingerprinting, signal processing, and scalable database design, making it a prime example for studying ML system design. It demonstrates how to handle large-scale data, ensure low-latency responses, and maintain robustness against noise, all critical aspects of designing real-world ML systems.

## 2. Historical Context

### Shazam’s Evolution in 2003
Shazam was launched in 2002, and its algorithm was detailed in 2003 by inventor Avery Li-Chung Wang ([Columbia: Shazam Paper](http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf)). In 2003, machine learning was not as prevalent as today, so Shazam relied on signal processing techniques, particularly the Fast Fourier Transform (FFT), to create audio fingerprints. This approach was computationally efficient and effective for the technology available at the time.

## 3. Core Concept: Audio Fingerprinting

### Definition and Importance
Audio fingerprinting involves creating a compact, unique representation of an audio signal that can be matched against a database to identify the source ([ACRCloud: Shazam Technology](https://blog.acrcloud.com/how-does-shazam-work)). Similar to human fingerprints, audio fingerprints capture distinctive features of a song, enabling fast and accurate identification even with short or noisy samples. In ML system design, fingerprinting is a key technique for efficient pattern matching.

### Example
Imagine hearing a song in a noisy café. Shazam records a 10-second clip, extracts its unique frequency patterns, and matches them to a database, identifying the song despite background chatter.

## 4. Functional Requirements

### System Capabilities
Shazam must support the following:
- **Audio Input:** Capture 5-10 seconds of audio via the device’s microphone.
- **Song Identification:** Accurately identify the song from the audio snippet.
- **Metadata Display:** Provide details like song title, artist, album, and lyrics.
- **No Match Handling:** Inform users if no match is found.
- **User History:** Store a history of identified songs for user reference.

## 5. Non-Functional Requirements

### Performance Goals
To ensure a robust user experience, Shazam must meet these non-functional requirements:
- **Accuracy:** High accuracy despite noise or partial audio.
- **Latency:** Response times within seconds.
- **Scalability:** Handle millions of simultaneous queries.
- **Availability:** Maintain high uptime for reliability.
- **Efficiency:** Optimize computational and storage resources.
- **Updateability:** Regularly update the database with new songs.

## 6. Data Types

### Types of Data Handled
Shazam processes three main data types:
- **Audio Data:** Raw user audio (noisy, varying loudness) and reference songs (clean, high-quality).
- **Fingerprint Data:** Hashes and timestamps representing unique audio patterns.
- **Metadata:** Song details (artist, album, lyrics) stored in databases.

### Data Characteristics
User audio may include background noise, varying volumes, or distortions, requiring robust processing. Reference songs are typically high-quality digital files.

## 7. Algorithmic Process

### How Shazam Identifies Songs
Shazam’s algorithm involves six key steps, detailed below with technical insights and code examples.

#### Step 1: Record Raw Audio
- **Description:** Capture 5-10 seconds of audio at 44,100 Hz, based on the Nyquist-Shannon Theorem, to cover frequencies up to 20,000 Hz ([Toptal: Shazam Algorithm](https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition)).
- **Example Code:**
```python
import pyaudio
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
CHUNK = 1024
audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
frames = []
for i in range(0, int(RATE / CHUNK * 10)):  # Record for 10 seconds
    data = stream.read(CHUNK)
    frames.append(data)
stream.stop_stream()
stream.close()
audio.terminate()
```

#### Step 2: Clean Audio
- **Description:** Apply minimal noise reduction and normalization to handle varying audio quality. Shazam’s algorithm is designed to be robust to noise, reducing the need for extensive cleaning.
- **Example Code:**
```python
import numpy as np
from scipy.io import wavfile
fs, audio_data = wavfile.read('audio.wav')
audio_data = audio_data / np.max(np.abs(audio_data))  # Normalize to [-1, 1]
```

#### Step 3: Fourier Transform
- **Description:** Use the Fast Fourier Transform (FFT) to convert the time-domain audio signal into a frequency-domain spectrogram, showing frequency vs. time ([Steemit: Music Recognition](https://steemit.com/technology/%40phenom/how-does-shazam-work-let-s-understand-music-recognition-algorithms-together)).
- **Example Code:**
```python
from scipy.signal import spectrogram
f, t, Sxx = spectrogram(audio_data, fs=44100)
```

#### Step 4: Create Fingerprints
- **Description:** Identify peaks in the spectrogram (landmarks) in specific frequency intervals (e.g., 30-40 Hz, 40-80 Hz), compute relative differences, and generate hashes with a fuzz factor for noise tolerance ([TechAhead: Shazam Recognition](https://www.techaheadcorp.com/blog/decoding-shazam-how-does-music-recognition-work-with-shazam-app/)).
- **Example Code:**
```python
from scipy.signal import find_peaks
peaks, _ = find_peaks(Sxx, height=0.1)  # Adjust height as needed
```

#### Step 5: Build Database
- **Description:** Store hashes with timestamps and song IDs in a scalable NoSQL database like Cassandra or HBase for fast lookups.
- **Example Code:**
```python
import sqlite3
conn = sqlite3.connect('fingerprints.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS fingerprints
             (hash TEXT, time REAL, song_id TEXT)''')
c.execute("INSERT INTO fingerprints VALUES (?, ?, ?)", (hash_value, time, song_id))
conn.commit()
```

#### Step 6: Matching Process
- **Description:** Generate hashes from query audio, search the database for matches, and verify timing consistency to identify the song ([MakeUseOf: Shazam Accuracy](https://www.makeuseof.com/how-does-shazam-work/)).
- **Example Code:**
```python
query_hashes = compute_hashes(query_audio)
matches = []
for h in query_hashes:
    result = c.execute("SELECT song_id, time FROM fingerprints WHERE hash = ?", (h,))
    matches.extend(result.fetchall())
# Group by song_id and check timing consistency
```

## 8. System Design Components

### Architecture Overview
Shazam’s system is designed for scalability and low latency, with the following components:
- **API Gateway:** Manages incoming requests from mobile apps.
- **Load Balancers:** Distribute requests across servers.
- **Fingerprint Conversion Servers:** Process audio to fingerprints.
- **Matching Servers:** Compare fingerprints to the database.
- **Databases:**
  - **Fingerprint Database:** NoSQL (e.g., Cassandra, HBase) for fast hash lookups.
  - **Metadata Database:** Relational (PostgreSQL) or NoSQL (MongoDB) for song details.
- **Response Routing:** Delivers results to the user’s app.

### Database Choices
| Database Type | Example | Use Case | Pros | Cons |
|---------------|---------|----------|------|------|
| NoSQL | Cassandra, HBase | Fingerprint storage | High scalability, fast lookups | Limited query flexibility |
| Relational | PostgreSQL | Metadata storage | Complex queries, structured data | Less scalable for massive datasets |
| NoSQL | MongoDB | Metadata storage | Flexible schemas, unstructured data | May require joins for relationships |

## 9. Robustness to Noise

### Handling Noisy Audio
Shazam’s algorithm is robust to noise due to:
- **Relative Differences:** Using pairs of peaks (frequency and time differences) rather than absolute values.
- **Fuzz Factor:** Quantizing hashes to tolerate minor distortions.
- **Example:** In a noisy environment, a peak at 440 Hz might shift slightly, but the relative difference to another peak (e.g., 880 Hz) remains consistent, allowing accurate matching.

## 10. Scalability and Performance

### Scaling Strategies
- **Distributed Systems:** Fingerprint database is sharded across multiple servers.
- **Parallel Processing:** Queries are processed in parallel for speed.
- **Caching:** Popular songs’ fingerprints are cached to reduce database load.
- **CDNs:** Content Delivery Networks distribute metadata efficiently.
- **Performance:** Shazam handles millions of queries daily with sub-second response times.

## 11. Use Cases Beyond Shazam

### Additional Applications
Shazam’s fingerprinting technology can be applied to:
- **Voice Identification:** Identifying speakers in meetings (e.g., Google Meet note-taker).
- **Plagiarism Detection:** Detecting copyrighted music in videos (e.g., YouTube).
- **Environmental Monitoring:** Recognizing specific sounds like animal calls.
- **Security Systems:** Detecting alarms or other critical sounds.

### Example
In a Google Meet scenario, audio fingerprinting could identify speakers by creating unique voice fingerprints, similar to song fingerprints, enabling automated note-taking by associating speech with individuals.

## 12. Comparison of Techniques

### Traditional vs. ML Approaches
- **Traditional (Shazam):** Relies on FFT and hashing, which are computationally efficient and scalable for song identification.
- **ML/Deep Learning:** Could enhance accuracy for complex scenarios (e.g., distinguishing similar songs) but is often overkill due to increased computational demands.
- **Trade-offs:** Traditional methods are faster and require less training data, while ML offers flexibility for nuanced tasks but at higher cost.

## 13. Flowchart of Shazam’s Process

```mermaid
graph TD
    A[User Records Audio] --> B[Clean Audio]
    B --> C[Fourier Transform]
    C --> D[Create Spectrogram]
    D --> E[Identify Landmarks]
    E --> F[Compute Hashes]
    F --> G[Store Fingerprints in Database]
    H[Query Audio] --> I[Clean Audio]
    I --> J[Fourier Transform]
    J --> K[Create Spectrogram]
    K --> L[Identify Landmarks]
    L --> M[Compute Hashes]
    M --> N[Search Database]
    N --> O[Find Matches]
    O --> P[Retrieve Metadata]
    P --> Q[Display Results]
```

## 14. Summary of Key Points
- **Core Idea:** Audio fingerprinting enables fast, accurate music recognition.
- **Algorithm:** Uses FFT, spectrograms, and hashing to create and match fingerprints.
- **System Design:** Scalable databases and distributed servers ensure performance.
- **Robustness:** Handles noise through relative differences and fuzz factors.
- **Applications:** Extends to voice identification, plagiarism detection, and more.

## Key Citations
- [How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing](https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition)
- [How Does Shazam Work? Let's Understand Music Recognition Algorithms Together](https://steemit.com/technology/%40phenom/how-does-shazam-work-let-s-understand-music-recognition-algorithms-together)
- [abracadabra: How does Shazam work?](https://www.cameronmacleod.com/blog/how-does-shazam-work)
- [How does the Shazam's algorithm work?](https://www.quora.com/How-does-the-Shazams-algorithm-work)
- [How does Shazam work](http://coding-geek.com/how-shazam-works/)
- [How does the Shazam app recognize music?](https://www.techaheadcorp.com/blog/decoding-shazam-how-does-music-recognition-work-with-shazam-app/)
- [How does Shazam work?](https://blog.acrcloud.com/how-does-shazam-work)
- [How does Shazam work? What is the logic behind Shazam tracing out the exact song by just a sample of it?](https://www.quora.com/How-does-Shazam-work-What-is-the-logic-behind-Shazam-tracing-out-the-exact-song-by-just-a-sample-of-it)
- [An Industrial-Strength Audio Search Algorithm](http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf)
- [How Does Shazam Recognize Music Accurately?](https://www.makeuseof.com/how-does-shazam-work/)