# ML System Design: Designing Instagram Feed Ranking Model

## End to End Implementation

### Goal 
The primary goals of designing an Instagram feed ranking model are as follows:

1. **Rank Posts/Reel/Content to Maximize Engagement**:
   - The aim here is to enhance user interaction, which is quantified in terms of likes, comments, and shares. By optimizing the content displayed to users, the model can drive engagement metrics effectively.

2. **Diversity in Content**:
   - It’s crucial that the users see a variety of content. This involves showcasing posts from different creators and across various genres, ensuring that the feed remains interesting and engaging.

3. **Freshness in Content**:
   - Users should see recent posts preferentially. This is important since fresher content is generally more relevant to users, aligning with their current interests and trends.

### Business Perspective
From a business standpoint, focusing on suggested posts is key for several reasons:
- **Increase in Daily Active Users (DAU)**:
  - A higher DAU indicates better retention and user satisfaction.
- **Increase in Click-Through Rate (CTR)**:
  - A higher CTR signifies that users are engaging with the content suggested.
- **Increase in Average Session Time**:
  - Longer sessions suggest that users find the content engaging enough to spend additional time on the platform.

### Question - Explain the difference between DAU and WAU?

**DAU (Daily Active Users)** refers to the number of unique users who interact with the app on a daily basis. This metric directly indicates how many unique users find value in the platform each day.

**WAU (Weekly Active Users)** denotes the count of unique users who engage with the app over a week. 

- **Does 7 days of DAU = Weekly Active Users?**
No, they do not equate. While DAU counts daily unique interactions, WAU is concerned with the total unique users over a week, meaning a user who interacts daily would be counted multiple times in DAU offering a misleading representation of unique engagement if compared directly.

### Individual Level Requirement
At the individual user level, the primary requirement is to retrieve relevant and useful content that aligns with the user’s interests. This refinement is crucial for enhancing user satisfaction and engagement rates.

## Functional Requirements
The functional needs of the Instagram feed ranking model include:
- Improve **Daily Active Users (DAU)**.
- Maximize **Average Time Spent** on the platform.
- Enhance **Click-Through Rates (CTR)**.
- Boost the **User Engagement Score** based on interaction metrics such as likes, shares, and comments.

## Non-Functional Requirements
Non-functional requirements ensure that the system operates effectively under varying circumstances:
- **Scalability**: The system must handle > 2 Billion users simultaneously.
- **Availability**: Aiming for 99.99% uptime to ensure that users can access the service whenever needed.
- **Latency**: All responses must come under 100 ms for a seamless user experience.
- **Monetization**: Generate ad revenue through efficient ad placements.
- **Tooling**: Robust mechanisms for error handling, debugging, monitoring, and alerting.
- **Analytics**: Insights on creator demographics and content interactions to refine the algorithm.
- **Monitoring and Observability**: Ongoing tracking of system performance and user engagement.

## Estimate
### Building Model for 500 Million DAUs
- A solid storage strategy is essential, including:
  - **Structured Data**: Storing follow relationships and post metadata in a robust data warehouse like AWS Redshift.
  - **Data Lakes**: Using platforms like Amazon S3 or HDFS for storing raw interaction logs, which can be unstructured and varied.

## Pipeline
The pipeline unfolds as follows upon opening the Instagram app:
1. **Candidate Generation**: With 1 Billion possible posts, the model narrows down to 100 candidate posts tailored to the user.
2. **Ranking**: Analyze user engagement potential utilizing sophisticated machine learning algorithms. 

### Data Engineering
Understanding data is vital:
- **Data Types**:
  - **Posts**: Attributes include creators, media (audio/video/text), and embedded features.
  - **Viewer Interactions**: Collating historical data on user interactions with posts to derive insights on preferences.

### Labels
For each interaction, labels indicate engagement:
- `1`: Engaged
- `0`: Not Engaged
   
### Model Architecture
The architecture design focuses on:
1. **ML Approach**: Traditional algorithms for ranking.
2. **DL Approach**: Utilizing deep learning methods to leverage complex user-post relationships.

#### Candidate Generation
Drawing parallels to Olympic selection, this phase involves filtering from a larger pool (1 million posts) down to a manageable hundred based on user interaction data.

#### Ranking
Application of classification algorithms yields probabilities of engagement, which inform the ranking of chosen candidate posts.

### Deep Dive
#### Collaborative Filtering
Collaborative filtering seeks to identify user preferences based on interaction patterns. Here’s a simple demonstration once again:

**Example Matrix Visualization**:  
```mermaid
graph LR
    A[User A] --> P1[Post 1]
    A --> P2[Post 2]
    B[User B] --> P2[Post 2]
    B --> P3[Post 3]
```

The above serves to illustrate a **sparse matrix**, where the number of potential posts far exceeds the number of users, resulting in many zero-engagement scenarios.

### SVD (Singular Value Decomposition)
- **Definition**: SVD decomposes a matrix into singular values, yielding insights into relationships.
- **Matrix Representation**:

\[
A = B \cdot C^T 
\]

Where:
- \(A\) = User-Item interaction matrix.
- \(B\) (m x d) = User feature matrix.
- \(C\) (n x d) = Item (post) feature matrix.

**Computational Load**: SVD does incur a heavy computational load, particularly in large datasets. Efficient libraries (such as NumPy, SciPy) help mitigate processing time.

## Content-Based Filtering
In content-based filtering, the algorithm recommends items based on their characteristics compared against user profiles. 

**Example**:
- If a user primarily engages with content about **gardening**, the model will prioritize garden-related posts in the feed, aligning recommendations with individual interests.

## Heuristics-Based Filtering
Heuristic filters provide immediate results:
- **Posts from Followed Accounts**: High-priority for users.
- **Trending Content**: Posts gaining traction quickly (e.g., X likes in Y minutes).

### Neural Network Approach
#### Two-Tower Network
This structure separates user and post attributes into two distinct processing pipelines that ultimately output a similarity score based on interactions.

**Network Architecture Visualization**:  
```mermaid
flowchart LR
    A[Viewer Features] -->|Dense NN| B[Viewer Embeddings]
    C[Post Features] -->|Dense NN| D[Post Embeddings]
    B --> E[Dot Product / Cosine Similarity] --> F[Ranking Score]
```

### Autoencoders
Autoencoders are neural networks designed to learn efficient encodings, compressing input data and later reconstructing it back to the original form.

**Process**: Fans out as encoder and decoder phases, capturing vital patterns within user engagement data.

## Ranking Mechanism
### Input Features
1. User Interactions: Behavior patterns over the past 7 days.
2. Post Details: Engagement metrics and creator data.
3. Contextual Elements: Time since login, device details, and demographics.

### Loss Function
- **Binary Cross Entropy** serves as the loss function:
  \[
  L = -\frac{1}{N}\sum_{i=1}^{N} [y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i)]
  \]
  
Here, \(y_i\) is the actual label, and \(\hat{y}_i\) is the predicted probability.

### Model Selection
Algorithms are chosen based on desired performance metrics:
- **XGBoost** or **Logistic Regression** could be utilized to rank posts effectively.

### Reranking & Diversity
Design mechanisms to ensure diversity:
- Limit posts from the same creator to avoid redundancy.
- Restrict categories of posts shown consecutively, thereby maintaining variety.

### Deployment
#### Batch Inference
Batch inference processes inputs in groups, scheduled at intervals for efficient throughput and resource utilization.

#### Airflow
Apache Airflow is instrumental for managing data workflows and can automate complex data pipelines, ensuring timely model updates and data ingestion.

## Monitoring
### Key Metrics
Keep tabs on:
- **Precision, Recall, F1 Score**: To evaluate prediction accuracy.
- **Latency**: Ensure service response remains below 100 milliseconds.

### A/B Testing
Conduct experiments evaluating the performance of new algorithms against existing ones, focusing on engagement metrics.

## Retraining Protocol
Create a framework to monitor for feature distribution shifts:
- Utilize metrics such as KL Divergence for detecting changes and triggering model retraining to maintain performance.

### Handling Edge Cases
1. **Scalability of Two-Tower Networks**: Exceptionally scalable due to its architecture.
2. **Multi-Modal Inputs**: Designed to simultaneously process text, images, audio, and video data efficiently.

### Cold Start Problem
Cold starts arise when new content fails to attract immediate engagement. Tactics for dealing with this include recommending currently trending posts or utilizing targeted demographic insights.

### Similarity Assessments
- **Siamese Networks** leverage parallel processing paths to calculate content similarity without taxing storage.

### Ad Integrations
Distinct ranking models for advertisements, tuned to strike a balance between relevance and revenue management.

### Sparse Data Management
Overcome data sparsity challenges by implementing:
- **Compressed Sparse Matrix**: Minimizes memory footprint while enhancing computation efficiency.
- **Lasso Regularization**: This technique aids in feature selection by penalizing lesser-impact features.

---

This comprehensive compilation provides an in-depth understanding of designing an Instagram feed ranking model from the ground up. Feel free to reach out for clarification on any specific topics or for more detailed explanations!