# 第10讲 Google的深度学习大模型们
课程内容：在完成初阶学习后，我们进入目前先进，受欢迎的深度学习模型，他们被大量应用在推荐，搜索，自然语言处理等领域。本节主要讨论深度学习高阶算法，Attention, Transformer，Bert，以及预训练模型等。
常考知识点：

## 1.什么是attention？
## 1. What is Attention?

**Attention** is a mechanism in deep learning that mimics human cognition. At its core, it allows a neural network to **dynamically focus on the most relevant parts of the input** when performing a task, rather than treating all parts of the input equally.

---

### 1. The Problem it Solves (The "Bottleneck")

Before Attention, the standard way to handle sequence-to-sequence tasks (like language translation) was with an **Encoder-Decoder RNN/LSTM**.

1.  **Encoder:** Reads the entire input sentence (e.g., "I am a student") and compresses all of its information into a *single, fixed-length vector* (called the "context vector" or "thought vector").
2.  **Decoder:** Takes this *single vector* and tries to generate the entire output sentence (e.g., "Je suis un étudiant") based *only* on that one representation.

**This created a "bottleneck."** It's unreasonable to expect a single, fixed-size vector to perfectly summarize a 50-word sentence. The network would "forget" the beginning of the sentence by the time it finished reading it.

### 2. The Solution: How Attention Works

Attention was created to solve this bottleneck. Instead of a single, static context vector, it creates a **dynamic context vector** for *every step* of the output.

It allows the decoder to "look back" at the *entire* input sequence and decide which words are most important for generating the *current* output word.

The mechanism is most famously described using **Queries, Keys, and Values**.

* **Encoder:** The encoder outputs a set of "hidden states" for every input word. These states represent the meaning of each word *in context*. Let's call this set of states the **"Values"**. We also make a copy of these states to act as the **"Keys"**.
* **Decoder:** At each step, the decoder has its own current hidden state. This state represents "what I am trying to produce right now." This is the **"Query"**.

Here is the step-by-step process for the decoder generating *one* word:

1.  **Get Query:** The decoder has its current hidden state (the **Query**).
2.  **Calculate Scores (Find Relevance):** The decoder compares its **Query** with every **Key** from the encoder's output. This comparison (often a dot product) produces a "score" for each input word.
    * *Intuition:* If the decoder is about to produce the word "student" (the Query), it will give a high score to the input word "student" (the Key).
3.  **Get Attention Weights (Normalize):** The scores are passed through a `softmax` function. This converts them into a probability distribution (all values are between 0 and 1, and they all sum to 1). These are the **"attention weights."**
    * *Example:* `[I(0.05), am(0.05), a(0.1), student(0.8)]`
    * This tells the decoder to "pay 80% of your attention to the word 'student', 10% to 'a', and 5% to 'I' and 'am'."
4.  **Calculate Context Vector (Weighted Sum):** The attention weights are used to create a *weighted sum* of the encoder's original **Values**.
    * *Result:* This new **dynamic context vector** is heavily biased towards the "Value" of the word "student."
5.  **Make Prediction:** The decoder takes this new, dynamic context vector (which is tailored for this specific step) and its own hidden state to predict the next word.

This entire process is repeated for *every single word* the decoder generates, creating a new, custom-tailored context vector each time.

---

### 3. Why is Attention so Important?

1.  **Solves Long-Term Dependencies:** It completely solves the bottleneck problem. The network no longer needs to cram all information into one vector. It has direct access to every input word at every step, so it doesn't "forget" the beginning of a long sentence.
2.  **Provides Interpretability:** You can visualize the attention weights! This lets you "see inside" the model's "mind." For translation, you can plot a matrix showing which input word the model "looked at" when it produced each output word. This is invaluable for debugging.
3.  **It's the Core of the Transformer:** The Attention mechanism was so powerful that researchers realized they could build an entire, state-of-the-art architecture (the **Transformer**) using *only* attention, completely removing all RNN/LSTM components. This led to models like BERT and GPT.

## 2.Attention的工作原理是什么？
The working principle of Attention is to create a **dynamic, weighted context** for every operation, allowing a model to decide which parts of the input are most relevant for the current task.

My previous answer explained "classic" attention in an RNN Encoder-Decoder. This answer will explain the working principle of **Self-Attention**, which is the mechanism that powers the **Transformer** (and models like BERT and GPT).

Self-Attention is the principle of letting a sequence **look at itself** to build context.

The most common implementation of this principle uses **Queries (Q)**, **Keys (K)**, and **Values (V)**.

### The QKV (Query, Key, Value) Analogy

Think of it like a "database lookup" or a search engine for your own sentence:

* **Value (V):** The actual *content* or *meaning* of each word. This is the information you want to retrieve.
* **Key (K):** A "label" or "keyword" for each word. This is like the *index* of the database, advertising what kind of content the Value holds.
* **Query (Q):** A "search query" from a specific word. It represents "what I am currently looking for" to build my own context.

The goal is: For each word, its **Query** searches through all the other words' **Keys**. If a Query and Key "match" (have a high score), it pulls in that word's **Value**.

---

### The Step-by-Step Working Principle (Self-Attention)

Let's use the sentence: "**The cat ate the food**" and find the context-aware representation for the word "**ate**".

#### Step 1: Create Q, K, and V vectors

* First, we have an initial embedding (a vector) for each word.
* We then create three *learnable* weight matrices: $W_Q$, $W_K$, and $W_V$.
* For **every single word**, we create its *own* Q, K, and V vector by multiplying its embedding by these matrices.
    * `Q_ate = embedding_ate * W_Q`
    * `K_ate = embedding_ate * W_K`
    * `V_ate = embedding_ate * V_V`
    * ... (we do this for "The", "cat", "the", "food" as well)

#### Step 2: Calculate Scores (Query $\cdot$ Key)

Now, the word "**ate**" (the "Query") needs to find out how relevant every other word is. It does this by "asking" each word's "Key" for a score. This is done with a **dot product**.

* `score_1 = Q_ate • K_the`
* `score_2 = Q_ate • K_cat`
* `score_3 = Q_ate • K_ate`
* `score_4 = Q_ate • K_the`
* `score_5 = Q_ate • K_food`

If "ate" (the verb) is "querying" for its subject and object, it will have a high-scoring match with the "Keys" for "cat" and "food".

#### Step 3: Scale and Normalize (Softmax)

The scores are not yet ready to be used.
1.  **Scale:** To stabilize training, the scores are divided by the square root of the dimension of the key vectors (e.g., $\sqrt{d_k}$). This prevents the dot product values from becoming too large.
2.  **Softmax:** All the scaled scores for "ate" are passed through a `softmax` function. This converts them into a probability distribution (adding up to 1.0). These are the **"attention weights."**

* **Example weights for "ate":**
    * `w_1 (The)` = 0.05
    * `w_2 (cat)` = **0.40**
    * `w_3 (ate)` = 0.10
    * `w_4 (the)` = 0.05
    * `w_5 (food)` = **0.40**

This result clearly says: "To understand 'ate', pay 40% attention to 'cat' and 40% to 'food'."

#### Step 4: Calculate the Weighted Sum (Get Values)

Now that we have the "how-much-to-pay-attention" weights, we use them to get the "what-to-pay-attention-to" information.

We calculate a **weighted sum** of all the **Value (V)** vectors in the sequence.

* `z_ate = (w_1 * V_the) + (w_2 * V_cat) + (w_3 * V_ate) + (w_4 * V_the) + (w_5 * V_food)`

This new vector, $z_{ate}$, is the **final output** for "ate". It is a context-aware representation. The original meaning of "ate" has now been blended with the meanings of "cat" and "food".

---

### The Final Result

This **entire 4-step process is done in parallel for *every single word* in the sequence**, creating a new set of context-aware embeddings for the whole sequence at once.

This is the core working principle. The Transformer model then makes this even more powerful by:
1.  **Multi-Head Attention:** It does this QKV process 8 or 12 times in parallel (with different $W_Q, W_K, W_V$ matrices). Each "head" can learn a *different kind* of relationship (e.g., one head for subject-verb, another for grammar).
2.  **Stacking Layers:** It stacks these Multi-Head Attention layers on top of each other, allowing the model to build up an incredibly deep and rich understanding of the sequence.



## 3.Transformer工作原理是什么？
## 3. What is the Working Principle of the Transformer?

The **Transformer** is an architecture introduced in the 2017 paper "Attention Is All You Need." Its working principle is a complete paradigm shift: it is the first major sequence-to-sequence model to **completely abandon Recurrence (RNNs, LSTMs, GRUs)** and rely **entirely on Attention mechanisms** to handle sequential data.

Its two key advantages are:
1.  **Parallelization:** Unlike an RNN which must process word 4 after word 3 (sequentially), the Transformer processes all words in a sequence *at the same time*. This makes it dramatically faster to train on modern hardware (GPUs/TPUs).
2.  **Long-Range Dependencies:** It solves the vanishing gradient problem by design. The "path length" between any two words in a sentence (e.g., the first word and the last word) is just 1, via a direct self-attention connection. In an RNN, this path is `N` steps long.

The Transformer is an **Encoder-Decoder** architecture, which is best understood by breaking it into its two main parts.

---

### Part 1: Input Preprocessing

Since there is no RNN, how does the model know the *order* of the words?
The input to the Transformer is not just the word embedding. It's the sum of two vectors:

**`Input Vector = Word Embedding + Positional Encoding`**

* **Word Embedding:** The standard vector representation of the word's meaning.
* **Positional Encoding:** A vector that explicitly defines the *position* of the word in the sequence (e.g., 1st, 2nd, 3rd...). This is a clever, fixed vector generated using sine and cosine functions. This "injects" the sequence order into the model.

---

### Part 2: The Encoder

The **Encoder's** job is to read the entire input sentence (e.g., "The cat sat on the mat") and build a rich, context-aware representation for *every single word*.

It is a stack of `N` (e.g., 6) identical layers. Each layer has two main sub-layers:

1.  **Multi-Head Self-Attention:**
    * **Self-Attention:** This is where every word in the sentence "looks at" every *other* word in the *same* sentence, using the Q, K, V mechanism.
    * This allows the model to build context. For example, when processing the word "sat," the self-attention will create strong links to "cat" (who sat) and "mat" (where it sat).
    * **Multi-Head:** The model does this in parallel 8 or 12 times (each "head" has its own set of Q/K/V weights). Each "head" can learn a different *type* of relationship (e.g., one head learns subject-verb, another learns grammatical relationships). The results are then combined.

2.  **Position-wise Feed-Forward Network:**
    * After attention, the output for *each word* is passed through its own independent, identical, simple two-layer neural network (an MLP).
    * This "processes" or "thinks about" the context-aware vector that self-attention just produced.

*Note: Both sub-layers also have a **Residual Connection** (like in ResNet) and **Layer Normalization** (`Add & Norm`), which are crucial for stabilizing the training of such a deep network.*

**Encoder Output:** A set of `N` vectors, one for each input token, now loaded with rich contextual information. Let's call this `memory`.

---

### Part 3: The Decoder

The **Decoder's** job is to *generate* the output sentence (e.g., "Le chat s'est assis sur le tapis") one word at a time. It is also a stack of `N` identical layers, but its layers are more complex, having *three* sub-layers.

The decoder works **auto-regressively**: it takes the words it has *already generated* as its input.

1.  **Masked Multi-Head Self-Attention:**
    * This is self-attention for the *output* sentence. For example, when generating the 3rd word, it looks at the 1st and 2nd words.
    * **"Masked":** This is the key difference. It applies a "mask" that **hides all future tokens**. When predicting the 3rd word, the model is *not allowed* to see the 4th word (which hasn't been generated yet). This prevents cheating.

2.  **Encoder-Decoder Attention:**
    * This is the "classic" attention where the Encoder and Decoder meet.
    * The **Query (Q)** comes from the decoder (from the masked self-attention layer above). It's the "what am I trying to say now" vector.
    * The **Keys (K)** and **Values (V)** come from the **Encoder's final output (`memory`)**.
    * This allows the decoder to "look back" at the entire input sentence and decide which words are most relevant for generating the *current* output word.

3.  **Position-wise Feed-Forward Network:**
    * This is identical to the one in the encoder. It "processes" the vector that was just created by the Encoder-Decoder attention.

---

### Part 4: The Final Output

1.  The vector from the top of the decoder stack is passed through a final **Linear Layer** (which projects it to the size of the entire vocabulary).
2.  A **Softmax** function is applied to turn those scores into probabilities.
3.  The word with the highest probability is chosen as the next word.
4.  This word is then fed back into the bottom of the decoder as input for the next time step, and the process repeats until an `<end_of_sentence>` token is generated.



## 4.ResNet有效的原因的什么？
## 4. Why is ResNet (Residual Network) so Effective?

ResNet is effective because it solves the two critical problems that prevented neural networks from becoming "deeper": the **Vanishing Gradient** problem and the **Degradation** problem.

It solves these using a simple but powerful mechanism called a **Residual Block** (or "Skip Connection").

---

### 1. The Core Problem: Deep Networks Were Too Hard to Train

Before ResNet, a major paradox existed:
1.  We *knew* that "deeper" models should be more powerful because they can learn more complex patterns.
2.  In *practice*, when networks got "too deep" (e.g., 50+ layers), their performance would get **worse**, not better.

This wasn't just overfitting. This was a **degradation** problem—the deep model was simply too difficult to train (optimize). The primary cause of this was the **Vanishing Gradient Problem**.

### 2. The Solution: The Residual Block

The ResNet architecture is built from "Residual Blocks."

* **A "Plain" Network Layer:** A standard layer tries to learn a direct mapping, $H(x)$, from its input $x$.
    `x -> [WEIGHTS -> RELU] -> H(x)`

* **A "Residual" Network Block:** A residual block learns a "residual" function, $F(x)$, and adds its original input $x$ back to the output.
    * It tries to learn: $H(x) = F(x) + x$
    * `F(x)` is the "residual" (the *change* or *delta* the network needs to learn).
    * `x` is the "identity" or "skip connection" that bypasses the layers.


# 第11讲 Amazon,Netflix, Hulu,Google,Alibaba的三大经典推荐系统算法

## 1.DeepFM的深浅学习分别是什么

In the DeepFM architecture, the "shallow" and "deep" parts are two parallel components that are trained jointly. They are designed to capture different types of feature interactions from the same set of inputs.

The goal is to simultaneously learn **low-order feature interactions** (the "shallow" part) and **high-order feature interactions** (the "deep" part) to make a prediction.

---

### 1. The "Shallow" Learning: Factorization Machine (FM)

The "shallow" component of DeepFM is a **Factorization Machine (FM)**.

* **What it is:** The FM is a "shallow" but powerful model that is highly effective for sparse data, like that in recommender systems.
* **What it learns:** Its primary job is to explicitly model **low-order feature interactions**, specifically:
    * **1st-order interactions:** The individual importance of each feature (just like in linear regression).
    * **2nd-order interactions (the key part):** The pairwise interactions between features. For example, it learns a single vector (embedding) for "User=Alice" and "Item=Coffee" and models their interaction by taking the **dot product** of their vectors.
* **Why it's used:** It is very efficient at "memorizing" the effects of these simple, common feature combinations (e.g., "User Alice *always* likes Item Coffee").

### 2. The "Deep" Learning: Deep Neural Network (DNN)

The "deep" component of DeepFM is a standard **Deep Neural Network (DNN)**, which is typically a Multi-Layer Perceptron (MLP).

* **What it is:** A stack of fully-connected layers with non-linear activation functions (like ReLU).
* **What it learns:** Its job is to automatically learn **high-order feature interactions** and **complex, non-linear patterns**.
    * "High-order" means interactions between three, four, or more features (e.g., "User=Alice" *and* "Time=Morning" *and* "Device=Mobile").
    * Humans cannot manually define all these complex combinations. The DNN learns them implicitly.
* **Why it's used:** It provides "generalization." It can find new, previously unseen feature combinations that are predictive, helping the model recommend items that the user hasn't interacted with but are similar to their "deeper" preferences.

---

### How They Work Together (The Key Idea)

The final architecture of DeepFM combines these two parts:

1.  **Shared Embeddings (Crucial):** All sparse input features (like `user_id`, `item_id`, `gender`) are first converted into low-dimensional embedding vectors. These *exact same* embedding vectors are used as the input for *both* the FM and DNN components. This is a very efficient way to learn the feature representations.
2.  **Parallel Computation:** The FM component and the DNN component process these embeddings in parallel.
3.  **Final Prediction:** The output of the FM component (a single number representing low-order interactions) and the output of the DNN component (another number representing high-order interactions) are **summed together** and passed through a sigmoid function to produce the final click-through rate (CTR) prediction.

**`Prediction = sigmoid(Output_FM + Output_DNN)`**

## 2.推荐系统常用的评价指标是什么

Evaluation metrics for recommender systems can be broadly divided into three main categories. The metric you choose depends on the specific task (e.g., predicting a rating vs. creating a top-N list) and the ultimate business goal.

---

### 1. Prediction Accuracy Metrics

These metrics are used when your goal is to **predict a user's exact rating** for an item (e.g., "predict this user will rate this movie 4.2 stars"). They are most common in "classical" systems, like those for the Netflix Prize.

* **Mean Absolute Error (MAE):**
    * **What it is:** The average absolute difference between the ratings a user *actually* gave (ground truth) and the ratings your model *predicted*.
    * **Formula:** $\frac{1}{|N|} \sum_{(u,i) \in N} |y_{ui} - \hat{y}_{ui}|$
    * **Pros:** Easy to understand. It gives a clear, average "error" in the same unit as the rating (e.g., "we are off by 0.5 stars on average").
    * **Cons:** Doesn't care if you are off by 1 star on a 1-star movie or a 5-star movie.

* **Root Mean Squared Error (RMSE):**
    * **What it is:** Similar to MAE, but it squares the differences before averaging, then takes the square root.
    * **Formula:** $\sqrt{\frac{1}{|N|} \sum_{(u,i) \in N} (y_{ui} - \hat{y}_{ui})^2}$
    * **Pros:** It **punishes large errors much more** than small errors. This is useful because being wrong by 4 stars is much worse than being wrong by 1 star. This was the primary metric for the Netflix Prize.
    * **Cons:** Less intuitive than MAE.

---

### 2. Ranking / Information Retrieval Metrics (Most Common)

In most modern systems (like e-commerce, search, or media), you don't care about the *exact* rating. You care about **creating an ordered list** where the *best* items are at the **top**. These metrics evaluate the quality of that "Top-N" list (e.g., your "Top 10" recommendations).

* **Precision@k:**
    * **What it is:** Of the **Top-K** items you recommended, what fraction were *actually relevant* (e.g., items the user clicked, bought, or liked)?
    * **Formula:** $\frac{\text{Number of relevant items in Top-K}}{\text{K}}$
    * **Example:** If you recommend 10 items (K=10) and the user interacts with 3 of them, your Precision@10 is 3/10 = 0.3.
    * **Use Case:** Good for measuring "how many of the items I'm showing are useful?"

* **Recall@k:**
    * **What it is:** Of *all* the items the user *actually liked* (in the test set), what fraction did you *successfully recommend* in your Top-K list?
    * **Formula:** $\frac{\text{Number of relevant items in Top-K}}{\text{Total number of relevant items}}$
    * **Example:** If the user liked a total of 6 items, and your Top-10 list contained 3 of them, your Recall@10 is 3/6 = 0.5.
    * **Use Case:** Good for measuring "how many of the good items am I finding?"

* **Mean Average Precision (MAP):**
    * **What it is:** A more sophisticated version of Precision. It's the *average* of Precision@k, but it also **rewards for putting relevant items *higher* up the list.**
    * **How it works:** It calculates the precision at *each point* a relevant item is found in the list, and then averages these. This is then averaged across all users.
    * **Use Case:** Excellent for when the *order* of the relevant items matters, not just the count.

* **Normalized Discounted Cumulative Gain (NDCG@k):**
    * **What it is:** The "gold standard" for ranking. It's the most complex but most accurate metric.
    * **How it works:**
        1.  **Cumulative Gain (CG):** Sums up the "relevance" of all items in the Top-K list.
        2.  **Discounted (DCG):** "Discounts" (penalizes) the relevance of items that are lower down the list. A relevant item at position 1 is worth more than at position 5.
        3.  **Normalized (NDCG):** Compares your model's DCG to the "ideal" DCG (the score of a perfect ranking). The result is a score between 0.0 and 1.0.
    * **Use Case:** The best metric when you have **graded relevance** (e.g., item A is "perfect" (5 stars), item B is "good" (3 stars)) and **position matters deeply**.

---

### 3. Business / "Beyond Accuracy" Metrics

These metrics measure the *actual impact* of the recommender system on business goals and user experience. They are often measured in **online A/B tests**.

* **Click-Through Rate (CTR):**
    * **What it is:** The percentage of recommended items that are clicked by users.
    * **Formula:** $\frac{\text{Number of Clicks}}{\text{Number of Impressions (Views)}}$
    * **Use Case:** The most common metric for advertising, search, and media (e.t., YouTube, news).

* **Conversion Rate (CVR):**
    * **What it is:** The percentage of recommended items that lead to a "conversion" (e.g., a purchase, a subscription, or a sign-up).
    * **Formula:** $\frac{\text{Number of Conversions}}{\text{Number of Clicks}}$
    * **Use Case:** The most important metric for e-commerce.

* **Diversity:**
    * **What it is:** Measures how *different* the items in the recommendation list are from each other.
    * **Why it matters:** A user might get bored if you only recommend 10 items from the exact same category (a "filter bubble"). High diversity improves user experience.

* **Serendipity & Novelty:**
    * **Novelty:** Measures if your system recommends items the user hasn't seen before.
    * **Serendipity:** Measures if your system recommends items that are *surprisingly* relevant and new (i.e., not just popular and not just similar to what they already like). This is key to user delight and "discovery."

* **Coverage:**
    * **What it is:** What percentage of your *total item catalog* is your system actually able to recommend?
    * **Why it matters:** A low coverage means you are only recommending the same popular items over and over, ignoring your "long-tail" inventory.

## 3.GNN的hop数目多大越好吗？
No, a larger number of hops (or layers) in a GNN is **not** always better. In fact, this is one of the most famous problems in GNN research.

For most GNN architectures, performance **degrades** significantly as the number of hops (layers) increases.

This is a fundamental trade-off. Each "hop" corresponds to one layer of the GNN. A $k$-layer GNN aggregates information from its $k$-hop neighborhood.

* **Benefit of More Hops:** A larger $k$ means a larger **"receptive field"**. The node gets information from further away in the graph, which should (in theory) help it capture more *global* structural information.
* **Drawback of More Hops:** Adding too many layers leads to a critical problem called **Over-smoothing**.

---

### 1. The Core Problem: Over-smoothing (The #1 Reason)

This is the main reason why deep GNNs (with many hops) fail.

* **What is it?** A GNN layer works by *averaging* (or "aggregating") the feature vectors of a node and its neighbors. When you repeat this averaging process too many times, the feature vectors of *all* nodes in the graph (or a connected component) start to look more and more like each other. They all converge towards a single, average "global" vector.

* **Analogy (Mixing Paint):**
    * **1 Hop:** Imagine your node is a drop of **blue** paint. You mix it with its neighbors (a drop of **red** and **yellow**). You get a unique **brown** color.
    * **2 Hops:** You mix your **brown** with your neighbors' mixed colors. You get a slightly different, more "average" brown.
    * **10 Hops:** After 10 steps of mixing, you have effectively averaged *every* color in the entire graph. The color for *every single node* is now the same **muddy brown**.

* **The Result:** If all node representations become identical, the GNN can no longer tell them apart. It loses all specific, local information, and its predictive power plummets.

---

### 2. Other Major Problems

* **Computational Cost (Neighbor Explosion):**
    Each additional hop *exponentially* increases the number of neighbors (and their features) that must be processed for a single node. In a dense social or recommendation graph, a 2-hop GNN might be fast, but a 10-hop GNN could be trying to process billions of nodes, making it computationally infeasible.

* **Overfitting:**
    More layers (hops) mean more parameters. This increases the model's complexity and makes it more likely to overfit the training data, especially if the graph is small.

---

### What is the "Correct" Number of Hops?

For most real-world problems, **GNNs are deliberately kept very shallow.**

* The most common and effective architectures use only **2 to 4 layers (hops)**.
* In many tasks like recommendation, the most powerful signals come from your *immediate* neighborhood (e.g., your 1-hop friends, or 2-hop "friends of friends"). Information from 10 hops away (e.g., your friend's friend's friend's... cousin) is often just noise.

**In summary: You add hops to get more *global* context, but you must stop before over-smoothing *destroys* the essential *local* information.**


## 4.工业界推荐系统结构是什么？
The structure of a modern industrial-scale recommender system (like at YouTube, Amazon, or ByteDance) is a **multi-stage pipeline** designed to solve one core problem: **how to find the 10 best items for a specific user from a corpus of *billions* of items in under 200 milliseconds.**

A single, complex model cannot do this. The solution is a "funnel" architecture that progressively filters and refines the recommendations. This is almost always a three-stage system:

1.  **Candidate Generation (or "Retrieval")**
2.  **Scoring (or "Ranking")**
3.  **Re-ranking (or "Ordering")**

This entire model pipeline is supported by a massive real-time data and serving infrastructure.

---

### 1. The Model Architecture: The Three-Stage Funnel

#### Stage 1: Candidate Generation (The "Recall" Stage)

* **Goal:** To quickly filter the entire item corpus (e.g., 1 billion videos) down to a manageable subset of *hundreds* (e.g., ~500) of "good enough" candidates.
* **Focus:** **Speed** and **Recall** (don't miss any potentially good items).
* **How it works:** This stage uses simpler, computationally cheap models to find broad matches. Common models include:
    * **Two-Tower Models:** This is the most common modern approach. One "tower" (a DNN) learns a user embedding from user features. A second "tower" learns an item embedding from item features. At serving time, you find the items whose embeddings are "closest" (using a dot product or cosine similarity) to the user's embedding.
    * **Collaborative Filtering (e.g., Matrix Factorization):** "Users who liked this also liked..."
    * **Content-Based:** "Show other items from the same category/artist."
* **Output:** A list of ~500 candidate items that are "relevant."

#### Stage 2: Scoring (The "Precision" Stage)

* **Goal:** To take the ~500 candidates and rank them in order of *precise* relevance to the user.
* **Focus:** **Accuracy** and **Precision**. This is where the heavy-duty "Deep Learning" models are used.
* **How it works:** This model is much more complex and can use *thousands* of features. Unlike the Two-Tower model, it **can mix user and item features together** to learn complex interactions.
* **Common Models:**
    * **DeepFM (or Wide & Deep):** This is a classic. It combines a "Deep" part (a DNN to learn high-order, non-linear interactions) and a "Shallow" part (like FM or Logistic Regression to "memorize" simple, co-occurrence rules).
    * **DLRM (Deep Learning Recommendation Model):** A specialized architecture from Facebook (Meta) that is highly optimized for this task.
* **Output:** A precise score (e.g., `pCTR = 0.92`) for each of the 500 candidates, which are then sorted.

#### Stage 3: Re-ranking (The "Business Logic" Stage)

* **Goal:** To take the perfectly ranked list (e.g., Top 50) and make final adjustments based on business rules, fairness, and user experience.
* **Focus:** **Diversity, Novelty, Fairness, and Business Objectives.**
* **How it works:** This stage applies a set of simple rules or lightweight models to the final list.
    * **Remove "Seen" Items:** Filter out items the user has already watched or purchased.
    * **Ensure Diversity:** Prevent showing 10 items from the exact same category. (e.g., "Don't show more than 3 videos from the same channel in the top 10.")
    * **Promote Freshness:** Boost the score of new or trending content.
    * **Fairness:** Ensure that content from new creators gets a chance to be seen.
    * **Business Rules:** Promote sponsored items or items with higher profit margins.
* **Output:** The final Top-10 list that is actually displayed to the user.

---

### 2. The Supporting Infrastructure (The "Industrial" Part)

The 3-stage model is just one piece. The true "industrial" structure is the data and serving system that makes it possible in real-time.

#### 1. Data Pipelines (Offline vs. Online)

* **Offline Batch Pipeline:** Runs daily or hourly. It uses tools like **Spark** or **Delta Lake** to process massive historical logs.
    * **Purpose:** Training models (Ranker, Two-Tower) and pre-computing features for the Feature Store.
* **Online Streaming Pipeline:** Runs 24/7. It uses tools like **Kafka** (to ingest events) and **Flink/Spark Streaming** (to process them).
    * **Purpose:** To capture the user's *immediate* actions (e.g., "user just clicked item X"). This data is fed *instantly* to the Feature Store to update the user's profile, making the recommendations "real-time."

#### 2. The Feature Store

This is the central "brain" of the system and solves the **training-serving skew** problem. It has two parts:

* **Offline Store (e.g., S3, Hive):** A massive database of historical features used by the **Offline Pipeline** to train models.
* **Online Store (e.g., Redis, DynamoDB):** A super-fast, low-latency key-value database.
    * **How it's used:** When a user request hits the server, the **Scoring** model (Stage 2) needs features like "user's click count in the last 10 minutes." It queries the **Online Store** to get these "fresh" features in milliseconds.

#### 3. Model Serving & Evaluation

* **Candidate Generation Service:** This service queries for the user embedding and uses an **Approximate Nearest Neighbor (ANN)** index (like **Faiss** or **ScaNN**) to find the 500 closest item embeddings from the corpus of billions, all in ~20-30ms.
* **Ranking Service:** This service gets the 500 candidates, queries the **Online Feature Store** for fresh features, and runs the complex DeepFM/DLRM model to score them, all in ~100-150ms.
* **A/B Testing Framework:** No new model is ever fully deployed. It is first deployed as an **A/B test** (or "canary release") to a small percentage of users (e.g., 1%). The system logs the business metrics (CTR, CVR, diversity, etc.) for this new model and compares it to the old one. Only after it is proven to be better in a live test is it rolled out to 100% of users.

## 5.怎么解决推荐中sparsity的问题

## 5. How to Solve the Sparsity Problem in Recommendation Systems

**Sparsity** is the single biggest challenge in most recommender systems. It describes the fact that the **user-item interaction matrix** (the "who-liked-what" dataset) is **mostly empty**. Most users have only interacted with a tiny fraction of the total available items.

This creates two major problems:
1.  **General Sparsity:** It's hard to find patterns. If two users have no items in common, traditional collaborative filtering can't determine if they are similar.
2.  **The "Cold Start" Problem:**
    * **User Cold Start:** A new user joins. They have *zero* interactions. We know nothing about them. What do we recommend?
    * **Item Cold Start:** A new item is added to the catalog. It has *zero* interactions. Who should we show it to?

Here are the primary methods used in industrial systems to solve this, moving from simple heuristics to advanced models.

---

### 1. Model-Based Solutions (Handling General Sparsity)

These models are inherently designed to "fill in the blanks" of a sparse matrix by learning generalized patterns.

* **Matrix Factorization (MF) & Embedding-Based Models (MF, DeepFM):**
    This is the **fundamental solution** to general sparsity. Instead of comparing users to other users, MF-based models learn a dense, low-dimensional "latent factor" vector (an **embedding**) for every user and every item.
    * **How it helps:** The model learns a user's "taste" (e.g., `user_A = [loves_action, hates_comedy]`) and an item's "characteristics" (e.g., `item_B = [is_action, is_not_comedy]`).
    * The model can then predict that User A will like Item B (by taking the dot product of their vectors), even if they have *never* interacted before.
    * This is the core of models like **Matrix Factorization** and the **FM** part of **DeepFM**.

* **Graph Neural Networks (GNNs):**
    This is a more advanced technique. A GNN (like GraphSAGE or PinSAGE) doesn't just learn an embedding for a user based on their *own* interactions; it also learns from the interactions of their **neighbors**.
    * **How it helps:** If User A has only liked one item, they are very "sparse." But if their friends (1-hop neighbors) or "friends of friends" (2-hop neighbors) are very active, the GNN will "pass" this information (via message passing) to User A's embedding.
    * This enriches the representation of sparse users/items by borrowing information from their local graph neighborhood.

---

### 2. Feature-Based / Hybrid Solutions (Handling the "Cold Start" Problem)

When you have *zero* interactions (cold start), you can't use collaborative models. You *must* rely on metadata.

* **Content-Based Filtering:**
    This is the primary strategy for cold-start. You ignore the interaction matrix and use the *features* of the items or users.
    * **Item Cold Start:** A new movie has no ratings. But it *does* have a `genre`, `director`, `cast`, and `description`. You can recommend this new item to users who have previously liked other items with the *same* `genre` or `director`.
    * **User Cold Start:** A new user has no history. But you can ask them for their `age`, `gender`, `location`, or ask them to select a few `genres` they like during onboarding. You can then recommend items that are popular with other users in their demographic (e.g., "users aged 18-25 in your area also liked...").

* **Hybrid Models (The Industrial Standard):**
    Modern models like **DeepFM** or **Wide & Deep** are *inherently* hybrid. This is their main strength. They combine both approaches:
    1.  **Collaborative (Sparse) Features:** They learn embeddings for the `user_id` and `item_id` (the "Memorization" or "FM" part).
    2.  **Content (Dense) Features:** They also take `user_age`, `item_category`, `time_of_day`, etc., as inputs (the "Generalization" or "Deep" part).
    * **How it helps:** For a new user, the `user_id` embedding is useless (it's random). But the model can *still* make a good prediction based on their `age`, `gender`, and `location`. The content features "bootst**r**ap" the recommendation until the user has enough interaction history for the collaborative part to become powerful.

---

### 3. System-Level / Heuristic Solutions (Practical Fallbacks)

These are the business-logic "safety nets" that every real-world system implements.

* **For User Cold Start: The "Popularity" Baseline**
    The simplest and most reliable solution for a new user is to **not** personalize.
    * **Action:** Show them the "Top 10 Most Popular," "Trending Now," or "Most-Viewed" items, either globally or for their region.
    * **Why it works:** Popular items are popular for a reason. It's a high-probability, low-risk recommendation that also serves as the *first* data-gathering step for that user.

* **For Item Cold Start: Exploration Strategies**
    A new item has no history, so the model will never pick it. You have to *force* the system to show it.
    * **Action:** This is an **Explore-Exploit** problem. You "exploit" by showing items you *know* are good, but you must also "explore" by intentionally injecting *new, unknown items* into a small percentage of users' recommendations.
    * **How it works:** You might reserve 1-5% of recommendation slots for "new items" to gather the first clicks and interactions. This data is then fed back into the model, allowing it to learn the item's embedding and start recommending it "organically."


# 第12讲 Google三大自然语言处理实战算法
课程内容：在自然语言处理（NLP）领域中，翻译是一项重要的任务。翻译任务涉及将一种自然语言（源语言）的文本转换为另一种自然语言（目标语言）的等效文本
翻译任务面临许多挑战，因为语言之间存在语法、词汇、语义和文化差异等方面的复杂性。翻译系统需要能够理解源语言文本的含义，并能够准确地表达出目标语言的等效含义。本节讲介绍常见的翻译模型，n-gram, encoder-decoder RNN等。

## 1.Bi-LSTM和LSTM的区别
The key difference is the **direction of information processing** and, as a result, the **context** the model has at any given time step.

* **LSTM** is **unidirectional**: It only processes information from the past (from beginning to end).
* **Bi-LSTM** (Bidirectional LSTM) is **bidirectional**: It processes information from both the past *and* the future (from beginning-to-end and end-to-beginning).

---

### 1. Standard LSTM (Unidirectional)

A standard LSTM is a type of RNN that reads a sequence (like a sentence) one word at a time, from left to right.

* **How it works:** At any time step `t`, the LSTM's hidden state ($h_t$) is calculated using two inputs:
    1.  The current input $x_t$ (the current word).
    2.  The hidden state from the previous time step $h_{t-1}$ (the "memory" of all past words).
* **The Limitation (Incomplete Context):** The model's "understanding" of a word is based *only* on the words that came before it.
* **Example:** In the sentence, "The **bank** [t]... ...by the **river**,"
    * When the LSTM is at the word "bank", it has only seen "The". It has no idea if "bank" means a financial institution or a river bank. It's making a guess based on incomplete context.

---

### 2. Bi-LSTM (Bidirectional LSTM)

A Bi-LSTM is not a new *type* of cell; it's an **architecture** that uses two separate LSTMs on the same input.

* **How it works:**
    1.  **Forward LSTM:** One LSTM processes the sentence from **left-to-right** (beginning-to-end), just like a standard LSTM. This generates a "forward" hidden state ($\overrightarrow{h_t}$) for each word, which knows about the *past* context.
    2.  **Backward LSTM:** A *second, separate* LSTM processes the same sentence from **right-to-left** (end-to-beginning). This generates a "backward" hidden state ($\overleftarrow{h_t}$) for each word, which knows about the *future* context.
* **The Final State:** The final, context-aware representation for the word at time `t` is the **concatenation** of these two hidden states:
    `Final_h_t = [ \overrightarrow{h_t} ; \overleftarrow{h_t} ]`
* **The Advantage (Full Context):** The model's "understanding" of a word is now based on *all* the words that came before it *and* all the words that come after it.
* **Example (Same Sentence):** "The **bank** [t]... ...by the **river**,"
    * When the Bi-LSTM is at the word "bank":
        * The **Forward LSTM** knows "The".
        * The **Backward LSTM** (which started at "river") knows "by the river".
    * The combined state `[h_forward ; h_backward]` has the full context and can easily determine that "bank" refers to a river bank.

---

### Key Differences & Trade-offs

| Feature | LSTM (Unidirectional) | Bi-LSTM (Bidirectional) |
| :--- | :--- | :--- |
| **Information Flow** | **One direction** (Past to Future) | **Two directions** (Past $\leftrightarrow$ Future) |
| **Context** | Only knows the **past** context. | Knows both **past and future** context. |
| **Architecture** | A single LSTM layer. | **Two LSTM layers** running in parallel. |
| **Performance** | Less powerful for NLP. Fails to understand context-dependent words. | **Much more powerful** for NLP. Captures a rich representation. |
| **Parameters** | `N` parameters. | `2 * N` parameters (roughly double). Slower to train. |
| **Use Case (NLP)** | Good for **Generative** tasks (a **Decoder**) where you *cannot* see the future, like language modeling. | Good for **Representation** tasks (an **Encoder**) where you *can* see the full sentence, like translation, sentiment analysis, or NER. |
| **Use Case (Other)** | **Time-Series Forecasting** (where you must not see the future). | **Not suitable** for real-time forecasting. |

## 2.翻译任务的目标函数是什么

## 2. What is the Objective Function for a Translation Task?

This is a foundational concept. The objective function for a translation task (and most sequence-to-sequence tasks) is **Maximum Likelihood Estimation (MLE)**.

In practice, this is implemented by minimizing the **Cross-Entropy Loss**.

Here is the breakdown of the principle.

---

### 1. The High-Level Goal: Maximize Probability

The goal of the model is to find the **most probable** target sentence ($Y$) given the source sentence ($X$).

* **Source Sentence ($X$):** "I am a student"
* **Target Sentence ($Y$):** "Je suis un étudiant"

We want to train our model's parameters ($\theta$) to **maximize the conditional probability $P(Y|X)$**.

### 2. The Mathematical Formulation (Maximum Likelihood)

A sentence $Y$ is a sequence of words (tokens): $Y = (y_1, y_2, ..., y_T)$.

To find the probability of this entire sequence, we use the **chain rule of probability**:
$P(Y|X) = P(y_1|X) \times P(y_2|y_1, X) \times P(y_3|y_1, y_2, X) \times ... \times P(y_T|y_{<T}, X)$

This is a product of the probabilities of each word, given the source sentence and all the target words that came before it.

The objective function for the model is to **maximize this product** across the entire training dataset:

$$\text{Objective} = \max_{\theta} \sum_{(X,Y) \in \text{Dataset}} \prod_{t=1}^{T} P(y_t | y_{<t}, X; \theta)$$

This is **Maximum Likelihood Estimation (MLE)**.

### 3. The Practical Loss Function: Cross-Entropy

Working with a product of probabilities is numerically unstable (multiplying many small numbers leads to underflow). The standard solution is to work with **log-probabilities**, which converts the product into a sum.

Since $\log$ is a monotonic function, maximizing a value is the same as maximizing its log:

$$\text{Objective} = \max_{\theta} \sum_{(X,Y)} \sum_{t=1}^{T} \log P(y_t | y_{<t}, X; \theta)$$

This is called the **Log-Likelihood**.

In machine learning, we prefer to *minimize a loss function*. Maximizing a value is identical to **minimizing its negative**. This gives us our final loss function:

$$\text{Loss Function} = \min_{\theta} \sum_{(X,Y)} - \sum_{t=1}^{T} \log P(y_t | y_{<t}, X; \theta)$$

This is the **Cross-Entropy Loss**.

* **How it works:** At each time step $t$, the decoder (e.g., an RNN or Transformer) outputs a `softmax` layer, which is a probability distribution over the *entire* vocabulary (e.g., 50,000 words).
* The term $- \log P(y_t | ...)$ simply picks the log-probability of the **one correct word** (the ground truth $y_t$) from that distribution.
* The model is trained to make this specific probability as close to 1.0 (and its log-probability as close to 0) as possible, for every step in the sentence.

---

### Important Distinction: Training vs. Inference

It is critical to separate the *training objective* from the *final evaluation metric*.

* **Training Objective (Loss Function):** **Cross-Entropy Loss**. This is what the model *optimizes* (using "teacher forcing," where it's given the correct word $y_t$ at each step). It must be differentiable.
* **Evaluation Metric:** **BLEU Score** (or ROUGE, METEOR). This is what *humans* use to judge the "quality" of the final, generated translation. It is *not* differentiable and cannot be used directly as a loss function.
* **Inference Objective:** **Beam Search**. During inference (testing), we don't have the target $Y$. We must *search* for the sentence $Y$ that maximizes $P(Y|X)$. **Beam Search** is the heuristic algorithm used to find this high-probability sentence.

## 3.哪些任务是seq2seq的任务
## 3. What Tasks are Seq2Seq Tasks?

A **Seq2Seq (Sequence-to-Sequence)** task is any task that involves **transforming an input sequence of variable length into an output sequence of variable length**.

The input and output sequences do not need to be the same length, and they don't even need to be the same *type* of data. This general-purpose framework is one of the most powerful in deep learning.

Here are the most common examples:

### 1. Natural Language Processing (Text-to-Text)

This is the most famous category.

* **Machine Translation:**
    * **Input:** A sentence in one language (e.g., "How are you?").
    * **Output:** The same sentence in another language (e.g., "Comment allez-vous ?").
* **Text Summarization:**
    * **Input:** A long document or article (a long sequence).
    * **Output:** A short summary of that document (a short sequence).
* **Question Answering:**
    * **Input:** A "context" document and a "question" (often concatenated together).
    * **Output:** The "answer" to the question.
* **Chatbots / Conversational AI:**
    * **Input:** A user's prompt or question.
    * **Output:** The chatbot's reply.
* **Grammatical Error Correction:**
    * **Input:** A sentence with grammatical errors (e.g., "I go to store yesterday").
    * **Output:** The corrected sentence (e.g., "I went to the store yesterday").
* **Style Transfer:**
    * **Input:** A sentence in one style (e.g., "The movie was bad.").
    * **Output:** The same sentence in another style (e.g., "I found the film to be profoundly disappointing.").

### 2. Speech Processing (Audio-to-Text / Text-to-Audio)

* **Automatic Speech Recognition (ASR):**
    * **Input:** An audio sequence (a waveform or spectrogram).
    * **Output:** A text sequence (the transcript).
* **Text-to-Speech (TTS):**
    * **Input:** A text sequence.
    * **Output:** An audio sequence (the generated speech).

### 3. Computer Vision (Image-to-Text)

* **Image Captioning:**
    * **Input:** An image (which is first processed by a CNN "encoder").
    * **Output:** A text sequence describing the image (e.g., "A red car parked on the street").
* **Video Captioning:**
    * **Input:** A sequence of video frames.
    * **Output:** A text sequence describing the action in the video.

### 4. Code & Software Engineering

* **Code Generation:**
    * **Input:** A natural language description (e.g., "a Python function that sorts a list").
    * **Output:** The corresponding code sequence (e.g., `def sort_list(my_list): ...`).

