## üîÅ Graph Representation Learning

### ‚ùì What is Graph Representation Learning?

In modern semi-supervised learning, **graph representation learning** (also called **graph embedding**) refers to learning **efficient and independent features** from graph nodes, with the aim of using them in machine learning tasks such as prediction.

---

### üìå How It Works:

Each node `u` in the graph is mapped to a **low-dimensional vector** (embedding):



<center>
    <img src="images/image1.png" alt="Image" width="600"/>
</center>


$$
f: u \to \mathbb{R}^d
$$



This resulting vector is known as the **embedding vector**.


### üéØ Goal:

The goal of these methods is to **embed nodes** into **low-dimensional vectors** in a way that **preserves their structural context** in the graph.  
In other words, we want to **embed nodes into a hidden space** where the **geometric relationships** reflect the **original graph neighborhood structure**.


### üß† Applications:

- Node classification  
- Link prediction  
- Graph classification  
- Anomaly detection  
- Community detection


## üìä Graph Embedding

Graph embedding is a technique that can address the challenge of graph analysis in a **cost-effective** and **precise** manner.  
This method converts the graph into a **vector-based representation** (typically in lower dimensions) based on its structure.



### üñºÔ∏è Visual Explanation of Graph Embedding Types

**(A)** Original Graph  
Nodes are color-coded into three clusters:
- Blue: A, B, C, D  
- Yellow: E, F, G  
- Red: H, I, J  



**(B)** Node Embedding  
Each **node** is embedded into a 2D space, preserving structural similarity.  
Nodes from the same cluster are located close to each other.



**(C)** Edge Embedding  
Each **edge** is mapped to a point in 2D space.  
The goal is to preserve the edge-level relationships.



**(D)** Subgraph Embedding  
Groups of nodes (subgraphs) are represented in a compact form, capturing local structures like communities.



**(E)** Whole Graph Embedding  
The entire graph is embedded into a single point in space ‚Äî useful for comparing whole graphs.



# How can we learn the embedding function *f*?


<center>
    <img src="images/image2.png" alt="Image" width="600"/>
</center>


## üî∑ Node Embedding: Encoding with Matrix Formulation

We focus on **node embedding** based on the **encoder-decoder framework**.

### ‚úÖ Encoder Function

We define a function that maps each node $v \in \mathcal{V}$ to an embedding vector $\mathbf{z}_v \in \mathbb{R}^d$:

$\text{ENC} : \mathcal{V} \rightarrow \mathbb{R}^d,\quad \text{ENC}(v) = \mathbf{z}_v$

This function **encodes** the node $v$ into a low-dimensional vector representation.



### üü® Embedding Matrix $Z$

We can arrange all node embeddings in a matrix:

$Z \in \mathbb{R}^{d \times |\mathcal{V}|}$

Each **column** of $Z$ corresponds to the embedding of one node:

$Z[:, v] = \mathbf{z}_v$

So we can write:

$\text{ENC}(v) = Z \cdot \mathbf{x}_v$

where $\mathbf{x}_v$ is a **one-hot vector** indicating the index of node $v$.



### üìå Notes:

- Each **column** of $Z$ represents the **embedding** for a node.
- $\mathbf{z}_v \in \mathbb{R}^d$: the latent representation of node $v$.
- This formulation allows us to apply vector operations efficiently.




<center>
    <img src="images/image3.png" alt="Image" width="600"/>
</center>


### üéØ Decoder Objective: Reconstructing the Relationship Between Nodes $u$ and $v$

In graph representation learning, the **decoder** reconstructs information about the original graph using the node embeddings:

$$
\text{DEC}: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}^+
$$

The decoder receives the embeddings of nodes $u$ and $v$, and estimates their similarity:

$$
\text{DEC}(\text{ENC}(u), \text{ENC}(v)) = \text{DEC}(z_u, z_v) \approx S[u, v]
$$

- Here, $S[u, v]$ is an entry in the **similarity matrix** $S$, which captures how similar or connected nodes $u$ and $v$ are in the original graph.



### üí° A Simple Choice for $S$:

We can use the adjacency matrix $A$ of the graph as the similarity matrix:

$$
S[u, v] \triangleq A[u, v]
$$

This means:

- $S[u, v] = 1$ if nodes $u$ and $v$ are connected  
- $S[u, v] = 0$ otherwise

### üîπ Common Neighbors

To calculate the number of common neighbors between nodes $v_i$ and $v_j$, we can compute:

$$
S_{CN} = AA
$$

- For an **undirected graph**, the matrix $S_{CN}[i][j]$ shows the number of common neighbors between nodes $v_i$ and $v_j$.

- For a **directed graph**, $S_{CN}[i][j]$ counts the number of nodes $v_k$ such that there are **paths from $v_j$ to $v_k$ and from $v_k$ to $v_i$** ‚Äî i.e., $v_j \rightarrow v_k \rightarrow v_i$.


### üîÑ Pairwise Decoder

A **pairwise decoder** $\text{DEC}(z_u, z_v)$ predicts the **relationship or similarity** between nodes $u$ and $v$.  
For example, it may estimate whether the two nodes are **neighbors** in the original graph.



> ‚úÖ The goal is to **minimize reconstruction error** between the predicted similarity and the true similarity in $S$,  
> thus optimizing both the **encoder** and the **decoder** functions.


___

# üî∑ Node Representation Learning: Shallow Embedding

- This is an **unsupervised** method for node representation learning.

- We **do not use node labels**.
- We **do not use node features**.

- The goal is to **directly estimate an embedding vector for each node**, such that certain aspects of the graph structure are preserved.

- These embeddings are **independent of any downstream prediction task**.


## Shallow Embedding

- $z_u$: the embedding of node $u$, which is the target of our learning process.

- If $\mathcal{D}$ is the set of training data pairs, the goal is to minimize the following loss function $\mathcal{L}$:

$$
\mathcal{L} = \sum_{(u,v) \in \mathcal{D}} \ell\left( \text{DEC}(z_u, z_v),\ \text{S}[u,v] \right)
$$

where $\ell: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}$

- $\ell$ is a loss function that measures the difference between the decoded similarity $\text{DEC}(z_u, z_v)$ and the true similarity $S[u,v]$.

- The loss function $\ell$ can vary depending on how similarity and the decoder are defined. A common choice is the **mean squared error** for regression or classification.


___

___

___

# Creating Node Representations with DeepWalk


# üìå Introducing Word2Vec

The first step to comprehending the DeepWalk algorithm is to understand its major component: **Word2Vec**.

---

## üß† What is Word2Vec?

Word2Vec is one of the most influential deep-learning techniques in NLP.  
Published in 2013 by **Tomas Mikolov et al.** at Google, it introduced a way to convert words into vectors ‚Äî called **embeddings** ‚Äî using large text datasets.

These embeddings:

- Allow computers to understand the **meaning of words** numerically.
- Are useful in downstream tasks (like sentiment analysis or graph node classification).
- Are a famous and patented example of successful ML architecture.


## üìä Example Embeddings

Here are a few example word vectors:

```
vec(king)   = [‚àí2.1, 4.1, 0.6]  
vec(queen)  = [‚àí1.9, 2.6, 1.5]  
vec(man)    = [3.0, ‚àí1.1, ‚àí2.0]  
vec(woman)  = [2.8, ‚àí2.6, ‚àí1.1]
```

These are simplified 3-dimensional representations, but real embeddings are often 100‚Äì300 dimensions.


## üßÆ Similarity by Euclidean Distance

Let‚Äôs compare the Euclidean distances between words:

- Distance between **king** and **queen**: $4.37$
- Distance between **king** and **woman**: $8.47$

This tells us:

‚û° $vec(king)$ is **closer** to $vec(queen)$ than it is to $vec(woman)$.



## üîÅ Cosine Similarity (Angle-Based Comparison)

Instead of using distances, **cosine similarity** is often used.  
It compares **angles**, not magnitudes ‚Äî which makes it better when lengths differ.

It is defined as:

$$
\text{cosine\_similarity}(\vec{A}, \vec{B}) = \cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \cdot \|\vec{B}\|}
$$

- $\vec{A} \cdot \vec{B}$ = dot product  
- $\|\vec{A}\|$ = length (norm) of vector A



## üí° Vector Arithmetic: Word Analogies

One surprising ability of Word2Vec is solving analogies using simple vector math.

Famous example:

> "**man** is to **woman** as **king** is to ___?"

This is calculated as:

$$
\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}
$$

This relationship doesn‚Äôt always hold exactly, but it works surprisingly well in many cases.


# CBOW vs Skip-gram (Word2Vec)

## üß† Core Concept
Both **CBOW** and **Skip-gram** are models used in Word2Vec to learn word embeddings ‚Äî dense vector representations of words ‚Äî based on their context in a sentence.

---

## üìò CBOW (Continuous Bag-of-Words)
This is trained to predict a word using its
surrounding context (words coming before and after the target word). The order of context
words does not matter since their embeddings are summed in the model. The authors claim to
obtain better results using four words before and after the one that is predicted.

### ‚úÖ Goal:
Predict the **center word** given its **context words**.

### üìå Example:
For the sentence:  
`The cat sits on the mat`

CBOW input and output would be:  
**Input (Context):** `The, cat, on, the`  
**Output (Target):** `sits`


## üìó Skip-gram
Here, we feed a single word to the model and try to predict
the words around it. Increasing the range of context words leads to better embeddings but also
increases the training time.


### ‚úÖ Goal:
Predict the **context words** given the **center word**.

### üìå Example:
For the sentence:  
`The cat sits on the mat`

Skip-gram input and output would be:  
**Input (Center):** `sits`  
**Output (Context):** `The, cat, on, the`

<center>
    <img src="images/image4.jpg" alt="Image" width="600"/>
</center>


## üìä Comparison Table

| Feature         | CBOW                          | Skip-gram                        |
|----------------|-------------------------------|----------------------------------|
| Input           | Context words                 | Center word                      |
| Output          | Center word                   | Context words                    |
| Speed           | Faster to train               | Slower but more accurate         |
| Good for        | Frequent words                | Rare words                       |



___


## üìå Creating Skip-grams

For now, we will focus on the **skip-gram model** since it is the architecture used by **DeepWalk**.

Skip-grams are implemented as **pairs of words** with the following structure:

```
(target word, context word)
```

- `target word`: the input word to the model.
- `context word`: the word the model tries to predict (surrounding words).



### üîß Parameter: `context size`

The number of skip-gram pairs generated for a given target word depends on a parameter called **context size**, which defines how many words before and after the target word are considered context.



### üìä Example 

Let‚Äôs take the sentence:  
**"the train was late"**

<center>
    <img src="images/image5.jpg" alt="Image" width="600"/>
</center>

### üß† Practical Usage

- The same idea applies to a **corpus** of text, not just a single sentence.
- We **store all context words** for the same target word in a **list** to save memory.
- The next example will apply this to an entire paragraph, using:



### In the following example, we create skip-grams for an entire paragraph stored in the text variable. We set the CONTEXT_SIZE variable to 2, which means we will look at the two words before and after our target word:

In [1]:
import numpy as np

In [2]:
np.random.seed(42)

In [11]:
CONTEXT_SIZE = 2

text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eu sem 
scelerisque, dictum eros aliquam, accumsan quam. Pellentesque tempus, lorem ut 
semper fermentum, ante turpis accumsan ex, sit amet ultricies tortor erat quis 
nulla. Nunc consectetur ligula sit amet purus porttitor, vel tempus tortor 
scelerisque. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices 
posuere cubilia curae; Quisque suscipit ligula nec faucibus accumsan. Duis 
vulputate massa sit amet viverra hendrerit. Integer maximus quis sapien id 
convallis. Donec elementum placerat ex laoreet gravida. Praesent quis enim 
facilisis, bibendum est nec, pharetra ex. Etiam pharetra congue justo, eget 
imperdiet diam varius non. Mauris dolor lectus, interdum in laoreet quis, 
faucibus vitae velit. Donec lacinia dui eget maximus cursus. Class aptent taciti
sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Vivamus
tincidunt velit eget nisi ornare convallis. Pellentesque habitant morbi 
tristique senectus et netus et malesuada fames ac turpis egestas. Donec 
tristique ultrices tortor at accumsan.
""".split()


Next, we create the skip-grams thanks to a simple for loop to consider every word in text.
A list comprehension generates the context words, stored in the skipgrams list:

In [13]:
# Create skipgrams
skipgrams = []
for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE):
    array = [text[j] for j in np.arange(i - CONTEXT_SIZE, i + CONTEXT_SIZE + 1) if j != i]
    skipgrams.append((text[i], array))
    

In [14]:
print(skipgrams[0:2])

[('dolor', ['Lorem', 'ipsum', 'sit', 'amet,']), ('sit', ['ipsum', 'dolor', 'amet,', 'consectetur'])]


These two target words, with their corresponding context, work to show what the inputs to Word2Vec
look like.