In [1]:
import numpy as np
from neighbourhood import Neighbourhood
from preprocessor import *

[nltk_data] Downloading package stopwords to /Users/heidi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
nodes = pd.read_csv("data/nodes.csv")
links = pd.read_csv("data/links_type.csv")
bilinks = pd.read_csv("data/bilinks.csv")

## Idea of the Graph Neural Network
### Graph Convolutional Network
#### PinSAGE? Pins and Boards vs Items and Users vs Products and Customers

1. Collecting **neighbourhood** of item and presenting the item through its neighbourhood. -> We need to define the neighbourhood of item in bipartite graph.
2. Items **initial representation** should be embeddings created based on its features. -> We need to determine features for the items.
3. ...
4. PinSAGE creates embeddings based on item and its neighbourhood features.


### **Collecting Neighbourhood in Bipartite graph**

Define the neighbourhood for any Product item in our Bipartite graph.

In [3]:
N = Neighbourhood(bilinks,links)

Generate random product and find its neighbourhood through common customer.

In [4]:
random_product = np.random.choice(range(bilinks.Id.nunique()))
print("Random product: ", random_product)
N.find_neighbourhood(random_product)

Random product:  5465


([470733], [4.0])

Get the Product's whole neighbourhood including direct neighbours and neighbourhood of common Customer.

In [5]:
N.get_neighbourhood(random_product)

([470733], [1.0])

Receptive field is neighbourhood of K randomly selected neighbours. Neighbours can be selected multiple times.

In [6]:
for k in range(7):
    print(N.get_receptive_field(random_product,K=k))

[]
[470733]
[470733, 470733]
[470733, 470733, 470733]
[470733, 470733, 470733, 470733]
[470733, 470733, 470733, 470733, 470733]
[470733, 470733, 470733, 470733, 470733, 470733]


Generate a random walk from Product.

In [7]:
N.generate_random_walk(143845)

[147830, 142912, 143845, 142912, 147830]

Importance pooling: selecting T (default 10) most important neighbours of Product by generating T random walks from T-hop neighbourhood.

In [8]:
print(N.importance_pooling(random_product))

{470733: 0.5, 5465: 0.5}


### **Feature table for Products (nodes)**

We decided to select different types of values: textual, categorical and numerical values.

In [9]:
columns = {
    "ID": ["Id"],
    "textual": ["Title"],
    "categorical": ["Group"],
    "numerical": ["Salesrank","AvgRating"]
}

In [10]:
Features = get_features(nodes,columns)

In [11]:
Features.dtypes.unique()

array([dtype('int64'), dtype('float64'), dtype('int8')], dtype=object)

In [12]:
Features.shape # normaliseerimine? Standard scaler

(542664, 2004)

#### Additional features from Categories

We would like to add more information for nodes: categories

In [None]:
# df_cat = pd.read_csv('data/categories.csv')
# df_cat.head()

Each category path has its own unique CatId. We can represent each node as vectors representing which categories they belong to.

In [None]:
# df_cat["nbr"] = 1
# Features2 = pd.pivot_table(data=df_cat,index="Id",columns="CatId",values="nbr",fill_value=0)

In [None]:
# Features2.shape

In [None]:
# df_cat.Id.nunique()

In [None]:
# np.min(df_cat.CatId.values),np.max(df_cat.CatId.values)

In [None]:
# 542664-519781

### **Graph neural network**

**Architecture from Paper**

Input: 
- Set of nodes $\mathcal{M}\subset \mathcal{V}$ (minibatch from nodes $\mathcal{V}$);
- depth parameter $K$;
- neighbourhood function $\mathcal{N}:\mathcal{V}\rightarrow 2^{\mathcal{V}}$

Output:
- Embeddings $z_u, \forall u\in \mathcal{M}$

Sampling neighbourhoods for nodes in minibatch:
- $K$-th round consist of batchnodes: $\mathcal{S}^{(k)} \leftarrow \mathcal{M}$;
- for $k = K,\dots, 1$ do
  - $\mathcal{S}^{(k-1)}\leftarrow \mathcal{S}^{(k)}$
  - for $u\in \mathcal{S}^{(k)}$ do
    - $\mathcal{S}^{(k-1)}\leftarrow\mathcal{S}^{(k-1)}\cup \mathcal{N}(u)$ 
    - ($K-1$)-st round consist of $K$-th nodes and their neighbourhood nodes

Generating embeddings for nodes in minibatch:
- $h^{(0)}_u \leftarrow x_u \forall u\in \mathcal{S}^{(0)}$; init emb is feature vector $x_u$
- for $k = 1,\dots,K$ do
  - for $u\in \mathcal{S}^{(k)}$ do
    - $\mathcal{H}\leftarrow \big\{ h^{(k-1)}_v, \forall v\in \mathcal{N}(u) \big\}$
    - $h^{(k)}_u\leftarrow \text{convolve}^{(k)}\big( h^{(k-1)}_u,\mathcal{H} \big)$
- for $u\in \mathcal{M}$ do
  - $z_u\leftarrow G_2\cdot\text{ReLU}\big( G_1h^{(K)}_u+g \big)$

<hr>
<u>Convolve</u>:

Input:
- current embedding $z_u$ for node $u$;
- set of neighbour embeddings $\{ z_v|v\in\mathcal{N}(u) \}$ with set of neighbour importances **$\alpha$**;
- symmetric vector function $\gamma(\cdot)$

Output: 
- new embedding $z^{\text{NEW}}_u$ for node u

Generating an embedding:
- neighbourhood embedding: $n_u\leftarrow \gamma\big(\{ \text{ReLU}(Qh_v+q)|v\in\mathcal{N}(u) \}, \alpha \big)$;
- node $u$ embedding: $z^{\text{NEW}}_u \leftarrow \text{ReLU}\big(W\cdot \text{concat}(z_u,n_u)+w\big)$;
- normalized node $u$ emb: $z^{\text{NEW}}_u \leftarrow \frac{z^{\text{NEW}}_u}{\lVert z^{\text{NEW}}_u\rVert _2}$

In [13]:
from input import Input

In [14]:
batch_size = 5
pool_size = 7

I = Input(Features, bilinks, links)
NodeIDs = I.random_batch(batch_size)
Nodes = I.init_embedding(NodeIDs)
NeighbourIDs, Importances = I.pooling(NodeIDs,pool_size)
Neighbourhoods, Alpha = I.init_neigh_embeddings(NeighbourIDs,Importances,pool_size)

```
Neighbouring =
{
    node1: {
        "neighbours": [nodea, nodeb, nodec],
        "importances": [0.55,0.25,0.2]},
    node2: {
        "neighbours": [nodex, nodey, nodez],
        "importances": [0.45,0.45,0.1]},
    nodea: {
        "neighbours": ...,
        "importances": ... },
    nodeb: {
        ... },
    ...
}
Stacks = [
[node1, node2],
[node1, node2, nodea, nodeb, nodec, nodex, nodey, nodez],
...
]
```

In [15]:
Stacks, Neighbouring = I.sampler(NodeIDs,pool_size=4,depth_K=3)

In [16]:
print(len(Stacks))
for S in Stacks: print(len(S),end=", ")
print("")
print(len(Neighbouring))

3
5, 25, 229, 
40


In [17]:
print(Nodes.shape)
print(Neighbourhoods.shape)
print(Alpha.shape)

(5, 2004)
(5, 7, 2004)
(5, 7)


In [18]:
import tensorflow as tf
from keras.layers import Input, Layer, Dense, Concatenate, Lambda, Reshape, Flatten

In [157]:
class Aggregator(Layer):
    def __init__(self):
        super(Aggregator, self).__init__()

    def call(self, inputs):
        Neighbours = inputs[0] # (7, 2004)
        Alphas = inputs[1] # (1, 7)
        Neighbourhood = Dense(20,activation='relu',dtype='float32')(Neighbours) # (7, 20)
        return tf.matmul(Alphas,Neighbourhood) # (1, 7) x (7, 20) = (1, 20)

class Convolve(Layer):
    def __init__(self):
        super(Convolve, self).__init__()

    def call(self, inputs): # inputs = [Node, Neighbours,Alpha]
        Node, Neighbours, Alpha = inputs
        # Node = Input(shape=(1,2004)) 
        # Neighbours = Input(shape=(7,2004)) 
        # Alpha = Input(shape=(1,7))
        NodeEmb = Dense(20,activation='relu',dtype='float32')(Node) # (1, 20)
        print(NodeEmb.shape)
        NeighboursEmb = Aggregator()([Neighbours,Alpha]) # (1, 20)
        print(NeighboursEmb.shape)
        AggEmb = Concatenate()([NodeEmb,NeighboursEmb]) # (1, 40)
        print(AggEmb.shape)
        Emb = Dense(20,activation='relu',dtype='float32')(AggEmb) # (1, 20)
        return tf.divide(Emb,tf.norm(Emb,ord=2)) # (None, 1, 20)

In [20]:
x = Convolve()([Nodes,Neighbourhoods,Alpha])
x

2022-05-12 18:34:27.146156: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


(None, 1, 20)
(None, 1, 20)
(None, 1, 40)


<KerasTensor: shape=(None, 1, 20) dtype=float32 (created by layer 'tf.math.truediv')>

- **NodeIDs** - list of Product Id-s -- the batch `(batch_size x 1)`
- **NeighbourIDs** - list of important Neighbours' Ids -- pool of batch nodes `(batch_size x T)`
- **Importances** - list of imp Neighbours' importances -- importances of each node in pool `(batch_size x T)`
- **Hoods** - Features of important Neighbours -- features of pools `(batch_size x T x feature_dim)`

In [21]:
minibatch = Input(shape=(1,1)) # minibatch: NodeIDs
# Sampler:
pool = 4
K = 3
Stack3 = Lambda(lambda x: [x+1,x+2,x+3])(minibatch)
Stack3 = tf.stack([Stack3[0],Stack3[1],Stack3[2]],axis=1)
print(Stack3)


KerasTensor(type_spec=TensorSpec(shape=(None, 3, 1, 1), dtype=tf.float32, name=None), name='tf.stack/stack:0', description="created by layer 'tf.stack'")


In [None]:
batch = 5
pool = 4
minibatch = Input(shape=(1,1)) # minibatch: NodeIDs
miba = tf.make_ndarray(minibatch)
neighbours = tf.constant(I.pooling(miba,pool)[0])
importances = tf.constant(I.pooling(miba,pool)[1])
allnodes = tf.concat([minibatch,neighbours])

In [103]:
# Input
mb = I.random_batch(5)
inp = tf.constant(mb,shape=(mb.size,1),dtype="int32")
print("Nodes with neighbours: ", inp)
print("")

# Neighbours of input nodes
nei = tf.constant(I.pooling(mb,4)[0])
print("Neighbours: ", nei)
print("")

# Neighbours of input nodes
alf = tf.constant(I.pooling(mb,4)[1])
print("Importances: ", alf)
print("")

# Concatenate input with neighbours
inpnei = tf.concat([inp,nei],axis=-1)
print("Nodes with neighb: ", inpnei)
print("")

# Flattening
ns = tf.unstack(nei, axis=1)
n = tf.concat([ns[i] for i in range(len(ns))],axis=0)
tf.constant(I.pooling(n.numpy(),4)[0])

Nodes with neighbours:  tf.Tensor(
[[175826]
 [477413]
 [432050]
 [106924]
 [452232]], shape=(5, 1), dtype=int32)

Neighbours:  tf.Tensor(
[[     0      0      0      0]
 [154611 294332 420897 477413]
 [ 34879 260074  47197 148374]
 [ 58262 106924 478404      0]
 [302920 435646 384171 507845]], shape=(5, 4), dtype=int32)

Importances:  tf.Tensor(
[[0.     0.     0.     0.    ]
 [0.1875 0.1875 0.0625 0.0625]
 [0.3125 0.25   0.125  0.0625]
 [0.4375 0.3125 0.1875 0.0625]
 [0.3125 0.25   0.125  0.125 ]], shape=(5, 4), dtype=float32)

Nodes with neighb:  tf.Tensor(
[[175826      0      0      0      0]
 [477413 154611 294332 420897 477413]
 [432050  34879 260074  47197 148374]
 [106924  58262 106924 478404      0]
 [452232 302920 435646 384171 507845]], shape=(5, 5), dtype=int32)



<tf.Tensor: shape=(20, 4), dtype=int32, numpy=
array([[     0,      0,      0,      0],
       [154611,  81002, 294332, 420897],
       [ 91637, 522778, 260074,  73690],
       [478404,  58262, 461256, 106924],
       [452232, 302920, 435646, 379412],
       [     0,      0,      0,      0],
       [154611, 477413, 420897, 502937],
       [ 43617, 350998,  47197, 359084],
       [ 58262, 478404, 106924, 461256],
       [231890, 435646,  12152,  98728],
       [     0,      0,      0,      0],
       [420897, 154611, 134651, 294332],
       [451100, 260074,  34879,  47197],
       [461256, 199891, 504962, 339592],
       [275207, 384171, 302920, 452232],
       [     0,      0,      0,      0],
       [477413, 154611, 294332, 502937],
       [502029,  80655, 260074, 451100],
       [     0,      0,      0,      0],
       [ 52224, 507845, 280603, 106114]], dtype=int32)>

In [85]:
print(nei)
print(alf)

tf.Tensor(
[[182649 428845 252963  95884]
 [141329 450640 151096 533523]
 [ 93966 435885 371659  34342]
 [     0      0      0      0]
 [138403 509369 317317 277166]], shape=(5, 4), dtype=int32)
tf.Tensor(
[[0.0625 0.0625 0.0625 0.0625]
 [0.4375 0.125  0.125  0.125 ]
 [0.4375 0.25   0.1875 0.0625]
 [0.     0.     0.     0.    ]
 [0.1875 0.0625 0.0625 0.0625]], shape=(5, 4), dtype=float32)


In [84]:
n_emb = I.init_embedding(n.numpy())
nei_emb = n_emb.reshape(5,4,2004)
nei_emb.shape

(5, 4, 2004)

In [89]:
alf_br = tf.broadcast_to(tf.expand_dims(alf,axis=-1),[5,4,2004])

In [93]:
imp_emb = tf.reduce_sum(nei_emb*alf_br,axis=1)

In [30]:
a = tf.constant([[1,2,3],[4,5,6]])
proto_tensor = tf.make_tensor_proto(a)  # convert `tensor a` to a proto tensor
tf.make_ndarray(proto_tensor)

array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)

In [209]:
def sampling(node_ids,pool_T,depth_K):
    stacks, alphas = [],[]
    node_ids = tf.constant(node_ids,dtype="int32")
    for k in range(depth_K): # k = 0,...,K-1
        neigh_ids = tf.constant(I.pooling(node_ids,pool_T)[0]) # get neighbours
        alfas = tf.constant(I.pooling(node_ids,pool_T)[1]) # get importances
        node_ids = tf.constant(node_ids,shape=(node_ids.numpy().size,1),dtype="int32") # reshape nodes
        stack = tf.concat([node_ids,neigh_ids],axis=-1) # stack nodes with their neighbours
        stacks.append(stack)
        alphas.append(alfas)
        if k > 0:
            stacks[k] = tf.concat([stack,stacks[k-1]],axis=0)
            alphas[k] = tf.concat([alfas,alphas[k-1]],axis=0)
        node_ids = tf.concat(tf.unstack(neigh_ids, axis=1),axis=0) # flatten neighbours for next round nodes
    return stacks[-1], alphas

In [165]:
# def sampling(node_ids,pool_T,depth_K):
#     stacks, neighbours, alphas = [],[],[]
#     node_ids = tf.constant(node_ids,dtype="int32")
#     for k in range(depth_K):
#         #node_ids = tf.constant(node_ids,shape=(1,node_ids.size),dtype="int32") # (5, 1)
#         neigh_ids = tf.constant(I.pooling(node_ids,pool_T)[0]) # (5, 4)
#         alfas = tf.constant(I.pooling(node_ids,pool_T)[1]) # (5, 4)
#         node_ids = tf.constant(node_ids,shape=(node_ids.numpy().size,1),dtype="int32") # (5, 1)
#         stack = tf.concat([node_ids,neigh_ids],axis=-1) # (5, 5)
#         stacks.append(stack)
#         neighbours.append(neigh_ids)
#         alphas.append(alfas)
#         node_ids = tf.concat(tf.unstack(neigh_ids, axis=1),axis=0)
#     return stacks, neighbours, alphas

In [212]:
batch = I.random_batch(5)
stack, alphas = sampling(batch,4,3)
print(stack.shape)
for a in alphas:
    print(a.shape)

(105, 5)
(5, 4)
(25, 4)
(105, 4)


In [177]:
# stack = stacks[0]
# print(0, stack.shape)
# for i in range(1,len(stacks)):
#     stack = tf.concat([stack,stacks[i]],axis=0)
#     print(i, stack.shape)
# stack.shape

0 (5, 5)
1 (25, 5)
2 (105, 5)


TensorShape([105, 5])

In [181]:
# importances = []
# importances.append(alphas[0])
# print(importances[0].shape)
# for i in range(1,len(alphas)):
#     importances.append(tf.concat([importances[i-1],alphas[i]],axis=0))
#     print(importances[i].shape)

(5, 4)
(25, 4)
(105, 4)


In [207]:
def embedding(stack, importances, pool_T, depth_K, init_dim):
    stack_sizes = [0]+[importances[i].shape[0] for i in range(len(importances))]
    # get initial stack
    stack_np = tf.concat(tf.unstack(stack,axis=1),axis=0).numpy() # flatten and numpy
    stack_emb = tf.reshape(tf.constant(I.init_embedding(stack_np)), [-1, pool_T+1, init_dim]) # init embeddings
    # layers:
    for k in range(depth_K-1,-1,-1): # k = K-1,...,0
        # nodes, neighbours, alphas -> CONV -> new embeddings
        node_emb = tf.expand_dims(tf.unstack(stack_emb,axis=1)[0],axis=1)
        neigh_emb = tf.reshape(tf.concat(tf.unstack(stack_emb,axis=1)[1:],axis=0), [-1, pool_T, stack_emb.shape[-1]])
        alphas = tf.expand_dims(importances[k],axis=1)
        new_emb = Convolve()([node_emb, neigh_emb, alphas])
        if k > 0: # k = K-1,...,1
            # new embeddings -> COMBINE -> next stack
            splits,combined = [],[]
            for i in range(1,len(stack_sizes[:k+2])):
                split = tf.concat(tf.unstack(new_emb,axis=0)[stack_sizes[i-1]:stack_sizes[i]],axis=0)
                splits.append(tf.expand_dims(split,axis=1))
                if i > 1: 
                    reshape_split = tf.reshape(split,[-1,pool_T,split.shape[-1]])
                    combined.append(tf.concat([splits[i-2],reshape_split],axis=1))
            stack_emb = tf.concat(combined,axis=0)
    return new_emb

In [217]:
from model import PinSAGE
from graph import BipartiteGraph

G = BipartiteGraph(Features,bilinks,links)
mdl = PinSAGE(G, pool_T=4, depth_K=3, emb_dim=20)

In [218]:
batch = G.random_subgraph(5)
mdl.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3))

AttributeError: 'BipartiteGraph' object has no attribute 'random_subgraph'

In [213]:
embedding(stack, alphas, pool_T=4, depth_K=3, init_dim=2004)

(105, 1, 20)
(105, 1, 20)
(105, 1, 40)
(25, 1, 20)
(25, 1, 20)
(25, 1, 40)
(5, 1, 20)
(5, 1, 20)
(5, 1, 40)


<tf.Tensor: shape=(5, 1, 20), dtype=float32, numpy=
array([[[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ]],

       [[0.        , 0.        , 0.        , 0.08099678, 0.23498493,
         0.24243158, 0.        , 0.        , 0.        , 0.        ,
         0.32936975, 0.45133516, 0.42156368, 0.        , 0.        ,
         0.20280978, 0.27794722, 0.38285437, 0.17848478, 0.03373609]],

       [[0.        , 0.        , 0.        , 0.0805148 , 0.00652521,
         0.09415338, 0.00813679, 0.        , 0.        , 0.        ,
         0.10833282, 0.16822295, 0.13277219, 0.        , 0.        ,
         0.        , 0.06208873, 0.08875901, 0.        , 0.08205435]],

       [[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.       

In [160]:
# def embedding(stack, nodes, neighbours, alphas,pool_T,depth_K, init_dim):
#     # initial embeddings
#     stack_np = tf.reshape(stack,[stack.numpy().size]).numpy()
#     stack_emb = tf.reshape(tf.constant(I.init_embedding(stack_np)), [-1, pool_T+1, init_dim])
#     node_emb = tf.unstack(stack_emb,axis=1)[0]
#     node_emb = tf.expand_dims(node_emb,axis=1)
#     neigh_emb = tf.reshape(tf.concat(tf.unstack(stack_emb,axis=1)[1:],axis=0), [-1, pool_T, init_dim])
#     #importances = tf.broadcast_to(tf.expand_dims(alphas[-1],axis=-1),[alphas[-1].shape[0],pool_T,init_dim])
#     importances = tf.expand_dims(alphas[-1],axis=1)
#     new_emb = Convolve()([node_emb, neigh_emb, importances])
    
#     # using previous embeddings

In [161]:
# embedding(stack, nodes, neighbours, alphas,4,3, 2004)

(80, 1, 20)
(80, 1, 20)
(80, 1, 40)


<tf.Tensor: shape=(80, 1, 20), dtype=float32, numpy=
array([[[0.        , 0.        , 0.        , ..., 0.01369335,
         0.02207531, 0.07849218]],

       [[0.        , 0.        , 0.        , ..., 0.        ,
         0.00028086, 0.0083801 ]],

       [[0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.01452408]],

       ...,

       [[0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.02514934]],

       [[0.        , 0.        , 0.00122165, ..., 0.        ,
         0.00608738, 0.00664893]],

       [[0.        , 0.        , 0.        , ..., 0.00747036,
         0.02758896, 0.07731792]]], dtype=float32)>