# Simpified code for Sentence Embedding (SIF)

The original code of the following paper is a complete solution for sentence embedding, so it contains lots of things together. However, in many cases, we have our word embeddings, and we just need to learn the sentence embeddings. Therefore, to simplify it, I am using only the sentence embedding part (i.e. get_weighted_average(), compute_pc(), remove_pc() and SIF_embedding() functions) from the original code. I am generating word embedding using word2vec from gensim library. 

### Source
Paper: A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS by Sanjeev Arora, Yingyu Liang and Tengyu Ma
Link: https://openreview.net/pdf?id=SyK00v5xx

Original Code at GitHub: https://github.com/PrincetonML/SIF

In [1]:
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [2]:
def get_weighted_average(We, x, w):
    """
    Compute the weighted average vectors
    :param We: We[i,:] is the vector for word i
    :param x: x[i, :] are the indices of the words in sentence i
    :param w: w[i, :] are the weights for the words in sentence i
    :return: emb[i, :] are the weighted average vector for sentence i
    """
    n_samples = x.shape[0]
    emb = np.zeros((n_samples, We.shape[1]))
    for i in range(n_samples):
        emb[i] = np.dot(w[i], We[x[i]]) / np.count_nonzero(w[i])
    return emb

In [3]:
def compute_pc(X,npc=1):
    """
    Compute the principal components. DO NOT MAKE THE DATA ZERO MEAN!
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: component_[i,:] is the i-th pc
    """
    svd = TruncatedSVD(n_components=npc, n_iter=7, random_state=0)
    svd.fit(X)
    return svd.components_

In [4]:
def remove_pc(X, npc=1):
    """
    Remove the projection on the principal components
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: XX[i, :] is the data point after removing its projection
    """
    pc = compute_pc(X, npc)
    if npc==1:
        XX = X - X.dot(pc.transpose()) * pc
    else:
        XX = X - X.dot(pc.transpose()).dot(pc)
    return XX

In [5]:
def SIF_embedding(We, x, w, param):
    """
    Compute the scores between pairs of sentences using weighted average + removing the projection on the first principal component
    :param We: We[i,:] is the vector for word i
    :param x: x[i, :] are the indices of the words in the i-th sentence
    :param w: w[i, :] are the weights for the words in the i-th sentence
    :param params.rmpc: if >0, remove the projections of the sentence embeddings to their first principal component
    :return: emb, emb[i, :] is the embedding for sentence i
    """
    emb = get_weighted_average(We, x, w)
    if  param > 0:
        emb = remove_pc(emb, param)
    return emb

# Load data

In [6]:
import pandas as pd

In [7]:
PATH = 'U:\\Research\\Projects\\sef\\datamining\\mlonlineabuse\\WorkingFolderActiveAMI\\'
cur_state = '_4'

In [8]:
df_U = pd.read_csv(PATH+'U'+cur_state+'.csv')
df_U.sample(5)

Unnamed: 0,label,text
619,0,@hangfirebbq @AlisonMoyet Lolol.. that's so no...
1272,1,@SarahhWaqar @CallmeJaagii Bitch shut the fuck up
1574,1,Happy 4th of July everyone! Ladies don't fuck ...
641,1,@lmchristi1 It's a nice butt. But I didn't thi...
1531,0,"@miidnighthour i mean, i know that 😂😂😂 i spent..."


In [9]:
df_L = pd.read_csv(PATH+'L'+cur_state+'.csv')
df_L.sample(5)

Unnamed: 0,label,text
1100,1,Told “that tight dress is what makes you a who...
1561,0,@ewnupdates Motlante you power hungry kunt you...
1647,1,@grxmd I dont think this fat whore could even ...
652,1,“One minute you’re a spooky little witch bitch...
388,1,@YeenShitCuh Bitch keep my name out your mouth...


In [10]:
df = df_U.append(df_L, ignore_index = True)
df.reset_index()
df.head(5)

Unnamed: 0,label,text
0,1,@priya_ebooks Man: (harasses and stalks woman ...
1,0,Anybody can dig a hole and plant a tree. But t...
2,0,That's the original it came from. But I apprec...
3,0,When you've run out of kids to pimp out and yo...
4,0,The bitch flipped millions for em she deserved...


# word2vec

In [11]:
from gensim.models import Word2Vec



In [12]:
toc_sentences = [df.iloc[idx]['text'].split() for idx in df.index]
toc_sentences;

In [13]:
model_w2v = Word2Vec(sentences = toc_sentences, max_vocab_size=None, size=200, 
                               window=5, min_count=1, iter=10, hs=1, 
                               workers=4, sg=0)

In [14]:
vocab = list(model_w2v.wv.vocab)
vectors = model_w2v[vocab]
#vector = model['author'] # vector for a single word
w2v = dict(zip(vocab, vectors))
w2v['@'];

  


# word2frequency, word2index, index2word

In [15]:
from collections import Counter 

In [16]:
words = []
[words.extend(df.iloc[idx]['text'].split()) for idx in df.index];
words[5]

'woman'

In [17]:
w2f = Counter(words)
w2f;

In [18]:
a = 10e-3
w2w = {}
for word in vocab:
    w2w[word] = a/(a + w2f[word])

In [19]:
w2i = {}
i2w = {}
for i,key in enumerate(vocab):
    w2i[key] = i
    i2w[i] = key
w2i;
i2w;

In [20]:
We = np.array(vectors)
x = []
w = []
for toc_sentence in toc_sentences:
    x_i = []
    w_i = []
    for word in toc_sentence:
        x_i.append(w2i[word])
        w_i.append(w2w[word])
    x.append(x_i)
    w.append(w_i)
x = np.array(x)
w = np.array(w)

In [21]:
sentence_embs = SIF_embedding(We, x, w, 1)

In [22]:
sentence_embs

array([[-9.98905345e-05, -2.14459598e-05, -3.82540129e-05, ...,
         5.49619736e-05, -1.50128978e-05,  2.40446742e-05],
       [ 2.84096030e-06,  1.72700151e-05,  3.73897006e-05, ...,
         1.38573537e-05,  1.62785198e-05, -8.63739881e-05],
       [ 4.44145038e-05, -9.32318724e-06,  5.05819931e-06, ...,
         6.37717797e-06,  6.53118119e-06, -6.15614937e-05],
       ...,
       [ 2.84741139e-05,  8.53944969e-07,  3.19873161e-05, ...,
        -5.79333959e-06,  1.17873289e-05, -5.84563887e-05],
       [ 5.25880127e-07,  1.57506371e-05,  6.63278426e-05, ...,
        -5.16076179e-05,  9.61174363e-06, -1.08848990e-04],
       [ 4.68408766e-05,  6.87340723e-06,  2.39540514e-05, ...,
        -6.49857222e-06,  1.16351252e-05, -8.61213949e-05]])

In [23]:
np.save(PATH+'sentence_embs.npy', sentence_embs) # save

In [24]:
sentence_embs = np.load(PATH+'sentence_embs.npy') # load

In [25]:
sentence_embs

array([[-9.98905345e-05, -2.14459598e-05, -3.82540129e-05, ...,
         5.49619736e-05, -1.50128978e-05,  2.40446742e-05],
       [ 2.84096030e-06,  1.72700151e-05,  3.73897006e-05, ...,
         1.38573537e-05,  1.62785198e-05, -8.63739881e-05],
       [ 4.44145038e-05, -9.32318724e-06,  5.05819931e-06, ...,
         6.37717797e-06,  6.53118119e-06, -6.15614937e-05],
       ...,
       [ 2.84741139e-05,  8.53944969e-07,  3.19873161e-05, ...,
        -5.79333959e-06,  1.17873289e-05, -5.84563887e-05],
       [ 5.25880127e-07,  1.57506371e-05,  6.63278426e-05, ...,
        -5.16076179e-05,  9.61174363e-06, -1.08848990e-04],
       [ 4.68408766e-05,  6.87340723e-06,  2.39540514e-05, ...,
        -6.49857222e-06,  1.16351252e-05, -8.61213949e-05]])

### Please feel free to give your feedback