Let's build some sentence vectors

In [2]:
import numpy as np
import pandas as pd
import pickle
import spacy
import os
import re

from multiprocessing import Pool
from gensim.utils import simple_preprocess
from pathlib import Path

In [3]:
n_cpus = os.cpu_count()
tok = spacy.blank('en', disable=["parser","tagger","ner"])

In [6]:
DATA_PATH = Path('../data')
df = pickle.load(open(DATA_PATH/'df_reviews_text.p', 'rb'))

I will start by coding a number of little helpers to clean the sentences so that we can average the word-vectors of the words forming those sentences. 

In [4]:
def normalize_sents(sents):
    nsents = []
    for s in sents: nsents.append(' '.join([t.norm_ for t in tok.tokenizer(s)]))
    return nsents

Let's see what this does

In [8]:
df.review_sents[0]

['This is a great tutu and at a really great price.',
 "It doesn't look cheap at all.",
 "I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly.",
 'A++']

In [9]:
normalize_sents(df.review_sents[0])

['this is a great tutu and at a really great price .',
 'it does not look cheap at all .',
 'i am so glad i looked on amazon and found such an affordable tutu that is not made poorly .',
 'a++']

The following ones are self explanatory

In [10]:
def rm_non_alpha(sents):
    return [re.sub("[^a-zA-Z]", " ", s).strip() for s in sents]


def rm_single_chars(sents):
    return [s for s in sents if len(s)>1]


def rm_useless_spaces(sents):
    return [re.sub(' {2,}', ' ', s) for s in sents]

So, we will process the sentences as follows:

In [11]:
def process_sents(df, coln='processed_sents'):
    df[coln] = df['review_sents'].apply(lambda x: normalize_sents(x))
    df[coln] = df[coln].apply(lambda x: rm_non_alpha(x))
    df[coln] = df[coln].apply(lambda x: rm_single_chars(x))
    df[coln] = df[coln].apply(lambda x: rm_useless_spaces(x))
    return df

We will use this function via pandas `apply` which can be rather slow. Here is a little `helper` to make our life better/faster

In [12]:
def parallel_apply(df, func, n_cores=n_cpus):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    return df

In [14]:
%time df = parallel_apply(df, process_sents)

CPU times: user 2.86 s, sys: 1.48 s, total: 4.34 s
Wall time: 27 s


In [15]:
df.head()

Unnamed: 0,reviewText,summary,review_sents,processed_sents
0,This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I lo...,Great tutu- not cheaply made,"[This is a great tutu and at a really great price., It doesn't look cheap at all., I'm so glad I...","[this is a great tutu and at a really great price, it does not look cheap at all, i am so glad i..."
1,"I bought this for my 4 yr old daughter for dance class, she wore it today for the first time and...",Very Cute!!,"[I bought this for my 4 yr old daughter for dance class, she wore it today for the first time an...",[i bought this for my yr old daughter for dance class she wore it today for the first time and t...
2,"What can I say... my daughters have it in orange, black, white and pink and I am thinking to buy...",I have buy more than one,"[What can I say..., my daughters have it in orange, black, white and pink, and I am thinking to ...","[what can i say, my daughters have it in orange black white and pink, and i am thinking to buy f..."
3,"We bought several tutus at once, and they are got high reviews. Sturdy and seemingly well-made. ...","Adorable, Sturdy","[We bought several tutus at once, and they are got high reviews., Sturdy and seemingly well-made...","[we bought several tutus at once and they are got high reviews, sturdy and seemingly well made, ..."
4,Thank you Halo Heaven great product for Little Girls. My Great Grand Daughters Love these Tutu'...,Grammy's Angels Love it,"[Thank you Halo Heaven great product for Little Girls. , My Great Grand Daughters Love these Tu...","[thank you halo heaven great product for little girls, my great grand daughters love these tutu ..."


Let's compare with `swifter`. 

In [16]:
import swifter
df['processed_sents'] = df['review_sents'].swifter.apply(lambda x: normalize_sents(x))

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=265728, style=ProgressStyle(description_wi…




So...is slower. Only uses 4 cores in my Mac. Based on their repo I'd say this would be useful for larger datasets

Anyway, moving onto word and sentence vectors

In [17]:
def word_vectors(path, fname):
    embeddings_index = {}
    f = open(path/fname)
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index


def sentence_vector(sent, embeddings_index, dim=100):
    if len(sent)>0:
        return sum([embeddings_index.get(w, np.zeros((dim,))) for w in sent])/len(sent)
    else:
        return np.zeros((dim,))

In [18]:
WORDVEC_PATH = DATA_PATH/'glove.6B'
wordvec_fname= 'glove.6B.100d.txt'
embeddings_index = word_vectors(WORDVEC_PATH, wordvec_fname)

In [19]:
# Individual sentences
all_sents = [s for sents in df.processed_sents for s in sents]
idx2sent = {k:v for k,v in enumerate(all_sents)}

In [20]:
idx2toksents = {}
for i,s in idx2sent.items(): idx2toksents[i] = simple_preprocess(s)

In [21]:
idx2toksents[0]

['this', 'is', 'great', 'tutu', 'and', 'at', 'really', 'great', 'price']

In [22]:
sent2vec = {}
for i,s in idx2toksents.items(): sent2vec[i] = sentence_vector(s, embeddings_index)

In [23]:
sent2vec[0]

array([-0.08077522,  0.40300232,  0.46199453, -0.17641734, -0.08382   ,
       -0.1148019 , -0.22886679,  0.12189113, -0.31666443, -0.27374446,
       -0.01265999,  0.02562739,  0.08072   ,  0.03578821,  0.03440867,
       -0.30850387,  0.02228022,  0.04734689, -0.44706333,  0.29923317,
        0.15157245,  0.0429249 ,  0.12130955, -0.13901   ,  0.55765444,
        0.15392989, -0.19783   , -0.39939177,  0.07522634, -0.30169234,
       -0.26608667,  0.41936848,  0.15411378,  0.0793588 ,  0.02022355,
        0.26441944,  0.04570355,  0.44493055, -0.04780278, -0.24634112,
       -0.33693856, -0.10056976,  0.16340989, -0.51877445, -0.06957345,
        0.12563667,  0.48675224, -0.41143847, -0.04714622, -0.7283745 ,
        0.06338234, -0.23116666,  0.22893542,  0.94118   , -0.315755  ,
       -2.5162    , -0.2728801 , -0.07546222,  1.2681334 ,  0.26414666,
       -0.10944656,  0.65772665, -0.4733538 ,  0.08199911,  0.57762116,
       -0.20582278,  0.5442811 ,  0.20293446,  0.41186   , -0.10

In [24]:
len(sent2vec)

1409349

and just like that we have 1.4 mill sentence vectors