# Customer satisfaction from YouTube comments

YouTube has active and popular product review channels; however, customers' comments aren't explicitly linked to products or their satisfaction (i.e. explicit ratings) so the data is less structured and unlabeled. This analysis examines the extent to which customer satisfaction on YouTube product reviews can be inferred from a classifier trained on a large open-source dataset of Amazon product reviews.

This notebook performs the analysis in a few steps:
1. Get Amazon review data (from electronics sub-category)
2. Preprocess text and extract features
3. Train a simple classifier
4. Get YouTube comments from API (comments on laptop review videos)
5. Apply preprocessing from Step 2 and classifier from Step 3 to comments

In [7]:
import pandas as pd
import numpy as np
np.random.seed(0)
import os
import pickle
import multiprocessing as mp
import gzip
import json
import re
import nltk
from nltk import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
from apiclient.discovery import build
#nltk.download('punkt')
api_key = 'xxxxxxxxxxxxxxxxxxx'

## Getting Amazon review data

The file Electronics_5.json.gz contains ~6.7 million Amazon reviews from the electronics category. The file is subset to exclude reviews of products with less than 5 reviews or from users with less than 5 reviews (see https://nijianmo.github.io/amazon/index.html).

In [134]:
out = {}
g = gzip.open('input/Electronics_5.json.gz', 'rb')
for i,l in enumerate(g):
    out[i]=json.loads(l)
df = pd.DataFrame.from_dict(out, orient='index')
df['reviewText'] = df['reviewText'].astype(str)
df.shape

(6739590, 12)

In [135]:
df.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5.0,67,True,"09 18, 1999",AAP7PPBU72QFM,151004714,{'Format:': ' Hardcover'},D. C. Carrad,This is the best novel I have read in 2 or 3 y...,A star is born,937612800,
1,3.0,5,True,"10 23, 2013",A2E168DTVGE6SV,151004714,{'Format:': ' Kindle Edition'},Evy,"Pages and pages of introspection, in the style...",A stream of consciousness novel,1382486400,
2,5.0,4,False,"09 2, 2008",A1ER5AYS3FQ9O3,151004714,{'Format:': ' Paperback'},Kcorn,This is the kind of novel to read when you hav...,I'm a huge fan of the author and this one did ...,1220313600,
3,5.0,13,False,"09 4, 2000",A1T17LMQABMBN5,151004714,{'Format:': ' Hardcover'},Caf Girl Writes,What gorgeous language! What an incredible wri...,The most beautiful book I have ever read!,968025600,
4,3.0,8,True,"02 4, 2000",A3QHJ0FXK33OBE,151004714,{'Format:': ' Hardcover'},W. Shane Schmidt,I was taken in by reviews that compared this b...,A dissenting view--In part.,949622400,


## Preprocessing text data

Below are some functions to preprocess text data for feature extraction. Note that some of these steps are done to preserve words that might contain sentiment information in bigrams, which are extracted later on. For example, the word "not" is important to keep in the bigram "not happy". 

In [136]:
contractions = { 
"aren't": "are not",
"arent": "are not",
"can't": "can not",
"cant": "can not",
"could've": "could have",
"couldve": "could have",
"couldn't": "could not",
"couldnt": "could not",
"didn't": "did not",
"didnt": "did not",
"doesn't": "does not",
"doesnt": "does not",
"don't": "do not",
"dont": "do not",
"hadn't": "had not",
"hadnt": "had not",
"hasn't": "has not",
"hasnt": "has not",
"haven't": "have not",
"havent": "have not",
"he'd": "he would",
"he'll": "he will",
"he's": "he is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"isnt": "is not",
"it'd": "it would",
"it'll": "it will",
"itll": "it will",
"it's": "it is",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldve": "should have",
"shouldn't": "should not",
"shouldnt": "should not",
"that'd": "that would",
"thatd": "that would",
"that's": "that is",
"thats": "that is",
"there's": "there is",
"theres": "there is",
"they'll": "they will",
"theyll": "they will",
"they're": "they are",
"theyre": "they are",
"they've": "they have",
"theyve": "they have",
"wasn't": "was not",
"wasnt": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"werent": "were not",
"what're": "what are",
"what's": "what is",
"where'd": "where did",
"whered": "where did",
"where's": "where is",
"wheres": "where is",
"won't": "will not",
"wont": "will not",
"would've": "would have",
"wouldve": "would have",
"wouldn't": "would not",
"wouldnt": "would not",
"you'd": "you would",
"youd": "you would",
"you'll": "you will",
"youll": "you will",
"you're": "you are",
"youre": "you are",
"you've": "you have",
"youve": "you have"
}

#function to replace contractions based on the dictionary above using regex
contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))
def expand_contractions(s, contractions_dict=contractions):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, s)

#stopwords to exclude
sw = ['of','the','on','in','if','you','i','with','it','was','when','from','this','they','for','my','to',
      'that','there','and','be','been','than','which','as','but','these','at','what','who','why','where',
      'a','an','is','does','do','can','did','would','will','have','are','im','ive','weve','were','am']

#function to lowercase, remove non-alphabetic characters, expand contractions, and remove stopwords
def preprocess_sentence(sentence):
    sent = sentence.lower().strip()
    sent = re.sub(r'[^a-zA-Z\s]','',sent)
    sent = expand_contractions(sent)
    sent = sent.rstrip().strip()
    sent = " ".join([x for x in sent.split() if x not in sw])
    return sent

#tokenize into sentences, process each sentence, and re-join
def preprocess_review(review):
    sents = sent_tokenize(review)
    sents = [preprocess_sentence(sent) for sent in sents]
    sents = " . ".join(sents)
    return sents

In [143]:
#an example of result of preprocessing
print('Actual review:')
print(df.reviewText.values[99996])
print('--------')
print('Preprocessed:')
print(preprocess_review(df.reviewText.values[99996]))

Actual review:
This product is excellent I feel very safe using it on my equipment I liked it so much i purchased another one you won't be disappointed !!!!!!
--------
Preprocessed:
product excellent feel very safe using equipment liked so much purchased another one not disappointed . 


In [142]:
#preprocess reviews using multiprocessing
pool = mp.Pool(processes=mp.cpu_count())
reviews = pool.map(preprocess_review,df['reviewText'])
pool.close()
pool.join()
ratings = df['overall'].values

After preprocessing, the reviews are represented as a vector of bigram counts. I use bigrams as a simple way to preserve some context that is absent in bag-of-words models. Below, filters are applied to remove bigrams that occur in less than 0.005% of reviews or more than 50% of reviews. Ultimately, there are 5,486 unique bigrams extracted.

In [238]:
n_reviews = len(reviews)
min_df=max(int(np.floor(.0005*n_reviews)),10)
max_df=int(np.floor(.5*n_reviews))
cv = CountVectorizer(ngram_range=(2,2),max_df=max_df,min_df=min_df)
mat = cv.fit_transform(reviews)

In [239]:
vocab = ['']+cv.get_feature_names()
vocab_size = len(vocab)
vocab_size

5486

The function below formats the vectors of bigram counts into a tensor that can be fed to a Keras embedding layer. See the section below for the model architecture.

In [241]:
def get_tensor_from_count_matrix(count_mat,max_len=100):
    #count_mat is sparse matrix of work/doc counts
    out = np.zeros((count_mat.shape[0],max_len))
    coo = count_mat.tocoo()
    doc_idx = -1
    word_idx = 0
    for doc,word,num in zip(coo.row,coo.col,coo.data):
        if doc != doc_idx:
            doc_idx = doc
            word_idx = 0
        for _ in range(num):
            try:
                out[doc,word_idx] = word+1
                word_idx += 1
            except:
                pass
    return out
X = get_tensor_from_count_matrix(mat)

After the feature extraction process, the following bigrams are produced from the same example shown above:

In [242]:
#get bigrams from one observation of a tensor
def get_features_from_tensor(tensor_obs):
    return [vocab[int(x)] for x in tensor_obs if vocab[int(x)]!='']
print('Actual review:')
print(df.reviewText.values[99996])
print('------')
print('Features extracted:')
get_features_from_tensor(X[99996])

Actual review:
This product is excellent I feel very safe using it on my equipment I liked it so much i purchased another one you won't be disappointed !!!!!!
------
Features extracted:


['product excellent',
 'liked so',
 'purchased another',
 'feel very',
 'not disappointed',
 'another one',
 'one not',
 'so much']

Create labels and partition data randomly into train/test sets. Note that for this analysis I am training a binary classifier by converting 5-star ratings into 1's and anything less than 5-star ratings into 0's. A multi-class model could be used in the future to improve results. 

In [248]:
Y = (ratings==5)*1
part = np.random.permutation(X.shape[0])
n_train = int(np.floor(.6*X.shape[0]))
X_train = X[part[:n_train]] 
X_test = X[part[n_train:]] 
Y_train = Y[part[:n_train]] 
Y_test = Y[part[n_train:]]
ratings_train = ratings[part[:n_train]] 
ratings_test = ratings[part[n_train:]]

## Training a classifier

This section trains a simple classifier on the processed/labeled data. The model below simply learns the polarity of bigrams (1D embedding layer) and classifies the reviews' rating (customer satisfaction) based on the average polarity of the bigrams in the review. Other more complicated models could be used to improve results (i.e. sequence models, higher dimensional embeddings, more complex classifiers); however, because the goal is to apply the classifier to a new distribution of data, I chose a more constrained network based on features that are likely to also show up in the YouTube data.

In [272]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size,1,input_shape=X[0].shape),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1,activation='sigmoid')
])
optimizer = tf.keras.optimizers.Adam(.01)
model.compile(
    loss='binary_crossentropy',
    optimizer = optimizer,
    metrics=['accuracy']
)
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, 100, 1)            5486      
_________________________________________________________________
global_average_pooling1d_23  (None, 1)                 0         
_________________________________________________________________
dense_24 (Dense)             (None, 1)                 2         
Total params: 5,488
Trainable params: 5,488
Non-trainable params: 0
_________________________________________________________________
None


In [273]:
history = model.fit(X_train,Y_train,batch_size=4096,epochs=30,validation_data=(X_test,Y_test))

Train on 4043754 samples, validate on 2695836 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


The model is clearly learning to embed bigrams based on polarity. Some further cleansing of bigrams could reduce noise: for example, clearly "four stars" is a good way to separate a "negative" review from a "positive" one but will not be in the new distribution of YouTube data.

In [274]:
embedding_layer = np.squeeze(model.layers[0].get_weights()[0])
classifier_layer = np.squeeze(model.layers[2].get_weights()[0])
polarity = pd.Series(embedding_layer*classifier_layer,index=vocab).sort_values()
print('Most negative bigrams:')
print(polarity.index[:10].values)
print('---------------')
print('Most positive bigrams:')
print(polarity.index[-10:].values)


Most negative bigrams:
['given stars' 'only stars' 'stars instead' 'four stars'
 'very disappointing' 'three stars' 'very disappointed' 'two stars'
 'one star' 'buyer beware']
---------------
Most positive bigrams:
['love thing' 'more pleased' 'extremely happy' 'cons none'
 'highly recommended' 'perfect replacement' 'excellent product'
 'awesome product' 'not happier' 'say enough']


The performance of the model is pretty good, particularly given its simplicity and the fact that negative samples are noisier by construction.

In [275]:
pred = np.squeeze(np.array(model.predict(X_test)))
from sklearn.metrics import confusion_matrix
pred = np.squeeze(np.array(model.predict(X_test)))
confusion_matrix(Y_test,(pred>.5)*1)

array([[ 489772,  476847],
       [ 118423, 1610794]])

## Get YouTube comment data

The functions below download metadata and text data for the top 100 comments of the top 50 videos for a particular query. In this analysis, I get a list of recent laptop releases to use as query terms, which almost exclusively result in product review videos.

In [210]:
def get_video_data(query,order='relevance',publishedBefore='2020-01-01T00:00:00Z'):
    #gets the video metadata for the top 50 YouTube videos for a particular query string
    #order in ['relevance','date','rating','viewCount']
    req = yt.search().list(q=query,
                       part='snippet',
                       type='video',
                       maxResults=50,
                       relevanceLanguage='en',
                       order=order,
                       publishedBefore=publishedBefore)
    res = req.execute()
    video_ids = ",".join([vid['id']['videoId'] for vid in res['items']])
    req = yt.videos().list(part='snippet,statistics',id=video_ids)
    res = req.execute()    
    out = {}
    for i,item in enumerate(res['items']):
        vid = item['id']
        out[vid] = {}
        out[vid]['query'] = query
        out[vid]['number'] = i
        for snippet_feature in ['publishedAt','title','description']:
            try:
                out[vid][snippet_feature] = item['snippet'][snippet_feature]
            except:
                out[vid][snippet_feature] = np.nan
        for stat_feature in ['viewCount','likeCount','dislikeCount','commentCount']:
            try:
                out[vid][stat_feature] = int(item['statistics'][stat_feature])
            except:
                out[vid][snippet_feature] = np.nan
        try:
            #get data for top 100 comments for the video
            out[vid]['comments'] = get_comment_data(vid,order=order,publishedBefore=publishedBefore)
        except:
            #if comments are disabled
            out[vid]['comments'] = np.nan
    return out

def get_comment_data(video_id,order='relevance',publishedBefore='2020-01-01T00:00:00Z'):
    #gets metadata and text data for top 100 comments of a particular YouTube video 
    req = yt.commentThreads().list(part='snippet',
                       videoId=video_id,
                       maxResults=100,
                       textFormat='plainText',
                       order='relevance')
    res = req.execute()
    out = {}
    for i,item in enumerate(res['items']):
        cid = item['snippet']['topLevelComment']['id']
        out[cid]={}
        out[cid]['number'] = i
        for snippet_feature in ['publishedAt','textDisplay','likeCount']:
            try:
                out[cid][snippet_feature] = item['snippet']['topLevelComment']['snippet'][snippet_feature]
            except:
                out[cid][snippet_feature] = np.nan
    return out

Get the video/comment data for a set of query terms and save as pickle.

In [279]:
yt = build('youtube','v3',developerKey=api_key)
#a list of recent top laptop releases
laptops = ['HP Elite Dragonfly',
           'Dell XPS 13 2019',
           'Huawei MateBook 13',
           'HP Spectre x360 2019',
           'MacBook Pro 16-inch 2019',
           'Alienware Area-51m',
           'Google Pixelbook Go',
           'Microsoft Surfact Laptop 3',
           'Dell XPS 15 2-in-1',
           'Dell G5 15 5590',
           'Asus Chromebook Flip',
           'Asus VivoBook S15',
           'Acer Switch 3',
           'Apple MacBook 12-inch 2017',
           'HP Spectre Folio']
if os.path.isfile('youtube_data.p'):
    with open('youtube_data.p','rb') as f:
        youtube_data = pickle.load(f)
else:
    youtube_data = {}
    for query in laptops:
        print(query)
        youtube_data.update(get_video_data(query))
    with open('youtube_data.p','wb') as f:
        pickle.dump(youtube_data,f)        

In [297]:
#example of the structure of the video data
pd.DataFrame(youtube_data).T.head()

Unnamed: 0,query,number,publishedAt,title,description,viewCount,likeCount,dislikeCount,commentCount,comments
22_AGgC7QJM,HP Elite Dragonfly,0,2019-12-19T17:00:00.000Z,HP Elite Dragonfly Review,"Lisa Gade reviews the HP Elite Dragonfly 13"" p...",161061,2468,67,238,"{'UgzuQR1_JPXWidT6CAx4AaABAg': {'number': 0, '..."
3pDn8wPD4Ng,HP Elite Dragonfly,1,2019-09-18T06:00:02.000Z,HP Elite Dragonfly first look: A light busines...,"HP is chasing superlatives again. Last year, t...",115724,1261,70,136,"{'Ugzgqec21xctbHwgbOR4AaABAg': {'number': 0, '..."
c0muI1HMvkU,HP Elite Dragonfly,2,2019-12-10T22:47:53.000Z,HP Elite Dragonfly Review: The Stunning Busine...,HP Elite Dragonfly: http://tidd.ly/6ceb42b8\n...,42230,729,36,99,"{'UgwM5y_BDZFhSyn2Zwx4AaABAg': {'number': 0, '..."
0tmnvxjj4xY,HP Elite Dragonfly,3,2019-12-13T19:38:37.000Z,HP Elite Dragonfly unboxing and first impressions,Unboxing and first impressions of HP's ultra-l...,5118,38,10,10,"{'Ugxju1v7yDz-QRwwPeZ4AaABAg': {'number': 0, '..."
ZLZsSdOknz8,HP Elite Dragonfly,4,2019-09-20T19:10:30.000Z,Lew Later On The 24-Hour Dragonfly Laptop,Clip from Lew Later (Episode - iPhone 11 Teard...,30103,626,28,76,"{'UgxYa67chZW-nJXlisB4AaABAg': {'number': 0, '..."


The data for these products (queries) is comprised of 37,603 YouTube comments. In general there are between 2,000 and 3,700 comments per product (with HP Elite Dragonfly and Acer Switch 3 as exceptions with fewer comments).

In [281]:
#make a dataframe of comments
comment_df = pd.DataFrame()
for vid,data in youtube_data.items():
    comment_dict = data['comments']
    try:
        toadd = pd.DataFrame(comment_dict).T
        toadd['query'] = data['query']
        comment_df = youtube_df.append(toadd)
    except:
        pass
comment_df.shape

(37603, 5)

In [282]:
comment_df.head()

Unnamed: 0,likeCount,number,publishedAt,query,textDisplay
UgzuQR1_JPXWidT6CAx4AaABAg,166,0,2019-12-19T17:24:22.000Z,HP Elite Dragonfly,Ok when is she gonna wear a shirt i dont want ...
UgzfJsION2Yoixzzgm54AaABAg,63,1,2019-12-19T23:12:20.000Z,HP Elite Dragonfly,she's not wasting any time and really nailing ...
UgwBlVIWFDPURgAZtLd4AaABAg,58,2,2019-12-19T17:02:32.000Z,HP Elite Dragonfly,You've been one of my favorite YouTubers since...
UgwHXy-Zd9USvgvv6DF4AaABAg,7,3,2019-12-19T17:37:08.000Z,HP Elite Dragonfly,Upgrade pricing should be near actual cost +~1...
UgzyicVlmw14jKEybnF4AaABAg,51,4,2019-12-19T17:14:52.000Z,HP Elite Dragonfly,Lisa is one of the best when it comes to in-de...


In [295]:
print('Number of comments per product query')
comment_df.groupby('query')['textDisplay'].count().sort_values()

Number of comments per product query


query
HP Elite Dragonfly             789
Acer Switch 3                  920
Dell G5 15 5590               2009
HP Spectre Folio              2256
Asus VivoBook S15             2498
Microsoft Surfact Laptop 3    2555
Asus Chromebook Flip          2671
Google Pixelbook Go           2706
Dell XPS 13 2019              2765
Apple MacBook 12-inch 2017    2800
Dell XPS 15 2-in-1            2840
HP Spectre x360 2019          2883
Huawei MateBook 13            3052
Alienware Area-51m            3201
MacBook Pro 16-inch 2019      3658
Name: textDisplay, dtype: int64

## Apply classifier to YouTube comment data

The following steps are followed to predict customer satisfaction from the YouTube comment data:
1. Preprocess comment data in the same way as Amazon review data
2. Extract features using the vocabulary from Amazon reviews
3. Infer sentiment/customer satisfaction using trained classifier

In [283]:
comments = comment_df.textDisplay.apply(preprocess_review)
comments_cv = CountVectorizer(ngram_range=(2,2),vocabulary=vocab[1:])
comments_mat = comments_cv.fit_transform(comments)
comments_tensor = get_tensor_from_count_matrix(comments_mat)
comment_df['pred'] = np.squeeze(np.array(model.predict(comments_tensor)))

We can try to get a sense of how well the classifier works on the new distribution of data by reading some reviews. Below it looks like the classifier is working as intended.

In [291]:
print('-------------')
print('Most "positive" comments:')
print('-------------')
print('\n')
for comment in comment_df.sort_values(by=['pred'])['textDisplay'].values[-5:]:
    print(comment.replace('\n',''))
    print('\n')
print('-------------')
print('Most "negative" comments:')
print('-------------')
print('\n')
for comment in comment_df.sort_values(by=['pred'])['textDisplay'].values[:5]:
    print(comment)
    print('\n')

-------------
Most "positive" comments:
-------------


own this laptop and i absolutely love it, its the best laptop i ever had in my life i so love it, and its also the first  laptop i own with a white color, it's change so much from black classic gaming laptop


Great video man, I live in Brazil and actually got the Macbook 12 2017 with the i5 processor and 512gb ssd storage based on this review, I was leaning towards either this or the MacBook Pro base model, and I decided to go with the 12 inch because portability is really important for me, since I`m a college professor and businessman, so I carry mine around all day. All I can say about this machine is that everything you said on your review is true, it is a wonderful, lightweight and very powerful device, much more powerful than my 2014 MacBook Air, which I replaced with this one. The port situation doesn't bother me at all, I only have one adaptor which lets me charge it, plug on my 32" monitor and keep an extra USB-A port. Th

Aggregating to the product level, we can try to get a sense of overall YouTube-customer satisfaction for each product. Here, the MacBooks have the most positive customer satisfaction and the Spectre has the most negative.

In [285]:
comment_df.groupby('query')['pred'].mean().sort_values()

query
HP Spectre x360 2019          0.646922
Google Pixelbook Go           0.648963
Asus Chromebook Flip          0.649582
HP Elite Dragonfly            0.649692
Acer Switch 3                 0.649988
Alienware Area-51m            0.650416
Dell XPS 15 2-in-1            0.650839
Dell G5 15 5590               0.651589
Microsoft Surfact Laptop 3    0.652427
Asus VivoBook S15             0.652914
Dell XPS 13 2019              0.652993
HP Spectre Folio              0.653268
Huawei MateBook 13            0.657045
MacBook Pro 16-inch 2019      0.658816
Apple MacBook 12-inch 2017    0.663216
Name: pred, dtype: float32

## Summary & next steps

A lot can be done to improve the data quality and model used here (as discussed above). However, this gives a starting point for inferring customer satisfaction from a large unlabeled dataset of YouTube comments, where there is an active community centered around product reviews.