# Analysis of Youtube Comments on Mercedes-Benz Channel

Esther Ling, 2018

## Introduction

Vimeo vs Youtube? Vimeo, a paid platform with 170 million users is claimed to have a more mature audience compared to Youtube, which has over 1 billion users. As a result, videos on Vimeo typically receive more constructive comments and fewer spam and/or unhelpful negative reviews, which may be an important factor for businesses implementing branding strategies on social media.

In this project, I propose to test this claim by performing a sentiment analysis on a collection of comments from the two platforms.

Since different types of videos have different audience, it is possible that comments are also dependent on the type of video poster (hobbyist vs business) and genre of video (educational, marketing, etc). In order to limit the scope, I aim to focus on large corporations that cross-post to both platforms.

For this stage, I extracted comments from videos posted by Mercedez Benz on Youtube using the Youtube Data API.

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk import tokenize, FreqDist
import string
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Main data structures:
# 1. df_comments: The original list of comments, with pre-processing.
# 2. comments_tok: A list of lists containing tokenized comments.
# 3. words_comments_tok: A list of all the words from all comments.

In [3]:
## Read comments from csv file
df_comments = pd.read_csv("../Data/comments.csv", header=None)
df_comments.dropna(inplace=True) ## Remove NaNs
df_comments = df_comments[0].tolist()
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['would', 'im'])
print("Done!")

Done!


In [4]:
## Remove non-printable characters (such as emojis)
printable = set(string.printable)
df_comments = [filter(lambda x: x in printable, c) for c in df_comments]

## Remove comments by channel - sets the lines to empty
df_comments = [filter(lambda x: 'Subscribe' not in c, c) for c in df_comments]

## Remove punctuation
df_comments = [c.translate(None, string.punctuation) for c in df_comments]

## Remove empty lines
df_comments = filter(None, df_comments)

print(df_comments[0:10])

['AUDIS ARE WAY MORE FASTER', 'The dream will be on December 18 2022', 'I love mercedes car', 'Awesome  video ', 'Love mercedes', 'UP', 'I love mercedes company', 'Would you give a car for freePlease', 'Mercedes hit like please', '1']


In [5]:
## Tokenize each sentence (make a list of lists)
comments_tok = [tokenize.word_tokenize(c) for c in df_comments]

## Total word bank (flat_list = [item for sublist in l for item in sublist])
words_comments_tok = [w for c in comments_tok for w in c]
words_comments_tok = [w.lower() for w in words_comments_tok] ## lower_case all the words 
words_comments_tok = [w for w in words_comments_tok if w not in stopwords] ## remove stop-words

print(words_comments_tok[0:10])

['audis', 'way', 'faster', 'dream', 'december', '18', '2022', 'love', 'mercedes', 'car']


In [6]:
## Get word distribution
allWordDist = FreqDist(w for w in words_comments_tok)
mostCommon = allWordDist.most_common(20)
mostCommon

[('mercedes', 2125),
 ('car', 1782),
 ('like', 967),
 ('best', 890),
 ('love', 851),
 ('benz', 633),
 ('one', 627),
 ('nice', 549),
 ('video', 520),
 ('amg', 517),
 ('class', 499),
 ('new', 498),
 ('cars', 485),
 ('good', 441),
 ('great', 416),
 ('mercedesbenz', 379),
 ('looks', 356),
 ('nothing', 321),
 ('please', 315),
 ('amazing', 314)]

In [7]:
## Sort words based on frequency, for plotting using Plotly
wordDist_keys = [w for _,w in sorted(zip(allWordDist.values(), allWordDist.keys()), reverse=True)]
wordDist_values = sorted(allWordDist.values(), reverse=True)
x = wordDist_keys[0:20]
y = wordDist_values[0:20]

## Plot 1: Most common 20 words

Do we see a more positive or negative sentiment based on the most common words?

In [8]:
data = [go.Bar(x=x, 
               y=y,)]
layout = go.Layout(
    title='Most common 20 words',
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

### Observations

Overall, the most frequently occurring words are postive, such as "like", "best", "love", "nice", "good", "great", "amazing".

## Plot 2: What's the distribution of competitor mentions?

In [43]:
competitor_list = ['audi', 'bmw', 'porsche', 'volkswagen', 'jaguar', 
                    'ferrari', 'volvo', 'aston martin', 
                    'hyundai', 'honda', 'nissan', 'toyota', 
                    'peugeot', 'renault', 'bugatti',
                    'ford', 'tesla']
words_comments_tok_comp = [w for w in words_comments_tok if w in competitor_list]
print(len(words_comments_tok_comp))
compDist = FreqDist(w for w in words_comments_tok_comp)

647


In [45]:
data = [go.Histogram(x=words_comments_tok_comp)]
layout = go.Layout(
    title='Competitor Brand Mentions',
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

### Obervations

1. According to (https://www.mbaskool.com/brandguide/automobiles/1300-mercedes-benz.html), the Top 3 competitors are BMW, Porsche and Audi. Based on the video comments, BMW has the most mentions, followed by Audi, which matches the ranked list. However, Porsche has relatively few mentions, despite being in the Top 3. 

2. Surprisingly, Tesla and Nissan have a significant number of mentions, as non-German competitors.