### Youtube API: Extraction and analysis of comments about Asus Zenbook Pro (Regex/NLTK)

The topic of analysis is "Asus Zenbook Pro", a laptop from Asus. The idea is to find out what people think about the product by analysing comments, extracted from videos on this topic.

**Import libraries**

In [1]:
import requests
import pandas as pd
import numpy as np

**Extract videos that contain specific search words** <br>
Key-words are: _asus zenbook pro_ <br>
Number of videos: _50_ <br>
Relevance: _English language_

In [2]:
params = (
    ('key', 'AIzaSyDyPycUEc7szd7NWABwbAULVdAxBo36W3w'),
    ('part', 'snippet'),
    ('type', 'video'),
    ('maxResults', 50),
    ('q', 'asus zenbook pro'),
    ('relevanceLanguage', 'en'), #is not guaranteed to work
)

response = requests.get('https://www.googleapis.com/youtube/v3/search', params=params)

response_json=response.json()


channel_ids = []
videoid_name = {}
for i in range(len(response_json['items'])):
    channel_ids.append(response_json['items'][i]['snippet']['channelId'])
    videoid_name[response_json['items'][i]['snippet']['title']] = response_json['items'][i]['id']['videoId']

Even though **'relevanceLanguage'** is set to English, API outputs videos of non-English channels. Consequently, only comments from English-speaking videos will be selected for the Analysis<br>
To find out what videos are in English language, a library called **"langid"** is used <br>
A language will be determined from a title of a video

In [3]:
import langid

#create a list of videos with english names
videos_required=[]
for name in videoid_name.keys():
    lang = langid.classify(name)
    #print("Lang: ", lang, "Name: ", name)
    if lang[0] == 'en':
        videos_required.append(videoid_name.get(name))

In [4]:
print("Number of English videos: ", len(videos_required))

Number of English videos:  24


**Now when video Id's are stored, API can be used once more to extract comments of videos in a list "videos_required"**

In [5]:
import time
comments=[]
video_id = []
for video in videos_required:
    
    params_v = (
        ('key', 'AIzaSyDyPycUEc7szd7NWABwbAULVdAxBo36W3w'),
        ('part', 'snippet'),
        ('videoId', video),
        ('maxResults', '100'),
    )
    
    response_v = requests.get('https://www.googleapis.com/youtube/v3/commentThreads', params=params_v)
    response_json_v=response_v.json()
    

    for i in range(len(response_json_v['items'])):
        comments.append(response_json_v['items'][i]['snippet']['topLevelComment']['snippet']['textOriginal'])
        video_id.append(response_json_v['items'][i]['snippet']['topLevelComment']['snippet']['videoId'])
    time.sleep(3) 

Now Comments are put into a dataframe <br>
Moreover, API from text-processing.com is used to detect **positive** and **negative** comments

In [6]:
df = pd.DataFrame(columns=['textDisplay', 'video_id','label','pos','neg','neutral']) #creates empty dataframe

for i in range(len(comments)):
    lst=[]
    comment = comments[i]
    vid_id = video_id[i]
    data = [('text', comment),]
    response = requests.post('http://text-processing.com/api/sentiment/', data=data)
    json_sent = response.json()
    lst.append(comment)
    lst.append(vid_id)
    lst.append(json_sent['label'])
    lst.append(json_sent["probability"]["pos"])
    lst.append(json_sent["probability"]["neg"])
    lst.append(json_sent["probability"]["neutral"])
    df.loc[i] = lst

**Summary: number of positive, negative, neutral comments**

In [7]:
df.groupby(['label'])['textDisplay'].count()

label
neg        654
neutral    330
pos        468
Name: textDisplay, dtype: int64

**Subset of negative comments**

In [14]:
pd.options.display.max_colwidth = 140
df.loc[df['label'] == 'neg'].sort_values(by=['neg'], ascending =False)[:5]

Unnamed: 0,textDisplay,video_id,label,pos,neg,neutral
1378,Ther is NOTHING WORST than scrolling a touch screen and that it lags so terribly.\n\nThe touch pad completely turned me off from this la...,jR1V_7RxrIk,neg,0.028725,0.971275,0.001585
679,Stupid idea and boring naming,otLtSbzWgrA,neg,0.05839,0.94161,0.061381
1190,I bought an UX430UA from Asus and I'm really mad at them for not having Asus health charging app. the website says all 2017 zenbook have...,A0cLS0ZHWNc,neg,0.06791,0.93209,0.211965
63,"I'm all for an extra screen on a laptop, but why on earth did they put it in the worse possible place to put a screen?\nDoes anyone seri...",b5wGGp88nBs,neg,0.106105,0.893895,0.146019
332,Who the hell measures battery life with the screen off? That's so stupid!,ycsCNY-wSHg,neg,0.106472,0.893528,0.012235


**Subset of positive comments**

In [13]:
df.loc[df['label'] == 'pos'].sort_values(by=['pos'], ascending =False)[:5]

Unnamed: 0,textDisplay,video_id,label,pos,neg,neutral
1429,VERY NICE. GOOD BRAND. I use this brand for many years and I feel very comfortable. this is the top of the PC and of the various brands....,EcaDhN_OD_Q,pos,0.89826,0.10174,0.093861
872,Nice one Saf! This is probably the best coverage of Computex haha,phGShu0LzwQ,pos,0.871374,0.128626,0.161591
1078,"Asus always deliver a great, durable, and beautiful product.",CEWrNY0u-Gc,pos,0.869709,0.130291,0.111447
1430,Fiero utilizzatore della Asus da più di 15 anni. Una marca davvero ottima. Eccelle in ogni sua funzionalità e prestazioni. Design e graf...,EcaDhN_OD_Q,pos,0.864242,0.135758,0.157376
877,Now that is awesome innovation. especially the extension display option. that is nice.,phGShu0LzwQ,pos,0.859301,0.140699,0.111939


**Use PorterStemmer to normalize words and find the most frequent words used in positive and negative comments**

In [11]:
from nltk import FreqDist
import operator

import re
#the words that appear he most in positive reviews
import nltk
porter = nltk.PorterStemmer()
list_pos=[]
for i in range(len(df.loc[df['label'] == 'pos'])):
    list_pos.append(df.loc[df['label'] == 'pos']["textDisplay"].iloc[i])
lst_words_pos = []
for line in list_pos:
    text_pos = re.split('\n| |\?|\!|\:|\"|\(|\)|\...|\;',line)
    for word in text_pos:
        if (len(word)>3 and not word.startswith('@') and not word.startswith('#') and word != 'RT'):
            lst_words_pos.append(porter.stem(word.lower()))


dist_pos = FreqDist(lst_words_pos) 
sorted_dist_pos = sorted(dist_pos.items(), key=operator.itemgetter(1), reverse=True)
sorted_dist_pos[:50]


[('thi', 112),
 ('laptop', 81),
 ('with', 79),
 ('asu', 66),
 ('that', 58),
 ('video', 58),
 ('great', 57),
 ('review', 48),
 ('good', 45),
 ('have', 42),
 ('would', 42),
 ('your', 41),
 ('thank', 41),
 ('nice', 37),
 ('more', 36),
 ('love', 35),
 ('look', 35),
 ('will', 33),
 ('zenbook', 31),
 ('what', 31),
 ('like', 31),
 ('than', 30),
 ('awesom', 28),
 ('better', 27),
 ('macbook', 26),
 ('realli', 26),
 ("it'", 25),
 ('just', 25),
 ('screen', 24),
 ('game', 23),
 ('vivobook', 23),
 ('cool', 22),
 ('appl', 22),
 ('use', 20),
 ('veri', 20),
 ('know', 19),
 ('could', 18),
 ('from', 18),
 ('price', 18),
 ('work', 17),
 ('about', 17),
 ('think', 16),
 ('amaz', 16),
 ('make', 16),
 ('best', 15),
 ('display', 15),
 ('some', 15),
 ('want', 15),
 ('edit', 14),
 ('pleas', 13)]

Some useful words that help understand what users in Zenbook laptops **like**: _look, video, screen, game, price, display_

In [12]:
list_neg=[]
for i in range(len(df.loc[df['label'] == 'neg'])):
    list_neg.append(df.loc[df['label'] == 'neg']["textDisplay"].iloc[i])
lst_words_neg = []
for line in list_neg:
    text_neg = re.split('\n| |\?|\!|\:|\"|\(|\)|\...|\;',line)
    for word in text_neg:
        if (len(word)>3 and not word.startswith('@') and not word.startswith('#') and word != 'RT'):
            lst_words_neg.append(porter.stem(word.lower()))
dist_neg = FreqDist(lst_words_neg) 
sorted_dist_neg = sorted(dist_neg.items(), key=operator.itemgetter(1), reverse=True)
sorted_dist_neg[:50]

[('thi', 253),
 ('laptop', 191),
 ('that', 158),
 ('have', 116),
 ('with', 112),
 ('asu', 106),
 ('screen', 99),
 ('zenbook', 85),
 ('like', 79),
 ('look', 71),
 ('just', 67),
 ('about', 67),
 ('what', 58),
 ('want', 56),
 ('when', 55),
 ('would', 54),
 ('they', 53),
 ('macbook', 49),
 ('than', 47),
 ('think', 46),
 ('more', 45),
 ('better', 45),
 ('game', 45),
 ('onli', 45),
 ('realli', 45),
 ('will', 45),
 ("don't", 44),
 ('your', 44),
 ('touch', 44),
 ("it'", 42),
 ('need', 42),
 ('much', 40),
 ('releas', 38),
 ('review', 36),
 ('video', 36),
 ('use', 35),
 ('price', 35),
 ('appl', 33),
 ('could', 33),
 ('from', 32),
 ('there', 32),
 ('make', 31),
 ('pleas', 31),
 ('time', 28),
 ('doe', 28),
 ('where', 27),
 ('thing', 27),
 ('some', 27),
 ('come', 26),
 ('trackpad', 26)]

Some useful words that help understand what users in Zenbook laptops **dislike**: _time, game, price, touch(pad), screen, trackpad_