# Fetching YouTube comments and classifying them

In this notebook we fetch comments from YouTube and classify them using the model trained in the previous notebook.

See this page for a handy way of setting up access to the YouTube data API: [https://python.gotrained.com/youtube-api-extracting-comments/](https://python.gotrained.com/youtube-api-extracting-comments/)

YouTube API
[https://developers.google.com/youtube/v3/docs/commentThreads/list](https://developers.google.com/youtube/v3/docs/commentThreads/list)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip uninstall scikit-learn -y
!pip install scikit-learn==0.20.3

Uninstalling scikit-learn-0.20.3:
  Successfully uninstalled scikit-learn-0.20.3
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Looking in indexes: https://packages.dns.ad.zopa.com/artifactory/api/pypi/pypi-python2711-virtual/simple/
Collecting scikit-learn==0.20.3
[?25l  Downloading https://packages.dns.ad.zopa.com/artifactory/api/pypi/pypi-python2711-virtual/packages/5e/82/c0de5839d613b82bddd088599ac0bbfbbbcbd8ca470680658352d2c435bd/scikit_learn-0.20.3-cp36-cp36m-manylinux1_x86_64.whl (5.4MB)
[K    100% |████████████████████████████████| 5.4MB 109.2MB/s ta 0:00:01
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.20.3
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
!pip install google-api-python-client
!pip install google-auth google-auth-oauthlib google-auth-httplib2

Looking in indexes: https://packages.dns.ad.zopa.com/artifactory/api/pypi/pypi-python2711-virtual/simple/
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Looking in indexes: https://packages.dns.ad.zopa.com/artifactory/api/pypi/pypi-python2711-virtual/simple/
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
!pip freeze > requirements.txt

[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
import os
import pickle
import getpass

Enter my personal YouTube client keys. If the user wants to use this API they might need to set up their own account and download their keys.

In [6]:
CLIENT_SECRETS_FILE = 'client_secret_1049876915637-7ia95c7rg5teak6crcuodies22keluuh.apps.googleusercontent.com.json'

In [7]:
from youtube_comments import youtube

Specify the scope of this application.

In [8]:
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

In [9]:
import google.oauth2.credentials

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES
            )
            credentials = flow.run_console()
 
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)
 
    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

In [10]:
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()

### Search videos and fetch comments

In [11]:
import csv
import re

def _get_video_id_from_url(url):
    return re.split('v=', url)[-1]

def get_video_comments_from_url(url, service, **kwargs):
    
    video_id = _get_video_id_from_url(url)
    
    comments = []
    results = service.commentThreads().list(
        part='snippet', videoId=video_id, textFormat='plainText', **kwargs
    ).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)

        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(
                part='snippet', videoId=video_id, textFormat='plainText', **kwargs
            ).execute()
        else:
            break
    
    return comments


def get_videos(service, **kwargs):
    final_results = []
    results = service.search().list(**kwargs).execute()
 
    i = 0
    max_pages = 3
    while results and i < max_pages:
        final_results.extend(results['items'])
 
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.search().list(**kwargs).execute()
            i += 1
        else:
            break
 
    return final_results


def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)

        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

        
def search_videos_by_keyword(service, **kwargs):
    results = get_videos(service, **kwargs)
    final_result = []
    for item in results:
        title = item['snippet']['title']
        video_id = item['id']['videoId']
        comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')
        final_result.extend([(video_id, title, comment) for comment in comments])
    
    return final_result


def write_to_csv(comments):
    with open('comments.csv', 'w') as comments_file:
        comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        comments_writer.writerow(['Video ID', 'Title', 'Comment'])
        for row in comments:
            comments_writer.writerow(list(row))


In [12]:
comments = get_video_comments_from_url('https://www.youtube.com/watch?v=D0W1v0kOELA', service)

In [13]:
len(comments)

7623

In [14]:
comments[:10]

['Trippin while watching this is euphoric',
 'any 1 rember\nJoe dirt 2 when he goes back in time and warns the band about the crash ,',
 'this bird jew cannot change',
 '4:04',
 'driving a flying car over bone county on san andreas to this song was just majestic\n"if the police can\'t stop you, you must be on ........the dust"\n4:56 you’re welcome',
 '2020 and the is solo still going..\nThis is the song of freedom.\nIf this isn’t played at my funeral I’m not going.',
 'Forest gum and gta snd I think',
 'I still dont understand what SUSUDIO means!',
 'Oh the memories.',
 'Is that the coolest guy who ever lived at 1.48']

In [15]:
import pandas as pd

comments_data = pd.DataFrame({
    'comment': comments
})
comments_data.head(3)

Unnamed: 0,comment
0,Trippin while watching this is euphoric
1,any 1 rember\nJoe dirt 2 when he goes back in ...
2,this bird jew cannot change


In [16]:
import pickle
with open('models/models_dict.pkl', 'rb') as f:
    models_dict = pickle.load(f)

In [17]:
word_vectorizer = models_dict['word_vectorizer']
char_vectorizer = models_dict['char_vectorizer']
models = models_dict['models']

In [18]:
from scipy.sparse import hstack

def classify_comments(comments, word_vectorizer, char_vectorizer, models, probability=False):
    """
    :param comments: an array of strings, the raw data to score
    """
    word_features = word_vectorizer.transform(comments)
    char_features = char_vectorizer.transform(comments)
    combined_features = hstack([char_features, word_features])
    
    predictions = {}
    for class_name, model in models.items():
        if probability:
            # Take the positive class probability prediction
            class_prediction = model.predict_proba(combined_features)[1]
        else:
            class_prediction = model.predict(combined_features)
            
        predictions[class_name] = class_prediction
    
    return pd.DataFrame(predictions)


In [19]:
comments_data.loc[14:18, 'comment']

14    HEY DISLIKERS ! DO YOURSELF A FAVOR....1 ST LO...
15                                 K-rose GTA SAN bitch
16       Players guitar hero dari indonesia ada gak 😁😁😁
17    I listen to slipknot, dope, lil Wayne, eminem,...
18    My girlfriend says "if i leave here tomorrow, ...
Name: comment, dtype: object

### Test the command-line system

In [73]:
!python youtube_comments/youtube.py --url https://www.youtube.com/watch?v=wLdK6z679Bs --data_path=output

Traceback (most recent call last):
  File "youtube_comments/youtube.py", line 165, in <module>
    raise ValueError(f'The video {video_id} currently has no comments!')
ValueError: The video wLdK6z679Bs currently has no comments!


In [74]:
!python youtube_comments/youtube.py --url https://www.youtube.com/watch?v=eyHEn0lDc6g --data_path=output --save_raw=True

         toxic  severe_toxic   obscene  threat  insult  identity_hate
mean  0.035714           0.0  0.035714     0.0     0.0            0.0
sum   2.000000           0.0  2.000000     0.0     0.0            0.0


In [None]:
!python youtube_comments/youtube.py --url https://www.youtube.com/watch?v=D0W1v0kOELA --data_path=output