# Fetching YouTube comments and classifying them

In this notebook we fetch comments from YouTube and classify them using the model trained in the previous notebook.

It serves as the development of the code that ends up in `youtube.py`.

See this page for a handy way of setting up access to the YouTube data API: [https://python.gotrained.com/youtube-api-extracting-comments/](https://python.gotrained.com/youtube-api-extracting-comments/)

YouTube API
[https://developers.google.com/youtube/v3/docs/commentThreads/list](https://developers.google.com/youtube/v3/docs/commentThreads/list)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install google-api-python-client
!pip install google-auth google-auth-oauthlib google-auth-httplib2

Collecting google-api-python-client
  Downloading https://files.pythonhosted.org/packages/9a/b4/a955f393b838bc47cbb6ae4643b9d0f90333d3b4db4dc1e819f36aad18cc/google_api_python_client-1.8.0-py3-none-any.whl (57kB)
Collecting google-auth>=1.4.1 (from google-api-python-client)
  Downloading https://files.pythonhosted.org/packages/05/b0/cc391ebf8ebf7855cdcfe0a9a4cdc8dcd90287c90e1ac22651d104ac6481/google_auth-1.12.0-py2.py3-none-any.whl (83kB)
Collecting uritemplate<4dev,>=3.0.0 (from google-api-python-client)
  Downloading https://files.pythonhosted.org/packages/bf/0c/60d82c077998feb631608dca3cc1fe19ac074e772bf0c24cf409b977b815/uritemplate-3.0.1-py2.py3-none-any.whl
Collecting httplib2<1dev,>=0.9.2 (from google-api-python-client)
  Downloading https://files.pythonhosted.org/packages/8e/4b/025a7338bb2d4a2c61f0e530b79aafc29d112ed8e61333a6dd9ba48f3bab/httplib2-0.17.0-py3-none-any.whl (95kB)
Collecting google-auth-httplib2>=0.0.3 (from google-api-python-client)
  Downloading https://files.pytho

tensorflow 1.11.0 has requirement setuptools<=39.1.0, but you'll have setuptools 46.1.3 which is incompatible.
You are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting google-auth-oauthlib
  Downloading https://files.pythonhosted.org/packages/7b/b8/88def36e74bee9fce511c9519571f4e485e890093ab7442284f4ffaef60b/google_auth_oauthlib-0.4.1-py2.py3-none-any.whl
Installing collected packages: google-auth-oauthlib
Successfully installed google-auth-oauthlib-0.4.1


You are using pip version 18.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Install other packages.

In [3]:
%%capture
!pip install requirements.txt

In [4]:
import os
import pickle
import getpass

Enter my personal YouTube client keys. If the user wants to use this work they might need to set up their own account and download their keys.

In [5]:
CLIENT_SECRETS_FILE = 'personal_key.json'

In [6]:
from youtube_comments import youtube

Specify the scope of this application.

In [7]:
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

In [8]:
import google.oauth2.credentials

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES
            )
            credentials = flow.run_console()
 
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)
 
    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

In [None]:
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()

### Search videos and fetch comments

In [10]:
import csv
import re

def _get_video_id_from_url(url):
    return re.split('v=', url)[-1]

def get_video_comments_from_url(url, service, **kwargs):
    
    video_id = _get_video_id_from_url(url)
    
    comments = []
    results = service.commentThreads().list(
        part='snippet', videoId=video_id, textFormat='plainText', **kwargs
    ).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)

        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(
                part='snippet', videoId=video_id, textFormat='plainText', **kwargs
            ).execute()
        else:
            break
    
    return comments


def get_videos(service, **kwargs):
    final_results = []
    results = service.search().list(**kwargs).execute()
 
    i = 0
    max_pages = 3
    while results and i < max_pages:
        final_results.extend(results['items'])
 
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.search().list(**kwargs).execute()
            i += 1
        else:
            break
 
    return final_results


def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)

        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

        
def search_videos_by_keyword(service, **kwargs):
    results = get_videos(service, **kwargs)
    final_result = []
    for item in results:
        title = item['snippet']['title']
        video_id = item['id']['videoId']
        comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')
        final_result.extend([(video_id, title, comment) for comment in comments])
    
    return final_result


def write_to_csv(comments):
    with open('comments.csv', 'w') as comments_file:
        comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        comments_writer.writerow(['Video ID', 'Title', 'Comment'])
        for row in comments:
            comments_writer.writerow(list(row))


In [None]:
comments = get_video_comments_from_url('https://www.youtube.com/watch?v=D0W1v0kOELA', service)

In [None]:
len(comments)

In [None]:
comments[:10]

In [None]:
import pandas as pd

comments_data = pd.DataFrame({
    'comment': comments
})
comments_data.head(3)

In [16]:
import pickle
with open('models/models_dict.pkl', 'rb') as f:
    models_dict = pickle.load(f)

In [17]:
word_vectorizer = models_dict['word_vectorizer']
char_vectorizer = models_dict['char_vectorizer']
models = models_dict['models']

In [18]:
from scipy.sparse import hstack

def classify_comments(comments, word_vectorizer, char_vectorizer, models, probability=False):
    """
    :param comments: an array of strings, the raw data to score
    """
    word_features = word_vectorizer.transform(comments)
    char_features = char_vectorizer.transform(comments)
    combined_features = hstack([char_features, word_features])
    
    predictions = {}
    for class_name, model in models.items():
        if probability:
            # Take the positive class probability prediction
            class_prediction = model.predict_proba(combined_features)[1]
        else:
            class_prediction = model.predict(combined_features)
            
        predictions[class_name] = class_prediction
    
    return pd.DataFrame(predictions)


In [19]:
comments_data.loc[14:18, 'comment']

14    HEY DISLIKERS ! DO YOURSELF A FAVOR....1 ST LO...
15                                 K-rose GTA SAN bitch
16       Players guitar hero dari indonesia ada gak 😁😁😁
17    I listen to slipknot, dope, lil Wayne, eminem,...
18    My girlfriend says "if i leave here tomorrow, ...
Name: comment, dtype: object

### Test the command-line system

In [13]:
!python youtube_comments/youtube.py --key_file personal_key.json --url https://www.youtube.com/watch?v=wLdK6z679Bs --data_path=output

Traceback (most recent call last):
  File "youtube_comments/youtube.py", line 180, in <module>
    youtube_comments = get_video_comments_from_url(args['url'], service)
  File "youtube_comments/youtube.py", line 120, in get_video_comments_from_url
    part='snippet', videoId=video_id, textFormat='plainText', **kwargs
  File "C:\Users\ollie\Anaconda3\lib\site-packages\googleapiclient\_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "C:\Users\ollie\Anaconda3\lib\site-packages\googleapiclient\http.py", line 898, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=wLdK6z679Bs&textFormat=plainText&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.">


In [74]:
!python youtube_comments/youtube.py --key_file personal_key.json --url https://www.youtube.com/watch?v=eyHEn0lDc6g --data_path=output --save_raw=True

         toxic  severe_toxic   obscene  threat  insult  identity_hate
mean  0.035714           0.0  0.035714     0.0     0.0            0.0
sum   2.000000           0.0  2.000000     0.0     0.0            0.0


In [None]:
!python youtube_comments/youtube.py --key_file personal_key.json --url https://www.youtube.com/watch?v=D0W1v0kOELA --data_path=output