In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

% matplotlib inline

In [5]:
df = pd.read_csv('All_data_cbs.csv').drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1)

In [7]:
df.columns

Index([u'rowHash', u'Id', u'Title', u'Description', u'LikeCount',
       u'DislikeCount', u'ViewCount', u'FavoriteCount', u'CommentCount',
       u'PublishedAt', u'Channel Id', u'Channel Title', u'Lang',
       u'publishedAt', u'subscriberCount', u'channelVideoCount',
       u'channelViewCount', u'nextHash', u'PrevCommentCount',
       u'PrevDislikeCount', u'PrevLikeCount', u'PrevPublishedAt',
       u'PrevViewCount', u'PrevTitle', u'PublishedYear', u'ChannelAge',
       u'Title-clickbait', u'PrevTitle-clickbait'],
      dtype='object')

In [24]:
prolific_channels = df[df['channelVideoCount'] > 50]
df['PublishedAt'][1] >df['PublishedAt'][0]

False

In [14]:
prolific_channels[['Id', 'Title']]

Unnamed: 0,Id,Title
3,E--7XtuuqAg,Mooring Anchor Carry 120kg at Papar Strongestm...
4,fdsppIOelPw,Tire Flip 450kg at Papar Strongestman 2016
5,tJ3bb0V5y2Y,Farmer Walk 125kg each hand at Papar Strongest...
6,UYnmMwiNGUU,Truck Pull 14 tonnes at Papar Strongestman 2016
7,3anYa8MAjvA,Atlas Stone at Papar Strongestman 2016
8,lT5lqT0zJgk,200kg Bench Press with support
9,9qIULP6nY04,Farmer Walk 130kg Arnold Classic Australia
10,JUSp1cWyyjc,Memories JK1M
11,iLKDs4ISafA,30 Squats for Ticket
12,AGmVK_he-pY,The Craziest Trainer !!!


High level overview of what I want to do is this:

The `prolific_channels` pandas dataframe has a bunch of entries corresponding to videos from channels that have published more than 50 videos. What I want to be able to do is, for each channel that has more than 50 videos, get the rowHash (index in the dataframe) of the most recently uploaded video. Then, for each of those most recently uploaded videos per channel, we want to get statistics for the last 10 or so videos uploaded before that one. Each of these stats will be a feature in our regression. And then after defining a dataframe for that (which I'll refer to as `df`) and putting those features in it, we call a GradientBoostingRegressor as follows.

In [None]:
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

#features = ['last_video_views', '2nd_to_last_views', ... 'views_from_10_videos_ago',
#            'last_video_comment_count', same for the others...]

X = df[features]
y = df['Views']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
    
reg = GradientBoostingRegressor()
reg.fit(X_train, y_train)
print "R^2: {0}".format(reg.score(X_test, y_test))

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,12))
sns.barplot(x=reg.feature_importances_, y=features)

And then in addition to that regressor, which is a view count predictor based on stats specific to the channel in question, we'll also want another regressor that looks at coarser channel stats like that other group did (total subscribers, total videos, total view counts, etc), and then just based on those try to come up with an estimate of the view counts for the most recent uploads to a channel. And then combine those regressors somehow.

In [30]:
channels = {}
for _, video in prolific_channels.iterrows():
    channel_id = video['Channel Id']
    if not channel_id in channels:
        channels[channel_id] = [video]
    else:
        channels[channel_id].append(video)
for channel_id in channels:
    channels[channel_id] = sorted(channels[channel_id], key=lambda v: v['PublishedAt'])

In [36]:
def prediction_answer_and_feature_extractor(channel, n):
    """
    I'm assuming the thing that's passed in is a list of videos,
    in other words a channel is a value in the channels dict.
    This also assumes that 
    
    This takes one of those entries and returns a tuple, one value of which
    is the correct view count that we want to predict for the most recent video
    posted to the channel, and the second is a dictionary containing feature names
    and feature values of the previous 10 videos before the most recent one.
    """
    if len(channel) < n+1:
        raise Exception("Not enough videos in channel to create a feature vector.")
        
    video_stat_names = ['LikeCount', 'DislikeCount', 'ViewCount', 
                        'FavoriteCount', 'CommentCount']
    correct_prediction = channel[-1]['ViewCount']
    feature_values = []
    feature_names = []
    for i in range(-2,-n-2,-1):
        feature_values += channel[i][video_stat_names]
        feature_names += [name + "_%d"%i for name in video_stat_names]
    return (correct_prediction, dict(zip(feature_names, feature_values)))

1

The above will definitely fail since the entries in ~channels~ are often going to be shorter than n (which we'll set to 5 or 10 or something). Which is why we need to use the YouTube API to loop through all channel ids in channels and populate the list with the last 15 videos posted to that channel.

In [49]:
df['Channel Id'][0]

'UCa0o4WDOq1kJAyMETrPnpQg'