# Concept
predict how much revenue a given video idea will produce for tiger fitness. we also want to predict views, percent viewed, and change in subscribers

# Features (input 1-7, output 8-14)
1. video duration (bucketed, analytics based)
2. title polarity
3. title length (characters)
4. n links in description
5. category of video (vlog, training, nutrition, personal story, supplements and steroids) 
6. product price range (bucketed, intuition based)
7. product type (apparel(3), supplements(2), food(1), coaching(4), nothing(0))
8. transactions
9. views 
10. subs 
11. average percent viewed
12. likes
13. Impressions
14. Impression click through

# Models
1. revenue prediction (predict sales volume, multiply by price range to give a revenue range 
2. views prediction
3. change in subscribers prediction
4. average percent viewed prediction
5. likes prediction

Data Sources...

manual entry or unsupervised learning: 5

youtube analytics report - video analytics overview: 2, 3, 9, 10, 11, 12

youtube analytics report - video analytics engadgement: 8, 13, 14

## Import statements

In [1]:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite://', echo=False)
from textblob import TextBlob
import numpy as np

## Merging the two reports

In [2]:
df1 = pd.read_csv(r'C:\Users\aacjp\OneDrive\Desktop\Table0.csv').iloc[1:].set_index('Video')
df2 = pd.read_csv(r'C:\Users\aacjp\OneDrive\Desktop\Table data.csv').iloc[1:]
df3 = pd.read_csv(r'C:\Users\aacjp\OneDrive\Desktop\internal_all.csv')[['Address', 'Meta Description 1']]
df3['Address'] = df3['Address'].apply(lambda x: x.split('=')[1])

imp = []
imp_crt = []
desc = []
for i in range(1, len(df2['Video'])+1): #adding impressions and impression click through for each video
    imp.append(df1.loc[df2['Video'][i]]['Impressions'])
    imp_crt.append(df1.loc[df2['Video'][i]]['Impressions click-through rate (%)'])
    desc.append(df3.loc[df3['Address'] == df2['Video'][i]]['Meta Description 1'])
df2['Impressions'] = imp
df2['Impressions click-through rate (%)'] = imp_crt
df2['description'] = desc

In [3]:
def countLinks(description):
    try:
        wrds = description.replace('...', ' ').split(' ')
        links = 0
        for w in wrds:
            if w.startswith('http'):
                links += 1
    except: 
        links = 0
    return links

def youtubeLingo(title):
    return title.lower().replace('***', 'uck').replace('prod.', 'produced by').replace('ep.', 'episode ').replace('|', 'and').replace('@', 'at ').replace('=', 'equals')

def stripPunct(string):
    return string.replace('(', '').replace(')', '').replace(':', '').replace('!', '').replace('?', '')

## adding title polarity and length

In [4]:
df2['nlinks'] = df2['description'].apply(lambda x: countLinks(x.iloc[0]))
df2

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks
1,Ewdr2H2qmVY,Ux of 4 (prod. CashMoneyAP),"Oct 7, 2018",8,3,182,0:01:13,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0
2,ApXSnKR8lQU,Retarded genius (prod. Letzer),"Nov 4, 2018",7,1,183,0:01:04,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1
3,Hde5iY6CG8s,I tried to move to LA!!!!!,"Oct 1, 2019",4,0,52,0:04:39,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0
4,QVYOgNfa1Mk,Your first Machine Learning Model | Beginner P...,"Jun 13, 2021",2,3,82,0:02:20,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1
5,f-4wjPBSROo,Smart kids = hamster brain ??,"Jun 27, 2019",2,0,14,0:01:53,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0
6,vrDEeol53V8,2 things that fucked me up!,"Jul 25, 2019",2,0,17,0:01:59,32.61,157,7.01,"18 Enjoy the videos and music you love, upl...",0
7,60UBEiBPxaY,Moving to Seattle lol,"Oct 7, 2020",1,0,14,0:01:42,18.59,133,7.52,"11 NaN Name: Meta Description 1, dtype: object",0
8,CYP2npTbbmg,reacting to my old song,"Oct 3, 2020",1,0,14,0:01:58,38.34,83,10.84,"17 NaN Name: Meta Description 1, dtype: object",0
9,RyWIwt4BlCE,AI web app demo,"Feb 9, 2021",1,0,13,0:01:06,79.67,104,9.62,14 check out the full project here: https:/...,1
10,aiuCMSwQTec,350x2 deadlift @150lbs,"Dec 24, 2019",1,0,8,0:00:35,75.85,103,3.88,"15 NaN Name: Meta Description 1, dtype: object",0


In [5]:
df2['Video title'] = df2['Video title'].apply(lambda x: stripPunct(youtubeLingo(x)))

In [6]:
tl = []
tp = []
for title in df2['Video title']:
    tl.append(len(title))
    tp.append(TextBlob(title).sentiment.polarity)
    
df2['title length'] = tl
df2['title polarity'] = tp

In [7]:
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks,title length,title polarity
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,0:01:13,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0,31,0.0
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,0:01:04,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1,34,-0.8
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,0:04:39,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0,21,0.0
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,0:02:20,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1,79,0.25
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,0:01:53,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0,32,0.214286


In [8]:
def toSeconds(duration):
    timesplits = duration.split(':')
    return int(timesplits[0])*3600 + int(timesplits[1])*60 + int(timesplits[2])

In [9]:
df2['Average view duration'] = df2['Average view duration'].apply(lambda x: toSeconds(x)) #converting average view duration to an int
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks,title length,title polarity
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,73,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0,31,0.0
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1,34,-0.8
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0,21,0.0
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1,79,0.25
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0,32,0.214286


## calculating total duration

In [10]:
def getVideoLength(avd, apv):
    vd = []
    for i in range(1, len(avd)+1):
        vd.append(int((avd[i]/apv[i])*100))
    return vd

df2['Video duration'] = getVideoLength(df2['Average view duration'], df2['Average percentage viewed (%)'])
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks,title length,title polarity,Video duration
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,73,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0,31,0.0,230
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1,34,-0.8,193
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0,21,0.0,787
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1,79,0.25,1565
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0,32,0.214286,368


## Bucketing total duration

In [11]:
def getPercentile(n, sample):
    sample = sorted(list(sample))
    n_lesser = 0
    n_greater = 0
    for s in sample:
        if s > n:
            n_greater += 1
        else:
            n_lesser += 1
    return (n_lesser / (n_lesser+n_greater))*100

def getQuartile(percentile):
    return int(percentile/25)

In [12]:
df2['duration bucket']= df2['Video duration'].apply(lambda x: getQuartile(getPercentile(x, list(df2['Video duration']))))
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks,title length,title polarity,Video duration,duration bucket
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,73,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0,31,0.0,230,0
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1,34,-0.8,193,0
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0,21,0.0,787,3
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1,79,0.25,1565,3
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0,32,0.214286,368,1


In [13]:
df2['transactions'] = [0]*len(df2) #these are all zeroes for now because I haven't sold anything on my channel yet
df2['revenue'] = [0]*len(df2) # revenue would reffer to sales (how much $ worth of transactions from a video)
df2['ad revenue'] = [0]*len(df2) # if your channel is monotized then this is the money wou will recieve directly from youtube
df2['total revenue'] = [0]*len(df2) 
df2.head()

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),description,nlinks,title length,title polarity,Video duration,duration bucket,transactions,revenue,ad revenue,total revenue
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,73,31.63,810,8.52,19 shot by @gwen.holley and @its_jeremiahpr...,0,31,0.0,230,0,0,0,0,0
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,3 shot by @gwen.holleyfollow me on instagra...,1,34,-0.8,193,0,0,0,0,0
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,9 This is by far the craziest thing I've ev...,0,21,0.0,787,3,0,0,0,0
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,20 Click Here for to book FREE Tutoring ses...,1,79,0.25,1565,3,0,0,0,0
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,"16 Enjoy the videos and music you love, upl...",0,32,0.214286,368,1,0,0,0,0


In [14]:
df2['Video title'][2]

'retarded genius produced by letzer'

In [15]:
#','.join(df['Text']).split(' ')
vocab = list(set(','.join(df2['Video title']).replace(',', ' ').split(' ')))[1:]
embeddings = {}
for i in range(len(vocab)):
    embeddings[vocab[i]] = i
    
def tokenize(string, embeddings):
    mappings = ([])
    for word in string.split(' '):
        if word != '':
            mappings.append(embeddings[word])
    return mappings

def tokenizeDF(col, embeddings):
    arrays = ([])
    for string in col:
        arrays.append(tokenize(string, embeddings))
    return arrays

In [16]:
def getPad(tokenizedTexts):
    dim = 0
    for lst in tokenizedTexts:
        dim += len(lst)
    return int(round(dim/len(tokenizedTexts)))

def padIt(pad, tokenizedTexts):
    arrays = []
    for t in tokenizedTexts:
        if len(t[:pad]) != pad:
            padding = [0] * (pad-len(t[:pad]))
            arrays.append(t[:pad]+padding)
        else:
            arrays.append(t[:pad])
    return np.array(arrays)

In [17]:
tt = tokenizeDF(df2['Video title'], embeddings)
tt

[[64, 21, 57, 35, 9, 39],
 [71, 62, 35, 9, 17],
 [49, 43, 82, 72, 82, 70],
 [46, 79, 29, 81, 77, 58, 18, 12, 83, 80, 33, 25],
 [36, 84, 75, 69, 5],
 [38, 11, 87, 23, 13, 31],
 [4, 82, 34, 14],
 [63, 82, 24, 56, 19],
 [78, 59, 37, 42],
 [26, 8, 65, 68],
 [52, 49, 0, 16, 61, 22],
 [27, 50, 24, 85],
 [28, 67, 80],
 [15, 30, 74, 86, 20],
 [1, 41, 51, 80],
 [53, 78, 2, 46, 45],
 [55, 32, 66],
 [7, 76, 54, 60],
 [40, 48, 41, 47],
 [15, 30, 74, 10, 73, 38],
 [6, 3, 25],
 [57, 88, 21, 80, 44]]

In [18]:
pad = getPad(tt)
pad

5

In [19]:
padded = padIt(pad, tt)
padded

array([[64, 21, 57, 35,  9],
       [71, 62, 35,  9, 17],
       [49, 43, 82, 72, 82],
       [46, 79, 29, 81, 77],
       [36, 84, 75, 69,  5],
       [38, 11, 87, 23, 13],
       [ 4, 82, 34, 14,  0],
       [63, 82, 24, 56, 19],
       [78, 59, 37, 42,  0],
       [26,  8, 65, 68,  0],
       [52, 49,  0, 16, 61],
       [27, 50, 24, 85,  0],
       [28, 67, 80,  0,  0],
       [15, 30, 74, 86, 20],
       [ 1, 41, 51, 80,  0],
       [53, 78,  2, 46, 45],
       [55, 32, 66,  0,  0],
       [ 7, 76, 54, 60,  0],
       [40, 48, 41, 47,  0],
       [15, 30, 74, 10, 73],
       [ 6,  3, 25,  0,  0],
       [57, 88, 21, 80, 44]])

In [20]:
padded.shape

(22, 5)

In [25]:
from sklearn.cluster import KMeans

In [23]:
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

In [27]:
unsupervised = KMeans(n_clusters=4).fit(padded)
unsupervised

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [28]:
unsupervised.cluster_centers_

array([[21.71428571, 48.14285714, 54.85714286, 70.71428571,  3.57142857],
       [43.        , 42.125     , 52.625     , 15.375     ,  4.875     ],
       [54.2       , 75.2       , 15.2       , 55.8       , 49.2       ],
       [32.        , 36.5       , 78.        , 41.        , 77.5       ]])

In [51]:
unsupervised.inertia_

35842.7857142857

In [59]:
sum(np.min(cdist(padded, unsupervised.cluster_centers_, 'euclidean'), axis=1)) / padded.shape[0]

38.944741045507

In [66]:
df2['category'] = unsupervised.predict(padded)

In [67]:
df2

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),...,nlinks,title length,title polarity,Video duration,duration bucket,transactions,revenue,ad revenue,total revenue,category
1,Ewdr2H2qmVY,ux of 4 produced by cashmoneyap,"Oct 7, 2018",8,3,182,73,31.63,810,8.52,...,0,31,0.0,230,0,0,0,0,0,1
2,ApXSnKR8lQU,retarded genius produced by letzer,"Nov 4, 2018",7,1,183,64,33.02,1251,8.63,...,1,34,-0.8,193,0,0,0,0,0,1
3,Hde5iY6CG8s,i tried to move to la,"Oct 1, 2019",4,0,52,279,35.42,188,11.7,...,0,21,0.0,787,3,0,0,0,0,3
4,QVYOgNfa1Mk,your first machine learning model and beginner...,"Jun 13, 2021",2,3,82,140,8.94,62,4.84,...,1,79,0.25,1565,3,0,0,0,0,2
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,...,0,32,0.214286,368,1,0,0,0,0,0
6,vrDEeol53V8,2 things that fucked me up,"Jul 25, 2019",2,0,17,119,32.61,157,7.01,...,0,26,-0.6,364,1,0,0,0,0,1
7,60UBEiBPxaY,moving to seattle lol,"Oct 7, 2020",1,0,14,102,18.59,133,7.52,...,0,21,0.8,548,2,0,0,0,0,1
8,CYP2npTbbmg,reacting to my old song,"Oct 3, 2020",1,0,14,118,38.34,83,10.84,...,0,23,0.1,307,0,0,0,0,0,2
9,RyWIwt4BlCE,ai web app demo,"Feb 9, 2021",1,0,13,66,79.67,104,9.62,...,1,15,0.0,82,0,0,0,0,0,1
10,aiuCMSwQTec,350x2 deadlift at 150lbs,"Dec 24, 2019",1,0,8,35,75.85,103,3.88,...,0,24,0.0,46,0,0,0,0,0,0


In [74]:
df2.loc[df2['category'] == 0]

Unnamed: 0,Video,Video title,Video publish time,Likes,Subscribers gained,Views,Average view duration,Average percentage viewed (%),Impressions,Impressions click-through rate (%),...,nlinks,title length,title polarity,Video duration,duration bucket,transactions,revenue,ad revenue,total revenue,category
5,f-4wjPBSROo,smart kids equals hamster brain,"Jun 27, 2019",2,0,14,113,30.67,165,6.06,...,0,32,0.214286,368,1,0,0,0,0,0
10,aiuCMSwQTec,350x2 deadlift at 150lbs,"Dec 24, 2019",1,0,8,35,75.85,103,3.88,...,0,24,0.0,46,0,0,0,0,0,0
12,3eJXtEzFQqs,twitter is my friend,"Dec 11, 2020",0,0,10,231,41.94,68,8.82,...,0,21,0.0,550,2,0,0,0,0,0
14,NMGihswkjhY,feed forward neural networks part1,"Feb 7, 2020",0,0,10,83,25.11,178,3.93,...,2,34,0.0,330,1,0,0,0,0,0
15,YOFTAifYo_w,working with movie data,"Sep 11, 2020",0,0,4,38,10.96,317,1.26,...,0,23,0.0,346,1,0,0,0,0,0
18,ge_R6_Q7Pxg,ravage life episode 1,"Sep 19, 2020",0,0,5,973,22.32,30,10.0,...,0,21,0.0,4359,4,0,0,0,0,0
19,jHFr8cvxeAo,linear regression with sklearn,"Feb 6, 2020",0,0,14,82,14.55,122,2.46,...,1,30,0.0,563,2,0,0,0,0,0
