We extract video frames from selected animations and extract
the line art images to form our training dataset. We calculate a
**768-dimensional feature vector of histograms of R, G, B channels
for each frame**. The difference between frames is determined by
calculating the **mean square error** of the feature vectors, which
is used for splitting the source animations into shots. When the
difference between the neighboring frames is **greater than 200**, it
is considered to belong to different shots. In order to improve the
quality of the data, we **remove shots in which the mean square
errors between all pairs of frames are less than 10** (as they are
too uniform), and **the shot with a length less than 8 frames**.
Then we **filter out video frames that are too dark or too faded
in color**. Finally we get a total of 
1096 video sequences from 6 animations, with a total of 29,834 images. 
Each video sequence
has 27 frames on average. 

In [76]:
import cv2
import matplotlib.pyplot as plt
import numpy as np
import glob
import json
import os
import shutil
import pandas

Let's create the frames.

In [88]:
def VidToFrames (vidpath, folderName):
    vidcap = cv2.VideoCapture(vidpath)
    success,image = vidcap.read()
    print(success)
    count = 0
    length = int(vidcap.get(cv2.CAP_PROP_FPS))
    width  = int(vidcap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    print(length, width, height)
    while success:
        cv2.imwrite("{foldername}/{frameNum}.png".format(foldername= folderName, frameNum=str(count)), image)  
        success,image = vidcap.read()
        count += 1
    print(count, "frames in file")

# for i in glob.glob("../mp4/*.mp4"):
for i in glob.glob("../mp4/*.mp4"):
    destination = i.split('.mp4')[0]
    os.mkdir(destination)
    VidToFrames(i, destination)   

True
29 640 360
11453 frames in file
True
25 640 360
27641 frames in file
True
29 640 360
16502 frames in file
True
29 640 360
9485 frames in file
True
24 640 360
11420 frames in file


Now that we have folders with frames from each video for each channelId, we can start the analysis. There is two parts to this: 
1. Color analysis
2. Motion analysis

### Part I  <br/>
***Color Analysis*** <br/>
Is there a correlation between color and subcount?

In [78]:
from IPython.display import Image

def removeDimAndFadedImages(folderName):
    sortedFrameList = sorted(glob.glob(folderName+'/*.png'), key=os.path.getmtime)
    debugCount = 0
    for imagePath in sortedFrameList:
        brightness = getBrightnessOfImage(imagePath)
        if brightness < 9.4e7 or brightness > 4.5e8: # brightness < 6*10^7 or brightness > 7*10^8:
            print(imagePath, ": ", brightness)
            os.remove(imagePath)
        if debugCount % 100 == 0: 
            print (debugCount)
        debugCount +=1

        
def getBrightnessOfImage(imagePath):
    image = cv2.imread(imagePath)
    vectors = []
    for i, col in enumerate(['b', 'g', 'r']):
        hist = cv2.calcHist([image], [i], None, [256], [0, 256])
        sums = np.sum(np.fromiter((index*hist[index] for index in range(len(hist))), dtype=float))
        vectors.append(sums)
        plt.plot(hist, color = col)
        plt.xlim([0, 256])
    plt.show()
    totalSums = np.sum(vectors)
    return totalSums

An image histogram gives a graphical representation of the distribution of pixel intensities in a digital image.

In [73]:
def binAndPlot(some_list):
    sorted_list = sorted(some_list)
    sorted_counted = Counter(sorted_list)
    range_length = list(range(int(max(some_list)+1))) # Get the largest value to get the range.
    data_series = {}

    for i in range_length:
        data_series[i] = 0 # Initialize series so that we have a template and we just have to fill in the values.
    for key, value in sorted_counted.items():
        data_series[key] = value
    data_series = pd.Series(data_series)
    x_values = data_series.index
    plt.hist(some_list, edgecolor="yellow", color="green")
    plt.show()

### Part 2:
*** Motion Analysis ***

Is there a correlation between the amount of animation and subcount?

In my head, I should technically analyze how much each frame differs from each other. 
However, color is a value that is assigned to each pixel in a frame.

So we are going to get rid of color. We're not completely getting rid of it tho- any changes due to color will be detected (example)- because we are using
ahem ahem an eXtended difference-of-Gaussians to extract the lines. So we're basically changing the data to put less weight on color because it's already been accounted for in Part I.

 There was a time when I was interested in colorization machine learning techniques. Honestly, I still am, just less than this project. 

 There was a paper called [Deep Line Art Video Colorization with a Few References](https://arxiv.org/abs/2003.10685) that I spent some time generating data for. It was only after I generated all the data that I realized I don't have the resources (GPU, money, honestly, motivation was the biggest factor)

> We extract video frames from selected animations and extract
the line art images to form our training dataset. We calculate a
768-dimensional feature vector of histograms of R, G, B channels
for each frame. The difference between frames is determined by
calculating the mean square error of the feature vectors, which
is used for splitting the source animations into shots. When the
difference between the neighboring frames is greater than 200, it
is considered to belong to different shots. In order to improve the
quality of the data, we remove shots in which the mean square
errors between all pairs of frames are less than 10 (as they are
too uniform), and the shot with a length less than 8 frames.
Then we filter out video frames that are too dark or too faded
in color. Finally we get a total of 1096 video sequences from 6
animations, with a total of 29,834 images. Each video sequence
has 27 frames on average. 

In my case, I got around 28k iumages with 32 frames per scene on average.

I'm going to use the data collection method stated above, with a slight variation


In [80]:
with open('../channelId_to_mostViewedVidId.json', 'r') as j:
    channelId_to_mostViewedVidId = json.loads(j.read())

LIST_OF_CHANNEL_IDS= channelId_to_mostViewedVidId.keys()
print(len(LIST_OF_CHANNEL_IDS))

61


In [None]:
with open('../masterSheet.json', 'r') as j:
    masterSheet = json.loads(j.read())

channelId_to_mostViewedVidInfo = {}

for channelId in LIST_OF_CHANNEL_IDS:
    mostViewedVidId = channelId_to_mostViewedVidId[channelId]
    for vidInfo in masterSheet[channelId]:
        if vidInfo['vidId'] == mostViewedVidId:
            channelId_to_mostViewedVidInfo[channelId] = vidInfo
channelId_to_mostViewedVidInfo
# print(len(channelId_to_mostViewedVidId))

In [159]:
import regex as re
df = pandas.DataFrame.from_dict(channelId_to_mostViewedVidInfo).T
not_candidate_vidIds= ['kQEtRoyFfI8', 'hpQQohcHk9Q', '6wS_uON5s6Q', 'RlU32AfEVeU', 'iVqhzEaJhDw', 'Xnv7JGqjaAo', '18msRdBF11A', 'Ln4AnsWNUQI', 'Hkz0NcKPzMs']

#https://en.wikipedia.org/wiki/ISO_8601#Durations
for channelId, row in df.iterrows():
    x= row['duration']
    m=0
    s=0
    try:
        m, s = re.findall('PT(\d+)M(\d+)S',x)[0]
    except:
        s = re.findall('PT(\d+)S',x)[0]
    if int(m)*60+int(s) < 180:
        not_candidate_vidIds.append(row['vidId'])
not_candidate_vidIds = set(not_candidate_vidIds)
candidateVideos = list(set(df.vidId)- not_candidate_vidIds)
df = df[df.vidId.isin(candidateVideos)]
print(len(df))

{'KwkhpOJ6nQ4', 'I9uWUw1fxOY', 'uDvPIKM-L1o', '5pMckBGWzAY', 'cZ_CnLE6SPo', 'Hkz0NcKPzMs', 'eNGgfPs0Xp8', '6wS_uON5s6Q', 'kQEtRoyFfI8', 'ewsGmhAjjjI', 'Ln4AnsWNUQI', 'Xnv7JGqjaAo', 'Ox49X6Andl8', 'IwxWmKsVR5U', 'iRBmUQQzpWQ', 'F8A-tXp09fs', 'oKLbOxLJfRg', '7FnQrNFyWy8', 'gA0bi-bFEYs', 'xa-4IAR_9Yw', 'P2EjH7l_N70', 'vv2vPAzj8S4', 'RlU32AfEVeU', 'WXNmSruTWIA', 'vuk2NZ0YKAE', '_uk_6vfqwTA', 'W8P5ewPk9fM', '7WSo1Uw-p_g', '12Ne9n40tmw', '2E2El1kdooM', 'EcgkRp2IUsc', 'O0hyjRF6quc', 'BErOLQBZ6c8', 'de8PRd_d7kg', 'so1_5hYUEE8', '0Vxp_Lj2b-E', 'rnQlkpOFgm8', 'uqJKryP1-8M', '2bGkEK8I6zQ', 'OyDLuom4KGs', 'Y7lYeRqhQ9Q', '18msRdBF11A', '2juKkLxdQo0', '0TlV3w1YGqk', 'o0zjRGRYEhk', 'hpQQohcHk9Q', 'PgajWuZA408', 'Vm6Yu2N-ePI', '2yFCyPX3kT0', 'EZCfJqr7Eao', 'nHgRnjqLmtM', 'plSyrHqUh78', 'iVqhzEaJhDw', 'LdNi3PpGtl8', 'n4CAhXpyVCI', 'Mv8OkBjySGQ', 'kbCah6yhYRs', 'A6V1QujNz8s', 'iEW-d02l9ew', '9oUpImsyf4Y', 'rlSXaDq3uOk'}
{'Ln4AnsWNUQI', 'KwkhpOJ6nQ4', 'I9uWUw1fxOY', 'xa-4IAR_9Yw', 'EZCfJqr7Eao', 'iVqhzEa

In [101]:
def ColorToLineart (framepath):
    lineartpath = "[lineart]"+ framepath.split('/')[-1]+"/"
    os.mkdir(lineartpath)
    print(lineartpath)
    imgList = glob.glob('%s/*'% framepath)
    count=0
    for i in imgList:
        try:
            xdog2(i, lineartpath)
            count+=1
        except:
            print(lineartpath)
    print(count, "out of", len(imgList), " files converted to lineart")
    
for framepath in glob.glob("frames/*"):
    print(framepath)
    ColorToLineart(framepath)


frames/Private School-IwxWmKsVR5U
[lineart]Private School-IwxWmKsVR5U/
11453 out of 11453  files converted to lineart
frames/Embarrassing Water Park Story (Ft. Emirichu)-P2EjH7l_N70
[lineart]Embarrassing Water Park Story (Ft. Emirichu)-P2EjH7l_N70/
16502 out of 16502  files converted to lineart
frames/Life is Fun - TheOdd1sOut - Animation Breakdown-Xnv7JGqjaAo
[lineart]Life is Fun - TheOdd1sOut - Animation Breakdown-Xnv7JGqjaAo/
9485 out of 9485  files converted to lineart
frames/How I Met My 'Boyfriend' (ft. Sultan Sketches)-vv2vPAzj8S4
[lineart]How I Met My 'Boyfriend' (ft. Sultan Sketches)-vv2vPAzj8S4/
4117 out of 4117  files converted to lineart
frames/Can They Survive My Hero Academia-2bGkEK8I6zQ
[lineart]Can They Survive My Hero Academia-2bGkEK8I6zQ/
27641 out of 27641  files converted to lineart


In [83]:
#XDoG
import os
import numpy as np
from scipy.ndimage.filters import gaussian_filter
import cv2
import glob

def xdog(original, count, lineartPath, epsilon=0.5, phi=10, k=1.4, tau=1, sigma=0.5):
    image = cv2.imread(original, cv2.IMREAD_GRAYSCALE)
    image = gaussian_filter(image, 0.7)
    gauss1 = gaussian_filter(image, sigma)
    gauss2 = gaussian_filter(image, sigma*k)

    D = gauss1 - tau*gauss2

    U = D/255
    
    for i in range(0,len(U)):
        for j in range(0,len(U[0])):
            U[i][j] = abs(1-U[i][j])
    for i in range(0, len(U)):
        for j in range(0, len(U[0])):
            if U[i][j] >= epsilon:
                U[i][j] = 1
            else:
                ht = np.tanh(phi*(U[i][j] - epsilon))
                U[i][j] = 1 + ht

    lineart = U*255
    success = cv2.imwrite(lineartPath+"/%d.png" % count, lineart)

def dodgeV2(x, y):
    return cv2.divide(x, 255 - y, scale=256)

def xdog2 (original, lineartPath):
    img = cv2.imread(original)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img_invert = cv2.bitwise_not(img_gray)
    img_smoothing = cv2.GaussianBlur(img_invert, (21, 21),sigmaX=0, sigmaY=0)
    final_img = dodgeV2(img_gray, img_smoothing)

    success = cv2.imwrite(lineartPath+original.split('/')[-1], final_img)     # save frame as JPEG file      


Get lineart for all frames

couple that do not fit the criteria:: vidIds [kQEtRoyFfI8- gameplay, hpQQohcHk9Q- too short. not commentary, 6wS_uON5s6Q- speeddrawing, RlU32AfEVeU- footage apparently from another youtuber, iVqhzEaJhDw- too short, Xnv7JGqjaAo- speeddraw, iRBmUQQzpWQ- simply not animation, 18msRdBF11A- not animation, Ln4AnsWNUQI- not animation, Hkz0NcKPzMs- too short]



The plan is to find the distribution of DIFF. It'll vary based on the artist.

In [74]:
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import glob
from IPython.display import Image, display

def getFeatureVectorOfImage_color (imagePath):
    image = cv2.imread(imagePath)
    vectors = []
    for i, col in enumerate(['b', 'g', 'r']):
        hist = cv2.calcHist([image], [i], None, [256], [0, 256])
        vectors.append(hist)
        # plt.plot(hist, color = col)
        # plt.xlim([0, 256])
    # plt.show()
    vectors = np.vstack(vectors)
    return vectors

def getFeatureVectorOfImage (imagePath):
    image = cv2.imread(imagePath)
    vectors = []
    hist = cv2.calcHist([image], [0], None, [256], [0, 256])
    vectors = np.vstack(hist)
    return vectors

def MSE_images (vA, vB):
    return np.sum(np.square(np.subtract(vA, vB)))

def DIFF_images(vA, vB):
    return (np.abs(vA - vB)).mean()

def removeLowDiff_andDemo(folderName):
    print(folderName)
    lenOfFrameList = len(glob.glob(glob.escape(folderName)+'/*.png')) #unfortunately, doesn't work bc sortByInt isn't a thing
    filesMoved = 0 
    validFrameList = []

    lowRangeDone= False
    midRangeDone= False
    highRangeDone= False

    listOfMSE = []
    listOfDIFF = []

    imageShape = cv2.imread('{}/{}.png'.format(folderName, 0)).shape
    h,w,c = imageShape
    for i in range(lenOfFrameList-1):
        firstFrame = '{}/{}.png'.format(folderName, i)
        secondFrame = '{}/{}.png'.format(folderName, i+1)
        f1_v = getFeatureVectorOfImage(firstFrame)
        f2_v = getFeatureVectorOfImage(secondFrame)
        diff = DIFF_images(f1_v, f2_v)
        mse = MSE_images(f1_v, f2_v)/ (h*w)
        listOfDIFF.append(diff)
        listOfMSE.append(mse)

        if (diff >= 2):
            validFrameList.append(firstFrame)
            filesMoved+=1
        # if (lowRangeDone and midRangeDone and highRangeDone) == False:
        #     if (diff == 0 and not lowRangeDone):
        #         print ("low diff = : ", diff)
        #         print(firstFrame)
        #         display(Image(filename=firstFrame))
        #         print (secondFrame)
        #         display(Image(filename=secondFrame))
        #         lowRangeDone = True
        #     if (diff > 100 and diff < 150  and not midRangeDone):
        #         print ("mid diff = : ", diff)
        #         print(firstFrame)
        #         display(Image(filename=firstFrame))
        #         print (secondFrame)
        #         display(Image(filename=secondFrame))
        #         midRangeDone = True
        #     if (diff > 250 and diff < 300  and not highRangeDone):
        #         print ("high diff = : ", diff)
        #         print(firstFrame)
        #         display(Image(filename=firstFrame))
        #         print (secondFrame)
        #         display(Image(filename=secondFrame))
        #         highRangeDone = True
    print(lenOfFrameList - filesMoved, "out of", lenOfFrameList, "files were duplicates")
    # binAndPlot(listOfDIFF)
    print("DIFF min: ", np.amin(listOfDIFF), " max: ", np.amax(listOfDIFF), "mean: ", np.mean(listOfDIFF), "")
    # binAndPlot(listOfMSE)
    print("MSE min: ", np.amin(listOfMSE), " max: ", np.amax(listOfMSE), "mean: ", np.mean(listOfMSE), "")

    return validFrameList
            

for lineartPath in glob.glob("frames/lineart/*")[3:5]:
    removeLowDiffDst = removeLowDiff_andDemo(lineartPath)
    print("===============================================================================")

frames/lineart/[lineart]i was gonna delete this but you guys told me not to-iVqhzEaJhDw
60 out of 1129 files were duplicates
DIFF min:  0.0  max:  94.05469 mean:  8.618614 
MSE min:  0.0  max:  637.0105555555556 mean:  4.005593286729216 
frames/lineart/[lineart]Being a Boba Barista (Work Stories)-so1_5hYUEE8
6263 out of 9584 files were duplicates
DIFF min:  0.0  max:  1800.0 mean:  5.462136 
MSE min:  0.0  max:  460080.9244444444 mean:  111.91795814184115 


In [135]:
def detectScenes(folderName, diff):
    with open(folderName+"_sceneDetection.txt", 'w') as f:
        sortedFrameList = sorted(glob.glob(folderName+'*.png'), key=os.path.getmtime)
        startOfShot=0
        numberOfScenes = 0
        lengthOfScene = 0
        for index in len(sortedFrameList-1):
            if MSE_images(sortedFrameList[index], sortedFrameList[index+1]) > diff:
                if lengthOfScene > 8:
                    f.write(startOfShot, ", ", index)
                    numberOfScenes += 1
                startOfShot = index+1
                lengthOfScene = 0
