# Data Cleaning from Gathered Data

## Setup

Here we'll use some of the same libraries to clean the data gathered from the YouTube API.

In [1]:
import pandas as pd
import numpy as np

In [6]:
video_metrics_df = pd.read_csv('csv/video_metrics.csv', index_col=0)
video_comments_df = pd.read_csv('csv/video_comments.csv', index_col=0)
print(f'video_metrics_df shape: {video_metrics_df.shape}')
print(f'video_comments_df shape: {video_comments_df.shape}')

video_metrics_df shape: (508, 13)
video_comments_df shape: (2192, 6)


## Cleaning

First let's look at the video_metrics_df

In [9]:
video_metrics_df.dtypes

channel_name     object
video_id         object
title            object
published_at     object
description      object
thumbnail_url    object
tags             object
category_id       int64
duration         object
view_count        int64
like_count        int64
fav_count         int64
comment_count     int64
dtype: object

Based on the above, it looks like that most columns are the expected type; numerical values are integers, strings are objects. 'published_at' and 'duration' should ideally be datetime format, but since we plan to upload these cleaned CSV files to a SQL database we'll refrain for now. 'category_id' should be an object rather than an integer since these values correspond with a separate table that links each category_id to a category name.

In [17]:
video_metrics_df['category_id'] = video_metrics_df['category_id'].astype('str')
print(video_metrics_df['category_id'].describe())

count     508
unique      2
top        20
freq      502
Name: category_id, dtype: object


Now if we look at the categorical features further.

In [30]:
video_metrics_df.describe(include=['O'])

Unnamed: 0,channel_name,video_id,title,published_at,description,thumbnail_url,tags,category_id,duration
count,508,508,508,508,490,508,508,508,508
unique,1,508,505,508,461,508,46,2,466
top,Why I Salty,Sd34QHPBKRA,Salty Brew - Creation Control [Eternal Card Game],2023-09-30T20:44:24Z,Powered by Restream https://restream.io\n\nJus...,https://i.ytimg.com/vi/ezJOf4Zn0Os/default.jpg,none,20,PT1M
freq,508,1,2,1,9,1,289,502,4


We can see that most columns have unique entries, as each row represents a unique video. We also expect the tags, description, and category IDs to be re-used at times. Interestingly the title of the video appears to be duplicated for a few videos - let's explore why that may be the case.

In [29]:
duplicated_titles = video_metrics_df[video_metrics_df['title'].duplicated()]['title']
print(duplicated_titles.to_list())
video_metrics_df[video_metrics_df['title'].isin(duplicated_titles)]

['Mecha Auto Battler and then some Armored Core | Mechabellum [STREAM]', 'Salty Brew - BULL$%@T XENAN STRANGERS [Eternal Card Game]', 'Salty Brew - Creation Control [Eternal Card Game]']


Unnamed: 0,channel_name,video_id,title,published_at,description,thumbnail_url,tags,category_id,duration,view_count,like_count,fav_count,comment_count
47,Why I Salty,-UrQ--njWvY,Mecha Auto Battler and then some Armored Core ...,2023-10-15T19:39:11Z,Was just mainly going to do a Mecabellum strea...,https://i.ytimg.com/vi/-UrQ--njWvY/default_liv...,"['mecha', 'auto battler', 'tft', 'team fight t...",20,P0D,0,0,0,0
48,Why I Salty,DfV-xuJX6bE,Mecha Auto Battler and then some Armored Core ...,2023-10-14T05:24:00Z,Was just mainly going to do a Mecabellum strea...,https://i.ytimg.com/vi/DfV-xuJX6bE/default.jpg,"['mecha', 'auto battler', 'tft', 'team fight t...",20,PT2H20M31S,337,13,0,2
443,Why I Salty,mRR2JQjzidE,Salty Brew - BULL$%@T XENAN STRANGERS [Eternal...,2020-03-09T00:57:49Z,I got home late from work and decided stream a...,https://i.ytimg.com/vi/mRR2JQjzidE/default.jpg,none,20,PT32M51S,434,10,0,0
445,Why I Salty,DuMKMM0vFj0,Salty Brew - BULL$%@T XENAN STRANGERS [Eternal...,2020-03-05T06:15:04Z,"Warning there is only one match, but it very l...",https://i.ytimg.com/vi/DuMKMM0vFj0/default.jpg,none,20,PT23M29S,290,5,0,1
446,Why I Salty,b3R1TBYyB28,Salty Brew - Creation Control [Eternal Card Game],2020-03-04T05:46:33Z,"Decided to give creation control another go, b...",https://i.ytimg.com/vi/b3R1TBYyB28/default.jpg,none,20,PT33M25S,466,20,0,5
459,Why I Salty,Snk0gfToLYY,Salty Brew - Creation Control [Eternal Card Game],2020-02-15T05:05:54Z,Originally i was trying to make an armory deck...,https://i.ytimg.com/vi/Snk0gfToLYY/default.jpg,none,20,PT32M2S,305,9,0,1


Nothing unusual here, all duplicated titles have different values in other columns. 

The only exception is the first set of videos which share the same title and description, but since these were from steams on back-to-back days most likely the title was re-used.

Finally let's look for missing values.

In [25]:
print(video_metrics_df.isna().sum())

channel_name      0
video_id          0
title             0
published_at      0
description      18
thumbnail_url     0
tags              0
category_id       0
duration          0
view_count        0
like_count        0
fav_count         0
comment_count     0
dtype: int64


In [28]:
video_metrics_df[video_metrics_df['description'].isna()].head()

Unnamed: 0,channel_name,video_id,title,published_at,description,thumbnail_url,tags,category_id,duration,view_count,like_count,fav_count,comment_count
63,Why I Salty,xUOy06kZMAw,Raven Please..,2023-10-01T21:08:07Z,,https://i.ytimg.com/vi/xUOy06kZMAw/default.jpg,none,20,PT59S,9259,638,0,21
67,Why I Salty,yRCX_zCiBZ0,Got a Job For you 621 (5) #armoredcore6 #armor...,2023-09-30T20:57:58Z,,https://i.ytimg.com/vi/yRCX_zCiBZ0/default.jpg,none,20,PT1M,4729,498,0,22
68,Why I Salty,rZdzszUyav8,Got a Job For you 621 (4) #armoredcore6 #armor...,2023-09-30T20:44:24Z,,https://i.ytimg.com/vi/rZdzszUyav8/default.jpg,none,20,PT49S,4742,524,0,16
69,Why I Salty,1vSUQfUB66Q,Got a Job For you 621 (3) #armoredcore6 #armor...,2023-09-30T20:29:00Z,,https://i.ytimg.com/vi/1vSUQfUB66Q/default.jpg,none,20,PT17S,3894,242,0,6
70,Why I Salty,9Er61UwlIF4,Got A Job For You 621 (1) #armoredcore6 #armor...,2023-09-30T20:09:57Z,,https://i.ytimg.com/vi/9Er61UwlIF4/default.jpg,none,20,PT15S,3177,242,0,4


Doing a quick check on YouTube shows that NaN values correspond with videos that had no description provided. We'll update this as 'No description provided' to remove any confusion.

In [33]:
video_metrics_df['description'] = video_metrics_df['description'].replace(np.nan, 'No description provided')
print(video_metrics_df['description'].describe())

count                         508
unique                        462
top       No description provided
freq                           18
Name: description, dtype: object


Now let's repeat the process with the video_comments_df.

In [34]:
video_comments_df.dtypes

video_id                   object
video_comment              object
video_comment_user         object
video_comment_time         object
video_comment_parent_id    object
video_comment_child_id     object
dtype: object

In [35]:
video_comments_df.describe()

Unnamed: 0,video_id,video_comment,video_comment_user,video_comment_time,video_comment_parent_id,video_comment_child_id
count,2192,2044,2043,2044,2044,2044
unique,508,2016,918,2044,1536,509
top,t3y4vt9kkoc,Nice,Why I Salty,2020-07-11T15:34:16Z,UgyB03KENb3uiX5Zafx4AaABAg,0
freq,36,5,252,1,6,1536


In [36]:
print(video_metrics_df.isna().sum())

channel_name     0
video_id         0
title            0
published_at     0
description      0
thumbnail_url    0
tags             0
category_id      0
duration         0
view_count       0
like_count       0
fav_count        0
comment_count    0
dtype: int64


Everything looks good! Lets save these tables and ingest into a SQL database for subsequent analysis

In [38]:
video_comments_df.to_csv('csv/video_comments_cleaned.csv')
video_metrics_df.to_csv('csv/video_metrics_cleaned.csv')