# SECTION 2: PREPROCESSING
___

In this notebook, we provide a csv file to obtain a cleaned version of the data in the form of a new csv file. The resulting csv can then be used in further analyses.

<br>

| Step | Subtask                                                                     |
|------|-----------------------------------------------------------------------------|
| 1    | Remove entries with null values (optional)                                  |
| 2    | Standardize header names                                                    |
| 3    | Convert dates to datetime                                                   |
| 4    | Convert durations to time signatures + add new column for length in seconds |
| 5    | Remove emojis from text fields (optional)                                   |
| 6    | Save cleaned data as a new csv                                              |


In [87]:
import math
import pandas as pd
from pathlib import Path
from os import listdir, getcwd
pd.options.mode.chained_assignment = None
from importlib import reload
from builtins import IndexError
import math

# project directories
ROOT = Path(getcwd())
DATA = ROOT.joinpath('data')
CLEAN_DATA = ROOT.joinpath('clean_data')

# local files
import preprocessing
reload(preprocessing)

<module 'preprocessing' from 'C:\\Users\\Matt\\DataspellProjects\\Youtube Data Analytics\\preprocessing.py'>

In [88]:
files = preprocessing.file_df
files

Unnamed: 0,index,name
0,1,channels.csv
1,2,channel_info.csv
2,3,comments.csv
3,4,videos.csv


In [89]:
# set value of file_index to the index corresponding to the desired file to be cleaned (based on the df in the above cell)
# modify the
# hit RUN ALL when ready

# parameters
file_index: int = 4
remove_nulls: bool = False
remove_emojis: bool = False   # WARNING: setting this to true makes this process run much slower


files.loc[files['index'] == file_index, 'name'].values[0]

'videos.csv'

In [90]:
# final check that file_index has been set before initiating process
file_selected: bool = True if file_index is not None else False
assert file_selected, "No file was selected"

In [91]:
# extract path to datafile
selected_file_name = files.loc[files['index'] == file_index, 'name'].values[0]
full_path = ROOT.joinpath('data').joinpath(selected_file_name)
f"Cleaning file '{full_path.name}'..."

"Cleaning file 'videos.csv'..."

In [92]:
# create DataFrame from datafile & preview
data = pd.read_csv(full_path)
df = pd.DataFrame(data)
cols = list(df.columns)
df.sample(10)

Unnamed: 0,videoId,channelName,description,videoTitle,postDate,duration,views,commentCount,thumbnail
914,a546lxxJIhE,LastWeekTonight,John Oliver discusses psychedelic assisted the...,Psychedelic Assisted Therapy: Last Week Tonigh...,2023-02-20T07:30:00Z,PT21M28S,4681836,,https://i.ytimg.com/vi/a546lxxJIhE/default.jpg
1555,go4hT1BLG6o,JerryRigEverything,"3 things I can't live without #ad. By the way,...",3 THINGS I CAN'T LIVE WITHOUT! #ad #ShopWithYo...,2023-10-10T13:00:36Z,PT59S,385680,,https://i.ytimg.com/vi/go4hT1BLG6o/default.jpg
1306,Cm_uIxcczWM,CaseyNeistat,,10 simple Tricks to Not get Divorced,2023-06-16T13:51:20Z,PT12M14S,1746069,,https://i.ytimg.com/vi/Cm_uIxcczWM/default.jpg
2278,KeeeLsAa30M,PewDiePie,#AD - Pre-Order G FUEL’s New PAC-MAN Flavor! h...,"$39,000,000 Minecraft House..",2023-01-17T17:45:00Z,PT20M46S,3265699,,https://i.ytimg.com/vi/KeeeLsAa30M/default.jpg
1093,Ee8CEV0QjCA,PragerU,There is a lot to learn about the Israeli-Pale...,The Middle East Conflict Part 2 (Marathon),2023-10-10T00:45:11Z,PT1H5M4S,46742,,https://i.ytimg.com/vi/Ee8CEV0QjCA/default.jpg
3903,2wM3pf79QSk,Jarvis Johnson,Take back creative control with Storyblocks' u...,Horrifying Home Design (w/ Chad Chad),2023-01-25T18:41:41Z,PT24M23S,1860634,,https://i.ytimg.com/vi/2wM3pf79QSk/default.jpg
3088,qquglT7Knfg,Markiplier,We finally see the monster! Just barely... but...,Amnesia: The Bunker - Part 2,2023-06-26T20:22:37Z,PT54M55S,2374279,,https://i.ytimg.com/vi/qquglT7Knfg/default.jpg
3044,_y2r2oEyqOA,ESPN,✔️ Subscribe to ESPN+ http://espnplus.com/yout...,This looks fun 👀👏 (via @cris_motorfide/TT) #sh...,2023-10-23T19:00:27Z,PT14S,14753,,https://i.ytimg.com/vi/_y2r2oEyqOA/default.jpg
3426,-H1c7C-hyAk,Nerdwriter1,MY BOOK IS OUT NOW! \nAMAZON: https://amzn.to/...,How Postwar Italy Created The Paparazzi,2022-04-30T12:42:05Z,PT7M35S,231711,,https://i.ytimg.com/vi/-H1c7C-hyAk/default.jpg
3617,2JAOTJxYqh8,Mark Rober,One of the best things about life is you can p...,Bed Bugs- What You've Been Told is Totally False,2023-03-04T14:00:00Z,PT23M47S,24914748,,https://i.ytimg.com/vi/2JAOTJxYqh8/default.jpg


In [93]:
assert all(column in cols for column in ['postDate', 'duration', 'description', 'channelName', 'videoId']), f"DataFile is missing required columns"
# ignore comment count since many are null (for vids with comments disabled), and thumbnails aren't necessary for our analyses
df = df.drop('commentCount', axis=1)
df = df.drop('thumbnail', axis=1)

df.sample(5)

Unnamed: 0,videoId,channelName,description,videoTitle,postDate,duration,views
3719,RiM0moNk74o,Ariana Grande,Check out Ariana’s full performance in Fortnit...,Fortnite Presents: Rift Tour Featuring Ariana ...,2021-08-07T18:00:11Z,PT6M56S,8652089
1312,34k7UI-DR_8,CaseyNeistat,JORDAN!!!! https://www.youtube.com/@JordanStud...,TESTING $1400 Ai POWERED ELECTRIC SHOES in NYC,2023-02-22T21:32:11Z,PT5M20S,3677747
561,a4vmEpHfBAk,MLB,Don't forget to subscribe! https://www.youtube...,What ump cam was made for. #ALCS,2023-10-24T01:09:22Z,PT13S,36833
1613,s7lU2h4KsSA,Domics,,@Domics,2021-04-01T11:51:19Z,PT5S,563800
3771,5GFunML9Iy4,Troom Troom,Watch the NEWEST videos: https://youtu.be/_mud...,We Build Secret Rooms for Mermaids! Emerald Gi...,2023-10-11T13:00:04Z,PT35M26S,196503


In [94]:
# step 1: remove nulls (optional)
if remove_nulls:
    pre_removal_size = df.shape[0]
    df.dropna(inplace=True)
    post_removal_size = df.shape[0]
    print(f"Removed {pre_removal_size - post_removal_size} rows with empty values ({round((post_removal_size / pre_removal_size * 100), 1)}%)")
else:
    print("Removed 0 rows with empty values")

Removed 0 rows with empty values


In [95]:
# step 2: rename columns
df = df.rename(columns={'videoId': 'video_id', 'channelName': 'channel_name', 'videoTitle': 'video_title', 'postDate': 'upload_date', 'duration': 'length', 'views': 'num_views'})
df

Unnamed: 0,video_id,channel_name,description,video_title,upload_date,length,num_views
0,V1hN1ekwTP8,Bon Appétit,Andrew Rea (AKA Binging With Babish) pits 32 c...,Babish Picks the Best Halloween Candy of All-T...,2023-10-24T16:00:02Z,PT17M33S,1214
1,d8q6bC-pAF8,Bon Appétit,"Japanese chef Yuji Haraguchi, owner of OKONOMI...",Flaming Fish Chashu,2023-10-20T18:45:02Z,PT58S,31681
2,hktIRdd90g4,Bon Appétit,Kendra Vaculin spent days in the Bon Appétit T...,Developing These Perfect Lemon Bars Nearly Bro...,2023-10-19T16:45:02Z,PT24M29S,140798
3,Jk90CG3WBy8,Bon Appétit,Chef Evan Funke brings Bon Appétit along to th...,Artichoke Hips Don't Lie,2023-10-18T20:15:02Z,PT27S,38809
4,9Tc33xCppQo,Bon Appétit,"“We make about 13 to 14,000 cookies every week...","Making 28,000 Pastries a Week in a Small Brook...",2023-10-17T16:00:07Z,PT20M1S,171404
...,...,...,...,...,...,...,...
4045,CIc8fHhO5O0,Ed Sheeran,Autumn Variations out now: https://es.lnk.to/a...,Autumn Is Coming #4,2023-09-18T22:00:06Z,PT48S,130324
4046,m9JM-a5AH54,Ed Sheeran,Subscribe to Ed's channel: http://bit.ly/Subsc...,"Santa Clara pop up before the stadium show, gi...",2023-09-18T14:08:11Z,PT41S,124038
4047,1btffD7DYVQ,Ed Sheeran,Subscribe to Ed's channel: http://bit.ly/Subsc...,Debuting American Town unplugged in the audien...,2023-09-16T09:18:26Z,PT1M,93226
4048,e_irHEmAkgw,Ed Sheeran,Autumn Variations is the first album I’m putti...,I want you guys to make the videos for Autumn ...,2023-09-13T18:39:51Z,PT25S,107355


In [96]:
# step 3: convert dates
df['upload_date'] = df['upload_date'].apply(lambda x: preprocessing.convert_date(x))

In [97]:
# step 4: convert durations + add seconds column
df['length'] = df['length'].apply(lambda x: preprocessing.convert_duration(x))
df['length_secs'] = df['length'].apply(lambda x: preprocessing.duration_to_secs(x))

In [98]:
# step 5: remove emojis (optional)
if remove_emojis:
    df['video_title'] = df['video_title'].apply(lambda x: preprocessing.remove_emojis(x))
    df['description'] = df['description'].apply(lambda x: preprocessing.remove_emojis(x))
    print(f"All emojsi removed")

In [99]:
# step 6: save file as new csv
new_file_name = full_path.stem + '_cleaned.csv'
df.to_csv(CLEAN_DATA.joinpath(new_file_name), index=False)
f"Saved file to '{new_file_name}'"

"Saved file to 'videos_cleaned.csv'"