## ADA PM2 - Dataset pre-filtering

This notebooks aims at pre-filtering the [YouNiverse](https://zenodo.org/records/4650046) dataset to only keep gaming related content. In the detail, we will proceed following the next steps.
1. Keep only videos which `category` is `Gaming`.
2. Keep only channels which have at least one video in the selected list.
3. Keep only time-series for the selected channels.
4. Keep only comments for the selected videos.

We will export each of the resulting datasets in a separate `.tsv` file.

In [39]:
import polars as pl
import tqdm
import json

VIDEOS_PATH = "data/youniverse/original/yt_metadata_en_100k.jsonl"
CHANNELS_PATH =  "./data/youniverse/original/df_channels_en.tsv"
TIMESERIES_PATH = "data/youniverse/original/df_timeseries_en.tsv"
COMMENTS_PATH =  "data/youniverse/original/youtube_comments.tsv"

## Videos
As a first step, let's simply grasp the first lines of our dataset so as to understand its structure.

In [44]:
first_video_df = pl.scan_ndjson(VIDEOS_PATH, n_rows=1)

first_video_df.collect()

categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
str,str,str,str,f64,str,i64,f64,str,str,str,f64
"""Film & Animation""","""UCzWrhkg9eK5I8Bm3HfV-unA""","""2019-10-31 20:19:26.270363""","""Lego City Police Lego Firetruc…",1.0,"""SBqSc91Hn9g""",1159,8.0,"""lego city,lego police,lego cit…","""Lego City Police Lego Firetruc…","""2016-09-28 00:00:00""",1057.0


Since we only want to get **gaming videos**, we need to filter out this category from our dataset. At the same time, we filter out the columns that are not relevant for our analysis. According to our understanding of the dataset, we will keep the following columns :
- `title` and `tags`, which contain useful information about the video content
- `upload_date`, which may be useful to track the link between subjects and periods
- `view_count`, `like_count` and `dislike_count`, which are key indicators of the video popularity
- `duration`, which may be useful to track trends per video game
- `channel_id` and `display_id`, which are useful`to link videos to channels and comments

We drop the `description` column, as it is too heavy to handle over that many videos. We also drop the `categories` column, which is no longer relevant, as well as the `crawl_date` column which is not usefult for our study. Finally, we fill missing values for `tags` with empty strings.

In [81]:
columns_to_keep = [
    "categories",
    "title",
    "tags",
    "upload_date",
    "view_count",
    "like_count",
    "dislike_count",
    "duration",
    "channel_id",
    "display_id",
]

filtered_gaming_df = (
    pl.read_ndjson(VIDEOS_PATH)
    .select(columns_to_keep)
    .filter(pl.col("categories") == "Gaming")
    .fill_null("")
    .drop("categories")
)

filtered_gaming_df

title,tags,upload_date,view_count,like_count,dislike_count,duration,channel_id,display_id
str,str,str,f64,f64,f64,i64,str,str
"""Lego City Lego Police1 Hour Lo…","""lego city,lego police,lego cit…","""2016-09-26 00:00:00""",1253.0,9.0,0.0,3442,"""UCzWrhkg9eK5I8Bm3HfV-unA""","""y5IvyZlzELs"""
"""Lego City Police Lego Fireman …","""lego city,lego police,lego cit…","""2016-09-25 00:00:00""",2311.0,8.0,0.0,2407,"""UCzWrhkg9eK5I8Bm3HfV-unA""","""m1agc0qT0BY"""
"""Lego Dimensions Cartoons Movie…","""lego city,lego dimensions,lego…","""2016-09-24 00:00:00""",5596.0,11.0,1.0,1820,"""UCzWrhkg9eK5I8Bm3HfV-unA""","""rr6tfbBA9iY"""
"""Lego City Police Lego Fireman …","""lego city,lego city police,leg…","""2016-09-21 00:00:00""",792.0,8.0,0.0,1209,"""UCzWrhkg9eK5I8Bm3HfV-unA""","""ZGll5_wD9Ys"""
"""Lego Jurassic World Complete M…","""lego jurassic world,lego city,…","""2016-09-21 00:00:00""",1.141393e6,2076.0,426.0,2053,"""UCzWrhkg9eK5I8Bm3HfV-unA""","""kYkokQgnu20"""
…,…,…,…,…,…,…,…,…
"""How To Gain 50,000 XP A GAME F…","""cdotveezy,cdotveezy uploads,sh…","""2018-09-28 00:00:00""",6505.0,135.0,1.0,260,"""UCzRFJpgXBSRoYlZfKe0w80w""","""fLaFJtgGUUY"""
"""I Dropped 20 On His Head?!?! 9…","""cdotveezy,cdotveezy uploads,sh…","""2018-09-25 00:00:00""",1101.0,42.0,2.0,574,"""UCzRFJpgXBSRoYlZfKe0w80w""","""pbAeRGl-BaY"""
"""Turn Up!! Here's My 92 Overall…","""cdotveezy,cdotveezy uploads,sh…","""2018-09-24 00:00:00""",12819.0,213.0,7.0,444,"""UCzRFJpgXBSRoYlZfKe0w80w""","""csHo-yizKxY"""
"""Its Time To Clutch Up!!!!! 91 …","""cdotveezy,cdotveezy uploads,sh…","""2018-09-21 00:00:00""",794.0,27.0,0.0,385,"""UCzRFJpgXBSRoYlZfKe0w80w""","""zMfT8C0Hac8"""


Now, we can convert our `polar` LazyFrame to a `.parquet` file, because it takes much less space than a `.tsv` file, and improves performance on huge dataframes. 

In [75]:
filtered_gaming_df.write_parquet("data/youniverse/filtered/gaming_videos.parquet")

## Channels

Let's see what this dataset looks like.

In [76]:
first_channel_df = pl.read_csv(CHANNELS_PATH, separator='\t', n_rows=1)

first_channel_df

category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
str,str,str,str,i64,i64,f64,f64
"""Gaming""","""2010-04-29""","""UC-lHJZR3Gqxm24_Vd_AJ5Yw""","""PewDiePie""",101000000,3956,3.0,2.087


Now, for our further analysis, we need to keep only the **gaming channels**. Furthermore, the channel ID's and names are needed to identify the channels in the videos dataset. Subscriber and video counts are also mandatory for having statistical insights. Thus, we will keep the following columns:
- `channel` is the channel ID to link with the videos dataset. It will be renamed to `channel_id`
- `name_cc` is the channel name, is more human-readable. It will be renamed to `channel_name`
- `subscribers_cc` and `videos_cc` are key indicators of the channel popularity

Since we only focus on gaming channels, we will filter out the other categories, and supress the `category_cc` column. We will also drop the `join_date` column, as it is not really relevant for our work, and the `subscriber_rank_sb` and `weight_sb` columns.


In [77]:
channels_df = pl.read_csv(CHANNELS_PATH, separator='\t')

gaming_channels_df = channels_df.filter(pl.col("category_cc") == "Gaming").select([
    "channel",
    "name_cc",
    "subscribers_cc",
    "videos_cc",
]).rename({
    "channel": "channel_id",
    "name_cc": "channel_name",
    "subscribers_cc": "subscribers",
    "videos_cc": "videos",
})

gaming_channels_df

channel_id,channel_name,subscribers,videos
str,str,i64,i64
"""UC-lHJZR3Gqxm24_Vd_AJ5Yw""","""PewDiePie""",101000000,3956
"""UCEdvpU2pFRCVqU6yIPyTpMQ""","""Marshmello""",39100000,366
"""UC7_YxT-KID8kRbqZo7MyscQ""","""Markiplier""",24400000,4484
"""UCKqH_9mk1waLgBiL2vT5b9g""","""VanossGaming""",24800000,1079
"""UCYzPXprvl5Y-Sf0g4vX-m6g""","""jacksepticeye""",22833014,4255
…,…,…,…
"""UCUhejRdj-jaI5H8u4t4ra_Q""","""Lots Of Gamer Pro""",10000,40
"""UCfdiBkNZbAFuAsfiEgNcA5w""","""Kurojaki Kaoru""",10000,107
"""UC2oRkTPQk0-czP04VOYigxg""","""BLITZFIRE911""",10000,305
"""UC-njwIOSrdj_RJd4JWXMrcA""","""Crimson Heroes""",10000,425


This dataframe is approximately **20000 rows-long**. We can now write the filtered dataset to a `tsv` file.

In [78]:
gaming_channels_df.write_csv('data/youniverse/filtered/gaming_channels.tsv', separator='\t')

## Time-series



Our next step of pre-filtering is to treat the **time-series** dataset. It contains the time-series of views and subscribers for each channel. Each data point represents the **view and subsciber count** for a given week. Here is what it looks like:

In [79]:
first_timeseries_df = pl.read_csv(TIMESERIES_PATH, separator='\t', n_rows=1)

first_timeseries_df

channel,category,datetime,views,delta_views,subs,delta_subs,videos,delta_videos,activity
str,str,str,f64,f64,f64,f64,i64,i64,i64
"""UCBJuEqXfXTdcPSbGO9qqn1g""","""Film and Animation""","""2017-07-03 00:00:00""",202494.555556,0.0,650.222222,0.0,5,0,3


For our work, we will keep the following columns, only for the **gaming** channels:
- `channel` to link with the videos and channels datasets, it will be renamed to `channel_id`
- `datetime` to track the time of the data point	
- `views`, `subs` and `videos` to track the global statistics of the channel
- `delta_views`, `delta_subs` and `delta_videos` to track the evolution of the channel

However, the columns we are not interested are `category` because it will always be **Gaming**, and `activity` which represents the number of videos uploaded in the last 15 days, and we suppress them.  

In [84]:
timeseries_df = pl.read_csv(TIMESERIES_PATH, separator="\t")

filtered_timeseries_df = timeseries_df.filter(pl.col("category") == "Gaming").select([
    "channel",
    "datetime",
    "views",
    "delta_views",
    "subs",
    "delta_subs",
    "videos",
    "delta_videos"
]).rename({"channel": "channel_id"})

filtered_timeseries_df

channel_id,datetime,views,delta_views,subs,delta_subs,videos,delta_videos
str,str,f64,f64,f64,f64,i64,i64
"""UCNNLaOkE-rcthxNssSHET2A""","""2017-05-15 00:00:00""",21755.111111,19051.111111,4980.444444,0.0,4,0
"""UCNNLaOkE-rcthxNssSHET2A""","""2017-05-22 00:00:00""",63875.75,42120.638889,7588.25,2607.805556,6,2
"""UCNNLaOkE-rcthxNssSHET2A""","""2017-05-29 00:00:00""",131038.375,67162.625,11309.125,3720.875,9,3
"""UCNNLaOkE-rcthxNssSHET2A""","""2017-06-05 00:00:00""",216735.0,85696.625,19353.0,8043.875,13,4
"""UCNNLaOkE-rcthxNssSHET2A""","""2017-06-12 00:00:00""",396499.666667,179764.666667,22546.555556,3193.555556,16,3
…,…,…,…,…,…,…,…
"""UC0UeVA9YHpOEr_Ng442xiRw""","""2019-09-02 00:00:00""",6.0129e6,232418.277778,61268.611111,1305.611111,278,2
"""UC0UeVA9YHpOEr_Ng442xiRw""","""2019-09-09 00:00:00""",6.2446e6,231640.888889,62631.666667,1363.055556,287,9
"""UC0UeVA9YHpOEr_Ng442xiRw""","""2019-09-16 00:00:00""",6480901.6,236322.933333,64010.0,1378.333333,294,7
"""UC0UeVA9YHpOEr_Ng442xiRw""","""2019-09-23 00:00:00""",6745316.8,264415.2,65480.0,1470.0,301,7


For the same reasons as mentioned above, we will write this dataset to a `.parquet` file.

In [85]:
filtered_timeseries_df.write_parquet("data/youniverse/filtered/gaming_timeseries.parquet")

## Comments

Now is the time to handle the **biggest** dataset of our work: the **comments**. Here is what a comment looks like in our dataset.

In [37]:
first_comment_df = pl.read_csv(COMMENTS_PATH, separator='\t', n_rows=1)

first_comment_df

author,video_id,likes,replies
i64,str,i64,i64
1,"""Gkb1QMHrGvA""",2,0


The work that needs to be done is to filter out the comments that are **not related** to gaming videos. To do so, we thought of having a set containing all the **gaming video IDs**, and checking for each comment if its video ID is in this list. If it is, we keep the comment, otherwise we drop it.

The columns that we will keep are:
- `author`, which is the **author's ID**, to ensure we keep track of which user commented on which video
- `video_id`, which is the **video's ID**, to link the comments to the videos

The `likes` and `replies` columns are not needed for our graphs' generations so we will drop them.

Firstly, we generate the set of **gaming video IDs**.

In [None]:
gaming_display_ids = set(filtered_gaming_df.to_pandas()["display_id"])

21057

We then filter out the comments that are **not related** to gaming videos. Finally, we save the filtered dataset to a `.parquet` file.

In [95]:
dfs = []
batch_size = 100_000
total_rows = 2_000_000

batched_reader = pl.read_csv_batched(COMMENTS_PATH, separator='\t', batch_size=batch_size)
num_batches = total_rows // batch_size
batches = batched_reader.next_batches(num_batches)

for chunk in batches:
    filtered_chunk = chunk.filter(pl.col("video_id").is_in(gaming_display_ids))
    dfs.append(filtered_chunk)

gaming_comments_df = pl.concat(dfs)

gaming_comments_df.write_parquet("data/youniverse/filtered/gaming_comments.parquet")
