## ADA PM2 - Dataset pre-filtering

This notebooks aims at pre-filtering the [YouNiverse](https://zenodo.org/records/4650046) dataset to only keep gaming related content. In the detail, we will proceed following the next steps.
1. Keep only videos which `category` is `Gaming`.
2. Keep only channels which have at least one video in the selected list.
3. Keep only time-series for the selected channels.
4. Keep only comments for the selected videos.

We will export each of the resulting datasets in a separate `.tsv` files.

In [1]:
import polars as pl
import tqdm
import json

ORIGINAL_PATH = {
    "videos": "../data/youniverse/original/yt_metadata_en.jsonl",
    "channels": "../data/youniverse/original/df_channels_en.tsv",
    "timeseries": "../data/youniverse/original/df_timeseries_en.tsv",
    "comments": "../data/youniverse/original/youtube_comments.tsv"
}

FILTERED_PATH = {
    "videos": "../data/youniverse/filtered/gaming_videos.tsv",
    "channels": "../data/youniverse/filtered/gaming_channels.tsv",
    "timeseries": "../data/youniverse/filtered/gaming_timeseries.tsv",
    "comments": "../data/youniverse/filtered/gaming_comments.tsv"
}

## Videos
As a first step, let's simply grasp the first lines of our dataset so as to understand its structure.

In [2]:
pl.read_ndjson(ORIGINAL_PATH["videos"], n_rows=1)

categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
str,str,str,str,f64,str,i64,f64,str,str,str,f64
"""Film & Animation""","""UCzWrhkg9eK5I8Bm3HfV-unA""","""2019-10-31 20:19:26.270363""","""Lego City Police Lego Firetruc…",1.0,"""SBqSc91Hn9g""",1159,8.0,"""lego city,lego police,lego cit…","""Lego City Police Lego Firetruc…","""2016-09-28 00:00:00""",1057.0


Since we only want to get **gaming videos**, we need to filter out this category from our dataset. At the same time, we filter out the columns that are not relevant for our analysis. According to our understanding of the dataset, we will keep the following columns :
- `title` and `tags`, which contain useful information about the video content
- `upload_date`, which may be useful to track the link between subjects and periods
- `view_count`, `like_count` and `dislike_count`, which are key indicators of the video popularity
- `duration`, which may be useful to track trends per video game
- `channel_id` and `display_id`, which are useful to link videos to channels and comments

We drop the `description` column, as it is too heavy to handle over that many videos. We also drop the `categories` column, which is no longer relevant, as well as the `crawl_date` column which is not usefult for our study. Finally, we fill missing values for `tags` with empty strings.

In [None]:
columns_to_keep = [
    "categories",
    "title",
    "tags",
    "upload_date",
    "view_count",
    "like_count",
    "dislike_count",
    "duration",
    "channel_id",
    "display_id",
]

filtered_gaming_df = (
    pl.read_ndjson(ORIGINAL_PATH["videos"])
    .select(columns_to_keep)
    .filter(pl.col("categories") == "Gaming")
    .fill_null("")
    .drop("categories")
)

Now, we can convert our `polar` LazyFrame to a `.tsv` file.

In [None]:
filtered_gaming_df.write_csv(FILTERED_PATH["videos"], separator="\t")

## Channels

Let's see what this dataset looks like.

In [3]:
pl.read_csv(ORIGINAL_PATH["channels"], separator='\t', n_rows=1)

category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
str,str,str,str,i64,i64,f64,f64
"""Gaming""","""2010-04-29""","""UC-lHJZR3Gqxm24_Vd_AJ5Yw""","""PewDiePie""",101000000,3956,3.0,2.087


Now, for our further analysis, we need to keep only the **gaming channels**. Furthermore, the channel ID's and names are needed to identify the channels in the videos dataset. Subscriber and video counts are also mandatory for having statistical insights. Thus, we will keep the following columns:
- `channel` is the channel ID to link with the videos dataset, we will rename it to `channel_id`
- `name_cc` is the channel name, which more human-readable and will be renamed to `channel_name`
- `subscribers_cc` and `videos_cc` are key indicators of the channel popularity

Since we only focus on gaming channels, we will filter out the other categories, and supress the `category_cc` column. We will also drop the `join_date`, `subscriber_rank_sb` and `weight_sb` columns, as they are not relevant for our work.


In [4]:
channels_df = pl.read_csv(ORIGINAL_PATH["channels"], separator='\t')

gaming_channels_df = channels_df.filter(pl.col("category_cc") == "Gaming").select([
    "channel",
    "name_cc",
    "subscribers_cc",
    "videos_cc",
]).rename({
    "channel": "channel_id",
    "name_cc": "channel_name",
    "subscribers_cc": "subscribers",
    "videos_cc": "videos",
})

This dataframe is approximately **20000 rows-long**. We can now write the filtered dataset to a `tsv` file.

In [5]:
gaming_channels_df.write_csv(FILTERED_PATH["channels"], separator='\t')

## Time-series



Our next step of pre-filtering is to treat the **time-series** dataset. It contains the time-series of views and subscribers for each channel. Each data point represents the **view and subsciber count** for a given week. Here is what it looks like:

In [6]:
pl.read_csv(ORIGINAL_PATH["timeseries"], separator='\t', n_rows=1)

channel,category,datetime,views,delta_views,subs,delta_subs,videos,delta_videos,activity
str,str,str,f64,f64,f64,f64,i64,i64,i64
"""UCBJuEqXfXTdcPSbGO9qqn1g""","""Film and Animation""","""2017-07-03 00:00:00""",202494.555556,0.0,650.222222,0.0,5,0,3


For our work, we will keep the following columns, only for the **gaming** channels:
- `channel` to link with the videos and channels datasets, it will be renamed to `channel_id`
- `datetime` to track the time of the data point	
- `views`, `subs` and `videos` to track the global statistics of the channel
- `delta_views`, `delta_subs` and `delta_videos` to track the evolution of the channel

However, the columns we are not interested are `category` because it will always be **Gaming**, and `activity` which represents the number of videos uploaded in the last 15 days, and we suppress them.  

In [8]:
timeseries_df = pl.read_csv(ORIGINAL_PATH["timeseries"], separator="\t")

filtered_timeseries_df = timeseries_df.filter(pl.col("category") == "Gaming").select([
    "channel",
    "datetime",
    "views",
    "delta_views",
    "subs",
    "delta_subs",
    "videos",
    "delta_videos"
]).rename({"channel": "channel_id"})

For the same reasons as mentioned above, we will write this dataset to a `.tsv` file.

In [9]:
filtered_timeseries_df.write_csv(FILTERED_PATH["timeseries"], separator="\t")

## Comments

Now is the time to handle the **biggest** dataset of our work: the **comments**. Here is what a comment looks like in our dataset.

In [8]:
pl.read_csv(ORIGINAL_PATH["comments"], separator='\t', n_rows=1)

author,video_id
i64,str
2,"""9pQILRT42Cg"""


The work that needs to be done is to filter out the comments that are **not related** to gaming videos. To do so, we thought of having a set containing all the **gaming video IDs**, and checking for each comment if its video ID is in this list. If it is, we keep the comment, otherwise we drop it.

The columns that we will keep are:
- `author`, which is the **author's ID**, to ensure we keep track of which user commented on which video
- `video_id`, which is the **video's ID**, to link the comments to the videos

The `likes` and `replies` columns are not needed for our graphs' generations so we will drop them.

Firstly, we generate the set of **gaming video IDs**.

In [None]:
gaming_display_ids = set(filtered_gaming_df.to_pandas()["display_id"])

We then filter out the comments that are **not related** to gaming videos. Finally, we save the filtered dataset to a `.tsv` file.

In [None]:
comments = pl.scan_csv(ORIGINAL_PATH["comments"], sep='\t', has_header=True).select(["author", "video_id"])
gaming_comments_df = comments.filter(pl.col("video_id").is_in(gaming_display_ids))

gaming_comments_df.write_csv(FILTERED_PATH["comments"], separator="\t")