## ADA PM2 - Dataset pre-filtering

This notebooks aims at pre-filtering the [YouNiverse](https://zenodo.org/records/4650046) dataset to only keep gaming related content. In the detail, we will proceed following the next steps.
1. Keep only videos which `category` is `Gaming`.
2. Keep only channels which have at least one video in the selected list.
3. Keep only time-series for the selected channels.
4. Keep only comments for the selected videos.

We will export each of the resulting datasets in a separate `.tsv` file.

In [None]:
import polars as pl

VIDEOS_PATH = "data/youniverse/original/yt_metadata_en.jsonl"
CHANNELS_PATH =  "data/youniverse/original/df_channels_en.tsv"
TIMESERIES_PATH = "data/youniverse/original/df_timeseries_en.tsv"
COMMENTS_PATH =  "data/youniverse/original/youtube_comments.tsv"

## Videos
As a first step, let's simply grasp the first lines of our dataset so as to understand its structure.

In [8]:
data = []
with open(VIDEOS_PATH, 'r') as file:
    data = [json.loads(next(file)) for _ in range(1)]
pl.DataFrame(data)

categories,channel_id,crawl_date,description,dislike_count,display_id,duration,like_count,tags,title,upload_date,view_count
str,str,str,str,f64,str,i64,f64,str,str,str,f64
"""Film & Animation""","""UCzWrhkg9eK5I8Bm3HfV-unA""","""2019-10-31 20:19:26.270363""","""Lego City Police Lego Firetruc…",1.0,"""SBqSc91Hn9g""",1159,8.0,"""lego city,lego police,lego cit…","""Lego City Police Lego Firetruc…","""2016-09-28 00:00:00""",1057.0


Since we only want to get **gaming videos**, we need to filter out this category from our dataset. At the same time, we filter out the columns that are not relevant for our analysis. According to our understanding of the dataset, we will keep the following columns :
- `title` and `tags`, which contain useful information about the video content
- `upload_date`, which may be useful to track the link between subjects and periods
- `view_count`, `like_count` and `dislike_count`, which are key indicators of the video popularity
- `duration`, which may be useful to track trends per video game
- `channel_id` and `display_id`, which are useful`to link videos to channels and comments

We drop the `description` column, as it is too heavy to handle over that many videos. We also drop the `categories` column, which is no longer relevant, as well as the `crawl_date` column which is not usefult for our study. Finally, we fill missing values for `tags` with empty strings.

In [None]:
dfs = []
total_rows = 72924794
chunksize = 100000
for json_df in tqdm(
    pd.read_json(
        VIDEOS_PATH, compression="infer", lines=True, chunksize=chunksize
    ),
    desc="Loading data",
    total=total_rows // chunksize,
):
    json_df = json_df[
        [
            "categories",
            "title",
            "tags",
            "upload_date",
            "view_count",
            "like_count",
            "dislike_count",
            "duration",
            "channel_id",
            "display_id",
        ]
    ]
    json_df = json_df[json_df["categories"] == "Gaming"]
    dfs.append(json_df)

gaming_df = pl.concat(dfs)
gaming_df.drop(columns=["categories"], inplace=True)
gaming_df.fillna({"tags": ""}, inplace=True)

In [None]:
columns_to_keep = [
    "categories",
    "title",
    "tags",
    "upload_date",
    "view_count",
    "like_count",
    "dislike_count",
    "duration",
    "channel_id",
    "display_id",
]

json_df = pl.read_ndjson(VIDEOS_PATH, columns=columns_to_keep)
gaming_df = json_df.filter(pl.col("categories") == "Gaming").drop("categories").fill_null("tags", "")

Now, we can convert our `pandas` data frame to `vaex`, which is a out-of-core library that allows us to handle huge datasets efficiently, lazy-loading them in memory. We will then save the filtered dataset to a `hdf5` file.

In [None]:
gaming_vdf = vaex.from_pandas(gaming_df)
gaming_vdf.export_hdf5('data/yt_gaming_metadata_en.hdf5')

## With polars
gaming_df.write_csv('data/yt_gaming_metadata_en.tsv', sep='\t')