## NOTEBOOK DESCRIPTION:

This notebook builds the training data for the word2vec_pytorch implementation. 
It builds the training data by selecting at most CONTEXT channels from a user and then performing all the combinations between the selected channels

WARNING: Before running this notebook, `/word2vecf/config.py` needs to be filled with the wanted parameters(THRESHOLD_NAME) corresponding to the minimum number of comments per channel

In [1]:
import scipy.sparse
import sys
import os
import gzip
import random
import time
import itertools

import pandas as pd
import numpy as np

from itertools import permutations, combinations

scriptpath = "/home/jouven/youtube_projects/word2vec_pytorch/"
sys.path.append(os.path.abspath(scriptpath))
from config import *


scriptpath = "/home/jouven/youtube_projects/"
sys.path.append(os.path.abspath(scriptpath))
from helpers.helpers_channels_more_300 import *

### Create training data from reader the comments dataset

In order to build the training data, the following code reads the orignal `comments_dataset`, as usual we process sequentially each user and puts the results into a Pandas DataFrame. 
The results need to have a specific format: every line of the DataFrame needs to contain pairs of channels corresponding to the (input, output) of a given user.
    
    For each user:
        - Select the channels that this user has commented
        - Perform subsampling if specified
        - Select at maximum CONTEXT channels from the set of channels this user has commented 
        - Perform the 2-combinations out of the select channels

In [2]:
COMMON_PATH = "/dlabdata1/youtube_large/jouven/word2vec_pytorch/channels_more_" + THRESHOLD_NAME

In [3]:
# Dictionnary mapping the video_id to the channel_id
vid_to_channels = video_id_to_channel_id()

In [4]:
# Set of duplicate users
duplicate_users = dict_occurent_users()

In [5]:
# Channels with the selected comments
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

In [11]:
'''
This function selects at maximum CONTEXT channels from the set of channels this user has commented in, performs
the 2-combinations out the selected channels and append the results to the data.

PARAMETERS:
    - data: List containing pairs of channels corresponding to the users already processed
    - user_channels: The list of channel a given user has commented in
'''
def create_pairs(data, user_channels):
    if len(user_channels) > CONTEXT_SIZE:
        user_channels = random.sample(user_channels, CONTEXT_SIZE)
    for comb in itertools.combinations(user_channels, 2):
        data.append((comb[0], comb[1]))

In [12]:
# Adjust chunk_size as necessary -- defaults to 16,384 if not specific
reader = Zreader("/dlabdata1/youtube_large/youtube_comments.ndjson.zst", chunk_size=16384)

# PARAMETERS

# Dictionnary counting the number of time (channel_idx, channel2_idx) appears
data = []
# Indices
nb = 0
idx = 1
# Channels that a user have commented
user_channels = []
# Number of channels, Row and columns length of the sparse matrix
matrix_len = len(channels_id)

# Create directory if not existing
check_directory(COMMON_DLAB_PATH)

user = ''
begin_time = time.time()


if SUBSAMPLING:
    print('performing subsampling ...')
    
    with open(os.path.join(COMMON_PATH, "vocab_occ.pkl"),'rb') as f:
        vocab_occ = pickle.load(f)
    f.close()
    total = np.sum(vocab_occ) # Total number of comments
    
    selected_channels = []
    for channel in range(len(vocab_occ)):
        frac = vocab_occ[channel]/total
        prob = 1 - np.sqrt(SAMPLING_RATE/frac)

        sampling = np.random.sample()
        if (sampling >= prob):
            selected_channels.append(channel)
        selected_channels = set(selected_channels)

print('Create training set ...')
# Read each line from the reader
for line in reader.readlines():
    line_split = line.replace('"', '').split(',')
    if len(line_split) >= 9:
        author_id = line_split[0]
        if vid_to_channels.get(line_split[2]) in channels_id:
            corr_channel = dict_channel_ind[vid_to_channels[line_split[2]]]
            if author_id == user:
                # if user is a duplicate user
                if author_id in duplicate_users:
                    if duplicate_users[author_id] <= 1:
                        user_channels.append(corr_channel)
                else:
                    user_channels.append(corr_channel)
            else:
                if SUBSAMPLING:
                    user_channels = list(set(user_channels).intersection(selected_channels))
                else:
                    user_channels = list(set(user_channels))
                
                # We need at list 2 channels to build a line into the training set.
                if len(user_channels) >= 2:
                    create_pairs(data, user_channels)
                user_channels = []
                
                # For memory purpose add results to the DataFrame
                if len(data) >= 50000000:
                    df = pd.DataFrame(data)
                    if nb == 0:
                        df.to_csv(TRAINING_DATA_PATH, compression='gzip', index = False)
                    else:
                        df.to_csv(TRAINING_DATA_PATH, compression='gzip', mode='a', index = False, header = False)
                    nb += 1
                    data = []
                    df = 0
                    print('idx ' + str(idx))
                    print('nb ' + str(nb))
                    
                # If user is a duplicate user
                if author_id in duplicate_users:
                    duplicate_users[author_id] += 1
                    if duplicate_users[author_id] <= 1:
                        user_channels.append(corr_channel)
                else:
                    user_channels.append(corr_channel)
           
        user = author_id
    idx += 1
    
df = pd.DataFrame(data)
df.to_csv(TRAINING_DATA_PATH, compression='gzip', mode='a', index = False, header = False)
data = 0
df = 0

Create training set ...
idx 76827
nb 1
idx 155879
nb 2
idx 247193
nb 3
idx 333841
nb 4
idx 403207
nb 5
idx 483070
nb 6
idx 564200
nb 7
idx 643304
nb 8
idx 736036
nb 9
idx 813628
nb 10
idx 897520
nb 11


KeyboardInterrupt: 