## NOTEBOOK DESCRIPTIONS:

This notebook is considered as a helper notebook to create:
- The set of users that appeared multiple times in the `comments_dataset` file
- Compute the number of comments per channel
- Filter the video-channel relationship dataframe with the selected channels
- Compute the number of users after filtered the set of channels
- Compute random/user jumper pairs of channels

In [2]:
import time
import pickle
import os
import sys
import scipy.sparse

import numpy as np

scriptpath = "/home/jouven/youtube_projects/"
sys.path.append(os.path.abspath(scriptpath))

from helpers.helpers_channels_more_300 import *

# -------------------------------------------------------------------------------------------

### Video channel mapping filtered

From the original dictionnary having all the video-channel relationship, we only select data corresponding to the selected channels.

In [None]:
# Selected channels and id-index mapping
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

In [None]:
vid_to_channels = pd.read_pickle("/dlabdata1/youtube_large/id_to_channel_mapping.pickle")

In [None]:
print('Original length of the relationship dataframe ', len(vid_to_channels))

In [None]:
# Set of channels selected
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

In [None]:
vid_to_channels_filtered = {}
for vid, channel in vid_to_channels.items():
    if not dict_channel_ind.get(channel) == None:
        if channel in channels_id:
            vid_to_channels_filtered[vid] = channel

In [None]:
print('Filtered length of the relationship dataframe ', len(vid_to_channels_filtered))

In [None]:
# Store the  video_id to the channel index filtered mapping
with open("/dlabdata1/youtube_large/jouven/channels_more_300/video_to_channel_mapping_filtered.pkl",'wb') as f:
     pickle.dump(vid_to_channels_filtered, f)
f.close()

# -------------------------------------------------------------------------------------------

### Duplicate users

Some users appear to have duplicate rows and the goal is to find these duplicate rows to delete them when reading the `comments_dataset` file.

In order to find these duplicate users, we first read the whole `comments_dataset` file and retrieve each user for each block of comments. Since this file is ordered by user, each time we encounter a different user we put it in the created dataframe.

In [None]:
# Selected channels and id-index mapping
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

# Dictionnary mapping the video_id to the channel_id
vid_to_channels = video_id_to_channel_id()

In [5]:
# Adjust chunk_size as necessary -- defaults to 16,384 if not specific
reader = Zreader("/dlabdata1/youtube_large/youtube_comments.ndjson.zst", chunk_size=160384)

# parameters
idx = 1
comments_per_channel = {}
user = ''
begin_time = time.time()
# Users having commented
users = []
nb = 0

# Read each line from the reader
for line in reader.readlines():
    line_split = line.replace('"', '').split(',')
    if len(line_split) >= 9:
        author_id = line_split[0]
        if vid_to_channels.get(line_split[2]) in channels_id:
            if author_id != user:
                users.append(author_id)
            
    if len(users) >= 50000000:
        print(str(idx) + ' line have been processed')
        with open("/dlabdata1/youtube_large/jouven/idx.pkl",'wb') as f:
             pickle.dump([idx], f)
        f.close()
        df = pd.DataFrame({'users': users})
        if nb == 0:
            df.to_csv('/dlabdata1/youtube_large/jouven/channels_more_300/users.csv.gz', compression='gzip', index = False)
        else:
            df.to_csv('/dlabdata1/youtube_large/jouven/channels_more_300/users.csv.gz', compression='gzip', mode='a', index = False, header = False)
        nb += 1
        df = 0
        users = []
        
    user = author_id
    idx += 1
    
print(str(idx) + ' line have been processed')
with open("/dlabdata1/youtube_large/jouven/idx.pkl",'wb') as f:
     pickle.dump([idx], f)
f.close()
df = pd.DataFrame({'users': users})
if nb == 0:
    df.to_csv('/dlabdata1/youtube_large/jouven/channels_more_300/users.csv.gz', compression='gzip', index = False)
else:
    df.to_csv('/dlabdata1/youtube_large/jouven/channels_more_300/users.csv.gz', compression='gzip', mode='a', index = False, header = False)
nb += 1
df = 0
users = []

1276141994 line have been processed
2552907222 line have been processed
3829866804 line have been processed
5105730512 line have been processed
6382036099 line have been processed
7659055277 line have been processed
8935401407 line have been processed
10210173875 line have been processed
10365014154 line have been processed


By the created dataframe, we find the duplicate users by looking at the duplicates users

In [2]:
users = pd.read_csv('/dlabdata1/youtube_large/jouven/channels_more_300/users.csv.gz', compression='gzip')

In [4]:
nb_users = len(users.drop_duplicates())

In [5]:
print('Number of users ', nb_users)

379473215

In [7]:
with open("/dlabdata1/youtube_large/jouven/channels_more_300/nb_users.pkl",'wb') as f:
     pickle.dump([nb_users], f)
f.close()

In [8]:
duplicate_users = users[users.duplicated() == True].drop_duplicates()

In [92]:
print('Number of users having duplicate rows =  ', users - nb_users)

17825

In [None]:
with open("/dlabdata1/youtube_large/jouven/channels_more_300/duplicate_users.pkl",'wb') as f:
     pickle.dump(list(duplicate_users['users']), f)
f.close()

### Number of comments per channel

Compute the number of comments per channel by traversing the whole `comments_dataset` and store it into a dictionnary

In [None]:
# Set of duplicate users
duplicate_users = dict_occurent_users()

# Selected channels and id-index mapping
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

# Dictionnary mapping the video_id to the channel_id
vid_to_channels = video_id_to_channel_id()

In [13]:
'''
This function add a comment into the a dictionnary keeping track of the number of comments per channel
PARAMETERS:
    - dictionnary: dicionnary mapping the channel index with it's corresponding number of comments
    - corr_channel: the channel index in which we want to add a comment
'''
def add_comment(dictionnary, corr_channel):
    if corr_channel in dictionnary:
        dictionnary[corr_channel] += 1
    else:
        dictionnary[corr_channel] = 1

In [14]:
# Adjust chunk_size as necessary -- defaults to 16,384 if not specific
reader = Zreader("/dlabdata1/youtube_large/youtube_comments.ndjson.zst", chunk_size=160384)

# parameters
idx = 1
comments_per_channel = {}
user = ''
begin_time = time.time()


# Read each line from the reader
for line in reader.readlines():
    line_split = line.replace('"', '').split(',')
    if len(line_split) == 9:
        author_id = line_split[0]
        if vid_to_channels.get(line_split[2]) in channels_id:
            corr_channel = vid_to_channels[line_split[2]]
            if author_id == user:
                if author_id in duplicate_users:
                    if duplicate_users[author_id] <= 1:
                        add_comment(comments_per_channel, corr_channel)
                else:
                    add_comment(comments_per_channel, corr_channel)
            else:
                if author_id in duplicate_users:
                    duplicate_users[author_id] += 1
                    if duplicate_users[author_id] <= 1:
                        add_comment(comments_per_channel, corr_channel)
                else:
                    add_comment(comments_per_channel, corr_channel)
            
            user = author_id
    idx += 1
    if idx % 1000000000 == 0:
        print(begin_time-time.time())
        begin_time = time.time()
    

-6333.590427398682
-6144.500817298889
-6062.126186847687
-5905.280533075333
-6172.949314117432
-6818.088992357254
-6077.464460372925
-5907.094409942627
-5780.842119693756
-5753.591685295105


In [15]:
# Store the comments_per_channel dictionnary into the YouTube project
with open("/dlabdata1/youtube_large/jouven/comments_per_channel.pkl",'wb') as f:
     pickle.dump(comments_per_channel, f)
f.close()

# -------------------------------------------------------------------------------------------

### Compute the pairs of channels for the user and random jumper

In [None]:
NB_SAMPLE = 3000

# Load the channel tuple sparse matrix
S = scipy.sparse.load_npz('/dlabdata1/youtube_large/jouven/final_sparse_matrix/channels_more_300/sparse_matrix_word2vec_users_commented_geq_2_channels.npz')

Compute the 3000 pairs of channels with the new random jumper method.
For every pair, we take 2 users at random and then for each user we select one channel at random

In [None]:
selected_users = np.random.choice(np.arange(S.shape[1]), NB_SAMPLE, replace=False)
S2 = S[:, selected_users_rw]

# Create and store channels tuples
channels_tuple = []
last = 0
for i in range(S2.shape[1]):
    idx = S2[:, i].nonzero()
    idx = idx[0]
    
    if i % 2 == 1:
        selected_channel = np.random.choice(idx, 1)
        channels_tuple.append((last, selected_channel))
    else:
        last = np.random.choice(idx, 1)
    
with open("/dlabdata1/youtube_large/jouven/channels_more_300/channels_tuple_random_walk_new.pkl",'wb') as f:
     pickle.dump(channels_tuple, f)
f.close()

Construction of the random pairs of channels to compute the random jumper of a model.
From the set of channels that we have, we select a pair of channels by selecting at random 2 channels.

In [None]:
channels_tuple = []
for val in selected_channels_rw:
    channels_tuple.append((val[0], val[1]))
with open("/dlabdata1/youtube_large/jouven/channels_more_300/channels_tuple_random_walk.pkl",'wb') as f:
     pickle.dump(channels_tuple, f)
f.close()


Construction of the random pairs to compute the user jumper of a model.
We select a pair of channel by selecting a user at random and then selecting random 2 channels from this user.

In [None]:
# Sample NB_SAMPLE from the set of users
selected_users = np.random.randint(S.shape[1] - 1, size=NB_SAMPLE)
S = S[:, selected_users]

# Create and store channels tuples
channels_tuple = []
for i in range(S.shape[1]):
    idx = S[:, i].nonzero()
    idx = idx[0]
    
    selected_channels = np.random.choice(idx, 2, replace=False)
    channels_tuple.append((selected_channels[0], selected_channels[1]))
    
with open("/dlabdata1/youtube_large/jouven/channels_more_300/channels_tuple_user_walk.pkl",'wb') as f:
     pickle.dump(channels_tuple, f)
f.close()

