## NOTEBOOK DESCRIPTION:

This notebook creates a sparse matrix which represent a proximity graph.
Since the dataset is sorted according to each user we sequentially consider each user: 

	For each user:
		We select the channels this user commented in.
        And then, since our comments don’t have a time reference when it was posted we decided to construct edges of the graph by doing all the possible combinations out of the selected channels.

In [1]:
import json
import time
import pickle
import scipy.sparse
import sys
import os
import random
import itertools
import math
import glob

import zstandard as zstd
import pandas as pd
import numpy as np

from scipy.sparse import dok_matrix

scriptpath = "/home/jouven/youtube_projects/"
sys.path.append(os.path.abspath(scriptpath))

from helpers.helpers_channels_more_10k import *

In [2]:
# Dictionnary mapping the video_id to the channel_id
vid_to_channels = video_id_to_channel_id()

In [3]:
# Channels with the selected comments
dict_channel_ind, dict_ind_channel, channels_id = filtered_channels_index_id_mapping()

In [4]:
# Set of duplicate users
duplicate_users = dict_occurent_users()

In the goal of evaluating the the similarity between channels we created what we call a proximity graph.
As the dataset is sorted according to each user we sequentially consider each user:
For each user:
	We select the channels this user commented in.
		
And then, since our comments don’t have a time reference when it was posted we decided to construct edges of the graph by doing all the possible combinations out of the selected channels.

In [5]:
'''
Function to add new edges to the existing data. Edges are formed by doing the 2-combinations of user_channels. 
If the edge already exists -> add 1 to the existing weight
    PARAMETERS:
        - graph_dict: dictionnary mapping the edge (tuple of channel indices) with the weight of that edge
        - user_channels: set of channels that a user commented in
'''

def create_edges(graph_dict, user_channels):
    
    for comb in itertools.combinations(user_channels, 2):
        edge = (comb[0], comb[1])
        if edge in graph_dict:
            graph_dict[edge] += 1
        else:
            graph_dict[edge] = 1
            
'''
Function to add new edges to the existing data. Edges are formed by first selecting the context channels at random
and them by doing the 2-combinations of user_channels. 
If the edge already exists -> add 1 to the existing weight
    PARAMETERS:
        - graph_dict: dictionnary mapping the edge (tuple of channel indices) with the weight of that edge
        - user_channels: set of channels that a user commented in
        - context: maximum number channels selected for a given user
'''
def create_edges_limited(graph_dict, user_channels, context):
    
    if len(user_channels) > context:
        user_channels = random.sample(user_channels, context)
    
    for comb in itertools.combinations(user_channels, 2):
        edge = (comb[0], comb[1])

        if edge in graph_dict:
            graph_dict[edge] += 1
        else:
            graph_dict[edge] = 1
        
            

In [8]:
'''
Function to add new edges to the existing data. Edges are formed by doing the 2-combinations of user_channels. 
If the edge already exists -> add 1/log(# number of channels that this user has commented in)
    PARAMETERS:
        - graph_dict: dictionnary mapping the edge (tuple of channel indices) with the weight of that edge
        - user_channels: set of channels that a user commented in
'''
def create_edges_normalized(graph_dict, user_channels):
    
    nb_comments = len(user_channels)
    
    for comb in itertools.combinations(user_channels, 2):

        edge = (comb[0], comb[1])
        if edge in graph_dict:
            graph_dict[edge] += 1/(math.log(nb_comments, 2))
        else:
            graph_dict[edge] = 1/(math.log(nb_comments, 2))

'''
Function to add new edges to the existing data. Edges are formed by first selecting the context channels at random
and them by doing the 2-combinations of user_channels. 
If the edge already exists -> add 1/log(# number of channels that this user has commented in)
    PARAMETERS:
        - graph_dict: dictionnary mapping the edge (tuple of channel indices) with the weight of that edge
        - user_channels: set of channels that a user commented in
        - context: maximum number channels selected for a given user
'''            
def create_edges_limited_normalized(graph_dict, user_channels, context):
    
    nb_comments = len(user_channels)
    if len(user_channels) > context:
        user_channels = random.sample(user_channels, context)
    
    for comb in itertools.combinations(user_channels, 2):
        edge = (comb[0], comb[1])
        if edge in graph_dict:
            graph_dict[edge] += 1/(math.log(nb_comments, 2))
        else:
            graph_dict[edge] = 1/(math.log(nb_comments, 2))

In [None]:
# Adjust chunk_size as necessary -- defaults to 16,384 if not specific
reader = Zreader("/dlabdata1/youtube_large/youtube_comments.ndjson.zst", chunk_size=16384)

# PARAMETERS

# Dictionnary counting the number of time (channel_idx, channel2_idx) appears
graph_dict_limited = {}
# Indices
nb = 1
idx = 1
# Channels that a user have commented
user_channels = []
# Number of channels, Row and columns length of the sparse matrix
matrix_len = len(channels_id)
context = 30

user = ''
begin_time = time.time()

dir_1 = '/dlabdata1/youtube_large/jouven/sparse_matrix_construction/channels_more_10k/sparse_matrices_limited_normalized/'
check_directory(dir_1)

# Read each line from the reader
for line in reader.readlines():
    line_split = line.replace('"', '').split(',')
    if len(line_split) >= 9:
        author_id = line_split[0]
        if vid_to_channels.get(line_split[2]) in channels_id:
            corr_channel = dict_channel_ind[vid_to_channels[line_split[2]]]
            if author_id == user:
                # if user is a duplicate user
                if author_id in duplicate_users:
                    if duplicate_users[author_id] <= 1:
                        user_channels.append(corr_channel)
                else:
                    user_channels.append(corr_channel)
            else:
                if len(user_channels) >= 2:
                    create_edges_limited_normalized(graph_dict_limited, user_channels, context)
                user_channels = []
                
                if len(graph_dict_limited) >= 75000000:
                    # For space requirements every 75 millions line create a dok matrix and
                    # update it with the graph_dict dictionnary and then save it into csr format and then release memory
                    graph_matrix = dok_matrix((matrix_len, matrix_len))
                    dict.update(graph_matrix, graph_dict_limited)
                    graph_dict_limited = {}
                    # Save sparse matrix
                    scipy.sparse.save_npz('/dlabdata1/youtube_large/jouven/sparse_matrix_construction/channels_more_10k/sparse_matrices_limited_normalized/matrice' + str(nb) + '.npz', graph_matrix.tocsr())
                    with open("/dlabdata1/youtube_large/jouven/sparse_matrix_construction/idx2.pkl",'wb') as f:
                         pickle.dump([idx], f)
                    f.close()
                    graph_matrix = []
                    nb += 1
                    print('line number: ' + str(idx) + ' time: ' + str(time.time() - begin_time))
                    begin_time = time.time()
                        
                # If user is a duplicate user
                if author_id in duplicate_users:
                    duplicate_users[author_id] += 1
                    if duplicate_users[author_id] <= 1:
                        user_channels.append(corr_channel)
                else:
                    user_channels.append(corr_channel)
           
        user = author_id
    idx += 1
    

# Store the graph sparse matrix
graph_matrix = dok_matrix((matrix_len, matrix_len))
dict.update(graph_matrix, graph_dict_limited)
graph_dict_limited = {}
# Save sparse matrix
scipy.sparse.save_npz('/dlabdata1/youtube_large/jouven/sparse_matrix_construction/channels_more_10k/sparse_matrices_limited_normalized/matrice' + str(nb) + '.npz', graph_matrix.tocsr())

graph_matrix = []


line number: 137270327 time: 1158.5001480579376
line number: 275022892 time: 1073.48898935318
line number: 412639213 time: 1065.8698954582214
line number: 549837858 time: 1062.13547539711
line number: 687178507 time: 1079.4762296676636
line number: 825136521 time: 1076.4855399131775
line number: 962940178 time: 1063.9683725833893
line number: 1100249592 time: 1062.4339764118195
line number: 1237301359 time: 1085.1867506504059
line number: 1374517380 time: 1076.8347342014313
line number: 1512419854 time: 1054.8100345134735
line number: 1650211681 time: 1053.404995918274
line number: 1787434072 time: 1091.6070857048035
line number: 1925929449 time: 1100.9692499637604
line number: 2062947711 time: 1053.5515422821045
line number: 2200687072 time: 1045.3022074699402
line number: 2613319873 time: 1085.4445509910583
line number: 2750808720 time: 1069.2701110839844
line number: 2889148950 time: 1250.6160681247711
line number: 3026568974 time: 1185.6369943618774
line number: 3164283203 time: 11

In [11]:
'''
Enable to load every sparse matrix to form only one by adding the weights of all intermediate sparse matrix
'''

dir_1 = '/dlabdata1/youtube_large/jouven/final_sparse_matrix/channels_more_10k/proximity_graph'
check_directory(dir_1)

matrix_len = len(channels_id)

path = '/dlabdata1/youtube_large/jouven/sparse_matrix_construction/channels_more_10k/sparse_matrices_limited_normalized/'
files = glob.glob(path + '*.npz')

graph = dok_matrix((matrix_len, matrix_len)).tocsr()
for file in files: 
    graph += scipy.sparse.load_npz(file)
# Save the final sparse matrix
scipy.sparse.save_npz('/dlabdata1/youtube_large/jouven/final_sparse_matrix/channels_more_10k/proximity_graph/channel_by_channel_ln_30.npz', graph)
