## Data Processing Exploratory Notebook


This notebook will be some exploratory work for processing the data. This will look at both the data for the Amazon data set and the Tao Bao data set. 

we will start off with looking at the data for the Amazon Data Set.

In [2]:
## importing the necessary libraries for the Amazon data set
import torch
import numpy as np
from torch_geometric.datasets import AmazonBook
import pandas as pd

Basic data exploration for the taobao datasets

In [3]:
taobao_userbehavior_df = pd.read_csv("taobao/raw/UserBehavior.csv", header = None)

In [4]:
print(taobao_userbehavior_df.head())
taobao_userbehavior_df.columns = ['user_id', 'item_id', 'category_id', 'behavior_type', 'timestamp']
taobao_userbehavior_df.head()

   0        1        2   3           4
0  1  2268318  2520377  pv  1511544070
1  1  2333346  2520771  pv  1511561733
2  1  2576651   149192  pv  1511572885
3  1  3830808  4181361  pv  1511593493
4  1  4365585  2520377  pv  1511596146


Unnamed: 0,user_id,item_id,category_id,behavior_type,timestamp
0,1,2268318,2520377,pv,1511544070
1,1,2333346,2520771,pv,1511561733
2,1,2576651,149192,pv,1511572885
3,1,3830808,4181361,pv,1511593493
4,1,4365585,2520377,pv,1511596146


Looking into the data.pt dataset

In [5]:
taobao_pt = torch.load('./taobao/processed/data.pt')
print(type(taobao_pt))
print(len(taobao_pt))
print(type(taobao_pt[0]))
print(type(taobao_pt[1]))
print(type(taobao_pt[2]))
taobao_data = taobao_pt[0]

  taobao_pt = torch.load('./taobao/processed/data.pt')


<class 'tuple'>
3
<class 'dict'>
<class 'NoneType'>
<class 'abc.ABCMeta'>


In [6]:
taobao_data

{'_global_store': {},
 'user': {'num_nodes': 987991},
 'item': {'num_nodes': 4161138},
 'category': {'num_nodes': 9437},
 ('user',
  'to',
  'item'): {'edge_index': tensor([[      0,       0,       0,  ...,  970447,  970447,  970447],
          [1827766, 1880345, 2076699,  ..., 2939548, 1534057, 2978718]]), 'time': tensor([1511544070, 1511561733, 1511572885,  ..., 1512293792, 1512293827,
          1512293891]), 'behavior': tensor([0, 0, 0,  ..., 0, 0, 0])},
 ('item',
  'to',
  'category'): {'edge_index': tensor([[1827766, 1880345, 2076699,  ...,  848356,  522299, 2015151],
          [   4564,    4565,     259,  ...,    4637,    4565,    8438]])}}

Don't think that we can use the already processed data.pt --> still need to preprocess the data for the appropriate knowledge graph

Looking specifically into the amazon-book datasets to better understand what they look like

In [7]:
amazon_entity_df = pd.read_csv('./amazon-book/entity_list.txt', delimiter=' ', on_bad_lines='skip')
amazon_item_list_df = pd.read_csv('./amazon-book/item_list.txt', delimiter  = ' ', on_bad_lines = 'skip')
amazon_relation_df = pd.read_csv('./amazon-book/relation_list.txt', delimiter = ' ', on_bad_lines = 'skip')
amazon_user_df = pd.read_csv('./amazon-book/user_list.txt', delimiter = ' ', on_bad_lines = 'skip')

In [8]:
print(amazon_entity_df.shape)
print(amazon_item_list_df.shape)
print(amazon_relation_df.shape)
print(amazon_user_df.shape)

(83460, 2)
(24915, 3)
(39, 2)
(70679, 2)


In [9]:
print('amazon entity mapping')
print(len(amazon_entity_df['org_id'].unique()))
print(len(amazon_entity_df['remap_id'].unique()))
print(amazon_entity_df.head())

print('amazon item mapping')
print(len(amazon_item_list_df['org_id'].unique()))
print(len(amazon_item_list_df['remap_id'].unique()))
print(len(amazon_item_list_df['freebase_id'].unique()))
print(amazon_item_list_df.head())

print('amazon relation mapping')
print(len(amazon_relation_df['org_id'].unique()))
print(len(amazon_relation_df['remap_id'].unique()))
print(amazon_relation_df.head())

print('amazon user mapping')
print(len(amazon_user_df['org_id'].unique()))
print(len(amazon_user_df['remap_id'].unique()))
print(amazon_user_df.head())



amazon entity mapping
83460
83460
      org_id  remap_id
0  m.045wq1q         0
1   m.03_28m         1
2  m.0h2q1cq         2
3  m.04y9jxd         3
4   m.060c1r         4
amazon item mapping
24915
24915
24915
       org_id  remap_id freebase_id
0  0553092626         0   m.045wq1q
1  0393316041         1    m.03_28m
2  038548254X         2   m.0h2q1cq
3  0385307756         3   m.04y9jxd
4  038531258X         4    m.060c1r
amazon relation mapping
39
39
                                              org_id  remap_id
0        http://rdf.freebase.com/ns/type.object.type         0
1      http://rdf.freebase.com/ns/type.type.instance         1
2  http://rdf.freebase.com/ns/book.written_work.c...         2
3    http://www.w3.org/1999/02/22-rdf-syntax-ns#type         3
4  http://rdf.freebase.com/ns/kg.object_profile.p...         4
amazon user mapping
70679
70679
           org_id  remap_id
0  A3RTKL9KB8KLID         0
1  A38LAIK2N83NH0         1
2  A3PPXVR5J6U2JD         2
3  A2ULDDL3MLJPUR     

This essentially tells me that there are only unique mappings here 

In [11]:
amazon_kg = pd.read_csv('./amazon-book/kg_final.txt', delimiter = ' ', on_bad_lines = 'skip', header = None)

print('shape of the kg dataset:', amazon_kg.shape)
amazon_kg.columns =['head', 'relation', 'tail']
print(amazon_kg.head())
print('unique relations:', len(amazon_kg['relation'].unique()))
print('unique heads:', len(amazon_kg['head'].unique()))
print('unique tails:', len(amazon_kg['tail'].unique()))
print('max heads:', (amazon_kg['head'].max()))
print('max tails:', (amazon_kg['tail'].max()))

shape of the kg dataset: (2557746, 3)
    head  relation   tail
0  24915         0  24916
1  24917         1   5117
2  24918         0  24917
3  24919         1  24920
4  24921         2  24922
unique relations: 39
unique heads: 113308
unique tails: 113479
max heads: 113486
max tails: 113486


So in here, there are more unique heads and tails than there are in terms of the actual mappings present in each of the entity/item/user list txt files

In [82]:
# amazon_train = pd.read_csv('./amazon-book/train.txt', delimiter=' ', on_bad_lines='skip', header = None)
# amazon_test = pd.read_csv('./amazon-book/test.txt', delimiter  = ' ', on_bad_lines = 'skip', header = None)

def _load_ratings(file_name):
    user_dict = dict()
    inter_mat = list()

    lines = open(file_name, 'r').readlines()
    for l in lines:
        tmps = l.strip()
        inters = [int(i) for i in tmps.split(' ')]

        u_id, pos_ids = inters[0], inters[1:]
        pos_ids = list(set(pos_ids))

        for i_id in pos_ids:
            inter_mat.append([u_id, i_id])

        if len(pos_ids) > 0:
            user_dict[u_id] = pos_ids
    return np.array(inter_mat), user_dict
    

amazon_train_array, amazon_train_dict = _load_ratings('./amazon-book/train.txt')
amazon_test_array, amazon_test_dict = _load_ratings('./amazon-book/test.txt')

In [1]:
print(len(amazon_train_dict))
print(len(amazon_test_dict))

print(len(amazon_train_array))
print(len(amazon_test_array))

print(amazon_train_array.shape)
print('max item id train:', amazon_train_array[:,1].max())
print('max item id test:', amazon_test_array[:,1].max())

print('max user id train:', amazon_train_array[:,0].max())
print('max user id test:', amazon_test_array[:,0].max())

## basically every user shows up here
print(len(amazon_test_array)/len(amazon_train_array))



NameError: name 'amazon_train_dict' is not defined