# 1. 数据集导入和预处理

使用ACM数据集，构造其中的**子集**，**预处理后的数据集参数**如下：
- 三个领域：`Database, Wireless Communication, Data Mining`
- `Database: SIGMOD, VLDB`
- `Data Mining: KDD`
- `Wireless Communication: SIGCOMM, MobiCOMM`
- 节点类型：`Author, Paper, Subject`
- 边类型(无向边)：`Author-Paper, Paper-Author, Author-Subject, Subject-Author`
- 元路径：`PAP, PSP`
- 半监督学习的学习目标：`Paper-->Conference`
- 训练集数量：600(各200)
- 验证集数量：300(各100)
- 测试集：剩下所有`Paper`

In [57]:
data_path = Path('../datasets/ACM/ACM.mat')
data = sio.loadmat(a_data_path)

In [144]:
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'TvsP', 'PvsA', 'PvsV', 'AvsF', 'VvsC', 'PvsL', 'PvsC', 'A', 'C', 'F', 'L', 'P', 'T', 'V', 'PvsT', 'CNormPvsA', 'RNormPvsA', 'CNormPvsC', 'RNormPvsC', 'CNormPvsT', 'RNormPvsT', 'CNormPvsV', 'RNormPvsV', 'CNormVvsC', 'RNormVvsC', 'CNormAvsF', 'RNormAvsF', 'CNormPvsL', 'RNormPvsL', 'stopwords', 'nPvsT', 'nT', 'CNormnPvsT', 'RNormnPvsT', 'nnPvsT', 'nnT', 'CNormnnPvsT', 'RNormnnPvsT', 'PvsP', 'CNormPvsP', 'RNormPvsP'])

## 1.1 生成节点;generating nodes

- paper_inx
- paper_target

数据中，`[0, 1, 9, 10, 13]`分别为：`KDD, SIGMOD, SIGCOMM, MobiCOMM, VLDB`，也即是论文中的所选出的子集。

In [145]:
# conferences
data['C']

array([[array(['KDD'], dtype='<U3')],
       [array(['SIGMOD'], dtype='<U6')],
       [array(['WWW'], dtype='<U3')],
       [array(['SIGIR'], dtype='<U5')],
       [array(['CIKM'], dtype='<U4')],
       [array(['SODA'], dtype='<U4')],
       [array(['STOC'], dtype='<U4')],
       [array(['SOSP'], dtype='<U4')],
       [array(['SPAA'], dtype='<U4')],
       [array(['SIGCOMM'], dtype='<U7')],
       [array(['MobiCOMM'], dtype='<U8')],
       [array(['ICML'], dtype='<U4')],
       [array(['COLT'], dtype='<U4')],
       [array(['VLDB'], dtype='<U4')]], dtype=object)

In [103]:
# paper vs conference
paper_conf = data['PvsC']

papers = paper_conf.nonzero()[1]
# DataBase: SIGMOD, VLDB;   select 994 papers from total 1994 papers
paper_db = np.isin(papers, [1, 13])
paper_db_inx = paper_db.nonzero()[0]
paper_db_inx = np.sort(np.random.choice(paper_db_inx, 994, replace=False))
# DataMining: KDD;  select total 1061 papers
paper_dm = np.isin(papers, [0])
paper_dm_inx = paper_dm.nonzero()[0]
# Wireless Comunication: SIGCOMM, MobiCOMM; select total 970 papers
paper_wc = np.isin(papers, [9, 10])
paper_wc_inx = paper_wc.nonzero()[0]

In [100]:
paper_db.sum(), paper_dm.sum(), paper_wc.sum()

(1994, 1061, 970)

In [129]:
# Then got total 3025 papers, just the same num as the paper(HAN)
paper_inx = np.sort(np.concatenate((paper_db_inx, 
                                    paper_dm_inx, 
                                    paper_wc_inx), axis=0))
paper_inx.shape

(3025,)

In [135]:
# 0 : database, 1: data mining, 2: wireless communication
paper_target = np.zeros_like(paper_inx)
paper_target[np.isin(paper_inx, paper_dm_inx)] = 1
paper_target[np.isin(paper_inx, paper_wc_inx)] = 2

In [288]:
paper_dict = dict(enumerate(paper_inx))
papers = np.array(list(paper_dict.keys()))
paper_dict = {key:value for value, key in paper_dict.items()}
targets = paper_target

In [296]:
num_papers = papers.shape[0]
num_papers

3025

## 1.2生成边；generating edges

根据论文中的说明，使用两种元路径：`[PAP, PSP]`

In [353]:
# paper_author edges
p_a = np.transpose(data['PvsA'][paper_inx].nonzero())
# paper_subjects edges
p_s = np.transpose(data['PvsL'][paper_inx].nonzero())

In [317]:
authors = np.unique(p_a[:, 1])
author_dict = dict(enumerate(authors))
author_dict = {key:value+num_papers for value, key in author_dict.items()}

In [327]:
authors = np.frompyfunc(author_dict.get, 1, 1)(authors)

In [331]:
num_authors = authors.shape[0]

In [333]:
subjects = np.unique(p_s[:, 1])
subject_dict = {key:value+num_papers+num_authors for value, key in dict(enumerate(subjects)).items()}

In [335]:
subjects = np.frompyfunc(subject_dict.get, 1, 1)(subjects)

In [337]:
num_subjects = subjects.shape[0]

In [346]:
num_node = num_papers + num_authors + num_subjects

完成了`authors, subjects`的序号变换，使得`papers, authors, subjects`的整体的序号连续了。

In [391]:
vec_papers_trans = np.vectorize(paper_dict.get)
vec_authors_trans = np.vectorize(author_dict.get)
vec_subjects_trans = np.vectorize(subject_dict.get)

In [348]:
PA = sp.csr_matrix((np.ones_like(p_a[:, 0]), 
                    (p_a[:, 0], vec_authors_trans(p_a[:, 1]))), 
                   shape=(num_node, num_node))

In [356]:
PS = sp.csr_matrix((np.ones_like(p_s[:, 0]),
                    (p_s[:, 0], vec_subjects_trans(p_s[:, 1]))),
                   shape=(num_node, num_node))

In [363]:
AP = PA.transpose()
SP = PS.transpose()

In [367]:
print(f'nodes: {nodes.shape[0]}\n'\
      f'edges: {p_a.shape[0]+p_s.shape[0]}\n'\
      f'papers: {papers.shape[0]}\n'\
      f'authors: {authors.shape[0]}\n'\
      f'subjects: {subjects.shape[0]}\n'\
      f'paper_author edges: {p_a.shape}\n'\
      f'paper_subjects edges: {p_s.shape}\n')

nodes: 9100
edges: 13063
papers: 3025
authors: 6018
subjects: 57
paper_author edges: (10038, 2)
paper_subjects edges: (3025, 2)



## 1.3 特征构造

In [422]:
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'TvsP', 'PvsA', 'PvsV', 'AvsF', 'VvsC', 'PvsL', 'PvsC', 'A', 'C', 'F', 'L', 'P', 'T', 'V', 'PvsT', 'CNormPvsA', 'RNormPvsA', 'CNormPvsC', 'RNormPvsC', 'CNormPvsT', 'RNormPvsT', 'CNormPvsV', 'RNormPvsV', 'CNormVvsC', 'RNormVvsC', 'CNormAvsF', 'RNormAvsF', 'CNormPvsL', 'RNormPvsL', 'stopwords', 'nPvsT', 'nT', 'CNormnPvsT', 'RNormnPvsT', 'nnPvsT', 'nnT', 'CNormnnPvsT', 'RNormnnPvsT', 'PvsP', 'CNormPvsP', 'RNormPvsP'])

In [445]:
paper_feats = data['TvsP'].transpose()[paper_inx]

这里有一个细节，如果计算`authors, subjects`的特征？

也是基于`papers`的特征信息，具体措施是通过`authors-papers, subjects-papers`矩阵与`papers-terms`矩阵进行**矩阵乘积**

In [453]:
num_authors, num_subjects

(6018, 57)

In [454]:
vec_authors_trans(p_a[:, 1]) - num_papers

array([1723, 1991, 4543, ...,  502, 2369, 4829])

In [462]:
# construct `authors-papers`, `subjects-papers` matrix
AP_tmp = sp.csr_matrix((np.ones_like(p_a[:, 0]), 
                        (vec_authors_trans(p_a[:, 1]) - num_papers, p_a[:, 0])), 
                       shape=(num_authors, num_papers))
SP_tmp = sp.csr_matrix((np.ones_like(p_s[:, 0]),
                        (vec_subjects_trans(p_s[:, 1]) - num_papers - num_authors, p_s[:, 0])),
                       shape=(num_subjects, num_papers))

In [458]:
AP_tmp

<6018x3025 sparse matrix of type '<class 'numpy.int32'>'
	with 10038 stored elements in Compressed Sparse Row format>

In [459]:
PS_tmp

<57x3025 sparse matrix of type '<class 'numpy.int32'>'
	with 3025 stored elements in Compressed Sparse Row format>

In [463]:
author_feats = AP_tmp.dot(paper_feats)
subject_feats = SP_tmp.dot(paper_feats)

In [474]:
features = sp.vstack((paper_feats, author_feats, subject_feats))

In [475]:
features

<9100x1903 sparse matrix of type '<class 'numpy.float64'>'
	with 989913 stored elements in Compressed Sparse Row format>

In [464]:
print(f'paper_feats: {paper_feats.shape}\n'\
     f'author_feats: {author_feats.shape}\n'\
     f'subject_feats: {subject_feats.shape}\n')

paper_feats: (3025, 1903)
author_feats: (6018, 1903)
subject_feats: (57, 1903)



## 1.4 训练集、验证集、测试集分割

- trian: 600
- val: 300
- test: 2125(the rest)

In [387]:
db_train_val = np.random.choice(paper_db_inx, 300, replace=False)
db_test = paper_db_inx[~np.isin(paper_db_inx, db_train_val)]
dm_train_val = np.random.choice(paper_dm_inx, 300, replace=False)
dm_test = paper_dm_inx[~np.isin(paper_dm_inx, dm_train_val)]
wc_train_val = np.random.choice(paper_wc_inx, 300, replace=False)
wc_test = paper_wc_inx[~np.isin(paper_wc_inx, wc_train_val)]

In [388]:
db_train_val.shape, db_test.shape

((300,), (694,))

In [389]:
dm_train_val.shape, dm_test.shape

((300,), (761,))

In [390]:
wc_train_val.shape, wc_test.shape

((300,), (670,))

In [476]:
train_papers = np.concatenate((db_train_val[:200], dm_train_val[:200], wc_train_val[:200]))
val_papers = np.concatenate((db_train_val[200:], dm_train_val[200:], wc_train_val[200:]))
test_papers = np.concatenate((db_test, dm_test, wc_test))

train_papers = vec_papers_trans(train_papers)
val_papers = vec_papers_trans(val_papers)
test_papers = vec_papers_trans(test_papers)

In [477]:
np.random.shuffle(train_papers)
np.random.shuffle(val_papers)
np.random.shuffle(test_papers)

In [478]:
train_targets = targets[train_papers]
val_targets = targets[val_papers]
test_targets = targets[test_papers]

In [479]:
print(f'train: {train_papers.shape[0]}\n'\
     f'val: {val_papers.shape[0]}\n'\
     f'test: {test_papers.shape[0]}')

train: 600
val: 300
test: 2125


## 1.5 数据离线存储

利用pickle进行存储

In [494]:
acm_data = {'papers': papers, 'authors': authors, 
            'subjects': subjects, 'PA': PA, 
            'PS': PS, 'AP': AP, 'SP':SP, 
            'train_papers': train_papers,
            'val_papers': val_papers,
            'test_papers': test_papers,
            'train_targets': train_targets,
            'val_targets': val_targets,
            'test_targets': test_targets,
            'paper_feats': paper_feats,
            'author_feats': author_feats,
            'subject_feats': subject_feats,
            'paper_dict': paper_dict,
            'author_dict': author_dict,
            'subject_dict': subject_dict}

In [507]:
import pickle
save_path = Path('../datasets/ACM') / 'acm_data'

In [517]:
pickle.dump(acm_data, open(save_path, 'wb'))