# CIT-HEPTH

Reference: 
    - http://networkrepository.com/ca-cit-HepTh.php

**Describe**: 
> Arxiv HEP-TH (high energy physics theory) citation graph is from arXiv and covers all the citations. Edges from u to v indicate that a paper u cited another paper v. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data is of the papers in the period from January 1993 to April 2003.



# Library

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import tensorflow as tf
import requests
import zipfile
from datetime import datetime
import os

# Download

In [19]:
link_dts = 'http://nrvis.com/download/data/ca/ca-cit-HepTh.zip'
dts_zip = 'ca-cit-HepTh.zip'
dts_name = 'ca-cit-HepTh.edges'

In [20]:
r1 = requests.get(link_dts, allow_redirects=True)
open(dts_zip, 'wb').write(r1.content)

7779843

In [21]:
with zipfile.ZipFile(dts_zip, 'r') as zip_ref:
    zip_ref.extractall()

# Handle data

In [27]:
df = None
with open(dts_name, 'r') as fi:
    lines = fi.readlines() 
    print(lines[:6])
    lines = lines[4:]
    lines_ = [list(map(int, line.strip().split())) for line in lines ]
    print(lines_[:4])
    df = pd.DataFrame(data=lines_, columns=['node_1', 'node_2', 'weight', 'timestamp'])

print()
print(df.dtypes)


['% sym unweighted\n', '1 2 1 1015887601\n', '1 3 1 1015887601\n', '1 4 1 1015887601\n', '1 5 1 1015887601\n', '1 6 1 1015887601\n']
[[1, 5, 1, 1015887601], [1, 6, 1, 1015887601], [1, 7, 1, 1015887601], [1, 8, 1, 1015887601]]

node_1       int64
node_2       int64
weight       int64
timestamp    int64
dtype: object


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2673130 entries, 0 to 2673129
Data columns (total 4 columns):
 #   Column     Dtype
---  ------     -----
 0   node_1     int64
 1   node_2     int64
 2   weight     int64
 3   timestamp  int64
dtypes: int64(4)
memory usage: 81.6 MB


In [29]:
df.describe()

Unnamed: 0,node_1,node_2,weight,timestamp
count,2673130.0,2673130.0,2673130.0,2673130.0
mean,3769.401,8666.282,1.0,1002332000.0
std,4792.042,6395.25,0.0,40888600.0
min,1.0,3.0,1.0,749430000.0
25%,559.0,3093.0,1.0,1015888000.0
50%,1561.0,6939.5,1.0,1015888000.0
75%,4823.0,14613.0,1.0,1015888000.0
max,22906.0,22908.0,1.0,1015888000.0


We will drop `weight` column and which row has value `timestamp = 0`. We can not create a temporal network without getting time stamp

In [30]:
df.drop(columns='weight', inplace=True)

In [31]:
df = df[df.timestamp != 0]

In [32]:
df.describe()

Unnamed: 0,node_1,node_2,timestamp
count,2673130.0,2673130.0,2673130.0
mean,3769.401,8666.282,1002332000.0
std,4792.042,6395.25,40888600.0
min,1.0,3.0,749430000.0
25%,559.0,3093.0,1015888000.0
50%,1561.0,6939.5,1015888000.0
75%,4823.0,14613.0,1015888000.0
max,22906.0,22908.0,1015888000.0


# Creating dynamic graph
Divide timestamp to `k` bin means `k` graph. Afterthat, we have 1 dynamic graph with `k` snapshot (static graph)

In [41]:
k = 6

In [42]:
timestamp_range = (df.timestamp.max() - df.timestamp.min() + 1)//k 
timestamp_range

44409600

In [43]:
graphs_df = []
print("Start time: ", datetime.fromtimestamp(df.timestamp.min()) )
for i in range(k):
    upper_time = df.timestamp.min() + timestamp_range*(i+1)
    print(f"[{i}|\tUpper_time= {datetime.fromtimestamp(upper_time)}\t |Row|= {len(df[df.timestamp<upper_time])}")
    if i == k-1:
        graph_df = df.copy()
    else:
        graph_df = df[df.timestamp<upper_time].copy()
    graphs_df.append(graph_df)

Start time:  1993-09-30 23:00:00
[0|	Upper_time= 1995-02-26 23:00:00	 |Row|= 6514
[1|	Upper_time= 1996-07-24 23:00:00	 |Row|= 26938
[2|	Upper_time= 1997-12-20 23:00:00	 |Row|= 130085
[3|	Upper_time= 1999-05-18 23:00:00	 |Row|= 223589
[4|	Upper_time= 2000-10-13 23:00:00	 |Row|= 290597
[5|	Upper_time= 2002-03-11 23:00:00	 |Row|= 290597


In [44]:
graphs = []
for i in range(k):
    g = nx.from_pandas_edgelist(graphs_df[i], "node_1", "node_2", create_using=nx.Graph())
    graphs.append(g)
    print(f"Graph {i+1}:\t|V|={g.number_of_nodes()}\t|E|={g.number_of_edges()}")

Graph 1:	|V|=619	|E|=6374
Graph 2:	|V|=1660	|E|=23654
Graph 3:	|V|=3699	|E|=99588
Graph 4:	|V|=5654	|E|=167731
Graph 5:	|V|=6798	|E|=214693
Graph 6:	|V|=22908	|E|=2444795


# Save dynamic graph

In [45]:
NUMBER_SAVE_GRAPH = 6

In [46]:
folder = "../data/"
if not os.path.exists(folder):
    os.makedirs(folder)

In [47]:
for i in range(min(NUMBER_SAVE_GRAPH, k)):
    nx.write_edgelist(graphs[i],f'{folder}/graph_{str(i//10)+str(i%10)}.edgelist',data=False)