# Loading cables data

Desired output `Vector{Cable}` where `Cable` contains `time::Float64`, `node::Int`, and `mark::SparseVector{Int}`.

Some notes on the data:

- `train.tsv`, `test.csv`, and `validation.tsv` contain the train, test, and validation sets, respectively. Each row in the file contains three (tab-delimited) integers: the document ID, the word ID, and the number of words.
- `meta.tsv` contains the nodes and time IDs of each the documents. Each row in the file contains three (tab-delimited) integers: the document ID, the node ID, and the timestamp ID.

We'll use the following universal column names when working with tables:
- `docid`: a unique document ID
- `nodeid`: a unique node ID
- `dateid`: a unique timestamp ID (note: this is to be contrasted with the actual time itself, which has a meaninful numerical value)

- `word_count`: the number of times `wordid` appears in `docid`
- `date`: the actual date in a string format

In [1]:
import CSV
import JLD
using DataFrames
using SparseArrays

DATADIR = "/home/anthony/data/cables/"

UNIQUE_ENTITIES = "unique_entities.txt"
UNIQUE_DATES = "unique_dates.txt"

META = "meta.tsv"
DATES = "dates.tsv"

TRAIN = "train.tsv"
TEST = "test.tsv"
VALIDATION = "validation.tsv"
;

## Load raw data into memory

In [2]:
dataset_name = TRAIN

# Load the metadata
meta = DataFrame(CSV.File(DATADIR * META, header=[:docid, :nodeid, :date]))

@show first(meta, 5)
;

first(meta, 5) = 5×3 DataFrame
│ Row │ docid │ nodeid │ date  │
│     │ Int64 │ Int64  │ Int64 │
├─────┼───────┼────────┼───────┤
│ 1   │ 0     │ 0      │ 2002  │
│ 2   │ 1     │ 1      │ 181   │
│ 3   │ 2     │ 1      │ 182   │
│ 4   │ 3     │ 1      │ 183   │
│ 5   │ 4     │ 1      │ 183   │


## Extract the data

In [3]:
# Get the sparse doc-word matrix
doc_inds = Int[]
word_inds = Int[]
word_counts = Int[]

# Instead of using a DataFrame, we'll have
# to do this in place because of the size of
# the train dataset.
for row in CSV.Rows(
    DATADIR * dataset_name, 
    header=[:docid, :wordid, :word_count],
    types=[Int, Int, Int]
)
    push!(doc_inds, row.docid+1)
    push!(word_inds, row.wordid+1)
    push!(word_counts, row.word_count)
end

In [4]:
# Fill the matrix with each value
doc_word_mat = sparse(doc_inds, word_inds, word_counts)

# Remove garbage from memory (please don't die computer!)
word_inds = nothing
word_counts = nothing
GC.gc()
;

2114195×21819 SparseMatrixCSC{Int64,Int64} with 112674036 stored entries:
  [31     ,     1]  =  1
  [55     ,     1]  =  1
  [62     ,     1]  =  1
  [65     ,     1]  =  1
  [67     ,     1]  =  1
  [73     ,     1]  =  1
  [75     ,     1]  =  2
  [92     ,     1]  =  1
  [148    ,     1]  =  2
  [161    ,     1]  =  1
  [166    ,     1]  =  1
  [238    ,     1]  =  1
  ⋮
  [1915747, 21819]  =  1
  [1917058, 21819]  =  1
  [1918733, 21819]  =  1
  [1918878, 21819]  =  1
  [1918943, 21819]  =  1
  [1921051, 21819]  =  1
  [1921129, 21819]  =  1
  [1964356, 21819]  =  1
  [1981318, 21819]  =  1
  [2034855, 21819]  =  1
  [2035548, 21819]  =  1
  [2055796, 21819]  =  1
  [2071179, 21819]  =  1

In [9]:
# Find all the unique document IDs in the dataset
docs = unique(doc_inds)

# Get the node for each document
nodes = meta[docs, :nodeid]

# Finally, get the dates for each document
dates = meta[docs, :date]
;

## Save the result as a JLD file

In [10]:
filepath = DATADIR * dataset_name[1 : end-4] * ".jld"
JLD.@save filepath docs dates nodes doc_word_mat