<a href="https://colab.research.google.com/github/hanhluukim/replication-topic-modelling-in-embedding-space/blob/main/etm_in_julia/ETM_in_julia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Vorbereiten den Julia Enwicklungsfeld auf Colab**

In [3]:
# Installation cell
%%capture
%%shell
if ! command -v julia 3>&1 > /dev/null
then
    wget -q 'https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.1-linux-x86_64.tar.gz' \
        -O /tmp/julia.tar.gz
    tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
    rm /tmp/julia.tar.gz
fi
julia -e 'using Pkg; pkg"add IJulia; precompile;"'
echo 'Done'

In [1]:
VERSION

v"1.6.1"

#Optional GPU Experiments

In [2]:
using Pkg
Pkg.add(["BenchmarkTools", "CUDA"])


using BenchmarkTools, CUDA

if has_cuda_gpu()
  print("The GPU device is:", CUDA.device())
end

The GPU device is:CuDevice(0)

In [4]:
mcpu = rand(2^10, 2^10)
@benchmark mcpu*mcpu

BenchmarkTools.Trial: 79 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m60.195 ms[22m[39m … [35m131.440 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m61.417 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m63.776 ms[22m[39m ± [32m  8.926 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.71% ± 1.59%

  [39m [39m▆[39m█[34m▄[39m[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m█[39m█[34m█[39m[39

In [5]:
println("The CuArrray operation should take around 0.5 ms(excluding CUDA downloading time which is a one time process), and should be much faster. If so, the GPU is working.")
mgpu = cu(mcpu)
@benchmark CUDA.@sync mgpu*mgpu

The CuArrray operation should take around 0.5 ms(excluding CUDA downloading time which is a one time process), and should be much faster. If so, the GPU is working.


BenchmarkTools.Trial: 2826 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.190 ms[22m[39m … [35m  1.172 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.48%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.376 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.825 ms[22m[39m ± [32m22.021 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.11% ± 0.01%

  [39m [39m [39m [39m [39m▃[39m█[34m█[39m[39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▃[39m▃[39m▃[39m█[39m█[34m█[39m[

# **Importieren von gebrauchten Paketten für die Implementierung**

In [14]:
using Pkg
Pkg.add(["BenchmarkTools", "CUDA", "Flux", "TextAnalysis"])

In [22]:
Pkg.add("MAT")

In [23]:
using BenchmarkTools, CUDA
using Flux, TextAnalysis
using MAT

# **Lesen vorverarbeitete BoW-Repräsentationen aus der Datei: prepared_data/bow_train.mat**


1.   Zuerst werden die vorverarbeiteten Dateien auf diesem Colab hochladen
2.   Listeneintrag



In [26]:
file = matopen("/content/bow_train.mat")
vars = matread("/content/bow_train.mat")
vars["train"]

Dict{String, Any} with 2 entries:
  "tokens" => Any[Int32[26 79 … 1882 1884] Int32[29 110 … 1622 1810] … Int32[2 …
  "counts" => Any[[1 1 … 1 1] [1 2 … 1 1] … [1 1 … 1 1] [1 1 … 1 1]]

# **Lesen Word-Embedding word2vec aus der Datei: prepared_data/vocab_embedding.txt**

In [43]:
embeddings = []
open("/content/vocab_embedding.txt") do file
  data = readlines(file)
  for line in data
    word, vector = split(line, "\t")
    vector = split(vector, " ")
    _vector = []
    for e in vector
      push!(_vector, parse(Float64, e))
    end
    push!(embeddings, _vector)
  end
end
#println(embeddings)
println("number of words in vocabulary: " * string(length(embeddings)))
print("dimension of word-embedding: " * string(length(embeddings[1])))

number of words in vocabulary: 1936
dimension of word-embedding: 300

# **Implementieren von ETM Modell**

# **Evaluieren von Topics mittels folgenden Evaluationmaßen**