In [1]:
# Use Pkg.add("<PackageName>") to install these packages first.
using Turing, Turing.RandomMeasures
using Plots, StatsPlots
using Statistics, Random, LinearAlgebra
using MCMCChains
using DataFrames, CSV

Define the LDA model.

In [2]:
@model function LDA(w, K, D)
    # K: number of topics
    # D: number of words
    # M = number of documents
    # this gets the length of the 1st dimension of array w
    M = size(w, 1)

    # topic distributions
    # A Vector of Vectors, size M, each initialized to undef
    # Each inner vector will have K entries that add up to 1.
    θ = Vector{Vector}(undef, M)
    α = 1.0
    for m = 1:M
        θ[m] ~ Dirichlet(K, α)
    end
    # println("theta:")
    # println(θ)

    # word distributions (for each topic)
    ψ = Vector{Vector}(undef, K)
    η = 0.01
    for k = 1:K
        ψ[k] ~ Dirichlet(D, η)
    end

    # println("ψ (word distributions for each topic):")
    # println(ψ)

    # one entry in outer vec per doc
    # the ints represent the topic assignment of each word
    z = Vector{Vector{Int}}(undef, M)

    for m = 1:M
        doc_length = size(w[m], 1)
        # in each doc, initialize each word's topic as 0
        z[m] = zeros(Int, doc_length)
        for n = 1:doc_length
            # select topic for word n in document m
            # draw from the topic distribution for that doc
            z[m][n] ~ Categorical(θ[m])
            # select symbol for word n in document m from topic z[m][n]
            # draw from the word distribution for that topic
            w[m][n] ~ Categorical(ψ[z[m][n]])
        end
    end
    # println("z:")
    # println(z)
    # println("w:")
    # println(w)
    return w
end


LDA (generic function with 2 methods)

Import the data from a CSV file into a DataFrame. Need to run generate_words_csv.py first to generate words_df.csv.

In [3]:
words_df = CSV.read(joinpath(".", "cache", "words_df.csv"), DataFrame)

Unnamed: 0_level_0,Column1,word_id,doc_id
Unnamed: 0_level_1,Int64,Int64,Int64
1,0,1,1
2,1,2,1
3,2,3,1
4,3,4,1
5,4,5,1
6,5,6,1
7,6,7,1
8,7,8,1
9,8,9,1
10,9,10,1


Get the number of documents and unique words from the imported corpus and set the parameters.

In [4]:
# number of topics
K = 10
# number of docs (count the unique doc IDs)
M = nrow(combine(words_df, :doc_id => unique => :uniq_doc_id))
# number of unique words in corpus
D = nrow(combine(words_df, :word_id => unique => :uniq_word_id))

4131

Transform the DataFrame into a Vector of Vectors that we can use with the LDA model.

In [5]:
condition_data = Vector{Vector{Int}}(undef, M)
for doc_num = 1:M
    filter_func(doc_id) = doc_id == doc_num
    filtered = filter(:doc_id => filter_func, words_df)
    condition_data[doc_num] = filtered[:, "word_id"]
end

Condition the model with the provided documents.

In [6]:
conditioned_LDA = LDA(condition_data, K, D)

DynamicPPL.Model{typeof(LDA), (:w, :K, :D), (), (), Tuple{Vector{Vector{Int64}}, Int64, Int64}, Tuple{}, DynamicPPL.DefaultContext}(LDA, (w = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 42, 42, 43, 43, 43, 44, 45, 46, 46, 46], [47, 47, 47, 47, 48, 48, 49, 49, 49, 50, 51, 51], [52, 53, 54, 55, 56, 57, 58, 59, 60, 60, 60, 61], [62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74], [75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87], [88, 89, 90, 91, 91, 92, 93, 94, 95, 96, 97, 97, 98], [99, 100, 101, 102, 103, 104, 105, 106, 107, 108]  …  [4105, 4105, 4105, 4105, 4105, 4105, 4105, 4106, 4106, 4106, 4106, 4106, 4107, 4107], [4108, 4109, 4109, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4110, 4111, 4111, 4111], [4112, 4112, 4112, 4113, 4113, 4113, 4114, 4114, 4114, 4114, 4114, 4114, 4114, 4114], [4114, 4115, 4115, 4115, 4115, 4115, 4116, 4117, 4117, 4117, 4118], [4119, 4

Sample the model. It currently uses a Sequential Monte Carlo (SMC) sampler, but it can also be configured to use importance sampling (IS), Metropolis Hastings (MH), or Particle Gibbs (PG). It can also combine multiple samplers so one is used for the discrete variables and a different one is used for the continuous variables, such as Hamiltonian Markov Chain (HMC) or the No U-Turn Sampler (NUTS).

In [7]:
chain = sample(conditioned_LDA, MH(), 5)

[32mSampling: 100%|███████████████████████████████████████████████████████████| Time: 0:00:04[39m


Chains MCMC chain (5×64675×1 Array{Float64, 3}):

Iterations        = 1:1:5
Number of chains  = 1
Samples per chain = 5
Wall duration     = 9.37 seconds
Compute duration  = 9.37 seconds
parameters        = θ[1][1], θ[1][2], θ[1][3], θ[1][4], θ[1][5], θ[1][6], θ[1][7], θ[1][8], θ[1][9], θ[1][10], θ[2][1], θ[2][2], θ[2][3], θ[2][4], θ[2][5], θ[2][6], θ[2][7], θ[2][8], θ[2][9], θ[2][10], θ[3][1], θ[3][2], θ[3][3], θ[3][4], θ[3][5], θ[3][6], θ[3][7], θ[3][8], θ[3][9], θ[3][10], θ[4][1], θ[4][2], θ[4][3], θ[4][4], θ[4][5], θ[4][6], θ[4][7], θ[4][8], θ[4][9], θ[4][10], θ[5][1], θ[5][2], θ[5][3], θ[5][4], θ[5][5], θ[5][6], θ[5][7], θ[5][8], θ[5][9], θ[5][10], θ[6][1], θ[6][2], θ[6][3], θ[6][4], θ[6][5], θ[6][6], θ[6][7], θ[6][8], θ[6][9], θ[6][10], θ[7][1], θ[7][2], θ[7][3], θ[7][4], θ[7][5], θ[7][6], θ[7][7], θ[7][8], θ[7][9], θ[7][10], θ[8][1], θ[8][2], θ[8][3], θ[8][4], θ[8][5], θ[8][6], θ[8][7], θ[8][8], θ[8][9], θ[8][10], θ[9][1], θ[9][2], θ[9][3], θ[9][4], θ[9][5], θ[9][6], θ[9][7], θ[9

This represents the word distribution of each topic.

In [8]:
topic_word_dists = Vector{Vector{Float64}}(undef, K)
for j = 1:K
    topic_word_dists[j] = [mean(chain, "ψ[$j][$i]") for i in 1:D]
end
topic_word_dists

10-element Vector{Vector{Float64}}:
 [1.5756657769545425e-34, 2.682871253628163e-5, 4.638485081214197e-48, 3.4012816713574985e-19, 7.202124319327328e-43, 4.909026415616577e-152, 0.003400801577946562, 3.846142598864491e-18, 4.852858852121852e-8, 1.533302606814653e-49  …  0.0021861936246779564, 1.5630524353786516e-15, 2.4031080408681994e-18, 1.048573190206108e-13, 5.81444504010267e-7, 2.7252589044010593e-28, 4.2621241872527e-45, 3.948740756802549e-43, 3.3975787981786227e-25, 2.3087183046029092e-32]
 [0.000904148171838744, 7.26960089225627e-24, 0.0002587704959292351, 0.000213276106574922, 1.795010522380385e-26, 2.8539559776835686e-111, 2.4377592857476376e-79, 2.099906119829144e-5, 7.733387044950214e-48, 1.1188503951850131e-113  …  9.812160522205885e-60, 9.261607150026063e-22, 9.270867473558673e-20, 3.8044626764956503e-22, 6.852335902237966e-51, 5.024445106719229e-66, 0.00023603639032001162, 8.649846368240308e-20, 4.072962500099924e-66, 2.767645453466503e-47]
 [1.6535503044729516e-59, 2.21

Query the distribution of topics in each document.

In [9]:
document_topic_distributions = Vector{Vector{Float64}}(undef, M)
for j = 1:M
    document_topic_distributions[j] = [mean(chain, "θ[$j][$i]") for i in 1:K]
end
document_topic_distributions

1056-element Vector{Vector{Float64}}:
 [0.18747990627191288, 0.08421065602839911, 0.08558688477688353, 0.03808774810482357, 0.0065165463050139445, 0.058745303386433224, 0.1367765152745831, 0.07442553915346879, 0.03286986921295992, 0.295301031485522]
 [0.003929025069745596, 0.16350437085349206, 0.08254190594104771, 0.272664569610898, 0.0733708990311926, 0.2186472801984179, 0.13254789381744483, 0.03361555918978549, 0.009839827664464281, 0.009338668623511315]
 [0.06323194522153733, 0.08143298559127128, 0.04100014586537749, 0.24697634897079057, 0.24906933258087563, 0.11138747153648462, 0.10465703570205356, 0.0363310805397157, 0.05212607236400365, 0.013787581627890424]
 [0.5237529296496167, 0.07349649896528214, 0.03771428740078687, 0.22449141471629525, 0.021196086448818267, 0.032148724043979074, 0.026786937131047316, 0.021904201899388652, 0.03398227827838498, 0.004526641466400513]
 [0.07448304429827156, 0.22772029632224763, 0.029456526843322683, 0.007059691028335223, 0.09077157831897838, 0.

Get the highest probability topic for each movie.

In [10]:
highest_prob_topic_per_movie = Vector{Tuple{Int, Float64}}(undef, M)
for doc = 1:M
    max_prob = 0.0
    max_ind = 0
    for topic = 1:K
        if document_topic_distributions[doc][topic] > max_prob
            max_prob = document_topic_distributions[doc][topic]
            max_ind = topic
        end
    end
    highest_prob_topic_per_movie[doc] = (max_ind, max_prob) 
end
highest_prob_topic_per_movie

1056-element Vector{Tuple{Int64, Float64}}:
 (10, 0.295301031485522)
 (4, 0.272664569610898)
 (5, 0.24906933258087563)
 (1, 0.5237529296496167)
 (2, 0.22772029632224763)
 (5, 0.3031128238688373)
 (6, 0.2971738579801988)
 (9, 0.24973069215187663)
 (3, 0.24793862281307893)
 (9, 0.27410971280838087)
 (2, 0.25929224551010904)
 ⋮
 (5, 0.26561827647286756)
 (9, 0.4955338564195927)
 (7, 0.22987275084866282)
 (4, 0.28877775248846366)
 (10, 0.1757061922622305)
 (7, 0.30404525157324475)
 (5, 0.39810528759960545)
 (4, 0.3172172175701393)
 (1, 0.22050395013280819)
 (3, 0.24250388388861702)

Save the results in a dataframe and output to a CSV file so they can be loaded into Python for evaluation.

In [17]:
# out_df = DataFrame(doc_id=1:1:M, topic_dist=document_topic_distributions)
highest_prob_topic = [tup[1] for tup in highest_prob_topic_per_movie]
highest_prob = [tup[2] for tup in highest_prob_topic_per_movie]
out_df = DataFrame(doc_id=1:1:M, highest_prob_topic=highest_prob_topic, highest_prob=highest_prob)

Unnamed: 0_level_0,doc_id,highest_prob_topic,highest_prob
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,10,0.295301
2,2,4,0.272665
3,3,5,0.249069
4,4,1,0.523753
5,5,2,0.22772
6,6,5,0.303113
7,7,6,0.297174
8,8,9,0.249731
9,9,3,0.247939
10,10,9,0.27411


In [19]:
CSV.write("cache/julia_out.csv", out_df)

"cache/julia_out2.csv"