# Introduction:

Here we provide a tutorial making use of both the Dirchlet multinomial mixture model and the Poisson model. The objective here is perform topic modelling on a text file, in which each line in said file relates to one document. 

# Preamble

We install the package itself, and then the relevant classes:

In [2]:
pip install GPyM-TM==3.0.0

Collecting GPyM-TM==3.0.0
  Downloading https://files.pythonhosted.org/packages/da/88/0816826b5b6d2d9fbf4dfff6d6b9e963d85a5ec8dc8e09d5d2bb2381c347/GPyM_TM-3.0.0-py3-none-any.whl
Installing collected packages: GPyM-TM
  Found existing installation: GPyM-TM 1.3.7
    Uninstalling GPyM-TM-1.3.7:
      Successfully uninstalled GPyM-TM-1.3.7
Successfully installed GPyM-TM-3.0.0


Thus, having loaded the package we now extract the two classes available within the package.

In [3]:
from GPyM_TM import GSDMM
from GPyM_TM import GPM

### Data:

We now read in and load the relevant text file making use of one of the functions available in the package, **load_file**, which places the text file in the necessary format for the package. 

Loading the below allows us to import the text file directly from Google Drive, however this step is not necessary if your text file is stored locally. 

In [4]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Reading in the data:

In [5]:
name="toy_dataset_cleaned"
filename = ('/content/drive/My Drive/Internship/%s.txt' % name)   

We define the number of topics:

In [6]:
nTopics = 10

Lastly, the text is transformed into the format required by the class, and saved within the variable corpus:

In [7]:
corpus = GSDMM.load_file(filename)

# Application - GDSMM:

In the code below, we intialize the object which will perform the topic modelling, and call several attributes on the object.

For the example. we will provide both the default usage, and a case in which several of the parameters have been specified. 

In [8]:
data_dmm = GSDMM.DMM(corpus, nTopics) # Initialize the object, with default parameters.

# data_dmm = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5) # Initialize the object.

data_dmm.topicAssigmentInitialise() # Performs the inital document assignments and counts
data_dmm.inference()

psi, theta, selected_psi, selected_theta = data_dmm.worddist() # Determines and stores the psi, theta and selected_psi and selected_theta values
   
finalAssignments = data_dmm.writeTopicAssignments() # Records the final topic assignments for the documents

coherence_topwords = data_dmm.writeTopTopicalWords(finalAssignments) # Record the top words for each document

score = data_dmm.coherence(coherence_topwords, len(finalAssignments)) #Calculates and stores the coherence

print("Final number of topics found: " + str(len(finalAssignments)))

corpus=10, words=75, K=10, a=0.100000, b=0.100000, nTopWords=10, iters=15
iteration: 0
iteration: 1
iteration: 2
iteration: 3
iteration: 4
iteration: 5
iteration: 6
iteration: 7
iteration: 8
iteration: 9
iteration: 10
iteration: 11
iteration: 12
iteration: 13
iteration: 14
[0 1 2 3 5 7 8 9]
cancer abnormal cell spread body disease group growth invade involve 
oil gas call combination crude exist liquid petroleum sticky substance 
born current donald january john june office president states trump 
broccoli cauliflower delicate flavor greener stronger abnormal body cancer cell 
trump apprentice fame hollywood reality receive star tv walk abnormal 
cat kitten clowder collective kindle noun abnormal body cancer cell 
broccoli dozen hearty nutrient rich tasty vegetable abnormal body cancer 
carnivorous cat domestic furry mammal small typically abnormal body cancer 
average topic:  6.237884775635033
Final number of topics found: 8


In [9]:
# We can then have to variables in which the selected theta's and psi are saved
selected_psi
selected_theta

array([[9.96956814e-01, 6.51863805e-05, 1.36336784e-03, 1.36336784e-03,
        2.01037928e-05, 6.90451404e-06, 1.23129830e-04, 3.56680195e-05],
       [9.96007373e-01, 2.59821030e-04, 1.36206945e-03, 1.36206945e-03,
        1.10439294e-04, 5.10433710e-05, 4.13789048e-04, 1.67440219e-04],
       [1.17486888e-04, 1.86653305e-02, 2.83219667e-01, 2.83219667e-01,
        6.82876259e-02, 2.83219667e-01, 3.38741182e-02, 1.06136193e-02],
       [5.16387722e-04, 3.45089706e-02, 6.33178825e-02, 2.62607894e-01,
        2.62607894e-01, 2.62607894e-01, 5.75149510e-02, 2.12927691e-02],
       [1.35009082e-03, 3.56363004e-02, 8.30346540e-03, 1.25993200e-01,
        1.65566068e-02, 1.25993200e-01, 5.41143822e-02, 5.04709464e-01],
       [5.90413635e-04, 2.50072906e-02, 4.91282946e-03, 1.31096649e-01,
        1.06295765e-02, 1.31096649e-01, 3.98264257e-02, 5.25153102e-01],
       [1.27310833e-03, 1.18809038e-01, 7.82999986e-03, 1.18809038e-01,
        1.56125452e-02, 2.26633720e-02, 5.61316442e-01, 3.

# Appplication - GPM

Thus, having shown how the topic modelling can be performed through the use of the Dirchlet multinomial mixture model, we now repeat the process making use of the Poisson Model.

Just as before, we make use of the **load_file** function which correctly formats the text file for the class. 

In [10]:
corpus = GPM.load_file(filename)

Then just as before we initialize the object, and show the relevant results.

In [10]:
# Default usage
data_gpm = GPM.GPM(corpus, nTopics)

# Non-default usage
#data_gpm = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)

data_gpm.topicAssigmentInitialise()

data_gpm.inference()
psi, theta, selected_psi, selected_theta = data_gpm.worddist()

finalAssignments = data_gpm.writeTopicAssignments()

coherence_topwords = data_gpm.writeTopTopicalWords(finalAssignments)

score = data_gpm.coherence(coherence_topwords, len(finalAssignments))

print("Final number of topics found: " + str(len(finalAssignments)))

corpus=10, words=75, K=10, a=0.001000, b=0.001000, g=0.100000,nTopWords=10, iters=15, N=20
iteration: 0
iteration: 1
iteration: 2
iteration: 3
iteration: 4
iteration: 5
iteration: 6
iteration: 7
iteration: 8
iteration: 9
iteration: 10
iteration: 11
iteration: 12
iteration: 13
iteration: 14
cancer abnormal cell spread eventually tissue uncontrolled body disease group 
oil gas trump cleaner coal disadvantage environmental fuel natural apprentice 
broccoli cauliflower delicate flavor greener stronger dozen hearty nutrient rich 
cat kitten carnivorous domestic furry mammal small typically clowder collective 
average topic:  8.435969586274934
Final number of topics found: 4


In [11]:
# We can then have to variables in which the selected theta's and psi are saved
selected_psi
selected_theta

array([[9.96956814e-01, 6.51863805e-05, 1.36336784e-03, 1.36336784e-03,
        2.01037928e-05, 6.90451404e-06, 1.23129830e-04, 3.56680195e-05],
       [9.96007373e-01, 2.59821030e-04, 1.36206945e-03, 1.36206945e-03,
        1.10439294e-04, 5.10433710e-05, 4.13789048e-04, 1.67440219e-04],
       [1.17486888e-04, 1.86653305e-02, 2.83219667e-01, 2.83219667e-01,
        6.82876259e-02, 2.83219667e-01, 3.38741182e-02, 1.06136193e-02],
       [5.16387722e-04, 3.45089706e-02, 6.33178825e-02, 2.62607894e-01,
        2.62607894e-01, 2.62607894e-01, 5.75149510e-02, 2.12927691e-02],
       [1.35009082e-03, 3.56363004e-02, 8.30346540e-03, 1.25993200e-01,
        1.65566068e-02, 1.25993200e-01, 5.41143822e-02, 5.04709464e-01],
       [5.90413635e-04, 2.50072906e-02, 4.91282946e-03, 1.31096649e-01,
        1.06295765e-02, 1.31096649e-01, 3.98264257e-02, 5.25153102e-01],
       [1.27310833e-03, 1.18809038e-01, 7.82999986e-03, 1.18809038e-01,
        1.56125452e-02, 2.26633720e-02, 5.61316442e-01, 3.