# Introduction:

Here we provide a tutorial making use of the Dirchlet multinomial mixture model as described above. The objective here is perform topic modelling on a text file, in which each line in said file relates to one document. 

# Preamble

We install the package itself, and then the relevant class:

In [8]:
pip install GPyM-TM



In [0]:
from GSDMM import GSDMM

### Data:

We now read in and load the relevant text file making use of one of the functions available in the package, **load_file**, which places the text file in the necessary format for the package. 

Loading the below allows us to import the text file directly from Google Drive, however this step is not necessary if your text file is stored locally. 

In [11]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Reading in the data:

In [0]:
name="toy_dataset_cleaned"
filename = ('/content/drive/My Drive/Internship/%s.txt' % name)   

We define the number of topics:

In [0]:
nTopics = 10

Lastly, the text is transformed into the format required by the class, and saved within the variable corpus:

In [0]:
corpus = GSDMM.load_file(filename)

# Application:

In the code below, we intialize the object which will perform the topic modelling, and call several attributes on the object.

For the example. we will provide both the default usage, and a case in which several of the parameters have been specified. 

In [25]:
data_dmm = GSDMM.DMM(corpus, nTopics) # Initialize the object, with default parameters.

# data_dmm = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5) # Initialize the object.

data_dmm.topicAssigmentInitialise() # Performs the inital document assignments and counts
data_dmm.inference()

psi, theta, selected_psi, selected_theta = data_dmm.worddist() # Determines and stores the psi, theta and selected_psi and selected_theta values
   
finalAssignments = data_dmm.writeTopicAssignments() # Records the final topic assignments for the documents

coherence_topwords = data_dmm.writeTopTopicalWords(finalAssignments) # Record the top words for each document

score = data_dmm.coherence(coherence_topwords, len(finalAssignments)) #Calculates and stores the coherence

print("Final K:", len(finalAssignments))

corpus=10, words=75, K=10, a=0.100000, b=0.100000, nTopWords=10, iters=15
iteration: 0
iteration: 1
iteration: 2
iteration: 3
iteration: 4
iteration: 5
iteration: 6
iteration: 7
iteration: 8
iteration: 9
iteration: 10
iteration: 11
iteration: 12
iteration: 13
iteration: 14
[0 1 3 4 5 6 8]
trump apprentice fame hollywood reality receive star tv walk abnormal 
carnivorous cat domestic furry mammal small typically abnormal body cancer 
cancer abnormal cell spread body disease group growth invade involve 
oil gas call combination crude exist liquid petroleum sticky substance 
cat kitten clowder collective kindle noun abnormal body cancer cell 
broccoli dozen hearty nutrient rich tasty vegetable cauliflower delicate flavor 
born current donald january john june office president states trump 
average topic:  7.93147758254302
Final K: 7
