<a href="https://colab.research.google.com/github/bostelma/ATiML-Project/blob/master/SlimTopicModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic modeling
This notebook lets you easily create the topics from a the data set and get the corresponding feature vectors. It makes use of the already preprocessed data that gets loaded in to spead up the process. The number of topics directly correlates to the size of the generated feature vector. However, note that the generation of topics includes randomness. Therefe, the topics and values are not fixed. They may change in each iteration.

In [14]:
NUM_TOPICS = 10   # Number of topics and size of output vec
PERCANTAGE = 0.2  # Percentages of Literary books to keep when
                  # generating the topics, 1.0 means all

In [10]:
from sklearn.model_selection import StratifiedShuffleSplit
import gensim
from gensim import corpora
import numpy as np
import random

In [11]:
genres = []
books = []

data_path = 'prepared_tokens.npy'

with open(data_path, 'rb') as f:
    genres = np.load(f, allow_pickle=True)
    books = np.load(f, allow_pickle=True)

In this cell the test / train split happens. You have to adapt this part according to your code. The important thing is that the variabels: books_train, books_test, genres_train, and genres_test get filled with reasonable data.

In [12]:
NUMBER_OF_SPLITS = 5
TEST_SIZE = 1 / 3

sss = StratifiedShuffleSplit(
    n_splits=NUMBER_OF_SPLITS,
    test_size=TEST_SIZE,
    random_state=0
)

splits = sss.split( books, genres )

# TODO add your train test split here, make sure to fill in 

train_index = []
test_index = []

for tr, te in splits:
  train_index = tr
  test_index = te
  
books_train, books_test = books[train_index], books[test_index]
genres_train, genres_test = genres[train_index], genres[test_index]

Here happens the actual processing stuff. At the end you have the four variables X_train, X_test, Y_train, and Y_test that you can use to train your model.

In [16]:
arr = []

for i in range( len(genres_train ) ):
  if genres_train[i] == 'Literary':
    val = random.random()
    if val > PERCANTAGE:
      continue
  arr.append( books_train[i])

In [None]:
# Create the topics
NUM_WORDS  = 4

dictionary = corpora.Dictionary( arr )
dictionary.filter_extremes(no_below=20, no_above=0.5)
corpus = [ dictionary.doc2bow( text ) for text in arr ]

# Set training parameters.
num_topics = NUM_TOPICS
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

ldamodel = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

# Print out the topics
topics = ldamodel.print_topics( num_words=NUM_WORDS )
print("The following topics were generated:")
for topic in topics:
  print( topic )

# Process the books and get final training data
X_train = []
Y_train = genres_train # TODO do I have to preprocess it as well?

for book in books_train:

  # Get the topic weights
  bow = dictionary.doc2bow( book )
  topics = ldamodel.get_document_topics( bow )

  # Convert the vector of dynamic length to
  # constant length feature vector
  x = [0] * NUM_TOPICS
  for topic in topics:
    x[topic[0]] = topic[1]
  X_train.append(x)

# Prepare our test data in the same way
X_test = []
Y_test = genres_test # TODO do I have to preprocess it as well?

for book in books_test:

  # Get the topic weights
  bow = dictionary.doc2bow( book )
  topics = ldamodel.get_document_topics(bow)

  # Convert the vector of dynamic length to
  # constant length feature vector
  x = [0] * NUM_TOPICS
  for topic in topics:
    x[topic[0]] = topic[1]
  X_test.append(x)