## TOPIC MODELLING 


### 1. Latent Dirichlet Analysis (LDA)

The idea is to first <i>LEARN</i> the topics from the Titles fields. This requires us to concatenate all the Titles from the cleaned datatables. 

In order to achieve this we have to follow these steps: 

   1. Create a corpus from the tokenised Titles
   2. Create a Dictionary of (word-id, word) pairs 
   3. Vectorise the corpus using a simple Bag of Words (BOW) // we could also use bi or tri grams
   4. Train an LDA model on the vectorised corpus
   
At the end, we shall have an LDA model that predicts the topic distribution for each document (probability distribution), as well as the word distribution per topic.

We shall use the <b> gensim library </b> to do the above.

### 1.1 Corpus Creation 

In [1]:
import pandas as pd
from gensim.corpora import Dictionary
from ast import literal_eval
from gensim.models import LdaModel
import random , pickle



First and foremost we require a corpus of documents which are the titles in our case.

In [10]:
title_corpus = []

for filename in ['dblp_books_clean.csv','dblp_articles_clean.csv' , 'dblp_incollections_clean.csv',
                 'dblp_inproceedings_clean.csv','dblp_proceedings_clean.csv', 'dblp_theses_clean.csv'
                ]:
    vals = pd.read_csv(filename).Title.unique()
    vals = [literal_eval(val) for val in vals]
    title_corpus.extend(vals)

  mask |= (ar1 == a)


### 1.2 Dictionary Creation 

In [26]:
vocab_dict = Dictionary(title_corpus)

In [27]:
len(vocab_dict)

333919

In [28]:
pickle.dump(vocab_dict, open( "full_vocab.p", "wb"))

In [29]:
vocab_dict.filter_extremes(no_above=0.9, no_below=10)

In [30]:
len(vocab_dict)

50806

In [31]:
pickle.dump(vocab_dict, open( "cut_vocab.p", "wb"))

### 1.3 BOW Vectorisation 

In [32]:
bow_corpus = [vocab_dict.doc2bow(x) for x in title_corpus]

In [36]:
len(bow_corpus)

3748124

### 1.4 Sampling & LDA Model 

We would need to sample our dataset as 3 million samples would take forever to calculate on a single computer. 
We sample 100,000 documents at random without replacement and train an LDA model on it. Note the number of topics required would need to be <b>CROSS VALIDATED BASED ON THE RECOMMENDATION QUALITY</b> 

In [50]:
sample = random.sample(range(len(bow_corpus)), 100000)
bow_sample =[bow_corpus[x] for x in sorted(sample)]

In [53]:
%time lda_model = LdaModel( corpus = bow_sample,id2word=vocab_dict,num_topics=10, passes=3)

CPU times: user 3min 24s, sys: 3.13 s, total: 3min 27s
Wall time: 3min 28s


In [54]:
pickle.dump(lda_model, open('lda_model_10_topics','wb'))

In [55]:
lda_model.print_topics(-1)

[(0,
  '0.027*"based" + 0.023*"image" + 0.020*"icassp" + 0.018*"recognition" + 0.015*"detection" + 0.014*"processing" + 0.012*"video" + 0.010*"images" + 0.010*"speech" + 0.009*"estimation"'),
 (1,
  '0.022*"logic" + 0.011*"complexity" + 0.010*"graphs" + 0.010*"graph" + 0.009*"theory" + 0.009*"set" + 0.008*"program" + 0.008*"order" + 0.008*"sets" + 0.007*"matrix"'),
 (2,
  '0.032*"based" + 0.026*"time" + 0.023*"systems" + 0.013*"real" + 0.013*"fuzzy" + 0.013*"control" + 0.013*"multi" + 0.012*"model" + 0.010*"analysis" + 0.009*"selection"'),
 (3,
  '0.027*"proceedings" + 0.018*"learning" + 0.015*"systems" + 0.014*"study" + 0.013*"symposium" + 0.012*"case" + 0.012*"information" + 0.010*"zur" + 0.010*"intelligent" + 0.009*"robot"'),
 (4,
  '0.011*"design" + 0.009*"high" + 0.009*"simulation" + 0.008*"localization" + 0.007*"analysis" + 0.007*"functional" + 0.007*"synthesis" + 0.007*"experimental" + 0.006*"eines" + 0.005*"speed"'),
 (5,
  '0.023*"optimization" + 0.015*"algorithm" + 0.012*"pro

#### We can see from the above that there still is ertain overlap between the topics but these resuls are good enough for the purpose of our prototype. 

#### It can also be seen that the stopwords in french and german are posing a problem which need to be removed during the cleaning.
This happend because we only removed English stopwords