iven the same dataset,

- extract the most syntactically weighted N-grams, omitting nonsense (‘казалось бы’, ‘возможно предположить’, etc). The main idea is to extract the most valuable data from the text.

- Try different models for a topic extraction. Which one performs better? What metrics were used to evaluate the model?


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sbn
import re
import multiprocessing as mp
from string import punctuation
import gensim

Load data:

In [None]:
lenta_data = pd.read_csv('E:/PycharmProjects/data/lenta-ru-news.csv')
lenta_data.head()

Clean and preprocessing text data:

In [None]:
import nltk
nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords
import re, string
from nltk.stem.snowball import SnowballStemmer

russian_stopwords = stopwords.words("russian")
russian_stopwords.extend(['BBC', 'ВВС', 'РИА', 'Риа','риа', 'пресс центра', 'агенства','прайм тасс', 'ТАСС', 
                          'Интерфакс', 'Интерфакса', 'новости','пресс службы', 'интерфакс', 'интерфакса', 'мнен', 'дан'
                          'Таким образом', 'как', 'по', 'в', 'результате', 'В', 'В результате', 'мнению', 'данным',
                          'таким образом', 'согласно которым', 'до сих пор', 'понедельник', 'вторник', 'рбк', 'РБК', 
                          'итартасс', 'ссыл', 'ссылается', 'день', 'ночь', 'утро', 'вечер', 'Reuters',
                          'наст', 'врем', 'ссылк', 'сообщ', 'агенств', 'сих', 'пор',
                          'среда', 'четверг', 'пятница', 'суббота', 'воскресенье', 'лет', 'год', 'некотор', 'необход', 
                          'лиш', "владимир",'Associated Press',
                          'сказ', 'больш', "ЕЭС России", 'eэс', 'Associated', 'соответств', 'говор', 'лучш', 'сообща',
                          'новост', 'эх'
                         ])
regex = re.compile('[%s]' % re.escape(string.punctuation))
stemmer = SnowballStemmer("russian") 

def preprocessing(text):
    text = regex.sub('', text)
    text = [token for token in text.split() if token not in russian_stopwords]
    text = [stemmer.stem(token) for token in text] 
    text = [token for token in text if token not in russian_stopwords]
    text = [token for token in text if token] 
    return ' '.join(text)

Only 100 000 values from dataset, bcs original dataset is very big and process it take a lot of time:

In [None]:
lenta_data['text'][:100000] = lenta_data['text'][:100000].apply(lambda x: preprocessing(x))

In [None]:
preprocessing_data = pd.DataFrame({'text': lenta_data['text'][:100000]})
preprocessing_data

Create the bigrams and trigrams, using gensim:

In [None]:
text = []
for index, row in preprocessing_data.iterrows():
        text.append(row['text'].split())

from gensim.models import Phrases
bigram = Phrases(text) 
trigram = Phrases(bigram[text])

for idx in range(len(text)):
    for token in bigram[text[idx]]:
        if '_' in token:
            text[idx].append(token)
    for token in trigram[text[idx]]:
        if '_' in token:
            text[idx].append(token)

Create the bigrams, using nltk:

In [None]:
texts = []
for index, row in preprocessing_data.iterrows():
        texts.append(row['text'].split())
        
from nltk.util import ngrams
import  collections

bigrams = [ngrams(text, 2) for text in texts]
bigram_freq = [collections.Counter(bigram) for bigram in bigrams]
# look at the most popular bigrams in the third and fourth text
bigram_freq[2].most_common(5), bigram_freq[3].most_common(5)

Create LDA-model for 100 000 items. 

A more detailed research will be carried out for 10 000 elements, bcs for 100 000 elements it`s take a lot of time.

In [None]:
from gensim.corpora.dictionary import Dictionary
from numpy import array

dictionary = Dictionary(text)
dictionary.filter_extremes(no_below=10, no_above=0.1)

corpus = [dictionary.doc2bow(doc) for doc in text]
print('Количество уникальных токенов: %d' % len(dictionary))
print('Количество документов: %d' % len(corpus))

In [None]:
from gensim.models.ldamulticore import LdaMulticore

model = LdaMulticore(corpus = corpus, id2word = dictionary, num_topics = 22)
model.show_topics()

In [None]:
import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

In [None]:
from gensim.models.coherencemodel import CoherenceModel

def calc_coherence_values(dictionary, corpus, texts, limit, start = 2, step = 2):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaMulticore(corpus=corpus,id2word = dictionary, num_topics = num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values = calc_coherence_values(dictionary = dictionary, corpus=corpus, texts=text, start = 2, limit = 41, step = 3)

We need to define the optimal numbers of topics:

In [None]:
limit, start, step = 41, 2, 3
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Number of topics")
plt.ylabel("Coherence")
plt.legend(("coherence_values"), loc='best')
plt.show()

The best coherence - for ~22 topics.


In [None]:
from gensim.models import LsiModel

In [None]:
lsamodel = LsiModel(corpus, num_topics = 22, id2word = dictionary) 
lsamodel.print_topics(num_topics = 22, num_words = 10)

In [None]:
def calc_coherence_values_Lsi(dictionary, corpus, texts, limit, start = 2, step = 2):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LsiModel(corpus=corpus, id2word = dictionary, num_topics = num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values_Lsi = calc_coherence_values(dictionary = dictionary, corpus=corpus, texts=text, start = 2, limit = 41, step =3)

In [None]:

limit, start, step = 41, 2, 3
x = range(start, limit, step)
plt.plot(x, coherence_values_Lsi)
plt.xlabel("Number of topics")
plt.ylabel("Coherence")
plt.legend(("coherence_values"), loc='best')
plt.show()

The best coherence - for ~22 topics.

- Which one performs better? What metrics were used to evaluate the model?


LSA performs a little better, bcs its coherence score bigger (about 0.53) then LDA score (about 0.52), but there is no significant difference, and we can say that the models are about the same, bcs they give almost the same quality assessment results for the same number of topics.

Coherence is a metric used to evaluate the quality of a model, based on 'c_v' measure.

In [None]:
coherence_values_Lsi, coherence_values

Save models for 100 000 items:

In [None]:
model.save('model_lda.gensim')

In [None]:
lsamodel.save("model_lsa.gensim")

Create and explore models for 10 000 items:

In [None]:
new_preprocessing_data = pd.DataFrame({'text': preprocessing_data['text'][:10000]})
new_preprocessing_data

Create N-grams:

In [None]:
text = []
for index, row in new_preprocessing_data.iterrows():
        text.append(row['text'].split())

from gensim.models import Phrases
bigram = Phrases(text) 
trigram = Phrases(bigram[text])

for idx in range(len(text)):
    for token in bigram[text[idx]]:
        if '_' in token:
            text[idx].append(token)
    for token in trigram[text[idx]]:
        if '_' in token:
            text[idx].append(token)

Create LDA-model:

In [None]:
from gensim.corpora.dictionary import Dictionary
from numpy import array

dictionary = Dictionary(text)
dictionary.filter_extremes(no_below=10, no_above=0.1)

corpus = [dictionary.doc2bow(doc) for doc in text]
print('Количество уникальных токенов: %d' % len(dictionary))
print('Количество документов: %d' % len(corpus))

Explore influence of hyperparameters on model quality. First, find optimal number of topics:

In [None]:
from gensim.models.ldamulticore import LdaMulticore

def calc_coherence_values_new_LDA(dictionary, corpus, texts, limit, start = 2, step = 2):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaMulticore(corpus=corpus,id2word = dictionary, num_topics = num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values = calc_coherence_values(dictionary = dictionary, corpus=corpus, texts=text, start = 2, limit = 20, step = 2)
coherence_values

Optimal number of topics is 6. Next, change hyperparameter eta and see result:

In [None]:
def calc_coherence_values_hp(dictionary, corpus, texts, num_topics = 6, eta = [1, 0.01, 0.001, 0.0001, 0.00001]):
    coherence_values = []
    model_list = []
    for eta_value in eta:
        model = LdaMulticore(corpus=corpus,id2word = dictionary, num_topics = num_topics, eta = eta_value)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values_eta = calc_coherence_values_hp(dictionary = dictionary, corpus=corpus, texts=text)
coherence_values_eta

We can see that LDA-model quality get better if eta = 0.001. The worth res is got if eta = 0.00001.

Next, let`s see how hp alpha influenced on the LDA-model:

In [None]:
def calc_coherence_values_hp_alpha(dictionary, corpus, texts, num_topics = 6, eta = 0.001):
    coherence_values = []
    model_list = []
    for alpha_value in np.arange(1, 2, 0.1):
        model = LdaMulticore(corpus=corpus,id2word = dictionary, num_topics = num_topics, alpha = alpha_value, eta = eta)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values


model_list, coherence_values_alpha = calc_coherence_values_hp_alpha(dictionary = dictionary, corpus=corpus, texts=text)
coherence_values_alpha

the results got worse. Also researched for alpha=[0, 1], but the result was just as bad.

For the best LDA model create pyLDAvis display:

In [None]:
model = LdaMulticore(corpus=corpus,id2word = dictionary, num_topics = 6, eta = 0.001)

lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

At the pic we can see bubles approximately the same size, it means that topics are distributed about the same.
Also, topics are scattered throughout the diagram and have virtually no overlap, it means that LDA topic model is good.

Picture:

![LDA_for_10%20000.png](attachment:LDA_for_10%20000.png)

Create LSI-vodel for 10 000 items:

In [None]:
lsamodel = LsiModel(corpus, num_topics = 6, id2word = dictionary) 
lsamodel.print_topics(num_topics = 6, num_words = 10)

In [None]:
def calc_coherence_values_new_Lsi(dictionary, corpus, texts, limit, start = 2, step = 2):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LsiModel(corpus=corpus, id2word = dictionary, num_topics = num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model = model, texts = texts, dictionary = dictionary, coherence = 'c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values_Lsi = calc_coherence_values(dictionary = dictionary, corpus=corpus, texts=text, start = 2, limit = 20, step = 2)
coherence_values_Lsi

In [None]:
limit, start, step = 20, 2, 2
x = range(start, limit, step)
plt.plot(x, coherence_values_Lsi)
plt.xlabel("Number of topics")
plt.ylabel("Coherence")
plt.legend(("coherence_values"), loc='best')
plt.show()

With hyperparameter tuning, the LDA model performed better than the LSA model.
