Tretiak Olesia, Hermann Yavorskiy

Ukrainan Catholic University 2020

# Topic Modeling For News' Articles 

## 1 RESEARCH GOAL 
The goal of this project is to give the user the possibility to process lots of news information in the compressed form. We want to extract the topics that were discussed in our dataset and understand which of them were most ‘hot’ without reading manually through over 170k news articles. The pipeline should be automated to be able to process any chunk of documents, in case someone wants to discover insights from a different (news) dataset. (But it will take some time to retrain and tune and optimize the model, as well as probably more unique approach to cleaning the texts from junk.)
We will be focusing on classic NLP techniques to achieve our goal, comparing the methods and picking the most optimal ones. We will compare some functions (for example Lemmatization) in different existing libraries to find the optimal trade-off between the time consumption and accuracy.
The pipeline will be able to process unseen news’ article giving the article’s distribution over topics by our pre-trained LDA model. Also, we will attempt do Dynamic Topic Modeling to try to catch the evolution of topics over time in our dataset (we will fail at DTM tho, so keep reading). 
When training our topic modeling model (LDA) we will tune the hyperparameters not with good-old GridSerach/RandomGridSearch, but with the use of Bayesian Optimization. 
We also would like to extract entities from the articles and present the statistics of most popular personas, locations, etc. for each topic.
We will also visualize the topics and the entities, providing more insights into the corpus.


### Imports

In [47]:
import pandas as pd
import numpy as np
import re

import os, json

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import CoherenceModel, TfidfModel, LdaMulticore, Phrases, LdaSeqModel, HdpModel
from gensim.corpora import Dictionary

import pyLDAvis

import time

import warnings
warnings.filterwarnings('ignore')

### Reading the JSON news dataset and sorting it by the publishing date

In [3]:
def extract_json_data(root, columns=['published', 'title', 'text']):
    json_files = (os.path.join(path, file)
                  for path, subdirs, files in os.walk(root)
                  for file in files
                  if file.endswith('.json'))
    articles = pd.DataFrame(columns=columns)

    for index, file in enumerate(json_files):
        with open(file, encoding='utf-8') as json_data:
            raw_data = json.load(json_data)
            data = {key: raw_data[key] for key in columns}
            df = pd.DataFrame(data, index=[index])
            articles = articles.append(df)

    articles.published = pd.to_datetime(articles.published)
    return articles.sort_values('published').reset_index(drop=True)

Wall time: 0 ns


In [None]:
#!! change path to where your data lays
documents = extract_json_data(r'C:\Users\trety\Desktop\NLP\714_20170904122143')

### Preview of the dataset

In [4]:
documents.head()

Unnamed: 0,published,title,text
0,2017-01-29 02:00:00+02:00,Woman Uses Pair of Tights to Make Point About ...,See the photos body positive blogger Milly Smi...
1,2017-01-29 02:16:00+02:00,Thon Maker set to start at centre against the ...,"January 28, 2017 7:16pm EST January 28, 2017 7..."
2,2017-01-29 03:18:00+02:00,Chandler Parsons calls out Trail Blazers on Tw...,"\nPublished on Jan. 28, 2017 \nJan. 28, 2017 \..."
3,2017-01-29 04:48:00+02:00,"Fern Adair Conservatory, home to generations o...","It’s a beautiful day in the neighborhood, as a..."
4,2017-01-29 07:09:00+02:00,"Heat grab seventh straight against Pistons, 11...","January 29, 2017 12:09am EST January 28, 2017 ..."


### Pre-save data

In [5]:
documents.to_csv('popular_news.csv', index=False)

In [48]:
# documents = pd.read_csv('popular_news.csv', encoding='utf-8')

### Shape of the data

In [7]:
documents.shape

(170882, 3)

### Unique texts and titles

In [8]:
documents.text.nunique()

167589

In [9]:
documents.title.nunique()

167238

### Missing values
#### better safe then sorry to check

In [10]:
documents.isna().sum()

published    0
title        0
text         0
dtype: int64

## Cleaning of the dataset

In [11]:
documents.dropna(inplace=True)
documents['text'] = documents['text'].map(str)
print(documents.shape)
documents = documents[documents['text'].map(len) > 1300]
print(documents.shape)
documents.drop_duplicates(subset ="text", 
                     keep = 'first', inplace = True) 
documents.drop_duplicates(subset ="title", 
                     keep = 'first', inplace = True) 
print(documents.shape)
# documents = documents[~documents['text'].str.contains("title_hn", flags=re.IGNORECASE)]
# print(documents.shape)
# documents = documents[~documents['text'].str.contains("", flags=re.IGNORECASE)]
# print(documents.shape)
# documents.text = documents.text.str.split('\ue001').str[-1]
documents = documents.reset_index(drop=True)

(170882, 3)
(130739, 3)
(127643, 3)


### Tokenize + lemmatize

In [35]:
from pattern.en import lemma

def preprocess(text):
    result = [lemma(token) for token in simple_preprocess(text) if token not in STOPWORDS and len(token) > 3]
    return result

In [36]:
start = time.time()
processed_docs0 = documents['text'].map(preprocess)
print(f'Lemmatizing with pattern: {round(time.time() - start, 2)/60} minutes')

Lemmatizing with pattern: 4.064666666666667 minutes


### Bigrams/ (Trigrams -> not for now)

In [39]:
def make_bigrams(processed_docs):
    bigram = Phrases(processed_docs, min_count=10, threshold=100) 
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    processed_docs = [bigram_mod[doc] for doc in processed_docs]
    return processed_docs

In [40]:
processed_docs0 = make_bigrams(processed_docs0)

### Dictionary

In [42]:
dictionary0 = Dictionary(processed_docs0)

print(dictionary0)

Dictionary(390076 unique tokens: ['account', 'attempt', 'average', 'away', 'basket']...)


#### filter all tokens appearing less then 5 times

In [43]:
dictionary0.filter_extremes(no_below=5)

print(dictionary0)

Dictionary(100000 unique tokens: ['account', 'attempt', 'average', 'away', 'basket']...)


### BOW -> IF-IDF

In [44]:
bow_corpus = [dictionary0.doc2bow(doc) for doc in processed_docs0]
tfidf = TfidfModel(bow_corpus)
corpus_tfidf_0 = tfidf[bow_corpus]

## LDA helper functions

In [55]:
def LDA(corpus, dictionary):
    lda_model = LdaMulticore(corpus, 
                             num_topics=30, 
                             id2word=dictionary, 
                             workers=3, 
                             per_word_topics=True,
                             #minimum_probability=0.1,
                             #alpha=0.1,
                             random_state=42,
                             passes=10) 
    return lda_model

In [56]:
def topics(model):
    for i in range(model.num_topics):
        words = ' '.join([x[0] for x in model.show_topic(i)])
        print('topic',i+1,':', words)
        print()

In [74]:
def docs_topics(model, corpus):
    mentioned_topics = pd.DataFrame()
    docs_topics = pd.DataFrame()
    for i, a in enumerate(model.get_document_topics(corpus)[:]):
        temp_dict = {}
        a = sorted(a, key=lambda x: x[1], reverse=True)
        try:
            first,snd = zip(*a)
        except ValueError:
            continue
        for j in range(len(first)):
            temp_dict[f'Topic {first[j]}'] = round(snd[j]*100, 2)
            mentioned_topics = mentioned_topics.append({'Document #':(i), 'Mentioned topic':int(first[j]),  'Percentage':round(snd[j]*100, 2)}, ignore_index=True)
        docs_topics = docs_topics.append(temp_dict, ignore_index=True)
    return mentioned_topics, docs_topics

#### Vizualization function

In [71]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

def vis(model, corpus, dictionary):
    v = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
    return v

## Running untuned LDA on our corpus

In [59]:
start = time.time()
lda_model_tfidf_0 = LDA(corpus_tfidf_0, dictionary0)                                           
print('TIME: ',(time.time()-start)/60, 'minutes')

TIME:  74.82215463717779 minutes


### Saving the model

In [60]:
lda_model_tfidf_0.save('30_topics')

#### Example on how to load the model

In [61]:
# lda = LdaMulticore.load('30_topics')

### Topics:

In [62]:
topics(lda_model_tfidf_0)

topic 1 : india kohli cricket test australia wicket sturgeon barcelona bangladesh nicola_sturgeon

topic 2 : india minister uttar_pradesh modi delhi party congres singh narendra_modi chief

topic 3 : north_korea china malaysia jong north_korean missile nasa malaysian south_korea nuclear

topic 4 : swn colbert dalai_lama corden jame_corden cyclone golovkin bieber joho storm_dori

topic 5 : food restaurant meat chef chicken menu cook wine beer cheese

topic 6 : women family life girl love feel children like think mother

topic 7 : game player season team play sport club coach league match

topic 8 : star_war buzzfee sasikala html_tag href_strong macron guardian_galaxy mcmaster jedi rogue

topic 9 : samsung galaxy phone smartphone android device apple nokia battery samsung_galaxy

topic 10 : film music movie song star oscar album actor award artist

topic 11 : beauty_beast emma_watson meredith belle barangay cebu deadpool flipkart efcc grey_anatomy

topic 12 : police officer attack arrest

## Baynesian Optimization

In [63]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

fspace = {
    'num_topics': hp.quniform('num_topics', 30, 40, 1),
    'alpha': hp.uniform('alpha', 0.03, 0.3),
    #'eta': hp.uniform('eta', 0.01, 0.3),
}

In [64]:
def f(params):
    lda_model = LdaMulticore(corpus_tfidf_0, 
                         num_topics=params['num_topics'], 
                         id2word=dictionary0, 
                         workers=3, 
                         per_word_topics=True,
                         #minimum_probability=0.1,
                         alpha=params['alpha'],
                         random_state=42,
                         passes=10) 
    
    coherence_lda = CoherenceModel(model=lda_model, texts=processed_docs0, dictionary=dictionary0, coherence='c_v').get_coherence()
    
    return {'loss': -coherence_lda, 'status': STATUS_OK}

In [65]:
trials = Trials()

In [66]:
start = time.time()
best = fmin(fn=f, space=fspace, algo=tpe.suggest, max_evals=10, trials=trials)
print("TIME:", time.time() - start)

100%|█████████████████████████████████████████████| 10/10 [14:58:54<00:00, 5393.40s/it, best loss: -0.5645393342929282]
TIME: 53934.076534986496


In [67]:
print('best:', best)

best: {'alpha': 0.13796190573474543, 'num_topics': 36.0}


In [68]:
lda_model = LdaMulticore(corpus_tfidf_0, 
                     num_topics=best['num_topics'], 
                     id2word=dictionary0, 
                     workers=3, 
                     per_word_topics=True,
                     #minimum_probability=0.1,
                     alpha=best['alpha'],
                     random_state=42,
                     passes=10) 

In [69]:
topics(lda_model)

topic 1 : food meat mcdonald chicken restaurant chef menu cook rice cheese

topic 2 : police officer attack arrest victim kill incident court suspect london

topic 3 : argentina maldive raqqa chile jazeera lender rupee davao greece mackay

topic 4 : ferrari barca ibrahimovic mercede orgasm hamilton napolitano portuguese gibson milan

topic 5 : nune intelligence_committee kardashian surveillance gordhan bollywood adam_schiff affleck eastender golf

topic 6 : samsung apple device user google phone smartphone iphone android galaxy

topic 7 : paul_manafort pelosi famine mulvaney yadav classify_information mick_mulvaney pruitt hernandez congo

topic 8 : kohli mumbai bengaluru pune dharamsala bangalore indiatoday pope lukaku sindhu

topic 9 : russian_ambassador angela_merkel mla bangladesh german_chancellor sasikala mcmaster donald_tusk tusk dhaka

topic 10 : film star movie love song actor music video character instagram

topic 11 : pipeline schiff najib messi buzzfee href_strong html_tag d

In [70]:
lda_model.save('baynesian_model')

In [72]:
vis(lda_model, corpus_tfidf_0, dictionary0)

# everything below will be finished in futureupdate, because of power issues (закон підлості спрацював) we cannot execute all code in time

## Predictions on unseen data

In [54]:
#data = "LOGAN PAUL HOSTS DUBAI'S BIGGEST EVER MEET AND GREET An estimated 11,000 fans packed into The Dubai Mall this Saturday"
data = '''President Donald Trump has put his Republican comrades in another politically tricky position as they await his latest scheme to snub Congress in shifting billions of dollars toward the border wall.

Even some GOP legislators who say they back the U.S.-Mexico barrier are lamenting reports that the president is plotting to reprogram $7.2 billion. Although the Trump administration has not confirmed the plan, the Washington Post reported Monday that the administration intends to shift that cash from military construction projects and efforts to combat drug smuggling.'''
vec = dictionary0.doc2bow(preprocess(data))

topics_list = lda_model_tfidf_0[vec]
topics_list[0]

[(12, 0.51910496), (13, 0.46130717)]

### DataFrames for visualizations

In [None]:
mentioned, docs = docs_topics(lda_model, corpus_tfidf_0)

In [None]:
mentioned.to_csv('mentioned_topics.csv', index=False)
docs.to_csv('doc_topic.csv', index=False)

In [None]:
mentioned.head()

In [None]:
docs.head()

In [None]:
topics = ['FOOD', 'POLICE & LAW VIOLATIONS', 'TRAVEL', 'FOOTBALL (BARCELONA, MILANO)', 'CELEBRITIES (REALITY SHOWS)', 'TECHNOLOGIES: SMARTPHONES', \
          'PEOPLE IN POLITICS', 'INDIA', 'POLITICS: RUSSIA VS GERMANY', 'SOCIAL MEDIA & HOLLYWOOD', 'PROGRAMMING', 'HEALTH (CHILDREN)', \
          'USA POLITICS: ELECTIONS', 'WAR: MIDDLE EAST', 'FINANCE', 'MOVIES', 'MARIJUANA', 'UNIDENTIFIED', 'TURKEY', 'BEAUTY AND THE BEAST', \
          'SPACE (MENTAL HEALTH OF ASTRONAUTS)', 'ISRAEL', 'SPANISH-RELATED', 'GOLF', 'FOOTBALL UK', 'WILD LIFE/ZOO/UNIVERSITY’S SPORTS', \
          'SPORTS: (AMERICAN FOOTBALL)', 'SPORTS: FOOTBALL/BOXING', 'COURT (LAW)', 'POLITICS: ASIA', 'OBAMA HEALTHCARE', 'CLIMATE CHANGE RESEARCH', \
          'WEATHER', 'SCHOOL', 'AIRCRAFT', 'MUSIC']

In [None]:
sequence = [f'Topic {i}' for i in range(36)]
docs = docs.reindex(columns=sequence)
docs

In [None]:
df0 = mentioned.groupby(['Mentioned topic'], sort=False).size().reset_index(name='Count').sort_values(by='Mentioned topic').reset_index(drop=True)
df0

## Named entity recognition with spaCy

In [None]:
## in future update

## Dynamic Topic Modeling

### *Attempt*

In [None]:
docs_test = documents[:10000]
process_test = docs_test['text'].map(preprocess)
process_test = make_bigrams(process_test)
dict_test = Dictionary(process_test)
dict_test.filter_extremes(no_below=5)
bow_corpus = [dict_test.doc2bow(doc) for doc in process_test]
tfidf = TfidfModel(bow_corpus)
corpus_test = tfidf[bow_corpus]

In [None]:
time_slice = [5000, 5000]
start = time.time()
ldaseq = LdaSeqModel(corpus=corpus_test, id2word=dict_test, time_slice=time_slice, num_topics=20, random_state=42)
print(f'Training the model on tf-idf corpus: {round((time.time()-start)/60, 2)} minutes')

In [None]:
for i in range(ldaseq.num_topics):
    print(f'\n\nTOPIC {i}')
    for j, topic in enumerate(ldaseq.print_topic_times(i, top_terms=15)):
        print(f'\nAt time {j} : ', end='')
        words, prob = zip(*topic)
        print(' '.join(words))

#### Detailed descriptionof why this piece doesn't work in the report

## VIZ 

In [None]:
df_appended = df0.append(df1).append(df2).append(df3)

In [None]:
df_appended

In [None]:
import plotly.express as px

# gapminder = px.data.gapminder().query("continent=='Oceania'")
# gapminder
fig = px.line(df_appended, x='Timeslice', y="Count", color='Mentioned topic', labels={'Count':'Number of documents', 'Timeslice':'Timeslice'})
fig.show()

In [None]:
df0['Count_percent'] = round(df0['Count']/(df0['Count'].sum())*100, 2)
# df1['Count_percent'] = round(df1['Count']/(df1['Count'].sum())*100, 2)


In [None]:
import plotly.graph_objects as go

fig = go.Figure(
    data=[go.Bar(x=sequence, y=df0['Count'])],
    layout=go.Layout(
        title=go.layout.Title(text="Number of documents containing topic N ")
    )
)
fig.update_traces(marker_color='pink')
fig.show()

In [None]:
fig = go.Figure(
    data=[go.Bar(x=sequence, y=df0['Count_percent'])],
    layout=go.Layout(
        title=go.layout.Title(text="Percentage of documents containing topic N")
    )
)
#fig.update_traces(marker_color='pink')
fig.show()

In [None]:

x, y = zip(*lda_model_tfidf_2.get_document_topics(corpus_tfidf_2)[6])
fig = go.Figure(
    data=[go.Bar(x=[f'Topic {_}' for _ in x], y=[_*100 for _ in y])],
    layout=go.Layout(
        title=go.layout.Title(text="Percentage of documents containing topic N in Timeslice 1")
    )
)
#fig.update_traces(marker_color='pink')
fig.show()

## Search

In [None]:
docs_topics0.fillna(value=0, inplace=True)
docs_topics0
docs_topics0.loc[docs_topics0['Topic 0'] > 0].sort_values('Topic 0', ascending=False)

In [None]:
# fig = px.bar(df1, x='Dominant topic 0', y='Count', labels={'Count':'Number of documents'})
# fig.update_traces(marker_color='pink')
# fig.show()

In [None]:
lda_model_tfidf_0.num_terms

In [None]:
# vis(lda_model_tfidf_3, corpus_tfidf_3, dictionary3)   