# Topic modeling with BERT

The purpose of topic modeling on our dataset is to find single keywords/topics for each article to put into Google Trends API to get an insight how popular certaint events were at given timestamp.

For topic modeling will we use unsupervised algorithm called Bidirectional Encoder Representations from Transformers (BERT). This will be done by using Maarten Grootendorst library called BERTopic. Documentation can be found her: https://github.com/MaartenGr/BERTopic

#TODO SKRIV hvorfor BERT

In [1]:
from bertopic import BERTopic
import pandas as pd
import numpy as np
import re, nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer

First we want do filter our data from stopwords and then lemmatize it, this will be done with the nltk library.

In [2]:
# Load data
df = pd.read_csv ('articles_data.csv')

# Drop rows with nan-values in specific columns
df = df.dropna(subset = ['Unnamed: 0', 'source_id', 'source_name', 'author', 'title',
       'description', 'url', 'published_at',
       'top_article', 'engagement_reaction_count', 'engagement_comment_count',
       'engagement_share_count', 'engagement_comment_plugin_count', 'content'])
# Reset index
df = df.reset_index(drop=True)


In [3]:
#Function for cleaning text
def cleaned_text(text, source_name):
    try:

        if source_name == 'Reuters':
            clean=re.split('-',text,1)[1]
            clean = re.sub("\n"," ",clean)
            clean=clean.lower()
            clean=re.sub(r"[~.,%/:;?_&+*=!-]"," ",clean)
            clean=re.sub("\[.*?\]", "", clean)
            clean=re.sub("[^a-z]"," ",clean)
            clean=clean.lstrip()
            clean=re.sub("\s{2,}"," ",clean)
            clean=re.sub(r'\b\w\b', '', clean)

        else:
            clean = re.sub("\n"," ",text)
            clean=clean.lower()
            clean=re.sub(r"[~.,%/:;?_&+*=!-]"," ",clean)
            clean=re.sub("\[.*?\]", "", clean)
            clean=re.sub("[^a-z]"," ",clean)
            clean=clean.lstrip()
            clean=re.sub("\s{2,}"," ",clean)
            clean=re.sub(r'\b\w\b', '', clean)
    except:
        clean = np.nan
    return clean

In [4]:
# Cleaning text
df["cleaned_content"] = df.apply(lambda x : cleaned_text(x['content'], x['source_name']), axis=1)

# Drop rows with nan in column cleaned_content
df = df.dropna(subset = ['cleaned_content'])
df = df.reset_index(drop=True)

In [5]:
stop=stopwords.words('english')
stop.append("say")
# Remove stopwords
df["stop_removed_content"]=df["cleaned_content"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [6]:
# Tokenize
df["tokenized"]=df["stop_removed_content"].apply(lambda x: nltk.word_tokenize(x))

In [7]:
# Function for lematize
def word_lemmatizer(text):
    lem_text = [WordNetLemmatizer().lemmatize(i,pos='v') for i in text]
    return lem_text

# Lematize
df["lemmatized"]=df["tokenized"].apply(lambda x: word_lemmatizer(x))
df["lemmatize_joined"]=df["lemmatized"].apply(lambda x: ' '.join(x))

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count,cleaned_content,stop_removed_content,tokenized,lemmatized,lemmatize_joined
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0,the national transportation safety board said ...,national transportation safety board said tues...,"[national, transportation, safety, board, said...","[national, transportation, safety, board, say,...",national transportation safety board say tuesd...
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0,the states jobless rate fell to per cent last ...,states jobless rate fell per cent last month a...,"[states, jobless, rate, fell, per, cent, last,...","[state, jobless, rate, fell, per, cent, last, ...",state jobless rate fell per cent last month ac...
2,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0,han kwang song the first north korean football...,han kwang song first north korean footballer s...,"[han, kwang, song, first, north, korean, footb...","[han, kwang, song, first, north, korean, footb...",han kwang song first north korean footballer s...
3,5,abc-news,ABC News,The Associated Press,'This Tender Land' is an affecting story about...,"""This Tender Land"" by William Kent Krueger is ...",https://abcnews.go.com/Entertainment/wireStory...,,2019-09-03T15:56:49Z,"""This Tender Land: a Novel"" (Atria Books), by ...",0.0,0.0,0.0,0.0,0.0,this tender land novel atria books by william...,tender land novel atria books william kent kru...,"[tender, land, novel, atria, books, william, k...","[tender, land, novel, atria, book, william, ke...",tender land novel atria book william kent krue...
4,6,reuters,Reuters,Reuters Editorial,EU wants to see if lawmakers will block Brexit...,The European Union is waiting to see if Britis...,https://www.reuters.com/article/us-britain-eu-...,https://s2.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:25:41Z,LONDON (Reuters) - The European Union is waiti...,0.0,0.0,0.0,817.0,0.0,the european union is waiting to see if britis...,european union waiting see british lawmakers b...,"[european, union, waiting, see, british, lawma...","[european, union, wait, see, british, lawmaker...",european union wait see british lawmakers bloc...


We will now create our topic model with the BERTopic library. We will use a pre-trained embedding model named `all-MiniLM-L6-v2`, which is a general purpose model trained on more than 1 billion training pairs. 

In [9]:
# create model 

#model = BERTopic(verbose=True, embedding_model = 'all-MiniLM-L6-v2', calculate_probabilities = True, nr_topics="auto")
 
#topics, probabilities = model.fit_transform(df['lemmatize_joined'])

# Save model

#model.save("topics_model")

In [10]:
# Load model

model = BERTopic.load("topics_model")

topics, probabilities = model.transform(df['lemmatize_joined'])

Batches:   0%|          | 0/254 [00:00<?, ?it/s]

In [56]:
a = 'national transportation safety board say tuesday tesla model autopilot mode strike fire truck culver city california one series crash board investigate involve tesla driver assistance state jobless rate fell per cent last month accord latest official figure higher previously report account upward revision central statistics office cso one several last two years nonethehan kwang song first north korean footballer score italian serie league join reign champion juventus loan cagliari han sign one year contract club option buy year end five million eu'

We can see that the model finds 114 topics, where -1 refers to all documents that did not have any topics assigned. So, 2483 articles have not been assigned a topic.

BERTopic uses the clustering algorithm HDBSCAN (https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) and a trait with this algorithm is that it doesn't force all documents/articles towards a certain cluster. If no cluster could be found, then it is simply an outlier.

In [60]:
print(f'Number of topics the model find: {len(model.get_topic_freq())}')
print(f'Number of articles with an assigned topic: {model.get_topic_freq().Count[1:].sum()}')
model.get_topic_freq()

Number of topics the model find: 18
Number of articles with an assigned topic: 3411


Unnamed: 0,Topic,Count
0,0,4709
1,-1,2742
2,1,133
3,2,79
4,3,77
5,4,76
6,5,66
7,6,50
8,7,30
9,8,29


If we look at the 8 largest topics, we can see some patterns. Topic 0 is about the Donald Trump and topic 7 is about the protest in Hong Kong.

In [61]:
import itertools
import numpy as np
from typing import List

import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [62]:
def visualize_barchart2(topic_model,
                       topics: List[int] = None,
                       top_n_topics: int = 12,
                       n_words: int = 5,
                       width: int = 250,
                       height: int = 250) -> go.Figure:
    """ Visualize a barchart of selected topics
    Arguments:
        topic_model: A fitted BERTopic instance.
        topics: A selection of topics to visualize.
        top_n_topics: Only select the top n most frequent topics.
        n_words: Number of words to show in a topic
        width: The width of each figure.
        height: The height of each figure.
    Returns:
        fig: A plotly figure
    Usage:
    To visualize the barchart of selected topics
    simply run:
    ```python
    topic_model.visualize_barchart()
    ```
    Or if you want to save the resulting figure:
    ```python
    fig = topic_model.visualize_barchart()
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/bar_chart.html"
    style="width:1100px; height: 660px; border: 0px;""></iframe>
    """
    colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])

    # Select topics based on top_n and topics args
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = topic_model.get_topic_freq().Topic.to_list()[1:top_n_topics + 1]
    else:
        topics = topic_model.get_topic_freq().Topic.to_list()[1:7]

    # Initialize figure
    subplot_titles = [f"Topic {topic}" for topic in topics]
    columns = 4
    rows = int(np.ceil(len(topics) / columns))
    fig = make_subplots(rows=rows,
                        cols=columns,
                        shared_xaxes=False,
                        x_title = 'c-TF-IDF score',
                        y_title = 'top 5 words',
                        horizontal_spacing=.1,
                        vertical_spacing=.4 / rows if rows > 1 else 0,
                        subplot_titles=subplot_titles)

    # Add barchart for each topic
    row = 1
    column = 1
    for topic in topics:
        words = [word + "  " for word, _ in topic_model.get_topic(topic)][:n_words][::-1]
        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]

        fig.add_trace(
            go.Bar(x=scores,
                   y=words,
                   orientation='h',
                   marker_color=next(colors)),
            row=row, col=column)

        if column == columns:
            column = 1
            row += 1
        else:
            column += 1

    # Stylize graph
    fig.update_layout(
        template="plotly_white",
        showlegend=False,
        title={
            'text': "<b>Topic Word Scores",
            'x': .5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
        },
        width=width*4,
        height=height*rows if rows > 1 else height * 1.3,
        hoverlabel=dict(
            bgcolor="white",
            font_size=16,
            font_family="Rockwell"
        ),
    )
    
    
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    
    return fig

In [63]:
visualize_barchart2(model)

Now we will give each article 3 keywords, which can be put into Google Trends API, from the 3 most important words in the topic each article is clustered too.

In [64]:
# Making topic column
df['Topic'] = topics

def get_3_topic(data, model):
    topic1 = []
    topic2 = []
    topic3 = []
    
    for i in data:
        if i < 0:
            topic1.append(np.nan)
            topic2.append(np.nan)
            topic3.append(np.nan)
        else:
            topic1.append(model.get_topic(i)[0][0])
            topic2.append(model.get_topic(i)[1][0])
            topic3.append(model.get_topic(i)[2][0])

    return topic1, topic2, topic3

In [65]:
topic1, topic2, topic3 = get_3_topic(df['Topic'], model)

df['Topic1'] = topic1
df['Topic2'] = topic2
df['Topic3'] = topic3

In [66]:
df.head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,...,engagement_comment_plugin_count,cleaned_content,stop_removed_content,tokenized,lemmatized,lemmatize_joined,Topic,Topic1,Topic2,Topic3
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,...,0.0,the national transportation safety board said ...,national transportation safety board said tues...,"[national, transportation, safety, board, said...","[national, transportation, safety, board, say,...",national transportation safety board say tuesd...,0,say,new,president
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,...,0.0,the states jobless rate fell to per cent last ...,states jobless rate fell per cent last month a...,"[states, jobless, rate, fell, per, cent, last,...","[state, jobless, rate, fell, per, cent, last, ...",state jobless rate fell per cent last month ac...,-1,,,
2,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",...,0.0,han kwang song the first north korean football...,han kwang song first north korean footballer s...,"[han, kwang, song, first, north, korean, footb...","[han, kwang, song, first, north, korean, footb...",han kwang song first north korean footballer s...,0,say,new,president
3,5,abc-news,ABC News,The Associated Press,'This Tender Land' is an affecting story about...,"""This Tender Land"" by William Kent Krueger is ...",https://abcnews.go.com/Entertainment/wireStory...,,2019-09-03T15:56:49Z,"""This Tender Land: a Novel"" (Atria Books), by ...",...,0.0,this tender land novel atria books by william...,tender land novel atria books william kent kru...,"[tender, land, novel, atria, books, william, k...","[tender, land, novel, atria, book, william, ke...",tender land novel atria book william kent krue...,4,book,novel,prize
4,6,reuters,Reuters,Reuters Editorial,EU wants to see if lawmakers will block Brexit...,The European Union is waiting to see if Britis...,https://www.reuters.com/article/us-britain-eu-...,https://s2.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:25:41Z,LONDON (Reuters) - The European Union is waiti...,...,0.0,the european union is waiting to see if britis...,european union waiting see british lawmakers b...,"[european, union, waiting, see, british, lawma...","[european, union, wait, see, british, lawmaker...",european union wait see british lawmakers bloc...,0,say,new,president


In [67]:
model.visualize_topics()

## Partial conclusion