# Bert Topic modelling

## Running the example

First thing is to runn the example from https://maartengr.github.io/BERTopic/index.html to check everything is working. Output should match that shown in the Quickstart example on the website.

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

In [None]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.get_topic(0)

In [None]:
topic_model.get_document_info(docs)

## Running on London data

Now we know that berttopic is at least working as in the example, we can run it on the `london_urls.csv` data we prepared earlier.

In [None]:
import pandas as pd

In [None]:
london_df = pd.read_csv('london_urls.csv', usecols=['url', 'parent_url', 'content', 'name', 'keyword'])
london_df

In [None]:
# get summary stats about the dataframe to check things look right.
london_df.describe()

### Approach 1: Train a model on all pages

In [None]:
# fit a BERTopic model to the data and use it to find groups in its training data.
docs = list(london_df['content'])
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

In [None]:
# look at a slice of the topics to see if they make sense. We see a lot of council names.
topic_model.get_topic_info()[40:60]

In [None]:
# we can also get information about how documents relate to topics and why they've been assigned
topic_model.get_document_info(docs)

### Approach 2: Topic modelling with sites featuring keywords

In [None]:
# make a new dataframe filtering to websites matching keywords
keyword_london_df = london_df[london_df['keyword']]
# we only want the documents for each site, not the whole dataframe
filtered_docs = list(keyword_london_df['content'].values)

In [None]:
# train a new topic model just on websites matching keywords
filtered_topic_model = BERTopic()
f_topics, f_probs = filtered_topic_model.fit_transform(filtered_docs)

In [None]:
# look at a slice of the topics again. These look more promising.
filtered_topic_model.get_topic_info()[1:30]

In [None]:
# looking at the document info again
filtered_topic_model.get_document_info(filtered_docs)

In [None]:
# export the enriched data for further processing
export_df = pd.concat([keyword_london_df.reset_index(drop=True), filtered_topic_model.get_document_info(filtered_docs)], axis=1)
export_df.to_csv("filtered_london_with_topics.csv")

In [None]:
# filter to 100 topics and see what changes
filtered_topic_model.reduce_topics(filtered_docs, nr_topics=100)
filtered_topic_model.get_topic_info()

## Approach 3: Southwark only model

We found that the topic model tends to build topics from the councils. This makes sense based on how the model works, but we want it to build topics based on policy rather than location. We'll train a model on one council and see if it generalises. Southwark is the council with the most pages in the dataset so we'll use that.

In [None]:
# southwark is the biggest
london_df['name'].value_counts()

In [None]:
# filter to just southward and retrain the model
sdocs = keyword_london_df[keyword_london_df['name'].str.contains("Southwark")]['content']
s_topic_model = BERTopic()
s_topics, s_probs = s_topic_model.fit_transform(sdocs)

In [None]:
# we don't see so many council names in the topics now
s_topic_model.get_topic_info()

In [None]:
# we'll try reducing the number of topics and see how that affects robustness
s_topic_model.reduce_topics(sdocs, nr_topics=50)
s_topic_model.get_topic_info()

### Apply to other councils

In [None]:
f_topics, f_probs = s_topic_model.transform(filtered_docs)

In [None]:
keyword_london_df['Topic'] = f_topics
keyword_london_df['topic_prob'] = f_probs
keyword_london_df = london_df[london_df['keyword']]
keyword_london_df

In [None]:
topics = s_topic_model.get_topic_info()[["Topic", "Name", "Representation"]]
topics = topics.set_index("Topic")
topics

In [None]:
# look at topics, they look more reasonable
all_london_topics = keyword_london_df.join(topics, on="Topic")
all_london_topics["Representation"].value_counts()

In [None]:
# export for further analysis
all_london_topics.to_csv("all_london_data_southwark_topics.csv")
topics.to_csv("southwark_topic_list.csv")