# LDA Topic Modelling from Hansard
## Using LDA to try to extract topics from unlabelled UK parliamentary debates.

### Data preparation
A single-day example of the data can be found in `/data/2016-01-14.json` to make following this more clear.

In [1]:
import lzma
import pickle
from pprint import pprint

raw_data = None
with lzma.open(f"../data/2016-today.xz", "rb") as f:
    raw_data = pickle.load(f)

# Check loaded successfully by printing date of first 2 days in data.
pprint(raw_data[0]["date"])
pprint(raw_data[1]["date"])

'2016-01-05'
'2016-01-06'


Here we recursively loop through the "debates" as they have a nested structure. The problem we face here is that some things Hansard lists as "debates" are/have collections of debates or even just statements. So we try to filter down to the debates on the deepest level and those which have more than 1 contribution so therefore are not statements.

In [2]:
import numpy as np
import pandas as pd


def recurse_debates(debates):
    for deb in debates:  # Loop through debates
        contributions = [
            x for x in deb["items"] if x["type"] == "contribution"
        ]  # Get only the items which have contributions
        if (
            len(contributions) > 1
        ):  # Keep only debates that have more than one contribution.
            yield np.array(
                [deb["title"], " ".join([item["text"] for item in deb["items"]])]
            )  # Yield debate item in format [title, joined text]
        if (
            "child_debates" in deb
        ):  # If debate has child debates loop through those and yield items.
            for item in recurse_debates(deb["child_debates"]):
                yield item


def generate_debates(raw_data):
    for day in raw_data:  # Loop through days in data
        for item in recurse_debates(day["debates"]):
            yield item  # Yield each debate item


df = pd.DataFrame(list(generate_debates(raw_data)), columns=["title", "text"])
df

Unnamed: 0,title,text
0,Out-of-hospital Care,1. What progress his Department has made on i...
1,GP Services,2. What progress his Department is making on ...
2,Hospital Trusts: Deficits,3. What proportion of hospital trusts are in ...
3,Rare Diseases,4. How many people have diseases classified b...
4,Social Care Budgets: A&E Attendance,5. What assessment he has made of the effect ...
...,...,...
13988,Education Recovery,"With permission, Mr Deputy Speaker, I will mak..."
13989,Official Development Assistance,Application for emergency debate (Standing Ord...
13990,Points of Order,"On a point of order, Mr Deputy Speaker. This i..."
13991,Advanced Research and Invention Agency Bill,[Relevant documents: Third Report of the Scien...


### Preprocessing
Preprocessing is at first done with [Gensim's preprocessing](https://radimrehurek.com/gensim/parsing/preprocessing.html) which does all of the usual removal of unwanted chars and stopwords, as well as tokenisation and stemming.

In [3]:
from gensim.parsing.preprocessing import preprocess_documents

# Grab just the text from the original data.
processed_corp = preprocess_documents(df["text"])
print(processed_corp[0])



['progress', 'depart', 'integr', 'improv', 'care', 'provid', 'outsid', 'hospit', 'happi', 'new', 'year', 'speaker—and', 'happi', 'new', 'year', 'familiar', 'face', 'opposit', 'shadow', 'cabinet', 'govern', 'commit', 'transform', 'hospit', 'care', 'commun', 'seen', 'excel', 'progress', 'area', 'led', 'integr', 'pioneer', 'torbai', 'greenwich', 'govern', 'remain', 'fulli', 'commit', 'deliv', 'integr', 'programm', 'better', 'care', 'fund', 'vanguard', 'seventi', 'cent', 'peopl', 'prefer', 'die', 'home', 'allow', 'peopl', 'die', 'hospit', 'chang', 'netherland', 'ow', 'better', 'social', 'care', 'provid', 'outsid', 'hospit', 'messag', 'minist', 'clinic', 'commiss', 'group', 'try', 'hard', 'bring', 'integr', 'servic', 'grate', 'hon', 'friend', 'rais', 'issu', 'share', 'view', 'want', 'greater', 'choic', 'end', 'life', 'care', 'peopl', 'abl', 'care', 'die', 'place', 'choos', 'appropri', 'need', 'hospic', 'hospit', 'home', 'recent', 'choic', 'review', 'set', 'vision', 'enabl', 'greater', 'choi

Next we write out the document frequencies of each word to determine a suitable filter threshold. LDA is sensitive to very high or low occurring words so we have to filter them out to get a more respresentative picture for each document. I have chosen between 0.1% and 60%.

In [4]:
from collections import defaultdict

doc_count = 0
frequencies = defaultdict(int)  # The doc frequencies of each token.
for doc in processed_corp:
    doc_count += 1
    doc_contains = set()
    for word in doc:
        doc_contains.add(word)
    for word in doc_contains:
        frequencies[word] += 1

# A sorted array of the format [(token, doc_count, frequency)]. For tokens in more than 2 docs (to reduce size for sorting).
freq_sorted = sorted(
    [
        (key, val, round(val * 100.0 / doc_count, 3))
        for (key, val) in frequencies.items()
        if val > 2
    ],
    key=lambda x: x[1],
)

with open("../logs/sorted_frequencies.txt", "w") as log_file:
    pprint(freq_sorted, log_file)

In [6]:
final_words = [word for (word, count, perc) in freq_sorted if 0.5 < perc < 80]
print(final_words[:50])

['danish', 'discontinu', 'consol', 'mosul', 'coalition’', 'consul', 'overtaken', 'eco', 'wiggin', 'editori', 'juggl', 'penc', 'grit', 'collis', 'bme', 'mar', 'needi', 'subscript', 'tsunami', 'dent', 'eccentr', 'businesses’', 'cannock', 'memo', 'slowest', 'echr', 'defianc', 'patronag', 'trainer', 'mod’', 'dakin', 'chryston', 'meagr', 'discomfort', 'epidemiolog', 'arcan', 'backtrack', 'flex', 'deviz', 'bonfir', 'dwp’', 'derail', 'assassin', 'abet', 'whate', 'teen', 'scanner', 'astronom', 'coroner’', 'martyn']


There is still clearly some cleaning to be done with this final list of words but we can move on for now. Next we filter all words not in the `final_words` out of the corpus. Then we can, using Gensim, create a dictionary for each word in the corpus (i.e. assign each word a unique number), and covert each document to its "bag of words" representation (i.e remove the ordering from the words and covert it to the numbers from the dictionary).

In [7]:
filtered_corp = []
for doc in processed_corp:
    filtered_corp.append(np.intersect1d(doc, final_words))

print(filtered_corp[:2])

[array(['abil', 'abl', 'absolut', 'achiev', 'ad', 'add', 'admiss', 'admit',
       'adult', 'affect', 'agre', 'allow', 'antibiot', 'appal',
       'appropri', 'area', 'arrang', 'assembl', 'assist', 'associ',
       'author', 'autumn', 'avail', 'bear', 'best', 'better', 'billion',
       'bring', 'brought', 'cabinet', 'care', 'carefulli', 'cent',
       'challeng', 'chancellor’', 'chang', 'chariti', 'china', 'choic',
       'choos', 'christma', 'clinic', 'come', 'comment', 'commiss',
       'commit', 'commun', 'concern', 'consid', 'constitu', 'context',
       'contribut', 'cornwal', 'council', 'creat', 'death', 'deliv',
       'demand', 'demograph', 'depart', 'determin', 'die', 'differ',
       'director', 'disabl', 'discuss', 'dispens', 'earli', 'elect',
       'emerg', 'enabl', 'end', 'england', 'excel', 'expect', 'experi',
       'explor', 'extrem', 'face', 'facilit', 'fact', 'failur',
       'familiar', 'feel', 'feet', 'final', 'find', 'flood', 'forgotten',
       'foundat', 'frequ

In [19]:
from gensim import corpora, models

dictionary = corpora.Dictionary(filtered_corp)
corpus_bow = [dictionary.doc2bow(doc) for doc in filtered_corp]
print(list(dictionary.token2id.items())[:10])
print(corpus_bow[10])

[('abil', 0), ('abl', 1), ('absolut', 2), ('achiev', 3), ('ad', 4), ('add', 5), ('admiss', 6), ('admit', 7), ('adult', 8), ('affect', 9)]
[(1, 1), (15, 1), (22, 1), (30, 1), (41, 1), (44, 1), (47, 1), (48, 1), (62, 1), (72, 1), (91, 1), (94, 1), (97, 1), (101, 1), (103, 1), (105, 1), (122, 1), (126, 1), (137, 1), (139, 1), (143, 1), (147, 1), (148, 1), (150, 1), (151, 1), (156, 1), (168, 1), (191, 1), (194, 1), (201, 1), (204, 1), (210, 1), (213, 1), (221, 1), (227, 1), (236, 1), (239, 1), (245, 1), (255, 1), (256, 1), (265, 1), (274, 1), (275, 1), (279, 1), (301, 1), (316, 1), (333, 1), (403, 1), (430, 1), (443, 1), (456, 1), (457, 1), (465, 1), (496, 1), (504, 1), (522, 1), (542, 1), (564, 1), (566, 1), (586, 1), (805, 1), (827, 1), (828, 1), (854, 1), (855, 1), (856, 1), (857, 1), (858, 1), (859, 1), (860, 1), (861, 1), (862, 1), (863, 1), (864, 1), (865, 1), (866, 1), (867, 1), (868, 1), (869, 1), (870, 1), (871, 1)]
