# Andy Haldane Speech Analysis

### Import Packages

In [1]:
import os
from pathlib import Path
import re
from collections import Counter

from spacy.lang.en.stop_words import STOP_WORDS

### Loading and Preprocessing

#### Load Speeches as Text Files

In [2]:
speeches_path = Path('speeches')

speeches = []
for file in os.listdir(speeches_path):
    if file.endswith('.txt'):
        with open(speeches_path / file) as f:
            speeches.append(f.read())
            
speeches[0][:800]

'Introduction\nI am delighted to be speaking at this “Engaging Business” Summit, at such a critical time for business, for workers and for the wider economy. The focus of today, and the excellent background report, is happiness in the workplace. This is an issue in which everyone has a stake. It is particularly pertinent with many people having had to adapt their ways of working as a result of the Covid crisis. Indeed, this year may well have seen the largest shift in working practices ever seen, certainly the largest in modern times.\n\nThat begs a host of questions about the impact of these changes in working practices on workers, businesses, communities and the wider economy. For economists like me, it raises questions about the impact on productivity and output in the workplace. As arid as'

#### Clean the text

In [3]:
for i, speech in enumerate(speeches):
    speech = speech.replace('\n', ' ')  # replace newline characters with spaces
    speech.replace("'", '')  # remove apostrophes
    speech = re.sub(r'.footnote\[[0-9]+\]', '', speech)  # remove .footnote[x] text
    speech = ''.join(filter(lambda x: x.isalpha() or x==' ', speech)) # remove non-alphabetic characters
    speech = speech.lower()   # convert text to lower case
    speech = re.sub(r' +', ' ', speech).strip()  # remove multiple and leading spaces
    speeches[i] = speech

speeches[0][:800]

'introduction i am delighted to be speaking at this engaging business summit at such a critical time for business for workers and for the wider economy the focus of today and the excellent background report is happiness in the workplace this is an issue in which everyone has a stake it is particularly pertinent with many people having had to adapt their ways of working as a result of the covid crisis indeed this year may well have seen the largest shift in working practices ever seen certainly the largest in modern times that begs a host of questions about the impact of these changes in working practices on workers businesses communities and the wider economy for economists like me it raises questions about the impact on productivity and output in the workplace as arid as these concepts can'

## Word Frequency Analysis

<font size="3"> In order to try and extract information about the topics discussed we start by looking at the frequency of all words used

In [4]:
counts = Counter(speeches[0].split(' '))
counts.most_common(20)

[('the', 189),
 ('of', 131),
 ('to', 99),
 ('and', 95),
 ('in', 79),
 ('a', 62),
 ('is', 55),
 ('homeworking', 45),
 ('for', 40),
 ('working', 40),
 ('that', 40),
 ('i', 34),
 ('as', 33),
 ('has', 31),
 ('these', 30),
 ('productivity', 30),
 ('from', 30),
 ('by', 28),
 ('have', 27),
 ('be', 26)]

<font size="3"> We can see that some words like "the" are frequent but not very informative. In order to improve the analysis it is best to remove such "stopwords" and is done so here by cross-referencing with a list of known stopwords imported from the Python library spacy

In [5]:
new_counts = Counter(list(filter(lambda x: x not in STOP_WORDS, speeches[0].split(' '))))
new_counts.most_common(20)

[('homeworking', 45),
 ('working', 40),
 ('productivity', 30),
 ('time', 19),
 ('work', 19),
 ('people', 18),
 ('workers', 17),
 ('home', 17),
 ('covid', 14),
 ('wellbeing', 14),
 ('effects', 14),
 ('crisis', 13),
 ('social', 12),
 ('office', 12),
 ('capital', 12),
 ('studies', 11),
 ('creativity', 11),
 ('shift', 10),
 ('different', 10),
 ('evidence', 10)]

<font size="3"> The results of this analysis seem to show evidence that the focus of this speech is on homeworking due to covid and the effects this will have on productivity, wellbeing and creativity.

<font size="3"> Applying this same approach to his final speech on 30th June 2021 gives the following output:

In [6]:
new_counts = Counter(list(filter(lambda x: x not in STOP_WORDS, speeches[1].split(' '))))
new_counts.most_common(30)

[('bank', 111),
 ('financial', 84),
 ('banks', 80),
 ('policy', 65),
 ('monetary', 63),
 ('central', 55),
 ('uk', 48),
 ('public', 35),
 ('crisis', 33),
 ('time', 32),
 ('new', 31),
 ('inflation', 30),
 ('years', 29),
 ('economy', 28),
 ('stability', 26),
 ('system', 24),
 ('guidance', 21),
 ('risk', 17),
 ('past', 16),
 ('s', 16),
 ('global', 16),
 ('good', 15),
 ('crises', 15),
 ('forward', 15),
 ('international', 15),
 ('banking', 15),
 ('expectations', 14),
 ('far', 14),
 ('better', 13),
 ('interest', 13)]

<font size="3"> Here the results aren't as clear but it seems like the topics of global inflation and economic stability were discuessed. However, with this speech a better approach is probably needed to make sure of this.

<font size="3"> Another approach that may work better is a dictionary based approach that counts the usage of pre-determined key words or phrases that are associated with potential risks that could be discussed. This has an obvious disadvantage in that it requires the possible risks to be known beforehand but could be a good way to categorise speeches with risks that occur resasonably frequently or can be forseen such as high unemployment rates.
    
One flaw with these word count approaches is that they indiscriminantly show all topics discussed which can be a problem in speeches such as Andy Haldane's final speech where a large portion is spent discussing his early career at Bank of England. A potential fix for this is to filter out paragraphs that don't have enough usages of negative words such as "risk" and "crisis" using either a dictionary of such words or a sentiment analysis algorithm.

<font size="3"> Another flaw is that certain words or topics are likely to be discussed in all speeches. In order to fix this one option is to look at the distribution of words across all the speeches together and place greater importance on those that are more concentrated in particular speeches.