# Using regex in Python to analyse speeches

This notebook details some techniques for using regular expressions - **regex** - with text data.

First we import some libraries we are going to need...[link text](https://)

In [1]:
#pandas for data analysis
import pandas as pd

In [2]:
#scraperwiki for scraping
!pip install scraperwiki
import scraperwiki

Collecting scraperwiki
  Downloading scraperwiki-0.5.1.tar.gz (7.7 kB)
Collecting alembic
  Downloading alembic-1.7.5-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 6.2 MB/s 
Collecting Mako
  Downloading Mako-1.1.6-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 3.5 MB/s 
Building wheels for collected packages: scraperwiki
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: filename=scraperwiki-0.5.1-py3-none-any.whl size=6545 sha256=cf3b98a0b47bcf00425586855059128aa375708f05990d4740566bc8321df478
  Stored in directory: /root/.cache/pip/wheels/3c/57/8d/41e15f7e5cc9eb0067539416abd445f210c0d04f39975d5ca5
Successfully built scraperwiki
Installing collected packages: Mako, alembic, scraperwiki
Successfully installed Mako-1.1.6 alembic-1.7.5 scraperwiki-0.5.1


In [3]:
#lxml.html and cssselect for drilling down into scraped webpages
import lxml.html
!pip install cssselect
import cssselect

Collecting cssselect
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Installing collected packages: cssselect
Successfully installed cssselect-1.1.0


## Scraping the speeches

We need some speech data to analyse. The gov.uk website publishes speeches and these can be found under the 'News and communications' search facility at https://www.gov.uk/search/news-and-communications.

We've accessed this via the '[View all announcements](https://www.gov.uk/search/news-and-communications?people=dominic-raab)' link at the bottom of [the gov.uk page for the politician Dominic Raab](https://www.gov.uk/government/people/dominic-raab)

We've then added an extra search term: 'statement', as we've noticed that speeches tend to use this term as a category.

Below we store the URL, then fetch the webpage at that URL, then convert it to an lxml 'object' which will make it easier to drill down into for specific information.

In [4]:
#store the URL
raaburl = "https://www.gov.uk/search/news-and-communications?keywords=statement&people%5B%5D=dominic-raab&order=updated-newest"
#fetch the webpage and store in 'html' as one long string
html = scraperwiki.scrape(raaburl)
#convert to an lxml object called 'root' - this gives it structure we can drill down into
root = lxml.html.fromstring(html)

### Drilling down into specific parts of the webpage using `cssselect`

Now we want specific pieces of information from that webpage: the URLs of the webpages containing the full text of each speech or press release, etc.

We use `.cssselect()` which will grab the contents of any HTML tags/attributes/values that we specify. Those are specified using **CSS selectors**.

Looking at the webpage HTML, for example, we can identify that document links are always within an `<a>` tag within a list tag like `<li class="gem-c-document-list__item">`

In [5]:
#grab all the elements within the specified tags, with the specified class
listlinks = root.cssselect('li.gem-c-document-list__item a')
#show how many matches we get
print(len(listlinks))
#store the text inside those tags, in a list called 'titles'
titles = [i.text_content() for i in listlinks]
#store the links - the href= values - in a list called 'hrefs'
hrefs = [i.attrib['href'] for i in listlinks]
#print them
print(titles)
print(hrefs)

20
["Flight from Kabul carrying British nationals: Foreign Secretary's statement, 9 September 2021", "Afghanistan response: Foreign Secretary's statement, 6 September 2021", 'Foreign Secretary statement on the sentencing of Maria Kolesnikova and Maksim Znak', "Harry Dunn: Foreign Secretary's statement, 27 August 2021", "Kabul attack: Foreign Secretary's statement following his call with US Secretary of State", 'UK sanctions Russian FSB operatives over poisoning of Alexey Navalny', 'Foreign Secretary Statement: 20 August 2021', 'UK doubles aid to Afghanistan', "Afghanistan: G7 Foreign and Development Ministers' Meeting, chair's statement, 19 August 2021", "Afghanistan debate in the House of Commons, 18 August 2021: Foreign Secretary's closing statement", "Fourth anniversary of Anoosheh Ashoori's detention: Foreign Secretary Statement", "MV Mercer Street attack: G7 Foreign Ministers' statement", 'Draft mandate for negotiations in respect of Gibraltar: Foreign Secretary statement', 'UN Hu

## Scrape the linked pages

We get 20 matches, which is what we expect. 

At some point we need to loop through multiple pages but for now 20 results is enough.

Now we need to scrape the 20 linked pages and store those in a dataframe. 

In [None]:
#create an empty dataframe
df = pd.DataFrame()

#loop through the links
for i in hrefs[:5]:
  #create an empty dictionary
  datadict = {}
  #add the link to the base URL, to form a full URL
  fulllink = "https://www.gov.uk"+i
  print(fulllink)
  #scrape the page at that link
  html = scraperwiki.scrape(fulllink)
  #convert to lxml object
  root = lxml.html.fromstring(html)
  #drill down into the tag containing the category - and store
  categories = root.cssselect('span.govuk-caption-xl.gem-c-title__context')
  #check we only have one match
  print(len(categories))
  #check the first match - there's some extra white space that we strip
  print(categories[0].text_content().strip())
  #store that in the dictionary
  datadict['category'] = categories[0].text_content().strip()
  #drill down into the tag containing the lead paragraph - and store
  leadpars = root.cssselect('p.gem-c-lead-paragraph')
  #check it
  print(leadpars[0].text_content())
  #store it
  datadict['leadpar'] = leadpars[0].text_content()
  #drill down into the paragraph tags within <div class="govspeak">
  ps = root.cssselect('div.govspeak p')
  #join the pars into a single string
  joinedps = '\n'.join([i.text_content() for i in ps])
  print(joinedps)
  #store it
  datadict['text'] = joinedps
  #add to the dataframe
  datadict['url'] = fulllink
  df = df.append(datadict, ignore_index=True)

https://www.gov.uk/government/news/foreign-secretary-statement-9-september-2021
1
Press release
Foreign Secretary Dominic Raab gave a statement on the departure of British nationals on a flight from Kabul.
The Foreign Secretary said:
We are grateful to our Qatari friends for facilitating a flight carrying 13 British nationals from Kabul to safety in Doha today.
We expect the Taliban to keep to their commitment to allow safe passage for those who want to leave.
Media enquiries

          Email newsdesk@fcdo.gov.uk
        

          Telephone 020 7008 3100
        
Contact the FCDO Communication Team via email (monitored 24 hours a day) in the first instance, and we will respond as soon as possible.
https://www.gov.uk/government/speeches/foreign-secretary-statement-on-afghanistan-response
1
Oral statement to Parliament
The Foreign Secretary updated Parliament on the UK's international response to the situation in Afghanistan.
Mr Speaker, with your permission I will update the House on 

In [None]:
df

Unnamed: 0,category,leadpar,text,url
0,Press release,Foreign Secretary Dominic Raab gave a statemen...,The Foreign Secretary said:\nWe are grateful t...,https://www.gov.uk/government/news/foreign-sec...
1,Oral statement to Parliament,The Foreign Secretary updated Parliament on th...,"Mr Speaker, with your permission I will update...",https://www.gov.uk/government/speeches/foreign...
2,Press release,Foreign Secretary Dominic Raab has provided a ...,Foreign Secretary Dominic Raab said:\nThe sent...,https://www.gov.uk/government/news/foreign-sec...
3,Press release,Foreign Secretary Dominic Raab's statement on ...,My deepest condolences are with Harry Dunn’s f...,https://www.gov.uk/government/news/harry-dunn-...
4,Press release,Foreign Secretary Dominic Raab gave a statemen...,Foreign Secretary Dominic Raab said:\nThis eve...,https://www.gov.uk/government/news/foreign-sec...


## Introducing regex

Now we have some documents to use regex on. First we need to import the `re` library for using regex.

In [None]:
#import re library for regex
import re

## 'Compiling' a regular expression

Now we need to 'compile' a regular expression using the `compile()` function.

This is stored in a variable called 'p'

In this case the expression specifies we are looking for a space, followed by 'w' and 'e', followed by another space, and then we indicate 'one or more alphanumeric characters' with some special characters: `\w` (a **metacharacter** which means 'any alphanumeric character') and `+` (a **modifier** which means 'one or more of')

In [None]:
p = re.compile(' ?[Ww]e \w+')
p

re.compile(r' we \w+', re.UNICODE)

## Finding all matches using `.findall()`

We then use that with `.findall()` to find all matches within a specified string, which is passed as an argument to that function.

In [None]:
print(p.findall(" and we will build"))

[' we will']


Here it matches the space and 'we' but also 'will' because it is one or more alphanumeric characters. The match stops with the space after 'will' because this is not an alphanumeric character.

Now to apply that to the first speech.

In [None]:
p.findall(df['text'][1])

[' we have',
 ' we accelerated',
 ' we have',
 ' we remember',
 ' we also',
 ' we have',
 ' we can',
 ' we are',
 ' we want',
 ' we possibly',
 ' we are',
 ' we have',
 ' we must',
 ' we must',
 ' we must',
 ' we must',
 ' we have',
 ' we plan',
 ' we will',
 ' we will',
 ' we will',
 ' we must',
 ' we stand',
 ' we continue',
 ' we possibly']

Here we get lots of matches. This list - along with lists of matches from other speeches - could be stored in a dataframe that can then be analysed. 

## Storing the matches in a dataframe

We can repeat this regex on each speech to generate a list for each speech.

To generate a dataframe of all those mentions, we need to generate a dataframe for each speech, with that list as a column, and the url as another, and then append it to a larger dataframe.

Below is the code to do that.

In [None]:
#create a new dataframe to store the results
wedf = pd.DataFrame()

#loop through a list of indices, up to an index which is equal to the number of items in the dataframe of speeches
for i in range(0,len(df)):
  #store the url in that row
  thisurl = df['url'][i]
  print(thisurl)
  #store all matches of the regex
  welist = p.findall(df['text'][i])
  #create a dataframe for the results
  localdf = pd.DataFrame()
  #store the matches - because this is a list it will fill as many cells as needed
  localdf['wemention'] = welist
  #create a second column which just has the url repeated. 
  #Because this is a string it will just repeat for as many rows as there are
  localdf['url'] = thisurl
  #append to the ongoing dataframe
  wedf = wedf.append(localdf, ignore_index=True)

#show the results
print(wedf)
  

https://www.gov.uk/government/news/foreign-secretary-statement-9-september-2021
https://www.gov.uk/government/speeches/foreign-secretary-statement-on-afghanistan-response
https://www.gov.uk/government/news/foreign-secretary-statement-on-the-sentencing-of-maria-kolesnikova-and-maksim-znak
https://www.gov.uk/government/news/harry-dunn-foreign-secretarys-statement-27-august-2021
https://www.gov.uk/government/news/foreign-secretary-statement-26-august-2021
          wemention                                                url
0           we will  https://www.gov.uk/government/news/foreign-sec...
1           we have  https://www.gov.uk/government/speeches/foreign...
2    we accelerated  https://www.gov.uk/government/speeches/foreign...
3           we have  https://www.gov.uk/government/speeches/foreign...
4       we remember  https://www.gov.uk/government/speeches/foreign...
5           we also  https://www.gov.uk/government/speeches/foreign...
6           we have  https://www.gov.uk/govern

In [None]:
#show the most frequent mentions
wedf['wemention'].value_counts()

 we will           6
 we must           5
 we have           5
 we possibly       2
 we are            2
 we should         1
 we plan           1
 we can            1
 we also           1
 we continue       1
 we want           1
 we stand          1
 we accelerated    1
 we remember       1
Name: wemention, dtype: int64

## Using NLTK to extract ngrams

An **ngram** is a number of words that appear consecutively. For example "to the" is a common ngram. 

The 'n' in 'ngram' means 'number' and there are specific words for ngrams of specific numbers. For example, an ngram of two words is called a **bigram**, or you can have a **trigram** of three words and so on. 

In the table above 'we will' is the most common bigram - but there might be other bigrams in those speeches which *end* with 'we', or which don't use it at all.

The natural language processing library `NLTK` (Natural Language Toolkit) includes a function for extracting ngrams. 

In [None]:
#import the ngrams part of nltk
from nltk.util import ngrams

In [None]:
#convert text to lower so the same words will be treated the same regardless of case
speech1lc = df['text'][1].lower()
speech1lc

'mr speaker, with your permission i will update the house on the uk’s international response to the situation in afghanistan.\nas my rt hon friend the prime minster has set out, over the last 3 weeks, through a shared effort right across government and our armed forces, we have delivered the largest and most complex evacuation in living memory.\nbetween 15 and 29 august, the uk evacuated over 15,000 people from afghanistan. that includes: over 8,000 british nationals, close to 5,000 afghans who loyally served the uk, along with their dependents, and around 500 special cases of particularly vulnerable afghans, including chevening scholars, journalists, human rights defenders, campaigners for women’s rights, judges and many others.\nof course, the work to get people out did not start on 15 august. the fcdo advised british nationals to leave the country in april, and then again on 6 august. we estimate that around 500 did so.\nat the same time, the government launched the arap scheme for 

In [None]:
#replace anything that's not a lower case or upper case letter, or number, or space - with a space
#this is again so words aren't treated differently because they're followed by a comma or full stop, etc.
speech1lc = re.sub(r'[^a-zA-Z0-9\s]', ' ', speech1lc)
speech1lc

'mr speaker  with your permission i will update the house on the uk s international response to the situation in afghanistan \nas my rt hon friend the prime minster has set out  over the last 3 weeks  through a shared effort right across government and our armed forces  we have delivered the largest and most complex evacuation in living memory \nbetween 15 and 29 august  the uk evacuated over 15 000 people from afghanistan  that includes  over 8 000 british nationals  close to 5 000 afghans who loyally served the uk  along with their dependents  and around 500 special cases of particularly vulnerable afghans  including chevening scholars  journalists  human rights defenders  campaigners for women s rights  judges and many others \nof course  the work to get people out did not start on 15 august  the fcdo advised british nationals to leave the country in april  and then again on 6 august  we estimate that around 500 did so \nat the same time  the government launched the arap scheme for 

In [None]:
#split the string on spaces, which creates a list
#loop through that list, calling each item 'token'
#store in a new list called 'tokens' if it's not "" (an empty item)
tokens = [token for token in speech1lc.split(" ") if token != ""]

In [None]:
#create a list of ngrams that are two words long (bigrams)
output = list(ngrams(tokens, 2))
#show the first 10 bigrams
output[:10]

[('mr', 'speaker'),
 ('speaker', 'with'),
 ('with', 'your'),
 ('your', 'permission'),
 ('permission', 'i'),
 ('i', 'will'),
 ('will', 'update'),
 ('update', 'the'),
 ('the', 'house'),
 ('house', 'on')]

## Show the most common bigrams using `collections`

The `collections` library allows us to [count the frequency of items in a list](https://stackoverflow.com/questions/2161752/how-to-count-the-frequency-of-the-elements-in-an-unordered-list). Below we import it, and then use the `.Counter()` function to count frequency.

This creates an object which includes the built-in function `.most_common()` - that can be used to show a specified number of the most frequent items.

In [None]:
import collections

In [None]:
#count the frequency of items
outputcount = collections.Counter(output)
#show the 10 most common
outputcount.most_common(10)

[(('to', 'the'), 12),
 (('the', 'taliban'), 8),
 (('the', 'uk'), 6),
 (('and', 'the'), 6),
 (('set', 'out'), 5),
 (('we', 'have'), 5),
 (('safe', 'passage'), 5),
 (('we', 'are'), 5),
 (('we', 'must'), 5),
 (('on', 'the'), 4)]

What did I say? 'To the' *is* a common ngram!

We can adapt that code to look at trigrams, too.

In [None]:
#create a list of ngrams that are 3 words long (trigrams)
output = list(ngrams(tokens, 3))
#count the frequency of items
outputcount = collections.Counter(output)
#show the 10 most common
outputcount.most_common(10)

[(('the', 'international', 'community'), 4),
 (('response', 'to', 'the'), 2),
 (('to', 'the', 'situation'), 2),
 (('rt', 'hon', 'friend'), 2),
 (('hon', 'friend', 'the'), 2),
 (('friend', 'the', 'prime'), 2),
 (('has', 'set', 'out'), 2),
 (('on', '15', 'august'), 2),
 (('\nat', 'the', 'same'), 2),
 (('the', 'same', 'time'), 2)]