# Tutorial: Data-driven news discourse analysis with Python

**August 2023**

This notebook follows the Medium tutorial article, and uses Innovation Sweet Spots' public discourse analysis modules.

We will fetch and analyse *The Guardian* news articles, but the analysis can also be applied to any other text data.

We will provide examples for:

*   Checking mentions of search terms over time
*   Exploring the news topics using BERTopic
*   Understanding the language used around these terms using spaCy


## Setting up

Running the following cells will install the Innovation Sweet Spots code and other necessary python packages.

Skip this step if running locally instead of Colab.

In [1]:
!git clone --branch discourse_tutorial_blog https://github.com/nestauk/innovation_sweet_spots.git

Cloning into 'innovation_sweet_spots'...
remote: Enumerating objects: 2844, done.[K
remote: Counting objects: 100% (1537/1537), done.[K
remote: Compressing objects: 100% (665/665), done.[K
remote: Total 2844 (delta 1121), reused 1061 (delta 869), pack-reused 1307[K
Receiving objects: 100% (2844/2844), 2.45 MiB | 6.67 MiB/s, done.
Resolving deltas: 100% (1808/1808), done.


In [2]:
import sys
sys.path.insert(0,'/content/innovation_sweet_spots')

In [3]:
!cd innovation_sweet_spots && \
pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matplotlib==3.5.1
  Downloading matplotlib-3.5.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
Collecting metaflow
  Downloading metaflow-2.7.21-py2.py3-none-any.whl (878 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m878.8/878.8 KB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv
  Downloading python_dotenv-0.21.1-py3-none-any.whl (19 kB)
Collecting sh
  Downloading sh-1.14.3.tar.gz (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.9/62.9 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting CurrencyConverter
  Downloading CurrencyConverter-0.17.5-py3-none-any.whl (563 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## Importing requirements

In [1]:
# Import packages
import altair as alt
import pandas as pd
from innovation_sweet_spots.utils.pd import pd_analysis_utils as au

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/karlis.kanders/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/karlis.kanders/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/karlis.kanders/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Getting the data: Using the Guardian Open Platform

This step shows how to fetch news articles from the Guardian mentioning "heat pumps".

First you should define your Guardian API key.

Setting it to `"test"` might work, but you should set up your own key here: https://open-platform.theguardian.com/access/

In [2]:
API_KEY = "test"

You can take a peek at the results by setting `only_first_page=True`

In [3]:
test_articles = au.guardian.search_content(
    "heat pumps", 
    api_key=API_KEY, 
    only_first_page=True, 
    use_cached=True, 
    save_to_cache=False
)

2023-08-04 11:11:03,542 - root - INFO - Loading results from /Users/karlis.kanders/Documents/code/innovation_sweet_spots/inputs/data/guardian/api_results/heat%20pumps.json


At the time of writing this tutorial, tt should say that 100 articles is about 14% of the total number of results, so you can work it out that there are around 700 articles on the Guardian mentioning heat pumps

You can check that the most recent article

In [4]:
# Get the first (most recent) result
test_articles[0]

{'id': 'environment/2023/jul/17/uk-installation-heat-pumps-report',
 'type': 'article',
 'sectionId': 'environment',
 'sectionName': 'Environment',
 'webPublicationDate': '2023-07-17T04:00:18Z',
 'webTitle': 'UK installations of heat pumps 10 times lower than in France, report finds',
 'webUrl': 'https://www.theguardian.com/environment/2023/jul/17/uk-installation-heat-pumps-report',
 'apiUrl': 'https://content.guardianapis.com/environment/2023/jul/17/uk-installation-heat-pumps-report',
 'fields': {'headline': 'UK installations of heat pumps 10 times lower than in France, report finds',
  'trailText': 'Analysts call on government to make pumps mandatory for all new homes and scale up grants for installation in existing properties',
  'body': '<p>The UK is lagging far behind France and other EU countries in installing heat pumps, research has shown, with less than a tenth of the number of installations despite having similar markets.</p> <p>Only 55,000 heat pumps were sold in the UK last

Now let's get all articles mentioning heat pumps. 

In my experience, best to use both singular and plural forms to get catch relevant results.

We will also specify the following article categories to reduce the possibility of irrelevant articles.

In [5]:
# Define allowed article categories
CATEGORIES = [
    "Environment",
    "Technology",
    "Science",
    "Business",
    "Money",
    "Cities",
    "Politics",
    "Opinion",
    "UK news",
    "Life and style",
]


In [6]:
# List of search terms
SEARCH_TERMS = ["heat pump", "heat pumps"]

articles_df, articles_metadata = au.get_guardian_articles(
    # Specify the search terms
    search_terms=SEARCH_TERMS,
    # To fetch the most recent articles, set use_cached to False
    use_cached = True,
    # Specify the API key
    api_key=API_KEY,
    # Specify which news article categories we'll consider
    allowed_categories = CATEGORIES,
)


2023-08-04 11:11:09,871 - root - INFO - Loading results from /Users/karlis.kanders/Documents/code/innovation_sweet_spots/inputs/data/guardian/api_results/heat%20pump.json


2023-08-04 11:11:09,897 - root - INFO - Loading results from /Users/karlis.kanders/Documents/code/innovation_sweet_spots/inputs/data/guardian/api_results/heat%20pumps.json


In [7]:
# Article texts
articles_df.head(3)

Unnamed: 0,id,text,date,year
0,environment/2023/jul/17/uk-installation-heat-p...,The UK is lagging far behind France and other ...,2023-07-17 04:00:18+00:00,2023
1,business/2023/jun/07/labour-donor-dale-vince-i...,Dale Vince has been condemned in the rightwing...,2023-06-07 21:00:47+00:00,2023
2,business/2023/may/30/heat-pumps-more-than-80-p...,More than 80% of households that have replaced...,2023-05-30 05:00:22+00:00,2023


In [8]:
# Article metadata
articles_metadata[articles_df.iloc[0].id]

{'webUrl': 'https://www.theguardian.com/environment/2023/jul/17/uk-installation-heat-pumps-report',
 'webTitle': 'UK installations of heat pumps 10 times lower than in France, report finds',
 'webPublicationDate': '2023-07-17T04:00:18Z',
 'tags': [{'id': 'profile/fiona-harvey',
   'type': 'contributor',
   'webTitle': 'Fiona Harvey',
   'webUrl': 'https://www.theguardian.com/profile/fiona-harvey',
   'apiUrl': 'https://content.guardianapis.com/profile/fiona-harvey',
   'references': [],
   'bio': '<p>Fiona Harvey is an environment editor at the Guardian</p>',
   'bylineImageUrl': 'https://uploads.guim.co.uk/2022/12/08/Fiona_Harvey_old_image.jpg',
   'bylineLargeImageUrl': 'https://uploads.guim.co.uk/2022/12/08/Fiona_Harvey_old_image.png',
   'firstName': 'Fiona',
   'lastName': 'Harvey'}]}

## Initialising the `DiscourseAnalysis` class

First, we can specify the path to the analysis outputs directory, which will come handy when revisiting the analysis in the future. Note that we are storing the analysis outputs separately from the cached search results (discussed above), in order to separate the analysis process, which is agnostic to the data sources, from the data fetching process

In [9]:
# Specify the location for analysis outputs
from innovation_sweet_spots import PROJECT_DIR
OUTPUTS_DIR = PROJECT_DIR / "outputs/data/discourse_analysis_outputs"

We can then specify the name `ANALYSIS_ID` for this specific analysis session - all the output tables will be stored in a subfolder of `OUTPUTS_DIR` with the same name.

In [10]:
ANALYSIS_ID = "guardian_heat_pumps_tutorial"

We will be saving and loading our analysis results to and from `innovation_sweet_spots/outputs/data/discourse_analysis_outputs/{ANALYSIS_ID}`.

We will then define a couple of additional filtering criteria to keep the most relevant results to our context, by specifying a (non-exhaustive) list of UK-related geographic terms and excluding any article that mentions Australia.

In [11]:
# Terms required to appear in the articles, 
# for the articles to be considered in the analysis
REQUIRED_TERMS = [
    "UK",
    "Britain",
    "Scotland",
    "Wales",
    "England",
    "Northern Ireland",
    "Britons",
    "London",
]

# Articles with these terms will be removed from the analysis
BANNED_TERMS = ["Australia"]

In [12]:
pda = au.DiscourseAnalysis(
    search_terms=SEARCH_TERMS,
    outputs_path=OUTPUTS_DIR,
    query_identifier=ANALYSIS_ID,
    required_terms = REQUIRED_TERMS,
    banned_terms = BANNED_TERMS,
)

pda.load_documents(document_text=articles_df)



The warning message above says we are missing document text and metadata. Metadata is optional and can be used when using *Guardian* articles.

The `load_documents` step adds document text to the class. This function has an argument `document_text` which can take a dataframe variable or if left blank will search for a file `document_text_{ANALYSIS_ID}.csv` in `outputs/data/discourse_analysis_outputs/{ANALYSIS_ID}/`.

Note that you can use `load_documents` to input any text data, as long as it has columns for `text`, `date`, `year` and `id`.

## Number of news articles across years

The number of documents per year that contain the search terms.

(The results for each search term are combined and deduplicated)

In [13]:
pda.document_mentions

Unnamed: 0,year,documents
0,2000,0
1,2001,0
2,2002,1
3,2003,3
4,2004,2
5,2005,6
6,2006,15
7,2007,19
8,2008,31
9,2009,21


Plot of the number of documents per year that contain the search terms.

In [14]:
pda.plot_mentions(use_documents=True)

You can also plot the number of sentences, and disaggregate the number of sentences per each search term.

(This might take a minute, as the text is processed into sentences using spacy)

In [15]:
pda.plot_mentions(use_documents=False)

You can then get all sentences with the search terms for a specific year, using the dictionary `combined_term_sentences` 

In [16]:
pd.set_option('max_colwidth', 500)
pda.combined_term_sentences["2022"].head(5)

Unnamed: 0,sentence,id,year
14941,"domestic heat pumps, which are now more accessible, were extremely expensive and scarcely deployed in the uk. that left only two options, gas or wood.",commentisfree/2022/dec/27/wood-burning-stove-environment-home-toxins,2022
14974,for the same budget i could buy an air source heat pump.,commentisfree/2022/dec/27/wood-burning-stove-environment-home-toxins,2022
14997,a ground source heat pump powers the underfloor heating; a new borehole was recently dug for drinking water.,lifeandstyle/2022/dec/10/flood-proof-riverside-living-a-house-on-the-thames-without-the-anxiety,2022
15032,"michael liebreich, chair of liebreich associates and founder of the analyst firm bloomberg new energy finance, has hit out at boiler slingers the uk's existing network of gas companies, plumbing firms and engineers who see hydrogen as a route to maintain as much of the status quo as possible, rather than moving to heat pumps and other proven low carbon technology.",environment/2022/sep/26/hydrogen-could-nearly-double-cost-of-heating-a-home-compared-with-gas,2022
15033,"liebreich tweeted, heating with hydrogen from renewable energy is six times less efficient than using the same electricity in a heat pump.",environment/2022/sep/26/hydrogen-could-nearly-double-cost-of-heating-a-home-compared-with-gas,2022


Finally, when considering the growth trends of news mentions, another important element is a baseline growth trend that we can use as a reference.

In [17]:
# Get the total article counts across relevant article categories
total_counts = au.get_total_article_counts(sections=CATEGORIES, api_key=API_KEY)

After dividing the number of articles mentioning heat pumps with the total number of reference articles, we find that the shape of the trend is preserved.

In [18]:
document_mentions_norm = (
    pda.document_mentions.copy()
    .assign(baseline_documents = total_counts.values())
    .assign(normalised = lambda df: df.documents / df.baseline_documents)
)

alt.Chart(document_mentions_norm).mark_line().encode(x="year:O", y="normalised")

## Characterising discourse topics using BERTopic

We can use BERTopic to find topics within our documents. More info on BERTopic can be found [here](https://maartengr.github.io/BERTopic/faq.html).

To create a topic model, use function `fit_topic_model`. If you want to use sentences found from phrase matching set the variable `use_phrases` to `True` (note, if using phrases, `set_phrase_patterns` will need to be run first.) If set to `False` it will use the `sentence_mentions`.

In [19]:
topic_model, docs = pda.fit_topic_model()

2023-08-04 11:17:51,901 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2


2023-08-04 11:17:52,314 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device: cpu


In [20]:
topic_model.visualize_barchart(top_n_topics=len(set(topic_model.topics_)))

Visualising the documents and topics in 2d space (mouseover for the text). Note that the `visualize_documents` function is not determinstic. You can make the output deterministic by following the steps [here](https://maartengr.github.io/BERTopic/faq.html).

To make the plot larger (to be able to view more of the mouse over text, increase the values for the parameters `width` and `height`)

In [21]:
topic_model.visualize_documents(docs, width=1400, height=750)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.76it/s]


## What we talk about when we talk about X: Collocation analysis

The `view_collocations` function can be used to find sentences where the search term appears with another specified term.

In [22]:
pda.view_collocations("air source")

Unnamed: 0,year,id,sentence
0,2005,uk/2005/nov/02/greenpolitics.renewableenergy,"he added, the installation of microgeneration products, such as micro turbines, solar panels and air source heat pumps, are an excellent way for individuals, communities and businesses to make their own contribution to tackling climate change."
1,2005,money/2005/sep/23/utilities.greenpolitics,"there is a significantly reduced, 5, vat rate for most microgeneration technologies which was extended in this year's budget to cover air source heat pumps and micro chp units."
2,2006,money/2006/jan/28/utilities.moneysupplement,air source heat pumps work just as well and are increasingly being used in flats and offices.
3,2007,business/2007/nov/16/eddieshah,the heating system utilises an interesting technology called an air source heat pump.
4,2007,business/2007/nov/16/eddieshah,"at the moment a pv system to run the air source heat pump would cost about 12,000, he says."
...,...,...,...
91,2023,environment/2023/jan/12/energy-house-20-tests-tech-that-aims-to-make-homes-greener-and-cheaper-to-run,"compared with the bitter temperature outside, the bellway home gives you a warm hug as you step inside, although it feels a bit topsy turvy as ultra slim infrared radiators are perched on the ceiling rather than the wall and an air source heat pump has been stowed in the loft in what is a uk first."
92,2023,environment/2023/jan/12/energy-house-20-tests-tech-that-aims-to-make-homes-greener-and-cheaper-to-run,"two competing heating systems are being tested inside, an electric based system utilising infrared panels, some of which are disguised as ceiling coving, as well as a water based system that uses heated skirting boards combined with an air source heat pump."
93,2023,environment/2023/jan/11/plan-rural-households-run-heating-on-vegetable-oil-uk,"under government plans, homes with off grid gas connections will be banned from buying replacement boilers from 2026 and instead expected to install air source or ground source heat pump systems."
94,2023,environment/2023/jan/01/wood-burning-stoves-for-some-of-us-wood-is-the-only-practical-affordable-fuel,"i installed an air source heat pump, and a small wood burning stove for extra warmth in winter."


You can check `term_rank` table for all collocated terms for each year

The importance of collocations is measured by pointwise mutual information (pmi), which indicates the strength of association of the collocated term and search term. 

Frequency (freq) and rank indicate how often the terms have been used together with the search terms. Frequency is the number of co-occurrences.

Note that this might take a minute, and sometimes on Colab it surprisingly runs out of memory on this. To run this section, I would advise cloning the repo and running it from your local machine.

As an alternative, you can also ran a simpler query `analyse_colocated_unigrams` to find frequently mentioned single-word terms.

In [23]:
pda.term_rank

Unnamed: 0,term,year,freq,rank,pmi
0,1000,2021,6,248.000091,1.500399
1,1000,2022,6,178.000090,1.586728
2,"1,000 worse",2021,4,423.000091,2.869246
3,"1,000 worse year",2021,4,424.000091,2.869246
4,1200,2007,3,41.000735,2.558919
...,...,...,...,...,...
2736,years,2008,4,70.000417,0.959682
2737,years,2020,5,68.000283,0.650424
2738,years costs,2021,4,652.000091,2.179451
2739,years costs covered,2021,4,653.000091,2.179451


In [24]:
# Check most often collocated terms
(
    pda.term_rank.groupby('term')
    .agg(freq=('freq', 'sum'))
    .sort_values('freq', ascending=False)
    .head(20)
)

Unnamed: 0_level_0,freq
term,Unnamed: 1_level_1
heat,1679
pumps,1027
source,440
pump,437
source heat,410
ground,322
solar,302
gas,281
air,258
source heat pumps,246


In [25]:
# Check collocations with highest PMI in 2021
(
    pda.term_rank
    .query("year == 2021")
    .sort_values('pmi', ascending=False)
).head(20)

Unnamed: 0,term,year,freq,rank,pmi
1818,pump,2021,125,2.000091,4.708417
1634,option like hear,2021,4,568.000091,4.610692
1837,pump installers,2021,4,583.000091,4.610692
1344,"installations reach 600,000",2021,3,803.000091,4.610692
412,charging high price,2021,4,456.000091,4.610692
411,charging high,2021,4,455.000091,4.610692
2510,toasty fossil,2021,4,636.000091,4.610692
1359,installers charging,2021,4,534.000091,4.610692
2511,toasty fossil fuels,2021,4,637.000091,4.610692
1360,installers charging high,2021,4,535.000091,4.610692


Try out also this simpler approach using only unigrams (will work on Colab)

In [26]:
# Simpler measure using only unigrams
pda.analyse_colocated_unigrams().sort_values('counts', ascending=False).head(20)

Unnamed: 0,lemmas,counts
1773,heat,1230
2762,pump,1075
1821,home,314
610,boiler,259
1336,energy,253
1642,gas,250
1776,heating,187
3247,source,184
3231,solar,182
1692,government,164


## Shifting narratives: Collocation importance over time

It is also informative to analyse how the importance of collocated terms varies over time. This can point to shifts in language used to describe our search terms, which in our case might be caused by the emergence of new entities and applications associated with our technologies of interest.

For example, a comparison of the two different types of heat pumps discussed above highlights an interesting trend, where ‘ground source’ was initially more frequently collocated, whereas now ‘air source’ - which is a more affordable type of heat pump - has overtaken the mentions of ‘ground source’.


In [27]:
check_terms = ['air source', 'ground source']
fig = (
    alt.Chart(
        pda.term_rank.query("term in @check_terms")
        )
    .mark_line()
    .encode(
        x=alt.X('year:O', title=''),
        y=alt.Y('freq:Q', title='Frequency'),
        color=alt.Color('term:N', title='Term'),
    )
)
fig
        

The `term_temporal_rank` can be used to potentially highlight interesting terms whose rank has changed significantly (ie, has a high variation across years)

In [28]:
pda.term_temporal_rank.sort_values('st_dev_rank', ascending=False).head(25)

Unnamed: 0_level_0,term,year_first_mention,num_years,st_dev_rank,mean_pmi
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
heat,heat,2002,22,2925.788,3.277
ground,ground,2002,17,32.186,3.211
building,building,2004,5,0.387,1.379
source,source,2005,18,0.333,3.39
pumps,pumps,2003,21,0.267,4.05
pump,pump,2002,19,0.222,4.616
solar,solar,2005,17,0.158,1.944
gas,gas,2006,12,0.156,1.184
source heat,source heat,2005,18,0.15,3.736
energy,energy,2007,12,0.15,0.729


## Extracting patterns using spaCy

We can also analyse specific types of phrase patterns containing the search term. For example, phrases that contain adjectives or verbs together with the search term.

More information of POS phrase matching can be found [here](https://spacy.io/usage/rule-based-matching).

The phrase patterns can either be loaded from a json file  theor default phrase patterns can be generated on the fly using `make_patterns=True`.

First we need to set the phrases patterns. Here we are making patterns based on the search terms.

In [29]:
pda.set_phrase_patterns(load_patterns=False, make_patterns=True)

{'heat_pump_term_noun': [{'TEXT': 'heat'},
  {'TEXT': 'pump'},
  {'POS': 'NOUN'},
  {'POS': 'NOUN', 'OP': '?'}],
 'heat_pump_noun_phrase': [{'POS': 'ADJ', 'OP': '*'},
  {'POS': 'NOUN'},
  {'POS': 'NOUN', 'OP': '?'},
  {'TEXT': 'heat'},
  {'TEXT': 'pump'}],
 'heat_pump_adj_phrase': [{'POS': 'ADV', 'OP': '*'},
  {'POS': 'ADJ'},
  {'POS': 'ADJ', 'OP': '*'},
  {'POS': 'NOUN', 'OP': '?'},
  {'TEXT': 'heat'},
  {'TEXT': 'pump'}],
 'heat_pump_term_is': [{'TEXT': 'heat'},
  {'TEXT': 'pump'},
  {'LEMMA': 'be'},
  {'DEP': 'neg', 'OP': '?'},
  {'POS': {'IN': ['ADV', 'DET']}, 'OP': '*'},
  {'POS': {'IN': ['NOUN', 'ADJ']}, 'OP': '*'}],
 'heat_pump_term_have': [{'TEXT': 'heat'},
  {'TEXT': 'pump'},
  {'LEMMA': 'have'},
  {'DEP': 'neg', 'OP': '?'},
  {'POS': {'IN': ['ADV', 'DET']}, 'OP': '*'},
  {'POS': {'IN': ['NOUN', 'ADJ']}, 'OP': '*'}],
 'heat_pump_term_can': [{'TEXT': 'heat'},
  {'TEXT': 'pump'},
  {'LEMMA': 'can'},
  {'DEP': 'neg', 'OP': '?'},
  {'POS': {'IN': ['ADV', 'DET']}, 'OP': '*'},
  {'P

In [30]:
pda.set_phrase_patterns(load_patterns=False, make_patterns=True).keys()

dict_keys(['heat_pump_term_noun', 'heat_pump_noun_phrase', 'heat_pump_adj_phrase', 'heat_pump_term_is', 'heat_pump_term_have', 'heat_pump_term_can', 'heat_pump_term_is_at', 'heat_pump_verb_obj', 'heat_pump_verb_subj', 'heat_pumps_term_noun', 'heat_pumps_noun_phrase', 'heat_pumps_adj_phrase', 'heat_pumps_term_is', 'heat_pumps_term_have', 'heat_pumps_term_can', 'heat_pumps_term_is_at', 'heat_pumps_verb_obj', 'heat_pumps_verb_subj'])

Then we can find matches in the documents that match the phrase patterns.

This might take a minute.

In [31]:
pda.pos_phrases

 33%|███▎      | 6/18 [00:50<01:40,  8.36s/it]

2023-08-04 11:19:07,493 - innovation_sweet_spots - INFO - heat_pump_term_is_at pattern found no phrase matches.


 83%|████████▎ | 15/18 [02:01<00:23,  7.99s/it]

2023-08-04 11:20:18,831 - innovation_sweet_spots - INFO - heat_pumps_term_is_at pattern found no phrase matches.


100%|██████████| 18/18 [02:24<00:00,  8.04s/it]


Unnamed: 0,year,phrase,number_of_mentions,pattern
0,"2005, 2006, 2007",heat pump system,1,heat_pump_term_noun
1,"2005, 2006, 2007",heat pump technologies,1,heat_pump_term_noun
2,"2008, 2009, 2010",heat pump devices,1,heat_pump_term_noun
3,"2008, 2009, 2010",heat pump rating,1,heat_pump_term_noun
4,"2008, 2009, 2010",heat pump system,1,heat_pump_term_noun
...,...,...,...,...
495,2023,heat pumps fitted,1,heat_pumps_verb_subj
496,2023,heat pumps grew,1,heat_pumps_verb_subj
497,2023,heat pumps installed,1,heat_pumps_verb_subj
498,2023,heat pumps offer,1,heat_pumps_verb_subj


In [32]:
# All types of patterns generated
# (note that we have separate patterns for singular and plural search terms)
sorted(pda.pos_phrases.pattern.unique())

['heat_pump_adj_phrase',
 'heat_pump_noun_phrase',
 'heat_pump_term_can',
 'heat_pump_term_have',
 'heat_pump_term_is',
 'heat_pump_term_noun',
 'heat_pump_verb_obj',
 'heat_pump_verb_subj',
 'heat_pumps_adj_phrase',
 'heat_pumps_noun_phrase',
 'heat_pumps_term_can',
 'heat_pumps_term_have',
 'heat_pumps_term_is',
 'heat_pumps_term_noun',
 'heat_pumps_verb_obj',
 'heat_pumps_verb_subj']

In [33]:
# Query the adjective phrase patterns
pos = 'adj_phrase'
(
    pda.pos_phrases
    .groupby(['phrase', 'pattern'], as_index=False)
    .agg(number_of_mentions=('number_of_mentions','sum'))
    .query(f"pattern.str.contains('{pos}')", engine='python')
    .sort_values('number_of_mentions', ascending=False)
    # .head(10)
)

Unnamed: 0,phrase,pattern,number_of_mentions
33,electric heat pumps,heat_pumps_adj_phrase,25
289,low carbon heat pumps,heat_pumps_adj_phrase,9
32,electric heat pump,heat_pump_adj_phrase,9
316,new heat pump,heat_pump_adj_phrase,7
28,domestic heat pumps,heat_pumps_adj_phrase,6
34,electric powered heat pumps,heat_pumps_adj_phrase,3
291,low co2 heat pumps,heat_pumps_adj_phrase,2
52,fitting heat pumps,heat_pumps_adj_phrase,2
53,free heat pumps,heat_pumps_adj_phrase,2
27,domestic heat pump,heat_pump_adj_phrase,2


## Save the analysis outputs

Speeds up the analysis. Next time, if you specify the same query id, it should load the results and you won't have to compute the phrases and patterns again.

In [34]:
pda.save_analysis_results()