<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 1 - Arianne

This phase aims at extracting text from the blog posts of the following websites:
- [Greenpeace Stories](https://www.greenpeace.org/international/story/)
- [WWF](https://www.worldwildlife.org/stories?page=1&threat_id=effects-of-climate-change)
- [WRI](https://www.wri.org/resources/topic/climate-53/type/insights-50?page=0)

## Required Python packages

- beautifulsoup4
- pandas
- tqdm
- selenium
- lxml

## Import the required libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import sys
import time
import logging
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.edge.options import Options

## Define input variables

In [2]:
input_directory = 'cl_st2_ph1_arianne'
output_directory = 'cl_st2_ph1_arianne'

## Create output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


## Set up logging

In [4]:
log_filename = f"{output_directory}/{output_directory}.log"

In [5]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename=log_filename
)

## Functions

### Create output subdirectories

In [6]:
def create_directory(path):
    """Creates a subdirectory if it doesn't exist."""
    if not os.path.exists(path):
        try:
            os.makedirs(path)
            print(f"Successfully created the directory: {path}")
        except OSError as e:
            print(f"Failed to create the {path} directory: {e}")
            sys.exit(1)
    else:
        print(f"Directory already exists: {path}")

### Scrape web pages

In [7]:
def scrape_html(url):
    """Loads a web page and returns its source HTML."""
    # Setting up the WebDriver
    #service = Service(r'C:\Users\eyamr\OneDrive\00-Technology\msedgedriver\edgedriver_win64\msedgedriver.exe')
    service = Service('/Users/eyamrog/msedgedriver/edgedriver_mac64/msedgedriver')
    #service = Service('/home/eyamrog/msedgedriver/edgedriver_linux64/msedgedriver')

    # Configure Edge to run headless
    options = Options()
    # For modern Edge/Chromium; if incompatible with your version, try "--headless"
    options.add_argument('--headless=new')
    options.add_argument('--disable-gpu')
    options.add_argument('--window-size=1920,1080')

    driver = webdriver.Edge(service=service, options=options)
    html = None
    try:
        driver.get(url)

        # Explicit wait for stable page load
        wait = WebDriverWait(driver, 10)
        max_wait_time = 30
        start_time = time.time()
        previous_html = ''

        while True:
            current_html = driver.page_source
            if current_html == previous_html or time.time() - start_time > max_wait_time:
                break
            previous_html = current_html
            time.sleep(2)

        html = driver.page_source  # Capture page source
    except Exception as e:
        logging.error(f"Error scraping {url}: {e}")
    finally:
        # Always close WebDriver
        driver.quit()

    return html

In [8]:
def scrape_html_docs(df, path):
    """Iterates over a DataFrame and saves HTML pages within multiple WebDriver sessions."""
    if not os.path.exists(path):
        try:
            os.makedirs(path)
        except OSError as e:
            logging.error(f"Failed to create the {path} directory: {e}")
            sys.exit(1)

    for _, row in tqdm(df.iterrows(), total=len(df), desc='Scraping HTML documents'):
        url = row['Post URL']
        doc_id = row['Post ID']
        filename = os.path.join(path, f"{doc_id}.html")

        page_source = scrape_html(url)  # Call the scrape_html function

        if page_source:
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(page_source)
            logging.info(f"Saved: {filename}")

## Scraping [Greenpeace Stories](https://www.greenpeace.org/international/story/)

### Define local variables

In [9]:
id = 'grp'
path = os.path.join(output_directory, id)
dataset_filename_1 = f"{id}_list"
dataset_filename_2 = f"{id}"

### Create output subdirectory

In [10]:
create_directory(path)

Directory already exists: cl_st2_ph1_arianne/grp


### Capture a few document pages for inspection

In [11]:
filename_sample_1 = 'greenpeace_stories_sample1.html'
url_sample_1 = 'https://www.greenpeace.org/international/story/page/1/'
filename_sample_11 = 'greenpeace_stories_sample11.html'
url_sample_11 = 'https://www.greenpeace.org/international/story/77736/from-hiroshima-to-gaza-defending-peace/'
filename_sample_2 = 'greenpeace_stories_sample2.html'
url_sample_2 = 'https://www.greenpeace.org/international/story/page/2/'
filename_sample_21 = 'greenpeace_stories_sample21.html'
url_sample_21 = 'https://www.greenpeace.org/international/story/77406/boots-to-boost-justice-standing-in-solidarity-with-indonesian-migrant-fishers/'
filename_sample_3 = 'greenpeace_stories_sample3.html'
url_sample_3 = 'https://www.greenpeace.org/international/story/page/3/'
filename_sample_31 = 'greenpeace_stories_sample31.html'
url_sample_31 = 'https://www.greenpeace.org/international/story/76810/vanishing-millet-fields-endangered-sparrows-the-climate-crisis-and-taiwans-forgotten-guardians/'

In [12]:
document_page_sample_1 = scrape_html(url_sample_1)

with open(f'{path}/{filename_sample_1}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_1)

In [13]:
document_page_sample_11 = scrape_html(url_sample_11)

with open(f'{path}/{filename_sample_11}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_11)

In [14]:
document_page_sample_2 = scrape_html(url_sample_2)

with open(f'{path}/{filename_sample_2}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_2)

In [15]:
document_page_sample_21 = scrape_html(url_sample_21)

with open(f'{path}/{filename_sample_21}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_21)

In [16]:
document_page_sample_3 = scrape_html(url_sample_3)

with open(f'{path}/{filename_sample_3}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_3)

In [17]:
document_page_sample_31 = scrape_html(url_sample_31)

with open(f'{path}/{filename_sample_31}', 'w', encoding='utf8', newline='\n') as file:
    file.write(document_page_sample_31)

### Scraping the post metadata

In [18]:
def scrape_posts(source, index_page_url_1, index_page_url_2, start_page, end_page):
    """Iterates over a set of index pages and extracts post metadata."""
    data = []

    for i in tqdm(range(start_page, end_page + 1)):
        url = f"{index_page_url_1}{i}{index_page_url_2}"

        index_page = scrape_html(url)

        # Parse page source with BeautifulSoup
        soup = BeautifulSoup(index_page, 'lxml')

        # Capture the listing page content
        listing_page_content = soup.find('div', id='listing-page-content')

        # Extract the items
        if listing_page_content:
            list = listing_page_content.find('ul', class_='wp-block-post-template')
            if list:
                items = list.find_all('li')

        for item in items:
            # Extract the item body
            body = item.find('div', class_='query-list-item-body')

            # Extract the post term
            if body:
                post_term = body.find('div', class_='wp-block-post-terms')
                if post_term:
                    post_term_text = ' '.join(post_term.get_text(' ', strip=True).split()) if post_term else ''

            # Extract the post tags
            if body:
                post_tags = body.find('div', class_='taxonomy-post_tag wp-block-post-terms')
                if post_tags:
                    post_tags_list = [a.get_text(strip=True) for a in post_tags.select('a[rel="tag"]')]
                    post_tags_text = ", ".join(post_tags_list) if post_tags_list else ''

            # Extract the title
            if body:
                headline = body.find('h4', class_='query-list-item-headline wp-block-post-title')
                title_text = ' '.join(headline.get_text(' ', strip=True).split()) if headline else ''

            # Extract the post URL
            if headline:
                anchor_url = headline.find('a')
                post_url = anchor_url['href'] if anchor_url else ''

            ## Extract the category
            #post_page = scrape_html(post_url)
            #soup_article = BeautifulSoup(post_page, 'lxml')
            #tag_wrap_issues = soup_article.find('div', class_='tag-wrap issues')
            #if tag_wrap_issues:
            #    anchor_category = tag_wrap_issues.find('a')
            #    category_text = anchor_category.get_text(strip=True) if anchor_category else ''

            # Extract the authors
            if body:
                authors = body.find('span', class_='article-list-item-author')
                authors_text = ' '.join(authors.get_text(' ', strip=True).split()) if authors else ''

            # Extract post date
            if body:
                post_date = body.find('div', class_='wp-block-post-date')
                if post_date:
                    time = post_date.find('time')
                    post_date_time = time['datetime'] if time else ''

            # Append the extracted data
            data.append({
                'Source': source,
                'Post Term': post_term_text,
                #'Category': category_text,
                'Post Tags': post_tags_text,
                'Title': title_text,
                'Post URL': post_url,
                'Authors': authors_text,
                'Post Date': post_date_time
            })

    return pd.DataFrame(data)

In [11]:
source = 'Greenpeace'
index_page_url_1 = 'https://www.greenpeace.org/international/story/page/'
index_page_url_2 = '/'
start_page = 1
end_page = 136

Note: On 17/08/2025, when the data was extracted, the end page was 136.

In [20]:
df_grp = scrape_posts(source, index_page_url_1, index_page_url_2, start_page, end_page)

100%|██████████| 1/1 [00:07<00:00,  7.26s/it]


In [21]:
df_grp['Post Date'] = pd.to_datetime(df_grp['Post Date'], errors='coerce', utc=True)

In [22]:
df_grp['Post ID'] = id + df_grp.index.astype(str).str.zfill(6)

In [43]:
df_grp.dtypes

Source               object
Post Term            object
Post Tags            object
Title                object
Post URL             object
Authors              object
Post Date    datetime64[ns]
Post ID              object
dtype: object

In [24]:
df_grp

Unnamed: 0,Source,Post Term,Post Tags,Title,Post URL,Authors,Post Date,Post ID
0,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33+00:00,grp000000
1,Greenpeace,Stories,Forests,Environmental storytelling for a Chinese audie...,https://www.greenpeace.org/international/story...,August Rick,2025-08-14 01:40:25+00:00,grp000001
2,Greenpeace,Stories,AlternativeFutures,5 reasons Greenpeace calls for new global tax ...,https://www.greenpeace.org/international/story...,Nina Stros,2025-08-13 13:47:04+00:00,grp000002
3,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-08 05:09:19+00:00,grp000003
4,Greenpeace,Stories,"Peace, Nuclear",From Hiroshima to Gaza: defending peace,https://www.greenpeace.org/international/story...,Greenpeace France,2025-08-07 15:22:58+00:00,grp000004
5,Greenpeace,Stories,"Nuclear, Peace",80 years since Hiroshima and Nagasaki — time f...,https://www.greenpeace.org/international/story...,Sam Annesley,2025-08-06 00:15:55+00:00,grp000005
6,Greenpeace,Stories,"Plastics, Oceans",More businesses join the call for a strong UN ...,https://www.greenpeace.org/international/story...,Sarah King,2025-08-01 06:00:00+00:00,grp000006
7,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-01 04:08:33+00:00,grp000007
8,Greenpeace,Stories,"Climate, Health, PollutersPayPact",The climate crisis hits health care in South A...,https://www.greenpeace.org/international/story...,Yoliswa Sobuwa,2025-07-31 09:05:58+00:00,grp000008
9,Greenpeace,Stories,"EnergyRevolution, Peace, Nuclear",How can we protect peace and democracy? Greenp...,https://www.greenpeace.org/international/story...,Camilo Sanchez,2025-07-30 11:58:55+00:00,grp000009


#### Export to a file

In [25]:
df_grp.to_json(f"{output_directory}/{dataset_filename_1}.jsonl", orient='records', lines=True)

### Scrape the posts

#### Import the data into a DataFrame

In [12]:
df_grp = pd.read_json(f"{input_directory}/{dataset_filename_1}.jsonl", lines=True)

In [13]:
df_grp['Post Date'] = pd.to_datetime(df_grp['Post Date'], unit='ms')

#### Scrape the posts

In [28]:
scrape_html_docs(df_grp, path)

Scraping HTML documents: 100%|██████████| 10/10 [00:56<00:00,  5.66s/it]


### Extract the text from the posts

In [14]:
def extract_text(df, path):
    """Extracts text from HTML files and saves as text files."""

    for post_id in df['Post ID']:
        html_file = os.path.join(path, f"{post_id}.html")
        txt_file = os.path.join(path, f"{post_id}.txt")

        # Check if the HTML file exists
        if not os.path.exists(html_file):
            logging.error(f"Skipping {html_file}: File not found")
            continue

        # Read HTML content
        with open(html_file, 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file, 'lxml')

        # Initialise text variable
        text = ''

        # Web Scraping - Begin

        # Capture the 'article body'
        post_body = soup.find('article')

        # Extract the category
        if post_body:
            tag_wrap_issues = post_body.find('div', class_='tag-wrap issues')
            if tag_wrap_issues:
                anchor_category = tag_wrap_issues.find('a')
                if anchor_category:
                    category_text = ' '.join(anchor_category.get_text(' ', strip=True).split())
                    text += f"Category: {category_text}\n"

        # Extract the paragraphs
        if post_body:
            post_content = post_body.find('div', class_='post-content')
            if post_content:
                post_details = post_content.find('div', class_='post-details clearfix')
                if post_details:
                    # Iterate top-level content blocks in order: paragraphs and lists
                    for block in post_details.find_all(['p', 'ul', 'ol'], recursive=False):
                        if block.name == 'p':
                            paragraph_text = ' '.join(block.get_text(' ', strip=True).split())
                            text += f"{paragraph_text}\n"
                        elif block.name in ('ul', 'ol'):
                            # Capture top-level list items in order
                            for li in block.find_all('li', recursive=False):
                                li_text = ' '.join(li.get_text(' ', strip=True).split())
                                text += f"{li_text}\n"

        # Web Scraping - End

        # Save text to a text file
        with open(txt_file, 'w', encoding='utf-8', newline='\n') as file:
            file.write(text)

        logging.info(f"Saved text for {post_id} to {txt_file}")

In [41]:
extract_text(df_grp, path)

In [14]:
def extract_text_gp000071_gp001092(df, path):
    """Extracts text from HTML files and saves as text files."""

    for post_id in df['Post ID']:
        html_file = os.path.join(path, f"{post_id}.html")
        txt_file = os.path.join(path, f"{post_id}.txt")

        # Check if the HTML file exists
        if not os.path.exists(html_file):
            logging.error(f"Skipping {html_file}: File not found")
            continue

        # Read HTML content
        with open(html_file, 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file, 'lxml')

        # Initialise text variable
        text = ''

        # Web Scraping - Begin

        # Capture the 'article body'
        post_body = soup.find('article')

        # Extract the category
        if post_body:
            tag_wrap_issues = post_body.find('div', class_='tag-wrap issues')
            if tag_wrap_issues:
                anchor_category = tag_wrap_issues.find('a')
                if anchor_category:
                    category_text = ' '.join(anchor_category.get_text(' ', strip=True).split())
                    text += f"Category: {category_text}\n"

        # Extract the paragraphs
        if post_body:
            post_content = post_body.find('div', class_='post-content')
            if post_content:
                post_details = post_content.find('div', class_='post-details clearfix')
                if post_details:
                    # Iterate top-level content blocks in order: paragraphs and lists
                    for block in post_details.find_all(['p', 'ul', 'ol'], recursive=True):
                        if block.name == 'p':
                            paragraph_text = ' '.join(block.get_text(' ', strip=True).split())
                            text += f"{paragraph_text}\n"
                        elif block.name in ('ul', 'ol'):
                            # Capture top-level list items in order
                            for li in block.find_all('li', recursive=False):
                                li_text = ' '.join(li.get_text(' ', strip=True).split())
                                text += f"{li_text}\n"

        # Web Scraping - End

        # Save text to a text file
        with open(txt_file, 'w', encoding='utf-8', newline='\n') as file:
            file.write(text)

        logging.info(f"Saved text for {post_id} to {txt_file}")

### Break down the texts into paragraphs

In [42]:
# Prepare to collect rows
data = []

# Loop through each 'Post ID' in the DataFrame
for _, row in df_grp.iterrows():
    post_id = row['Post ID']

    paragraph_count = 0
    file_path = os.path.join(path, f"{post_id}.txt")

    if not os.path.isfile(file_path):
        print(f"Missing file: {file_path}")
        continue

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = ' '.join(line.split()).strip()
            if not line:
                continue

            if line.startswith('Category:'):
                category_name = line.partition(':')[2].strip()
                # If a category line is blank or only has the colon, we gracefully assign the fallback name 'Undefined Category'
                category = category_name if category_name else 'Undefined Category'
                paragraph_count = 0 # Resetting paragraph count for new category

            elif line:
                paragraph_count += 1
                data.append({
                    'Post ID': post_id,
                    'Category': category,
                    'Paragraph': f"Paragraph {paragraph_count}",
                    'Text Paragraph': line
                    })

# Create final DataFrame
df_paragraph = pd.DataFrame(data)

In [35]:
df_paragraph

Unnamed: 0,Post ID,Category,Paragraph,Text Paragraph
0,grp000000,Greenpeace,Paragraph 1,From a banner protest at the plastic treaty in...
1,grp000000,Greenpeace,Paragraph 2,🇬🇧 England – Greenpeace UK’s climbers install ...
2,grp000000,Greenpeace,Paragraph 3,After securing a giant 12m x 8m canvas to one ...
3,grp000000,Greenpeace,Paragraph 4,The work starkly visualises the wound inflicte...
4,grp000000,Greenpeace,Paragraph 5,"🇨🇭 Switzerland – Juan Carlos Monterrey Gómez, ..."
...,...,...,...,...
23363,grp001356,Greenpeace,Paragraph 13,He joined Toronto’s City TV as an ecology spec...
23364,grp001356,Greenpeace,Paragraph 14,Over the years he continued to contribute to G...
23365,grp001356,Greenpeace,Paragraph 15,"In a recent book, Rex Weyler writes about refl..."
23366,grp001356,Greenpeace,Paragraph 16,“The ironies and tension of history simultaneo...


In [44]:
df_grp_paragraph = df_grp.merge(df_paragraph, on='Post ID', how='left')

In [45]:
df_grp_paragraph

Unnamed: 0,Source,Post Term,Post Tags,Title,Post URL,Authors,Post Date,Post ID,Category,Paragraph,Text Paragraph
0,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33,grp000000,Greenpeace,Paragraph 1,From a banner protest at the plastic treaty in...
1,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33,grp000000,Greenpeace,Paragraph 2,🇬🇧 England – Greenpeace UK’s climbers install ...
2,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33,grp000000,Greenpeace,Paragraph 3,After securing a giant 12m x 8m canvas to one ...
3,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33,grp000000,Greenpeace,Paragraph 4,The work starkly visualises the wound inflicte...
4,Greenpeace,Stories,Photography,Greenpeace Pictures of the Week,https://www.greenpeace.org/international/story...,Greenpeace International,2025-08-15 01:45:33,grp000000,Greenpeace,Paragraph 5,"🇨🇭 Switzerland – Juan Carlos Monterrey Gómez, ..."
...,...,...,...,...,...,...,...,...,...,...,...
23365,Greenpeace,Stories,"AboutUs, 50Years",Bob Hunter 1941 – 2005,https://www.greenpeace.org/international/story...,Greenpeace International,2005-05-02 15:15:00,grp001356,Greenpeace,Paragraph 13,He joined Toronto’s City TV as an ecology spec...
23366,Greenpeace,Stories,"AboutUs, 50Years",Bob Hunter 1941 – 2005,https://www.greenpeace.org/international/story...,Greenpeace International,2005-05-02 15:15:00,grp001356,Greenpeace,Paragraph 14,Over the years he continued to contribute to G...
23367,Greenpeace,Stories,"AboutUs, 50Years",Bob Hunter 1941 – 2005,https://www.greenpeace.org/international/story...,Greenpeace International,2005-05-02 15:15:00,grp001356,Greenpeace,Paragraph 15,"In a recent book, Rex Weyler writes about refl..."
23368,Greenpeace,Stories,"AboutUs, 50Years",Bob Hunter 1941 – 2005,https://www.greenpeace.org/international/story...,Greenpeace International,2005-05-02 15:15:00,grp001356,Greenpeace,Paragraph 16,“The ironies and tension of history simultaneo...


### Find texts that are empty

In [46]:
# Find rows where the specified column has empty strings
mask = df_grp_paragraph['Category'].isnull()

# Get the corresponding 'Post ID' values
post_ids_with_missing_text = df_grp_paragraph[mask]['Post ID'].tolist()

post_ids_with_missing_text

['grp000071', 'grp001092']

In [47]:
df_grp_empty = df_grp_paragraph[df_grp_paragraph['Category'].isnull()]
df_grp_empty

Unnamed: 0,Source,Post Term,Post Tags,Title,Post URL,Authors,Post Date,Post ID,Category,Paragraph,Text Paragraph
1101,Greenpeace,Stories,Photography,Behind the Lens: Marizilda Cruppe,https://www.greenpeace.org/international/story...,Greenpeace International,2025-04-06 06:00:00,grp000071,,,
18996,Greenpeace,Stories,Forests,This company promised to stop deforestation. B...,https://www.greenpeace.org/international/story...,Grant Rosoman,2018-05-21 04:34:08,grp001092,,,


#### Rescrape

In [40]:
scrape_html_docs(df_grp_empty, path)

Scraping HTML documents: 100%|██████████| 2/2 [00:13<00:00,  6.73s/it]


In [None]:
extract_text_gp000071_gp001092(df_grp_empty, path)

#### Export to a file

In [48]:
df_grp_paragraph.to_json(f"{output_directory}/{dataset_filename_2}.jsonl", orient='records', lines=True)

In [49]:
df_grp_paragraph.to_excel(f"{output_directory}/{dataset_filename_2}.xlsx", index=False)