# The Guardian

## Getting data

Getting articles from The Guardian API with code snippets from https://towardsdatascience.com/discovering-powerful-data-the-guardian-news-api-into-python-for-nlp-1829b568fb0f. The code was used to access the API and tweaked to finding articles based on phrase instead of tag - and specified for our topic.

In [1]:
# import packages
import json
import requests
import pandas as pd
from os import makedirs
from os.path import join, exists
from datetime import date, timedelta

use show-fields key and bodyText key value to your URL then you will get content.

e.g: show-fields=BodyText

http://content.guardianapis.com/search?order-by=newest&show-fields=bodyText&q=politics&api-key='eb90ec3a-ffa6-4874-86d5-6eb062efd14d'

# The Guardian articles
## Getting data from the Guardian API

Adding in phrase to query:
https://content.guardianapis.com/search?q="mitochondrial%20donation"&tag=politics/politics&from-date=2014-01-01&api-key=test

From a qualitative reading of the articles and writing down topics, the articles we look at include 'Ukraine', 'Russia', and 'news' or 'analysis'. We want to first get a broad list of articles that mention Ukraine and Russia, then we can further narrow down to our specifications regarding the war. 

The Guardian provides a platform to explore the results from parameters, which is what I used to get the link for the query. https://open-platform.theguardian.com/explore/.

In [2]:
# paramters for query
api_key = 'eb90ec3a-ffa6-4874-86d5-6eb062efd14d'
phrase = 'Ukraine%2C%20Russia'
from_date = '2021-08-01' #6 months before
page = 1705


def query_api(phrase, page, from_date, api_key):
    """
    Function to query the API for a particular tag
    returns: a response from API
    """
    response = requests.get("https://content.guardianapis.com/search?show-fields=bodytext&q=" + phrase + "&from-date=" + from_date 
                            +"&page=" + str(page) + "&show-blocks=all&&page-size=200&api-key=" + api_key)
    return response

In [3]:
def get_results_for_phrase(phrase, from_date, api_key):
    """
    Function to run a for loop for results greater than 200. 
    Calls the query_api function accordingly
    returns: a list of JSON results
    """
    json_responses = []
    response = query_api(phrase, 1, from_date, api_key).json()
    json_responses.append(response)
    number_of_results = response['response']['total']
    if number_of_results > 200:
        for page in range(2, (round(number_of_results/200))+1):
            response = query_api(phrase, page, from_date, api_key).json()
            json_responses.append(response)
    return json_responses

In [4]:
def convert_json_responses_to_df(json_responses):
    """
    Function to convert the list of json responses to a dataframe
    """
    df_results = []
    
    for page_json in json_responses:
        # Check if 'response' key is present in the current page
        if 'response' in page_json:
            # Use get method to avoid KeyError for 'results' key
            results = page_json['response'].get('results', [])
            
            # Normalize the JSON and append to the list
            df = pd.json_normalize(results)
            df_results.append(df)

    # Concatenate all DataFrames in the list
    all_df = pd.concat(df_results, ignore_index=True)
    return all_df

# Example usage
# df = convert_json_responses_to_df(json_responses)
# print(df)

In [5]:
json_responses = get_results_for_phrase(phrase, from_date, api_key)

In [9]:
df_responses = convert_json_responses_to_df(json_responses)
df_responses.drop_duplicates(subset=['webTitle', 'webUrl'], inplace = True) # getting rid of duplicates
# df['webPublicationDate'] = df['webPublicationDate'].apply(lambda x: pd.to_datetime(x))
df_responses.head()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,...,blocks.main.bodyTextSummary,blocks.main.published,blocks.main.createdDate,blocks.main.firstPublishedDate,blocks.main.publishedDate,blocks.main.lastModifiedDate,blocks.main.contributors,blocks.main.elements,blocks.body,blocks.totalBodyBlocks
0,world/live/2023/nov/29/russia-ukraine-war-live...,liveblog,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,https://www.theguardian.com/world/live/2023/no...,https://content.guardianapis.com/world/live/20...,False,pillar/news,...,,True,2023-11-29T10:53:48Z,2023-11-29T08:04:48Z,2023-11-29T10:53:50Z,2023-11-29T10:53:48Z,[],"[{'type': 'contentatom', 'assets': [], 'conten...","[{'id': '65675b468f087816ab20acfe', 'bodyHtml'...",20
1,world/live/2023/dec/01/russia-ukraine-war-live...,liveblog,world,World news,2023-12-01T15:59:43Z,Russia-Ukraine war: no reason for Russia to ch...,https://www.theguardian.com/world/live/2023/de...,https://content.guardianapis.com/world/live/20...,False,pillar/news,...,,True,2023-12-01T12:08:05Z,2023-12-01T08:16:10Z,2023-12-01T12:08:44Z,2023-12-01T12:08:39Z,[],"[{'type': 'image', 'assets': [{'type': 'image'...","[{'id': '6569f96a8f08db319e1bc924', 'bodyHtml'...",20
2,world/live/2023/oct/26/russia-ukraine-war-live...,liveblog,world,World news,2023-10-26T17:52:42Z,Russia-Ukraine war: US announces extra funding...,https://www.theguardian.com/world/live/2023/oc...,https://content.guardianapis.com/world/live/20...,False,pillar/news,...,,True,2023-10-26T16:41:30Z,2023-10-26T06:00:47Z,2023-10-26T16:42:16Z,2023-10-26T16:42:11Z,[],"[{'type': 'image', 'assets': [{'type': 'image'...","[{'id': '653aa7208f0837aa4c22b37a', 'bodyHtml'...",38
3,world/live/2023/nov/15/russia-ukraine-war-live...,liveblog,world,World news,2023-11-15T18:49:54Z,Russia-Ukraine war: Russia confirms Ukrainian ...,https://www.theguardian.com/world/live/2023/no...,https://content.guardianapis.com/world/live/20...,False,pillar/news,...,,True,2023-11-15T07:14:44Z,2023-11-15T07:14:44Z,2023-11-15T18:52:47Z,2023-11-15T05:22:36Z,[],"[{'type': 'image', 'assets': [{'type': 'image'...","[{'id': '65550f9e8f08d1d922ef766f', 'bodyHtml'...",39
4,world/live/2023/dec/03/russia-ukraine-war-live...,liveblog,world,World news,2023-12-03T16:51:49Z,Russia-Ukraine war live: Ukraine launches inqu...,https://www.theguardian.com/world/live/2023/de...,https://content.guardianapis.com/world/live/20...,False,pillar/news,...,,True,2023-12-03T08:26:03Z,2023-12-03T08:26:03Z,2023-12-03T17:04:56Z,2023-12-03T17:04:51Z,[],"[{'type': 'image', 'assets': [{'type': 'image'...","[{'id': '656cb1f88f08db319e1bdc4f', 'bodyHtml'...",22


In [7]:
df_responses.shape

(16246, 23)

In [10]:
df_responses.to_excel('guardian_responses.xlsx')

In [None]:
json_responses[0]['response']

In [None]:
json_responses[0]['response']['results'][0]

In [None]:
example = json_responses[0]['response']['results'][0]['blocks']['body'][0]['bodyHtml']
example

In [None]:
json_responses[0]['response']['results'][0]['blocks']['body'][1]['bodyHtml']

In [20]:
def remove_tags(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    return soup.get_text()

In [21]:
# creating the df to store results
df = pd.DataFrame(columns=['Page', 'Article', 'sectionId', 'sectionName', 'webPublicationDate', 'webTitle', 'pillarName', 'Text'])

count_page = 0

for page in json_responses:
    try:
        articles = page['response']['results']

        for count_article, article in enumerate(articles):
            try:
                article_id = article['id']
                section_id = article['sectionId']
                section_name = article['sectionName']
                web_publication_date = article['webPublicationDate']
                web_title = article['webTitle']
                pillar_name = article['pillarName']
                
                bodies = article['blocks']['body']

                for count_text, body in enumerate(bodies):
                    try:
                        text = remove_tags(body['bodyHtml'])
                        df = df.append({
                            'Page': count_page,
                            'Article': article_id,
                            'sectionId': section_id,
                            'sectionName': section_name,
                            'webPublicationDate': web_publication_date,
                            'webTitle': web_title,
                            'pillarName': pillar_name,
                            'Text': text
                        }, ignore_index=True)
                    except Exception as e:
                        print(f"Error processing text in Page {count_page}, Article {article_id}, Body {count_text}: {e}")
                        continue
            except Exception as e:
                print(f"Error processing article in Page {count_page}, Article {count_article}: {e}")
                continue
    except Exception as e:
        print(f"Error processing page {count_page}: {e}")
        continue

print(df.head())  # check contents


Error processing article in Page 0, Article 118: 'pillarName'
Error processing article in Page 0, Article 70: 'pillarName'
Error processing article in Page 0, Article 193: 'pillarName'
Error processing article in Page 0, Article 158: 'pillarName'
Error processing article in Page 0, Article 163: 'pillarName'
Error processing article in Page 0, Article 110: 'pillarName'
Error processing article in Page 0, Article 181: 'pillarName'
Error processing article in Page 0, Article 155: 'pillarName'
Error processing article in Page 0, Article 81: 'pillarName'
Error processing article in Page 0, Article 140: 'pillarName'
Error processing article in Page 0, Article 38: 'pillarName'
Error processing article in Page 0, Article 56: 'pillarName'
Error processing article in Page 0, Article 42: 'pillarName'
Error processing article in Page 0, Article 88: 'pillarName'
Error processing article in Page 0, Article 140: 'pillarName'
Error processing article in Page 0, Article 39: 'pillarName'
Error processin

In [22]:
df.head()

Unnamed: 0,Page,Article,sectionId,sectionName,webPublicationDate,webTitle,pillarName,Text
0,0,world/live/2023/nov/29/russia-ukraine-war-live...,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,News,We’re ending our live coverage of the war in U...
1,0,world/live/2023/nov/29/russia-ukraine-war-live...,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,News,Ukraine has seen no sign that its Nato allies ...
2,0,world/live/2023/nov/29/russia-ukraine-war-live...,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,News,Russia has failed in its bid to be re-elected ...
3,0,world/live/2023/nov/29/russia-ukraine-war-live...,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,News,Russian forces have today ramped up their atta...
4,0,world/live/2023/nov/29/russia-ukraine-war-live...,world,World news,2023-11-29T15:57:21Z,Russia-Ukraine war: Russia ramping up attacks ...,News,A man who used his finger to write “no to war”...


In [23]:
df.shape

(108702, 8)

In [25]:
print(df.isna().sum())

Page                  0
Article               0
sectionId             0
sectionName           0
webPublicationDate    0
webTitle              0
pillarName            0
Text                  0
dtype: int64


In [24]:
df.to_excel('guardian_responses_text.xlsx')

From the shape of the df, we can see that there are way more columns than the number of articles we pulled. Further examination revealed that the body texts were separated so we have to merge the texts of the same article. We do that below by grouping by all the other columns to make sure it is the same article.

In [27]:
# Grouping text that come from the same article
grouped_df = df.groupby(['Page', 'Article', 'sectionId', 'sectionName', 'webPublicationDate', 'webTitle', 'pillarName'])['Text'].apply(lambda x: ' '.join(x)).reset_index()

In [28]:
grouped_df.shape

(16222, 8)

In [31]:
print(grouped_df.isna().sum())

Page                  0
Article               0
sectionId             0
sectionName           0
webPublicationDate    0
webTitle              0
pillarName            0
Text                  0
dtype: int64


In [30]:
# saving grouped df to excel
grouped_df.to_excel('guardian_responses_text_grouped.xlsx')

## Filtering articles
From all the articles, we want to narrow down to capture the articles that are relavant to what has been reported on the war. From close reading we identified some keywords that are indicative of when reporting is done, we will then apply a dictionary based approach to to filter articles by keywords. Before the dictionary approach, we can filter out sections that are not news-related such as sports because they are not going to be reporting on the war.

In [50]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer #harsher stemmer
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
import re
import string
from collections import defaultdict

In [32]:
# loading the data from excel file
all_articles_df = pd.read_excel('guardian_responses_text_grouped.xlsx')

In [34]:
# changing date to datetime format and sorting by date
all_articles_df['webPublicationDate'] = all_articles_df['webPublicationDate'].apply(lambda x: pd.to_datetime(x))
all_articles_df = df.sort_values(by='webPublicationDate')

In [36]:
all_articles_df.head()

Unnamed: 0,Page,Article,sectionId,sectionName,webPublicationDate,webTitle,pillarName,Text
77827,0,sport/2021/aug/02/belarus-athlete-refused-fly-...,sport,Sport,2021-08-02T13:21:31Z,Belarus athlete who refused to fly home is gra...,Sport,The Belarus Olympic athlete Krystsina Tsimanou...
79012,0,sport/2021/aug/02/olympics-shooting-serhiy-kul...,sport,Sport,2021-08-02T14:00:11Z,‘I shot someone else’s target’: Ukraine’s Serh...,Sport,Ukrainian shooter Serhiy Kulish missed out on ...
76251,0,world/2021/aug/03/belarus-exile-group-leader-v...,world,World news,2021-08-03T12:36:15Z,Belarus exile group leader Vitaly Shishov foun...,News,The head of a Kyiv-based non-profit organisati...
99801,0,world/2021/aug/03/uk-on-your-side-boris-johnso...,world,World news,2021-08-03T14:27:42Z,"UK is on your side, Boris Johnson tells Belaru...",News,The UK is on the side of Belarusian opposition...
91984,0,world/2021/aug/03/anger-in-turkey-grows-over-g...,world,World news,2021-08-03T22:19:18Z,Anger in Turkey grows over government’s handli...,News,People across Turkey are looking for answers a...


In [40]:
# unique pillar names
unique_pillarNames = all_articles_df['pillarName'].unique()

print(unique_pillarNames)

['Sport' 'News' 'Arts' 'Opinion' 'Lifestyle']


In [41]:
# unique pillar names
unique_sectionName = all_articles_df['sectionName'].unique()

print(unique_sectionName)

['Sport' 'World news' 'Football' 'Books' 'Global development' 'Membership'
 'Environment' 'Opinion' 'Media' 'US news' 'Politics' 'Business' 'Society'
 'UK news' 'News' 'Film' 'Life and style' 'Television & radio' 'Science'
 'Technology' 'Australia news' 'From the Observer' 'Music' 'Culture'
 'Art and design' 'Travel' 'Law' 'Stage' 'Money' 'Games' 'Fashion'
 'Education' 'Food' 'Info' 'Crosswords' 'Inequality' 'Weather' 'Help']


In [53]:
# looking at all unique sections
# unique_sections_ID = all_articles_df['sectionId'].unique()

# print(unique_sections_ID)

Based on the sections, we think the following will be relevant to our topic:
1. World news
2. Global development
3. Opinion
4. Media
5. US news
6. Politics
7. Society
8. UK news
9. News
10. Law
11. Info

In [None]:
# make a copy to filter sections
relevant_articles_df = all_articles_df.copy()

### Text preprocessing

Before filtering, we have to first preprocess the text to make sure our keywords capture all variants of a word. We apply some of the same preprocessing steps to the words so they are the same.

In [46]:
# lowercase words
all_articles_df['processed_text'] = all_articles_df['Text'].str.lower()

# remove punctuation
all_articles_df['processed_text'] = [re.sub(f"[{string.punctuation}]", "", i) for i in all_articles_df['processed_text']]

# remove numbers
pattern = r'[0-9]+'
        
all_articles_df['processed_text'] = all_articles_df['processed_text'].str.replace(pattern, '')

# remove punctuation
all_articles_df['processed_text'] = [re.sub(f"[{string.punctuation}]", "", i) for i in all_articles_df['processed_text']]

# remove extra whitespaces
all_articles_df['processed_text'] = [re.sub(r'\s+', " ", i) for i in all_articles_df['processed_text']]

  all_articles_df['processed_text'] = all_articles_df['processed_text'].str.replace(pattern, '')


In [47]:
# remove stopwords
stop_words = stopwords.words('english')

def remove_stopwords(text):
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

# apply the function to the 'text' column of the dataframe
all_articles_df['processed_text'] = all_articles_df['processed_text'].apply(remove_stopwords) # for stemming

In [52]:
# lemmatize words
def lemmatize_words(text):
    tag_map = defaultdict(lambda : wordnet.NOUN)  #If nothing else is specified, use noun tag
    tag_map['J'] = wordnet.ADJ
    tag_map['V'] = wordnet.VERB
    tag_map['R'] = wordnet.ADV  
    
    lemmatizer = WordNetLemmatizer()
    
    tokens = text.split()
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
    lemmatized = []


    for word, tag in pos_tag(tokens):
        lemma = lemmatizer.lemmatize(word, tag_map[tag[0]])  #Where the magic happens 
        
        #we choose tag[0] to get all instances of a word class, 
        #e.g. NN (noun) and NP (proper noun) should both translate to noun. 
        
        lemmatized.append(lemma) 
        
    filtered_text = ' '.join(lemmatized)
    return filtered_text
 
all_articles_df['processed_text'] = all_articles_df['processed_text'].apply(lemmatize_words)

In [54]:
all_articles_df.head()

Unnamed: 0,Page,Article,sectionId,sectionName,webPublicationDate,webTitle,pillarName,Text,processed_text
77827,0,sport/2021/aug/02/belarus-athlete-refused-fly-...,sport,Sport,2021-08-02T13:21:31Z,Belarus athlete who refused to fly home is gra...,Sport,The Belarus Olympic athlete Krystsina Tsimanou...,belarus olympic athlete krystsina tsimanouskay...
79012,0,sport/2021/aug/02/olympics-shooting-serhiy-kul...,sport,Sport,2021-08-02T14:00:11Z,‘I shot someone else’s target’: Ukraine’s Serh...,Sport,Ukrainian shooter Serhiy Kulish missed out on ...,ukrainian shooter serhiy kulish miss medal com...
76251,0,world/2021/aug/03/belarus-exile-group-leader-v...,world,World news,2021-08-03T12:36:15Z,Belarus exile group leader Vitaly Shishov foun...,News,The head of a Kyiv-based non-profit organisati...,head kyivbased nonprofit organisation help bel...
99801,0,world/2021/aug/03/uk-on-your-side-boris-johnso...,world,World news,2021-08-03T14:27:42Z,"UK is on your side, Boris Johnson tells Belaru...",News,The UK is on the side of Belarusian opposition...,uk side belarusian opposition leader try bring...
91984,0,world/2021/aug/03/anger-in-turkey-grows-over-g...,world,World news,2021-08-03T22:19:18Z,Anger in Turkey grows over government’s handli...,News,People across Turkey are looking for answers a...,people across turkey look answer summer wildfi...


In [55]:
# saving procssed text df to excel
all_articles_df.to_excel('guardian_responses_text_processed.xlsx')

## Analysis

1. Bar chart showing the amount of articles per year (or other bin) and colored by our time period with a key. Volume of relevant news over time.
2. Dynamic topic modeling
3. Word embedding over time (neighborhood of Israel, Palestine, and Hamas over time for different time periods) https://arxiv.org/pdf/1807.04441.pdf 
- stem the words and embed them depending on year category