# Scraping the GDELT dataset

What we need is a df with:
- date
- source country of article
- country article mentions
- theme(s) of article OR article text

OR
- date
- percentage of articles from source country x mentioning country y that have theme z
    - could be smoothed over time, but not backwards in time

In [13]:
import pandas as pd
import urllib.parse
import pickle

countries_path = 'saves/countries_dict.pickle'
themes_path = 'saves/themes_list.pickle'
countries_capitals_path = 'saves/countries_capitals.csv'

Then we import the auxiliary datasets

In [15]:
with open(countries_path, 'rb') as f:
    countries = pickle.load(f)

with open(themes_path, 'rb') as f:
    themes = pickle.load(f)

countries_capitals = pd.read_csv(countries_capitals_path)

# Functions to scrape

In [6]:
dict = {
    "theme" : "Terror",
    "query" : "Netherlands",
    "mode" : "TimelineSourceCountry",
    "format" : "csv",
    "startdatetime" : "20170101010000",
    "enddatetime" : "20240223143054",
}

In [7]:
def querybuilder(dict):
    base_url = "https://api.gdeltproject.org/api/v2/doc/doc?"
    url = base_url + "&".join([f"{key}={value}" for key, value in dict.items()])
    url = urllib.parse.quote(url, safe=':/?&=')
    return url

def get_gdelt_data(theme, country, start_date, end_date):
    dict = {
        "theme": theme,
        "query": countries[country],
        "mode": "TimelineSourceCountry",
        "format": "csv",
        "startdatetime": start_date,
        "enddatetime": end_date,
    }
    url = querybuilder(dict)
    df = pd.read_csv(url)
    return df

def scrape_gdelt(themes, countries, start_date, end_date, verbose=False):
    df_list = []
    for theme in themes:
        for country in countries:
            if verbose:
                print(f"Scraping {theme} in {country}")
            df = get_gdelt_data(theme, country, start_date, end_date)
            df['theme'] = theme
            df['country'] = country
            df_list.append(df)
    return pd.concat(df_list)

# Testing methods

In [8]:
url = querybuilder(dict)
df = pd.read_csv(url)
df.head()

Unnamed: 0,Date,Series,Value
0,2017-01-02,Switzerland Volume Intensity,0.9921
1,2017-01-03,Switzerland Volume Intensity,1.3936
2,2017-01-04,Switzerland Volume Intensity,0.8653
3,2017-01-05,Switzerland Volume Intensity,1.3412
4,2017-01-06,Switzerland Volume Intensity,0.951


Test for a few countries and a few themes:

In [None]:
countries_subset = [str(item) for item in countries.keys()][:5]
themes_subset = themes[:5]

start_date_test = "20170101010000"
end_date_test = "20170223143054"

scarped_test = scrape_gdelt(themes_subset, countries_subset, start_date_test, end_date_test)

In [None]:
# rename columns
scarped_test.columns = ["Date", "Source country", "Intensity", "Theme", "Target country"]

# clean up source country column
scarped_test["Source country"] = scarped_test["Source country"].str.replace(" Volume Intensity", "")

# sort on source country, date and theme
scarped_test = scarped_test.sort_values(by=["Source country", "Date", "Target country", "Theme"])

scarped_clean = scarped_test[scarped_test["Source country"] != ""]

scarped_clean.head(10)

In [None]:
# pivot the table
scarped_test_pivot = scarped_clean.pivot_table(index=["Date", "Source country", "Target country"], columns=["Theme"], values="Intensity").reset_index()

# set the "Date" as the index
scarped_test_pivot = scarped_test_pivot.set_index("Date")

scarped_test_pivot.head()

Using this method we can create a dataframe that contains the volume of coverage in country x talking about country y, and that have theme z. This is the data that we want.

However, we need to then iterate over all countries and all themes. For this we need to make a dictionary that for each country decides the ways that country can be referred to, and a list of all the themes that we want to measure volume for.

We can then make a function that iterates and makes repeated API calls for all combinations, and use some data wrangling to get all data into one dataframe.

# Scraping all data

For this we need the dataset on all the country names, and the themes list