# Scraping the GDELT dataset

What we need is a df with:
- date
- source country of article
- country article mentions
- theme(s) of article OR article text

OR
- date
- percentage of articles from source country x mentioning country y that have theme z
    - could be smoothed over time, but not backwards in time

In [76]:
import pandas as pd
import urllib.parse
import pickle

countries_path = 'saves/countries_dict.pickle'
themes_path = 'saves/themes_list.pickle'
countries_capitals_path = 'saves/countries_capitals.csv'
countries_queries_path = 'saves/country_queries.csv'

Then we import the auxiliary datasets

In [79]:
# open themes list
with open(themes_path, 'rb') as f:
    themes = pickle.load(f)

# open the countries and capitals csv
countries_capitals = pd.read_csv(countries_capitals_path)
# make a countries dictionary with FIPS as key
countries = countries_capitals.set_index('FIPS')['Country'].to_dict()

# open the queries csv
countries_queries = pd.read_csv(countries_queries_path)
# make a query dictionary with FIPS as key
query_dict = countries_queries.set_index('FIPS')['Query'].to_dict()

# Functions to scrape

In [89]:
def querybuilder(dict):
    base_url = "https://api.gdeltproject.org/api/v2/doc/doc?"
    url = base_url + "&".join([f"{key}={value}" for key, value in dict.items()])
    url = urllib.parse.quote(url, safe='():/?&=').replace("&theme=", "%20theme:")
    return url

def get_gdelt_data(theme, country, start_date, end_date, verbose=False):
    if theme == "ALL":
        dict = {
            "query": query_dict[country],
            "mode": "TimelineSourceCountry",
            "startdatetime": start_date,
            "enddatetime": end_date,
            "format": "csv",
            "timezoom" : "yes",
        }
    else:
        dict = {
            "query": query_dict[country],
            "theme": theme,
            "mode": "TimelineSourceCountry",
            "startdatetime": start_date,
            "enddatetime": end_date,
            "format": "csv",
            "timezoom" : "yes",
        }        
    url = querybuilder(dict)
    if verbose:
        print(url.replace("csv", "html"))
    try:
        df = pd.read_csv(url)
        return df
    except pd.errors.EmptyDataError:
        if verbose:
            print("passed")
        pass

    

def scrape_gdelt(themes, countries, start_date, end_date, verbose=False):
    df_list = []
    themes.append("ALL")
    for theme in themes:
        for country in countries:
            if verbose:
                print(f"Scraping {theme} in {country}")
            df = get_gdelt_data(theme, country, start_date, end_date, verbose=verbose)
            if df is not None:
                df['theme'] = theme
                df['country'] = country
                df_list.append(df)
    return pd.concat(df_list)

# Testing methods

In [67]:
test_dict = {
    "query" : "Netherlands",
    "theme" : "KILL",
    "mode" : "TimelineSourceCountry",
    "startdatetime" : "20170101010000",
    "enddatetime" : "20240229143054",
    "format" : "html",
    "timezoom" : "yes"
}

'The Bahamas'

Test for a few countries and a few themes:

In [90]:
countries_subset = [str(item) for item in countries.keys()][12:15]
themes_subset = themes[:8]

start_date_test = "20170101010000"
end_date_test = "20170223143054"

scarped_test = scrape_gdelt(themes_subset, countries_subset, start_date_test, end_date_test, verbose=True)

Scraping TAX_FNCACT in AU
https://api.gdeltproject.org/api/v2/doc/doc?query=(Austria%20OR%20Vienna)%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_FNCACT in AJ
https://api.gdeltproject.org/api/v2/doc/doc?query=(Azerbaijan%20OR%20Baku)%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_FNCACT in BF
https://api.gdeltproject.org/api/v2/doc/doc?query=(%22The%20Bahamas%22%20OR%20Nassau)%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_ETHNICITY in AU
https://api.gdeltproject.org/api/v2/doc/doc?query=(Austria%20OR%20Vienna)%20theme:TAX_ETHNICITY&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_ETHNICITY in AJ
https://api.gdel

In [91]:
# rename columns
scarped_test.columns = ["Date", "Source country", "Intensity", "Theme", "Target country"]

# clean up source country column
scarped_test["Source country"] = scarped_test["Source country"].str.replace(" Volume Intensity", "")

# map fip in target country column to country name with countries dictionary
scarped_test["Target country"] = scarped_test["Target country"].map(countries)

# sort on source country, date and theme
scarped_test = scarped_test.sort_values(by=["Date", "Source country", "Target country", "Theme"])

scarped_clean = scarped_test[scarped_test["Source country"] != ""]

scarped_clean[20:50]

Unnamed: 0,Date,Source country,Intensity,Theme,Target country
8003,2017-01-02,Algeria,0.2006,LEADER,Austria
265,2017-01-02,Algeria,0.2006,USPEC_POLITICS_GENERAL1,Austria
477,2017-01-02,Algeria,0.0,ALL,Azerbaijan
4452,2017-01-02,Algeria,0.0,CRISISLEX_CRISISLEXREC,Azerbaijan
6201,2017-01-02,Algeria,0.0,LEADER,Azerbaijan
4346,2017-01-02,Algeria,0.0,USPEC_POLITICS_GENERAL1,Azerbaijan
3074,2017-01-02,Algeria,0.0,ALL,The Bahamas
0,2017-01-02,Algeria,0.0,LEADER,The Bahamas
2385,2017-01-02,Algeria,0.0,USPEC_POLITICS_GENERAL1,The Bahamas
4558,2017-01-02,Angola,0.9709,ALL,Austria


In [92]:
# pivot the table
scarped_test_pivot = scarped_clean.pivot_table(index=["Date", "Source country", "Target country"], columns=["Theme"], values="Intensity").reset_index()

# set the "Date" as the index
scarped_test_pivot = scarped_test_pivot.set_index("Date")

scarped_test_pivot.head(20)

Theme,Source country,Target country,ALL,CRISISLEX_CRISISLEXREC,LEADER,USPEC_POLITICS_GENERAL1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-02,Afghanistan,Austria,0.6557,0.3279,0.0,0.0
2017-01-02,Afghanistan,Azerbaijan,0.0,0.0,0.0,0.0
2017-01-02,Afghanistan,The Bahamas,0.0,0.0,,
2017-01-02,Albania,Austria,1.4085,0.7042,0.3521,0.7042
2017-01-02,Albania,Azerbaijan,0.0,0.0,0.0,0.0
2017-01-02,Algeria,Austria,1.003,0.2006,0.2006,0.2006
2017-01-02,Algeria,Azerbaijan,0.0,0.0,0.0,0.0
2017-01-02,Algeria,The Bahamas,0.0,,0.0,0.0
2017-01-02,Angola,Austria,0.9709,0.0,0.0,0.0
2017-01-02,Angola,Azerbaijan,0.9709,0.9709,0.0,0.0


Using this method we can create a dataframe that contains the volume of coverage in country x talking about country y, and that have theme z. This is the data that we want.

However, we need to then iterate over all countries and all themes. For this we need to make a dictionary that for each country decides the ways that country can be referred to, and a list of all the themes that we want to measure volume for.

We can then make a function that iterates and makes repeated API calls for all combinations, and use some data wrangling to get all data into one dataframe.

# Scraping all data

For this we need the dataset on all the country names, and the themes list.