# Scraping the GDELT dataset

What we need is a df with:
- date
- source country of article
- country article mentions
- theme(s) of article OR article text

OR
- date
- percentage of articles from source country x mentioning country y that have theme z
    - could be smoothed over time, but not backwards in time

In [102]:
import pandas as pd
import urllib.parse
import pickle

countries_path = 'saves/countries_dict.pickle'
themes_path = 'saves/themes_list.pickle'
countries_capitals_path = 'saves/countries_capitals.csv'

Then we import the auxiliary datasets

In [103]:
with open(countries_path, 'rb') as f:
    countries = pickle.load(f)

with open(themes_path, 'rb') as f:
    themes = pickle.load(f)

countries_capitals = pd.read_csv(countries_capitals_path)

# Functions to scrape

In [104]:
def querybuilder(dict):
    base_url = "https://api.gdeltproject.org/api/v2/doc/doc?"
    url = base_url + "&".join([f"{key}={value}" for key, value in dict.items()])
    url = urllib.parse.quote(url, safe=':/?&=').replace("&theme=", "%20theme:")
    return url

def get_gdelt_data(theme, country, start_date, end_date, verbose=False):
    if theme == "ALL":
        dict = {
            "query": countries[country],
            "mode": "TimelineSourceCountry",
            "startdatetime": start_date,
            "enddatetime": end_date,
            "format": "csv",
            "timezoom" : "yes",
        }
    else:
        dict = {
            "query": countries[country],
            "theme": theme,
            "mode": "TimelineSourceCountry",
            "startdatetime": start_date,
            "enddatetime": end_date,
            "format": "csv",
            "timezoom" : "yes",
        }        
    url = querybuilder(dict)
    if verbose:
        print(url.replace("csv", "html"))
    try:
        df = pd.read_csv(url)
        return df
    except pd.errors.EmptyDataError:
        if verbose:
            print("passed")
        pass

    

def scrape_gdelt(themes, countries, start_date, end_date, verbose=False):
    df_list = []
    themes.append("ALL")
    for theme in themes:
        for country in countries:
            if verbose:
                print(f"Scraping {theme} in {country}")
            df = get_gdelt_data(theme, country, start_date, end_date, verbose=verbose)
            if df is not None:
                df['theme'] = theme
                df['country'] = country
                df_list.append(df)
    return pd.concat(df_list)

# Testing methods

In [105]:
test_dict = {
    "query" : "Netherlands",
    "theme" : "KILL",
    "mode" : "TimelineSourceCountry",
    "startdatetime" : "20170101010000",
    "enddatetime" : "20240229143054",
    "format" : "html",
    "timezoom" : "yes"
}

Test for a few countries and a few themes:

In [107]:
countries_subset = [str(item) for item in countries.keys()][12:15]
themes_subset = themes[:8]

start_date_test = "20170101010000"
end_date_test = "20170223143054"

scarped_test = scrape_gdelt(themes_subset, countries_subset, start_date_test, end_date_test, verbose=True)

Scraping TAX_FNCACT in AA
https://api.gdeltproject.org/api/v2/doc/doc?query=Aruba%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_FNCACT in AT
https://api.gdeltproject.org/api/v2/doc/doc?query=Ashmore%20and%20Cartier%20Islands%20%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_FNCACT in AS
https://api.gdeltproject.org/api/v2/doc/doc?query=Australia%20theme:TAX_FNCACT&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_ETHNICITY in AA
https://api.gdeltproject.org/api/v2/doc/doc?query=Aruba%20theme:TAX_ETHNICITY&mode=TimelineSourceCountry&startdatetime=20170101010000&enddatetime=20170223143054&format=html&timezoom=yes
passed
Scraping TAX_ETHNICITY in AT
https://api.gdeltproject.org/api/v2/doc/doc?query=Ashmore%20and%20

  return pd.concat(df_list)


In [108]:
# rename columns
scarped_test.columns = ["Date", "Source country", "Intensity", "Theme", "Target country"]

# clean up source country column
scarped_test["Source country"] = scarped_test["Source country"].str.replace(" Volume Intensity", "")

# sort on source country, date and theme
scarped_test = scarped_test.sort_values(by=["Date", "Source country", "Target country", "Theme",  ])

scarped_clean = scarped_test[scarped_test["Source country"] != ""]

scarped_clean[20:50]

Unnamed: 0,Date,Source country,Intensity,Theme,Target country
1060,2017-01-02,Argentina,0.0217,ALL,AA
583,2017-01-02,Argentina,0.0,CRISISLEX_CRISISLEXREC,AA
1113,2017-01-02,Argentina,0.0,LEADER,AA
530,2017-01-02,Argentina,0.0,USPEC_POLITICS_GENERAL1,AA
6042,2017-01-02,Argentina,0.2064,ALL,AS
6307,2017-01-02,Argentina,0.0217,CRISISLEX_CRISISLEXREC,AS
5724,2017-01-02,Argentina,0.0109,LEADER,AS
2756,2017-01-02,Argentina,0.0,USPEC_POLITICS_GENERAL1,AS
5989,2017-01-02,Armenia,0.5882,ALL,AS
1855,2017-01-02,Armenia,0.2941,CRISISLEX_CRISISLEXREC,AS


In [109]:
# pivot the table
scarped_test_pivot = scarped_clean.pivot_table(index=["Date", "Source country", "Target country"], columns=["Theme"], values="Intensity").reset_index()

# set the "Date" as the index
scarped_test_pivot = scarped_test_pivot.set_index("Date")

scarped_test_pivot.head(20)

Theme,Source country,Target country,ALL,CRISISLEX_CRISISLEXREC,LEADER,USPEC_POLITICS_GENERAL1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-02,Afghanistan,AS,0.3279,0.0,0.0,0.0
2017-01-02,Albania,AS,0.0,0.0,0.0,0.0
2017-01-02,Algeria,AA,0.1003,0.1003,0.0,0.0
2017-01-02,Algeria,AS,0.0,0.0,0.0,0.0
2017-01-02,Angola,AS,1.9417,1.9417,1.9417,1.9417
2017-01-02,Argentina,AA,0.0217,0.0,0.0,0.0
2017-01-02,Argentina,AS,0.2064,0.0217,0.0109,0.0
2017-01-02,Armenia,AS,0.5882,0.2941,0.2941,0.0
2017-01-02,Australia,AA,0.0,0.0,0.0,0.0
2017-01-02,Australia,AS,34.2565,12.7958,8.1032,5.8716


Using this method we can create a dataframe that contains the volume of coverage in country x talking about country y, and that have theme z. This is the data that we want.

However, we need to then iterate over all countries and all themes. For this we need to make a dictionary that for each country decides the ways that country can be referred to, and a list of all the themes that we want to measure volume for.

We can then make a function that iterates and makes repeated API calls for all combinations, and use some data wrangling to get all data into one dataframe.

# Scraping all data

For this we need the dataset on all the country names, and the themes list