# **NB01 - Data Collection**

**Objective:** This notebook collects data from the GDELT and World Bank API and uploads them into the raw data folder to be processed in notebook 2. VDEM data will be collected in notebook 2 because it is too large to save directly as raw data; it can only be pushed on to GitHub after it is processed (see README file for more details). 

**Imports:**

In [None]:
import pandas as pd
import os
import functions 
from google.cloud import storage

# 1.Collect the data from the GDELT API

We used Google Big Query and Google Cloud Storage (GCS) to query the data from GDELT.

The query used to select the data was:

```sql
SELECT SQLDATE, EventCode, ActionGeo_CountryCode, AvgTone, GoldsteinScale, NumMentions
FROM `gdelt-bq.gdeltv2.events`
WHERE ActionGeo_CountryCode IN ('US', 'FR', 'IR', 'BR', 'IN', 'ZA')
  AND EventRootCode IN ('10', '11', '13', '14', '15', '18', '20')
  AND SQLDATE BETWEEN 20130101 AND 20231231
ORDER BY SQLDATE DESC;
```

Where the event root-codes represent the following categories:
- 10 = demands
- 11 = disapproval 
- 13 = threat
- 14 = protest
- 15 = force
- 18 = assault 
- 20 = mass violence

Here is the GDELT write up on tone and goldstein scale measures:

- AvgTone. (numeric) This is the average “tone” of all documents containing one or more
mentions of this event during the 15 minute update in which it was first seen. The score
ranges from -100 (extremely negative) to +100 (extremely positive). Common values range
between -10 and +10, with 0 indicating neutral. This can be used as a method of filtering the
“context” of events as a subtle measure of the importance of an event and as a proxy for the
“impact” of that event. For example, a riot event with a slightly negative average tone is likely
to have been a minor occurrence, whereas if it had an extremely negative average tone, it
suggests a far more serious occurrence. A riot with a positive score likely suggests a very minor 
occurrence described in the context of a more positive narrative (such as a report of an attack
occurring in a discussion of improving conditions on the ground in a country and how the
number of attacks per day has been greatly reduced). NOTE: this field refers only to the first
news report to mention an event and is not updated if the event is found in a different context
in other news reports. It is included for legacy purposes – for more precise information on the
positioning of an event, see the Mentions table. NOTE: this provides only a basic tonal
assessment of an article and it is recommended that users interested in emotional measures use
the Mentions and Global Knowledge Graph tables to merge the complete set of 2,300 emotions
and themes from the GKG GCAM system into their analysis of event records.

- GoldsteinScale. (floating point) Each CAMEO event code is assigned a numeric score from -10 to
+10, capturing the theoretical potential impact that type of event will have on the stability of a
country. This is known as the Goldstein Scale. This field specifies the Goldstein score for each
event type. NOTE: this score is based on the type of event, not the specifics of the actual event
record being recorded – thus two riots, one with 10 people and one with 10,000, will both
receive the same Goldstein score. This can be aggregated to various levels of time resolution to
yield an approximation of the stability of a location over time.

We used GCS to download each table from google big query to the cloud, and then used the python package google.cloud to laod the json files in. The code below reads in each JSON and saves it to data/raw/gdelt. 


In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "../access_keys/secure_key.json"
client = storage.Client()
bucket_name = "gdelt_data_yearly_storage"  
bucket = client.bucket(bucket_name)
countries = ['US', 'FR', 'ZA', 'IR', 'BR', 'IN']
output_dir="../data/raw/gdelt/"
functions.process_gdelt_data(bucket, countries, output_dir)

# 2. Collect the data from the World Bank API

The World Bank API provides access to a vast repository of global economic data, including GDP, inflation, unemployment, and governance indictors.

 We leveraged this API to collect key economic variables for our analysis, which examines the relationship between economic conditions and public sentiment towards democracy.

### Economic Indicators Collected

- **GDP Growth (NY.GDP.MKTP.KD.ZG):** Measures economic performance, often linked to public confidence in governance.
- **GINI Index (SI.POV.GINI):** Captures income distribution and economic disparity, which may impact support for democracy.
- **Income Share Held by Lowest 20% (SI.DST.FRST.20):** Measures income inequality, often associated with political discontent.
- **Inflation, Consumer Prices (FP.CPI.TOTL):** Affects cost of living and economic stability, impacting political trust.
- **Poverty Headcount Ratio at $2.15 a Day (SI.POV.DDAY):** Indicates extreme poverty levels, a crucial factor in economic grievances.
- **Unemployment (% of total labor force) (SL.UEM.TOTL.ZS):** Reflects labor market health, influencing political sentiment.
- **Government Effectiveness: Estimate (GE.EST):** Measures the quality of governance, affecting trust in democratic institutions.
- **Control of Corruption: Estimate (CC.EST):** Evaluates corruption levels, which influence public confidence in democracy.
- **Education Expenditure (% of GDP) (SE.XPD.TOTL.GD.ZS):** Assesses investment in human capital, influencing long-term economic and political stability.

### Countries and Time Frame

To maintain consistency with the **GDELT dataset**, we collected data for the following countries:

- **United States (USA)**
- **Brazil (BRA)**
- **France (FRA)**
- **South Africa (ZAF)**
- **Iran (IRN)**
- **India (IND)**

We retrieved data from **2014 to 2023**, ensuring a sufficient time span to analyse economic trends alongside democratic sentiment.


In [None]:
indicators = {
    "gdp_growth": "NY.GDP.MKTP.KD.ZG",  # GDP Growth (annual %)
    "inflation": "FP.CPI.TOTL",  # Inflation, consumer prices (annual %)
    "unemployment": "SL.UEM.TOTL.ZS",  # Unemployment (% of total labor force)
    "income_share_lowest_20": "SI.DST.FRST.20",  # Income Share Held by Lowest 20%
    "gini_index": "SI.POV.GINI",  # GINI Index (World Bank estimate)
    "poverty_headcount": "SI.POV.DDAY",  # Poverty Headcount Ratio at $2.15 a Day (2017 PPP) (% of population)
    "education_expenditure": "SE.XPD.TOTL.GD.ZS",  # Education Expenditure (% of GDP)
    "government_effectiveness": "GE.EST",  # Government Effectiveness: Estimate
    "control_of_corruption": "CC.EST"  # Control of Corruption: Estimate
}
countries = ["USA", "BRA", "FRA", "ZAF", "IRN", "IND"]  # USA, BRAZIL, FRANCE, SOUTH AFRICA, IRAN, INDIA
start_year = 2014
end_year = 2023
folder = "data/raw/world_bank"

data = functions.collect_data(countries, indicators, start_year, end_year, folder)