In [1]:
# packages
from google.cloud import bigquery
import os
import pickle
import pandas as pd
import geopandas as gpd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# GDELT (Global Database of Events, Language, and Tone)

The GDELT [project](https://www.gdeltproject.org/) was first introduced by [Leetaru and
Schrodt (2013)](http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf) based on earlier explorations (see for example [Schrodt (2010)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1643761)). 

It monitors print, broadcast, and web news media in over 100 languages from nearly all countries in the world. Its 2.0 Event Database contains a quarter billion geo-referenced event records in over 300 categories covering the entire world, updating every 15 minutes since February 2015. 
Before February 2015, the 1.0 Event Database releases daily updates. Historical data was collected back to 1979. 

The GDELT project pre-processes each of its sources and reports details on
the actors evolved, the type of the event, the number of mentions, sentiment score with respect to the source
and detailed information on the event’s location. Hence, answers to the questions: 

**What happened? Who was involved? Where did it happened? How is this talked about?**

While the GDELT project’s data have been widely used in social science research, it is important to be aware of
what it is and, accordingly, what it is not. The project captures an extensive picture of what is reported which
might diverge from what truly happened. This means that the data might include both false-positive as well as
numerous mentions of the same incident. In a reduced version, the project limits its data to one record of each
event type between actors per day. Additionally, the data are usually weighted by the total number of event
records collected per day since a higher number of events today than 20 years ago might just be a reflection
of technical advances. 

Compared with other event data, [Kwak and An (2016)](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM16/paper/download/13014/12811) found that the GDELT project
seems over-state the number of events but also to have a superior method for the resolution of the geographic
locations of events which makes it very attractive. 

To classify actors and events, the GDELT project uses CAMEO’s coding scheme ([Schrodt, 2012](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf)).

## This notebook
This notebook accesses the database via Google BigQuery and extracts data with:
- Protests (Event root code 14)
- Only events in NRW, Germany (Action geo adm1 code GM07)
- Only events geo-located at the city/landmark level (action geo type 4)

It uses the result to generate 2 dataframes (for 2014 and 2020) with the following variables:
- location
- num of protests 4 months pre election
- total number of mentions of these events
- average tone of how these events were spoken about

In [2]:
# Set project_id to your Google Cloud Platform project ID.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'service_account.json'

client = bigquery.Client()

project_id = "thesisau"

sql = """
SELECT  GLOBALEVENTID as id,
        SQLDATE as date, 
        Year as year, 
        Actor1Name as actor1_name,
        Actor1CountryCode as actor1_country, 
        Actor1KnownGroupCode as actor1_group,
        Actor1Religion1Code as actor1_religion1,
        Actor1Religion2Code as actor1_religion2,
        Actor1EthnicCode as actor1_ethnic,
        Actor1Type1Code as actor_type1,
        Actor1Type2Code as actor1_type2, 
        Actor1Type3Code as actor1_type3,
        Actor2Name as actor2_name, 
        Actor2CountryCode as actor2_country, 
        Actor2KnownGroupCode as actor2_group,
        Actor2Religion1Code as actor2_religion1,
        Actor2Religion2Code as actor2_religion2,
        Actor2EthnicCode as actor2_ethnic,
        Actor2Type1Code as actor2_type1, 
        Actor2Type2Code as actor2_type2, 
        Actor2Type3Code as actor2_type3, 
        EventCode as event_code,
        EventBaseCode as event_basecode,
        EventRootCode as event_rootcode,
        QuadClass as quad_class, 
        GoldsteinScale as goldstein_scale, 
        NumMentions as num_mentions, 
        NumSources as num_sources, 
        NumArticles as num_articles, 
        AvgTone as avg_tone,
        Actor1Geo_Type as actor1geo_res, 
        Actor1Geo_Fullname as actor1geo_name, 
        Actor1Geo_CountryCode as actor1geo_country, 
        Actor1Geo_Lat as actor1geo_lat,
        Actor1Geo_Long as actor1geo_long,
        Actor1Geo_FeatureID as actor1geo_id,
        Actor2Geo_Type as actor2geo_res, 
        Actor2Geo_Fullname as actor2geo_name, 
        Actor2Geo_CountryCode as actor2geo_country, 
        Actor2Geo_Lat as actor2geo_lat,
        Actor2Geo_Long as actor2geo_long,
        Actor2Geo_FeatureID as actor2geo_id,
        ActionGeo_Type as eventgeo_res, 
        ActionGeo_Fullname as eventgeo_name, 
        ActionGeo_CountryCode as eventgeo_country, 
        ActionGeo_Lat as eventgeo_lat,
        ActionGeo_Long as eventgeo_long,
        ActionGeo_FeatureID as eventgeo_id,
        DATEADDED as date_added, 
        SOURCEURL as source
FROM `gdelt-bq.full.events` 
WHERE ActionGeo_Type = 4 and ActionGeo_ADM1Code = 'GM07' and EventRootCode = '14'
"""

In [3]:
gdelt = client.query(sql, project=project_id).to_dataframe()

In [5]:
end_2014 = pd.to_datetime('25-05-2014')
# start_2014 = end_2014.replace(month = end_2014.month - 4)  # past 4 months
start_2014 = pd.to_datetime('25-12-2013')

end_2020 = pd.to_datetime('13-09-2020') 
start_2020 = end_2020.replace(month = end_2020.month -6) # past 4 months

In [6]:
gdelt.date = pd.to_datetime(gdelt.date, format='%Y%m%d')

# variance in quad_class
gdelt.quad_class.value_counts()
print('There is',gdelt.quad_class.var(),'variance in primary classifications.')

There is 0.0 variance in primary classifications.


In [7]:
#select columns
columns = ['date', 'num_mentions', 'eventgeo_lat', 'eventgeo_long', 'eventgeo_name', 
           'avg_tone', 'id', 'event_code']
gdelt = gdelt[columns]

In [8]:
# 2014
mask=(gdelt['date']>=start_2014) & (gdelt['date']<=end_2014)
df_14 = gdelt.loc[mask]
df_14.reset_index(inplace=True, drop=True)
len(df_14)

33

In [9]:
df_14['count'] = [1] * len(df_14.index)
df_14.drop(columns='id',inplace=True)

protests_14 = df_14.groupby(df_14.eventgeo_name).sum().reset_index()
protests_14.loc[:,['avg_tone', 'eventgeo_lat', 'eventgeo_long', 'num_mentions']] = \
            protests_14.loc[:,['avg_tone', 'eventgeo_lat', 'eventgeo_long', 'num_mentions']].div(protests_14['count'], axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [10]:
# 2020
mask=(gdelt['date']>=start_2020) & (gdelt['date']<=end_2020)
df_20 = gdelt.loc[mask]
len(df_20)

56

In [11]:
df_20['count'] = [1] * len(df_20.index)
df_20.drop(columns='id',inplace=True)

protests_20 = df_20.groupby(df_20.eventgeo_name).sum().reset_index()
protests_20.loc[:,['avg_tone', 'eventgeo_lat', 'eventgeo_long', 'num_mentions']] = \
            protests_20.loc[:,['avg_tone', 'eventgeo_lat', 'eventgeo_long', 'num_mentions']].div(protests_20['count'], axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [12]:
protests_20

Unnamed: 0,eventgeo_name,num_mentions,eventgeo_lat,eventgeo_long,avg_tone,count
0,"Bismarck, Nordrhein-Westfalen, Germany",3.5,51.55,7.1,-2.949483,4
1,"Bochum, Nordrhein-Westfalen, Germany",14.666667,51.4833,7.21667,-4.301591,3
2,"Bonn, Nordrhein-Westfalen, Germany",7.066667,50.7333,7.1,-1.567228,15
3,"Bottrop, Nordrhein-Westfalen, Germany",5.0,51.5167,6.91667,-1.872659,2
4,"Datteln, Nordrhein-Westfalen, Germany",6.5,51.6667,7.38333,-3.279144,6
5,"Dormagen, Nordrhein-Westfalen, Germany",2.5,51.1,6.83333,-2.292769,2
6,"Eppendorf, Nordrhein-Westfalen, Germany",6.0,51.45,7.18333,-5.042335,4
7,"Gutersloh, Nordrhein-Westfalen, Germany",13.0,51.9,8.38333,-3.666553,2
8,"Haltern, Nordrhein-Westfalen, Germany",2.0,52.1167,7.26667,-3.680253,1
9,"Heinsberg, Nordrhein-Westfalen, Germany",9.333333,51.0333,8.15,-4.242424,3


In [13]:
protests_14

Unnamed: 0,eventgeo_name,num_mentions,eventgeo_lat,eventgeo_long,avg_tone,count
0,"Bochum, Nordrhein-Westfalen, Germany",3.0,51.4833,7.21667,3.432836,1
1,"Bonn, Nordrhein-Westfalen, Germany",4.529412,50.7333,7.1,2.330657,17
2,"Koln, Nordrhein-Westfalen, Germany",12.25,50.9333,6.95,-0.311772,4
3,"Konigswinter, Nordrhein-Westfalen, Germany",6.0,50.6833,7.18333,3.911205,1
4,"Rheda, Nordrhein-Westfalen, Germany",10.0,51.9833,8.2,3.351955,1
5,"Solingen, Nordrhein-Westfalen, Germany",41.25,51.1833,7.08333,3.387091,8
6,"Steinfurt, Nordrhein-Westfalen, Germany",6.0,52.15,7.35,1.5,1
