In [1]:
import pandas as pd
import datetime

## Data Fetching

In [2]:
# might be better to import the code and not a file (to show what we've done)

from fetch_gdelt_data import *
from data_cleaning import clean_df

Below, we specify the date interval of the data to load in a DataFrame for us to use and to download, if we do not have to data locally already.

In [3]:
start = datetime.date(2015, 3, 1)
end = datetime.date(2015, 3, 20)

step = datetime.timedelta(days=1)

# Finds all dates that are between the starting_date and the ending_date
all_dates = [(start + datetime.timedelta(days=delta)) for delta in range((end - start).days)]

To load and download the data, a simple function call is needed. We can specify whether we want the translingual version or the english only one.

In [4]:
test_df = fetch_df(start, translingual=True, v='2')

## Data Cleaning and Selection

We will only keep the informations about the event type and location, the source URL and number of mentions, and the Goldstein scale and average tone of the Event. We drop every event with missing entries and add a column containing the ISO 3166-1 alpha-3 convention where the event happens.

In [5]:
selected_df = clean_df(test_df)

The function below will be usefull later on when we will need to select some event conditionally on one of their feature.

In [6]:
def select_events(df, feature, selector):
    '''Example of use : select_events(selected_df, 'EventCode', lambda x: x[:2] == '08')'''
    return df[df[feature].apply(selector)]

## Data visualization

To show how we can visualize the data, we plan to use folium and plotly later on.

In [7]:
import json
import branca
import folium
from folium.plugins import HeatMap
from fetch_location import get_country_from_point, get_mapping

We load the geojson that will be used to aggregate the data by country and display it in a choroplath.

In [8]:
world_geo_path = '../data/geo/countries.geo.json'
world_json_data = json.load(open(world_geo_path, encoding="UTF-8"))

The data contains a code for each country. From this we can aggregate the event easily together. However,the corresponding country name is not always the same as it sometimes contains details on the city/state level.

For the reason above, it is not easy to know which country code corresponds to which polygon of the geojson. The easiest solution we found, was to test using the longitude and latitude in which polygon the event happened and create a mapping country_code -> polygon_name

We compute below the different metrics times their "importance", this could be done differently.

In [10]:
selected_df.loc[:,'pondered_GoldsteinScale'] = selected_df.loc[:,'GoldsteinScale'] * selected_df.loc[:,'NumMentions']
selected_df.loc[:,'pondered_AvgTone'] = selected_df.loc[:,'AvgTone'] * selected_df.loc[:,'NumMentions']

Once we have the mapping, we can give a score to each country based on a chosen metric (average tone in the news toward the event, Goldstein scale, etc...), and map the index of each country (country_code) to the name of the polygon that reprensent it.

In [11]:
chosen_metric = 'pondered_AvgTone'

In [12]:
scores = selected_df.groupby('Country_Code')[chosen_metric].agg('mean')

rate_min = min(scores)
rate_max = max(scores)

# color scale from min rate to max rate
color_scale = branca.colormap.linear.RdYlGn.scale(rate_min, rate_max)
color_scale = color_scale.to_step(n=8)

def style_function(country):
    if country['id'] in scores.index.values: 
        # country is in the dataframe
        score = scores.loc[country['id']].mean()
        return {
            'fillOpacity': 0.8,
            'weight': 0.5,
            'color': 'black',
            'fillColor': color_scale(score)
        }
    else:
        # country is not in the dataframe, hence we put its color as black
        return {
            'fillOpacity': 0.2,
            'weight': 0.2,
            'color': 'black',
            'fillColor': 'black'
        }
def highlight_function(i):
    return {
                'weight': 2,
                'fillOpacity': .2
            }
world_map = folium.Map([0, 0], tiles='', zoom_start=2.5)
g = folium.GeoJson(world_json_data, style_function=style_function, highlight_function=highlight_function).add_to(world_map)

color_scale.caption = ' '.join(chosen_metric.split('_')) + ' scale'
color_scale.add_to(world_map)

del style_function
del highlight_function
world_map