In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .appName('aws') \
        .getOrCreate()

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1647248874453_0002,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
import pandas as pd
import numpy as np

import pyspark.sql.functions as f
from pyspark.ml.fpm import FPGrowth

from urllib.request import urlopen
from datetime import datetime
from functools import reduce

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
def fim_years(itemsCol, years, minSupport=0.001, minConfidence=0.3):
    """Returns the first 20 association rules for the inputted
    column of itemsets.

    Parameters
    -----
    itemsCol: column of a dataframe
        Column of itemsets.

    years: list
        Included years in the inputted column.

    minSupport: float
        Required minimum support for the FIM.

    minConfidence: float
        Required minimum confidence for the FIM.

    Output
    -----
    dataframe
        List of first 20 association rules
    """
    fpg = FPGrowth(itemsCol=itemsCol,
                   minSupport=minSupport,
                   minConfidence=minConfidence)
    fpg_trained = fpg.fit(trans_db.filter(trans_db.Year.isin(years))
                          .select(trans_db.SQLDATE,
                                  f.array_distinct(f.split(
                                      trans_db.itemset[0], "-"))
                                  .alias('itemset')))

    return (fpg_trained.associationRules
                       .orderBy(['support', 'confidence'],
                                ascending=[False, False])
                       .select('antecedent', 'consequent',
                               f.round('support', 3).alias('sup'),
                               f.round('confidence', 3).alias('conf'))
            ).show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

![banner](banner.PNG)

<div style='text-align: center; background-color: #FFD500; color: #005BBB; font-size: 60px; padding: 16px'><b><tt>CYBERWARFARE</tt></b></div>

<div style='text-align: center; background-color: #FFD500; color: #005BBB; font-size: 16px; font-style: bold; padding: 6px'><b>USING BIG DATA MINING METHODS ON ONLINE NEWS SOURCES<br>TO UNDERSTAND CONCURRENT EVENTS WITH THE RUSSO-UKRAINIAN WAR IN YEARS 2013-2022</b></div>

# Executive Summary

<div style='font-size: 18px; font-weight: bold;'>Background</div>
<br><div style='text-align: justify;'>The Russo-Ukrainian War has existed since 2014. Global news outlets have continued to report on the re-escalation in early 2022 as Russia invaded Ukraine. However, the intervening years was given less attention by source documents even if the war did not stop between 2015 and 2021, leading journalists to call it the "forgotten war." In this report, we use a novel application of itemset mining to surface frequent event patterns among geopolitical events concurrent to the Russo-Ukrainian War. By using big data available from news sources, we aim to identify trends in global sentiment and extract frequently occurring geopolitical events between Russia, Ukraine, and other major players, to further understand the Russo-Ukrainian tensions in the intervening years.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>Methods</div>
<br><div style='text-align: justify;'>We used the Global Database of Events, Language and Tone (GDELT) dataset available at the AWS Registry of Open Data and Google BigQuery. Limiting the scope to years 2013 to 2022 only, we accessed the dataset by initializing the necessary instances and machines on Google Cloud Platform and AWS and connecting to the appropriate bucket. We then converted the RDD into Spark DataFrames for convenient manipulation and preprocessing. The resulting file was saved in Parquet format, where relevant tables and results were filtered and processed using Spark SQL and Spark ML's <tt>FPGrowth</tt> library. A minimum support of 0.001 and a confidence of 0.3 were used to generate the frequent itemsets and its corresponding association rules. Python 3 was used to generate selected visualizations.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>Results</div>
<br><div style='text-align: justify;'>Salient results from our analysis include the following:</div>
<br><div style='text-align: justify; display: inline-block;'>
    <ol>
        <li>Frequently mentioned actors are mostly Presidents from major superpowers, such as the UK, the US, China, Russia, and Ukraine. Neither then-US President Donald Trump nor Ukrainian President Volodymyr Zelenskyy are included in the list of frequently mentioned actors.</li>
        <br>
        <li>The average tone of documents about Western world leaders is far more negative than that of their Eastern counterparts.</li>
        <br>
        <li>Highly mentioned events in the Russo-Ukrainian War are related to escalation events. However, the years 2015 to 2021 yielded very limited attention from source documents even if the war continued in these years.</li>
        <br>
        <li>Ukraine appears to be more prone to destabilization than Russia. This could be due to the latter's geopolitical and economic power. Towards early March 2022, however, it appeared that recent sanctions have negatively impacted Russia much more than Ukraine.</li>
        <br>
        <li>In mining the frequently occurring events and association rules, we saw no indication of Russia's interest in annexation of Crimea before the war. However, when it started in 2014, it was only the US who actively sanctioned and imposed embargoes on Russia. There was a period of relative silence from 2015 to 2021, and global attention increased at the re-escalation in 2022 given the invasion by and economic sanctions against Russia.</li>
    </ol>
</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>Conclusion and Recommendations</div>
<br><div style='text-align: justify;'>We described the global sentiment and mined for frequently occurring events concurrent to the Russo-Ukrainian War and its corresponding association rules. We suggest conducting further research to validate our findings against the historical context pre-2013. Future research may extend our report by sequence-aware pattern mining to fully understand the nuances between the dynamics of actors, including repeating and/or sequential patterns of certain actors.</div>
<br>

# 1. Introduction

<br><div style='text-align: justify;'>The conflict between Ukraine and Russia is only the latest in the Russo-Ukrainian War. While the war has received global attention when it started in 2014, it has received far less attention in succeeding years. Geopolitical commentators called it as Europe's "forgotten war" for reasons of friendlier ties of global leaders to Putin and lesser global attention, despite continued escalation and illegal annexation in the areas of Crimea and Donbas. Since then, the conflict has ranged from the physical (Russia has been present at the areas since 2014) to the digital (cyberwarfare has further intensified in recent months, which started from organized hacking around the initial days of war in 2014). In this report, we provide a novel way of analyzing concurrent events with the Russo-Ukrainian War using big data from a wide range of news sources.</div>
<br><div style='text-align: justify; text-indent: 2em'>By using big data available from a wide range of news sources, we aim to identify trends in global sentiment, from the "tone" of documents on a particular topic to their potential impact to a nation's stability. We want to extract frequently occurring geopolitical events between Russia, Ukraine, and other major players, and understand the Russo-Ukrainian tensions.</div>

## 1.1. Background

### 1.1.1. Revolution of Dignity

<br><div style='text-align: justify;'>Violent protests occurred in Ukraine in February 2014, following the eviction of then-Ukrainian President Viktor Yanukovych and his government ([Kharkiv, Kiev & Lviv, 2014](#kharkiv)). It was closely related to his decision not to be part of the European Union and, instead, be closer to Russia ([Kyiv Post, 2013](#kyiv)). While an interim government was put in place of the former government, Russia slowly increased its military presence in Crimea, a peninsula at the south of Ukraine.</div>

### 1.1.2. Annexation of Crimea and War in Donbas

<br><div style='text-align: justify;'>On the 20th of February 2014, Russia annexed Crimea and formally took control of the area, capturing political structures and raising Russian flags, in addition to its attack on the government of Ukraine as well as social media websites ([Cathcart, 2014](#cathcart)). The area has been under Russian control ever since. Outside institutions denounced Russia's actions, including the United Nations General Assembly ([United Nations, 2014](#united)).</div>

### 1.1.3. On frequent event pattern mining

<br><div style='text-align: justify;'>Frequent pattern mining has been used to obtain knowledge and useful patterns when manual review is infeasible. For instance, event log monitoring tools have been used to handle security incidents and automate event log analysis for network and systems management (Vaarandi, 2008). It has similarly been used in sports, such as rugby, to determine patterns of play and observe which actions (e.g., line breaks, successful line-outs, regained kicks in play, repeated phase-breakdown play, and failed exit plays) lead to team scoring or not scoring. To our knowledge, no application has yet been done in geopolitical events at a scale as large as the GDELT database.</div>

## 1.2. Motivation and Impact

<br><div style='text-align: justify;'>Given recent events surrounding both Ukraine and Russia, this report assesses how the situation developed over the years as reported through different platforms between the years 2013 and 2022. In so doing, this may aid in understanding the level of perception provided by actors and the frequent geopolitical events that may not be immediately apparent in the day-to-day and overwhelming reports of news sources.</div>

## 1.3. Problem Statement

<br><div style='text-align: justify;'>In this report, we aim to answer the question, "What is the overall global sentiment to and the frequently occurring geopolitical events concurrent with the Russo-Ukrainian War?" Specifically, we focus on the following:</div>
<div style='text-align: justify; display: inline-block;'>
    <ol>
        <li>Who are the most significant actors in source documents about the Russo-Ukrainian War?</li>
        <li>What is the average sentiment of documents towards these significant actors?</li>
        <li>What are the significant events in the Russo-Ukrainian War?</li>
        <li>What events most influenced country instability in Russia and Ukraine?</li>
        <li>What are the association rules between events that happened concurrently with the Russo-Ukrainian War?</li>
        </ol>
</div>

## 1.4. Scope and Limitations

<br><div style='text-align: justify;'>This study is subject to the following scope and limitations:</div>
<div style='text-align: justify; display: inline-block;'>
    <ol>
        <li><b><u>Contextualization of the results may be subject to bias.</u></b> The authors claim no expertise in geopolitics. We nevertheless performed extensive research as required to situate the data and results in the proper context.</li>
        <br>
        <li><b><u>Limiting the scope between 2013 and 2022 due to data limitations.</u></b> The authors made no material changes to the data provided through the platforms. While experts may have helped in validating the information collected, we did not include this given time limitations. In addition, the dataset is only limited up to the first week of March; many developments may have occurred after time of submission.</li>
        </ol>
</div>

# 2. Data Source

## 2.1. Dataset Description

<br><div style='text-align: justify;'>The `Global Database of Events, Language and Tone (GDELT)` dataset was retrieved from both the AWS Registry of Open Datasets ([GDELT Project, 2020](#gdelt)) and Google Cloud Platform's BigQuery. It periodically crawls and monitors many broadcasts, print, and web news from "nearly every country." Its maintainers quantitatively codify events, conflicts, and reactions every 15 minutes. It has been used in understanding historical context given its coverage back to the year 1800, as well as observing emerging media given its scope. The data is made publicly available for unlimited and unrestricted use by the GDELT Project. The latest version of the dataset is available in the Google Cloud Platform, with 250 gigabytes worth of data containing more than 600 million event records. We provide details on the schema in the following table.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 1: Data Fields of the GDELT Dataset</b></center>

| #  | Field                 | Type | Description |
| :- | :-                    | :-   | :- |
|  1 | <tt>GLOBALEVENTID</tt> | STR  | Globally unique identifier assigned to each event record that uniquely identifies it in the master dataset. |
|  2 | <tt>SQLDATE</tt> | DATE | Date the event took place. |
|  3 | <tt>MonthYear</tt> | DATE | Alternative formatting of the event date. |
|  4 | <tt>Year</tt> | DATE | Alternative formatting of the event date. |
|  5 | <tt>FractionDate</tt> | DATE | Alternative formatting of the event date. |
|  6 | <tt>Actor1Code</tt> | STR  | The complete raw CAMEO code for <tt>Actor1</tt>. |
|  7 | <tt>Actor1Name</tt> | STR  | The actual name of <tt>Actor1</tt>. |
|  8 | <tt>Actor1CountryCode</tt> | STR  | The 3-character CAMEO code for the country affiliation of <tt>Actor1</tt>. |
|  9 | <tt>Actor1KnownGroupCode</tt> | STR  | If <tt>Actor1</tt> is a known IGO/NGO/rebel organization (United Nations, World Bank, al-Qaeda, etc.) with its own CAMEO code, this field will contain that code. |
| 10 | <tt>Actor1EthnicCode</tt> | STR  | If the source document specifies the ethnic affiliation of <tt>Actor1</tt> and that ethnic group has a CAMEO entry, the CAMEO code is entered here. |
| 11 | <tt>Actor1Religion1Code</tt> | STR  | If the source document specifies the religious affiliation of <tt>Actor1</tt> and that religious group has a CAMEO entry, the CAMEO code is entered here. |
| 12 | <tt>Actor1Religion2Code</tt> | STR  | If multiple religious codes are specified for <tt>Actor1</tt>, this contains the second code. |
| 13 | <tt>Actor1Type1Code</tt> | STR  | The 3-character CAMEO code of the CAMEO "type" or "role" of <tt>Actor1</tt>, if specified. |
| 14 | <tt>Actor1Type2Code</tt> | STR  | If multiple type/role codes are specified for <tt>Actor1</tt>, this returns the second code. |
| 15 | <tt>Actor1Type3Code</tt> | STR  | If multiple type/role codes are specified for <tt>Actor1</tt>, this returns the third code. |
| 16 | <tt>Actor2Code</tt> | STR  | The 3-character CAMEO code for the country affiliation of <tt>Actor2</tt>. |
| 17 | <tt>Actor2Name</tt> | STR  | The actual name of <tt>Actor2</tt>. |
| 18 | <tt>Actor2CountryCode</tt> | STR  | The 3-character CAMEO code for the country affiliation of <tt>Actor2</tt>. |
| 19 | <tt>Actor2KnownGroupCode</tt> | STR  | If <tt>Actor2</tt> is a known IGO/NGO/rebel organization (United Nations, World Bank, al-Qaeda, etc.) with its own CAMEO code, this field will contain that code. |
| 20 | <tt>Actor2EthnicCode</tt> | STR  | If the source document specifies the ethnic affiliation of <tt>Actor2</tt> and that ethnic group has a CAMEO entry, the CAMEO code is entered here. |
| 21 | <tt>Actor2Religion1Code</tt> | STR  | If the source document specifies the religious affiliation of <tt>Actor2</tt> and that religious group has a CAMEO entry, the CAMEO code is entered here |
| 22 | <tt>Actor2Religion2Code</tt> | STR  | If multiple religious codes are specified for <tt>Actor2</tt>, this contains the second code. |
| 23 | <tt>Actor2Type1Code</tt> | STR  | The 3-character CAMEO code of the CAMEO "type" or "role" of <tt>Actor2</tt>, if specified. |
| 24 | <tt>Actor2Type2Code</tt> | STR  | If multiple type/role codes are specified for <tt>Actor2</tt>, this returns the second code. |
| 25 | <tt>Actor2Type3Code</tt> | STR  | If multiple type/role codes are specified for <tt>Actor2</tt>, this returns the third code. |
| 26 | <tt>IsRootEvent</tt> | STR  | The system codes every event found in an entire document, using an array of techniques to reference and link information together. |
| 27 | <tt>EventCode</tt> | STR  | The raw CAMEO action code describing the action that <tt>Actor1</tt> performed upon <tt>Actor2</tt>. |
| 28 | <tt>EventBaseCode</tt> | STR  | CAMEO event codes are defined in a three-level taxonomy. For events at level three in the taxonomy, this yields its level-two leaf root node. |
| 29 | <tt>EventRootCode</tt> | STR  | Similar to <tt>EventBaseCode</tt>, this defines the root-level category the event code falls under. |
| 30 | <tt>QuadClass</tt> | STR  | This field specifies this primary classification for the event type, allowing analysis at the highest level of aggregation. |
| 31 | <tt>GoldsteinScale</tt> | INT  | A numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country. |
| 32 | <tt>NumMentions</tt> | INT  | The total number of mentions of this event across all source documents. |
| 33 | <tt>NumSources</tt> | INT  | The total number of information sources containing one or more mentions of this event. |
| 34 | <tt><tt>NumArticles</tt> | INT  | The total number of source documents containing one or more mentions of this event. |
| 35 | <tt>AvgTone</tt> | INT  | The average "tone" of all documents containing one or more mentions of this event. |
| 36 | <tt>Actor1Geo_Type</tt> | STR  | The geographic resolution of the match type. |
| 37 | <tt>Actor1Geo_FullName</tt> | STR  | The full human-readable name of the matched location. |
| 38 | <tt>Actor1Geo_CountryCode</tt> | STR  | The 2-character FIPS10-4 country code for the location. |
| 39 | <tt>Actor1Geo_ADM1Code</tt> | STR  | The 2-character FIPS10-4 country code followed by the 2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division housing the landmark. |
| 40 | <tt>Actor1Geo_Lat</tt> | INT  | The centroid latitude of the landmark for mapping. |
| 41 | <tt>Actor1Geo_Long</tt> | INT  | The centroid longitude of the landmark for mapping. |
| 42 | <tt>Actor1Geo_FeatureID</tt> | STR  | The GNS or GNIS FeatureID for this location. |
| 43 | <tt>Actor2Geo_Type</tt> | STR  | The geographic resolution of the match type. |
| 44 | <tt>Actor2Geo_FullName</tt> | STR  | The full human-readable name of the matched location. |
| 45 | <tt>Actor2Geo_CountryCode</tt> | STR  | The 2-character FIPS10-4 country code for the location. |
| 46 | <tt>Actor2Geo_ADM1Code</tt> | STR  | The 2-character FIPS10-4 country code followed by the 2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division housing the landmark. |
| 47 | <tt>Actor2Geo_Lat</tt> | INT  | The centroid latitude of the landmark for mapping. |
| 48 | <tt>Actor2Geo_Long</tt> | INT  | The centroid longitude of the landmark for mapping. |
| 49 | <tt>Actor2Geo_FeatureID</tt> | STR  | The GNS or GNIS FeatureID for this location. |
| 50 | <tt>ActionGeo_Type</tt> | STR  | The geographic resolution of the match type. |
| 51 | <tt>ActionGeo_FullName</tt> | STR  | The full human-readable name of the matched location. |
| 52 | <tt>ActionGeo_CountryCode</tt> | STR  | The 2-character FIPS10-4 country code for the location. |
| 53 | <tt>ActionGeo_ADM1Code</tt> | STR  | The 2-character FIPS10-4 country code followed by the 2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division housing the landmark. |
| 54 | <tt>ActionGeo_Lat</tt> | INT  | The centroid latitude of the landmark for mapping. |
| 55 | <tt>ActionGeo_Long</tt> | INT  | The centroid longitude of the landmark for mapping. |
| 56 | <tt>ActionGeo_FeatureID</tt> | STR  | The GNS or GNIS FeatureID for this location. |
| 57 | <tt>DATEADDED</tt> | DATE | The date the event was added to the master database. |
| 58 | <tt>SOURCEURL</tt> | STR  | The URL of the news article in which the event was found. |

## 2.2. Summary Statistics

In [3]:
df_all = spark.read.parquet('s3://bdcc-final-project-bucket/df_combined.parquet')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
df_all.createOrReplaceTempView('df_all')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<div style='text-align: justify;'>Using the data from 2013 to 2022, we retrieved a total of 695,440,526 rows. Accessing the dataset through Parquet has made retrieval easier, with runtimes for most codes below running for 2 minutes or less.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 2: Number of Rows per Year</b></center>

In [5]:
(spark.sql("""SELECT Year, COUNT(*) as count
              FROM df_all
              GROUP BY Year
              ORDER BY Year ASC
              """).show())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+---------+
|Year|    count|
+----+---------+
|2013| 27407188|
|2014| 48381955|
|2015| 88097260|
|2016|114490697|
|2017|105043785|
|2018| 94274806|
|2019| 82320409|
|2020| 65355708|
|2021| 58835556|
|2022| 11233162|
+----+---------+

In [6]:
print('Number of Rows: ', df_all.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of Rows:  695440526

<div style='text-align: justify; text-indent: 2em'>The United States of America (<tt>USA</tt>) was ranked first with most events found in the database for the inclusive years. This is followed by the United Kingdom (<tt>GBR</tt>), Russia (<tt>RUS</tt>), and China (<tt>CHN</tt>). Ukraine (<tt>UKR</tt>) was only ranked 12th in the list.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 3: Common Country Codes</b></center>

In [7]:
(spark.sql("""SELECT actor1countrycode, COUNT(*) as count
              FROM df_all
              WHERE actor1countrycode != ''
              GROUP BY actor1countrycode
              ORDER BY count DESC
              """).show(30, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+--------+
|actor1countrycode|count   |
+-----------------+--------+
|USA              |94498172|
|GBR              |15822565|
|RUS              |15343983|
|CHN              |10943325|
|FRA              |8221175 |
|ISR              |7874826 |
|TUR              |7604698 |
|CAN              |7278318 |
|DEU              |7111797 |
|EUR              |6766498 |
|AUS              |6740041 |
|UKR              |6425771 |
|SYR              |6181725 |
|IRN              |6120403 |
|IND              |5394058 |
|PAK              |5260724 |
|NGA              |5156149 |
|ITA              |4607899 |
|EGY              |4477197 |
|ESP              |4124748 |
|AFR              |4112159 |
|SAU              |4077006 |
|JPN              |4072871 |
|MEX              |3976957 |
|AFG              |3665467 |
|IRQ              |3482916 |
|PSE              |3220719 |
|GRC              |2940069 |
|PHL              |2932544 |
|IDN              |2787000 |
+-----------------+--------+
only showing t

<div style='text-align: justify; text-indent: 2em'>Annually, almost all countries are reported by at least one news outlet that is part of the database.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 4: Number of Distinct Countries per Year</b></center>

In [8]:
(df_all.select(df_all['Year'].cast('int'),
               df_all['Actor1CountryCode'].cast('string'))
       .groupby('Year')
       .agg(f.countDistinct('Actor1CountryCode').alias('count(countries)'))
       .orderBy('Year', ascending=True)
       .show())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+----------------+
|Year|count(countries)|
+----+----------------+
|2013|             222|
|2014|             224|
|2015|             223|
|2016|             223|
|2017|             222|
|2018|             223|
|2019|             224|
|2020|             222|
|2021|             221|
|2022|             222|
+----+----------------+

# 3. Methodology

<center style="font-size:14px;font-style:default;"><b>FIGURE 1: Methodology Workflow</b></center>

![Figure 1](figures/fig1.PNG)

## 3.1. AWS Setup

<br><div style='text-align: justify;'>We experimented with different iterations of our AWS setup to optimize speed given the scale of the data we have used. Our final EMR setup includes the following:</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 5: AWS EMR Configuration</b></center>

| Configuration | Details |
| :- | :- |
| EMR | emr-6.5.0 |
| Hadoop distribution | Amazon 3.2.1 |
| Applications | Hive 3.1.2<br>Pig 0.17.0<br>Hue 4.9.0<br>JupyterHub 1.4.1<br>JupyterEnterpriseGateway 2.1.0<br>Spark 3.1.2<br>Livy 0.7.1<br>Zeppelin 0.10.0<br>TensorFlow 2.4.1 |
| Instance type: master | 1x m5.xlarge<br>4 vCore, 16 GiB memory, EBS only storage<br>EBS Storage: 100 GiB<br>Paid on-demand |
| Instance type: core | 1x m5.xlarge<br>4 vCore, 16 GiB memory, EBS only storage<br>EBS Storage: 500 GiB<br>Paid on-demand |
| Instance type: task | 3x m5.xlarge<br>4 vCore, 16 GiB memory, EBS only storage<br>EBS Storage: 500 GiB<br>Paid at spot price, using on-demand as the maximum |
| EBS Root Storage | 20 GB |
| Auto-termination | After 1 hour |
| Bootstrap action | Installation of libraries through a <tt>.sh</tt> file |

<b>Code Used:</b>
```terminal
#!/bin/bash

sudo pip3 install -U    \
  pip                   \
  numpy==1.21.5         \
  matplotlib            \
  pandas                \
  pyarrow               \
  fastparquet           \
  s3fs                  \
  fsspec                \
  cython                \
  seaborn               \
  plotly                \
  scipy                 \
  joblib                \
  scikit-learn          \
  geopandas             \
  datetime              \
  urllib3               \
  widgetsnbextension    \
  ipywidgets            \
  pandoc                \
  pylatex               \
  nbconvert             \
  ipyparallel           \
  pycodestyle           \
  autopep8              \
  jupyter-contrib-nbextensions
```

## 3.2. 2013-2014 data: Access data in S3

<br><div style='text-align: justify;'>Because our requirements need to include data from years 2013 onwards, we need to access GDELT v1 to access data for the years 2013 and 2014. This is because these years are incomplete in GDELT v2. We retrieved the data from <tt>s3://gdelt-open-data</tt>.</div>
<div style='font-size: 18px; font-weight: bold;'>Convert RDD into dataframe</div>
<div style='text-align: justify;'>The retrieved data was converted into Spark DataFrames for convenient manipulation of the dataset as required by the exploratory data analysis.</div>
<br><div style='font-size: 18px; font-weight: bold;'>Join 2013 & 2014 dataframe</div>
<div style='text-align: justify;'>A function was written to concatenate all DataFrames using PySpark's <tt>unionByName</tt>. Different from <tt>union</tt>, <tt>unionByName</tt> resolves columns by name and not by position.</div>
<br><div style='font-size: 18px; font-weight: bold;'>Remove rows for years 2012 and earlier</div>
<div style='text-align: justify;'>A quick inspection of the data revealed that rows of previous years were included in the s3 download. We removed all rows dated earlier than 2013.</div>
<br><div style='font-size: 18px; font-weight: bold;'>Save as Parquet</div>
<div style='text-align: justify;'>Finally, we saved the data in Parquet to take advantage of the efficiency of this data format. We partitioned it by <tt>Year</tt> and <tt>MonthYear</tt> since retrieval of the data is usually referenced by the dates as later provided in our EDA.</div>
<br><br><div style='text-align: justify; text-indent: 2em'>The authors used separate notebooks to retrieve and explore the dataset using PySpark SQL and Spark DataFrame. The notebook <tt>01_dataPreparation.ipynb</tt> contains all the codes of the processes we just discussed.</div>
<br>

## 3.3. 2015-2022 data: Access data in Google Cloud Platform and migrate to AWS

<br><div style='text-align: justify;'>To access the most recent data available for the years 2015 to 2022, we signed up for a Google Cloud Platform account using a personal Gmail account to take advantage of the $300 credits provided to new-to-GCP users.</div>

<div style='font-size: 18px; font-weight: bold;'>Setup Google Cloud Storage</div>
<div style='text-align: justify;'>We then set up our own Google Cloud Storage, creating a bucket using the default options to serve as the container where the GDELT data will be stored. Public access to the bucket was turned on to minimize challenges related to permissions; note that no private data was stored in the GCS bucket.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 6: Google Cloud Storage Configuration</b></center>

| Configuration | Details |
| :- | :- |
| Location type | Multi-region |
| Location | <tt>us</tt> (multiple regions in United States) |
| Replication | Default |
| Default storage class | Standard |
| Requester Pays | Off |
| Labels | None |
| Cloud Console URL | <tt>https://console.cloud.google.com/storage/browser/bdcc-final-project-gcp</tt> |
| <tt>gsutil</tt> URI | <tt>gs://bdcc-final-project-gcp</tt> |
| Access control | Uniform |
| Public access prevention | Not enabled by org policy or bucket setting |
| Public access status | Public to internet |
| Object versioning | Off |
| Retention policy | None |
| Encryption type | Google-managed key |
| Lifecycle rules | None |

<br>

<div style='font-size: 18px; font-weight: bold;'>Access required rows from Google BigQuery</div>
<div style='text-align: justify;'>We then accessed the GDELT page in Google BigQuery available at <a href="https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events">https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events</a>. We exported the entire database to our GCS bucket in Parquet format and in Snappy compression for cost-effective use of our bucket, reducing the total size from 250 GB to 51.8 GB.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 7: Google GDELT Events Database in GCP</b></center>

| Key | Value |
| :- | :- |
| Table ID | <tt>gdelt-bq:gdeltv2.events</tt> |
| Table size | 250 GB |
| Long-term storage size | 0 B |
| Number of rows | 622,100,515 |
| Created | May 20, 2016, 1:36:49 AM UTC+8 |
| Last modified | Mar 9, 2022, 12:04:34 AM UTC+8 |
| Table expiration | Never |
| Data location | US |

<br>

<div style='font-size: 18px; font-weight: bold;'>Create a VM Instance in Google Compute Engine</div>
<div style='text-align: justify;'>We then accessed Google Compute Engine to create a virtual machine instance using the default credentials. The region we used are the same as our S3 bucket to maximize speed between transfers.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 8: Google Compute Engine Configuration</b></center>

| Configuration | Details |
| :- | :- |
| Region | <tt>us-east4</tt> (Northern Virginia) |
| Zone | <tt>us-east4-c</tt> |
| Machine type | <tt>e2-standard-8</tt> |
| vCPU | 8 |
| Memory | 32 GB |
| Boot disk | 10 GB |
| Image | Debian GNU/Linux 10 |

<div style='text-align: justify; text-indent: 2em'>We accessed the SSH of the instance and added the following lines so we can access AWS from GCP:</div>

<b>Code Used:</b>
```terminal
cd ~
touch .boto
echo [Credentials] >> ~/.boto
echo aws_access_key_id = <insert access key id> >> ~/.boto
echo aws_secret_access_key = <insert secret access key> >> ~/.boto
cat.boto
gsutil -m rsync -rd gs://bdcc-final-project-gcp/gdelt-v2.parquet/ s3://bdcc-final-project-bucket/gdelt-v2.parquet/
```

<div style='text-align: justify; text-indent: 2em'>We can now access the files in S3. We have terminated the VM instance.</div>
<br>

## 3.4. Concatenate rows

<br><div style='text-align: justify;'>We concatenated both Parquet files through PySpark using <tt>unionByName</tt>, which were then saved in Parquet format. Full details on the concatenated dataframe, such as the schema and summary statistics, are available in the earlier section related to the <tt>Data Source</tt> (i.e., Section 2). The total size is 66.8 GB.</div><div style='text-align: justify; text-indent: 2em'>The entire data cleaning pipeline is provided by the authors in the <tt>01_dataPreparation.ipynb</tt> file as part of our submission.</div>

## 3.5. Process big data

<div style='font-size: 18px; font-weight: bold;'>Filter data by only including relevant events</div>
<br><div style='text-align: justify;'>We filtered only the relevant actors, limiting it to those whose country codes are any of the following: Ukraine or Russia, but also including permanent members of the UN Security Council, namely China, France, the United Kingdom, and the United States. The inclusion of UNSC members will allow us to understand how, in the threat of war, these superpowers negotiate for peace or escalate conflict events.</div>
<br><div style='text-align: justify; text-indent: 2em'>In addition, we filtered for regular expressions in the source URL's of each reported event, focusing on words that are related to the war:</div>
<br><tt>"war|conflict|aggress|crisis|invade|attack|invasion"
<br>"|weapon|nuclear|military|army|politic|alliance|battle|airforce"
<br>"|marine|combat|fight|soldier|discord|martial|power|violence"
<br>"|peace|negotiat|agree|coalition|ally|aid|sanction|refuge|defen|offen"</tt>
<br><br><div style='text-align: justify; text-indent: 2em'>We resaved the filtered data into a Parquet format for our frequent itemset mining pipeline.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>Prepare database for FIM and partition by year</div>
<br><div style='text-align: justify;'>To process the data for the frequent itemset mining pipeline, we needed to add an itemset column through <tt>collect_set</tt> so we may concatenate salient values.</div>
<br><div style='text-align: justify; text-indent: 2em'>Our itemset focuses on the frequent geopolitical events and will so focus on the actors and the events that happened between them. Events are defined through the Conflict and Mediation Event Observations (CAMEO) event codes, a framework for coding event data. The reader may want to view all event codes in <a href=' https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt'>this link</a>.</div>
<br><div style='text-align: justify; text-indent: 2em'>Below are the details of each itemset:</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 9: Details of each itemset</b></center>

| Key | Description |
| :- | :- |
| <tt>Actor1Name</tt> | Name of <tt>Actor1</tt>, who plays the active role in an action/event; appended with <tt>_1</tt> for clarity. |
| <tt>Actor2Name</tt> | Name of <tt>Actor2</tt>, who plays the passive role in an action/event; appended with <tt>_2</tt> for clarity. |
| <tt>Actor2CountryCode</tt> | Country code of <tt>Actor2</tt>. |
| <tt>EventDescription<tt> | Describes the action/event using the CAMEO event codes. |

<br><div style='text-align: justify; text-indent: 2em'>For convenience, we provide sample itemsets below for reference:</div>
<br><div style='text-align: justify; display: inline-block;'>
    <ul>
        <li><tt>RUSSIA_1, UKRAINE_2, UKR-Fight with small arms and light weapons</tt></li>
        <li><tt>UNITED STATES_1, RUSSIA_2, RUS-Impose embargo, boycott, or sanctions</tt></li>
        <li><tt>CHINA_1, RUSSIAN_2, RUS-Make empathetic comment</tt></li>
    </ul>
</div>
<br><br><div style='text-align: justify; text-indent: 2em'>The entire data filtering pipeline is provided by the authors in the <tt>02_dataFiltering.ipynb</tt> file as part of our submission.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>Data Visualization of Sentiment</div>
<br><div style='text-align: justify;'>We used Python3 to use Matplotlib and Seaborn for the required visualizations. They are made available by the authors in the <tt>02_visualizations.ipynb</tt> file as part of our submission.</div>

<br><br><div style='font-size: 18px; font-weight: bold;'>Run FP-Growth Algorithm to obtain association rules</div>
<br><div style='text-align: justify;'>We then ran the frequent itemset mining through PySpark's own ML <tt>fpm</tt> and <tt>FPGrowth</tt> libraries. It only requires a single column of all itemsets and the selection of the required minimum support and confidence.</div>
<br><div style='text-align: justify; text-indent: 2em'>FP-Growth extracts frequent itemsets through the construction of a frequent pattern (FP) tree to find the most frequent patterns. It then generates strong association rules from the resulting FP tree. We believe this is the most appropriate algorithm given the massive amount of itemsets in our database, as opposed to the Apriori algorithm which requires candidate generation and are thus computationally expensive.</div>
<br><div style='text-align: justify; text-indent: 2em'>We selected a minimum support of 0.001—meaning that the itemsets we consider are frequent occur in at least 0.1% of all transactions in our database—and a minimum confidence of 0.3 for generating the association rule—meaning that, at a minimum, association rules must be found true 30% of the time. Details on the analysis of the resulting frequent items and association rules are provided in the next section.</div>

## 3.6. Save in S3

<br><div style='text-align: justify;'>We saved all relevant files and models in S3 as required.</div>

# 4. Data Exploration

<br><div style='text-align: justify;'>In this section, we provide the exploratory data analysis to set the context for our Results and Discussion. We start with Russia and Ukraine only by identifying the most mentioned actors and events between these countries. We then build on this analysis by extending the discussion to global sentiment and frequent geopolitical events to include world leaders and other significant actors in the Russo-Ukrainian War.</div>
<hr>

<div style='font-size: 18px; font-weight: bold;'>Who are the actors with the most mentions between Ukraine and Russia?</div>
<br><div style='text-align: justify;'>Actors are the ones who may be directly or indirectly involved in the escalation, the start, the prolonging, and the stop of any war (as with any event). We list down the actors that figure most prominently in the media, lending to their significance in the event.</div>
<br><div style='text-align: justify; text-indent: 2em'>In the tables below, we can see that most of the actors are countries, locations, or institutions. The only individual included in the tables below is Vladimir Putin, showing that the portrayal of the media towards Russia almost always include him as decision-maker—more so than Ukraine, which is possibly portrayed as a more democratic country.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 10: Top 10 <tt>Actor1</tt> with most mentions (Ukraine-Russia only)</b></center>

In [10]:
(df_all.select(df_all['NumMentions'].cast('int'),
               df_all['Actor1Name'].cast('string'))
       .filter((df_all['Actor1Name'] != '') &
               (((df_all['Actor1CountryCode'] == 'UKR') &
                 (df_all['Actor2CountryCode'] == 'RUS')) |
                ((df_all['Actor1CountryCode'] == 'RUS') &
                 (df_all['Actor2CountryCode'] == 'UKR'))))
       .groupby('Actor1Name')
       .sum('NumMentions').alias('Total NumMentions')
       .orderBy('sum(NumMentions)', ascending=False)
       .show(10))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+----------------+
|    Actor1Name|sum(NumMentions)|
+--------------+----------------+
|       UKRAINE|         3176649|
|        RUSSIA|         3049490|
|       RUSSIAN|         2101494|
|     UKRAINIAN|         1162220|
|        MOSCOW|          648972|
|          KIEV|          423625|
|        CRIMEA|          421984|
|      VLADIMIR|          259946|
|VLADIMIR PUTIN|          135915|
|          KYIV|           97708|
+--------------+----------------+
only showing top 10 rows

<center style="font-size:14px;font-style:default;"><b>TABLE 11: Top 10 <tt>Actor2</tt> with most mentions (Ukraine-Russia only)</b></center>

In [11]:
(df_all.select(df_all['NumMentions'].cast('int'),
               df_all['Actor2Name'].cast('string'))
       .filter((df_all['Actor1Name'] != '') &
               (((df_all['Actor1CountryCode'] == 'UKR') &
                 (df_all['Actor2CountryCode'] == 'RUS')) |
                ((df_all['Actor1CountryCode'] == 'RUS') &
                 (df_all['Actor2CountryCode'] == 'UKR'))))
       .groupby('Actor2Name')
       .sum('NumMentions').alias('Total NumMentions')
       .orderBy('sum(NumMentions)', ascending=False)
       .show(10))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+----------------+
|Actor2Name|sum(NumMentions)|
+----------+----------------+
|   UKRAINE|         3638119|
|    RUSSIA|         2726664|
|   RUSSIAN|         1928897|
| UKRAINIAN|         1339576|
|    CRIMEA|          586220|
|    MOSCOW|          466605|
|      KIEV|          350506|
|  VLADIMIR|          206714|
|    DONBAS|          127800|
|   DONETSK|          105536|
+----------+----------------+
only showing top 10 rows

<div style='text-align: justify; text-indent: 2em'>For those with the highest tone, actors who are individuals are Russian politicians. Among those with the lowest tone are Yevhen Marchuk and Oleksandr Popov who are prominent politicians in Ukraine. The former was Prime Minister and the latter was Kyiv City Administrator. Both appear to not be pro-Russian.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 12: Top 10 <tt>Actor1</tt> with highest tone (Ukraine-Russia only)</b></center>

In [12]:
(df_all.select(df_all['AvgTone'].cast('float'),
               df_all['Actor1Name'].cast('string'))
       .filter(((df_all['Actor1CountryCode'] == 'UKR') &
                (df_all['Actor2CountryCode'] == 'RUS')) |
               ((df_all['Actor1CountryCode'] == 'RUS') &
                (df_all['Actor2CountryCode'] == 'UKR')))
       .groupby('Actor1Name')
       .mean('AvgTone')
       .orderBy('avg(AvgTone)', ascending=False)
       .show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------------+
|Actor1Name        |avg(AvgTone)      |
+------------------+------------------+
|URAL MOUNTAINS    |6.153846263885498 |
|SERGEY LEBEDEV    |3.9800994396209717|
|MIKHAIL SHVYDKOY  |3.7931034564971924|
|KHANTY MANSI      |3.4398033618927   |
|SERGEY KIRIYENKO  |2.641509532928467 |
|NEVA RIVER        |2.4636058807373047|
|ANATOLIY SERDYUKOV|2.3774144649505615|
|ANDRIY KLYUYEV    |2.3353466033935546|
|VIKTOR KHRISTENKO |1.9596365690231323|
|MYKOLA AZAROV     |1.7327005275514689|
+------------------+------------------+
only showing top 10 rows

<center style="font-size:14px;font-style:default;"><b>TABLE 13: Top 10 <tt>Actor1</tt> with lowest tone (Ukraine-Russia only)</b></center>

In [13]:
(df_all.select(df_all['AvgTone'].cast('float'),
               df_all['Actor1Name'].cast('string'))
       .filter(((df_all['Actor1CountryCode'] == 'UKR') &
                (df_all['Actor2CountryCode'] == 'RUS')) |
               ((df_all['Actor1CountryCode'] == 'RUS') &
                (df_all['Actor2CountryCode'] == 'UKR')))
       .groupby('Actor1Name')
       .mean('AvgTone')
       .orderBy('avg(AvgTone)', ascending=True)
       .show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+-------------------+
|Actor1Name         |avg(AvgTone)       |
+-------------------+-------------------+
|MARI EL            |-8.0               |
|KABARDINO BALKAR   |-7.242339611053467 |
|YEVHEN MARCHUK     |-6.615913733839989 |
|TOMSK              |-6.278172786171372 |
|ALEKSANDR BORTNIKOV|-6.037964570522308 |
|OLEKSANDR POPOV    |-5.567010402679443 |
|NIZHNI NOVGOROD    |-5.5404917399088545|
|TVER OBLAST        |-4.925099015235901 |
|CHUVASHIA          |-4.91709109715053  |
|SAKHA              |-4.748990486065547 |
+-------------------+-------------------+
only showing top 10 rows

<div style='text-align: justify; text-indent: 2em'>Among those as <tt>Actor2</tt>, those with the highest tones are Russian politicians such as Mikhail Shvydkoy and Sergey Kiriyenko. For those with the lowest tones, most are areas in Russia with sizeable Ukrainian-identifying populations. Interestingly, Vladimir Filippov is a prominent Russian academician associated with Vladimir Putin.</div>
<br><div style='text-align: justify; text-indent: 2em'>Critically, most actors with top average tones are from Russia. Conversely, some Ukrainian politicians received the most negative average tones.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 14: Top 10 <tt>Actor2</tt> with highest tone (Ukraine-Russia only)</b></center>

In [14]:
(df_all.select(df_all['AvgTone'].cast('float'),
               df_all['Actor2Name'].cast('string'))
       .filter(((df_all['Actor1CountryCode'] == 'UKR') &
                (df_all['Actor2CountryCode'] == 'RUS')) |
               ((df_all['Actor1CountryCode'] == 'RUS') &
                (df_all['Actor2CountryCode'] == 'UKR')))
       .groupby('Actor2Name')
       .mean('AvgTone')
       .orderBy('avg(AvgTone)', ascending=False)
       .show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------------+
|Actor2Name        |avg(AvgTone)      |
+------------------+------------------+
|UKRAIEN           |5.060931921005249 |
|MIKHAIL SHVYDKOY  |3.7931034564971924|
|GORKI             |3.654582921947752 |
|KHANTY MANSI      |3.4398033618927   |
|MOBILE TELESYSTEMS|2.9675941467285156|
|SERGEY LEBEDEV    |2.7455849051475525|
|SERGEY KIRIYENKO  |2.641509532928467 |
|ANATOLIY SERDYUKOV|2.3774144649505615|
|NALCHIK           |2.04501149058342  |
|CHECHENIA         |1.954397439956665 |
+------------------+------------------+
only showing top 10 rows

<center style="font-size:14px;font-style:default;"><b>TABLE 15: Top 10 <tt>Actor2</tt> with lowest tone (Ukraine-Russia only)</b></center>

In [15]:
(df_all.select(df_all['AvgTone'].cast('float'),
               df_all['Actor2Name'].cast('string'))
       .filter(((df_all['Actor1CountryCode'] == 'UKR') &
                (df_all['Actor2CountryCode'] == 'RUS')) |
               ((df_all['Actor1CountryCode'] == 'RUS') &
                (df_all['Actor2CountryCode'] == 'UKR')))
       .groupby('Actor2Name')
       .mean('AvgTone')
       .orderBy('avg(AvgTone)', ascending=True)
       .show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+------------------+
|Actor2Name       |avg(AvgTone)      |
+-----------------+------------------+
|DONBASS          |-7.597536087036133|
|CHUVASHIA        |-7.393648584683736|
|KABARDINO BALKAR |-7.242339611053467|
|VLADIMIR FILIPPOV|-6.382978916168213|
|SAKHA            |-6.064684782709394|
|ARKHANGELSK      |-6.041539817196982|
|UKRAJINA         |-5.532409071922302|
|CHARKIV          |-5.101778652932909|
|KEMEROVO         |-5.098754187424977|
|TVER OBLAST      |-5.011488199234009|
+-----------------+------------------+
only showing top 10 rows

<hr>

<div style='font-size: 18px; font-weight: bold;'>What were the most mentioned events done by Ukraine to Russia and vice versa?</div>
<br><div style='text-align: justify;'>The database has tagged each news report with an event code that tracks what happened between actors.</div>
<br><div style='text-align: justify; text-indent: 2em'>The following table provides the first few rows of the events included in the database. We use these event codes to identify the usual events that occur between Ukraine and Russia.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 16: Selected Event Codes</b></center>

In [16]:
event_codes_path = 's3://bdcc-lab2-bucket/parquet_files/CAMEO.eventcodes.txt'
desc_eventTab = spark.sparkContext.textFile(event_codes_path)
header = desc_eventTab.first()

# Filter out the header, make sure the rest looks correct
log_txt = desc_eventTab.filter(lambda line: line != header)
temp_var = log_txt.map(lambda k: k.split("\t"))
event_desc = temp_var.toDF(header.split("\t"))
event_desc.show(10, truncate=False)

# For local temporary view of this DataFrame
event_desc.createOrReplaceTempView('event_desc')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+-----------------------------------+
|CAMEOEVENTCODE|EVENTDESCRIPTION                   |
+--------------+-----------------------------------+
|01            |MAKE PUBLIC STATEMENT              |
|010           |Make statement, not specified below|
|011           |Decline comment                    |
|012           |Make pessimistic comment           |
|013           |Make optimistic comment            |
|014           |Consider policy option             |
|015           |Acknowledge or claim responsibility|
|016           |Deny responsibility                |
|017           |Engage in symbolic act             |
|018           |Make empathetic comment            |
+--------------+-----------------------------------+
only showing top 10 rows

<br><div style='text-align: justify; text-indent: 2em'>Looking at the tables below, unsurprisingly, most news reports focused on either Ukraine or Russia wherein they `Use conventional military force`. Events done by Russia to Ukraine include `Make statement`, `Consult`, and `Appeal`. Among events done by Ukraine to Russia, `Consult`, `Make statement`, `Accuse`, and `Appeal` ranked higher, which are indicative of the power imbalance between these two countries.</div>
<br><div style='text-align: justify; text-indent: 2em'>Unfortunately, part of the Top 10 of Russia's actions to Ukraine include `Arrest, detain, or charge with legal action`, which is a usual tactic of any government when it encounters a citizen of the opposing country.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 17: Top 10 Most Mentioned Events done by Russia to Ukraine</b></center>

In [17]:
(spark.sql("""SELECT EventCode,
                     event_desc.EVENTDESCRIPTION as `Event Description`,
                     number_of_events
              FROM(
                   SELECT df_all.EventCode,
                          count(df_all.GLOBALEVENTID) as number_of_events
                   FROM df_all
                   WHERE df_all.Actor2CountryCode == 'UKR' and
                         df_all.Actor1CountryCode == 'RUS'
                   GROUP BY EventCode
                  )
              JOIN event_desc on event_desc.CAMEOEVENTCODE = EventCode
              ORDER BY number_of_events DESC
              LIMIT 20""").show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+----------------------------------------------------+----------------+
|EventCode|Event Description                                   |number_of_events|
+---------+----------------------------------------------------+----------------+
|190      |Use conventional military force, not specified below|64341           |
|010      |Make statement, not specified below                 |59730           |
|040      |Consult, not specified below                        |57512           |
|020      |Appeal, not specified below                         |53859           |
|051      |Praise or endorse                                   |47672           |
|112      |Accuse, not specified below                         |45235           |
|042      |Make a visit                                        |43230           |
|046      |Engage in negotiation                               |41115           |
|043      |Host a visit                                        |35322           |
|173      |Arres

<center style="font-size:14px;font-style:default;"><b>TABLE 18: Top 10 Most Mentioned Events done by Ukraine to Russia</b></center>

In [18]:
(spark.sql("""SELECT EventCode,
                     event_desc.EVENTDESCRIPTION as `Event Description`,
                     number_of_events
              FROM
                  (
                   SELECT df_all.EventCode,
                          count(df_all.GLOBALEVENTID) as number_of_events
                   FROM df_all
                   WHERE df_all.Actor2CountryCode == 'RUS' and
                         df_all.Actor1CountryCode == 'UKR'
                   GROUP BY EventCode
                  )
              JOIN event_desc on event_desc.CAMEOEVENTCODE = EventCode
              ORDER BY number_of_events DESC
              LIMIT 20""").show(10, truncate=False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+----------------------------------------------------+----------------+
|EventCode|Event Description                                   |number_of_events|
+---------+----------------------------------------------------+----------------+
|040      |Consult, not specified below                        |57629           |
|010      |Make statement, not specified below                 |56801           |
|112      |Accuse, not specified below                         |54880           |
|020      |Appeal, not specified below                         |53501           |
|190      |Use conventional military force, not specified below|52298           |
|046      |Engage in negotiation                               |41226           |
|043      |Host a visit                                        |39215           |
|042      |Make a visit                                        |38067           |
|051      |Praise or endorse                                   |36226           |
|036      |Expre

In [19]:
df = spark.read.parquet('s3://bdcc-final-project-bucket/'
                        'war_ukr_rus_w_internal.parquet')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
df.createOrReplaceTempView('df')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# 5. Results and Discussion

## 5.1. Global Sentiment

<br><div style='font-size: 18px; font-weight: bold;'>Who were the significant actors?</div>
<br><div style='text-align: justify;'>We start with visualizing the global sentiment on all major actors relevant to the Russo-Ukrainian War. GDELT suggests that the number of mentions are a proxy of importance of an individual actor. Below, we provide the list of actors that acted as leaders from each of their own country.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 19: Frequently Mentioned Actors</b></center>

In [21]:
(df.select(df['NumMentions'].cast('int'),
           df['Actor1Name'].cast('string'))
   .filter((f.col('Actor1Name').like('VLADIMIR PUTIN')) |
           (f.col('Actor1Name').like('VOLODYMYR ZELENSKYY')) |
           (f.col('Actor1Name').like('PETRO POROSHENKO')) |
           (f.col('Actor1Name').like('OLEKSANDR TURCHYNOV')) |
           (f.col('Actor1Name').like('VIKTOR YANUKOVYCH')) |
           (f.col('Actor1Name').like('DONALD TRUMP')) |
           (f.col('Actor1Name').like('JOE BIDEN')) |
           (f.col('Actor1Name').like('BARACK OBAMA')) |
           (f.col('Actor1Name').like('XI JINPING')) |
           (f.col('Actor1Name').like('BORIS JOHNSON')) |
           (f.col('Actor1Name').like('THERESA MAY')) |
           (f.col('Actor1Name').like('DAVID CAMERON')) |
           (f.col('Actor1Name').like('EMMANUEL MACRON')) |
           (f.col('Actor1Name').like('FRANCOIS HOLLANDE')))
   .groupby('Actor1Name')
   .sum('NumMentions').alias('Total NumMentions')
   .orderBy('sum(NumMentions)', ascending=False)
 .show())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+----------------+
|       Actor1Name|sum(NumMentions)|
+-----------------+----------------+
|   VLADIMIR PUTIN|          168261|
|     BARACK OBAMA|           55151|
|        JOE BIDEN|           44965|
|VIKTOR YANUKOVYCH|           11396|
|      THERESA MAY|            7948|
|    DAVID CAMERON|            6892|
|       XI JINPING|            4363|
+-----------------+----------------+

<div style='text-align: justify; text-indent: 2em'>Those that were frequently mentioned in all events were then-UK Prime Minister Theresa May, then-US President Barack Obama, US President Joe Biden, then-Ukrainian President Yanukovych, Chinese President Xi Jinping, and Russian President Vladimir Putin. Strikingly, then-US President Donald Trump and Ukrainian President Volodymyr Zelenskyy are not included on the list of frequently mentioned actors.</div>
<hr>

<div style='font-size: 18px; font-weight: bold;'>What is the average sentiment of documents towards significant actors?</div>
<br><div style='text-align: justify;'>The following figure plots the average "tone" of documents on the identified significant actors, with warm colors representing Russia and China and cold colors representing Ukraine, the UK, and the US. By "tone," we mean the average sentiment of documents containing the actor. Ranging from -100 to +100, most documents are limited to -10 and +10 only.</div>
<br><center style="font-size:14px;font-style:default;"><b>FIGURE 2: Monthly Average Tone of Events Based on Individual Actor</b></center>

![Figure 2](figures/fig2.PNG)

<div style='text-align: justify; text-indent: 2em'>Specifically for the significant actors included in plot, the "tone" of documents about Russia and China appears to be far more positive than that of their Western counterparts. This difference in tone is potentially a reflection of how much immediate control the governments of these countries have on how they are perceived by the media.</div>
<hr>

<div style='font-size: 18px; font-weight: bold;'>What are the significant events in the Russo-Ukrainian War?</div>
<br><div style='text-align: justify;'>We then looked at the number of "mentions" among source documents relating to the Russo-Ukrainian War. GDELT suggests that frequently mentioned events are likely to be significant.</div>
<br><center style="font-size:14px;font-style:default;"><b>FIGURE 3: Number of Mentions on the Russo-Ukrainian War</b></center>

![Figure 3](figures/fig3.PNG)

<div style='text-align: justify; text-indent: 2em'>We observe several peaks in the plot, with the first peak in early 2014, when the Ukrainians ousted their then-President Viktor Yanukovych. All other peaks are at certain escalation events: For instance, the peak in mid-2014 refers to Russia's annexation of Crimea, starting the formal invasion of Russia; the peak in late 2018 refers to the Kerch Strait incident, and the highest peak in early 2022 refers to the latest invasion of Russia in a bid for Ukraine's capital. The relative "silence" between the years of 2015 and 2022 led journalists to refer to the events in these years as the "forgotten war."</div>
<hr>

<div style='font-size: 18px; font-weight: bold;'>What events most influenced country instability in Russia and Ukraine?</div>
<br><div style='text-align: justify;'>We look at the impact of each event to the country. GDELT uses the Goldstein Scale, a score from -10 to +10, which aims to capture the "theoretical potential impact that type of event will have on the stability of a country."</div>
<br><center style="font-size:14px;font-style:default;"><b>FIGURE 4: Goldstein Scales for Russia and Ukraine</b></center>

![Figure 4](figures/fig4.PNG)

<div style='text-align: justify; text-indent: 2em'>These heatmaps visualize the Goldstein Scales for each month in the past nine years. Russia appears to have paler grids and thus has less negative Goldstein scores, possibly owing to the extent of their geopolitical and economic power and how they strategically yield it in diplomatic spaces. However, for the data ending early March 2022, it appears that Russia's scale is getting redder, indicating that recent sanctions have negatively impacted them.</div>
<br><div style='text-align: justify; text-indent: 2em'>On the other hand, Ukraine appears to be more prone to destabilization. The months of May 2014, July 2014, and November 2018 are months when key events occurred in the war. These include the Annexation of Crimea and the War in Donbas in eastern Ukraine in 2014, and the Kerch Strait incident in 2018.</div>

In [24]:
trans_db = spark.sql(
    """
    SELECT
        Year,
        SQLDATE,
        Actor1Name,
        COLLECT_SET(CONCAT(
            Actor1Name, "_1", "-",
            Actor2Name, "_2", "-",
            Actor2CountryCode, "-",
            EVENTDESCRIPTION
        )) AS itemset
    FROM df
    GROUP BY Year, SQLDATE, Actor1Name
    """
)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 5.2. Frequently Occurring Items and Association Rules between Geopolitical Events

<br><div style='text-align: justify;'>We now look at the frequently occurring geopolitical events between actors in the past nine years and discuss them in four sections below. Please note that the following tables only show the top 20 rows for readability. The following analysis used the top 100 rows of each year, which is available in <tt>03_fimPipeline.ipynb</tt>.</div>
<br><div style='text-align: justify; text-indent: 2em'>Below, actors are appended with either a "<tt>_1</tt>" or a "<tt>_2</tt>", the former meaning that they are <tt>Actor1</tt> (active role in the event) in the itemset, while the latter is <tt>Actor2</tt> (passive role in the event).</div>

<div style='font-size: 18px; font-weight: bold;'>2013: Usual diplomatic negotiations between world superpowers</div>
<br><div style='text-align: justify;'>At least for 2013, we observe the usual negotiations and consultations between world superpowers.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 20: First 20 association rules for itemsets in 2013</b></center>

In [25]:
fim_years('itemset', [2013])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------------------------+-----------------+-----+-----+
|antecedent                              |consequent       |sup  |conf |
+----------------------------------------+-----------------+-----+-----+
|[RUSSIA_2]                              |[RUS]            |0.468|1.0  |
|[RUS]                                   |[RUSSIA_2]       |0.468|0.618|
|[RUSSIAN_2]                             |[RUS]            |0.168|1.0  |
|[Consult, not specified below]          |[RUS]            |0.083|0.789|
|[Engage in negotiation]                 |[RUS]            |0.073|0.842|
|[MOSCOW_2]                              |[RUS]            |0.064|1.0  |
|[UKRAINE_2]                             |[UKR]            |0.054|1.0  |
|[UKR]                                   |[UKRAINE_2]      |0.054|0.581|
|[Consult, not specified below, RUSSIA_2]|[RUS]            |0.051|1.0  |
|[Consult, not specified below, RUS]     |[RUSSIA_2]       |0.051|0.617|
|[Consult, not specified below]          |[RUSSIA_2

<div style='text-align: justify; text-indent: 2em'>Specifically, many of the actions focused on <tt>Consult</tt> or <tt>Engage in negotiation</tt>, signifying the regular diplomatic functions between countries.</div>
<br><div style='text-align: justify; text-indent: 2em'>Understandably, given their power and influence—including being permanent members of the UN Security Council—they are mentioned more frequently than Ukraine. We observe no itemsets that may indicate Russia's intention to invade Crimea.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>2014: Start of the Russo-Ukrainian War</div>
<br><div style='text-align: justify;'>When the war started in 2014 with Russia's annexation of Crimea, US actively sanctioned and imposed embargoes on Russia.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 21: First 20 association rules for itemsets in 2014</b></center>

In [26]:
fim_years('itemset', [2014])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------+-----------+-----+-----+
|antecedent                             |consequent |sup  |conf |
+---------------------------------------+-----------+-----+-----+
|[RUSSIA_2]                             |[RUS]      |0.369|1.0  |
|[RUS]                                  |[RUSSIA_2] |0.369|0.624|
|[UKRAINE_2]                            |[UKR]      |0.192|1.0  |
|[UKR]                                  |[UKRAINE_2]|0.192|0.604|
|[RUSSIAN_2]                            |[RUS]      |0.143|1.0  |
|[UKRAINIAN_2]                          |[UKR]      |0.057|1.0  |
|[MOSCOW_2]                             |[RUS]      |0.046|1.0  |
|[Engage in negotiation]                |[RUS]      |0.04 |0.674|
|[Consult, not specified below]         |[RUS]      |0.036|0.521|
|[Make statement, not specified below]  |[RUS]      |0.035|0.582|
|[Impose embargo, boycott, or sanctions]|[RUS]      |0.031|0.835|
|[Host a visit]                         |[RUS]      |0.029|0.556|
|[Appeal, 

<div style='text-align: justify; text-indent: 2em'>Itemsets that were explicitly included were the following:</div>
<div style='text-align: justify; display: inline-block;'>
    <ol>
        <li><tt>UNITED STATES_1 Impose embargo, boycott, or sanctions</tt> to <tt>RUSSIA_2</tt></li>
        <li><tt>RUSSIA_1 Use conventional military force</tt> on <tt>UKRAINE_2</tt></li>
        <li><tt>RUSSIA_1 Express intent to cooperate</tt>, but as we later find out, they fail to do so.</li>
    </ol>
</div>
<br><br>

<div style='font-size: 18px; font-weight: bold;'>2015-2021: Continuation of the War</div>
<br><div style='text-align: justify;'>Using the number of mentions and Goldstein scale for the years 2015 to 2021 received less mentions and are unlikely to be significant. However, we surfaced several events with strong association rules across different years.</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 22: First 20 association rules for itemsets in 2015-2021</b></center>

In [27]:
fim_years('itemset', [2015, 2016, 2017, 2018, 2019, 2020, 2021])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------------------------------------------+-----------+-----+-----+
|antecedent                                            |consequent |sup  |conf |
+------------------------------------------------------+-----------+-----+-----+
|[RUSSIA_2]                                            |[RUS]      |0.399|1.0  |
|[RUS]                                                 |[RUSSIA_2] |0.399|0.599|
|[RUSSIAN_2]                                           |[RUS]      |0.167|1.0  |
|[UKRAINE_2]                                           |[UKR]      |0.145|1.0  |
|[UKR]                                                 |[UKRAINE_2]|0.145|0.64 |
|[MOSCOW_2]                                            |[RUS]      |0.05 |1.0  |
|[Consult, not specified below]                        |[RUS]      |0.05 |0.62 |
|[Make statement, not specified below]                 |[RUS]      |0.041|0.666|
|[UKRAINIAN_2]                                         |[UKR]      |0.04 |1.0  |
|[Host a visit]             

<div style='text-align: justify; text-indent: 2em'>Events in the next seven years followed a pattern, with most events surrounding on the following: Russia had formal agreements with China, which both continued to uphold. Also, Russia used military force somewhere in Ukraine. Europe typically responded by appealing to Russia, although their response had been increasingly hostile in the past 3 years. America's response was typically more aggressive, sanctioning and imposing embargoes on Russia more often than not, which usually resulted in Russia appealing to the US.</div>
<br><div style='text-align: justify; display: inline-block;'>
    <ol>
        <li><b>China has formal agreements with Russia.</b> Russia had formal agreements with China, which both continued to uphold. Specifically, this includes <tt>CHINA_1 Sign formal agreement</tt> with <tt>RUSSIA_2</tt>.</li>
        <br><li><b>Russia uses military force in Ukraine.</b> Specifically, this includes <tt>RUSSIA_1 Use conventional military force</tt> on <tt>UKRAINE_2</tt>.</li>
        <br><li><b>France and the UK appeal to and accuse Russia.</b> European countries such as <tt>FRANCE_1</tt> or <tt>UNITED KINGDOM_1</tt> typically responded by <tt>Appeal</tt>ing to Russia, although their response had been increasingly hostile in the past three years.</li>
        <br><li><b>US negotiates with and sanctions Russia.</b> America's response was typically more aggressive, sanctioning, and imposing embargoes on Russia often.</li>
        <br><li><b>Russia appeals to US.</b> This is often a result of US' actions against Russia.</li>
    </ol>
</div>
<br><br><div style='text-align: justify; text-indent: 2em'>The above items may indicate that the escalation did not happen from a single event, but stems from a larger group of events that happened over the years. Potentially, international travel restrictions related to the COVID-19 pandemic may have delayed events and actions between actors, but we leave the detailed analysis of 2020-2021 for future studies.</div>
<br>

<div style='font-size: 18px; font-weight: bold;'>2022: Re-escalation and Economic Sanctions against Russia</div>
<br><div style='text-align: justify;'>Finally, three itemsets stand out from previous years for 2022.</div>
<br><div style='text-align: justify; display: inline-block;'>
    <ol>
        <li><b>Worldwide criticisms and denouncements of Russia, as well as sanctions and embargoes.</b> This includes actions of <tt>Threaten</tt>ing, <tt>Impose embargo, boycott, or sanctions</tt>, and <tt>Make pessimistic comment</tt>.</li>
        <li><b>Support for Ukraine.</b> This includes events where Ukraine is the receiving end of <tt>Provide military aid</tt> and <tt>Engage in negotiation</tt>, but also includes <tt>Make pessimistic comment</tt>.</li>
    </ol>
</div>
<br><center style="font-size:14px;font-style:default;"><b>TABLE 23: First 20 association rules for itemsets in 2022</b></center>

In [28]:
fim_years('itemset', [2022])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------------------------+-----------+-----+-----+
|antecedent                             |consequent |sup  |conf |
+---------------------------------------+-----------+-----+-----+
|[RUSSIA_2]                             |[RUS]      |0.287|1.0  |
|[RUS]                                  |[RUSSIA_2] |0.287|0.544|
|[UKRAINE_2]                            |[UKR]      |0.286|1.0  |
|[UKR]                                  |[UKRAINE_2]|0.286|0.7  |
|[RUSSIAN_2]                            |[RUS]      |0.182|1.0  |
|[RUS]                                  |[RUSSIAN_2]|0.182|0.344|
|[UKRAINIAN_2]                          |[UKR]      |0.077|1.0  |
|[Praise or endorse]                    |[UKR]      |0.037|0.64 |
|[Appeal, not specified below]          |[RUS]      |0.036|0.555|
|[Consult, not specified below]         |[RUS]      |0.035|0.463|
|[Consult, not specified below]         |[UKR]      |0.034|0.448|
|[Make a visit]                         |[UKR]      |0.032|0.625|
|[MOSCOW_2

# 6. Conclusion

<br><div style='text-align: justify;'>In this report, we successfully described the global sentiment and mined for frequently occurring events concurrent to the Russo-Ukrainian War and its corresponding association rules. We summarize our main insights and contributions below:</div>
<br><div style='text-align: justify; display: inline-block;'>
    <ol>
        <li>Frequently mentioned actors, suggested by GDELT as significant actors, are mostly Presidents from major superpowers, such as the UK, the US, China, Russia, and Ukraine. Strikingly, neither then-US President Donald Trump nor Ukrainian President Volodymyr Zelenskyy are not included on the list of frequently mentioned actors.</li>
        <br>
        <li>The average tone of documents about Western world leaders is far more negative than that of their Eastern counterparts. This could be due to the immediate control the governments of these countries have on how they are perceived by the media.</li>
        <br>
        <li>Highly mentioned events in the Russo-Ukrainian War are related to escalation events, such as the annexation of Crimea or the Kerch Strait incident. However, the years 2015 to 2021 yielded very limited attention from source documents, even if the war continued during these years.</li>
        <br>
        <li>Using Goldstein Scales as indicator of theoretical impact of events on country stability, Ukraine appears to be more prone to destabilization than Russia. This could be due to the latter's geopolitical and economic power. Towards early March 2022, however, it appeared that recent sanctions have negatively impacted Russia much more than Ukraine.</li>
        <br>
        <li>Finally, the frequently occurring events and association rules can be divided into four parts:</li>
        <ol>
            <li>2013: The usual diplomatic negotiations between world superpowers occurred, with no indication of Russia’s interest in the annexation of Crimea.</li>
            <li>2014: Russia starts a War with Ukraine by annexing Crimea, but it was only the US who actively sanctioned and imposed embargoes on Russia.</li>
            <li>2015-2021: While these years received less mentions and a lower Goldstein scale, they are nevertheless significant given the strong association rules and patterns between actors. China increased the number of its formal agreements with Russia, while France, the UK, and the US appealed to, negotiated with, and sanctioned Russia in turns, while the latter used military force in Ukraine.</li>
            <li>2022:  Re-escalation of events has finally impacted country stability in Russia due to heavy economic sanctions from other countries</li>
        </ol>
    </ol>
</div>

# 7. Recommendations

<br><div style='text-align: justify; display: inline-block;'>
    <ol>
        <li><b>Further research can be conducted to validate the above findings.</b> The above insights were generated based on the years 2013 and 2022 only. However, wars do not exist in a vacuum; conflict between Russia and Ukraine has existed for many decades and only reached its peak at the start of the war. Knowing the historical context before 2013 will help validate the above insights.</li>
        <br>
        <li><b>Sequence-aware pattern mining may aid in understanding the nuances between the dynamics of actors.</b> The current study only looks at individual events with no regard for sequences between them. Using sequence-aware pattern mining may surface new insights, including repeating and/or sequential patterns of certain actors.</li>
        <br>
        <li><b>Encourage media outlets to work against misinformation.</b> State-sponsored media outlets have persisted for many years, which can greatly influence perceptions of people towards the war (or any topic for that matter). There is a public interest for the government and civil society organizations to promote responsible consumption of online media, including in matters that may not be directly nor immediately related to a particular country.</li>
    </ol>
</div>

# 8. References

<br><div style='text-align: justify;'><a id='bbc'>BBC News. (2018).</a> <i>Russia-Ukraine sea clash in 300 words.</i> BBC News. Retrieved 06 March 2022 from <a href="https://www.bbc.com/news/world-europe-46345697">https://www.bbc.com/news/world-europe-46345697</a></div>
<br><div style='text-align: justify;'><a id='cathcart'>Cathcart, W. (2017).</a> <i>Putin's Crimean medal of honor, forged before the war even began.</i> The Daily Beast. Retrieved 06 March 2022 from <a href="https://www.thedailybeast.com/putins-crimean-medal-of-honor-forged-before-the-war-even-began">https://www.thedailybeast.com/putins-crimean-medal-of-honor-forged-before-the-war-even-began</a></div>
<br><div style='text-align: justify;'><a id='dress'>Dress, B. (2022).</a> <i>Russia demanding that Ukraine demilitarize.</i> The Hill. Retrieved 06 March 2022 from <a href="https://thehill.com/policy/international/596551-russia-demanding-that-ukraine-demilitarize">https://thehill.com/policy/international/596551-russia-demanding-that-ukraine-demilitarize</a></div>
<br><div style='text-align: justify;'><a id='gdelt'>GDELT Project (2020).</a> <i>Global Database of Events, Language and Tone (GDELT).</i> Retrieved 06 March 2022 from <a href="https://registry.opendata.aws/gdelt">https://registry.opendata.aws/gdelt</a></div>
<br><div style='text-align: justify;'><a id='kharkiv'>Kharkiv, Kiev and Lviv (2014).</a> <i>The February revolution.</i> The Economist. Retrieved 06 March 2022 from <a href="https://www.economist.com/briefing/2014/02/27/the-february-revolution">https://www.economist.com/briefing/2014/02/27/the-february-revolution</a></div>
<br><div style='text-align: justify;'><a id='kyiv'>Kyiv Post (2013).</a> <i>Parliament passes statement on Ukraine's aspirations for European integration.</i> Kyiv Post. Retrieved 06 March 2022 from <a href="https://www.kyivpost.com/article/content/ukraine-politics/parliament-passes-statement-on-ukraines-aspirations-for-european-integration-320792.html">https://www.kyivpost.com/article/content/ukraine-politics/parliament-passes-statement-on-ukraines-aspirations-for-european-integration-320792.html</a></div>
<br><div style='text-align: justify;'><a id='macfarquhar'>MacFarquhar, N. (2018).</a> <i>In standoff with Russia, what does Ukraine's martial law decree mean?</i> The New York Times. Retrieved 06 March 2022 from <a href="https://www.nytimes.com/2018/11/27/world/europe/ukraine-crimea-russia.html">https://www.nytimes.com/2018/11/27/world/europe/ukraine-crimea-russia.html</a></div>
<br><div style='text-align: justify;'><a id='united'>United Nations (2014).</a> <i>68/262. Territorial integrity of Ukraine: Resolution adopted by the General Assembly on 27 March 2014.</i> United Nations. Retrieved 06 March 2022 from <a hreff="https://www.securitycouncilreport.org/atf/cf/%7B65BFCF9B-6D27-4E9C-8CD3-CF6E4FF96FF9%7D/a_res_68_262.pdf">https://www.securitycouncilreport.org/atf/cf/%7B65BFCF9B-6D27-4E9C-8CD3-CF6E4FF96FF9%7D/a_res_68_262.pdf</a></div>

# 9. Appendix

<div style='font-size: 18px; font-weight: bold;'>Screenshot of the instances, showing the instance type</div>

![Screenshot 1](figures/appendix1.PNG)
<hr>

<div style='font-size: 18px; font-weight: bold;'>Screenshot of the output of <tt>aws s3 ls --summarize --human-readable</tt> showing the total size of the actual data that was processed</div>

![Screenshot 2](figures/appendix2.PNG)
![Screenshot 3](figures/appendix3.PNG)
<hr>

<div style='font-size: 18px; font-weight: bold;'>Screenshot of the spark list of workers from the dashboard</div>

![Screenshot 4](figures/appendix4.PNG)<br>
![Screenshot 5](figures/appendix5.PNG)<br>
![Screenshot 6](figures/appendix6.PNG)