<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">ABSTRACT</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">INTRODUCTION</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h2 style="color: white; font-weight:bold">Background</h2></center>
</div> 

In an era where data-driven decisions are pivotal to success, political campaigns are no exception. Recognizing the power of big data analytics to shape political strategies, this project leverages the Global Database of Events, Language, and Tone (GDELT) to extract actionable insights for future political leaders in the Philippines. By harnessing GDELT's extensive repository of media events and sentiment analyses, the project aims to distill the most pressing issues and public sentiments across the country. This approach ensures that campaign platforms are not only relevant but also resonant with the electorate’s current concerns and needs.

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h2 style="color: white; font-weight:bold">Problem Statement</h2></center>
</div> 

In the digital age, the proliferation of disinformation, particularly through orchestrated campaigns via "troll farms", has emerged as a critical challenge in political campaigning across the Philippines and Southeast Asia. These deceptive practices not only distort public perceptions but also undermine the integrity of democratic processes by swaying voter opinions with false narratives. The difficulty lies in the ability of political campaigns to both identify and counteract these disinformation efforts effectively while also promoting genuine and fact-based discourse.

This project seeks to address the urgent need for sophisticated data-driven strategies that can discern and mitigate the impact of digital disinformation. Utilizing the extensive monitoring capabilities of GDELT to track media events and sentiment, the study aims to equip political leaders with the tools to identify trends and anomalies in public discourse that may indicate the presence of disinformation. By establishing more transparent and factually accurate communication strategies, aspiring political leaders can enhance their campaign platforms, foster a more informed electorate, and strengthen democratic resilience against the corrosive effects of misinformation.

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h2 style="color: white; font-weight:bold">Objective</h2></center>
</div> 

The goal of this study is to delve deeper into the information provided generously through the GDELT Project and help assist leaders in pivoting their goals and strategies accordingly. 
The objectives of this is study are as follows: 
1. asdasd
2. asdasd
3. asdasd

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">DATA SOURCES and DESCRIPTION</h1></center>
</div> 

The **GDELT (Global Data of Events, Language, and Tone) Project** is a comprehensive global monitoring system that tracks news content from broadcasters, print media, and websites in numerous languages across almost every country in the world. It employs advanced techniques to identify and extract key elements driving global discourse and events, such as people, locations, organizations, topics, sources, sentiments, numerical data, quotations, images, and events. By continuously processing this immense stream of information on a second-by-second basis, GDELT generates a freely accessible open data platform that enables computational analysis of the world's events, narratives, and societal forces in real-time.

GDELT has several [datasets](https://www.gdeltproject.org/data.html). This study specifically uses the **Global Knowledge Graph** (GKG) dataset. GKG enables the representation of the underlying dimensions, geographic patterns, and network structures inherent in global news coverage. It uses sophisticated natural language processing algorithms to compute and encode a  range of codified metadata that captures the latent and contextual aspects of each document. In essence, the GKG interconnects every person, organization, location, numerical data, theme, news source, and event across the globe into a massive unified network. This network captures what is happening worldwide, the associated contexts and involved entities, as well as the sentiments surrounding these events, providing a daily comprehensive view of our global society.

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">METHODOLOGY</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">DATA PREPROCESSING</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">EXPLORATORY DATA ANALYSIS</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">INSIGHTS</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">RESULTS and DISCUSSIONS</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">RECOMMENDATIONS</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">SCOPE & LIMITATIONS</h1></center>
</div> 

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">CONCLUSION</h1></center>
</div>

<div style="background-color:#019EDB ; padding: 10px 0;">
    <center><h1 style="color: white; font-weight:bold">Paula Martinez - Senator</h1></center>
</div> 

References:


The GDELT Project. (n.d.). [Web page]. Retrieved from https://www.gdeltproject.org/

Channel News Asia. (2022). Paid troll army for hire: Philippines' social media elections influencers. [Web page]. Channel News Asia. Retrieved from https://www.channelnewsasia.com/cna-insider/paid-troll-army-hire-philippines-social-media-elections-influencers-2917556

Rappler. (n.d.). Investigating troll farms: What to look out for. [Web page]. Retrieved from https://www.rappler.com/newsbreak/iq/investigating-troll-farms-what-to-look-out-for


# **SCRATCH**

In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = (SparkSession
     .builder
     .master('local[*]') # Master URL;
     .getOrCreate())

In [3]:
from pyspark.sql.types import (StructType, StructField, StringType,
IntegerType, FloatType, TimestampType, LongType)
import glob
# Define the schema—NOT FINAL; SHOULD REVISIT
schema = StructType([
    StructField("GKGRECORDID", StringType(), True),
    StructField("V2.1DATE", LongType(), True),
    StructField("V2SOURCECOLLECTIONIDENTIFIER", IntegerType(), True),
    StructField("V2SOURCECOMMONNAME", StringType(), True),
    StructField("V2DOCUMENTIDENTIFIER", StringType(), True),
    StructField("V1COUNTS", StringType(), True),
    StructField("V2.1COUNTS", StringType(), True),
    StructField("V1THEMES", StringType(), True),
    StructField("V2ENHANCEDTHEMES", StringType(), True),
    StructField("V1LOCATIONS", StringType(), True),
    StructField("V2ENHANCEDLOCATIONS", StringType(), True),
    StructField("V1PERSONS", StringType(), True),
    StructField("V2ENHANCEDPERSONS", StringType(), True),
    StructField("V1ORGANIZATIONS", StringType(), True),
    StructField("V2ENHANCEDORGANIZATIONS", StringType(), True),
    StructField("V1.5TONE", FloatType(), True),
    StructField("V2.1ENHANCEDDATES", StringType(), True),
    StructField("V2GCAM", StringType(), True),
    StructField("V2.1SHARINGIMAGE", StringType(), True),
    StructField("V2.1RELATEDIMAGES", StringType(), True),
    StructField("V2.1SOCIALIMAGEEMBEDS", StringType(), True),
    StructField("V2.1SOCIALVIDEOEMBEDS", StringType(), True),
    StructField("V2.1QUOTATIONS", StringType(), True),
    StructField("V2.1ALLNAMES", StringType(), True),
    StructField("V2.1AMOUNTS", StringType(), True),
    StructField("V2.1TRANSLATIONINFO", StringType(), True),
    StructField("V2EXTRASXML", StringType(), True)
])

# Define the path and file patterns for the first 8 days of August 2019
path = '/mnt/data/public/gdeltv2/gkg/'
file_pattern = '2019080[1-8]*.gkg.csv'  # Matches days 01 to 08

# Use glob to list files matching the pattern
files = glob.glob(path + file_pattern)

# Read the files into a DataFrame with the specified schema
df_gkg = spark.read.csv(files, sep='\t', schema=schema)

In [4]:
row_count = df_gkg.count()

In [5]:
print(f"The dataset has {row_count} rows.")

The dataset has 1436716 rows.


This is a sample of the dataset:

In [6]:
df_gkg.limit(3).toPandas()

Unnamed: 0,GKGRECORDID,V2.1DATE,V2SOURCECOLLECTIONIDENTIFIER,V2SOURCECOMMONNAME,V2DOCUMENTIDENTIFIER,V1COUNTS,V2.1COUNTS,V1THEMES,V2ENHANCEDTHEMES,V1LOCATIONS,...,V2GCAM,V2.1SHARINGIMAGE,V2.1RELATEDIMAGES,V2.1SOCIALIMAGEEMBEDS,V2.1SOCIALVIDEOEMBEDS,V2.1QUOTATIONS,V2.1ALLNAMES,V2.1AMOUNTS,V2.1TRANSLATIONINFO,V2EXTRASXML
0,20190801173000-0,20190801173000,1,newstoday.com.bd,http://www.newstoday.com.bd/?option=details&ne...,,,TAX_WORLDLANGUAGES;TAX_WORLDLANGUAGES_ARABIC;T...,"TAX_FNCACT_JUDGES,1022;MEDIA_MSM,296;SOC_SLAVE...",1#Germany#GM#GM#51.5#10.5#GM;1#Chile#CI#CI#-30...,...,"wc:345,c1.1:2,c1.4:1,c12.1:29,c12.10:24,c12.12...",,,,,"1188|112||a richly imagined , engaging and poe...","Man Booker International,95;Edinburgh Universi...","2,previous collections of short,323;3,novels,3...",,
1,20190801173000-1,20190801173000,1,idrw.org,http://idrw.org/ms-velpari-takes-over-from-sun...,,,TAX_FNCACT;TAX_FNCACT_DIRECTOR;TAX_FNCACT_CHIE...,"TAX_FNCACT_CHIEF,275;EDUCATION,430;SOC_POINTSO...","4#Hindustan, India (General), India#IN#IN00#28...",...,"wc:168,c12.1:6,c12.10:14,c12.12:5,c12.13:4,c12...",,,,https://youtube.com/channel/UChCONU0XnVC2671b7...,,"Sunil Kumar,143;Tejas Division,322;Aircraft Pr...",,,<PAGE_AUTHORS>By</PAGE_AUTHORS>
2,20190801173000-2,20190801173000,1,willistonherald.com,https://www.willistonherald.com/news/oil_and_e...,"KILL#4000000##2#Colorado, United States#US#USC...","KILL#4000000##2#Colorado, United States#US#USC...",WB_507_ENERGY_AND_EXTRACTIVES;WB_1702_OILFIELD...,"WB_507_ENERGY_AND_EXTRACTIVES,25;WB_1702_OILFI...","2#Colorado, United States#US#USCO#39.0646#-105...",...,"wc:715,c1.2:6,c1.3:1,c12.1:47,c12.10:88,c12.11...",https://bloximages.chicago2.vip.townnews.com/w...,,,https://youtube.com/channel/UCHR2WhAPYJ6magx0g...,,"Liberty Oilfield Services,26;Tier Four,2357","5,hydraulic fracturing fleets operating,30;23,...",,<PAGE_AUTHORS>Ren&eacute;e Jean rjean@willisto...


In [7]:
df_final = (df_gkg
            .withColumn('COUNTTYPE', F.regexp_extract('V1COUNTS', r'^([^#]+)',
                                                      1))
            .withColumn('LOCATION', F.regexp_extract('V1COUNTS',
                                                     r",\s*([^,#]+)#", 1))
            .withColumn('THEMES_SINGLE',
                        F.explode(F.array_remove(F.split(F.col('V1THEMES'),
                                                         ';'), "")))
            .filter(F.col('LOCATION').rlike('Philippines'))
            .persist()
           )

The final dataset with exploded count types, locations, and themes:

**Note:**
- not sure if `persist` makes the succeeding codes slower/faster but iirc it's supposed to optimize how the initial code of `df_final` is executed --> di na niya dinadaanan ulit.

In [8]:
df_final.limit(3).toPandas()

Unnamed: 0,GKGRECORDID,V2.1DATE,V2SOURCECOLLECTIONIDENTIFIER,V2SOURCECOMMONNAME,V2DOCUMENTIDENTIFIER,V1COUNTS,V2.1COUNTS,V1THEMES,V2ENHANCEDTHEMES,V1LOCATIONS,...,V2.1SOCIALIMAGEEMBEDS,V2.1SOCIALVIDEOEMBEDS,V2.1QUOTATIONS,V2.1ALLNAMES,V2.1AMOUNTS,V2.1TRANSLATIONINFO,V2EXTRASXML,COUNTTYPE,LOCATION,THEMES_SINGLE
0,20190801194500-914,20190801194500,1,thejakartapost.com,https://www.thejakartapost.com/news/2019/08/02...,"KILL#22##4#Philippine, Benguet, Philippines#RP...","KILL#22##4#Philippine, Benguet, Philippines#RP...",TAX_WEAPONS;TAX_WEAPONS_BOMB;EPU_CATS_MIGRATIO...,"GENERAL_GOVERNMENT,812;EPU_POLICY_GOVERNMENT,8...",1#Philippines#RP#RP#13#122#RP;1#Libya#LY#LY#25...,...,,,,"Mount Carmel Cathedral,72;Philippine Defense M...","100,people,143;",,<PAGE_AUTHORS>The Jakarta Post</PAGE_AUTHORS>,KILL,Philippines,TAX_WEAPONS
1,20190801194500-914,20190801194500,1,thejakartapost.com,https://www.thejakartapost.com/news/2019/08/02...,"KILL#22##4#Philippine, Benguet, Philippines#RP...","KILL#22##4#Philippine, Benguet, Philippines#RP...",TAX_WEAPONS;TAX_WEAPONS_BOMB;EPU_CATS_MIGRATIO...,"GENERAL_GOVERNMENT,812;EPU_POLICY_GOVERNMENT,8...",1#Philippines#RP#RP#13#122#RP;1#Libya#LY#LY#25...,...,,,,"Mount Carmel Cathedral,72;Philippine Defense M...","100,people,143;",,<PAGE_AUTHORS>The Jakarta Post</PAGE_AUTHORS>,KILL,Philippines,TAX_WEAPONS_BOMB
2,20190801194500-914,20190801194500,1,thejakartapost.com,https://www.thejakartapost.com/news/2019/08/02...,"KILL#22##4#Philippine, Benguet, Philippines#RP...","KILL#22##4#Philippine, Benguet, Philippines#RP...",TAX_WEAPONS;TAX_WEAPONS_BOMB;EPU_CATS_MIGRATIO...,"GENERAL_GOVERNMENT,812;EPU_POLICY_GOVERNMENT,8...",1#Philippines#RP#RP#13#122#RP;1#Libya#LY#LY#25...,...,,,,"Mount Carmel Cathedral,72;Philippine Defense M...","100,people,143;",,<PAGE_AUTHORS>The Jakarta Post</PAGE_AUTHORS>,KILL,Philippines,EPU_CATS_MIGRATION_FEAR_FEAR


# **How many different types of counts were made (e.g., how many arrests, protest)**?

In [9]:
(df_final
 .groupBy('COUNTTYPE')
 .count()
 .orderBy('count', ascending=False)
).show()

+----------+-----+
| COUNTTYPE|count|
+----------+-----+
|      KILL|30830|
|    AFFECT| 5064|
|    ARREST| 4207|
|     SEIZE|  897|
|   PROTEST|  821|
|     WOUND|  369|
|EVACUATION|  145|
|    KIDNAP|   31|
+----------+-----+

