# Raw Data Exploration of GDELT data
## Introduction

A small notebook designed to look at the raw data downloaded from GDELT for the date range 2019-01-01 - 2019-01-07. This data is unfiltered and therefore contains information for outside the UK. One of the main questions investigated in this notebook is "what variables does the data contain and what do they represent?".

## Library Definitions

In [1]:
import sys
import os
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import pandas as pd
import numpy as np
from src.data import GdeltCleaner
import src.config as cfg

%load_ext autoreload
%autoreload 2

## Ingesting Data

In [2]:
cleaner = GdeltCleaner.GdeltCleaner()

In [3]:
cleaner.load_raw_dataset("uk_jan_01_v2")

In [4]:
raw_data = cleaner.get_raw_dataset()

In [5]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12765 entries, 20190101003000-9 to 20190101210000-1363
Data columns (total 26 columns):
DATE                          12765 non-null float64
SourceCollectionIdentifier    12765 non-null float64
SourceCommonName              12765 non-null object
DocumentIdentifier            12765 non-null object
Counts                        2559 non-null object
V2Counts                      2559 non-null object
Themes                        11522 non-null object
V2Themes                      11522 non-null object
Locations                     12765 non-null object
V2Locations                   12765 non-null object
Persons                       11347 non-null object
V2Persons                     11302 non-null object
Organizations                 10477 non-null object
V2Organizations               9827 non-null object
V2Tone                        12765 non-null object
Dates                         6172 non-null object
GCAM                          12765 n

In [11]:
raw_data.head(1)["V2Tone"].values

array(['-3.16742081447964,0.452488687782805,3.61990950226244,4.07239819004525,26.6968325791855,0,200'],
      dtype=object)

In [10]:
len(raw_data.head(1)["V2Tone"].str.split(","))

1

# **THIS NEEDS UPDATING**

DATE - The date of publication of the news source, NOT the date of the event

NUMARTS - Total number of source documents containing one or more mentions of this count. **Can be used to assess importance (the higher the more important) once normalised against other counts during time period of interest**

COUNTS - List of counts found in this nameset. Each count is separated by a semicolon, fields within a count are separated by a #.

THEMES - List of all themes found in nameset

LOCATIONS - List of all locations found in text. Semicolon delimited blocks with # delimited fields. **This will create a new column called LOCATIONS_COUNT**

PERSONS - List of all person names found in text, semicolon delimited. **This will be converted into PERSONS_COUNT, which will be an integer of the number of persons included in the article.**

ORGANIZATIONS - List of all company and organization names found in the text. Highly recommended to produce histogram and discard names appearing once or twice to eliminate most false positive mathces

TONE - Comma-delimited list of six core emotional dimensions:
* Tone - Average "tone", ranges from -100 (extremely negative) to +100 (extremely positive). Most common values range between -10 and +10. 0 indicates neutral. Calculated as $Positive Score - Negative Score$. Documents with a tone close to 0 may have low emotional response or counteracting Positive and Negative Scores.

* Positive Score - Percentage of all words in the article that were found to have a positive emotional connotation (ranges from 0 to +100)

* Negative Score - Percentage of all words in the article that were found to have a negative emotional connotation (ranges from 0 to +100)

* Polarity - Percentage of words that had matches in the tonal dictionary as an indicator of how emotionally polarized or charged the text is. If polarity is high but tone is neutral this suggests that the text was highly emotionally charged but had roughly equivalent numbers of positively and negatively charged emotional words

* Activity Reference Density - Percentage of words that were active words offering a very basic proxy of the overall "activeness" of the text

* Self/Group Reference Density - Percentage of all words in the article that are pronouns, capturing a combination of self-references and group-based discourse. News media material tends to have very low densities of such language but this can be used to distinguish certain classes of news media and certain contexts

CAMEOEVENTIDS - Contains a comma-separated list of GlobalEventIDs from master GDELT event stream that were found in the same article(s) as this nameset was found

SOURCES - Semicolon delimited list of all sources publishing articles mentionining this nameset. For web-based news material this will be the top-level domain the page was from. In the case of BBC Monitoring material this will be noted as "BBC Monitoring" rather than an URL

SOURCEURLS - URLs of sources. Delimiter <UDIV\> is used to separate articles. **This column will be removed as we are not interested in the article links themselves.**


All of this information is obtained from [this document](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook.pdf)

## Data Cleaning

In [6]:
cleaner.clean_data()
cleaned_data = cleaner.get_clean_dataset()

AssertionError: 6 columns passed, passed data had 7 columns

In [None]:
cleaned_data.info()

In [None]:
cleaned_data.describe()

In [None]:
cleaned_data.head()

In [None]:
cleaner.save_clean_dataset()

## Data Inspection

#### SIngle Value Histograms