<a href="https://colab.research.google.com/github/limaOlima/news-map/blob/master/GDelt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gdelt Cookbook
Gdelt Cookbook
http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf

## General


**GlobalEventID**. (integer) Globally unique identifier assigned to each event record that uniquely
identifies it in the master dataset. NOTE: While these will often be sequential with date, this is
NOT always the case and this field should NOT be used to sort events by date: the date fields
should be used for this. NOTE: There is a large gap in the sequence between February 18, 2015
and February 19, 2015 with the switchover to GDELT 2.0 – these are not missing events, the ID
sequence was simply reset at a higher number so that it is possible to easily distinguish events
created after the switchover to GDELT 2.0 from those created using the older GDELT 1.0 system.

**Day**. (integer) Date the event took place in YYYYMMDD format. See DATEADDED field for
YYYYMMDDHHMMSS date.

**MonthYear**. (integer) Alternative formatting of the event date, in YYYYMM format.

**Year**. (integer) Alternative formatting of the event date, in YYYY format.

**FractionDate**. (floating point) Alternative formatting of the event date, computed as YYYY.FFFF,
where FFFF is the percentage of the year completed by that day. This collapses the month and
day into a fractional range from 0 to 0.9999, capturing the 365 days of the year. The fractional
component (FFFF) is computed as (MONTH * 30 + DAY) / 365. This is an approximation and does
not correctly take into account the differing numbers of days in each month or leap years, but
offers a simple single-number sorting mechanism for applications that wish to estimate the
rough temporal distance between dates.

## Actor Attributes



**Actor1Code**. (string) The complete raw CAMEO code for Actor1 (includes geographic, class,
ethnic, religious, and type classes). May be blank if the system was unable to identify an Actor1.

**Actor1Name**. (string) The actual name of the Actor1. In the case of a political leader or
organization, this will be the leader’s formal name (GEORGE W BUSH, UNITED NATIONS), for a
geographic match it will be either the country or capital/major city name (UNITED STATES /
PARIS), and for ethnic, religious, and type matches it will reflect the root match class (KURD,
CATHOLIC, POLICE OFFICER, etc). May be blank if the system was unable to identify an Actor1.

**Actor1CountryCode**. (string) The 3-character CAMEO code for the country affiliation of Actor1.
May be blank if the system was unable to identify an Actor1 or determine its country affiliation
(such as “UNIDENTIFIED GUNMEN”).

**Actor1KnownGroupCode**. (string) If Actor1 is a known IGO/NGO/rebel organization (United
Nations, World Bank, al-Qaeda, etc) with its own CAMEO code, this field will contain that code.

**Actor1EthnicCode**. (string) If the source document specifies the ethnic affiliation of Actor1 and
that ethnic group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few special
groups like ARAB may also have entries in the type column due to legacy CAMEO behavior.
NOTE: this behavior is highly experimental and may not capture all affiliations properly – for
more comprehensive and sophisticated identification of ethnic affiliation, it is recommended
that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social group
taxonomies and post-enrich actors from the GKG.

**Actor1Religion1Code**. (string) If the source document specifies the religious affiliation of Actor1
and that religious group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few
special groups like JEW may also have entries in the geographic or type columns due to legacy
CAMEO behavior. NOTE: this behavior is highly experimental and may not capture all affiliations
properly – for more comprehensive and sophisticated identification of ethnic affiliation, it is
recommended that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social
group taxonomies and post-enrich actors from the GKG.

**Actor1Religion2Code**. (string) If multiple religious codes are specified for Actor1, this contains
the secondary code. Some religion entries automatically use two codes, such as Catholic, which
invokes Christianity as Code1 and Catholicism as Code2.

**Actor1Type1Code**. (string) The 3-character CAMEO code of the CAMEO “type” or “role” of
Actor1, if specified. This can be a specific role such as Police Forces, Government, Military,
Political Opposition, Rebels, etc, a broad role class such as Education, Elites, Media, Refugees, or organizational classes like Non-Governmental Movement. Special codes such as Moderate and
Radical may refer to the operational strategy of a group.

**Actor1Type2Code**. (string) If multiple type/role codes are specified for Actor1, this returns the
second code.

**Actor1Type3Code**. (string) If multiple type/role codes are specified for Actor1, this returns the
third code.

## Event Action Attributes

**IsRootEvent**. (integer) The system codes every event found in an entire document, using an
array of techniques to deference and link information together. A number of previous projects
such as the ICEWS initiative have found that events occurring in the lead paragraph of a
document tend to be the most “important.” This flag can therefore be used as a proxy for the
rough importance of an event to create subsets of the event stream. NOTE: this field refers only
to the first news report to mention an event and is not updated if the event is found in a
different context in other news reports. It is included for legacy purposes – for more precise
information on the positioning of an event, see the Mentions table.

**EventCode**. (string) This is the raw CAMEO action code describing the action that Actor1
performed upon Actor2. NOTE: it is strongly recommended that this field be stored as a string
instead of an integer, since the CAMEO taxonomy can include zero-leaded event codes that can
make distinguishing between certain event types more difficult when stored as an integer.

**EventBaseCode**. (string) CAMEO event codes are defined in a three-level taxonomy. For events
at level three in the taxonomy, this yields its level two leaf root node. For example, code “0251”
(“Appeal for easing of administrative sanctions”) would yield an EventBaseCode of “025”
(“Appeal to yield”). This makes it possible to aggregate events at various resolutions of
specificity. For events at levels two or one, this field will be set to EventCode. NOTE: it is
strongly recommended that this field be stored as a string instead of an integer, since the
CAMEO taxonomy can include zero-leaded event codes that can make distinguishing between
certain event types more difficult when stored as an integer.

**EventRootCode**. (string) Similar to EventBaseCode, this defines the root-level category the
event code falls under. For example, code “0251” (“Appeal for easing of administrative
sanctions”) has a root code of “02” (“Appeal”). This makes it possible to aggregate events at
various resolutions of specificity. For events at levels two or one, this field will be set to
EventCode. NOTE: it is strongly recommended that this field be stored as a string instead of an
integer, since the CAMEO taxonomy can include zero-leaded event codes that can make
distinguishing between certain event types more difficult when stored as an integer.

**QuadClass**. (integer) The entire CAMEO event taxonomy is ultimately organized under four
primary classifications: Verbal Cooperation, Material Cooperation, Verbal Conflict, and Material Conflict. This field specifies this primary classification for the event type, allowing analysis at the
highest level of aggregation. The numeric codes in this field map to the Quad Classes as follows:
1=Verbal Cooperation, 2=Material Cooperation, 3=Verbal Conflict, 4=Material Conflict.

**GoldsteinScale**. (floating point) Each CAMEO event code is assigned a numeric score from -10 to
+10, capturing the theoretical potential impact that type of event will have on the stability of a
country. This is known as the Goldstein Scale. This field specifies the Goldstein score for each
event type. NOTE: this score is based on the type of event, not the specifics of the actual event
record being recorded – thus two riots, one with 10 people and one with 10,000, will both
receive the same Goldstein score. This can be aggregated to various levels of time resolution to
yield an approximation of the stability of a location over time.

**NumMentions**. (integer) This is the total number of mentions of this event across all source
documents during the 15 minute update in which it was first seen. Multiple references to an
event within a single document also contribute to this count. This can be used as a method of
assessing the “importance” of an event: the more discussion of that event, the more likely it is
to be significant. The total universe of source documents and the density of events within them
vary over time, so it is recommended that this field be normalized by the average or other
measure of the universe of events during the time period of interest. This field is actually a
composite score of the total number of raw mentions and the number of mentions extracted
from reprocessed versions of each article (see the discussion for the Mentions table). NOTE:
this field refers only to the first news report to mention an event and is not updated if the event
is found in a different context in other news reports. It is included for legacy purposes – for
more precise information on the positioning of an event, see the Mentions table.

**NumSources**. (integer) This is the total number of information sources containing one or more
mentions of this event during the 15 minute update in which it was first seen. This can be used
as a method of assessing the “importance” of an event: the more discussion of that event, the
more likely it is to be significant. The total universe of sources varies over time, so it is
recommended that this field be normalized by the average or other measure of the universe of
events during the time period of interest. NOTE: this field refers only to the first news report to
mention an event and is not updated if the event is found in a different context in other news
reports. It is included for legacy purposes – for more precise information on the positioning of
an event, see the Mentions table.

**NumArticles**. (integer) This is the total number of source documents containing one or more
mentions of this event during the 15 minute update in which it was first seen. This can be used
as a method of assessing the “importance” of an event: the more discussion of that event, the
more likely it is to be significant. The total universe of source documents varies over time, so it
is recommended that this field be normalized by the average or other measure of the universe
of events during the time period of interest. NOTE: this field refers only to the first news report
to mention an event and is not updated if the event is found in a different context in other news
reports. It is included for legacy purposes – for more precise information on the positioning of
an event, see the Mentions table.

**AvgTone**. (numeric) This is the average “tone” of all documents containing one or more
mentions of this event during the 15 minute update in which it was first seen. The score
ranges from -100 (extremely negative) to +100 (extremely positive). Common values range
between -10 and +10, with 0 indicating neutral. This can be used as a method of filtering the
“context” of events as a subtle measure of the importance of an event and as a proxy for the
“impact” of that event. For example, a riot event with a slightly negative average tone is likely
to have been a minor occurrence, whereas if it had an extremely negative average tone, it
suggests a far more serious occurrence. A riot with a positive score likely suggests a very minor occurrence described in the context of a more positive narrative (such as a report of an attack
occurring in a discussion of improving conditions on the ground in a country and how the
number of attacks per day has been greatly reduced). NOTE: this field refers only to the first
news report to mention an event and is not updated if the event is found in a different context
in other news reports. It is included for legacy purposes – for more precise information on the
positioning of an event, see the Mentions table. NOTE: this provides only a basic tonal
assessment of an article and it is recommended that users interested in emotional measures use
the Mentions and Global Knowledge Graph tables to merge the complete set of 2,300 emotions
and themes from the GKG GCAM system into their analysis of event records.


## Event Geography

**Actor1Geo_Type**. (integer) This field specifies the geographic resolution of the match type and
holds one of the following values: 1=COUNTRY (match was at the country level), 2=USSTATE
(match was to a US state), 3=USCITY (match was to a US city or landmark), 4=WORLDCITY
(match was to a city or landmark outside the US), 5=WORLDSTATE (match was to an
Administrative Division 1 outside the US – roughly equivalent to a US state). This can be used to
filter events by geographic specificity, for example, extracting only those events with a
landmark-level geographic resolution for mapping. Note that matches with codes 1 (COUNTRY),
2 (USSTATE), and 5 (WORLDSTATE) will still provide a latitude/longitude pair, which will be the
centroid of that country or state, but the FeatureID field below will be blank.

**Actor1Geo_Fullname**. (string) This is the full human-readable name of the matched location. In
the case of a country it is simply the country name. For US and World states it is in the format of
“State, Country Name”, while for all other matches it is in the format of “City/Landmark, State,
Country”. This can be used to label locations when placing events on a map. NOTE: this field
reflects the precise name used to refer to the location in the text itself, meaning it may contain
multiple spellings of the same location – use the FeatureID column to determine whether two
location names refer to the same place.

**Actor1Geo_CountryCode**. (string) This is the 2-character FIPS10-4 country code for the location.

**Actor1Geo_ADM1Code**. (string). This is the 2-character FIPS10-4 country code followed by the
2-character FIPS10-4 administrative division 1 (ADM1) code for the administrative division
housing the landmark. In the case of the United States, this is the 2-character shortform of the
state’s name (such as “TX” for Texas).

**Actor1Geo_ADM2Code**. (string). For international locations this is the numeric Global
Administrative Unit Layers (GAUL) administrative division 2 (ADM2) code assigned to each global
location, while for US locations this is the two-character shortform of the state’s name (such as
“TX” for Texas) followed by the 3-digit numeric county code (following the INCITS 31:200x
standard used in GNIS). For more detail on the contents and computation of this field, please
see the following footnoted URL. 5
 NOTE: This field may be blank/null in cases where no ADM2
information was available, for some ADM1-level matches, and for all country-level matches.
NOTE: this field may still contain a value for ADM1-level matches depending on how they are
codified in GNS.

**Actor1Geo_Lat**. (floating point) This is the centroid latitude of the landmark for mapping.

**Actor1Geo_Long**. (floating point) This is the centroid longitude of the landmark for mapping.

**Actor1Geo_FeatureID**. (string). This is the GNS or GNIS FeatureID for this location. More
information on these values can be found in Leetaru (2012).
6 NOTE: When Actor1Geo_Type has
a value of 3 or 4 this field will contain a signed numeric value, while it will contain a textual
FeatureID in the case of other match resolutions (usually the country code or country code and
ADM1 code). A small percentage of small cities and towns may have a blank value in this field
even for Actor1Geo_Type values of 3 or 4: this will be corrected in the 2.0 release of GDELT.
NOTE: This field can contain both positive and negative numbers, see Leetaru (2012) for more
information on this.

**DATEADDED**. (integer) This field stores the date the event was added to the master database
in YYYYMMDDHHMMSS format in the UTC timezone. For those needing to access events at 15
minute resolution, this is the field that should be used in queries.

**SOURCEURL**. (string) This field records the URL or citation of the first news report it found this
event in. In most cases this is the first report it saw the article in, but due to the timing and
flow of news reports through the processing pipeline, this may not always be the very first
report, but is at least in the first few reports.

## Mentions Table

**GlobalEventID**. (integer) This is the ID of the event that was mentioned in the article.

**EventTimeDate**. (integer) This is the 15-minute timestamp (YYYYMMDDHHMMSS) when the
event being mentioned was first recorded by GDELT (the DATEADDED field of the original event
record). This field can be compared against the next one to identify events being mentioned for
the first time (their first mentions) or to identify events of a particular vintage being mentioned
now (such as filtering for mentions of events at least one week old).

**MentionTimeDate**. (integer) This is the 15-minute timestamp (YYYYMMDDHHMMSS) of the
current update. This is identical for all entries in the update file but is included to make it easier
to load the Mentions table into a database.

**MentionType**. (integer) This is a numeric identifier that refers to the source collection the
document came from and is used to interpret the MentionIdentifier in the next column. In
essence, it specifies how to interpret the MentionIdentifier to locate the actual document. At
present, it can hold one of the following values:

> 1 = WEB (The document originates from the open web and the MentionIdentifier is a
fully-qualified URL that can be used to access the document on the web).

> 2 = CITATIONONLY (The document originates from a broadcast, print, or other offline
source in which only a textual citation is available for the document. In this case the
MentionIdentifier contains the textual citation for the document).

> 3 = CORE (The document originates from the CORE archive and the MentionIdentifier
contains its DOI, suitable for accessing the original document through the CORE
website).

> 4 = DTIC (The document originates from the DTIC archive and the MentionIdentifier
contains its DOI, suitable for accessing the original document through the DTIC website).

> 5 = JSTOR (The document originates from the JSTOR archive and the MentionIdentifier
contains its DOI, suitable for accessing the original document through your JSTOR
subscription if your institution subscribes to it).

> 6 = NONTEXTUALSOURCE (The document originates from a textual proxy (such as closed
captioning) of a non-textual information source (such as a video) available via a URL and
the MentionIdentifier provides the URL of the non-textual original source. At present,
this Collection Identifier is used for processing of the closed captioning streams of the
Internet Archive Television News Archive in which each broadcast is available via a URL,
but the URL offers access only to the video of the broadcast and does not provide any
access to the textual closed captioning used to generate the metadata. This code is
used in order to draw a distinction between URL-based textual material (Collection
Identifier 1 (WEB) and URL-based non-textual material like the Television News Archive).

**MentionSourceName**. (integer) This is a human-friendly identifier of the source of the
document. For material originating from the open web with a URL this field will contain the toplevel domain the page was from. For BBC Monitoring material it will contain “BBC Monitoring”
and for JSTOR material it will contain “JSTOR.” This field is intended for human display of major
sources as well as for network analysis of information flows by source, obviating the
requirement to perform domain or other parsing of the MentionIdentifier field.

**MentionIdentifier**. (integer) This is the unique external identifier for the source document. It
can be used to uniquely identify the document and access it if you have the necessary
subscriptions or authorizations and/or the document is public access. This field can contain a
range of values, from URLs of open web resources to textual citations of print or broadcast
material to DOI identifiers for various document repositories. For example, if MentionType is
equal to 1, this field will contain a fully-qualified URL suitable for direct access. If MentionType
is equal to 2, this field will contain a textual citation akin to what would appear in an academic
journal article referencing that document (NOTE that the actual citation format will vary (usually
between APA, Chicago, Harvard, or MLA) depending on a number of factors and no assumptions
should be made on its precise format at this time due to the way in which this data is currently
provided to GDELT – future efforts will focus on normalization of this field to a standard citation
format). If MentionType is 3, the field will contain a numeric or alpha-numeric DOI that can be
typed into JSTOR’s search engine to access the document if your institution has a JSTOR
subscription.

**SentenceID**. (integer) The sentence within the article where the event was mentioned (starting
with the first sentence as 1, the second sentence as 2, the third sentence as 3, and so on). This
can be used similarly to the CharOffset fields below, but reports the event’s location in the
article in terms of sentences instead of characters, which is more amenable to certain measures
of the “importance” of an event’s positioning within an article.

**Actor1CharOffset**. (integer) The location within the article (in terms of English characters)
where Actor1 was found. This can be used in combination with the GKG or other analysis to
identify further characteristics and attributes of the actor. NOTE: due to processing performed
on each article, this may be slightly offset from the position seen when the article is rendered in
a web browser.

**Actor2CharOffset**. (integer) The location within the article (in terms of English characters)
where Actor2 was found. This can be used in combination with the GKG or other analysis to
identify further characteristics and attributes of the actor. NOTE: due to processing performed
on each article, this may be slightly offset from the position seen when the article is rendered in
a web browser.

**ActionCharOffset**. (integer) The location within the article (in terms of English characters)
where the core Action description was found. This can be used in combination with the GKG or
other analysis to identify further characteristics and attributes of the actor. NOTE: due to
processing performed on each article, this may be slightly offset from the position seen when
the article is rendered in a web browser.

**InRawText**. (integer) This records whether the event was found in the original unaltered raw
article text (a value of 1) or whether advanced natural language processing algorithms were
required to synthesize and rewrite the article text to identify the event (a value of 0). See the
discussion on the Confidence field below for more details. Mentions with a value of “1” in this
field likely represent strong detail-rich references to an event.

**Confidence**. (integer) Percent confidence in the extraction of this event from this article. See
the discussion above.

**MentionDocLen**. (integer) The length in English characters of the source document (making it
possible to filter for short articles focusing on a particular event versus long summary articles
that casually mention an event in passing).

**MentionDocTone**. (integer) The same contents as the AvgTone field in the Events table, but
computed for this particular article. NOTE: users interested in emotional measures should use the MentionIdentifier field above to merge the Mentions table with the GKG table to access the
complete set of 2,300 emotions and themes from the GCAM system.

**MentionDocTranslationInfo**. (string) This field is internally delimited by semicolons and is used
to record provenance information for machine translated documents indicating the original
source language and the citation of the translation system used to translate the document for
processing. It will be blank for documents originally in English. At this time the field will also be
blank for documents translated by a human translator and provided to GDELT in English (such as
BBC Monitoring materials) – in future this field may be expanded to include information on
human translation pipelines, but at present it only captures information on machine translated
materials. An example of the contents of this field might be “srclc:fra; eng:Moses 2.1.1 /
MosesCore Europarl fr-en / GT-FRA 1.0”. NOTE: Machine translation is often not as accurate as
human translation and users requiring the highest possible confidence levels may wish to
exclude events whose only mentions are in translated reports, while those needing the highestpossible coverage of the non-Western world will find that these events often offer the earliest
glimmers of breaking events or smaller-bore events of less interest to Western media.

> SRCLC. This is the Source Language Code, representing the three-letter ISO639-2 code of
the language of the original source material.

> ENG. This is a textual citation string that indicates the engine(s) and model(s) used to
translate the text. The format of this field will vary across engines and over time and no
expectations should be made on the ordering or formatting of this field. In the example
above, the string “Moses 2.1.1 / MosesCore Europarl fr-en / GT-FRA 1.0” indicates that
the document was translated using version 2.1.1 of the Moses 7
SMT platform, using the
“MosesCore Europarl fr-en” translation and language models, with the final translation
enhanced via GDELT Translingual’s own version 1.0 French translation and language
models. A value of “GT-ARA 1.0” indicates that GDELT Translingual’s version 1.0 Arabic
translation and language models were the sole resources used for translation.
Additional language systems used in the translation pipeline such as word segmentation
systems are also captured in this field such that a value of “GT-ZHO 1.0 / Stanford PKU”
indicates that the Stanford Chinese Word Segmenter 8 was used to segment the text
into individual words and sentences, which were then translated by GDELT
Translingual’s own version 1.0 Chinese (Traditional or Simplified) translation and
language models.

**Extras**. (string) This field is currently blank, but is reserved for future use to encode special
additional measurements for selected material

# Libaries and Imports (Compile)

In [2]:
!pip install gdelt
!pip install tqdm

Collecting gdelt
[?25l  Downloading https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl (773kB)
[K     |▍                               | 10kB 21.4MB/s eta 0:00:01[K     |▉                               | 20kB 3.1MB/s eta 0:00:01[K     |█▎                              | 30kB 4.1MB/s eta 0:00:01[K     |█▊                              | 40kB 4.4MB/s eta 0:00:01[K     |██▏                             | 51kB 3.6MB/s eta 0:00:01[K     |██▌                             | 61kB 4.0MB/s eta 0:00:01[K     |███                             | 71kB 4.3MB/s eta 0:00:01[K     |███▍                            | 81kB 4.7MB/s eta 0:00:01[K     |███▉                            | 92kB 5.0MB/s eta 0:00:01[K     |████▎                           | 102kB 4.8MB/s eta 0:00:01[K     |████▋                           | 112kB 4.8MB/s eta 0:00:01[K     |█████                           | 122kB 4.8MB/s e

In [3]:
import gdelt

import pandas as pd

from datetime import timedelta, date, datetime

from concurrent.futures import ProcessPoolExecutor

import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth',1000)
gd2 = gdelt.gdelt(version=2)

# Gdelt (Compile)

In [4]:
def daterange(start_date, end_date):
  for n in range(int((end_date - start_date).days + 1)):
      yield start_date + timedelta(n)

def prepareDF(df):
  # Convert dateadded to datetime
  df['dateadded'] = pd.to_datetime(df['dateadded'], format='%Y%m%d%H%M%S')
  
  # use dateadded as index
  df.set_index('dateadded', inplace=True)
  df.sort_index(inplace=True)

  return df

def gdeltFromTo(start_date, end_date, table="events"):
  dfs = []
  for single_date in daterange(start_date, end_date):
    date = single_date.strftime("%Y %m %d")
    print(f"Search for {date}")
    df = gd2.Search([date],table='events',coverage=True, normcols=True)
    df = clear_df(df)
    dfs.append(df)
  df = pd.concat(dfs)
  df = prepareDF(df)
  return df

In [5]:
def clear_df(df):
  
  # erstmal ein paar columns droppen
  df.drop('globaleventid', 1, inplace=True)
  df.drop('sqldate', 1, inplace=True)
  df.drop('monthyear', 1, inplace=True)
  df.drop('year', 1, inplace=True)
  df.drop('fractiondate', 1, inplace=True)

  return df

In [6]:
df = gdeltFromTo(date(2020, 2, 20), date(2020, 3, 20))

Search for 2020 02 20
Search for 2020 02 21
Search for 2020 02 22
Search for 2020 02 23
Search for 2020 02 24
Search for 2020 02 25
Search for 2020 02 26
Search for 2020 02 27
Search for 2020 02 28
Search for 2020 02 29
Search for 2020 03 01
Search for 2020 03 02
Search for 2020 03 03
Search for 2020 03 04
Search for 2020 03 05
Search for 2020 03 06
Search for 2020 03 07
Search for 2020 03 08
Search for 2020 03 09
Search for 2020 03 10
Search for 2020 03 11
Search for 2020 03 12
Search for 2020 03 13
Search for 2020 03 14
Search for 2020 03 15
Search for 2020 03 16
Search for 2020 03 17
Search for 2020 03 18
Search for 2020 03 19
Search for 2020 03 20


In [7]:
df

Unnamed: 0_level_0,actor1code,actor1name,actor1countrycode,actor1knowngroupcode,actor1ethniccode,actor1religion1code,actor1religion2code,actor1type1code,actor1type2code,actor1type3code,actor2code,actor2name,actor2countrycode,actor2knowngroupcode,actor2ethniccode,actor2religion1code,actor2religion2code,actor2type1code,actor2type2code,actor2type3code,isrootevent,eventcode,cameocodedescription,eventbasecode,eventrootcode,quadclass,goldsteinscale,nummentions,numsources,numarticles,avgtone,actor1geotype,actor1geofullname,actor1geocountrycode,actor1geoadm1code,actor1geoadm2code,actor1geolat,actor1geolong,actor1geofeatureid,actor2geotype,actor2geofullname,actor2geocountrycode,actor2geoadm1code,actor2geoadm2code,actor2geolat,actor2geolong,actor2geofeatureid,actiongeotype,actiongeofullname,actiongeocountrycode,actiongeoadm1code,actiongeoadm2code,actiongeolat,actiongeolong,actiongeofeatureid,sourceurl
dateadded,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
2020-02-20 00:00:00,,,,,,,,,,,USA,UNITED STATES,USA,,,,,,,,1,173,"Arrest, detain, or charge with legal action",173,17,4,-5.0,7,7,7,-8.573721,0,,,,,,,,3,"Shiawassee County, Michigan, United States",US,USMI,,42.95,-84.15,1623018,3,"Shiawassee County, Michigan, United States",US,USMI,,42.9500,-84.1500,1623018,https://www.sfgate.com/news/crime/article/Man-charged-in-grisly-slaying-found-unresponsive-15068981.php
2020-02-20 00:00:00,NZL,NEW ZEALAND,NZL,,,,,,,,,,,,,,,,,,1,036,Express intent to meet or negotiate,036,03,1,4.0,4,1,4,-3.174603,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,0,,,,,,,,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,https://www.kamloopsthisweek.com/new-brunswick-green-leader-launches-his-budget-with-a-heart-tour-1.24072517
2020-02-20 00:00:00,NZL,NEW ZEALAND,NZL,,,,,,,,,,,,,,,,,,1,110,"Disapprove, not specified below",110,11,3,-2.0,4,1,4,-1.363636,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,0,,,,,,,,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,https://au.news.yahoo.com/treasurer-chides-labors-wellbeing-budget-221342412--spt.html
2020-02-20 00:00:00,NZL,NEW ZEALAND,NZL,,,,,,,,LAB,EMPLOYEE,,,,,,LAB,,,1,043,Host a visit,043,04,1,2.8,2,1,2,-2.564103,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,1,New Zealand,NZ,NZ,,-42.00,174.00,NZ,1,New Zealand,NZ,NZ,,-42.0000,174.0000,NZ,https://www.watoday.com.au/business/companies/qantas-slashes-flights-as-coronavirus-costs-top-100m-20200220-p542ih.html
2020-02-20 00:00:00,OMN,OMAN,OMN,,,,,,,,SAU,SAUDI ARABIA,SAU,,,,,,,,0,141,Demonstrate or rally,141,14,3,-6.5,8,1,8,-9.840426,1,Oman,MU,MU,,21.0000,57.0000,MU,1,Saudi Arabia,SA,SA,,25.00,45.00,SA,1,Saudi Arabia,SA,SA,,25.0000,45.0000,SA,https://www.660citynews.com/2020/02/19/landmine-blast-kills-6-in-yemeni-defence-ministers-convoy/
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-20 23:45:00,EDU,SCHOOL,,,,,,EDU,,,,,,,,,,,,,1,043,Host a visit,043,04,1,2.8,10,1,10,1.956522,3,"State Of Texas, Texas, United States",US,USTX,TX083,31.2504,-99.2506,1779801,0,,,,,,,,3,"State Of Texas, Texas, United States",US,USTX,,31.2504,-99.2506,1779801,https://www.oaoa.com/news/education/article_0e33df3c-6ae7-11ea-80de-c359d407b411.html
2020-03-20 23:45:00,EDU,COLLEGE,,,,,,EDU,,,,,,,,,,,,,0,043,Host a visit,043,04,1,2.8,6,1,6,1.420217,3,"Sweetwater, Texas, United States",US,USTX,TX353,32.4710,-100.4060,1348139,0,,,,,,,,3,"Sweetwater, Texas, United States",US,USTX,TX353,32.4710,-100.4060,1348139,https://www.reporternews.com/story/life/faith/2020/03/20/mitch-mcvickers-walking-faithfully-late-rich-mullins-footsteps/4978890002/
2020-03-20 23:45:00,EDU,COLLEGE,,,,,,EDU,,,,,,,,,,,,,0,043,Host a visit,043,04,1,2.8,4,1,4,1.420217,3,"Abilene, Texas, United States",US,USTX,TX441,32.4487,-99.7331,1329173,0,,,,,,,,3,"Abilene, Texas, United States",US,USTX,TX441,32.4487,-99.7331,1329173,https://www.reporternews.com/story/life/faith/2020/03/20/mitch-mcvickers-walking-faithfully-late-rich-mullins-footsteps/4978890002/
2020-03-20 23:45:00,EDU,SCHOOL,,,,,,EDU,,,,,,,,,,,,,0,071,Provide economic aid,071,07,2,7.4,1,1,1,-1.255230,4,"Rome, Lazio, Italy",IT,IT07,18350,41.9000,12.4833,-126693,0,,,,,,,,4,"Rome, Lazio, Italy",IT,IT07,18350,41.9000,12.4833,-126693,https://www.northwestgeorgianews.com/rome/news/education/as-sales-drop-local-education-sales-tax-funds-could-be/article_cba88c60-6ae3-11ea-958c-0bd988db05a4.html


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4547636 entries, 2020-02-20 00:00:00 to 2020-03-20 23:45:00
Data columns (total 56 columns):
 #   Column                Dtype  
---  ------                -----  
 0   actor1code            object 
 1   actor1name            object 
 2   actor1countrycode     object 
 3   actor1knowngroupcode  object 
 4   actor1ethniccode      object 
 5   actor1religion1code   object 
 6   actor1religion2code   object 
 7   actor1type1code       object 
 8   actor1type2code       object 
 9   actor1type3code       object 
 10  actor2code            object 
 11  actor2name            object 
 12  actor2countrycode     object 
 13  actor2knowngroupcode  object 
 14  actor2ethniccode      object 
 15  actor2religion1code   object 
 16  actor2religion2code   object 
 17  actor2type1code       object 
 18  actor2type2code       object 
 19  actor2type3code       object 
 20  isrootevent           int64  
 21  eventcode             object 
 22  cameocode

# Actors & NLP (Compile)
Actors and Named Entity Recognition on Actors

In [9]:
count_actors = df['actor1name'].value_counts()
count_actors.head(20)

UNITED STATES     492328
UNITED KINGDOM     81833
PRESIDENT          69098
GOVERNMENT         68916
CHINA              64514
POLICE             63318
SCHOOL             61907
CANADA             44791
STUDENT            40153
ITALY              39144
IRAN               35861
AUSTRALIA          35405
COMPANY            34905
NIGERIA            34250
RUSSIA             32996
AFRICA             32455
COMMUNITY          31945
TURKEY             31308
BUSINESS           30745
ISRAEL             28413
Name: actor1name, dtype: int64

In [10]:
len(count_actors)

6778

In [11]:
actors_tone = df.loc[:,['actor1name','actor2name','avgtone']]
actors_tone = actors_tone.reset_index()
actors_tone.rename(columns={'index':'dateadded'}, inplace=True)
actors_tone

Unnamed: 0,dateadded,actor1name,actor2name,avgtone
0,2020-02-20 00:00:00,,UNITED STATES,-8.573721
1,2020-02-20 00:00:00,NEW ZEALAND,,-3.174603
2,2020-02-20 00:00:00,NEW ZEALAND,,-1.363636
3,2020-02-20 00:00:00,NEW ZEALAND,EMPLOYEE,-2.564103
4,2020-02-20 00:00:00,OMAN,SAUDI ARABIA,-9.840426
...,...,...,...,...
4547631,2020-03-20 23:45:00,SCHOOL,,1.956522
4547632,2020-03-20 23:45:00,COLLEGE,,1.420217
4547633,2020-03-20 23:45:00,COLLEGE,,1.420217
4547634,2020-03-20 23:45:00,SCHOOL,,-1.255230


In [12]:
grouped_actors = actors_tone.groupby(['actor1name']).mean()
grouped_actors

Unnamed: 0_level_0,avgtone
actor1name,Unnamed: 1_level_1
A CABINET MEETING,-2.829695
A US,-3.521703
A. REHMAN MALIK,-2.371963
AACHEN,3.944688
AALBORG,1.315789
...,...
ZTE CORP,-2.137188
ZULU,-1.959366
ZUNI,-1.129898
ZURAB ADEISHVILI,-9.606987


In [13]:
# Add Mean and Variance
grouped_actors_sorted = actors_tone.groupby(['actor1name']).mean()
grouped_actors_sorted["var"] = actors_tone.groupby(['actor1name']).std(ddof=0)

# Sort by AvgTone
grouped_actors_sorted = grouped_actors_sorted.sort_values(by=['avgtone'])

# get 5 bad and 5 good scores
selected_df = pd.concat([grouped_actors_sorted.iloc[0:5], grouped_actors_sorted.iloc[-5:]])
selected_df = selected_df.reset_index()
selected_df

Unnamed: 0,actor1name,avgtone,var
0,MICHAEL JEFFERY,-19.0,0.0
1,ALBANIE,-16.438356,0.0
2,PHILIP ALSTON,-16.052122,8.919691
3,TADEUSZ MAZOWIECKI,-14.830508,0.0
4,NEW HUMAN RIGHTS,-14.785992,0.0
5,MINIST OF TELECOMMUNICATION,8.823529,0.0
6,OBLATES OF MARY IMMACULATE,9.230769,0.0
7,XAVERIAN BROTHERS,9.94152,0.0
8,XINYUAN REAL ESTATE,10.0,0.0
9,ASSURED GUARANTY LTD,10.526316,0.0


In [14]:
# Find out why avgtone is so bad or good
interesting_col = ["actor1name", "actor2name", "avgtone", "isrootevent", "goldsteinscale", "nummentions", "sourceurl"]
for index, row in selected_df.iterrows():
    print(f"{row['avgtone']} - {row['actor1name']}")
    print(df[df["actor1name"] == row["actor1name"]][interesting_col].to_markdown())
    print("")

-19.0 - MICHAEL JEFFERY
| dateadded           | actor1name      |   actor2name |   avgtone |   isrootevent |   goldsteinscale |   nummentions | sourceurl                                                            |
|:--------------------|:----------------|-------------:|----------:|--------------:|-----------------:|--------------:|:---------------------------------------------------------------------|
| 2020-02-21 06:30:00 | MICHAEL JEFFERY |          nan |       -19 |             1 |             -9.5 |             5 | http://www.greenfieldreporter.com/2020/02/21/arrests__february_21-4/ |

-16.4383561643836 - ALBANIE
| dateadded           | actor1name   |   actor2name |   avgtone |   isrootevent |   goldsteinscale |   nummentions | sourceurl                                                                              |
|:--------------------|:-------------|-------------:|----------:|--------------:|-----------------:|--------------:|:---------------------------------------------------

In [15]:
actors_tone.describe()

Unnamed: 0,avgtone
count,4547636.0
mean,-2.161652
std,3.622487
min,-37.5
25%,-4.363636
50%,-2.052786
75%,0.1745201
max,31.10731


In [16]:
#listactors = actors.tolist
#listactors

In [None]:
import spacy

spacy.cli.download("en_core_web_lg")
#spacy.cli.download("en_core_web_md")


nlp = spacy.load('en_core_web_lg')
from collections import Counter

In [None]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [None]:
actors_head= actors_tone.head(20000)

In [None]:
tokens = nlp(''.join(str(actors_head.actor1name.tolist())))

Show all tokens and how they were categorized (ausgeklammert, da lang)

In [None]:
#show_ents(tokens)

In [None]:
items = [x.text for x in tokens.ents]
Counter(items).most_common(20)

In [None]:
org_list = []
person_list = []
product_list = []

for ent in doc.ents:
    if ent.label_ == 'ORG':
        org_list.append(ent.text)

    elif ent.label_ == 'PERSON':
        person_list.append(ent.text)
    
    elif ent.label_ == 'PRODUCT':
        product_list.append(ent.text)
        
org_counts = Counter(org_list).most_common(20)
person_counts = Counter(person_list).most_common(20)
product_counts = Counter(product_list).most_common(20)


df_org = pd.DataFrame(org_counts, columns =['text', 'count'])
df_person = pd.DataFrame(person_counts, columns =['text', 'count'])
df_products = pd.DataFrame(product_counts, columns =['text', 'count'])



In [None]:
df_org

In [None]:
df_products

In [None]:
df_person

In [None]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))


In [None]:
test = actors_tone.head()
test

In [None]:
tokens = nlp(''.join(str(actors_head.actor1name.tolist())))


In [None]:
def label(doc):
  for token in doc:
    print(token.text+' - '+token.label_+' - '+str(spacy.explain(token.label_)))

In [None]:
if ent.label_ == 'ORG':
              org_list.append(ent.text)
              return 'ORG'
              
            elif ent.label_ == 'PERSON':
              person_list.append(ent.text)
              return 'PERSON'
            
            elif ent.label_ == 'PRODUCT':
              product_list.append(ent.text)
              return 'PRODUCT'

In [None]:
test['text'] = test['actor1name'].apply(lambda actor1name: nlp(actor1name))
test['ents'] = test['text'].apply(label)



In [None]:
test