# Basic `gdeltPyR` Usage

`gdeltPyR` retrieves [Global Database of Events, Language, and Tone (GDELT) data (version 1.0 or version 2.0) ](http://gdeltproject.org/data.html#intro) via [parallel HTTP GET requests](http://docs.python-requests.org/en/v0.10.6/user/advanced/#asynchronous-requests) and is an alternative to [accessing GDELT data via Google BigQuery ](http://gdeltproject.org/data.html#googlebigquery). 

 Performance will vary based on the number of available cores (i.e. CPUs), internet connection speed, and available RAM.  For systems with limited RAM, Later iterations of `gdeltPyR` will include an option to store the output directly to disc.  

### Memory Considerations

Take your systems specifications into consideration when running large or complex queries.  While `gdeltPyR` loads each temporary file long enough only to convert it into a `pandas` dataframe (15 minutes each for 2.0, full day for 1.0 events tables), GDELT data can be especially large and exhaust a computers RAM.  For example, Global Knowledge Graph (gkg) table queries can eat up large amounts of RAM when pulling data for only a few days.  Before trying month long queries, try single day queries or create a pipeline that pulls several days worth of data, writes to discs, flushes globals, and continues to pull more data.  

### Recommended RAM

It's best to use a system with at least 8 GB of RAM.

# Installation

```bash
pip install gdeltPyR
```

You can also install directly from www.github.com

```bash
pip install git+https://github.com/linwoodc3/gdeltPyR
```

# Basic Usage

[`gdeltPyR`](https://github.com/linwoodc3/gdeltPyR) queries revolve around 4 concepts:

| **Name** | **Description**                                                                                                                                                                                                                                                       | **Input Possibilities/Examples**    |
|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|
| *version*  | (integer)  - Selects the version of GDELT data to query; defaults to version 2.                                                                                                                                                                                   | 1 or 2                          |
| *date*    | (string or list of strings) - Dates to query                                                                                                                                                                                                                      | "2016 10 23" or "2016 Oct 23"   |
| *coverage*| (bool) - For GDELT 2.0, pulls every 15 minute interval in the dates passed in the 'date' parameter. Default coverage is False or None.  `gdeltPyR` will pull the latest 15 minute interval for the current day or the last 15 minute interval for a historic day. | True or False or None           |
| *tables*  | (string) - The specific GDELT table to pull.  The default table is the 'events' table.  See the [GDELT documentation page for more information](http://gdeltproject.org/data.html#documentation)                                                                  | 'events' or 'mentions' or 'gkg' |

With these basic concepts, you can run any number of GDELT queries.

In [5]:
##############################
# Import the package
##############################
import gdelt
gdelt.__version__

'0.1.13'

In [8]:
###############################
# Instantiate the gdelt object
##############################

gd = gdelt.gdelt(version=2,cores=10)

To launch your query, pass in your dates.  When passing multiple dates, pass as a list of strings.  We will time the multi-day query.  

## Important Date Details for GDELT 1.0 and 2.0
For **GDELT 2.0**, every 15 minute interval is a zipped CSV file, and `gdeltPyR` makes concurrent HTTP GET requests to each file. When the `coverage` parameter is set to *True*, each full day of data has 96 15 minute interval files to pull.  If you are pulling the current day and coverage is set to *True*, `gdeltPyR` all the intervals leading up to the latest 15 minute interval.  When `coverage` is *False*, the package pulls the last 15 minute interval when querying a historical date and the latest 15 minute interval when querying the current date. Additinally, GDELT 2.0 data only goes back as far as Feb 2015.  The [additional features of GDELT 2.0 are discussed here](http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/). 

**GDELT 1.0** releases the previous day's query at 6AM EST of the next day (if today's current date is 23 Oct, the 22 Oct results would be available at 6AM Eastern on 23 Oct).

# The Query

To launch your query, just pass in dates.  When passing multiple dates, pass as a list of strings.  First, some information on my OS.

In [9]:
import platform
import multiprocessing
from concurrent.futures import ThreadPoolExecutor

print (platform.platform())

print (multiprocessing.cpu_count())

macOS-14.1-x86_64-i386-64bit
16


And now the query.

In [10]:
%time results = gd.Search(['2023 10 19','2023 10 20'],table='events',coverage=True)

CPU times: user 1.23 s, sys: 473 ms, total: 1.7 s
Wall time: 11.4 s


Let's get an idea for the number of results we returned.  

In [11]:
results.info(memory_usage='deep',show_counts=True,verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361591 entries, 0 to 361590
Data columns (total 62 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   GLOBALEVENTID          361591 non-null  int64  
 1   SQLDATE                361591 non-null  int64  
 2   MonthYear              361591 non-null  int64  
 3   Year                   361591 non-null  int64  
 4   FractionDate           361591 non-null  float64
 5   Actor1Code             331779 non-null  object 
 6   Actor1Name             331779 non-null  object 
 7   Actor1CountryCode      221099 non-null  object 
 8   Actor1KnownGroupCode   4609 non-null    object 
 9   Actor1EthnicCode       2266 non-null    object 
 10  Actor1Religion1Code    5234 non-null    object 
 11  Actor1Religion2Code    941 non-null     object 
 12  Actor1Type1Code        154085 non-null  object 
 13  Actor1Type2Code        10563 non-null   object 
 14  Actor1Type3Code        272 non-null 

In ~11 seconds, `gdeltPyR` returned several hundred thousand rows with 61 columns of data.  With the data in a tidy format, GDELT data can be analyzed with any number of [`pandas` data analysis pipelines and techniques](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).