# LT3_BDCC Final Project_Data Collection

This notebook is for collecting and filtering events from GDELT GKG datasets that have the following locations: `US`, `Romania`, `China`, `Spain`, `Japan`, `Switzerland`, `South Korea`. These are the countries that hold the most bitcoins according to this [article] (https://medium.com/@biditex/7-countries-with-the-most-bitcoin-hodlers-503b205d926f). To filter the data, the FIPS10-4 country codes were used, as indicated in the GDELT code book. List of country codes can be found [here](https://en.wikipedia.org/wiki/List_of_FIPS_country_codes).

Selected columns: `id`, `timestamp`, `sourceID`, `sourceName`, `themes`, `location`, and `tone`

# Distributed Dask Cluster

In [1]:
from dask.distributed import Client
import dask.dataframe as dd

In [2]:
client = Client("3.23.112.208:8786") # enter the public IP address of dask cluster
client


+-------------+--------+-----------+-------------------------+
| Package     | client | scheduler | workers                 |
+-------------+--------+-----------+-------------------------+
| dask        | 2.30.0 | 2.30.0    | {'2020.12.0', '2.30.0'} |
| distributed | 2.30.1 | 2.30.1    | {'2.30.1', '2020.12.0'} |
| msgpack     | 1.0.0  | 1.0.0     | {'1.0.1', '1.0.0'}      |
| numpy       | 1.19.1 | 1.19.1    | 1.19.2                  |
| tornado     | 6.0.4  | 6.0.4     | {'6.0.4', '6.1'}        |
+-------------+--------+-----------+-------------------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Client  Scheduler: tcp://3.23.112.208:8786  Dashboard: http://3.23.112.208:8787/status,Cluster  Workers: 10  Cores: 20  Memory: 83.46 GB


# Functions

In [3]:
def read_data(fp):
    """Read the data from the S3 bucket source"""
    cols = [0,1,2,4,7,9,15]
    
    colnames = ['id', 'timestamp', 'sourceID', 'sourceName', 'themes', 'location',
                'tone']

    ddf = dd.read_csv(fp, storage_options={'anon': True}, assume_missing=True,
                      delimiter='\t', usecols=cols, parse_dates=[1],
                      names=colnames)
    
    # drop null columns
    ddf = ddf.dropna(how='any', subset=['timestamp', 'themes', 'location', 'tone'])
    
    return ddf

def filter_county(ddf):
    """Filter GDELT events based on tehe specified countries only"""
    # FIPS10-4 country codes
    countries = ['#US#', '#RO#', '#CH#', '#JA#', '#KS#', '#SZ#', '#SP#']
    countries = '|'.join(countries)

    return ddf[ddf.location.str.contains(countries)].reset_index(drop=True)

# 2019

## January 2019

Raw size: 65GB

In [4]:
# read data
ddf_201901 = read_data('s3://gdelt-open-data/v2/gkg/201901*.gkg.csv')

In [5]:
# filter data
ddf_201901 = filter_county(ddf_201901)

In [10]:
# count
# ddf_201901.size.compute()

In [6]:
ddf_201901.head()

Unnamed: 0,id,timestamp,sourceID,sourceName,themes,location,tone
0,20190101000000-0,2019-01-01,1.0,http://www.buzznet.com/2018/12/inside-the-live...,TAX_ECON_PRICE;TAX_FNCACT;TAX_FNCACT_ACTRESS;T...,"3#Hollywood, California, United States#US#USCA...","0.444444444444444,3.11111111111111,2.666666666..."
1,20190101000000-1,2019-01-01,1.0,http://gwdtoday.com/main.asp?SectionID=2&SubSe...,MANMADE_DISASTER_IMPLIED;EDUCATION;SOC_POINTSO...,"2#New York, United States#US#USNY#42.1497#-74....","2.73081924577373,3.94451668833984,1.2136974425..."
2,20190101000000-2,2019-01-01,1.0,https://www.lamonitor.com/content/new-mexico-p...,LEADER;TAX_FNCACT;TAX_FNCACT_GOVERNOR;TAKE_OFF...,"2#New Mexico, United States#US#USNM#34.8375#-1...","0.72463768115942,3.6231884057971,2.89855072463..."
3,20190101000000-3,2019-01-01,1.0,https://www.nbc15.com/content/news/Winter-weat...,CRISISLEX_T01_CAUTION_ADVICE;MANMADE_DISASTER_...,"3#Dane County, Wisconsin, United States#US#USW...","-2.18181818181818,0.727272727272727,2.90909090..."
4,20190101000000-4,2019-01-01,1.0,https://magicvalley.com/entertainment/the-best...,EPU_ECONOMY_HISTORIC;TAX_FNCACT;TAX_FNCACT_COR...,"3#Hollywood, California, United States#US#USCA...","3.97003745318352,8.68913857677903,4.7191011235..."


In [7]:
# save to csv
ddf_201901.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201901-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201901-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201901-002

## February 2019

64GB

In [8]:
# read data
ddf_201902 = read_data('s3://gdelt-open-data/v2/gkg/201902*.gkg.csv')

In [9]:
# filter data
ddf_201902 = filter_county(ddf_201902)

In [11]:
# count
# ddf_201902.size.compute()

In [12]:
# save to csv
ddf_201902.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201902-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201902-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201902-002

## March 2019

70GB

In [13]:
# read data
ddf_201903 = read_data('s3://gdelt-open-data/v2/gkg/201903*.gkg.csv')

In [14]:
# filter data
ddf_201903 = filter_county(ddf_201903)

In [15]:
# count
# ddf_201903.size.compute()

In [16]:
# save to csv
ddf_201903.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201903-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201903-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201903-002

## April 2019

36GB

In [17]:
# read data
ddf_201904 = read_data('s3://gdelt-open-data/v2/gkg/201904*.gkg.csv')

In [18]:
# filter data
ddf_201904 = filter_county(ddf_201904)

In [19]:
# count
# ddf_201904.size.compute()

In [20]:
# save to csv
ddf_201904.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201904-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201904-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201904-002

## May 2019

Data not available

In [22]:
# read data
ddf_201905 = read_data('s3://gdelt-open-data/v2/gkg/201905*.gkg.csv')

OSError: s3://gdelt-open-data/v2/gkg/201905*.gkg.csv resolved to no files

In [None]:
# filter data
ddf_201905 = filter_county(ddf_201905)

In [None]:
# count
ddf_201905.size.compute()

In [None]:
# save to csv
ddf_201905.to_csv('s3://bdcc2021-aids/bdcc_gdelt/201905-*.csv', index=False)

## June 2019

Data not available

In [23]:
# read data
ddf_201906 = read_data('s3://gdelt-open-data/v2/gkg/201906*.gkg.csv')

OSError: s3://gdelt-open-data/v2/gkg/201906*.gkg.csv resolved to no files

In [None]:
# filter data
ddf_201906 = filter_county(ddf_201906)

In [None]:
# count
ddf_201906.size.compute()

In [None]:
# save to csv
ddf_201906.to_csv('s3://bdcc2021-aids/bdcc_gdelt/201906-*.csv', index=False)

## July 2019

Data not available

In [24]:
# read data
ddf_201907 = read_data('s3://gdelt-open-data/v2/gkg/201907*.gkg.csv')

OSError: s3://gdelt-open-data/v2/gkg/201907*.gkg.csv resolved to no files

In [None]:
# filter data
ddf_201907 = filter_county(ddf_201907)

In [None]:
# count
ddf_201907.size.compute()

In [None]:
# save to csv
ddf_201907.to_csv('s3://bdcc2021-aids/bdcc_gdelt/201907-*.csv', index=False)

## August 2019

Data not available

In [25]:
# read data
ddf_201908 = read_data('s3://gdelt-open-data/v2/gkg/201908*.gkg.csv')

OSError: s3://gdelt-open-data/v2/gkg/201908*.gkg.csv resolved to no files

In [None]:
# filter data
ddf_201908 = filter_county(ddf_201908)

In [None]:
# count
ddf_201908.size.compute()

In [None]:
# save to csv
ddf_201908.to_csv('s3://bdcc2021-aids/bdcc_gdelt/201908-*.csv', index=False)

## September 2019

Data not available

In [26]:
# read data
ddf_201909 = read_data('s3://gdelt-open-data/v2/gkg/201909*.gkg.csv')

OSError: s3://gdelt-open-data/v2/gkg/201909*.gkg.csv resolved to no files

In [None]:
# filter data
ddf_201909 = filter_county(ddf_201909)

In [None]:
# count
ddf_201909.size.compute()

In [None]:
# save to csv
ddf_201909.to_csv('s3://bdcc2021-aids/bdcc_gdelt/201909-*.csv', index=False)

# 2018

## December 2018

62GB

In [22]:
# read data
ddf_201812 = read_data('s3://gdelt-open-data/v2/gkg/201812*.gkg.csv')

In [23]:
# filter data
ddf_201812 = filter_county(ddf_201812)

In [24]:
# count
# ddf_201812.size.compute()

In [25]:
# save to csv
ddf_201812.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201812-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201812-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201812-002

## November 2018

73GB

In [26]:
# read data
ddf_201811 = read_data('s3://gdelt-open-data/v2/gkg/201811*.gkg.csv')

In [27]:
# filter data
ddf_201811 = filter_county(ddf_201811)

In [28]:
# count
# ddf_201811.size.compute()

In [29]:
# save to csv
ddf_201811.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201811-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201811-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201811-002

## Octover 2018

77GB

In [30]:
# read data
ddf_201810 = read_data('s3://gdelt-open-data/v2/gkg/201810*.gkg.csv')

In [31]:
# filter data
ddf_201810 = filter_county(ddf_201810)

In [32]:
# count
# ddf_201810.size.compute()

In [33]:
# save to csv
ddf_201810.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201810-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201810-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201810-002

## September 2018

73GB

In [34]:
# read data
ddf_201809 = read_data('s3://gdelt-open-data/v2/gkg/201809*.gkg.csv')

In [35]:
# filter data
ddf_201809 = filter_county(ddf_201809)

In [36]:
# count
# ddf_201809.size.compute()

In [37]:
# save to csv
ddf_201809.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201809-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201809-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201809-002

## August 2018

76GB

In [38]:
# read data
ddf_201808 = read_data('s3://gdelt-open-data/v2/gkg/201808*.gkg.csv')

In [39]:
# filter data
ddf_201808 = filter_county(ddf_201808)

In [40]:
# count
# ddf_201808.size.compute()

In [41]:
# save to csv
ddf_201808.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201808-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201808-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201808-002

## July 2018

76GB

In [42]:
# read data
ddf_201807 = read_data('s3://gdelt-open-data/v2/gkg/201807*.gkg.csv')

In [43]:
# filter data
ddf_201807 = filter_county(ddf_201807)

In [44]:
# count
# ddf_201807.size.compute()

In [45]:
# save to csv
ddf_201807.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201807-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201807-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201807-002

## June 2018

80GB

In [46]:
# read data
ddf_201806 = read_data('s3://gdelt-open-data/v2/gkg/201806*.gkg.csv')

In [47]:
# filter data
ddf_201806 = filter_county(ddf_201806)

In [48]:
# count
# ddf_201806.size.compute()

In [49]:
# save to csv
ddf_201806.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201806-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201806-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201806-002

## May 2018

85GB

Need to specify data type and filter non-numeric values in `sourceID` column.

In [50]:
cols = [0,1,2,4,7,9,15]

colnames = ['id', 'timestamp', 'sourceID', 'sourceName', 'themes', 'location',
            'tone']

ddf_201805 = dd.read_csv('s3://gdelt-open-data/v2/gkg/201805*.gkg.csv', 
                  storage_options={'anon': True}, assume_missing=True,
                  delimiter='\t', usecols=cols, parse_dates=[1],
                  names=colnames, dtype={'sourceID': 'object'})

# drop null columns
ddf_201805 = ddf_201805.dropna(how='any', subset=['timestamp', 'themes', 'location', 'tone'])

In [51]:
# filter data
ddf_201805 = filter_county(ddf_201805)

In [52]:
# remove non-numeric rows on sourceID
ddf_201805 = ddf_201805[ddf_201805.sourceID.apply(lambda x: x.isnumeric(), meta=('sourceID', 'bool'))]

In [53]:
# convert sourceID column data type to float
ddf_201805.sourceID = ddf_201805.sourceID.astype(float)

In [54]:
# count
# ddf_201805.size.compute()

In [55]:
# save to csv
ddf_201805.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201805-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201805-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201805-002

## April 2018

80GB

In [57]:
# read data
ddf_201804 = read_data('s3://gdelt-open-data/v2/gkg/201804*.gkg.csv')

In [58]:
# filter data
ddf_201804 = filter_county(ddf_201804)

In [59]:
# count
# ddf_201804.size.compute()

In [60]:
# save to csv
ddf_201804.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201804-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201804-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201804-002

## March 2018

85GB

Need to specify data type and filter non-numeric values in `sourceID` column.

In [61]:
cols = [0,1,2,4,7,9,15]

colnames = ['id', 'timestamp', 'sourceID', 'sourceName', 'themes', 'location',
            'tone']

ddf_201803 = dd.read_csv('s3://gdelt-open-data/v2/gkg/201803*.gkg.csv', 
                  storage_options={'anon': True}, assume_missing=True,
                  delimiter='\t', usecols=cols, parse_dates=[1],
                  names=colnames, dtype={'sourceID': 'object'})

# drop null columns
ddf_201803 = ddf_201803.dropna(how='any', subset=['timestamp', 'themes', 'location', 'tone'])

In [62]:
# filter data
ddf_201803 = filter_county(ddf_201803)

In [63]:
# remove non-numeric rows on sourceID
ddf_201803 = ddf_201803[ddf_201803.sourceID.apply(lambda x: x.isnumeric(), meta=('sourceID', 'bool'))]

In [64]:
# count
# ddf_201803.size.compute()

In [65]:
# save to csv
ddf_201803.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201803-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201803-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201803-002

## February 2018

76GB

In [66]:
# read data
ddf_201802 = read_data('s3://gdelt-open-data/v2/gkg/201802*.gkg.csv')

In [67]:
# filter data
ddf_201802 = filter_county(ddf_201802)

In [68]:
# count
# ddf_201802.size.compute()

In [69]:
# save to csv
ddf_201802.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201802-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201802-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201802-002

## January 2018

80GB

In [70]:
# read data
ddf_201801 = read_data('s3://gdelt-open-data/v2/gkg/201801*.gkg.csv')

In [71]:
# filter data
ddf_201801 = filter_county(ddf_201801)

In [72]:
# count
# ddf_201801.size.compute()

In [73]:
# save to csv
ddf_201801.to_csv('s3://bdcc2021-aids/bdcc_gdelt_v2/201801-*.csv', index=False)

['bdcc2021-aids/bdcc_gdelt_v2/201801-0000.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0001.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0002.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0003.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0004.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0005.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0006.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0007.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0008.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0009.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0010.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0011.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0012.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0013.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0014.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0015.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0016.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0017.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0018.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-0019.csv',
 'bdcc2021-aids/bdcc_gdelt_v2/201801-002

**Note**: Due to some technical errors, the remaining of the codes used for extracting and filtering data from 2017 and  2016 weren't saved in this notebook; but it follows the same chain of functions as used in the 2018 and 2019 data.