# Trying out ways to retrieve datasets

Quickly experimenting and testing out different methods that could be useful

### GDELT & Google BigQuery

Import BigQuery and set up Client

In [8]:
import requests
from google.cloud import bigquery

In [9]:
client = bigquery.Client()

In [4]:
dataset_ref = client.dataset("gdeltv2", project="gdelt-bq")
dataset = client.get_dataset(dataset_ref)
tables = list(client.list_tables(dataset))

Setup dryrun query

In [5]:
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

query = """
        SELECT GKGRECORDID, DATE, SourceCollectionIdentifier, SourceCommonName, V2Themes, V2Tone, Dates, GCAM, Amounts, TranslationInfo
        FROM `gdelt-bq.gdeltv2.gkg`
        LIMIT 300
        """

Run the Query dry and display estimated amount of bytes used

In [6]:
query_job = client.query(query, job_config=job_config)

print("This query will process {} bytes.".format(query_job.total_bytes_processed))

This query will process 14948953384706 bytes.


`14948953384706` bytes equals `14.95` TB of Queried data, which is too much for the free tier, and approximates to about €90 of credits in Google BigQuery

# Directly retrieve CSV's from GDELT

This would be the way to retrieve the individual parts of the dataset manually

In [10]:
url = "http://data.gdeltproject.org/gdeltv2/20150218224500.translation.gkg.csv.zip"
response = requests.head(url)
if response.status_code == 200:
    print("File size:", response.headers['Content-Length'], "bytes")
else:
    print("Error:", response.status_code)

File size: 9117874 bytes


# Lexis Nexis

Even though the API specifically mentions that this isn't to be used to make a script, I still tried and got the following:

In [11]:
url = "http://advance.lexis.com.proxy.uba.uva.nl/api/search?q=burden%20of%20proof&collection=cases&qlang=bool&context=1516831"
response = requests.get(url)

# print(response.text)

We get an URL to a login screen, which means you have to login regardless of the proxy, making it impossible to script this at a large scale