In [2]:
from src.gdelt_preprocess import process_day, process_month, process_year
from datetime import datetime

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
process_day(datetime(2023, 1, 15), delete_csv=True)

[INFO] Found 96 GKG files for 2023-01-15


Processing 20230115: 100%|██████████| 96/96 [02:10<00:00,  1.36s/it]


[SUCCESS] Saved 593 records to data/preprocessed/gdelt_gkg_files/20230115.parquet


Unnamed: 0,GKGRECORDID,DATE,PAGE_TITLE,URL,ORGANIZATIONS
0,20230115000000-45,2023-01-15 00:00:00,SHAREHOLDER ALERT: Pomerantz Law Firm Reminds ...,https://www.memphissun.com/news/273370952/shar...,"[Exchange Commission, United States District C..."
1,20230115000000-108,2023-01-15 00:00:00,Bankruptcy being weighed by retailer Bed Bath ...,https://www.texasguardian.com/news/273370719/b...,"[Bed Bath Beyond, Goldman Sachs, Walt Disney]"
2,20230115000000-232,2023-01-15 00:00:00,SHAREHOLDER ALERT: Pomerantz Law Firm Investig...,https://www.californiatelegraph.com/news/27337...,[Sunlight Financial Holdings Inc Sunlight]
3,20230115000000-288,2023-01-15 00:00:00,US marines on Okinawa armed with missiles as m...,https://www.texasguardian.com/news/273370720/u...,"[Bed Bath Beyond, Goldman Sachs, Walt Disney]"
4,20230115000000-328,2023-01-15 00:00:00,SHAREHOLDER ALERT: Pomerantz Law Firm Investig...,https://www.saltlakecitysun.com/news/273370989...,[Sunlight Financial Holdings Inc Sunlight]
...,...,...,...,...,...
588,20230115234500-121,2023-01-15 23:45:00,Asia Stocks Set for Support From Wall Street R...,https://www.bnnbloomberg.ca/asia-stocks-set-fo...,"[Morgan Stanley, European Central Bank, Goldma..."
589,20230115234500-289,2023-01-15 23:45:00,SUSTAINABLE INNOVATION IN MINING A KEY THEME A...,https://www.philippinetimes.com/news/273375919...,"[Aspermont Media, Oz Minerals, Artificial Inte..."
590,20230115234500-470,2023-01-15 23:45:00,"Scaramucci Sees Bitcoin at $50,000 to $100,000...",https://www.nbcmiami.com/news/business/money-r...,"[Federal Reserve, Nasdaq]"
591,20230115234500-623,2023-01-15 23:45:00,Lawmakers renew bipartisan push to ban congres...,https://www.cbs58.com/news/lawmakers-renew-bip...,"[Warner Bros, Discovery Company, Members Of Co..."


In [5]:
process_day(datetime(2019, 1, 15), delete_csv=True)

[INFO] Found 96 GKG files for 2019-01-15


Processing 20190115: 100%|██████████| 96/96 [03:19<00:00,  2.07s/it]

[WARN] No data collected for 2019-01-15, Parquet will not be saved.





We observe that for January 15th, 2023, there are 593 finance-related news items, whereas for January 15th, 2019, there are none. This discrepancy arises because the actual news headlines are reliably stored only in the `EXTRASXML` field, which appears to be populated starting around 2020.

The GDELT GKG (Global Knowledge Graph) CSV files contain several columns, among which the most relevant for our purposes are:

- `GKGRECORDID`: a unique identifier for each GKG record.
- `DATE_RAW`: the timestamp of the article.
- `URL`: the original URL of the news article.
- `THEMES`: tags describing the topics of the article, e.g., `ECON_STOCKMARKET`.
- `LOCATIONS`: locations mentioned in the article.
- `ORGANIZATIONS_RAW`: recognized organizations mentioned.
- `EXTRASXML`: a field containing additional metadata in XML format, including the actual page title.

In practice, we extract the page title from `EXTRASXML` using a function that parses the `<PAGE_TITLE>` tag. This allows us to obtain the news headline in a structured form:

```python
def extract_page_title(xml_str: str) -> str:
    """Extract <PAGE_TITLE> from EXTRASXML field."""
    if not isinstance(xml_str, str):
        return ""
    match = re.search(r"<PAGE_TITLE>(.*?)</PAGE_TITLE>", xml_str)
    return match.group(1).strip() if match else ""
```

While it is theoretically possible to infer headlines from URLs, since URLs often include the title with dashes replacing spaces, this approach is highly inconsistent and introduces excessive noise, making it unsuitable for reliable data extraction.

Another significant limitation lies in mapping recognized organizations to financial identifiers such as tickers or PERMCO codes. Each article may contain multiple recognized entities, often including unrelated companies or organizations. Disambiguating which entity corresponds to the relevant financial instrument is both challenging and error-prone.

For all these reasons, we have decided not to use GDELT as the source for our news dataset.