# GDELT Web NGrams Reconstruction Pipeline

This notebook demonstrates a three-step pipeline:

1. **Download** GDELT Web NGrams files for a given time range (`step1_gdelt_download.py`).
2. **Reconstruct articles** from downloaded files (`step2_reconstruct_gdelt.py` + `gdelt_wordmatch_multiprocess.py`).
3. **Filter and deduplicate** reconstructed articles using a Boolean query (`step3_filtermerge_db.py`).

The notebook calls the scripts exactly as they are intended to be used from the command line.


In [1]:
# Install boolean.py if not already available.
# Comment out this cell if dependency management is handled elsewhere.
%pip install boolean.py


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from pathlib import Path

# Base directories
DATA_DIR = Path("gdeltdata")              # Step 1 output (.webngrams.json.gz)
PREPROC_DIR = Path("gdeltpreprocessed")   # Step 2 output (CSVs)
FINAL_CSV = Path("final_filtered_dedup.csv")

# Time range (UTC)
START_TS = "2025-11-25T00:00:00"
END_TS   = "2025-11-25T23:59:00"

# Language filter for reconstruction (None => no language filter)
LANGUAGE = "it"   # for Italian; set to None to disable language filtering

# URL filters (comma-separated string)
URL_FILTER = "repubblica.it,corriere.it"

# Boolean query for step 3
QUERY = '((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'


## Step 1 – Download GDELT Web NGrams

This step uses `step1_gdelt_download.py` to download `.webngrams.json.gz` files
for the selected time range into `gdeltdata/`.

Decompression is disabled here (`--no-decompress`), because step 2 handles
decompression one file at a time.


In [9]:
%run step1_gdelt_download.py \
  --start $START_TS \
  --end   $END_TS \
  --outdir $DATA_DIR \
  --no-decompress

Time range from 2025-11-25 00:00:00 to 2025-11-25 23:59:00 covers 1440 minute slots.
Target directory for downloads: gdeltdata


Downloading:  99%|█████████▉| 1431/1440 [03:11<00:04,  1.87file/s]
Downloading:   0%|          | 0/1440 [00:00<?, ?file/s]
Downloading:   0%|          | 1/1440 [00:00<05:33,  4.31file/s]
Downloading:   0%|          | 2/1440 [00:02<30:19,  1.27s/file]
Downloading:   0%|          | 3/1440 [00:04<41:39,  1.74s/file]
Downloading:   0%|          | 4/1440 [00:04<26:51,  1.12s/file]
Downloading:   0%|          | 5/1440 [00:04<18:36,  1.29file/s]
Downloading:   0%|          | 6/1440 [00:05<13:41,  1.75file/s]
Downloading:   0%|          | 7/1440 [00:05<10:35,  2.25file/s]
Downloading:   1%|          | 8/1440 [00:05<08:31,  2.80file/s]
Downloading:   1%|          | 9/1440 [00:05<07:09,  3.33file/s]
Downloading:   1%|          | 10/1440 [00:05<06:10,  3.86file/s]
Downloading:   1%|          | 11/1440 [00:05<05:33,  4.29file/s]
Downloading:   1%|          | 12/1440 [00:06<05:07,  4.64file/s]
Downloading:   1%|          | 13/1440 [00:06<04:50,  4.92file/s]
Downloading:   1%|          | 14/1440 [00

Time range from 2025-11-25 00:00:00 to 2025-11-25 23:59:00 covers 1440 minute slots.
Target directory for downloads: gdeltdata
Downloaded 188 .gz files into gdeltdata.


Downloading: 100%|██████████| 1440/1440 [03:11<00:00,  7.50file/s]

Downloaded 188 .gz files into gdeltdata.





## Step 2 – Reconstruct articles from Web NGrams

This step uses:

- `step2_reconstruct_gdelt.py` to iterate over all `.webngrams.json.gz` files in `gdeltdata/`,
- `gdelt_wordmatch_multiprocess.py` internally, to reconstruct full-text articles.

For each `.webngrams.json.gz` file:

1. The file is decompressed to `.json`.
2. Only news that match the language and url filters are processed.
3. Articles are reconstructed and written to a CSV in `gdeltpreprocessed/`.
4. The `.json` file is removed.
5. Empty CSVs (header only) are deleted.
6. Optionally, the original `.gz` can be deleted (`--delete-gz`).


In [None]:
if LANGUAGE is None:
    %run step2_reconstruct_gdelt.py \
      --input-dir $DATA_DIR \
      --output-dir $PREPROC_DIR \
      --url-filter "$URL_FILTER" \
      --processes 8
else:
    %run step2_reconstruct_gdelt.py \
      --input-dir $DATA_DIR \
      --output-dir $PREPROC_DIR \
      --language $LANGUAGE \
      --url-filter "$URL_FILTER" \
      --processes 8


Cell output is omitted due to multiprocessing issues in Jupyter notebooks and slow execution. Generated files are stored in the "gdeltpreprocessed" folder.

## Step 3 – Filter and deduplicate articles

This step uses `step3_filtermerge_db.py` to:

1. Read all CSV files from `gdeltpreprocessed/`.
2. Filter rows by a Boolean query that supports:
   - `AND`, `OR`, `NOT`
   - Parentheses
   - Quoted phrases (for multi-word terms)
3. Write all matching rows to a temporary CSV.
4. Deduplicate by URL, keeping the row with the longest `Text` for each URL.
5. Write the final result to `final_filtered_dedup.csv`.


In [11]:
%run step3_filtermerge_db.py \
  --input-dir $PREPROC_DIR \
  --output $FINAL_CSV \
  --query "$QUERY"

Filtering CSV files in gdeltpreprocessed into temporary file final_filtered_dedup.csv.tmp.
Deduplicating by URL and writing final output to final_filtered_dedup.csv.
