Add News Dataset Files by GeoffreySie · Pull Request #12 · pgmpy/example_datasets

GeoffreySie · 2026-03-11T22:58:17Z

The News dataset is one of the datasets suggested to add from pgmpy/pgmpy#2620 and solves the issue pgmpy/pgmpy#2787.

The dataset is the topic_doc_mean_n5000_k3477_seed_1.csv dataset downloaded from https://www.fredjo.com/

DARHWOLF

Nice work!

May I request the changes mentioned in the comments please.
Also, I see that the dataset is already available in .csv format. Was there any specific reason to convert it to .txt?
Further, may I request you to clearly specify how this conversion was done, and how did you verify whether the contents of the datasets were not modified in the process.

news/README.md

DARHWOLF · 2026-03-11T23:31:58Z

README.md

+1. [example-causal-datasets](https://github.com/cmu-phil/example-causal-datasets): CC0 1.0 Universal. Last synced on 2026-02-05.

-3.[Twins-datasets](http://www.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html)
+2. [News-dataset] (https://www.fredjo.com/): Last downloaded on 2026-03-11.


this link is to the author's personal website. May I request you to find a better alternative here? Preferably a direct link to the webpage where this dataset can be accessed.

Also please put the exact link to the dataset.

ankurankan · 2026-03-12T17:14:39Z

I see that there are multiple datasets in the zip file. Which dataset has been added in this PR?

…mple_datasets into add-news-dataset

GeoffreySie · 2026-03-12T23:03:11Z

Thank you both for the advice! I've updated the links to the direct links to download the dataset.

Also, I see that the dataset is already available in .csv format. Was there any specific reason to convert it to .txt?

I saw that #3 had a similar comment and decided to do the same for this.

Further, may I request you to clearly specify how this conversion was done, and how did you verify whether the contents of the datasets were not modified in the process.

Sure! The News dataset is split into .x and .y files, where .x contains a sparse representation of the features and .y contains treatment, y_factual, y_counterfactual, mu0, mu1. To make this similar to other datasets in this repository, I wrote a script to convert the sparse representation of .x into columns, then join this side-by-side with .y. The csv file is converted into txt by renaming the file to txt in Pandas to_csv. Below is the exact script I used to do this.

def merge_x_and_y(x_file, y_file, output_file):

    with open(x_file, 'r') as f:
        header = f.readline().strip().split(',')
        n_rows, n_cols = int(header[0]), int(header[1])

    # Read the .x file
    df = pd.read_csv(x_file, skiprows=1, header=None, names=['row', 'col', 'val'])
    
    # coo_matrix uses 0-based indexing, but original dataset uses 1-based indexing
    rows = df['row'].values - 1
    cols = df['col'].values - 1
    vals = df['val'].values
    
    # Create dense matrix filling in all the zeros
    sparse_matrix = coo_matrix((vals.astype(int), (rows, cols)), shape=(n_rows, n_cols))
    dense_matrix = sparse_matrix.toarray()
    
    # Convert to a pandas DataFrame with 'w_i' as each column in the dense matrix
    df_x = pd.DataFrame(dense_matrix, columns=[f'x_{i}' for i in range(n_cols)])

    # Read the .y file
    df_y = pd.read_csv(y_file, header=None, 
                       names=['treatment', 'y_factual', 'y_counterfactual', 'mu0', 'mu1'])

    # Merge x and y dataframes side by side
    df_merged = pd.concat([df_y, df_x], axis=1)

    df_merged.to_csv(output_file, index=False, sep='\t')

I see that there are multiple datasets in the zip file. Which dataset has been added in this PR?

This is the first dataset, namely the topic_doc_mean_n5000_k3477_seed_1.csv.x and topic_doc_mean_n5000_k3477_seed_1.csv.y.

Happy to find another way to add this dataset instead if required.

GeoffreySie added 2 commits March 11, 2026 22:45

Add News dataset

04f8ba3

Fixed incorrect README formatting

9be1a0c

DARHWOLF requested changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into add-news-dataset

1c7fcd5

GeoffreySie added 3 commits March 12, 2026 21:41

Updated links to dataset

f28fdec

Merge branch 'add-news-dataset' of https://github.com/GeoffreySie/exa…

e8a8c04

…mple_datasets into add-news-dataset

Fixed spacing in README

5f5d3a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add News Dataset Files#12

Add News Dataset Files#12
GeoffreySie wants to merge 6 commits intopgmpy:mainfrom
GeoffreySie:add-news-dataset

GeoffreySie commented Mar 11, 2026

Uh oh!

DARHWOLF left a comment

Uh oh!

Uh oh!

DARHWOLF Mar 11, 2026

Uh oh!

ankurankan Mar 12, 2026

Uh oh!

ankurankan commented Mar 12, 2026

Uh oh!

GeoffreySie commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GeoffreySie commented Mar 11, 2026

Uh oh!

DARHWOLF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DARHWOLF Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ankurankan Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ankurankan commented Mar 12, 2026

Uh oh!

GeoffreySie commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants