Skip to content

Add News Dataset Files#12

Open
GeoffreySie wants to merge 6 commits intopgmpy:mainfrom
GeoffreySie:add-news-dataset
Open

Add News Dataset Files#12
GeoffreySie wants to merge 6 commits intopgmpy:mainfrom
GeoffreySie:add-news-dataset

Conversation

@GeoffreySie
Copy link

The News dataset is one of the datasets suggested to add from pgmpy/pgmpy#2620 and solves the issue pgmpy/pgmpy#2787.

The dataset is the topic_doc_mean_n5000_k3477_seed_1.csv dataset downloaded from https://www.fredjo.com/

Copy link

@DARHWOLF DARHWOLF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

  • May I request the changes mentioned in the comments please.

  • Also, I see that the dataset is already available in .csv format. Was there any specific reason to convert it to .txt?

  • Further, may I request you to clearly specify how this conversion was done, and how did you verify whether the contents of the datasets were not modified in the process.

README.md Outdated
1. [example-causal-datasets](https://github.com/cmu-phil/example-causal-datasets): CC0 1.0 Universal. Last synced on 2026-02-05.

3.[Twins-datasets](http://www.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html)
2. [News-dataset] (https://www.fredjo.com/): Last downloaded on 2026-03-11.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link is to the author's personal website. May I request you to find a better alternative here? Preferably a direct link to the webpage where this dataset can be accessed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please put the exact link to the dataset.

@ankurankan
Copy link
Member

I see that there are multiple datasets in the zip file. Which dataset has been added in this PR?

@GeoffreySie
Copy link
Author

Thank you both for the advice! I've updated the links to the direct links to download the dataset.

  • Also, I see that the dataset is already available in .csv format. Was there any specific reason to convert it to .txt?

I saw that #3 had a similar comment and decided to do the same for this.

  • Further, may I request you to clearly specify how this conversion was done, and how did you verify whether the contents of the datasets were not modified in the process.

Sure! The News dataset is split into .x and .y files, where .x contains a sparse representation of the features and .y contains treatment, y_factual, y_counterfactual, mu0, mu1. To make this similar to other datasets in this repository, I wrote a script to convert the sparse representation of .x into columns, then join this side-by-side with .y. The csv file is converted into txt by renaming the file to txt in Pandas to_csv. Below is the exact script I used to do this.

def merge_x_and_y(x_file, y_file, output_file):

    with open(x_file, 'r') as f:
        header = f.readline().strip().split(',')
        n_rows, n_cols = int(header[0]), int(header[1])

    # Read the .x file
    df = pd.read_csv(x_file, skiprows=1, header=None, names=['row', 'col', 'val'])
    
    # coo_matrix uses 0-based indexing, but original dataset uses 1-based indexing
    rows = df['row'].values - 1
    cols = df['col'].values - 1
    vals = df['val'].values
    
    # Create dense matrix filling in all the zeros
    sparse_matrix = coo_matrix((vals.astype(int), (rows, cols)), shape=(n_rows, n_cols))
    dense_matrix = sparse_matrix.toarray()
    
    # Convert to a pandas DataFrame with 'w_i' as each column in the dense matrix
    df_x = pd.DataFrame(dense_matrix, columns=[f'x_{i}' for i in range(n_cols)])

    # Read the .y file
    df_y = pd.read_csv(y_file, header=None, 
                       names=['treatment', 'y_factual', 'y_counterfactual', 'mu0', 'mu1'])

    # Merge x and y dataframes side by side
    df_merged = pd.concat([df_y, df_x], axis=1)

    df_merged.to_csv(output_file, index=False, sep='\t')

I see that there are multiple datasets in the zip file. Which dataset has been added in this PR?

This is the first dataset, namely the topic_doc_mean_n5000_k3477_seed_1.csv.x and topic_doc_mean_n5000_k3477_seed_1.csv.y.

Happy to find another way to add this dataset instead if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants