# 🤗 Database Setup

This notebook is used to upload the Huggingface Climate Policy Radar dataset to a Postgres table.

Refer to README.md for instructions on setting up the Postgres database.

In [4]:
import os
import regex as re
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import pgai
import torch
import glob
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
from datasets import load_dataset, Features, Value
import sys

# Add the project root directory to the Python path
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
project_root = os.path.abspath(os.path.join(notebook_dir, '../..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"Added {project_root} to sys.path")

# Now import the functions from the root directory
from scripts.retrieval.functions import store_database_batched



## 1. Load the Huggingface dataset.

To access this dataset, you need a Huggingface account.

1. Input `huggingface-cli login` in the command line
2. Paste your access token

In [2]:
ds = load_dataset("ClimatePolicyRadar/all-document-text-data")
ds = ds["train"]
flat_ds = ds.flatten()

Resolving data files:   0%|          | 0/23 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/42 [00:00<?, ?it/s]

## 2. Save all chunks into the "climate_policy_radar" table.

The table contains all documents from the Climate Policy Radar dataset, with each row representing a chunk and all associated metadata stored in individual columns.

In [3]:
store_database_batched(flat_ds, num_chunks=len(flat_ds))

Inserting chunks into database:   0%|          | 0/180 [00:00<?, ?it/s]