# Fetch Articles from Database

In [1]:
from utils.preprocessing import *
from utils.database import *
from utils.files import *
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


## Connect to Database

Credentials are sourced from the `.env` file.

In [2]:
_, db = getConnection(use_dotenv=True)

## Query Database

Fetches a limited number of articles from the database that haven't been processed yet, 
returning specified fields like url, title, and parsing result text.

In [3]:
fields = {"url": 1, "title": 1, "parsing_result.text": 1}
articles = fetchArticleTexts(db,  limit=50, skip=0, fields=fields, query={
                             "processing_result": {"$exists": False}})


Processes the 'parsing_result' of each article to clean the text, and filters out articles 
that lack a 'title' or 'parsing_result'.


In [4]:
for article in articles:
    text = article.get("parsing_result", {}).get("text", "")
    if "parsing_result" in article:
        article["parsing_result"]["text"] = cleanText(text)
    else:
        print("No parsing result for article", article["_id"])

articles = [article for article in articles if article.get(
    "title", "") and article.get("parsing_result", "")]

print("Number of articles:", len(articles))

Number of articles: 50


## Export as JSON

Saves the given data to a JSON file for optional visual inspection.

In [5]:
exportAsJSON("../data/input/articles.json",  articles)

# Prepare Dataset

Convert article IDs to strings and transform a list of articles into a dataset with fields: id, title, url, and text extracted from parsing results. The HuggingFace `datasets` library provides several key advantages over plain JSON files:

- **Efficiency**: The datasets are memory-mapped, allowing you to work with data that's larger than your available RAM without loading the entire dataset into memory. 
- **Speed**: Datasets in the HuggingFace format (which is Arrow-based) can be loaded faster than large JSON files, facilitating quicker data operations.
- **Columnar Storage**: By using Apache Arrow for storage, HuggingFace datasets benefit from a columnar format that ensures more efficient serialization and deserialization compared to row-based storage, such as JSON.


In [6]:
# Iterate through each article in the list of articles
# and convert the "_id" field of the article to a string
for article in articles:
    article["_id"] = str(article["_id"])

# Convert the list of JSON objects into a dictionary format
dataset_dict = {
    "id": [_["_id"] for _ in articles],
    "title": [_["title"] for _ in articles],
    "url": [_["url"] for _ in articles],
    "text": [_["parsing_result"]["text"] for _ in articles],
}

dataset = Dataset.from_dict(dataset_dict)

Save dataset to disk:

In [7]:
dataset.save_to_disk('../data/input/articles')

Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 3001.89 examples/s]


In [10]:
print(dataset[42]["text"][:100])

Skip to comments. Posted on 07/01/2016 4:30:44 PM PDT by The_Media_never_lie A West Glacier man was 
