# Load Data Corpus

This notebooks shows how to use standardized data loaders developed based on 
open-source APIs : 
* [GNews](https://pypi.org/project/gnews/)
* [NewsAPI](https://newsapi.org/)
* [Arxiv](https://github.com/lukasschwab/arxiv.py)

We also use [Newspaper3k](https://newspaper.readthedocs.io/en/latest/) to scrap
full contents of the articles retrieved through GNews and NewsAPI.

Note that we standardized all results so that they have common fields regardless
of which search engine they were retrieved through. Here is a brief description 
of this standardized structure:

```bash
{
    'url':  'url of the article',
    'title':  'title of the article',
    'content':  'full text of teh article'
    'metadata': {
        'query_engine': 'newsapi or gnews or newsapi',
        'published_date': Timestamp('2024-01-25 17:50:07+0000', tz='UTC'),
        <others>
    }
```

`<others>` stands for some extra fields (not explicitely represented here) that 
can be specific to certain search engines.

## Setup

In [None]:
from rich import print
import newspaper
import pandas as pd
import arxiv
from typing import Any, Dict, List

In [None]:
import sys
sys.path.append("../")
sys.path.append("../src/ai_news_digest/steps/")
from src.ai_news_digest.steps import load_news_corpus as lnc
from src.ai_news_digest.steps import load_arxiv_corpus as lac

## GNews

In [None]:
res_gnews = lnc.load_news_gnews(
    keywords="Ukraine war",
    language="en",
    country=None,
    period="90d",
    start_date=None,
    end_date=None,
    exclude_websites=[],
    max_results=20,
    override_content=True,
    standardize=True,
)

# print(res_gnews[:3])

In [None]:
d = res_gnews[0]
print(d)

## NewsAPI

In [None]:
credentials = lnc.load_credentials(path_to_creds="../conf/local/credentials.yml")

res_newsapi = lnc.load_news_newsapi(
    credentials=credentials,
    query="Ukraine war",
    sources=["bbc-news", "google-news"],
    domains=["apnews.com"],
    language="en",
    period="30d",
    from_date=None,  # YYYY-MM-DD format
    to_date=None,  # YYYY-MM-DD format
    sort_by="relevancy",
    override_content=True,     # switch to True
    use_logs=False,
    standardize=True,
)

# print(res_newsapi[:3])

In [None]:
d = res_newsapi[14]
print(d)

## Arxiv

In [None]:
res_arxiv = lac.load_arxiv_corpus(
    # query="Transformers Explainability",
    query="ti:Transformers AND ti:interpretability",
    # query="abs:Transformers AND abs:interpretability",
    max_results=20,
    sort_by=arxiv.SortCriterion.SubmittedDate,
)

In [None]:
d = res_arxiv[14]
print(d)