# Data Acquisition

This notebook demonstrates how to acquire data for an NLP task from the web.

We will use the New York Times Article Search API to download articles. The API is available at [https://developer.nytimes.com/apis](https://developer.nytimes.com/apis).

Here is an example call that we are going to implement as a Python function to automate the data acquisition process:

```
https://api.nytimes.com/svc/search/v2/articlesearch.json?q=election&api-key=yourkey
```

The exact API specification can be found at [here](https://developer.nytimes.com/docs/articlesearch-product/1/overview).

To use the API, we need to create an account and get an API key. The API key is used to authenticate the user and to track API usage.

More details about the API key can be found [here](https://developer.nytimes.com/get-started).


In [None]:
import requests


def get_nyt_articles_by_keyword(params: dict, api_key: str) -> dict:
    url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
    response = requests.get(url, params={**params, "api-key": api_key})
    return response.json()

In [None]:
# set the API key
API_KEY = "insert_your_api_key_here"

# specify the search parameters according to your needs
params = {
    "q": "uefa euro 2024",
    "begin_date": "20240601",
    "end_date": "20240630",
    "sort": "newest",
}
response = get_nyt_articles_by_keyword(params, API_KEY)

# inspect the response
response

Now we have the data at hand and can inspect it

In [None]:
articles = response["response"]["docs"]

# print the number of articles (we only get 10 articles per request)
len(articles)

Now we can build a preprocessing pipeline according to the task at hand and prepare the data for further processing.

In [None]:
for a in articles:
    print(a["headline"]["main"])
    print(a["snippet"])
    print(a["pub_date"])
    print("")

Once we are done, we could save the data to a file or a database for later use.

In [None]:
import json

with open("output/nyt_articles.json", "w") as f:
    json.dump(articles, f, indent=2)