## Data Mining

### Introduction
The proliferation of fake news in today's digital age has become a significant concern. In response to this, various techniques and methodologies have been developed to detect and combat fake news. In this project, we aim to leverage data mining techniques to gather news articles from The Guardian and subsequently analyze them for fake news detection.

### Data Collection
We utilized The Guardian's API to collect news articles for analysis. The data collection process involved the following steps:

1. API Access: Access to The Guardian's API was established using a unique API key.
2. Date Range Specification: We specified a date range from January 1, 2021, to January 15, 2021, for data collection.
3. Article Retrieval: Using the API, we retrieved articles published within the specified date range. The articles were retrieved with additional metadata including the article's ID, title, URL, section name, and publication date.
4. Data Storage: The collected data was stored in CSV format for further analysis.

In [3]:
import csv
import requests
from os import makedirs
from os.path import join, exists
from datetime import date, timedelta

ARTICLES_DIR = join('tempdata', 'articles')
makedirs(ARTICLES_DIR, exist_ok=True)

MY_API_KEY = open(r"Guardian_API.txt").read().strip()
API_ENDPOINT = 'http://content.guardianapis.com/search'
my_params = {
    'from-date': "",
    'to-date': "",
    'order-by': "newest",
    'show-fields': 'all',
    'page-size': 200,
    'api-key': MY_API_KEY
}

start_date = date(2021, 1, 1)
end_date = date(2021, 1, 15)
dayrange = range((end_date - start_date).days + 1)

for daycount in dayrange:
    dt = start_date + timedelta(days=daycount)
    datestr = dt.strftime('%Y-%m-%d')
    fname = join(ARTICLES_DIR, datestr + '.csv')
    if not exists(fname):
        print("Downloading", datestr)
        all_results = []
        my_params['from-date'] = datestr
        my_params['to-date'] = datestr
        current_page = 1
        total_pages = 1
        while current_page <= total_pages:
            print("...page", current_page)
            my_params['page'] = current_page
            resp = requests.get(API_ENDPOINT, my_params)
            data = resp.json()
            all_results.extend(data['response']['results'])
            current_page += 1
            total_pages = data['response']['pages']

        with open(fname, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['id', 'webTitle', 'webUrl', 'sectionName', 'publicationDate'])
            writer.writeheader()
            for article in all_results:
                writer.writerow({
                    'id': article['id'],
                    'webTitle': article['webTitle'],
                    'webUrl': article['webUrl'],
                    'sectionName': article['sectionName'],
                    'publicationDate': article['webPublicationDate']
                })
        print("Writing to", fname)
    else:
        print("File already exists:", fname)


File already exists: tempdata/articles/2021-01-01.csv
File already exists: tempdata/articles/2021-01-02.csv
File already exists: tempdata/articles/2021-01-03.csv
File already exists: tempdata/articles/2021-01-04.csv
File already exists: tempdata/articles/2021-01-05.csv
File already exists: tempdata/articles/2021-01-06.csv
File already exists: tempdata/articles/2021-01-07.csv
File already exists: tempdata/articles/2021-01-08.csv
File already exists: tempdata/articles/2021-01-09.csv
File already exists: tempdata/articles/2021-01-10.csv
File already exists: tempdata/articles/2021-01-11.csv
File already exists: tempdata/articles/2021-01-12.csv
File already exists: tempdata/articles/2021-01-13.csv
File already exists: tempdata/articles/2021-01-14.csv
File already exists: tempdata/articles/2021-01-15.csv
