# First Deliverable for FCD

## Objectives of the project

1. Data mining news articles related to André Ventura / CHEGA (potentially only headlines at first)
	- Timeframe 2019-2024
2. Word cloud analysis since the party foundation, assess the topics of interest throughout the years
3. Correlation with INE (National Institute of Statistics) data with the various topics (identify real trends, or made up trends by fake news/social media)
4. Change of the party’s position regarding the identified topics
5. Word clouds every 5 years of the portuguese politics landscape (2000-2005-2010-2015-2020)

## Script to make requests from ARQUIVO's CDX API

---
In the first part of the script we defined the newspapers to search for (yes, CDX API makes you search by a *single* url at a time), and the function that will look them up.

---
### Challenges with the CDX API

The CDX API can access any stored link in **arquivo API**, but it can only be filtered with some metadata and text within the URL
- We made use of timestamps to limit our search to the period of interest. However,  it doesn't mean that all the pages collected are from that period, as there are several snapshots from pages way before 2019
- We looked up link by link (out of 12 newspapers selected, across the political spectrum), and looked for the tags '-chega-' and 'ventura'
    - Filtering for '-chega-' helped reduce the number of outliers by avoiding words where 'chega' can be found (e.g., chegado, aconchega)
    - In the case of 'ventura' the outliers weren't that many, and we had no big issues

In [None]:
import requests
import json
import time


# Define the API endpoint
cdx_url = "https://arquivo.pt/wayback/cdx"

# Newspapers to search
newsp = ['cmjornal.pt/*', 
         'dn.pt/*',
         'expresso.pt/*',
         'folhanacional.pt/*',
         'jn.pt/*',
         'ionline.sapo.pt/*',   
         'sol.sapo.pt/*',
         'observador.pt/*',
         'publico.pt/*',
         'sabado.pt/*',
         'sapo.pt/*',
         'visao.pt/*',
         ]

# Tags to search within newspaper's links
tags = ['-chega-', 'ventura'] # needed to add dashes before and after "chega" in order to avoid other words containing it 

# Process the response into a list
data = []

# Define the maximum number of retries
max_retries = 2
delay_between_requests = 5 # seconds

# Function to handle requests with retries and delays
def fetch_data_w_retries(url, params, retries=max_retries):
    """
    Makes a GET request to the given URL with the given parameters, and 
    retries the request up to 'retries' times if it fails. If the request 
    fails after all retries, returns None.
    
    :param url: str, the URL to make the request to
    :param params: dict, the parameters to send with the request
    :param retries: int, the number of times to retry the request if it fails
    :return: requests.Response, or None if the request fails after all retries
    """
    for attempt in range(retries):
        try:
            response = requests.get(url, params=params, timeout=150)
            response.raise_for_status() # Raise an error for 4xx or 5xx responses
            return response
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}. Attempt {attempt + 1} of {retries}")
            if attempt < retries - 1:
                print(f"Retrieved {len(data)} records.")
                time.sleep(delay_between_requests) # Wait before retrying
            else:
                print("Max retries reached. Skipping.")
                return None


---
Both in the function and the rest of the script we had to manage several error inducing scenarios such as:
- Blank error status
- Process the data as NDJSON instead of JSON
- Exceeding read time out

In [None]:

# Check if the response is in NDJSON format
for i in newsp:
    for tag in tags:
        params = {
        'url': i,
        'fields': 'url,timestamp,status',
        'from': '2022',
        'to': '2024',
        'filter': 'original:'+tag,
        'output': 'json',
        'limit': '5000'
        }
        
        response = fetch_data_w_retries(cdx_url, params)

        if response:
            if response.status_code == 200:
                if response.headers.get('Content-Type') == 'text/x-ndjson':
                    # Process each line as a separate JSON object
                    for line in response.text.splitlines():
                        try:
                            record = json.loads(line)

                            status = record.get('status')

                            if status == '200':
                                data.append(record)
                            else:
                                if status is None:
                                    print(f"Record missing 'status' field: {record}")
                                    print(f"Retrieved {len(data)} records.")

                                else:
                                    print(f"Record with status '{status}': {record}")
                                    print(f"Retrieved {len(data)} records.")

                        except json.JSONDecodeError as e:
                            print(f"Error parsing line: {line}")
                            print(f"JSONDecodeError: {e}")
                        except TypeError as e:
                            print(f"Unexpected data format: {line}")
                            print(f"TypeError: {e}")
                else:
                    print("Response is not in NDJSON format.")
            else:
                print(f"Failed to retrieve data: {response.status_code}")
        else:
            print("Failed to retrieve data.")

# Print the number of records retrieved
print(f"Retrieved {len(data)} records.")

# Insert the new data into cdx_results.json
with open("cdx_results.json", "w") as json_file:
    json.dump(data, json_file, indent=4)


## Extracting titles from the URLs

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Receiving Data with CDX API

In [None]:
# please put the file name of the .json file here so my script works with the variable 'file_name'

file_name = ''

#### Scrapping words out of received URLs

- Open every website and webscrape the title-element in the HTML-Script

- Safe the original URL as Key and the extracted words as value in a dictionary, if the title contains 'Chega' or 'CHEGA'
- Write it into a new .json file

In [None]:
import json
import requests
from bs4 import BeautifulSoup
import time

# Path to the JSON file
json_file_path = 'test_for_scraping_cdx_results.json'

# Open and load the JSON file
with open(file_name, 'r') as file:
    data = json.load(file)

titles = {}

# Loop through URLs; search for the title element in HTML-Script
for v in data:
    url = v['url']
    
    try:
        # Requesting the website and setting a timeout of 10 seconds
        response = requests.get(url, timeout=10, stream=True) # stream=True for downloading data in chunks not everything at once 
        
        # Delay between requests to avoid overloading the server
        time.sleep(1)  # 1-second delay

        # Initialize an empty content variable and stream content chunks
        html_content = b""
        
        # Stream the content until we find the closing </title> tag
        for chunk in response.iter_content(chunk_size=512):
            html_content += chunk
            if b"</title>" in html_content:
                break  # Stop streaming once the <title> tag is found

        # Parse only the partial content with BeautifulSoup
        soup = BeautifulSoup(html_content, "html.parser")

        # Extract the <title> element
        title_tag = soup.find("title")

        # If no <title> tag is found, skip this page, makes program faster
        if not title_tag:
            continue

        # Extract the text from the <title> tag
        title_text = title_tag.get_text()

        # Check for "Chega" or "CHEGA" (case-sensitive check)
        if "Chega" in title_text or "CHEGA" in title_text:
            titles[url] = title_text  # Store the URL and title in the dictionary

    except requests.exceptions.Timeout:
        print(f"Timeout occurred for URL: {url}")
        continue  # Skip to the next URL if a timeout occurs

    except requests.exceptions.RequestException as e:
        print(f"Request failed for URL: {url} with error: {e}")
        continue  # Skip to the next URL if another error occurs

# Writing the dictionary to a JSON file
with open('titles.json', 'w', encoding='utf-8') as json_file:
    json.dump(titles, json_file, indent=4, ensure_ascii=False)

# Print the results, only for debugging, to remove later
for url, title in titles.items():
    print(url, title)