# First Deliverable for FCD

## Objectives of the project

1. Data mining news articles related to André Ventura / CHEGA (potentially only headlines at first)
	- Timeframe 2019-2024
2. Word cloud analysis since the party foundation, assess the topics of interest throughout the years
3. Correlation with INE (National Institute of Statistics) data with the various topics (identify real trends, or made up trends by fake news/social media)
4. Change of the party’s position regarding the identified topics
5. Word clouds every 5 years of the portuguese politics landscape (2000-2005-2010-2015-2020)

## Script to make requests from ARQUIVO's CDX API

---
In the first part of the script we defined the newspapers to search for (yes, CDX API makes you search by a *single* url at a time), and the function that will look them up.

---
### Challenges with the CDX API

The CDX API can access any stored link in **arquivo API**, but it can only be filtered with some metadata and text within the URL
- We made use of timestamps to limit our search to the period of interest. However,  it doesn't mean that all the pages collected are from that period, as there are several snapshots from pages way before 2019
- We looked up link by link (out of 12 newspapers selected, across the political spectrum), and looked for the tags '-chega-' and 'ventura'
    - Filtering for '-chega-' helped reduce the number of outliers by avoiding words where 'chega' can be found (e.g., chegado, aconchega)
    - In the case of 'ventura' the outliers weren't that many, and we had no big issues

In [1]:
import requests
import json
import time


# Define the API endpoint
cdx_url = "https://arquivo.pt/wayback/cdx"

# Newspapers to search
newsp = ['cmjornal.pt/*', 
         'dn.pt/*',
         'expresso.pt/*',
         'folhanacional.pt/*',
         'jn.pt/*',
         'ionline.sapo.pt/*',   
         'sol.sapo.pt/*',
         'observador.pt/*',
         'publico.pt/*',
         'sabado.pt/*',
         'sapo.pt/*',
         'visao.pt/*',
         ]

# Tags to search within newspaper's links
tags = ['-chega-', 'ventura'] # needed to add dashes before and after "chega" in order to avoid other words containing it 

# Process the response into a list
data = []

# Define the maximum number of retries
max_retries = 2
delay_between_requests = 5 # seconds

# Function to handle requests with retries and delays
def fetch_data_w_retries(url, params, retries=max_retries):
    """
    Makes a GET request to the given URL with the given parameters, and 
    retries the request up to 'retries' times if it fails. If the request 
    fails after all retries, returns None.
    
    :param url: str, the URL to make the request to
    :param params: dict, the parameters to send with the request
    :param retries: int, the number of times to retry the request if it fails
    :return: requests.Response, or None if the request fails after all retries
    """
    for attempt in range(retries):
        try:
            response = requests.get(url, params=params, timeout=150)
            response.raise_for_status() # Raise an error for 4xx or 5xx responses
            return response
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}. Attempt {attempt + 1} of {retries}")
            if attempt < retries - 1:
                print(f"Retrieved {len(data)} records.")
                time.sleep(delay_between_requests) # Wait before retrying
            else:
                print("Max retries reached. Skipping.")
                return None


---
Both in the function and the rest of the script we had to manage several error inducing scenarios such as:
- Blank error status
- Process the data as NDJSON instead of JSON
- Exceeding read time out

In [2]:

# Check if the response is in NDJSON format
for i in newsp:
    for tag in tags:
        params = {
        'url': i,
        'fields': 'url,timestamp,status',
        'from': '2022',
        'to': '2024',
        'filter': 'original:'+tag,
        'output': 'json',
        'limit': '5000'
        }
        
        response = fetch_data_w_retries(cdx_url, params)

        if response:
            if response.status_code == 200:
                if response.headers.get('Content-Type') == 'text/x-ndjson':
                    # Process each line as a separate JSON object
                    for line in response.text.splitlines():
                        try:
                            record = json.loads(line)

                            status = record.get('status')

                            if status == '200':
                                data.append(record)
                            else:
                                if status is None:
                                    print(f"Record missing 'status' field: {record}")
                                    print(f"Retrieved {len(data)} records.")

                                else:
                                    print(f"Record with status '{status}': {record}")
                                    print(f"Retrieved {len(data)} records.")

                        except json.JSONDecodeError as e:
                            print(f"Error parsing line: {line}")
                            print(f"JSONDecodeError: {e}")
                        except TypeError as e:
                            print(f"Unexpected data format: {line}")
                            print(f"TypeError: {e}")
                else:
                    print("Response is not in NDJSON format.")
            else:
                print(f"Failed to retrieve data: {response.status_code}")
        else:
            print("Failed to retrieve data.")

# Print the number of records retrieved
print(f"Retrieved {len(data)} records.")

# Insert the new data into cdx_results.json
with open("cdx_results.json", "w") as json_file:
    json.dump(data, json_file, indent=4)


KeyboardInterrupt: 

#### Scrapping words out of received URLs

- Open every website and webscrape the title-element in the HTML-Script

- Safe the original URL as Key and the extracted words as value in a dictionary, if the title contains 'Chega' or 'CHEGA'
- Write it into a new .json file

In [9]:
import json
import requests
from bs4 import BeautifulSoup
import time

# Path to the JSON file
file_name = 'cdx_results_2019.json'

# Open and load the JSON file
with open(file_name, 'r') as file:
    data = json.load(file)

newspaper_values = {}

# Loop through URLs; search for the title element in HTML-Script
counter = 0

for v in data:
    url = v['url']
    print(f'iteration {counter}')
    print(f'link: {url}')
    counter += 1
    
    try:
        # Requesting the website and setting a timeout of 10 seconds
        response = requests.get(url, timeout=10, stream=True) # stream=True for downloading data in chunks not everything at once 
        
        # Delay between requests to avoid overloading the server
        # time.sleep(1)  # 1-second delay; should may be in code if server gets too much requests, for now works fine without delay

        # Initialize an empty content variable and stream content chunks
        html_content = b""
        
        # Stream the content until we find the closing </title> tag
        for chunk in response.iter_content(chunk_size=512):
            html_content += chunk
            # if b"</title>" in html_content:
                # break  # Stop streaming once the <title> tag is found

        # Parse only the partial content with BeautifulSoup
        soup = BeautifulSoup(html_content, "html.parser") # optional lxml parser

        # Extract the <title> element
        title_tag = soup.find("title")
        text_tag = soup.find_all("p")

        # If no <title> tag is found, skip this page, makes program faster
        if not title_tag:
            continue

        # Extract title and text from the <title>/<p>-tag
        title_text = title_tag.get_text()

        for text in text_tag:
            text_text = text.get_text()

        # Debug output for Text:
        print(f'Text found in <p> element: {text_text}')

        # Check for "Chega" or "CHEGA" (case-sensitive check)
        if "Chega" in title_text or "CHEGA" or "Andre Ventura" in title_text:
            print(f'title text "{title_text}" is valid"')
            newspaper_values['url'] = url  # Store the URL and title in the dictionary
            newspaper_values['title'] = title_text
            newspaper_values['text'] = text_text

    except requests.exceptions.Timeout:
        print(f"Timeout occurred for URL: {url}")
        continue  # Skip to the next URL if a timeout occurs

    except requests.exceptions.RequestException as e:
        print(f"Request failed for URL: {url} with error: {e}")
        continue  # Skip to the next URL if another error occurs

df_newspaper_values = df.DataFrame(data=newspaper_values)

# Writing the dictionary to a JSON file
with open('newspaper_url_infos.json', 'w', encoding='utf-8') as json_file:
    json.dump(titles, json_file, indent=4, ensure_ascii=False)

iteration 0
link: https://www.cmjornal.pt/boa-vida/amp/capital-do-natal-chega-a-oeiras
Text found in <p> element: "Pretendemos criar um espaço de referência, onde se conjuguem os valores e princípios do espírito de Natal, com fortes componentes de diversão e de responsabilidade social", afirmou Ivan Dias, um dos fundadores do projecto.

title text "Capital do Natal chega a Oeiras - Boa Vida - Correio da Manhã" is valid"
iteration 1
link: https://www.cmjornal.pt/boa-vida/amp/capital-do-natal-chega-a-oeiras
Text found in <p> element: "Pretendemos criar um espaço de referência, onde se conjuguem os valores e princípios do espírito de Natal, com fortes componentes de diversão e de responsabilidade social", afirmou Ivan Dias, um dos fundadores do projecto.

title text "Capital do Natal chega a Oeiras - Boa Vida - Correio da Manhã" is valid"
iteration 2
link: https://www.cmjornal.pt/boa-vida/detalhe/capital-do-natal-chega-a-oeiras
Text found in <p> element: (Enviada diariamente)
title text "

In [17]:
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Eingabedatei mit URLs
file_name = 'test_for_scraping_cdx_results.json'

# Eingabe-JSON laden
with open(file_name, 'r') as file:
    data = json.load(file)

# Liste, um die Ergebnisse zu sammeln
records = []

# Iteration über die URLs in der Eingabedatei
for index, entry in enumerate(data):
    url = entry['url']
    print(f'Iteration {index + 1}, URL: {url}')
    
    try:
        # Website anfragen
        response = requests.get(url, timeout=10)
        html_content = response.content  # Gesamten Inhalt laden

        # HTML parsen
        soup = BeautifulSoup(html_content, "lxml")

        # <title>-Element und <p>-Tags extrahieren
        title_tag = soup.find("title")
        text_tags = soup.find_all("p")

        # Prüfen, ob ein <title>-Tag vorhanden ist
        if not title_tag:
            print(f"Kein <title>-Tag gefunden für URL: {url}")
            continue


        # Titeltext extrahieren
        title_text = title_tag.get_text().strip()

        # Text aller <p>-Tags sammeln
        cat_text = "\n".join([p.get_text().strip() for p in text_tags])

        # Ergebnis zur Liste hinzufügen
        records.append({
            "URL": url,
            "Title": title_text,
            "Text": cat_text
        })

    except requests.exceptions.Timeout:
        print(f"Timeout bei URL: {url}")
        continue

    except requests.exceptions.RequestException as e:
        print(f"Anfrage fehlgeschlagen für URL: {url} mit Fehler: {e}")
        continue

# DataFrame erstellen und in JSON-Datei speichern
df_newspaper_values = pd.DataFrame(records)
df_newspaper_values.to_json('newspaper_url_infos.json', orient='records', indent=4, force_ascii=False)

# Ausgabe des DataFrames zur Überprüfung

Iteration 1, URL: https://www.cmjornal.pt/politica/amp/andre-ventura-diz-chega-vai-impedir-extrema-direita-em-portugal
Iteration 2, URL: https://www.cmjornal.pt/politica/amp/partido-chega-de-andre-ventura-inicia-formalizacao-para-ser-alternativa-a-direita-que-parece-nao-existir
Iteration 3, URL: https://www.jn.pt/ntv/interior/lourenco-ortigao-e-kelly-bailey-dao-tudo-e-mais-alguma-coisa-mas-nao-chega-10616299.html
Timeout bei URL: https://www.jn.pt/ntv/interior/lourenco-ortigao-e-kelly-bailey-dao-tudo-e-mais-alguma-coisa-mas-nao-chega-10616299.html
Iteration 4, URL: https://www.cmjornal.pt/cm-ao-minuto/amp/clima-mais-de-mil-alunos-em-coimbra-dizem-que-chega-de-bla-bla-bla-e-exigem-mudancas
Iteration 5, URL: https://www.cmjornal.pt/cm-ao-minuto/amp/euro2020-cristiano-ronaldo-chega-aos-90-golos-por-portugal-na-lituania
Iteration 6, URL: https://www.cmjornal.pt/cm-ao-minuto/amp/jogo-baleia-azul-chega-a-china
Iteration 7, URL: https://www.cmjornal.pt/cm-ao-minuto/amp/liga-dos-bombeiros-cheg

In [18]:
df_newspaper_values.head()

Unnamed: 0,URL,Title,Text
0,https://www.cmjornal.pt/politica/amp/andre-ven...,André Ventura diz que Chega vai impedir extrem...,A informação do País e do mundo em primeiro lu...
1,https://www.cmjornal.pt/politica/amp/partido-c...,Partido 'CHEGA' inicia formalização para ser a...,A informação do País e do mundo em primeiro lu...
2,https://www.cmjornal.pt/cm-ao-minuto/amp/clima...,"Estudantes de todo o país estão fartos de ""blá...",A informação do País e do mundo em primeiro lu...
3,https://www.cmjornal.pt/cm-ao-minuto/amp/euro2...,Error404 - Correio da Manhã,A informação do País e do mundo em primeiro lu...
4,https://www.cmjornal.pt/cm-ao-minuto/amp/jogo-...,"Jogo ""Baleia Azul"" chega à China - Cm ao Minut...",A informação do País e do mundo em primeiro lu...


In [24]:
df_newspaper_values[12][1]

KeyError: 12

In [33]:
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np  # For NaN values

# Path to the JSON file containing URLs
file_name = 'test_for_scraping_cdx_results.json'

# Load the JSON input file
with open(file_name, 'r') as file:
    data = json.load(file)

# List to store the results for DataFrame creation
records = []

# Iterate over each entry (URL) in the input file
for index, entry in enumerate(data):
    url = entry['url']
    print(f'Iteration {index + 1}, URL: {url}')
    
    try:
        # Request the webpage content with a timeout of 10 seconds
        response = requests.get(url, timeout=10)
        html_content = response.content  # Load the entire content

        # Parse the HTML with BeautifulSoup using the lxml parser for faster processing
        soup = BeautifulSoup(html_content, "lxml")

        # Extract the <title> element and all <p> elements
        title_tag = soup.find("title")
        text_tags = soup.find_all("p")

        # Check if <title> tag exists
        if title_tag:
            # Extract text from the <title> tag
            title_text = title_tag.get_text().strip()

            # Concatenate the text of all <p> tags
            cat_text = "\n".join([p.get_text().strip() for p in text_tags])

            # Check for keywords in the title
            if "Chega" in title_text or "CHEGA" in title_text or "Andre Ventura" in title_text:
                # If keywords are found, add URL, title, and text to the record
                print(f'cat_text: {cat_text}')
                records.append({
                    "URL": url,
                    "Title": title_text,
                    "Text": cat_text
                })
            else:
                # If keywords are not found, add URL with NaN for title and text
                records.append({
                    "URL": url,
                    "Title": np.nan,
                    "Text": np.nan
                })

        else:
            # If no <title> tag, add URL with NaN for title and text
            records.append({
                "URL": url,
                "Title": np.nan,
                "Text": np.nan
            })

    except requests.exceptions.Timeout:
        print(f"Timeout occurred for URL: {url}")
        # Skip to the next URL if a timeout occurs
        records.append({
            "URL": url,
            "Title": np.nan,
            "Text": np.nan
        })
        continue

    except requests.exceptions.RequestException as e:
        print(f"Request failed for URL: {url} with error: {e}")
        # Skip to the next URL if another error occurs
        records.append({
            "URL": url,
            "Title": np.nan,
            "Text": np.nan
        })
        continue

# Create a DataFrame from the list of records
df_newspaper_values = pd.DataFrame(records)

# Save the DataFrame to a JSON file
df_newspaper_values.to_json('newspaper_url_infos.json', orient='records', indent=4, force_ascii=False)

Iteration 1, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-casa-alfama?ref=HP_BlocoComercial
Iteration 2, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-casa-alfama?ref=HP_CofinaBoostSolutions
Iteration 3, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-casa-alfama?ref=HP_CofinaBoostSolutions
Iteration 4, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-casa-alfama?ref=HP_CofinaBoostSolutions
Iteration 5, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-casa-alfama?ref=HP_CofinaBoostSolutions
Iteration 6, URL: https://www.cmjornal.pt/c-studio/especiais-c-studio/por-boas-causas/detalhe/fado-chega-a-todos-no-regresso-da-santa-ca

In [34]:
df_newspaper_values.head()

Unnamed: 0,URL,Title,Text
0,https://www.cmjornal.pt/c-studio/especiais-c-s...,,
1,https://www.cmjornal.pt/c-studio/especiais-c-s...,,
2,https://www.cmjornal.pt/c-studio/especiais-c-s...,,
3,https://www.cmjornal.pt/c-studio/especiais-c-s...,,
4,https://www.cmjornal.pt/c-studio/especiais-c-s...,,
