## Homework 2 Working so far - Minas Emiris

Write web crawlers for the following two tasks:
1. Extract at least 10 United Nations press releases containing the word “crisis”. Start with the following seed url: https://press.un.org/en. Notice how press release pages gave the "PRESS RELEASE" relative link in the top left corner. Here is an example press release: https://press.un.org/en/2023/sc15431.doc.htm where the “PRESS RELEASE” has the following relative anchor tag:
`<a href="/en/press-release" hreflang="en">Press Release</a>`
Use this information to determine whether the web page is a press release.

In [73]:
import requests
from bs4 import BeautifulSoup

url = 'https://press.un.org/en'
response = requests.get(url)

In [24]:
# Use a set to automatically handle duplicates for URLs and content
visited_urls = set()
press_releases_content = set()

# Loop over anchor tags
for link in soup.find_all('a', href=True):
    href = link['href']
    
    # Now just checking for the .doc.htm ending, omitting the year-specific criteria
    if href.endswith('.doc.htm'):
        full_url = href if href.startswith('http') else 'https://press.un.org' + href

        # If we haven't visited this URL before, retrieve its content
        if full_url not in visited_urls:
            content = extract_content_from_url(full_url)
            visited_urls.add(full_url)
            press_releases_content.add(content)

# Display the results
for release_content in press_releases_content:
    # Print the URL for diagnostic purposes
    print(f"Content from URL: {full_url}")
    print(release_content)
    print('-' * 100)

Content from URL: https://press.un.org/en/2023/231002_sc.doc.htm
Recognizing Need to Bolster Indigenous Peoples’ Rights, Third Committee Underscores Importance of Respecting Traditional Lands, Valuable Conservation Knowledge
----------------------------------------------------------------------------------------------------
Content from URL: https://press.un.org/en/2023/231002_sc.doc.htm
Fourth Committee Hears Last Petitioners, Resumes General Debate with Conflict in Western Sahara Again in Spotlight
----------------------------------------------------------------------------------------------------
Content from URL: https://press.un.org/en/2023/231002_sc.doc.htm
Secretary-General Deeply Saddened by Earthquake in Afghanistan, Extends Sincere Condolences to Families of Victims
----------------------------------------------------------------------------------------------------
Content from URL: https://press.un.org/en/2023/231002_sc.doc.htm
‘Outrageous a Person Dies of Hunger Every Few S

In [80]:
import requests
from bs4 import BeautifulSoup

url = 'https://press.un.org/en'
response = requests.get(url)

In [82]:
# Use a set to automatically handle duplicates for URLs and content
visited_urls = set()
press_releases_content = set()

# Loop over anchor tags
for link in soup.find_all('a', href=True):
    href = link['href']
    
    # Now just checking for the .doc.htm ending, omitting the year-specific criteria
    if href.endswith('.doc.htm'):
        full_url = href if href.startswith('http') else 'https://press.un.org' + href

        # If we haven't visited this URL before, retrieve its content
        if full_url not in visited_urls:
            content = extract_content_from_url(full_url)
            visited_urls.add(full_url)
            press_releases_content.add(content)

# Display the results
for release_content in press_releases_content:
    # Print the URL for diagnostic purposes
    print(f"Content from URL: {full_url}")
    print(release_content)
    print('-' * 100)

2. Crawl the press room of the European Parliament and extract at least 10 press releases that cover the plenary sessions and contain the word “crisis”. Start with the following seed url: https://www.europarl.europa.eu/news/en/press-room Notice how press releases related to plenary sessions contain the text “PLENARY SESSIONS” with the following html: `<span class="ep_name">Plenary session</span>` 
Here is an example:
https://www.europarl.europa.eu/news/en/press-room/20220620IPR33417/national-recovery-plans-meps-assess-the-performance-of-crisis-funding

In [66]:
import requests
from bs4 import BeautifulSoup

def extract_plenary_links(url):
    local_links = set()  # Using a set to automatically handle duplicates
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all spans with the specific text "Plenary session"
    plenary_spans = soup.find_all('span', class_='ep_name', string='Plenary session')

    # Loop over found spans and navigate upwards to get the associated link
    for span in plenary_spans:
        parent_article = span.find_parent('article', class_='ep_gridcolumn')
        if parent_article:
            link = parent_article.find('a', href=True)
            if link:
                href = link['href']
                full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
                # Add to local_links if the link contains "press-room"
                if '/press-room/' in full_url:
                    local_links.add(full_url)
    
    return local_links

# Extract content from the link and check if the text contains crisis
def extract_and_check_content(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_content = soup.get_text()

    # Check for the word "crisis"
    if 'crisis' not in page_content.lower():
        return None

    # Extract title and text
    title = soup.find('div', class_='ep-p_text')
    text = soup.find('div', class_='ep-a_text')

    return {
        'title': title.get_text().strip() if title else None,
        'text': text.get_text().strip() if text else None
    }

# Start from the seed URL
base_url = 'https://www.europarl.europa.eu/news/en/press-room'
max_releases = 20

all_plenary_links = set()
press_release_contents = []
page_num = 1

while len(press_release_contents) < max_releases:
    current_url = f"{base_url}/page/{page_num}"
    new_links = extract_plenary_links(current_url)
    
    # Break if no new links are found, indicating we might have reached the last page
    if not new_links:
        break
    
    all_plenary_links.update(new_links)
    page_num += 1

# Check each link for the presence of the word 'crisis' and extract content if found
for link in all_plenary_links:
    content = extract_and_check_content(link)
    
    if content:
        press_release_contents.append({
            'url': link,
            **content
        })
        
        if len(press_release_contents) >= max_releases:
            break

# Display the results
for release in press_release_contents:
    print(f"Content from URL: {release['url']}")
    print(f"Title: {release['title']}")
    print(f"Text: {release['text']}")
    print('-' * 100)


Content from URL: https://www.europarl.europa.eu/news/en/press-room/20230911IPR04908/meps-vote-to-strengthen-eu-defence-industry-through-common-procurement
Title: MEPs vote to strengthen EU defence industry through common procurement
Text: MEPs backed the European Defence Industry Reinforcement through common Procurement Act (EDIRPA) on Tuesday.
----------------------------------------------------------------------------------------------------
Content from URL: https://www.europarl.europa.eu/news/en/press-room/20230707IPR02418/semiconductors-meps-adopt-legislation-to-boost-eu-chips-industry
Title: Semiconductors: MEPs adopt legislation to boost EU chips industry
Text: Plans to secure the EU’s supply of chips by boosting production and innovation, and establishing emergency measures against shortages, were adopted by Parliament on Tuesday.
----------------------------------------------------------------------------------------------------
Content from URL: https://www.europarl.europa.e

In [None]:
import requests
from bs4 import BeautifulSoup

In [68]:
# for the most part this copied from my previous homework
def extract_plenary_links(url):
    # storre all the links within the webpage
    local_links = set()
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find spans with the text plenary session
    plenary_spans = soup.find_all('span', class_='ep_name', string='Plenary session')
    # I also tried the following that works equally
    # plenary_spans = soup.find_all('div', class_='ep-p_text ep-layout_contenttype ep-layout_plenary')

    # Move upwards to get the link
    for span in plenary_spans:
        parent_article = span.find_parent('article', class_='ep_gridcolumn')
        if parent_article:
            link = parent_article.find('a', href=True)
            if link:
                href = link['href']
                # distinguish absolute & relative URLs -> Copied from last HW.
                full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
                # Add to local_links if the link contains "press-room"
                if '/press-room/' in full_url:
                    local_links.add(full_url)
    
    return local_links

In [None]:
# Extract content from the link and check if the text contains crisis
def extract_crisis_content(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_content = soup.get_text()

    # check is crisis exists, if not return None
    if 'crisis' not in page_content.lower():
        return None

    # extract title and text
    title = soup.find('div', class_='ep-p_text')
    text = soup.find('div', class_='ep-a_text')
    
    # return title and text
    return {
        'title': title.get_text().strip() if title else None,
        'text': text.get_text().strip() if text else None
    }

In [69]:


# Start from the seed URL
base_url = 'https://www.europarl.europa.eu/news/en/press-room'
# I set max iterations to stop the crawler
max_releases = 20

press_release_contents = []
page_num = 0  # Start with 0 to also process the initial seed URL

# Set max iterations
while len(press_release_contents) < max_releases:
    current_url = base_url if page_num == 0 else f"{base_url}/page/{page_num}"
    new_links = extract_plenary_links(current_url)
    
    # Break if no new links are found, indicating we might have reached the last page
    #if not new_links:
    #    break
    # Check each link for check link for crisis keyword & extract if it meets 
    for link in new_links:
        content = extract_crisis_content(link)
        if content:
            # store contents in a seto
            press_release_contents.append({
                'url': link,
                **content
            })
            if len(press_release_contents) >= max_releases:
                break
    page_num += 1

# Display the results
for release in press_release_contents:
    print(f"Content from URL: {release['url']}")
    print(f"Title: {release['title']}")
    print(f"Text: {release['text']}")
    print('-' * 100)

Content from URL: https://www.europarl.europa.eu/news/en/press-room/20230929IPR06132/nagorno-karabakh-meps-demand-review-of-eu-relations-with-azerbaijan
Title: Nagorno-Karabakh: MEPs demand review of EU relations with Azerbaijan
Text: Condemning Azerbaijan’s violent seizure of Nagorno-Karabakh, MEPs call for sanctions against those responsible and for the EU to review its relations with Baku.
----------------------------------------------------------------------------------------------------
Content from URL: https://www.europarl.europa.eu/news/en/press-room/20230929IPR06130/parliament-argues-for-a-top-up-to-multi-annual-budget-for-crisis-response
Title: Parliament argues for a top-up to multi-annual budget for crisis response
Text: On Tuesday, MEPs set out their position on the reform of the EU’s long-term budget, emphasizing the urgency of future-proofing the EU budget.
----------------------------------------------------------------------------------------------------
Content from U

In [57]:
# Find all div elements with class 'ep_title'
ep_titles = soup.find_all('div', class_='ep_gridcolumn-content')

# Store plenary article details
plenary_articles = []

for title in ep_titles:
    # Check if the title contains the specific 'plenary' class
    plenary_div = title.find('div', class_='ep-p_text ep-layout_contenttype ep-layout_plenary')
    if plenary_div:
        # Extract the article text
        article_text = title.get_text().strip()

        # Extract the associated URL
        link = title.find('a', href=True)
        article_url = link['href'] if link else None

        # If the URL is relative, prepend the base URL
        if article_url and not article_url.startswith('http'):
            article_url = 'https://www.europarl.europa.eu' + article_url

        # Store details in the list
        plenary_articles.append({'text': article_text, 'url': article_url})

# Display the plenary articles with their URLs
plenary_articles


[{'text': 'Greening the bond markets: MEPs approve new standard to fight greenwashing\xa0\n\n\n\nPlenary session\xa0\nECON\xa0\n\n\n05-10-2023 - 13:48\n\xa0\n\n\n\n\nMEPs on Thursday adopted a new voluntary standard for the use of a “European Green Bond” label, the first of its kind in the world.',
  'url': 'https://www.europarl.europa.eu/news/en/press-room/20230929IPR06139/greening-the-bond-markets-meps-approve-new-standard-to-fight-greenwashing'},
 {'text': 'Nagorno-Karabakh: MEPs demand review of EU relations with Azerbaijan\xa0\n\n\n\nPlenary session\xa0\nAFET\xa0\n\n\n05-10-2023 - 12:53\n\xa0\n\n\n\n\nCondemning Azerbaijan’s violent seizure of Nagorno-Karabakh, MEPs call for sanctions against those responsible and for the EU to review its relations with Baku.',
  'url': 'https://www.europarl.europa.eu/news/en/press-room/20230929IPR06132/nagorno-karabakh-meps-demand-review-of-eu-relations-with-azerbaijan'},
 {'text': 'Parliament pushes for start of EU accession talks with Moldova \

In [61]:
# Extracting URLs from the provided plenary_articles output
plenary_urls = [article['url'] for article in plenary_articles]

# List to store URLs of articles containing the word "crisis"
articles_with_crisis_from_plenary = []

# Iterate over the plenary articles and fetch content for each
for url in plenary_urls:
    response = requests.get(url)
    article_soup = BeautifulSoup(response.content, 'html.parser')
    
    # Check if the word "crisis" is present in the article content
    if "crisis" in article_soup.get_text().lower():
        articles_with_crisis_from_plenary.append(url)

articles_with_crisis_from_plenary


['https://www.europarl.europa.eu/news/en/press-room/20230929IPR06132/nagorno-karabakh-meps-demand-review-of-eu-relations-with-azerbaijan']

In [62]:
import requests
from bs4 import BeautifulSoup

def extract_plenary_links(url):
    local_links = set()  # Using a set to automatically handle duplicates
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all spans with the specific text "Plenary session"
    plenary_spans = soup.find_all('span', class_='ep_name', string='Plenary session')

    # Loop over found spans and navigate upwards to get the associated link
    for span in plenary_spans:
        parent_article = span.find_parent('article', class_='ep_gridcolumn')
        if parent_article:
            link = parent_article.find('a', href=True)
            if link:
                href = link['href']
                # Construct the full URL, assuming relative URLs
                full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
                # Add to local_links if the link contains "press-room"
                if '/press-room/' in full_url:
                    local_links.add(full_url)
    
    return local_links

def check_for_crisis(link):
    """Check if the content of the link contains the word 'crisis'."""
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_content = soup.get_text()
    
    return 'crisis' in page_content.lower()

# Start from the seed URL
seed_url = 'https://www.europarl.europa.eu/news/en/press-room'
plenary_links_set = extract_plenary_links(seed_url)

# Check each link for the presence of the word 'crisis' and print if found
for link in plenary_links_set:
    if check_for_crisis(link):
        print(link)

https://www.europarl.europa.eu/news/en/press-room/20230929IPR06132/nagorno-karabakh-meps-demand-review-of-eu-relations-with-azerbaijan


In [65]:
import requests
from bs4 import BeautifulSoup

def extract_plenary_links(url):
    local_links = set()  # Using a set to automatically handle duplicates
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all spans with the specific text "Plenary session"
    plenary_spans = soup.find_all('span', class_='ep_name', string='Plenary session')

    # Loop over found spans and navigate upwards to get the associated link
    for span in plenary_spans:
        parent_article = span.find_parent('article', class_='ep_gridcolumn')
        if parent_article:
            link = parent_article.find('a', href=True)
            if link:
                href = link['href']
                full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
                # Add to local_links if the link contains "press-room"
                if '/press-room/' in full_url:
                    local_links.add(full_url)
    
    return local_links

# check if the text contains crisis
def check_for_crisis(link):
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_content = soup.get_text()
    return 'crisis' in page_content.lower()

# Start from the seed URL
base_url = 'https://www.europarl.europa.eu/news/en/press-room'
max_releases = 20

all_plenary_links = set()
press_release_count = 0
page_num = 1

# Set max iterations
while press_release_count < max_releases:
    current_url = f"{base_url}/page/{page_num}"
    new_links = extract_plenary_links(current_url)
    
    # Break if no new links are found, indicating we might have reached the last page
    if not new_links:
        break
    
    all_plenary_links.update(new_links)
    page_num += 1

# Check each link for the presence of the word 'crisis' and print if found
for link in all_plenary_links:
    if check_for_crisis(link):
        print(link)
        press_release_count += 1
        
        if press_release_count >= max_releases:
            break


https://www.europarl.europa.eu/news/en/press-room/20230911IPR04908/meps-vote-to-strengthen-eu-defence-industry-through-common-procurement
https://www.europarl.europa.eu/news/en/press-room/20230707IPR02418/semiconductors-meps-adopt-legislation-to-boost-eu-chips-industry
https://www.europarl.europa.eu/news/en/press-room/20230706IPR02317/ep-today
https://www.europarl.europa.eu/news/en/press-room/20230929IPR06130/parliament-argues-for-a-top-up-to-multi-annual-budget-for-crisis-response
https://www.europarl.europa.eu/news/en/press-room/20230706IPR02316/ep-today
https://www.europarl.europa.eu/news/en/press-room/20230707IPR02427/covid-19-parliament-adopts-roadmap-to-better-prepare-for-future-health-crises
https://www.europarl.europa.eu/news/en/press-room/20230911IPR04918/svietlana-tsikhanouskaya-to-meps-support-belarusians-european-aspirations
https://www.europarl.europa.eu/news/en/press-room/20230707IPR02421/parliament-adopts-new-rules-to-boost-energy-savings
https://www.europarl.europa.eu/n

In [63]:
import requests
from bs4 import BeautifulSoup

def extract_plenary_links(url):
    local_links = set()  # Using a set to automatically handle duplicates
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all spans with the specific text "Plenary session"
    plenary_spans = soup.find_all('span', class_='ep_name', string='Plenary session')

    # Loop over found spans and navigate upwards to get the associated link
    for span in plenary_spans:
        parent_article = span.find_parent('article', class_='ep_gridcolumn')
        if parent_article:
            link = parent_article.find('a', href=True)
            if link:
                href = link['href']
                # Construct the full URL, assuming relative URLs
                full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
                # Add to local_links if the link contains "press-room"
                if '/press-room/' in full_url:
                    local_links.add(full_url)
    
    return local_links

def check_for_crisis(link):
    """Check if the content of the link contains the word 'crisis'."""
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    page_content = soup.get_text()
    
    return 'crisis' in page_content.lower()

# Start from the seed URL and iterate over multiple pages
base_url = 'https://www.europarl.europa.eu/news/en/press-room'
max_pages = 10  # Define how many pages you want to check

all_plenary_links = set()

# Iterate through multiple pages
for page_num in range(1, max_pages + 1):
    current_url = f"{base_url}/page/{page_num}"
    all_plenary_links.update(extract_plenary_links(current_url))

# Check each link for the presence of the word 'crisis' and print if found
for link in all_plenary_links:
    if check_for_crisis(link):
        print(link)


https://www.europarl.europa.eu/news/en/press-room/20230911IPR04908/meps-vote-to-strengthen-eu-defence-industry-through-common-procurement
https://www.europarl.europa.eu/news/en/press-room/20230609IPR96209/meps-demand-an-eu-food-security-plan-and-more-resources-for-farmers
https://www.europarl.europa.eu/news/en/press-room/20230609IPR96202/president-christodoulides-no-border-changes-will-stem-from-violence-and-war
https://www.europarl.europa.eu/news/en/press-room/20230707IPR02418/semiconductors-meps-adopt-legislation-to-boost-eu-chips-industry
https://www.europarl.europa.eu/news/en/press-room/20230706IPR02317/ep-today
https://www.europarl.europa.eu/news/en/press-room/20230608IPR95908/ep-today
https://www.europarl.europa.eu/news/en/press-room/20230929IPR06130/parliament-argues-for-a-top-up-to-multi-annual-budget-for-crisis-response
https://www.europarl.europa.eu/news/en/press-room/20230608IPR95905/ep-today
https://www.europarl.europa.eu/news/en/press-room/20230706IPR02316/ep-today
https:/

KeyboardInterrupt: 

In [23]:
import requests
from bs4 import BeautifulSoup

def extract_plenary_links(url):
    local_links = set()  # Using a set to automatically handle duplicates
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Navigate to the main body
    ep_gridrow_contents = soup.find_all('div', class_='ep_gridrow-content')

    # Filter for plenary sessions
    plenary_contents = [content for content in ep_gridrow_contents if "Plenary session" in content.get_text()]

    # Loop over anchor tags within plenary sessions to extract links
    for content in plenary_contents:
        for link in content.find_all('a', href=True):
            href = link['href']
            
            # Construct the full URL, assuming relative URLs
            full_url = href if href.startswith('http') else 'https://www.europarl.europa.eu' + href
            
            # Add to local_links; a set will automatically ignore duplicates
            local_links.add(full_url)
    
    return local_links

# Start from the seed URL
seed_url = 'https://www.europarl.europa.eu/news/en/press-room'
plenary_links_set = extract_plenary_links(seed_url)

# Display the extracted links
for link in plenary_links_set:
    print(link)


https://www.eppgroup.eu/newsroom 
https://www.europarl.europa.eu/news/en/press-room/20231009IPR06729/a-step-towards-supporting-eu-competitiveness-and-resilience-in-strategic-sectors
https://www.europarl.europa.eu/news/en/press-room/20231011IPR06911/president-metsola-in-solidarity-with-the-victims-of-the-terror-attacks-in-israel
https://www.europarl.europa.eu/news/en/press-room/20230929IPR06132/nagorno-karabakh-meps-demand-review-of-eu-relations-with-azerbaijan
https://ecrgroup.eu/news/
https://www.europarl.europa.eu/news/en/press-room/20230929IPR06139/greening-the-bond-markets-meps-approve-new-standard-to-fight-greenwashing
https://www.europarl.europa.eu/news/en/press-room/20230929IPR06137/parliament-pushes-for-start-of-eu-accession-talks-with-moldova
https://www.greens-efa.eu/en/newsroom/press-releases/
https://www.europarl.europa.eu/committees/en
https://www.europarl.europa.eu/news/en/agenda/weekly-agenda/2023-42#agenda-day20231016
https://www.guengl.eu/news
https://www.europarl.euro