# Notebook Overview
This notebook includes setup and implementation for web scraping tasks. Below is the structured layout of the notebook for better understanding and readability.

## GeckoDriver and Selenium Setup Script

This script is specifically designed for setting up the environment for web scraping using Selenium in Python. It focuses on installing and configuring necessary components to control a Firefox browser through Selenium.

Key Features:

    System and Browser Setup:
      Updates the system's package list.
      Installs Firefox, the web browser to be automated.
    GeckoDriver Installation:
      Downloads and extracts GeckoDriver, which is essential for Selenium to interact with Firefox.
      Sets executable permissions for GeckoDriver.
      Moves GeckoDriver to a system directory for easy access.
    Selenium Installation:
      Installs Selenium, a powerful tool for browser automation.
    Verification:   
      Checks the installation of GeckoDriver.
      Verifies the installed version of Firefox.
    Purpose:
      The script is tailored for preparing a Python environment for web scraping tasks.
      It ensures that all necessary components, particularly GeckoDriver and Selenium, are correctly installed and configured.
      This setup is crucial for automating and controlling the Firefox browser in web scraping applications.





In [None]:
!apt-get update
!apt-get install firefox
!wget https://github.com/mozilla/geckodriver/releases/download/v0.29.0/geckodriver-v0.29.0-linux64.tar.gz
!tar -xvzf geckodriver-v0.29.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/


!pip install selenium

!wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
!tar -xvzf geckodriver-v0.34.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/

!tar -xvzf geckodriver-v0.34.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/

!which geckodriver
!firefox --version

## Web Content Scraping Script



This Python script is designed for scraping specific content from web pages using the requests library and BeautifulSoup. It focuses on extracting structured data from a list of URLs.

Key Features:

    URLs Processing:
        Maintains a list of URLs to be scraped.
        Iterates over each URL to fetch its content.

    Page Content Fetching:
        Uses the requests library to retrieve the HTML content of each page.

    Data Extraction with BeautifulSoup:
        Parses the HTML content using BeautifulSoup.
        Extracts specific data based on HTML tags and classes.

    Content Organization:
        Stores the extracted content in a structured format (dictionary).
        Handles different types of HTML elements like paragraphs and lists.

    Output Display:
        Prints the extracted information in a readable format.
        Includes a separator for clarity between different scraped contents.

    Purpose:
        The script is tailored for efficiently scraping and organizing information from multiple web pages.
        It demonstrates the use of Python's requests and BeautifulSoup for web scraping, showcasing the ability to handle and parse HTML content to extract meaningful data.
        This script is particularly useful for gathering structured information from various web sources.

In [None]:
import requests
from bs4 import BeautifulSoup

# List of links to scrape
links = [
    'https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/horizon-cl3-2024-bm-01-02',
    # ... Add all other links as needed
]

# Iterate over each link to scrape information
for link in links:
    # Fetch the page content
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Initialize a dictionary to store the sections' content
    sections_content = {}

    # Loop through each description tag and capture the content
    for tag in soup.find_all('span', class_='topicdescriptionkind'):
        section_title = tag.get_text(strip=True)
        content = ''

        # Attempt to find the next element that contains text, this could be a paragraph, list, or another tag
        next_element = tag.find_next()

        # Check if the next element is a paragraph or a list
        if next_element.name == 'p':
            content = next_element.get_text(strip=True)
        elif next_element.name == 'ul':
            # Combine all list items into a single string
            content = ' '.join(li.get_text(strip=True) for li in next_element.find_all('li'))

        # Store the content in the dictionary using the section title as the key
        sections_content[section_title] = content

    # Print the topic and its corresponding sections' content
    print(f"Topic: {link.split('/')[-1]}")
    for title, text in sections_content.items():
        print(f"{title}: {text}")
    print("\n" + "-"*80 + "\n")  # Print a separator line


## Headless Chrome Web Scraping Script



Headless Chrome Web Scraping Script

This Python script is designed for web scraping using Selenium with a headless Chrome browser. It's configured to run in environments like Google Colab, where a graphical user interface is not available.

Key Features:

    Selenium WebDriver Setup:
        Imports necessary Selenium components for web automation.
        Initializes the Chrome WebDriver.
    Headless Chrome Configuration:
        Configures Chrome to run in headless mode (no GUI).
        Sets additional arguments for compatibility with environments like Colab (--no-sandbox, --disable-dev-shm-usage).
    Web Page Interaction:
        Uses the WebDriver to navigate to a specified URL.
        Demonstrates how to interact with web elements (e.g., retrieving the page title).
    Scraping Task Execution:
        The script is structured to perform web scraping tasks on the visited page.
    Clean Up:
        Properly closes the Chrome driver after completing tasks to free resources.
    Purpose:
      The script is specifically designed for automated web scraping tasks in headless environments.
      It showcases how to configure and use Selenium with a headless Chrome browser, making it suitable for server-side scraping tasks or in environments without a display.
      This approach is essential for efficient, automated data extraction from web pages in non-GUI contexts.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Set up Chrome options for headless browsing in Colab
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Initialize the Chrome driver with the specified options
driver = webdriver.Chrome(options=options)

# Now you can use the driver object to visit pages, interact with elements, etc.
# For example, to visit a webpage:
driver.get('https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/horizon-cl3-2024-bm-01-01?keywords=HORIZON-CL3-2024&tenders=false&openForSubmission=false&sortBy=identifier&pageSize=25')
# Print the title of the page
print(driver.title)

# Do your scraping tasks...

# Don't forget to close the driver after your tasks are done
driver.quit()


## Web Scraping Script Description



This Python script is designed for **web scraping using Selenium and BeautifulSoup**. It automates a web browser to extract specific information from a webpage.

Key Features:
    
    Selenium Installation and Setup:
      Installs and configures Selenium for browser automation.
      Sets up GeckoDriver and ChromeDriver for Firefox and Chrome browsers.

    Web Browser Configuration:
      Customizes Chrome browser settings for the scraping task.
   
    Web Scraping Process:
      Automates navigation to a specified URL.
      Waits for the necessary page elements to load.
      Parses HTML content using BeautifulSoup.
      Extracts and prints targeted information from the webpage.
      Handles timeouts and exceptions efficiently.
      Closes the browser driver post-scraping.
    Purpose:
      The script demonstrates a comprehensive approach to web scraping, showcasing techniques for browser automation,
      dynamic content handling, and structured data extraction from webpages.

In [None]:
!pip install --upgrade selenium

!apt-get install firefox
!wget https://github.com/mozilla/geckodriver/releases/download/v0.29.0/geckodriver-v0.29.0-linux64.tar.gz
!tar -xvzf geckodriver-v0.29.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/

!tar -xvzf geckodriver-v0.29.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/

!rm /usr/local/bin/geckodriver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
!tar -xvzf geckodriver-v0.34.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/


# Install Selenium
!pip install selenium

# Download the latest chromedriver and unzip it
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time



# Set up options for Chrome, this time without headless mode
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--headless')  # Comment this line out or remove it
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--disable-images")
chrome_options.add_argument("--disable-javascript")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920,1080')


# Specify the path to the Chrome binary
chrome_options.binary_location = "/usr/bin/google-chrome"

# Initialize the Chrome driver
driver = webdriver.Chrome(options=chrome_options)


# URL of the webpage you want to scrape
url = "https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/horizon-cl3-2024-fct-01-05?keywords=CL3-2024&openForSubmission=false&programmePeriod=2021%20-%202027&frameworkProgramme=43108390"

# Use the driver to navigate to the page
driver.get(url)

try:
    # Wait for the page content to load
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.TAG_NAME, 'body')))
    print("Page loaded successfully.")
except TimeoutException:
    print("Timed out waiting for page content to load.")

# try:

# # Increase timeout and try a different approach to locate the element
#     WebDriverWait(driver, 90).until(EC.visibility_of_element_located((By.XPATH, "//eui-card-header-title[contains(text(), 'Topic description')]")))

#     # Now get the page source and create a BeautifulSoup object
#     soup = BeautifulSoup(driver.page_source, 'html.parser')

#     # Function to extract information
#     def extract_information(soup):
#         topic_info = {}
#         sections_to_scrape = ["ExpectedOutcome", "Scope"]

#         # Find the 'Topic description' title
#         topic_description_title = soup.find('eui-card-header-title', string="Topic description")
#         if topic_description_title:
#             topic_description_content = topic_description_title.find_next('div', class_='showMore--three-lines')
#             topic_info["Topic description"] = topic_description_content.get_text(strip=True) if topic_description_content else "Content not found"

#         # Find 'ExpectedOutcome' and 'Scope' elements and their respective contents
#         for section in sections_to_scrape:
#             section_content = soup.find('span', class_='topicdescriptionkind', string=section)
#             if section_content:
#                 content = section_content.find_next('p')
#                 topic_info[section] = content.get_text(strip=True) if content else 'Content not found'

#         return topic_info

#     # Call the function with the BeautifulSoup object
#     scraped_data = extract_information(soup)

#     # Print the scraped data
#     for key, value in scraped_data.items():
#         print(f"{key}: {value}\n")

# except TimeoutException:
#     print("Timed out waiting for page to load")
# finally:
#     # Close the driver
#     driver.quit()


try:

    print("Navigated to URL.")

    # Alternative method to find and click the 'Show more' button
    # Update this XPath according to the button's attributes or text
    show_more_button_xpath = "//button[contains(text(), 'Show more')]"  # Example XPath
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, show_more_button_xpath)))
    driver.find_element(By.XPATH, show_more_button_xpath).click()
    print("'Show more' button clicked.")


    # Wait for the visibility of the title 'Topic description'
    WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//eui-card-header-title[contains(text(), 'Topic description')]")))
    print("Topic description is visible.")

    # Now get the page source and create a BeautifulSoup object
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # Function to extract information
    def extract_information(soup):
        topic_info = {}
        sections_to_scrape = ["ExpectedOutcome", "Scope"]

        # Find the 'Topic description' title
        topic_description_title = soup.find('eui-card-header-title', string="Topic description")
        if topic_description_title:
            topic_description_content = topic_description_title.find_next('div', class_='showMore--three-lines')
            topic_info["Topic description"] = topic_description_content.get_text(strip=True) if topic_description_content else "Content not found"

        # Find 'ExpectedOutcome' and 'Scope' elements and their respective contents
        for section in sections_to_scrape:
            section_content = soup.find('span', class_='topicdescriptionkind', string=section)
            if section_content:
                content = section_content.find_next('p')
                topic_info[section] = content.get_text(strip=True) if content else 'Content not found'

        return topic_info

    # Call the function with the BeautifulSoup object
    scraped_data = extract_information(soup)
    print("Scraped data extracted.")

    # Print the scraped data
    for key, value in scraped_data.items():
        print(f"{key}: {value}\n")

except TimeoutException:
    print("Timed out waiting for page to load or element to be found")
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Close the driver
    driver.quit()

### This cell imports necessary libraries and sets up the environment for the task.

In [None]:
import requests

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Print the top-level keys
print(data.keys())


### This cell imports necessary libraries and sets up the environment for the task.

In [None]:
import requests

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Access the 'TopicDetails' key
topic_details = data['TopicDetails']

# Now, you need to find the correct sub-key under 'TopicDetails' that contains the "Topic description"
# If you're not sure what the sub-key is, you can print out the keys within 'TopicDetails' to explore:
print(topic_details.keys())

# Once you identify the correct sub-key, you can access the "Topic description" like this:
# topic_description = topic_details['the_sub_key_here']
# print(topic_description)


### This cell imports necessary libraries and sets up the environment for the task.

In [None]:
import requests

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-03.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Extract the 'Topic description' from 'destinationDetails' key
topic_description = data['TopicDetails']

# Print or process the 'Topic description'
print(topic_description)


### This cell imports necessary libraries and sets up the environment for the task.

In [None]:
import requests
import pandas as pd
from datetime import datetime

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()['TopicDetails']

# Extract specific data
call_id = data['identifier']
call_title = data['callTitle']
type_of_action = data['actions'][0]['types'][0]['typeOfAction']
type_of_mga = data['actions'][0]['types'][0]['typeOfMGA'][0]['abbreviation']

# Converting dates from timestamp to readable format
planned_opening_date = data['actions'][0]['plannedOpeningDate']
deadline_date = data['actions'][0]['deadlineDates'][0] + " 17:00:00 Brussels time"


# Create a DataFrame for table format
df = pd.DataFrame({
    'Call ID': [call_id],
    'Call Title': [call_title],
    'Type of Action': [type_of_action],
    'Type of MGA': [type_of_mga],
    'Planned Opening Date': [planned_opening_date],
    'Deadline Date': [deadline_date]
})

# Print the DataFrame
print(df)


## JSON Data Retrieval and Parsing Script



This Python script is designed for fetching and parsing JSON data from a specified URL using the requests library. It focuses on retrieving structured JSON data from a web source and extracting key information.

Key Features:

    URL Specification:
        Defines the URL of the JSON data to be fetched.
    Data Fetching:
        Uses the requests library to send a GET request to the specified URL.
    JSON Parsing:
        Parses the JSON response from the request.
    Data Inspection:
        Prints the top-level keys of the JSON data to provide an overview of its structure.
    Purpose:
      The script is tailored for simple and efficient retrieval and parsing of JSON data from web sources.
      It demonstrates the use of Python's requests library for network requests and JSON handling, showcasing the ability
      to easily access and inspect structured data from the internet.
      This script is particularly useful for applications that require automated data fetching and parsing from online JSON sources.

In [None]:
import requests

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Print the top-level keys
print(data.keys())


## JSON Data Retrieval and Specific Content Extraction Script



This Python script is tailored for fetching JSON data from a URL and extracting specific information from it using the requests library. It focuses on accessing a particular section of the JSON data structure.

Key Features:

    URL Definition:
        Specifies the URL of the JSON file to be retrieved.
    GET Request Execution:
        Sends a GET request to the URL to fetch the JSON data.
    JSON Data Parsing:
        Parses the received JSON response.
    Accessing Specific Data:
        Extracts the 'TopicDetails' section from the JSON data.
        Prints the keys within 'TopicDetails' to facilitate exploration and identification of the relevant sub-key.
    Targeted Data Extraction:
        Provides a template for accessing a specific sub-key (e.g., "Topic description") once identified.
    Purpose:
        The script is designed for scenarios where specific pieces of information need to be extracted from structured JSON data obtained from a web source.
        It demonstrates the process of fetching, parsing, and drilling down into JSON data to access detailed content.
        This approach is valuable for applications that require targeted data extraction from complex JSON structures, particularly in data analysis or integration tasks.

In [None]:
import requests

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Access the 'TopicDetails' key
topic_details = data['TopicDetails']

# Now, you need to find the correct sub-key under 'TopicDetails' that contains the "Topic description"
# If you're not sure what the sub-key is, you can print out the keys within 'TopicDetails' to explore:
print(topic_details.keys())

# Once you identify the correct sub-key, you can access the "Topic description" like this:
# topic_description = topic_details['the_sub_key_here']
# print(topic_description)


## JSON Data Retrieval and DataFrame Creation Script



This Python script is designed for fetching JSON data from a specified URL, extracting specific information, and organizing it into a pandas DataFrame. It uses the requests library for data retrieval and pandas for data structuring.

Key Features:

    URL and Data Fetching:
        Specifies the URL of the JSON file.
        Fetches the JSON data using a GET request.
    JSON Parsing and Data Extraction:
        Parses the JSON response to access the 'TopicDetails'.
        Extracts specific details like call ID, title, type of action, and type of MGA.
    Date Formatting:
        Converts timestamps into a readable date format.
    DataFrame Creation:
        Utilizes pandas to create a DataFrame for structured data representation.
        Organizes extracted data into a tabular format.
    Data Display:
        Prints the DataFrame for review or further analysis.

In [None]:
import requests
import pandas as pd
from datetime import datetime

# URL of the JSON file
url = 'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/horizon-cl3-2024-fct-01-05.json'

# Send a GET request
response = requests.get(url)

# Parse the JSON response
data = response.json()['TopicDetails']

# Extract specific data
call_id = data['identifier']
call_title = data['callTitle']
type_of_action = data['actions'][0]['types'][0]['typeOfAction']
type_of_mga = data['actions'][0]['types'][0]['typeOfMGA'][0]['abbreviation']

# Converting dates from timestamp to readable format
planned_opening_date = data['actions'][0]['plannedOpeningDate']
deadline_date = data['actions'][0]['deadlineDates'][0] + " 17:00:00 Brussels time"


# Create a DataFrame for table format
df = pd.DataFrame({
    'Call ID': [call_id],
    'Call Title': [call_title],
    'Type of Action': [type_of_action],
    'Type of MGA': [type_of_mga],
    'Planned Opening Date': [planned_opening_date],
    'Deadline Date': [deadline_date]
})

# Print the DataFrame
print(df)

## POST Request and Data Processing Script



This Python script is designed for sending a POST request to a specified API, processing the JSON response, and organizing the data into a pandas DataFrame, which is then saved to an Excel file. It uses the requests library for the API call and pandas for data handling.

**Adauga Call Id in Excel/ Sheet1**

Key Features:

    API Interaction:
        Defines the URL for a POST request to an API.
        Sends the POST request and checks the response status.
    JSON Response Handling:
        Parses the JSON response upon a successful request.
        Extracts relevant data from the "results" array.
    Data Processing:
        Processes and filters the data, avoiding duplicates.
        Converts content to lowercase for consistency.
    Data Sorting and DataFrame Creation:
        Sorts the entries based on identifiers.
        Creates a DataFrame with the sorted data.
    Excel File Generation:
        Saves the DataFrame to an Excel file for external use.
    Data Display:
        Prints the sorted entries for immediate review.

In [None]:
import requests
import json
import pandas as pd

# Define the URL for the POST request
url = 'https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=SEDIA&text=*CL3-2024*&pageSize=50&pageNumber=1'

# Make the POST request
response = requests.post(url)

# Check the response status code
if response.status_code == 200:
    # Request was successful, parse the JSON response
    json_data = json.loads(response.text)

    # Extract information from the "results" array
    results = json_data.get("results", [])

    # Create a list to store entries and a set to keep track of processed identifiers
    entries = []
    processed_identifiers = set()

    for result in results:
        metadata = result.get("metadata", {})
        identifiers = metadata.get("identifier", [])
        content = result.get("content", "")

        # Check if identifiers exist and are not empty
        if identifiers:
            identifier = identifiers[0]  # Use the first identifier
            content = content.lower()  # Convert content to lowercase

            # Check if the identifier has already been processed
            if identifier not in processed_identifiers:
                # Append the entry to the list and mark the identifier as processed
                entries.append((identifier.lower(), content))
                processed_identifiers.add(identifier)

    # Sort entries by identifier content
    sorted_entries = sorted(entries, key=lambda x: x[0])

    # Create a DataFrame from the sorted entries
    df = pd.DataFrame(sorted_entries, columns=['Call-Id', 'Content'])

    # Save the DataFrame to an Excel file in Sheet 1
    with pd.ExcelWriter('eu_topic_details.xlsx', engine='openpyxl') as writer:
        df.to_excel(writer, sheet_name='Sheet1', index=False)

    # Print the sorted entries
    for idx, entry in enumerate(sorted_entries, start=1):
        identifier, content = entry
        print(f"{idx}. Identifier: {identifier}")
        print(f"   Content: {content}")

else:
    # Request failed
    print(f'POST request failed with status code {response.status_code}')

## Web Data Extraction and DataFrame Compilation Script



This Python script is designed for extracting specific information from web sources using identifiers from an Excel file, and then organizing this data into a structured pandas DataFrame. It uses requests for web requests, BeautifulSoup for HTML parsing, and pandas for data handling.

Key Features: **Parseaza datele de interes in Excel / Sheet2**

    Excel File Reading:
        Reads identifiers from an Excel file using pandas.
    Web Requests and Data Extraction:
        Iterates through each identifier to construct and send GET requests to specific URLs.
        Parses JSON responses to extract detailed information.
    HTML Content Parsing:
        Utilizes BeautifulSoup to parse HTML content within the JSON data.
        Extracts and cleans specific sections like 'Expected Outcome' and 'Scope'.
    DataFrame Creation and Updating:
        Compiles extracted data into a pandas DataFrame.
        Appends new data to the DataFrame for each identifier.
    Error Handling:
        Includes exception handling for robust processing.
    Excel File Writing:
        Saves the compiled data into a new sheet in the Excel file.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Read the Excel file with identifiers from Sheet1
df_identifiers = pd.read_excel('eu_topic_details.xlsx', sheet_name='Sheet1')

# Create an empty DataFrame to store the results
df_results = pd.DataFrame(columns=[
    'Call ID',
    'Call Title',
    'Type of Action',
    'Type of MGA',
    'Planned Opening Date',
    'Deadline Date',
    'Expected Outcome',
    'Scope'
])

# Iterate through identifiers
for index, row in df_identifiers.iterrows():
    identifier = row['Call-Id']

    # Construct the URL with the identifier
    url = f'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/{identifier}.json'

    # Send a GET request
    response = requests.get(url)

    try:
        # Parse the JSON response
        data = response.json()

        # Extract the required fields
        call_id = data['TopicDetails']['identifier']
        call_title = data['TopicDetails']['callTitle']
        type_of_action = data['TopicDetails']['actions'][0]['types'][0]['typeOfAction']
        type_of_mga = data['TopicDetails']['actions'][0]['types'][0]['typeOfMGA'][0]['description']
        planned_opening_date = data['TopicDetails']['actions'][0]['plannedOpeningDate']
        deadline_date = data['TopicDetails']['actions'][0]['deadlineDates'][0] + " 17:00:00 Brussels time"

        # Extracting and cleaning the ExpectedOutcome and Scope content
        description_html = data['TopicDetails']['description']
        soup = BeautifulSoup(description_html, 'html.parser')

        # Extracting text content for ExpectedOutcome and Scope
        expected_outcome_tag = soup.find('span', class_='topicdescriptionkind', string='ExpectedOutcome')
        scope_tag = soup.find('span', class_='topicdescriptionkind', string='Scope')

        expected_outcome = ""
        scope = ""

        if expected_outcome_tag:
            # Find all <li> tags within the ExpectedOutcome section and concatenate their text
            expected_outcome_list = expected_outcome_tag.find_next('ul')
            if expected_outcome_list:
                expected_outcome_items = expected_outcome_list.find_all('li')
                expected_outcome = "\n".join([item.get_text(strip=True) for item in expected_outcome_items])

        if scope_tag:
            # Find the next <p> tag after the Scope section and get its text
            scope_paragraph = scope_tag.find_next('p')
            if scope_paragraph:
                scope = scope_paragraph.get_text(strip=True)

        # Add the data to the DataFrame for results
        df_results = df_results.append({
            'Call ID': call_id,
            'Call Title': call_title,
            'Type of Action': type_of_action,
            'Type of MGA': type_of_mga,
            'Planned Opening Date': planned_opening_date,
            'Deadline Date': deadline_date,
            'Expected Outcome': expected_outcome,
            'Scope': scope
        }, ignore_index=True)

    except Exception as e:
        print(f"Error processing identifier {identifier}: {str(e)}")

# Save the results to an Excel file
with pd.ExcelWriter('eu_topic_details.xlsx', engine='openpyxl', mode='a') as writer:
    df_results.to_excel(writer, sheet_name='Sheet2', index=False)

# Print the results DataFrame for Sheet 2
print(df_results)

## Budget Data Extraction and Compilation Script



This Python script is designed for extracting budget-related information from a web source using identifiers from an Excel file and organizing this data into a structured pandas DataFrame. It uses requests for web requests and pandas for data handling and Excel file operations.

Key Features:

    Excel File Interaction:
        Reads identifiers from an existing Excel file.
        Prepares to save new data to the same file.
    Web Requests and JSON Parsing:
        Iterates through identifiers to construct and send GET requests.
        Parses JSON responses to extract budget information.
    Data Extraction and Processing:
        Extracts budget values, number of projects, and budget per project.
        Utilizes a custom function to handle budget value extraction.
    Unique Identifier Tracking:
        Ensures data uniqueness by tracking action identifiers.
    DataFrame Creation and Updating:
        Compiles extracted data into a new DataFrame.
        Appends new data for each unique action identifier.
    Error Handling:
        Implements exception handling for robust data processing.
    Excel File Writing:
        Adds the new DataFrame to a new sheet in the existing Excel file.

In [None]:
import requests
import pandas as pd

# Define the path to your Excel file
input_excel_path = '/content/eu_topic_details.xlsx'
output_excel_path = '/content/eu_topic_details.xlsx'  # Updated output path

# Read the Excel file with identifiers from Sheet1
df_identifiers = pd.read_excel(input_excel_path, sheet_name='Sheet1')

# Prepare a new DataFrame to store action identifiers and budget values
budget_data = pd.DataFrame(columns=['Action Identifier', 'Budget Value', 'Number of Projects', 'Budget Per Project'])

# Define a function to extract the budget value as a number
def get_budget_value(budget_year_map):
    if '2024' in budget_year_map:
        return budget_year_map['2024']
    return 0

# Set to keep track of unique Action Identifiers
unique_action_identifiers = set()

# Process each link in the Excel file
for index, row in df_identifiers.iterrows():
    identifier = row['Call-Id']

    # Construct the URL with the identifier
    url = f'https://ec.europa.eu/info/funding-tenders/opportunities/data/topicDetails/{identifier}.json'

    # Send a GET request
    response = requests.get(url)

    try:
        # Parse the JSON response
        data = response.json()

        # Access data under 'TopicDetails' and then 'budgetOverviewJSONItem'
        budget_overview = data['TopicDetails'].get('budgetOverviewJSONItem', {}).get('budgetTopicActionMap', {})

        for action_id, actions in budget_overview.items():
            for action in actions:
                # Extract the required fields
                action_identifier = action.get('action', '').split(' - ')[0]
                budget_value = get_budget_value(action.get('budgetYearMap', {}))
                num_projects = action.get('expectedGrants', 0)
                budget_per_project = action.get('minContribution', 0)

                # Check for duplicate Action Identifiers
                if action_identifier not in unique_action_identifiers:
                    # Append the data to the new DataFrame
                    budget_data = budget_data.append({
                        'Action Identifier': action_identifier,
                        'Budget Value': budget_value,
                        'Number of Projects': num_projects,
                        'Budget Per Project': budget_per_project
                    }, ignore_index=True)
                    unique_action_identifiers.add(action_identifier)

    except Exception as e:
        print(f"Error processing identifier {identifier}: {str(e)}")

# Save the new DataFrame with action identifiers and budget values to the same Excel file in a new sheet (Sheet3)
with pd.ExcelWriter(output_excel_path, engine='openpyxl', mode='a') as writer:
    budget_data.to_excel(writer, sheet_name='Sheet3', index=False)

print("Processing complete. Action identifiers and budget values added to 'Sheet3' in the same Excel file.")

In [None]:
import os

# Specify the path to your Google Drive folder
drive_path = '/content/drive/My Drive/'

# List all items (files and folders) in your Google Drive root directory
drive_items = os.listdir(drive_path)

# Filter and print only the folders
folders = [item for item in drive_items if os.path.isdir(os.path.join(drive_path, item))]

# Print the list of folders
print("List of Folders in Google Drive:")
for folder in folders:
    print(folder)


List of Folders in Google Drive:
BEIA
KINETO
cursuri
Books
UPB
odbackup
GFoto
cardsdxi
Colab Notebooks
RST_A6.1
appsheet


In [None]:
%cd '/content/drive/My Drive/Colab Notebooks/'


/content/drive/My Drive/Colab Notebooks


In [None]:
rm -rf repo


In [None]:
!git clone https://github.com/pasatsanduadrian/repo.git


Cloning into 'repo'...


In [None]:
%cd '/content/drive/My Drive/Colab Notebooks/'


/content/drive/My Drive/Colab Notebooks


In [None]:
%cd repo  # Navigate to the 'repo' directory
!git add Scrape_v3.ipynb  # Add the file you want to commit
!git commit -m "Added my Colab notebook"  # Commit the changes


[Errno 2] No such file or directory: 'repo # Navigate to the repo directory'
/content/drive/My Drive/Colab Notebooks
[master bdae24a] Added my Colab notebook
 1 file changed, 1 insertion(+)
 create mode 100644 Scrape_v3.ipynb


In [60]:
!git config --global credential.helper store
!echo "https://pasatsanduadrian:${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}@github.com" > ~/.git-credentials


In [4]:
!rm -f ~/.git-credential-cache


In [7]:
!git config --global credential.helper cache
!git config --global credential.helper 'cache --timeout=3600'


In [12]:
!rm -rf '/content/drive/My Drive/Colab Notebooks/repo/'


In [14]:
!git clone https://github.com/pasatsanduadrian/repo.git


fatal: destination path 'repo' already exists and is not an empty directory.


In [15]:
%cd '/content/drive/My Drive/Colab Notebooks/'


/content/drive/My Drive/Colab Notebooks


In [18]:
cd repo  # Navigate to the 'repo' directory
!git add Scrape_v3.ipynb  # Add the file you want to commit
!git commit -m "Added my Colab notebook"  # Commit the changes


SyntaxError: invalid syntax (<ipython-input-18-e3b518d896ce>, line 1)