# Sheet 3

## Excersise 1a
Which rules do you need to follow when doing web scraping?
- **Control request rate**: Send requests at a reasonable frequency to avoid server overload.
- **Avoid private or restricted data**: Only collect publicly available information (No personalized data)
- **Design for flexibility**: Structure the scraper to adapt to changes in website layout or HTML structure, making it easier to update if the site changes.
- **Extract data dynamically**: Use relative paths and search functions (e.g., class names or unique IDs) to locate data, reducing dependency on specific HTML elements.


## Excersise 1b

In [1]:
# imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random

In [2]:
with open('./Data/titles.csv', 'r') as file:
    lines = file.readlines()

data = [line.strip() for line in lines[1:] if line.strip()]  
df_titles = pd.DataFrame(data, columns=['Title'])

df_titles.sample(10)

Unnamed: 0,Title
81,A comprehensive study on challenges in deployi...
125,Impact of Advanced Synoptics and Simplified Ch...
38,"Tensorflow-serving: Flexible, high-performance..."
70,On challenges in machine learning model manage...
111,Perceive your users in depth: Learning univers...
42,Innovative devops for artificial intelligence
162,Extending reference architecture of big data s...
171,Scaling Machine Learning as a Service
259,Toward human-in-the-loop collaboration between...
267,Towards classes of architectural dependability...


In [None]:
titles = df_titles['Title'].tolist()

# List to store the data of the papers
papers_data = []

# Counter to limit requests
counter = 0

# Iterate over the titles and fetch the data from the Semantic Scholar API
for title in titles:
    sem_scholar_url = f"https://api.semanticscholar.org/graph/v1/paper/search?query={title}&fields=title,authors,year,venue,abstract,citationCount"
    
    response = requests.get(sem_scholar_url)
    
    if response.status_code == 200:
        data = response.json()
        
        if data.get("data"):
            first_paper = data["data"][0]
            
            paper_title = first_paper.get("title") or "Title not available"
            authors = ", ".join([author["name"] for author in first_paper.get("authors", [])]) or "No author available"
            year_of_publication = first_paper.get("year") or "Year not available"
            publication_venue = first_paper.get("venue") or "Publication venue not available"
            full_abstract = first_paper.get("abstract") or "Abstract not available"
            citation_count = first_paper.get("citationCount", 0)

            papers_data.append({
                "Paper title": paper_title,
                "Authors": authors,
                "Year of publication": year_of_publication,
                "Where it was published": publication_venue,
                "Full abstract": full_abstract,
                "Amount of citations": citation_count
            })
            counter += 1

            # Add a short random pause between requests
            time.sleep(random.uniform(3, 10))

            # Longer pause every 20 requests
            if counter % 20 == 0:
                print("Waiting for 120 seconds to prevent blocking...")
                time.sleep(120)
            continue
            
    elif response.status_code == 429:
        print("Rate limit reached. Process terminated due to too many requests.")
        break  # Exit the loop if rate limit is reached
    else:
        print(f"Error when fetching: {title}, Status Code: {response.status_code}")

# The stored data in a DataFrame, if process completes successfully
df_papers = pd.DataFrame(papers_data)
df_papers.head()

df_papers.to_csv('./Data/dea03_exercise1_web_scraping_results', index=False)

Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...
Waiting for 120 seconds to prevent blocking...


Unnamed: 0,Paper title,Authors,Year of publication,Where it was published,Full abstract,Amount of citations
0,A Framework for Automated Testing,"Thomas Fehlmann, Eberhard Kranich",2020,European Conference on Software Process Improv...,,6
1,Towards MLOps: A Case Study of ML Pipeline Pla...,"Yue Zhou, Yue Yu, Bo Ding",2020,2020 International Conference on Artificial In...,The development and deployment of machine lear...,58
2,Euclid. I. Overview of the Euclid mission,"Euclid Collaboration Y. Mellier, Abdurro’uf, J...",2024,Astronomy &amp; Astrophysics,The current standard model of cosmology succes...,26
3,The key to leveraging AI at scale,"Deborah Leff, Kenneth T. K. Lim",2021,Journal of Revenue and Pricing Management,,6
4,Reliable Fleet Analytics for Edge IoT Solutions,"E. Raj, M. Westerlund, L. E. Leal",2021,arXiv.org,In recent years we have witnessed a boom in In...,10


In [3]:
df_papers = pd.read_csv('./Output/dea03_exercise1_web_scraping_results.csv', sep=';')
df_papers.head()

Unnamed: 0,Paper title,Authors,Year of publication,Where it was published,Full abstract,Amount of citations
0,A Framework for Automated Testing,"Thomas Fehlmann, Eberhard Kranich",2020,European Conference on Software Process Improv...,Abstract not available,6
1,Towards MLOps: A Case Study of ML Pipeline Pla...,"Yue Zhou, Yue Yu, Bo Ding",2020,2020 International Conference on Artificial In...,The development and deployment of machine lear...,58
2,Euclid. I. Overview of the Euclid mission,"Euclid Collaboration Y. Mellier, Abdurro’uf, J...",2024,Astronomy &amp; Astrophysics,The current standard model of cosmology succes...,26
3,The key to leveraging AI at scale,"Deborah Leff, Kenneth T. K. Lim",2021,Journal of Revenue and Pricing Management,Abstract not available,6
4,Reliable Fleet Analytics for Edge IoT Solutions,"E. Raj, M. Westerlund, L. E. Leal",2021,arXiv.org,In recent years we have witnessed a boom in In...,10


### Author with most published titles

In [7]:
all_authors = df_papers['Authors'].str.split(', ').explode()

# Count each author and find the one with the most publications
most_published_author = all_authors.value_counts().idxmax()
most_published_count = all_authors.value_counts().max()
print(f"Author with the most publications: {most_published_author} ({most_published_count} papers)")

Author with the most publications: J. Bosch (13 papers)


### What is the year with the most paper publications?

In [8]:
# Count publications by year
year_counts = df_papers['Year of publication'].value_counts()

# Find the year with the most publications
most_publications_year = year_counts.idxmax()
most_publications_count = year_counts.max()
print(f"Year with the most publications: {most_publications_year} ({most_publications_count} papers)")

Year with the most publications: 2020 (80 papers)


### Which paper was cited the most?

In [9]:
# Convert citations to integers and find the paper with the most citations
df_papers['Amount of citations'] = df_papers['Amount of citations'].astype(int)
most_cited_paper = df_papers.loc[df_papers['Amount of citations'].idxmax()]

print("Paper with the most citations:")
print(f"Title: {most_cited_paper['Paper title']}")
print(f"Citations: {most_cited_paper['Amount of citations']}")

Paper with the most citations:
Title: STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
Citations: 12319
