<div dir=ltr align=center>
<font color=0F5298 size=10>
    Data Science <br>
<font color=0F5298 size=5>
    Electrical Engineering Department <br>
    Fall 2024 <br>
    Parham Gilani - 400101859 <br>
    Project Phase 0<br>
    
____

# Crawling:

In [41]:
import requests
import time
import json
import os

# Fetch papers from the Semantic Scholar API
def scrape_semantic_scholar_api(topic, year_range, limit=1000):
    api_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    collected_papers = []
    current_offset = 0

    while len(collected_papers) < limit:
        if current_offset % 300 == 0:
            print(f"Scraped: {current_offset}/{limit}")

        # Request parameters
        query_params = {
            "query": topic,
            "fields": "title,abstract,authors,citationCount",
            "offset": current_offset,
            "limit": 100,
            "year": year_range,
            "sort": "relevance"
        }

        try:
            response = requests.get(api_url, params=query_params)

            if response.status_code == 429:
                print("Hit rate limit. Pausing for 10 seconds...")
                time.sleep(10)
                continue

            if response.status_code != 200:
                print(f"Error: Received {response.status_code}. Message: {response.json()}.")
                break

            response_data = response.json()
            paper_list = response_data.get("data", [])

            if not paper_list:
                print("No additional papers found.")
                break

            for paper in paper_list:
                try:
                    collected_papers.append({
                        'title': paper.get("title", "No Title"),
                        'abstract': paper.get("abstract", "No Abstract"),
                        'authors': ", ".join([author.get("name", "Unknown") for author in paper.get("authors", [])]),
                        'citations': paper.get("citationCount", 0)
                    })

                    if len(collected_papers) >= limit:
                        break

                except Exception as inner_error:
                    print(f"Error processing paper data: {inner_error}")

            current_offset += 100
            time.sleep(5)

        except Exception as outer_error:
            print(f"Network or request error: {outer_error}. Retrying in 10 seconds...")
            time.sleep(10)

    return collected_papers


# Save the scraped data to a JSON file for each topic
def save_to_json_file(topic, all_year_data):
    if not all_year_data:
        print(f"No data to save for topic: {topic}")
        return

    # Create a unique filename for the topic
    file_name = f"{topic.replace(' ', '_')}_data.json"

    # Save the data to the JSON file
    with open(file_name, 'w', encoding='utf-8') as f:
        json.dump(all_year_data, f, ensure_ascii=False, indent=4)

    print(f"Data saved to {file_name}")


topics = ["Foundation Models", "Generative Models", "LLM", "VLM", "Diffusion Models"]
year_ranges = ["2017-2023", "2024-2025"]

for topic in topics:
    print(f"Scraping data for topic: {topic}")

    all_year_data = []

    # Scrape and collect data for each year range
    for year_range in year_ranges:
        data = scrape_semantic_scholar_api(topic, year_range, limit=1000)
        all_year_data.append({
            'papers': data
        })

    # Save all data for the topic in one file
    save_to_json_file(topic, all_year_data)

    print(f"Completed scraping for topic: {topic}")


Scraping data for topic: Foundation Models
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Data saved to Foundation_Models_data.json
Completed scraping for topic: Foundation Models
Scraping data for topic: Generative Models
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 900/1000
Data saved to Generative_Models_data.json
Completed scraping for topic: Generative Models
Scraping data for topic: LLM
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Data saved to LLM_data.json
Completed scraping for topic: LLM
Scraping data for topic: VLM
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Error: Received 400. Message: {'error': 'Requested data for this limit and/or offs

This script fetches academic papers from the Semantic Scholar API based on specified topics and year ranges. It defines functions to send API requests with parameters like title, abstract, authors, and citation count while managing pagination. For each topic, papers are collected in a list, and if the rate limit is reached, the script pauses before retrying. The script ensures that the data collection stops once the specified limit is reached. Error handling is implemented to retry requests in case of failures, such as network issues or invalid responses. It checks for missing or empty data and prints appropriate messages. Once the papers are gathered, they are saved to a JSON file for each topic, named after the topic with `_data.json`. This includes data for all year ranges provided in the script. The process repeats for multiple topics such as "Foundation Models," "Generative Models," etc. The script handles multiple topics and year ranges sequentially. Finally, the data is stored, ensuring that each topic's papers are grouped and saved correctly.

# Cleaning the dataset and merge:

In [44]:
import json
import os

# Load the existing file and get the data
def load_data_from_file(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                return json.load(file)
            except json.JSONDecodeError:
                print(f"Error loading data from {file_path}. File may be empty or corrupted.")
                return []
    else:
        print(f"File {file_path} not found.")
        return []

# Save the merged data back to the file
def save_data_to_file(file_path, merged_data):
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(merged_data, file, ensure_ascii=False, indent=4)
    print(f"Data saved to {file_path}")

# Merge the "papers" parts and save them to the same file
def merge_papers_in_file(file_path):
    # Load the data from the file
    data = load_data_from_file(file_path)

    # Check if the file contains more than one "papers" part (assuming there are separate parts)
    if len(data) == 2:
        # Merge the papers from both parts into one list
        merged_papers = data[0].get('papers', []) + data[1].get('papers', [])

        # Save the merged data back into the same file
        merged_data = [{
            'papers': merged_papers
        }]

        save_data_to_file(file_path, merged_data)
    else:
        print(f"Data in {file_path} does not contain two parts to merge.")

# List of file names
file_names = [
    "Foundation_Models_data.json",
    "Generative_Models_data.json",
    "LLM_data.json",
    "VLM_data.json",
    "Diffusion_Models_data.json"
]

# Merge the papers parts in each file
for file_name in file_names:
    merge_papers_in_file(file_name)


Data saved to Foundation_Models_data.json
Data saved to Generative_Models_data.json
Data saved to LLM_data.json
Data saved to VLM_data.json
Data saved to Diffusion_Models_data.json


# Visualization:

In [46]:
import json
import os

# Load the existing file and get the data
def load_data_from_file(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                return json.load(file)
            except json.JSONDecodeError:
                print(f"Error loading data from {file_path}. File may be empty or corrupted.")
                return []
    else:
        print(f"File {file_path} not found.")
        return []

# Show the first item (paper) from the "papers" list in the file
def show_first_paper_from_file(file_path):
    data = load_data_from_file(file_path)

    if data and 'papers' in data[0]:
        papers = data[0]['papers']

        if papers:
            first_paper = papers[0]
            print(f"\nFirst paper from {file_path}:")
            print(f"Title: {first_paper.get('title', 'No Title')}")
            print(f"Abstract: {first_paper.get('abstract', 'No Abstract')}")
            print(f"Authors: {first_paper.get('authors', 'No Authors')}")
            print(f"Citations: {first_paper.get('citations', 0)}")
            print("-" * 40)
        else:
            print(f"No papers found in {file_path}.")
    else:
        print(f"No 'papers' key found in {file_path} or the file is empty.")

# List of file names
file_names = [
    "Foundation_Models_data.json",
    "Generative_Models_data.json",
    "LLM_data.json",
    "VLM_data.json",
    "Diffusion_Models_data.json"
]

# Show the first paper from each file
for file_name in file_names:
    show_first_paper_from_file(file_name)



First paper from Foundation_Models_data.json:
Title: On the Opportunities and Risks of Foundation Models
Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent