# Formalia

Please read the [assignment overview page](https://github.com/TheYuanLiao/comsocsci2025/wiki/Assignments) carefully before proceeding. The page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. 

__If you fail to follow these simple instructions, it will negatively impact your grade!__

**Due date and time**: The assignment is due on Mar 4th at 23:59. Hand in your Jupyter notebook file (with extension `.ipynb`) via DTU Learn _(Assignment 1)_. 

Remember to include in the first cell of your notebook:
* the link to your group's Git repository 
* group members' contributions


## Part 1: Web-scraping

> **Exercise: Web-scraping the list of participants to the International Conference in Computational Social Science**    
>
> You can find the programme of the 2023 edition of the conference at [this link](https://ic2s2-2023.org/program). As you can see the conference programme included many different contributions: keynote presentations, parallel talks, tutorials, posters. 
> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling. 
> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping. 
> 3. Create the set of unique researchers that joined the conference and *store it into a file*.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
> 4. *Optional:* For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials)

In [9]:
import requests
from bs4 import BeautifulSoup
import re

url = "https://ic2s2-2023.org/program"

response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")

sections = soup.find_all("section", id="main")

# Initialize a set to store unique names
researcher_names_2023 = set()

# Regular expression to match names inside <i>
name_pattern = re.compile(r"\b[A-Z][a-z]+(?:\s[A-Z]\.?)?(?:\s[A-Z][a-z]+)+\b")

for section in sections:
    for i_tag in section.find_all("i"):
        text = i_tag.get_text()
        
        matches = name_pattern.findall(text)
        for match in matches:
            researcher_names_2023.add(match)

print(f"Number of researcher names from the program of 2023: {len(researcher_names_2023)}\n")
print("First 10 names:")
print("\n".join(list(sorted(researcher_names_2023))[:10]))

Number of researcher names from the program of 2023: 1380

First 10 names:
Aaron Clauset
Aaron J. Schwartz
Aaron Schein
Aaron Smith
Abbas Haidar
Abby Smith
Abdulkadir Celikkanat
Abdullah Almaatouq
Abdullah Zameek
Adam Finnemann


### Storing the unique researcher names in a file

In [5]:
with open("files/researcher_names_2023.txt", "w") as f:
    f.write("\n".join(sorted(list(researcher_names_2023))))

> 5. How many unique researchers do you get?

**Answer:**  
We found 1380 unique reserchers

> 6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices __(answer in max 150 words)__.

**Answer:**  
We started by creating a `BeautifulSoup()` instance to connect to the URL. Each day's program was stored in a `<section>` element with an id of "main," so we used the soup instance to find all these sections. We then made a `set()` called "researcher_names_2023" to store unique names. After inspecting the sections, we noticed that names were inside `<i>` tags. We looped through each section, found all `i` elements, and extracted the text. Since one `<i>` tag could contain multiple names, we used a regular expression to match each name and added them to the set. ChatGPT helped create the pattern, which matches names like:  
- `"John Smith"`  
- `"John A. Smith"`  
- `"John A Smith"`  
- `"John Michael Smith"` or `"John A. Michael Smith"`

## Part 2: Ready Made vs Custom Made Data

> **Exercise: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.

**Answer:**  
### Centola's experiment
#### Pros:
- The results are useful in order to best select which people to select the next time  
- The data is nonreactive, which means that they are unaware about the experiment and therefore aren't biased  
- They have access to all data since it's custom made  
- The data is clean, no bots no spam  

#### Cons:
- Not much data  
- Takes time  
- The demographic isn't very wide, since the participants have to agree to the survey. This requires a special type of person  

### Nicolaides's study
#### Pros:
- Always on, the data is always updated  
- Large data  
- Nonreactive  
- Complete, clean and accessible data, they have all variables required

#### Cons:
- The data is nonrepresentative, the population isn't varied, they collect data from the app that only runners use  
- There could be many confounders, such as variables as weather and vacations


> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__

**Answer:**  
For Centola's experiment the data is controlled and clean, so the observed effects are causally related to the design of the experiment. However since the data is nonpresentative it can't necessarily be generalized to the broader population. The findings may only apply to a specific demographic that self-selected into the study, and the effects may not hold in real-world, more diverse settings.

For Nicolaides's study uses large-scale, real-world data, capturing social behavior over time. While this improves generalizability, confounders like weather or socioeconomic factors may bias results. Additionally, since the data comes from a runners' app, findings may not apply to non-runners.

#### The trade-off between control vs. generalizability is key:
Centola's experiment provides high internal validity (causality is clear) but low external validity (hard to generalize).
Nicolaides's study offers high external validity (reflects real-world behavior) but lower internal validity (difficult to establish causal relationships).

## Part 3: Gathering Research Articles using the OpenAlex API

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 (NOT 2023) conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
> 
> **Steps:**
>  
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
> 
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>  
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
> 
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts). 
>
> **Efficiency Tips:**
> Writing efficient code in this exercise is **crucial**. To speed up your process:
> - **Apply filters directly in your request:** When possible, use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) of the *works* endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists).  
> - **Bulk requests:** Instead of sending one request for each author, you can use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) to query works by multiple authors in a single request. *Note: My testing suggests that can only include up to 25 authors per request.*
> - **Use multiprocessing:** Implement multiprocessing to handle multiple requests simultaneously. I highly recommmend [Joblib’s Parallel](https://joblib.readthedocs.io/en/stable/) function for that, and [tqdm](https://tqdm.github.io/) can help monitor progress of your jobs. Remember to stay within [the rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 10 requests per second.
>
>
>   
> For reference, employing these strategies allowed me to fetch the data in about 30 seconds using 5 cores on my laptop. I obtained a dataset of approximately 25 MB (including both the *IC2S2 abstracts* and *IC2S2 papers* files).

### Functions for retrieving data

In [None]:
import pandas as pd
import requests
from joblib import Parallel, delayed
from tqdm import tqdm
import time

def batch_list(lst, batch_size):
    """Yield successive chunks of size batch_size from lst."""
    for i in range(0, len(lst), batch_size):
        yield lst[i:i+batch_size]

def fetch_works_for_batch(author_ids, per_page=200, sleep_time=0.2):
    """
    Fetch works from OpenAlex for a batch of author IDs.
    
    Parameters:
      - author_ids: list of author OpenAlex IDs (e.g. "https://openalex.org/A123456789")
      - filter_params: additional filter string (e.g. "cited_by_count:>10,authorship_count:<10")
      - per_page: number of works per page (max 200)
      - sleep_time: pause between pages (to respect rate limits)
      
    Returns:
      - List of work dictionaries.
    """
    works_endpoint = "https://api.openalex.org/works"
    all_works = []
    # Build filter for author IDs (using the OR operator "|")
    filter_query = "authorships.author.id:" + "|".join(author_ids)
    filter_query += ",cited_by_count:>10,authors_count:<10"
    filter_query += ",concepts.id:" + "|".join(["https://openalex.org/C144024400", "https://openalex.org/C15744967", "https://openalex.org/C162324750", "https://openalex.org/C17744445", "https://openalex.org/C33923547", "https://openalex.org/C121332964", "https://openalex.org/C41008148"])

    cursor = "*"
    while True:
        params = {
            "filter": filter_query,
            "per_page": per_page,
            "cursor": cursor,
            "select": "id,publication_year,cited_by_count,authorships,abstract_inverted_index,title,concepts"
        }
        response = requests.get(works_endpoint, params=params)
        if response.status_code != 200:
            print(f"Error fetching works for authors {author_ids}: {response.status_code}")
            break
        data = response.json()
        results = data.get("results", [])
        all_works.extend(results)
        meta = data.get("meta", {})
        next_cursor = meta.get("next_cursor")
        if not next_cursor:
            break
        cursor = next_cursor
        time.sleep(sleep_time)
    return all_works

def fetch_works_for_authors(author_ids, batch_size=25, n_jobs=5):
    """
    Given a list of author IDs, split them into batches and fetch works in parallel.
    
    Returns:
      - List of work dictionaries.
    """
    batches = list(batch_list(author_ids, batch_size))
    results = Parallel(n_jobs=n_jobs)(
        delayed(fetch_works_for_batch)(batch) for batch in tqdm(batches, desc="Fetching works for authors")
    )
    # Flatten list of lists
    all_works = [work for sublist in results for work in sublist]
    return all_works

### Retrieving works data for each researcher name

In [None]:
authors_df = pd.read_csv('files/researchers_data_2024.csv')
filtered_authors = authors_df[(authors_df['Works Count'] >= 5) & (authors_df['Works Count'] <= 5000)]
filtered_authors_ids = filtered_authors["ID"].tolist()

all_works = fetch_works_for_authors(filtered_authors_ids)
print(f"Total works fetched for IC2S2 2024 authors: {len(all_works)}")

In the request we filtered for the number of citations, the number of authors, and for the concepts. Now we need to only keep the works relevant to Computational Social Science and that intersects with a quantitative discipline

In [None]:
def filter_works_by_concepts(works, css_concepts, quantitative_concepts):
    """
    Filter works to include only those that have at least one level-0 concept 
    from each of the following groups:
      - Computational Social Science: e.g. Sociology, Psychology, Economics, Political Science
      - Quantitative disciplines: e.g. Mathematics, Physics, Computer Science
    """
    filtered = []
    # Pre-lowercase the concept names for easier comparison
    css_concepts_lower = [c.lower() for c in css_concepts]
    quantitative_concepts_lower = [q.lower() for q in quantitative_concepts]
    
    for work in works:
        concepts = work.get("concepts", [])
        css_found = False
        quantitative_found = False
        for concept in concepts:
            if concept.get("level") == 0:
                name = concept.get("display_name", "").lower()
                if name in css_concepts_lower:
                    css_found = True
                if name in quantitative_concepts_lower:
                    quantitative_found = True
        if css_found and quantitative_found:
            filtered.append(work)
    return filtered

css_concepts = ["Sociology", "Psychology", "Economics", "Political Science"]
quantitative_concepts = ["Mathematics", "Physics", "Computer Science"]
filtered_works = filter_works_by_concepts(all_works, css_concepts, quantitative_concepts)
print(f"Total works after concept filtering: {len(filtered_works)}")

### Storing the data in csv files

In [None]:
authors_papers = []
authors_abstracts = []

for work in filtered_works:
    # Extract author IDs from the "authorships" field
    authors = [auth.get("author", {}).get("id") for auth in work.get("authorships", []) if auth.get("author", {}).get("id")]
    authors_papers.append({
        "id": work.get("id"),
        "publication_year": work.get("publication_year"),
        "cited_by_count": work.get("cited_by_count"),
        "author_ids": authors
    })
    authors_abstracts.append({
        "id": work.get("id"),
        "title": work.get("title"),
        "abstract_inverted_index": work.get("abstract_inverted_index")
    })

authors_papers_df = pd.DataFrame(authors_papers)
authors_abstracts_df = pd.DataFrame(authors_abstracts)

authors_papers_df.to_csv("authors_papers.csv", index=False)
authors_abstracts_df.to_csv("authors_abstracts.csv", index=False)

### Retrieving the ids of unique co-author researchers

In [None]:
import ast

#authors_papers_df = pd.read_csv("authors_papers.csv")
authors_papers = authors_papers_df.to_dict(orient='records')
for paper in authors_papers:
    if isinstance(paper['author_ids'], str):
        paper['author_ids'] = ast.literal_eval(paper['author_ids'])

all_author_ids_in_works = set()
for paper in authors_papers:
    for aid in paper["author_ids"]:
        all_author_ids_in_works.add(aid)

# Identify co-author IDs by removing the IC2S2 authors
authors_ids_set = set(filtered_authors_ids)
coauthor_ids = list(all_author_ids_in_works - authors_ids_set)
print(f"Total unique co-author IDs: {len(coauthor_ids)}")

> **Data Overview and Reflection questions:** Answer the following questions: 
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works? 

**Answer:**  
We obtained 12746 papers that we stored into the dataframe. We extracted out all the co-authors and found there to be 16912 unique co-authors.

> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__

**Answer:**  
To improve efficiency, we implemented batch processing and parallelization. The `batch_list` function splits author IDs into smaller batches, optimizing the API request size and preventing timeouts. By using `joblib`'s `Parallel` and `delayed` with `n_jobs=5`, we parallelized API requests, significantly reducing execution time. we used a cursor-based pagination approach in `fetch_works_for_batch`, allowing uninterrupted data retrieval until all results were fetched. Adding a brief `sleep_time` of 0.2 seconds between requests maintained API rate limits without unnecessary delays. Before requesting the API we only kept the ids of authors with a total work count between 5 and 5,000. Then we applied filters directly to the request which greatly reduced the number of works to retrieve.

> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

**Answer:**  
The filtering criteria helped make the dataset more relevant and manageable. By setting a range of 5 to 5000 works per author, we avoided including authors with too few or too many works, focusing on active researchers. The citation filter ensured we included impactful research, and limiting the number of authors per work helped highlight studies with clear individual contributions. We also filtered works to specific fields related to Computational Social Science to keep the dataset focused.

These filtering choices might mean we missed new researchers with fewer citations. Filtering by specific fields could also mean we missed other interdisciplinary studies relevant to Computational Social Science outside of the specific fields we chose. This could lead to overrepresenting well-established areas with lots of citations. So while the filtering made our dataset more relevant, it might create a bias toward established researchers and topics.

## Part 4: The Network of Computational Social Scientists

> **Exercise: Constructing the Computational Social Scientists Network**
>
> In this exercise, we will create a network of researchers in the field of Computational Social Science using the NetworkX library. In our network, nodes represent authors of academic papers, with a direct link from node _A_ to node _B_ indicating a joint paper written by both. The link's weight reflects the number of papers written by both _A_ and _B_.
>
> **Part 1: Network Construction**
>
> 1. **Weighted Edgelist Creation:** Start with your dataframe of *papers*. Construct a _weighted edgelist_ where each list element is a tuple containing three elements: the _author ids_ of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once. 
>
> 2. **Graph Construction:**
>    - Use NetworkX to create an undirected [``Graph``](https://networkx.org/documentation/stable/reference/classes/graph.html).
>    - Employ the [`add_weighted_edges_from`](https://networkx.org/documentation/stable/reference/classes/generated/networkx.Graph.add_weighted_edges_from.html#networkx.Graph.add_weighted_edges_from) function to populate the graph with the weighted edgelist from step 1, creating a weighted, undirected graph.
>
> 3. **Node Attributes:**
>    - For each node, add attributes for the author's _display name_, _country_, _citation count_, and the _year of their first publication_ in Computational Social Science. The _display name_ and _country_ can be retrieved from your _authors_ dataset. The _year of their first publication_ and the _citation count_  can be retrieved from the _papers_ dataset.
>    - Save the network as a JSON file.
>      
> **Part 2: Preliminary Network Analysis**
> Now, with the network constructed, perform a basic analysis to explore its features.
> 1. **Network Metrics:**
>    - What is the total number of nodes (authors) and links (collaborations) in the network? 
>    - Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.
>    - Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
>    - If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset. 
>    - How many isolated nodes are there in your network?  An isolated node is defined as a node with no connections to any other node in the network.
>    - Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?  __(answer in max 150 words)__
> 
> 3. **Degree Analysis:**
>    - Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us about the network? __(answer in max 150 words)__
> 
> 4. **Top Authors:**
>    - Identify the top 5 authors by degree. What role do these node play in the network? 
>    - Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? __(answer in max 150 words)__


In [2]:
import pandas as pd
from itertools import combinations
from collections import defaultdict

# Indlæs data
papers_df = pd.read_csv("C:\DTU\Fjerde-semester\Social_informatik\comsocsci2025\lectures\papers_combined.csv")

# Omdan 'author_ids' fra streng til liste
papers_df["author_ids"] = papers_df["author_ids"].apply(eval)  # Evaluerer strengen som en liste

# Dictionary til at tælle forfattersamarbejder
coauthor_counts = defaultdict(int)

# Gå igennem hver artikel og find samarbejdende forfattere
for author_list in papers_df["author_ids"]:
    for author1, author2 in combinations(author_list, 2):
        pair = tuple(sorted((author1, author2)))  # Sortér for at undgå duplikater
        coauthor_counts[pair] += 1

# Omdan til en vægtet edgelist (liste af tuples)
weighted_edgelist = [(a, b, count) for (a, b), count in coauthor_counts.items()]

# Konverter til en DataFrame for bedre visning og lagring
edgelist_df = pd.DataFrame(weighted_edgelist, columns=["author1", "author2", "weight"])

# Gem resultatet til en CSV-fil
edgelist_df.to_csv("weighted_edgelist.csv", index=False)

# Udskriv de første rækker
print(edgelist_df.head())


  papers_df = pd.read_csv("C:\DTU\Fjerde-semester\Social_informatik\comsocsci2025\lectures\papers_combined.csv")


                            author1                           author2  weight
0  https://openalex.org/A5014647140  https://openalex.org/A5082953212       2
1  https://openalex.org/A5014647140  https://openalex.org/A5067142016       4
2  https://openalex.org/A5067142016  https://openalex.org/A5082953212       1
3  https://openalex.org/A5008033989  https://openalex.org/A5014647140       5
4  https://openalex.org/A5008033989  https://openalex.org/A5067142016       4
