### Link to git repository:
https://github.com/mbrochlips/CSS_project1

### Contributions:
| Name               | Part 1 | Part 2 | Part 3 | Part 4 |
|--------------------|--------|--------|--------|--------|
| Mikkel (s234860)   | 40%    | 30%    | 30%    | 35%    |
| Kantinka (s235058) | 30%    | 40%    | 30%    | 35%    |
| Marcus (s234816)   | 30%    | 30%    | 40%    | 30%    |

In [1]:
# Library imports
from bs4 import BeautifulSoup 
import re
import pandas as pd
import requests
from time import sleep
from json import JSONDecodeError
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from thefuzz import fuzz

### Part 1: Web-scraping

In [2]:
url = "https://ic2s2-2023.org/program"
r = requests.get(url) 
soup = BeautifulSoup(r.content) 

data = soup.find_all("i")

text = str(data)

cleaned_text = re.sub(r'<[^>]+>', '', text)

cleaned_text = re.sub(r'Chair:', '', cleaned_text)

names = [name.strip().lower() for name in cleaned_text.split(',')]

names[0] = names[0][2:]
names[-1] = names[0][:-1]

#print(names)

In [3]:
for elem in data:
    name = "Linda Steg"
    if name in str(elem):
        print(f"{name} found")
print(f"{name} not found")

# Seems that no keynote speakers included

Linda Steg not found


In [4]:
soup = BeautifulSoup(r.text, "html.parser")

keynote_elements = soup.find_all("td", colspan="100%")

names_keynote = []
for elem in keynote_elements:
    if "keynotes" in str(elem):
        index = str(elem).find("Keynote - ")
        name = str(elem)[index+10:-13]
        names_keynote.append(name.lower())

#print(names_keynote)

allnames = names + names_keynote

In [5]:
# NOTE: To find all unique names we used chatgpt to help us filtering the names.

# Steps:
# 1.	Check for exact duplicates.
# 2.	Use fuzzy matching to detect minor spelling differences.
# 3.	Handle middle name variations (keep the longest version)

# Function to check if one name is a short version of another
def is_shorter_version(name1, name2):
    name1_parts = set(name1.split())
    name2_parts = set(name2.split())
    return name1_parts.issubset(name2_parts) or name2_parts.issubset(name1_parts)

# Filter out duplicates with minor differences
unique_names = []
for name in allnames:
    name  = name.title().strip()
    found_duplicate = False
    for unique in unique_names: 
        # Check for minor spelling differences
        if fuzz.ratio(name, unique) > 90:
            found_duplicate = True
            break
        # Check if one is a shorter version of another
        if is_shorter_version(name, unique):
            if len(name) > len(unique):  # Keep the longer version
                unique_names.remove(unique)
                unique_names.append(name)
            found_duplicate = True
            break
    if not found_duplicate:
        unique_names.append(name)

# Print length of cleaned list
print(len(unique_names))

1455


### 6. Explaining the process.

Looking into the html code of the website it was quickly seen that most names were written in italics \<i> . This included the two talks categories and posters. But some names were also with underscore \<u> and often many names where listed together. To clean this the library RegEx (re) was used to filter different cases found (including filtering "Chair:"). 

After checking names from different areas in the webpage we found that we were only missing the the keynote speakers at the top of the webpage. It was discovered that "td", colspan="100%" was unique for the lines with Keynote speakers. And the found names were added. Lastly, all names where cleaned and we searched for dublicates (including spelling mistakes and handling different versions of middle names).

Name count = 1455

___

### Part 2: Ready Made vs Custom Made Data

#### Pros and cons of the custom-made data vs. ready-made data
One of the advantages of custom-made data is being able to control the experimental setup to precisely fit the research question. E.g. by making a randomized setup, with a control group as in Centolas experiment. This is harder to achieve in ready-made data, as the participants might not be a representative of the entire population. E.g. It might be only some types of people who uses the fitness tracking app in Nicolaides study. Moreover the data might be incomplete or dirty, and demographic information might be sparse, either due to limited collection of data or privacy concerns.

The advantages of ready-made data is that it can contain large samples taken continuously over a long period of time, whereas custom-made data has to be planned and can be costly to obtain. Ready-made data can also reflect real world conditions more closely, as the data is not obtained from an artificial setting. 

#### How the differences can influence the interpretation of the results
In Centola’s experiment, the setup might be too far from a real setting to say something general about behavior contagion, as there are lots of factors the study doesn’t consider. 

In Nicolaides study, there is a risk of that the people who used the fitness-app is not representative of the population, maybe they are generally more social, or generally more keen to be nudged by their friends’ behavior. There is also a chance that the data doesn’t affect the theoretical concept in study, or that there are confounders. The researchers tried to take account of the latter, by using an instrumental variable. 

___
### Part 3: Gathering Research Articles using the OpenAlex API

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 (NOT 2023) conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
>
> **Steps:**
>
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
>
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
>
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts).
>
>


In [6]:
# Read authors CSV file
csv_path = 'Authors/Authors2024.csv'
authors = pd.read_csv(csv_path)

authors

Unnamed: 0,id,display_name,works_api_url,h_index,works_count,country_code
0,https://openalex.org/A5101854927,Hazem Ibrahim,https://api.openalex.org/works?filter=author.i...,12,25,FI
1,https://openalex.org/A5007282319,Talal Rahwan,https://api.openalex.org/works?filter=author.i...,32,167,US
2,https://openalex.org/A5018129441,Yasir Zaki,https://api.openalex.org/works?filter=author.i...,16,125,AE
3,https://openalex.org/A5077466034,Vikram Balasubramanian,https://api.openalex.org/works?filter=author.i...,2,3,US
4,https://openalex.org/A5030255458,Guodong Ju,https://api.openalex.org/works?filter=author.i...,11,48,CN
...,...,...,...,...,...,...
1125,https://openalex.org/A5025218700,Amit Goldenberg,https://api.openalex.org/works?filter=author.i...,26,139,US
1126,https://openalex.org/A5061723972,Federico Zimmerman,https://api.openalex.org/works?filter=author.i...,1,1,AR
1127,https://openalex.org/A5092145709,Brooke Perreault,https://api.openalex.org/works?filter=author.i...,1,4,US
1128,https://openalex.org/A5058723753,Olivier Bergeron-Boutin,https://api.openalex.org/works?filter=author.i...,3,7,US


This is all the authors from IC2S2 2024.

We then batch the authors and groups of 25 to make the code run faster:

In [7]:
# Split authors into batches. Each batch contains 25 authors
author_batches = np.array_split(authors, len(authors) // 25)
author_batches[0]

  return bound(*args, **kwds)


Unnamed: 0,id,display_name,works_api_url,h_index,works_count,country_code
0,https://openalex.org/A5101854927,Hazem Ibrahim,https://api.openalex.org/works?filter=author.i...,12,25,FI
1,https://openalex.org/A5007282319,Talal Rahwan,https://api.openalex.org/works?filter=author.i...,32,167,US
2,https://openalex.org/A5018129441,Yasir Zaki,https://api.openalex.org/works?filter=author.i...,16,125,AE
3,https://openalex.org/A5077466034,Vikram Balasubramanian,https://api.openalex.org/works?filter=author.i...,2,3,US
4,https://openalex.org/A5030255458,Guodong Ju,https://api.openalex.org/works?filter=author.i...,11,48,CN
5,https://openalex.org/A5103042494,John Duncan,https://api.openalex.org/works?filter=author.i...,41,157,GB
6,https://openalex.org/A5102845677,Kenji Yokotani,https://api.openalex.org/works?filter=author.i...,6,30,JP
7,https://openalex.org/A5011732280,Masanori Takano,https://api.openalex.org/works?filter=author.i...,8,70,JP
8,https://openalex.org/A5074372343,Nobuhito Abe,https://api.openalex.org/works?filter=author.i...,27,137,JP
9,https://openalex.org/A5081816161,Saran Tenzin Tamang,https://api.openalex.org/works?filter=author.i...,6,17,


In [8]:
# Define DataFrame column names
paper_col = ['id', 'publication_year', 'cited_by_count', 'author_ids']
abstract_col = ['id', 'title', 'abstract_inverted_index']

# Global API source and filter Concepts dictionaries
source = 'https://api.openalex.org/works'
concepts_1 = {
    'Computer science': 'https://openalex.org/C41008148',
    'Physics': 'https://openalex.org/C121332964',
    'Mathematics': 'https://openalex.org/C33923547',
}
concepts_2 = {
    'Psychology': 'https://openalex.org/C15744967',
    'Sociology': 'https://openalex.org/C144024400',
    'Economics': 'https://openalex.org/C162324750',
    'Political science': 'https://openalex.org/C17744445',
}

We then define a function that calls the api on a batch in order to make the code work in parallel.

In [9]:
def process_batch(author_batch, batch_index):
    """
    Process a single batch of authors by iterating over API pages,
    applying retry logic, and collecting papers and abstracts.
    :param author_batch: Dictionary containing batch of authors. The batch works on a maximum of 25 authors.
    :param batch_index: The index of the batch to be processed.
    :return: Two lists of papers and abstracts data.
    """

    local_session = requests.Session()  # create a local session for the thread
    local_papers_data = []
    local_abstracts_data = []
    page = 1

    while True:
        tries = 0
        results = None
        # Retry loop for API request
        while tries < 10:
            try:
                # Joining author ids to bulk the request
                author_ids_str = '|'.join(author_batch['id'])
                url = (
                    f"{source}?filter=author.id:{author_ids_str},"
                    f"cited_by_count:>10,authors_count:<10,"
                    f"concepts.id:{'|'.join(concepts_1.values())},"
                    f"concepts.id:{'|'.join(concepts_2.values())}"
                )
                response = local_session.get(url, params={'per_page': 200, 'page': page})
                results = response.json()['results']
                break  # exit retry loop on success
            except (JSONDecodeError, requests.exceptions.RequestException) as e:
                tries += 1
                sleep(0.1)

        # If no results are returned, exit the paging loop
        if not results:
            break

        # Process each paper in the current page
        for paper in results:
            paper_id = paper['id']
            publication_year = paper['publication_year']
            cited_by_count = paper['cited_by_count']
            # Extract author IDs from the authorships list
            author_ids = ";".join(sub_author['author']['id'] for sub_author in paper['authorships'])
            local_papers_data.append([paper_id, publication_year, cited_by_count, author_ids])

            title = paper['title']
            abstract_index = paper['abstract_inverted_index']
            local_abstracts_data.append([paper_id, title, abstract_index])

        page += 1

    print(f"Completed batch {batch_index}")
    return local_papers_data, local_abstracts_data

We then multithread all of the batches:

In [31]:
# Multi-threaded execution using ThreadPoolExecutor
all_papers_data = []
all_abstracts_data = []

# A maximum of 10 threads work at a time
with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_batch = {
        executor.submit(process_batch, batch, idx): idx
        for idx, batch in enumerate(author_batches)
    }
    # As each future completes, combine its results
    for future in as_completed(future_to_batch):
        try:
            papers_data, abstracts_data = future.result()
            all_papers_data.extend(papers_data)
            all_abstracts_data.extend(abstracts_data)
        except Exception as exc:
            batch_index = future_to_batch[future]
            print(f"Batch {batch_index} generated an exception: {exc}")

Completed batch 9
Completed batch 6
Completed batch 1
Completed batch 3
Completed batch 2
Completed batch 4
Completed batch 7
Completed batch 11
Completed batch 8
Completed batch 10
Completed batch 13
Completed batch 0
Completed batch 14
Completed batch 5
Completed batch 16
Completed batch 12
Completed batch 15
Completed batch 20
Completed batch 23
Completed batch 17
Completed batch 18
Completed batch 22
Completed batch 21
Completed batch 25
Completed batch 19
Completed batch 24
Completed batch 30
Completed batch 32
Completed batch 31
Completed batch 26
Completed batch 29
Completed batch 28
Completed batch 33
Completed batch 35
Completed batch 27
Completed batch 38
Completed batch 34
Completed batch 36
Completed batch 37
Completed batch 42
Completed batch 39
Completed batch 41
Completed batch 43
Completed batch 40
Completed batch 44


In [32]:
# Convert the collected data into DataFrames
papers = pd.DataFrame(all_papers_data, columns=paper_col)
abstracts = pd.DataFrame(all_abstracts_data, columns=abstract_col)

We can just do a sanity check and check that each DataFram are the same length

In [33]:
# Check length of papers and abstracts DataFrames
print(len(papers), len(abstracts))

12439 12439


We can then save the DataFrames

In [34]:
# Drop duplicates and save to csv
papers = papers.drop_duplicates(['id'], ignore_index=True)
papers.to_csv('Works/IC2S2_papers.csv', index=False)
papers

Unnamed: 0,id,publication_year,cited_by_count,author_ids
0,https://openalex.org/W1965631677,2007,3273,https://openalex.org/A5076755077;https://opena...
1,https://openalex.org/W2136968651,2005,1829,https://openalex.org/A5015711710;https://opena...
2,https://openalex.org/W2104834906,2014,1697,https://openalex.org/A5078758511;https://opena...
3,https://openalex.org/W2066752129,2013,1157,https://openalex.org/A5015711710;https://opena...
4,https://openalex.org/W2122465344,2003,1049,https://openalex.org/A5015711710;https://opena...
...,...,...,...,...
11344,https://openalex.org/W1490505863,1986,11,https://openalex.org/A5079825006
11345,https://openalex.org/W1546346013,1988,11,https://openalex.org/A5038255653;https://opena...
11346,https://openalex.org/W1585825907,1988,11,https://openalex.org/A5079825006
11347,https://openalex.org/W2076461592,1993,12,https://openalex.org/A5100398118;https://opena...


In [35]:
# Drop duplicates and save to csv
abstracts = abstracts.drop_duplicates(['id'], ignore_index=True)
abstracts.to_csv('Works/IC2S2_abstracts.csv', index=False)
abstracts

Unnamed: 0,id,title,abstract_inverted_index
0,https://openalex.org/W1965631677,The Increasing Dominance of Teams in Productio...,"{'We': [0], 'have': [1], 'used': [2], '19.9': ..."
1,https://openalex.org/W2136968651,Collaboration and Creativity: The Small World ...,"{'Small': [0], 'world': [1, 23, 81], 'networks..."
2,https://openalex.org/W2104834906,What Do We Learn from the Weather? The New Cli...,"{'A': [0], 'rapidly': [1], 'growing': [2], 'bo..."
3,https://openalex.org/W2066752129,Atypical Combinations and Scientific Impact,"{'Novelty': [0], 'is': [1, 61], 'an': [2, 74],..."
4,https://openalex.org/W2122465344,Relational Embeddedness and Learning: The Case...,"{'As': [0], 'a': [1, 15, 122, 153], 'complemen..."
...,...,...,...
11344,https://openalex.org/W1490505863,A formal model of organizational structure and...,
11345,https://openalex.org/W1546346013,"Markets, hierarchies and the impact of informa...",
11346,https://openalex.org/W1585825907,Modeling coordination in organizations and mar...,"{'This': [0, 53], 'paper': [1], 'describes': [..."
11347,https://openalex.org/W2076461592,Performance of dynamic rate leaky bucket algor...,"{'A': [0], 'new': [1], 'input': [2], 'rate': [..."


> **Data Overview and Reflection questions:** Answer the following questions:
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works?
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__
> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

A total of 11349 unique papers are in our dataset. The following code calcualtes the number of unique reasearches that have co-authored these works:

In [36]:
author_ids = ';'.join(papers['author_ids'])
author_ids = author_ids.split(';')
unique_authors = list(set( author_ids ))

len(unique_authors)

17581

A total number of 17581 researches have co-authored works in our dataset.

When I first ran the code without any optimizations it took around half an hour to get all the works for the IC2S2_papers dataframe and OC2S2_abstracts dataframe. Then I implemented bulking authors together in batches of 25 which significantly reduced the running time to a few minutes, and then multithreading the batches to 10 working threads at a time reduced the running time even lower to finishing in 16 seconds.

The filters we apply aim the enhance data quality and relevant to the topic. By only including works with over 10 citations we ensure that the papers we include only contains research that have a recognized impact. The concepts  filter insures that the papers are in the field of computational social science.

By having our filters we may underrepresent emerging papers with few citations, large teams of more than 10 researchers.

___
### Part 4: The Network of Computational Social Scientists

In [37]:
folder_path = "Works/"
papers = pd.read_csv(folder_path + "IC2S2_papers.csv")

In [38]:
def pair_exists(df, author_a, author_b):
    mask = ((df["author_1"] == author_a) & (df["author_2"] == author_b)) | \
           ((df["author_1"] == author_b) & (df["author_2"] == author_a))
    match_index = df[mask].index
    return match_index, mask.any()  # Returns True if the pair exists

In [39]:
edge_dict = {}
for authors in papers["author_ids"].values:
    cleaned_string = authors.replace("[", "").replace("]", "").replace("'", "")
    authors = np.array(cleaned_string.split(";"))
    
    for i,author1 in enumerate(authors):
        for j in range(i+1,len(authors)):
            author2 = authors[j]
            author_pair = tuple(sorted([author1,author2]))
            if author_pair in edge_dict:
                edge_dict[author_pair] += 1
            else:
                edge_dict[author_pair] = 1
df = pd.DataFrame([{"author_1": pair[0], "author_2": pair[1], "weight": weight} for pair, weight in edge_dict.items()])
df
#df.to_csv("Authors/author_edgelist_allworks.csv", index = False)

Unnamed: 0,author_1,author_2,weight
0,https://openalex.org/A5076755077,https://openalex.org/A5079813490,4
1,https://openalex.org/A5015711710,https://openalex.org/A5076755077,5
2,https://openalex.org/A5015711710,https://openalex.org/A5079813490,7
3,https://openalex.org/A5015711710,https://openalex.org/A5040544719,1
4,https://openalex.org/A5078758511,https://openalex.org/A5079813490,1
...,...,...,...
56389,https://openalex.org/A5005339370,https://openalex.org/A5079825006,2
56390,https://openalex.org/A5100398118,https://openalex.org/A5111622296,1
56391,https://openalex.org/A5038255653,https://openalex.org/A5081335111,1
56392,https://openalex.org/A5079825006,https://openalex.org/A5081335111,1


In [40]:
G = nx.Graph()
edges = [(row['author_1'], row['author_2'], row['weight']) for index, row in df.iterrows()]
G.add_weighted_edges_from(edges)

In [41]:
nodes_N = G.nodes
author_N = len(nodes_N) #the number of authors
author_N

17572

In [42]:
weight_sum = int(sum(df["weight"].values))
weight_sum

77881

In [43]:
nx.is_connected(G)

False

The graph is disconnected. There are isolated groups of nodes with no path connecting them.

In [44]:
list_of_connected_comp = list(nx.connected_components(G))
print(len(list_of_connected_comp)) # number of connected components

273


In [45]:
list(nx.isolates(G)) 

[]

There are 279 connected components and no isolated nodes in the network. It was expected that there was no isolated nodes as the network was created form a edgelist.

In [46]:
density = weight_sum/((author_N*(author_N-1))/2)
density #very low density

0.0005044798701189593

In [47]:
average_weight_pr_author = (weight_sum/author_N)*2
average_weight_pr_author #average sum of edge weights going to each node

8.864215797860233

279 connected componnents suggests that G is has low connectivity. This is underlined by the low density. This means that the data suggests that the CSS reasearchers are poorly connected and work in around 279 isolated groups. This makes sense since researchers probably often work in smaller teams, and since Comutational Sociaal Science is quite a wide topic, it makes sense that there exist many different subareas of research. The low density is also to be expected as it would be unrealistic that every researcher in CSS had worked together with even close to 18000 other people.

In [48]:
# Get the degree of each node (as a list of values)
degrees = [degree for node, degree in G.degree()]

# Compute the required statistics
average_degree = np.mean(degrees)
median_degree = np.median(degrees)
counter = Counter(degrees)
mode_degree, mode_count = counter.most_common(1)[0]
minimum_degree = np.min(degrees)
maximum_degree = np.max(degrees)

In [49]:
# Calculate node strength (weighted degree) for each node
strengths = [strength for node, strength in G.degree(weight='weight')]

# Compute the required statistics
average_strength = np.mean(strengths)
median_strength = np.median(strengths)
counter_strength = Counter(strengths)
mode_strength, mode_count = counter_strength.most_common(1)[0]
minimum_strength = np.min(strengths)
maximum_strength = np.max(strengths)

In [50]:
# Print the computed degree statistics
print("Average degree:", average_degree)
print("Median degree:", median_degree)
print("Mode degree:", mode_degree)
print("Minimum degree:", minimum_degree)
print("Maximum degree:", maximum_degree)

print()

# Print the computed node strength statistics
print("Average strength:", average_strength)
print("Median strength:", median_strength)
print("Mode strength:", mode_strength)
print("Minimum strength:", minimum_strength)
print("Maximum strength:", maximum_strength)

Average degree: 6.418620532665605
Median degree: 5.0
Mode degree: 4
Minimum degree: 1
Maximum degree: 362

Average strength: 8.864215797860233
Median strength: 5.0
Mode strength: 4
Minimum strength: 1
Maximum strength: 607


We can tell from the degree information that on average each author collaborates with around 6 to 7 other authors. We can tell from the median and mode and most authors work in small networks. We can also see that there is a large range of collaborations $[1..362]$, with some authors only having 1 collaboration while others have hundreds.

When looking at the strength of the connections we can see that the average is $8.86$ which indicates that people prefer to repeat collaborations rather than to make new connections. We can also see this from the median and mode where we can see that most authors collaborate with small groups. We can also see from the maximum that some authors co-author many papers, suggesting that they have a large network connection.

In [51]:
weighted_degree_dict = dict(G.degree(weight="weight"))

most_connected_weighted_nodes = sorted(weighted_degree_dict.items(), key=lambda x: x[1], reverse=True)

url = 'https://api.openalex.org/authors/'

print("Top 5:")
for id_url, weighted_degree in most_connected_weighted_nodes[:5]:
    index = id_url.find("A")
    id = id_url[index:]
    response = requests.get(url + id).json()
    name = response["display_name"]

    print(f"Node {id_url} with weight: {weighted_degree}")
    print(name)
    
    field = response["topics"][0]["field"]["display_name"]
    subfield = response["topics"][0]["subfield"]["display_name"]
    print(f"Field: {field}, Subfield: {subfield}")
    print("")

Top 5:
Node https://openalex.org/A5005421447 with weight: 607
Yi Yang
Field: Computer Science, Subfield: Computer Vision and Pattern Recognition

Node https://openalex.org/A5007176508 with weight: 594
Alex Pentland
Field: Social Sciences, Subfield: Transportation

Node https://openalex.org/A5100355277 with weight: 550
Yong Li
Field: Social Sciences, Subfield: Transportation

Node https://openalex.org/A5100322712 with weight: 517
Yan Wang
Field: Computer Science, Subfield: Information Systems

Node https://openalex.org/A5044944954 with weight: 514
Lyle Ungar
Field: Psychology, Subfield: Social Psychology



The top authors in our network connects research clusters togehter by having broad backgrounds in the field. Their high weight indicate frequent collaborations and/or connections. For instance, Shuicheng Yan is a leader in Computer Science team, which focuses on Computer Vision and multimedia analysis. Alex Pentland is an entrepreneur in the fields Social Physics, Honest Signals, Computational Social Science, Network and Complexity Science. Alessandro Flammini is intereting since his main field is physics but also has some background in Computer Science. Even though the authors have varying primary specializations in fields there enhances the connections of Computational Social Science, demonstrating how technical, social, and quantitative research can address societal challenges.