In [None]:
import numpy as np
import pandas as pd

### Part 1: Web-scraping
## ANSWERS TO QUESTIONS:
___

In [None]:
from bs4 import BeautifulSoup
import requests

url2019 = "https://ic2s2-2023.org/program"

page = requests.get(url2019)
soup = BeautifulSoup(page.content, "html.parser")
## find all classes "nav_list"
nav_lists = soup.find_all(class_="nav_list")

names = []
for nav_list in nav_lists:
    new_names = nav_list.find_all("i")
    new_names = [name.get_text() + "," for name in new_names]
    new_names = " ".join(new_names)
    new_names = new_names.split(", ")
    new_names = [name.strip() for name in new_names]
    names.extend(new_names)
    
unique_names = list(set(names))
unique_names_sorted = [name for name in unique_names if " " in name]
np.save("unique_names.npy", unique_names_sorted)

In [6]:
unique_names = np.load("unique_names.npy")
print("Number of unique names: ", len(unique_names))
print("First 10 unique names: ", unique_names[:10])

Number of unique names:  1499
First 10 unique names:  ['Telmo Menezes' 'Lui Ruck' 'James Calum Young' 'Giyeon Baek'
 'Karolina Stanczak' 'Mathieu Génois' 'Karthikeya Kaushik' 'James Evans'
 'Giada Marino' 'Tom Emery']


### Question:

Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices (answer in max 150 words).
### Answer:
In total, we got 1499 unique names of researchers. This was achieved by web-scraping of the page given in the assignment and we used Beutifusoupe as a library for this. We looked through the page inspection to see where the names were placed. We found that they belonged to a class called “nav_list” in which the names were listed in the “i” place. W looped through the names to get them. Some of the names were combined, so we had to ensure that we correctly separated the names by looking for “,”. The next thing was to find out if some were repeated, to do that we first turned the list into a set and then to a list. This is an effective way to ensure that there are no repeated names. In the end, we ensured that all the names contained a space, to ensure it was names.
___

### Part 2: Ready made vs Custom made data
### Question:

What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book (answer in max 150 words).
### Answer:
A positive aspect of the custom-made data used in Centola's experiment is that it is specifically designed to answer the research question. Another positive thing about this study that is not always true about custom-made data, is the fact that people did not know they were being observed, it is "non-reactive". Another possible upside is that it is likely quite "clean" data, as it is collected in a controlled environment. A downside can be the overhead, resources and time needed to collect the relatively small amount of data. The ready-made data has some of the classic positive aspects such as: The size, non-reactive and it is always on. One has to be careful making sure no sensitive information is shared with the public. Algorithmic confounding and drift inherent to longitudinal app data can be a downside.

### Question:
How do you think these differences can influence the interpretation of the results in each study? (answer in max 150 words)

### Answer:
One of the main differences between the interpretation of the results in Centola's experiment versus Nicolaides's study is the fact that it is more likely to observe and conclude causality in the custom-made data. This is because the experiment is designed to answer a specific question and the data is collected in a controlled environment. In the ready-made data, it is more likely to observe correlation, as the data is not collected to answer a specific research question and will be thus be incomplete. The ready-made data is also more likely to be affected by algorithmic confounding and drift, which can make it harder to interpret the results. The ready-made data is also more likely to be "dirty", full of noise and outliers, which can make it harder to interpret the results.



### Part 3: Gathering Research Articles using the OpenAlex API

We haven't included all code for this section. This is due to the fact, that we ran the code both for the original authors and for the coatuhors, and we thought it would be redundant to have twice. We included the functions used to extract the information below.

### Question:
__How many works are listed in your IC2S2 papers dataframe? How many unique researchers have co-authored these works?__ <br>
### Answer:
In the end we found 10540 unique authors and 68428 papers. 
### Question:
__Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?__
### Answer:
Unfortunately, we couldn't get multiprocessing to work. To speed up the process slightly, we used filters: <br>
<code> FILTERS = ",cited_by_count:>10,authors_count:<10" </code> <br>
<br>
Also, we tried to quickly cancel our search if the paper didn't uphold the requirements. For example, if the work count wasnt between 5 and 5000, we immediatly ended the query. 
### Question:
__Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?__
### Answer:
One reason for making these threshols is to filter out outliers, i.e. papers that are either corrupted or anomalies. Also, since we're interested in computational social science, it makes sense to only look at papers that falls within this field. Filter away authors with too many works and papers with too make authors is also a way of limiting the size of our dataset so that it is actually manageable to work with. Obviously, we miss some results in the process, especially: the papers that are collaborations between many authors, and the many "junk" papers (they are obviously not junk, but) with 0 citations for example those by aspiring phd studets. 


In [None]:
"""
Script for finding the information about the authors
We find:
- id
- display_name
- works_api_url
- h_index
- works_count
- country_code

We use the OpenAlex API to find the information about the authors
"""

from Levenshtein import distance as lev
import pandas as pd
from tqdm import tqdm

## The things we want to extract
columns = ["id", "display_name", "works_api_url", "h_index", "works_count", "country_code"]
df = pd.DataFrame(columns=columns)

## The base url for the API
BASE_URL = "https://api.openalex.org/authors?search="
## The unique names we want to search for
names = np.load("unique_names.npy")

def get_information_from_name(name):
    """Get information about an author from the OpenAlex API"""
    url = BASE_URL + name.lower()
    response = requests.get(url)
    response_json = response.json()
    results = response_json["results"]

    if results:
        ## Look at display names but also alternative names
        ## to see if we can find a match
        display_name = results[0]["display_name"]
        display_name_alternatives = results[0]["display_name_alternatives"]
        display_name_alternatives.append(display_name)

        ## Calculate the Levenshtein distance between the name we are looking for
        ## Levenshtein distance is the number of single-character edits (insertions, deletions or substitutions)
        ## required to change one word into the other
        ## If the distance is less than 2, we consider it a match (this is a bit arbitrary, but we had to set a threshold)
        lev_distances = [lev(name, alt) for alt in display_name_alternatives]

        if min(lev_distances) < 2:
            ## If we have a match, we extract the information we want
            id = results[0]["id"]
            works_api_url = results[0]["works_api_url"]
            h_index = results[0]["summary_stats"]["h_index"]

            count_by_year = results[0]["counts_by_year"]
            works_count = count_by_year[0]["works_count"] if count_by_year else None

            affiliations = results[0]["affiliations"]
            country_code = affiliations[0]["institution"]["country_code"] if affiliations else None

            new_row = pd.Series(data=[id, display_name, works_api_url, h_index, works_count, country_code], index=columns)
            return new_row
        
    return None


## We loop through all the names and extract the information
## We add the information to a dataframe
for name in tqdm(names):
    new_row = get_information_from_name(name)
    if new_row is not None:
        df.loc[len(df)] = new_row
            
df.to_pickle("authors.pkl")

In [None]:
"""
Use the above information to find the works of the authors
We find:
- id
- publication_year
- cited_by_count
- author_ids
- title
- abstract_inverted_index

"""

df = pd.read_pickle("authors.pkl")
works_api_urls = df["works_api_url"].tolist()

import requests
from tqdm import tqdm
import time

paper_df = pd.DataFrame(columns = ["id", "publication_year", "cited_by_count", "author_ids"])
abstracts_df = pd.DataFrame(columns = ["id", "title", "abstract_inverted_index"])

COMSCOSCI = ["Sociology", "Psychology", "Economics", "Political Science"]
QUANSCI = ["Mathematics", "Physics", "Computer Science"]

FILTERS = ",cited_by_count:>10,authors_count:<10"
PARAMS = {"per-page": "200", "page": "1"}

def get_works_from_url(works_api_urls, filters, params):
    for attempt in range(10): ## for some reason, the requests sometimes fail, so we need to keep trying until it works
                ## usually, it works on the first try, but sometimes it takes 2 or 3 tries
        try:    ## use a try-except block to catch the error and keep going
            url += FILTERS
            
            ## get the first page of results to find out how many pages there are
            ## and how many works there are in total
            response = requests.get(url, params=PARAMS).json()
            work_counts = response["meta"]["count"] 
            number_of_pages = work_counts // 200 + 1
            
            ## we only want to get works that have between 5 and 5000 citations
            if 5 <= work_counts <= 5000:
                ## loop through all the pages of results
                for page_number in range(1, number_of_pages + 1):
                    ## get the next page of results
                    new_params = {"per-page": "200", "page": str(page_number)}
                    response = requests.get(url, params=new_params).json()                    
                    works = response["results"]

                    ## loop through all the works on the page
                    for work in works:
                        ## get the id of the work
                        ## if the work is not already in the dataframe, add it
                        id = work["id"]
                        
                        if id not in paper_df["id"].tolist():
                            
                            ## get the publication year, number of citations, title, and abstract
                            publication_year = work["publication_year"]
                            cited_by_count = work["cited_by_count"]
                            title = work["title"]
                            abstract_inverted_index = work["abstract_inverted_index"]
                            
                            authors_list = work["authorships"]
                            authors_ids = [author["author"]["id"] for author in authors_list]
                            
                            concepts_list = work["concepts"]
                            concepts = [concept["display_name"] for concept in concepts_list]
                            
                            ## check if the work is relevant
                            ## ie if it is in both the COMSCOSCI and QUANSCI categories
                            relevant = any(concept in COMSCOSCI for concept in concepts) and any(concept in QUANSCI for concept in concepts)
                            
                            if relevant:
                                new_paper_row = pd.Series([id, publication_year, cited_by_count, authors_ids], index = paper_df.columns)
                                paper_df.loc[len(paper_df)] = new_paper_row
                                
                                new_abstract_row = pd.Series([id, title, abstract_inverted_index], index = abstracts_df.columns)
                                return new_abstract_row
                  
        ## if the request fails, print the error and try again
        ## we will try 10 times before giving up              
        except Exception as e:
            print(e)
    
    ## if we have tried 10 times and it still doesn't work, return None
    return None 

## loop through all the authors
## for each author, get all the works they have written
## and add the works to the dataframe
for url in tqdm(works_api_urls):
    new_abstract_row = get_works_from_url(url, FILTERS, PARAMS)
    if new_abstract_row is not None:
        abstracts_df.loc[len(abstracts_df)] = new_abstract_row
        
paper_df.to_pickle("papers.pkl")
abstracts_df.to_pickle("abstracts.pkl")

We then repeat the same procedure for all the coauthors. <br>
i.e. we find information about the new authors and find their work. <br>
This amounts to finding all the authors that we haven't already looked at and running them through the same functions. <br>
In the end, we end up with these dataframes:

In [8]:
final_authors = pd.read_pickle("final_authors.pkl")
final_papers = pd.read_pickle("final_papers.pkl")

print("Final number of authors: ", len(final_authors))
print("Final number of papers: ", len(final_papers))

Final number of authors:  10540
Final number of papers:  68428


### Part 4: The Network of Computational Social Scientists

In [None]:
import networkx as nx

## load in the data again
## we need to do this because we have to re-run the previous cells
papers = pd.read_pickle("final_papers.pkl")
authors = pd.read_pickle("final_authors.pkl")

"""
For each author, we want to find the total number of citations they have received
and the earliest year they have published a paper
We will add these columns to the authors dataframe
"""

## use the explode function to very quickly create a new row for each author of a paper
citations = papers.explode("author_ids").groupby("author_ids")["cited_by_count"].sum().reset_index()
## add this column to the authors dataframe
authors = authors.merge(citations, left_on="id", right_on="author_ids", how="left")
authors = authors.rename(columns={"cited_by_count": "total_citations"})
authors = authors.drop(columns="author_ids")
authors = authors.fillna(0)

earliest_publication = papers.explode("author_ids").groupby("author_ids")["publication_year"].min().reset_index()
## add this column to the authors dataframe
## some authors have no publication year, so we will fill in a zero
authors = authors.merge(earliest_publication, left_on="id", right_on="author_ids", how="left")
authors = authors.rename(columns={"publication_year": "earliest_publication"})
authors = authors.drop(columns="author_ids")
authors = authors.fillna(0)

In [12]:
import itertools

"""
We want to create a graph where the nodes are authors and the edges are the number of papers they have written together
We will use the NetworkX library to create this graph
"""


# Get all unique pairs of author_ids from the papers dataframe, one line at a time
edgelist_dict = {}
for index, row in papers.iterrows():
    author_ids = row['author_ids']
    citations = row['cited_by_count']
    for pair in itertools.combinations(author_ids, 2):
        pair = tuple(sorted(pair))
        # If the pair is already in the dictionary, increment the count
        if pair in edgelist_dict:
            edgelist_dict[pair] += citations
        # If the pair is not in the dictionary, add it
        else:
            edgelist_dict[pair] = citations

edgelist = [(k[0], k[1], v) for k, v in edgelist_dict.items()]

G = nx.Graph()
G.add_weighted_edges_from(edgelist)

print("Number of nodes: ", G.number_of_nodes())
print("Number of edges: ", G.number_of_edges())

print(G.nodes())

Number of nodes:  9964
Number of edges:  32742
['https://openalex.org/A5032539868', 'https://openalex.org/A5080298742', 'https://openalex.org/A5076015541', 'https://openalex.org/A5020199695', 'https://openalex.org/A5059871346', 'https://openalex.org/A5014662127', 'https://openalex.org/A5069885186', 'https://openalex.org/A5068511908', 'https://openalex.org/A5033629846', 'https://openalex.org/A5036321054', 'https://openalex.org/A5035632114', 'https://openalex.org/A5055900838', 'https://openalex.org/A5071261828', 'https://openalex.org/A5014680546', 'https://openalex.org/A5037582478', 'https://openalex.org/A5089544677', 'https://openalex.org/A5084994165', 'https://openalex.org/A5052174052', 'https://openalex.org/A5089883420', 'https://openalex.org/A5015601396', 'https://openalex.org/A5039065189', 'https://openalex.org/A5023961667', 'https://openalex.org/A5046026314', 'https://openalex.org/A5042391987', 'https://openalex.org/A5010764981', 'https://openalex.org/A5039020564', 'https://openale

In [15]:
"""
Add the attributes of the authors to the graph
We will add the display name, country code, total citations, and earliest publication year
"""

for row in authors.iterrows():
    row = row[1]
    
    attributes_dict = {}
    attributes_dict["display_name"] = row["display_name"]
    attributes_dict["country_code"] = row["country_code"]
    attributes_dict["total_citations"] = row["total_citations"]
    attributes_dict["earliest_publication"] = row["earliest_publication"]
    G.add_node(row["id"], **attributes_dict)
    
## save network as json
import json
data = nx.node_link_data(G)
with open('network.json', 'w') as f:
    json.dump(data, f, indent=4)
    

In [16]:
num_edges = G.number_of_edges()
num_nodes = G.number_of_nodes()

## calculate density of the network
density = nx.density(G)
print("Network density:", density)

Network density: 0.00016581743345606923


In [17]:
## is the network fully connected?
is_connected = nx.is_connected(G)
print("Is the network connected?", is_connected)

Is the network connected? False


In [19]:
## how many clusters are there? 
clusters = list(nx.connected_components(G))
print("Number of clusters:", len(clusters))

Number of clusters: 10007


In [21]:
## how many isolated nodes are there?
isolated_nodes = list(nx.isolates(G))
print("Number of isolated nodes:", len(isolated_nodes))

Number of isolated nodes: 9909


### Question:

Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why? (answer in max 150 words)
## ANSWER TO QUESTION:
We started with the idea that there must be a lot of collaboration, but we found 9964 nodes (researchers) and 32742 links (collaborations), of all these connections we did only find a density of 0.000166, which seems to be a small number and hints to us that the network is sparsely connected. Furthermore, that must be a lot of individuals. We then turn to how many clusters there were, because this could give an insight into how the distribution of these collaboration groups and isolated individuals is. What we found was 10007 clusters and 9909 isolated nodes. The amount of clusters means that some communities collaborate a lot. On the other hand, the amount of isolated nodes says that the majority of the scientists only interact in limited collaboration. To conclude, there exist some distinct communities of collaboration, which collaborate a lot, while there still are some that don’t.

In [22]:
## Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree).
degrees = [val for (node, val) in G.degree()]
strengths = [val for (node, val) in G.degree(weight="weight")]

print("Average degree:", sum(degrees) / len(degrees))
print("Median degree:", sorted(degrees)[len(degrees) // 2])
print("Mode degree:", max(set(degrees), key=degrees.count))
print("Minimum degree:", min(degrees))
print("Maximum degree:", max(degrees))

print("Average strength:", sum(strengths) / len(strengths))
print("Median strength:", sorted(strengths)[len(strengths) // 2])
print("Mode strength:", max(set(strengths), key=strengths.count))
print("Minimum strength:", min(strengths))
print("Maximum strength:", max(strengths))

Average degree: 3.295124037639008
Median degree: 1
Mode degree: 0
Minimum degree: 0
Maximum degree: 428
Average strength: 505.9961757157953
Median strength: 13
Mode strength: 0
Minimum strength: 0
Maximum strength: 431675


# Part 4: The Network of Computational Social Scientists
### Question:
Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why? (answer in max 150 words)

## ANSWER TO QUESTION:
With a network density of 0.000167 it is a very sparse network. While the magnitude of the number admittedly does not say much to us, this is in line with what we expected, as the network is very large and the number of connections is very small compared to the number of possible connections. We expected the network to be very sparse, since we are investigating researchers in a broad geographical scope, a broad variety of research topics and academic backgrounds. This means that the researchers are not likely to be connected to each other. The network is also not very connected, as there are 10007 clusters and 9909 isolated nodes, this is likely due to the fact that the researchers are from different fields and geographical locations. This is in line with what we expected.

### Question:
Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us about the network? (answer in max 150 words)

## ANSWER TO QUESTION:
An average degree of 3.30 tells us that on average, each researcher is connected to 3.30 other researchers. The median degree of 1 tells us that half of the researchers are connected to 1 or fewer researchers. The minimum is unsurprisingly 0 since there are isolated nodes. The maximum is 428, which is a very high number, this tells us that the network is not very connected, but there are some researchers who are very connected. The average strength, i.e. the weighted degree is 506, and the median strength is 13, which means that the network is skewed, with a few researchers having a very high strength, and most researchers having a very low strength. This is likely due to the fact that the network is very sparse and not very connected.  

### Question:
Identify the top 5 authors by degree. What role do these node play in the network?
Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? (answer in max 150 words)

## ANSWER TO QUESTION:
The top 5 authors by degree are:
428, Chao Wang: Seems to primarily work with Chemistry and Materials Science, but has also published in the field of Applied Physics, which coudl be the reason we see him in the network. 
417, Hui Wang: Works with genetics and diseases mostly, but has publications in "Science".
313, Hao Chen: Nanotechnology and organic materials. 
284, Qi Wang: Has studied both social interactions and things such as physiology and materials science.
189, Xiao Zhang: Has published in journals such as "Econometrica" and "Green Chemistry".

All 5 of these authors are asian and it is likely that they are so well connected in the network due to the fact that they are from a country with a large population and a large number of researchers, and therefore can connect easily peers.



In [24]:
## find top 5 nodes by number of degrees
top5_nodes = sorted(G.degree, key=lambda x: x[1], reverse=True)[:5]
print("Top 5 nodes by degree:")
for node in top5_nodes:
    print(node, G.nodes[node[0]]["display_name"])

Top 5 nodes by degree:
('https://openalex.org/A5055838753', 428) Chao Wang
('https://openalex.org/A5090366405', 417) Hui Wang
('https://openalex.org/A5022499603', 313) Hao Chen
('https://openalex.org/A5015195367', 284) Qi Wang
('https://openalex.org/A5002318539', 189) Xiao Zhang


In [25]:
## extract largests connected component
largest_cc = max(nx.connected_components(G), key=len)
print("Number of nodes in largest connected component:", len(largest_cc))

Number of nodes in largest connected component: 9329
