Link to git repository: https://github.com/ongiboy/Cognitive-Social-Science

Group member's contribution:
* Every task was made in collaboration by all members.

# Part 1: Web-scraping

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# Setup the URL
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

Extracting keynote speakers

In [None]:
table = soup.find("table",{"class":"tutorials"})
table_rows = table.find_all("tr")

# For each row find the elements that include "Keynote" and save to list
keynotes = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [i.text for i in td]
    if row:
        if "Keynote" in row[1]:
            keynotes.append(row[1])

# Remove "Keynote - " from each element in list
keynotes = [re.sub("Keynote - ", "", i) for i in keynotes]

print(keynotes)
print("Number of keynotes: ", len(list(set(keynotes))))

Extracting chairs people

In [None]:
chairs = []
# Find the sections with names
sections = soup.find_all("h2")
for section in sections:

    # Find names within section
    bullets = section.find_all("i")
    for nameline in bullets:

        # Add only names after "Chairs" to list (Chair: Taha Yasseri)
        if "Chair" in nameline.text:
            chairs.append(nameline.text[7:])
        
# unique elements in names
print(chairs)
print("Number of chairs: ", len(list(set(chairs))))

Extracting speakers

In [None]:
names = []
# Find the sections with names
sections = soup.find_all("ul",{"class":"nav_list"})
for section in sections:

    # Find names within section
    bullets = section.find_all("i")
    for nameline in bullets:

        # Split into each name and add to list
        names.extend(nameline.text.split(", "))
        
print(names)
print("Number of names: ", len(list(set(names))))

Combined list of all researchers

In [None]:
# Combine all lists into one
all_names = keynotes + chairs + names

# Remove duplicates
all_names = list(set(all_names))

print("Number of names: ", len(all_names))

In [None]:
# Save names in csv
df = pd.DataFrame(all_names, columns=["name"])
df.to_csv("names.csv", index=False)

How many unique researchers do you get?
* 1491

Explain the process you followed to web-scrape the page. 
Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list?
* First, a detailed inspection of the page was made. The sections containing names were found. These sections are:
    * An overview table containing keynote speakers (< td >)
    * Plenary talks containing chairs (< h2 >) and speakers (< i >)
    * Parallel talks containing chairs (< h2 >) and speakers (< i >)
    * Posters (< i >)
* The sections were scraped for names. In the overview table, in each row Finally the lists combined. 
* Then, the list was made to a set to remove potential duplicates.

# Part 2: Ready Made vs Custom Made Data

What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)?
* Centola's data samples and experimental setup were highly controlled, so the causal effect of clustered vs random networks on spread behavior could be investigated while avoiding confounding effects effectively. However, ethical aspects of transparency of the experimental data could be discussed, as the people registering to the health website were not informed about their participation of the experiment. The controlled set-up might also create an artificial environment not accurately reflecting clustered networks of the real world.  

* The observational data from the fitness tracking app combined with the social network app used in Nicolaide's study, provided large samples under real world conditions, reflecting the users actual behaviors in their natural environments. However as control of data collecting were limited, potential biases in the app's user base can affect the generalizability of the findings. Other external variables may also affect the found correlation, such as the motivation or other explanations as homophily.  
  

How do you think these differences can influence the interpretation of the results in each study?
* The findings of Centola's experiment are likely more internally valid because of the controlled setup which helps isolating the variables of interest and reduce external factors. This difference makes for a more clear understanding of the causal relationship between network structures and spread behavior. However, because of the artificial environment and lack of transparency, the results of Centola’s study may have limited external validity.
* On the other side, the results of Nicolaide's study are more applicable to real-world scenarios, dealing with the complexity of actual social networks. However, because of the lack of control over the data collection there might be potential biases and confounding variables, making it necessary to interpret the findings with caution.


# Part 3: Gathering Research Articles using the OpenAlex API

Define functions and import packages

In [None]:
import requests
import pandas as pd
import tqdm
import unicodedata

def normalize_name(name):
    """
    returns more standard version of name
        (removes accents, special characters, and makes lowercase)
    """

    normalized_name = ''.join(c for c in unicodedata.normalize('NFD', name) if unicodedata.category(c) != 'Mn' and (c.isalnum() or c.isspace()))
    return normalized_name.lower()


def get_info_from_names(df_names):
    """
    Retrieves author information from names (through a search)
    """

    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/authors'

    data = []
    for i in tqdm.tqdm(range(len(df_names))):
        try:
            response = requests.get(BASE_URL + RESOURCE, params={"search": df_names["name"][i]})
            all_results = response.json()["results"]#[0]

            names = []
            normalized_df_name = normalize_name(df_names["name"][i])

            for r in range(len(all_results)):
                name = all_results[r]["display_name"]
                names.append(normalize_name(name))

                if normalized_df_name in names:
                    results = response.json()["results"][r]
                    break

            if normalized_df_name not in names:
                continue

            data += [[results["id"], results["display_name"], results["works_api_url"], results["summary_stats"]["h_index"], results["works_count"], results["last_known_institution"]["country_code"]]]

        except (IndexError, KeyError, ValueError, TypeError) as e:
            print(f"Skipping data point at iteration {i} due to error: {e}")
            continue  # Skip the rest of the loop and proceed to the next iteration
        except requests.exceptions.RequestException as e:
            print(f"Request error at iteration {i}: {e}")
            continue

    new_df = pd.DataFrame(data, columns=["id","display_name","works_api_url","h_index","works_count", "country_code"])
    return new_df


def get_concept_ids(concept_requirements):
    """
    Retrieves the OpenAlex concept_ids for a list of requirements.
    """

    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/concepts'

    concept_ids = []
    for concept in concept_requirements:
        result = requests.get(BASE_URL + RESOURCE, params={'search': concept, 'filter': 'level:0'}).json()
        concept_ids.append(result['results'][0]['id'])

    concept_ids = [id.split("/")[-1] for id in concept_ids]
    return concept_ids


def get_articles_from_authors(names, concept_ids_requirements_1, concept_ids_requirements_2, subset=False):
    """
    Extracts articles from authors in the names table.
        The articles are filtered by the criteria from the assignment description.
    """
    
    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/works'

    # Filter out authors not having 5-5000 works
    names = names[(names['works_count']>=5) & (names['works_count']<=5000)]

    table1 = []
    table2 = []

    # Search for articles in batches of 25 authors
    name_batches = [list(names['id'][i:i+25]) for i in range(0, len(names), 25)]

    for num_name_batch, name_batch in tqdm.tqdm(enumerate(name_batches)):

        # short version for testing
        if subset and num_name_batch>0:
            break

        # Scroll through the results
        cursor = '*'
        while True:
            filters = ['cited_by_count:>10', 
                        'authors_count:<10',
                        'authorships.author.id:'+'|'.join(name_batch),
                        'concepts.id:'+'|'.join(concept_ids_requirements_1),
                        'concepts.id:'+'|'.join(concept_ids_requirements_2)
                        ]
            parameters = {'per-page': 200,
                            'filter': ','.join(filters),
                            'cursor': cursor
                            }
            result = requests.get(BASE_URL + RESOURCE, params=parameters).json()

            # If last page is reached (which is empty), break 
            cursor = result['meta']['next_cursor'] # next page for next search
            if len(result['results'])==0 or cursor is None:
                break

            # Go through all articles and extract information
            for n_article,article in enumerate(result['results']):
                try:
                    tab1 = [article['id'], article['publication_year'], article['cited_by_count'], [author['author']['id'] for author in article['authorships']]]
                    tab2 = [article['id'], article['title'], article['abstract_inverted_index']]
                    table1.append(tab1)
                    table2.append(tab2)

                except:
                    print("skipped name batch:", num_name_batch, "article:", n_article)
                    continue
    
    table1 = pd.DataFrame(table1, columns=['id', 'publication_year', 'cited_by_count', 'authors'])
    table2 = pd.DataFrame(table2, columns=['id', 'title', 'abstract_inverted_index'])

    return table1, table2


def get_info_from_author_ids(author_ids):
    """
    Gets info table from author ids.
    """

    URL = 'https://api.openalex.org/authors'
    author_ids = [id.split('/')[-1] for id in author_ids]

    co_author_info = []
    id_batches = [author_ids[i:i+25] for i in range(0, len(author_ids), 25)]

    for i,ids_batch in tqdm.tqdm(enumerate(id_batches)):
            
        # search for result
        params = {'filter': 'ids.openalex:'+'|'.join(ids_batch)}
        result_batch = requests.get(URL, params=params).json()

        # No results
        if len(result_batch) == 0:
            continue
        
        # Go through results (authors)
        for result in result_batch['results']:

            # if person doesnt have all info, skip person
            try:
                # extract desired information
                person_info = [result['id'], result['display_name'], result['works_api_url'], result['summary_stats']['h_index'], result['works_count'], result['last_known_institution']['country_code']]
            except:
                continue

            co_author_info.append(person_info)
    
    co_author_info_df = pd.DataFrame(co_author_info, columns=['id', 'display_name', 'works_api_url', 'h_index', 'works_count', 'country_code'])

    return co_author_info_df

In [None]:
# First, turn names-list from Part 1 into a pandas dataframe
names = pd.read_csv('data/names.csv')

# Get info from names (table of authors + info)
authors = get_info_from_names(names)

In [None]:
# Drop duplicates
authors = authors.drop_duplicates(subset='id')

# Save authors to csv
authors.to_csv("data/authors.csv", index=False)

Authors from week 2 is downloaded

(workflow in different file: "data_preperation.ipynb" since its is not part of the task)

In [None]:
authors = pd.read_csv('data/authors_final.csv')

First, the concept_ids from the desired concepts are retrieved (for use in the search)

In [None]:
# Get concept ids
concepts_requirements_1 = ['Sociology', 'Psychology', 'Economics', 'Political Science']
concepts_requirements_2 = ['Mathematics', 'Physics', 'Computer Science']
concept_ids_1 = get_concept_ids(concepts_requirements_1)
concept_ids_2 = get_concept_ids(concepts_requirements_2)

Then, the articles by the author list are retrieved, using the specified filters.

In [None]:
# Get articles made by names
authors = pd.read_csv('data/authors_final.csv')

articles_from_authors, abstracts_from_authors = get_articles_from_authors(authors, concept_ids_1, concept_ids_2, subset=False)
print("Number of articles:", len(articles_from_authors))

Finally, duplicated are dropped and files are saved.

In [None]:
# Drop duplicates from articles and abstracts
papers = articles_from_authors.drop_duplicates(subset=['id'])
abstracts = abstracts_from_authors.drop_duplicates(subset=['id'])

# Save articles to csv
papers.to_csv("data/papers.csv", index=False)
abstracts.to_csv("data/abstracts.csv", index=False)

## (OBS: Following not asked in assignment)

In [None]:
# load papers and authors
import ast
authors = pd.read_csv('data/authors_final.csv')
papers = pd.read_csv('data/papers.csv')

author_ids = authors["id"]
co_authors_ids = [x for x in papers["authors"].copy().apply(ast.literal_eval).explode().dropna().unique() if x not in author_ids]

In [None]:
# Get info from author ids
co_author_info = get_info_from_author_ids(co_authors_ids)


In [None]:
# Drop rows in co_author_info if "nan" in country_code
co_author_info = co_author_info.dropna(subset=['country_code'])

# Concatenate the authors and co_author_info tables
co_authors_df = pd.concat([authors, co_author_info]).drop_duplicates(subset=['id'])

# Drop duplicates
co_author_info = co_author_info.drop_duplicates(subset='id')

# Save authors to csv
co_author_info.to_csv("data/co_authors_info.csv", index=False)
co_authors_df.to_csv("data/THE_AUTHOR_DATASET.csv", index=False)

In [None]:
# Get papers and abstracts from co-authors (We use the same concept_ids as earlier)
papers_from_co_authors, abstracts_from_co_authors = get_articles_from_authors(co_author_info, concept_ids_1, concept_ids_2, subset=False)

In [None]:
# Concatenate the papers and abstracts tables
papers_final = pd.concat([papers, papers_from_co_authors]).drop_duplicates(subset=['id'])
abstracts_final = pd.concat([abstracts, abstracts_from_co_authors]).drop_duplicates(subset=['id'])

# Save papers and abstracts to csv
papers_final.to_csv("data/THE_PAPER_DATASET.csv", index=False)
abstracts_final.to_csv("data/THE_ABSTRACT_DATASET.csv", index=False)

Dataset summary: How many works are listed in your IC2S2 papers dataframe?
* 9615

How many unique researchers have co-authored these works?
* 12766

Efficiency in code: Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?
* As many of the search filters as possible were moved to the filter parameter, as this resulted in fewer results to filter (loop) through.
* Additionally, the search was done in batches of names, paging through the results (200 per page) and thereby avoiding many non-full pages. This sped up the run time a lot.
* Parallel processing was not implemented due to lack of time. This would have sped up the runtime drastically, but was mostly relevant in the functions not included in this script - due to a longer runtime in these functions.
* The names searched for was based on the author table. If a name search didn't contain all desired columns (eg. country_code), the name was dropped. This results in fewer names than desired but avoids a dataset with missing values.
* Furthermore, 

Filtering Criteria and Dataset Relevance: Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? 
* Filtering of field semantics ensures articles within out desired scientific field.
* Citation thresholding to >10 citations ensures well-acknowledged articles.
* Author thresholds on >5 works ensures that relevant and active authors are selected. Limiting to authors with <5000 articles attempts to limit to relevant authors within our desired field. Many articles increases the likelihood that the author is working on a very wide range of or in interdisciplinary fields, which could be an irrelevant author.
* Thresholding to <10 authors per work also attempts to avoid very broad and non-specific (irrelevant) articles.

Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?
* On one hand, the thresholding could lead to the exclusion of too many authors, because they don't satisfy our thresholds.
* But we believe that there is a greater amount of included irrelevant authors, eg. based on the semantic filters. For example, a random physics article with political importance would be included because of the semantic filters. This would be irrelevant to our search, and there are many other examples.

# Part 4: The Network of Computational Social Scientists

In [None]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import tqdm
import ast
import itertools
from statistics import median, mode
import json
import numpy as np

# Load papers csv-file
papers = pd.read_csv('data/papers.csv')
authors_final = pd.read_csv('data/authors_final.csv')

# Weird CSV
papers['authors'] = papers['authors'].apply(ast.literal_eval)

# From the dataframe pull the authors
co_authors = papers['authors']

## Part 1: Network Construction:

#### 1.  Weighted Edgelist Creation: 
Start with your dataframe of papers. Construct a weighted edgelist where each list element is a tuple containing three elements: the author ids of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once.


#### 2.  Graph Construction: 
- Use NetworkX to create an undirected Graph.
- Employ the add_weighted_edges_from function to populate the graph with the weighted edgelist from step 1, creating a weighted, undirected graph.

In [None]:
GRR = nx.Graph()

# Dictionary to store the count of collaborations between author pairs
collaboration_count = {}

for author_list in co_authors:
    author_list = sorted(author_list)
    
    for author_pair in itertools.combinations(author_list,2):
        if author_pair in collaboration_count:
            collaboration_count[author_pair] += 1
        else:
            collaboration_count[author_pair] = 1
            
edge_list = []
for (author1, author2), count in collaboration_count.items():
    edge_list.append((author1, author2, count))
    
GRR.add_weighted_edges_from(edge_list)

#### 3. Node Attributes:

* For each node, add attributes for the author's display name, country, citation count, and the year of their first publication in Computational Social Science. The display name and country can be retrieved from your authors dataset. The year of their first publication and the citation count can be retrieved from the papers dataset.
* Save the network as a JSON file.

In [None]:
# Add attributes for each node
for node in GRR.nodes():
    
    # Filter papers for the current author
    author_papers = THE_PAPER_DATASET[THE_PAPER_DATASET['authors'].apply(lambda x: node in x)] # find papers that the author has contributed to

    # Check if the author has any publications
    if not author_papers.empty:
        first_publication_year = author_papers['publication_year'].min()
        total_citation_count = author_papers['cited_by_count'].sum()
    else:
        first_publication_year = 0
        total_citation_count = 0
    
    display_name = THE_AUTHOR_DATASET[THE_AUTHOR_DATASET['id'] == node]["display_name"]
    country = THE_AUTHOR_DATASET[THE_AUTHOR_DATASET['id'] == node]["country_code"]
   
    # Add node attributes    
    GRR.nodes[node]["display_name"] = display_name
    GRR.nodes[node]["country"] = country
    GRR.nodes[node]["first_publication_year"] = first_publication_year
    GRR.nodes[node]["citation_count"] = total_citation_count

In [None]:
# Convert the network to a JSON file
graph_data = nx.json_graph.node_link_data(GRR)

# Save the JSON data to a file
with open("data/graph_data.json", "w") as json_file:
    json.dump(graph_data, json_file)

## Part 2: Preliminary Network Analysis 
#### 1. Network Metrics
- What is the total number of nodes (authors) and links (collaborations) in the network?
- Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.
- Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
- If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset.
- How many isolated nodes are there in your network? An isolated node is defined as a node with no connections to any other node in the network.

In [None]:
# total number of nodes and links in the network:
print(GRR)

graph_density = nx.density(GRR)
print(graph_density)

# Check if the graph is connected
is_fully_connected = nx.is_connected(GRR)

if is_fully_connected:
    print("The graph is fully connected.")
else:
    print("The graph is disconnected.")

# Find connected components
connected_components = list(nx.connected_components(GRR))
print(len(connected_components))

# Find isolated nodes
isolated_nodes = list(nx.isolates(GRR))
print(len(isolated_nodes))

* Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why? (answer in max 150 words)

#### 2. Degree Analysis:
- Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree).

In [None]:
# Compute node degrees
degrees = dict(GRR.degree())
weighted_degrees = dict(GRR.degree(weight='weight'))

# Compute average, median, mode, minimum, and maximum degree
average_degree = np.mean(list(degrees.values()))
median_degree = median(list(degrees.values()))
try:
    mode_degree = mode(list(degrees.values())) #Degree value that occurs with highest frequency among the nodes
except:
    mode_degree = "No unique mode"
min_degree = min(degrees.values())
max_degree = max(degrees.values())

# Same calculations but for STRENGTH (WEIGHTED DEGREE)
# Compute average, median, mode, minimum, and maximum weighted degree
average_weighted_degree = np.mean(list(weighted_degrees.values()))
median_weighted_degree = median(list(weighted_degrees.values()))
try:
    mode_weighted_degree = mode(list(weighted_degrees.values()))
except:
    mode_weighted_degree = "No unique mode"
min_weighted_degree = min(weighted_degrees.values())
max_weighted_degree = max(weighted_degrees.values())

# Print the results
print("Degree Analysis:")
print(f"Average Degree: {average_degree}")
print(f"Median Degree: {median_degree}")
print(f"Mode Degree: {mode_degree}")
print(f"Minimum Degree: {min_degree}")
print(f"Maximum Degree: {max_degree}")

print("\nWeighted Degree Analysis:")
print(f"Average Weighted Degree: {average_weighted_degree}")
print(f"Median Weighted Degree: {median_weighted_degree}")
print(f"Mode Weighted Degree: {mode_weighted_degree}")
print(f"Minimum Weighted Degree: {min_weighted_degree}")
print(f"Maximum Weighted Degree: {max_weighted_degree}")

- What do these metrics tell us about the network? (answer in max 150 words)

#### 3. Top Authours
- Identify the top 5 authors by degree. What role do these node play in the network?

In [None]:
# Identify the top 5 authors by degree
top_authors = sorted(degrees.items(), key=lambda x: x[1], reverse=True)[:5]

# Print the top authors and their degrees
print("Top 5 Authors by Degree:")
for author, degree in top_authors:
    print(f"Author {author}: Degree {degree}")

- Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? (answer in max 150 words)