Link to git repository: https://github.com/ongiboy/Cognitive-Social-Science

Group member's contribution:
* Christian Ong (s204109)   : 33%
* Daniel Ries (s21)         : 33%
* Kavus Latifi (s21)        : 33%

# Part 1: Web-scraping

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# Setup the URL
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK)
soup = BeautifulSoup(r.content)

In [None]:
table = soup.find("table",{"class":"tutorials"})
table_rows = table.find_all("tr")

# For each row find the elements that include "Keynote" and save to list
keynotes = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [i.text for i in td]
    if row:
        if "Keynote" in row[1]:
            keynotes.append(row[1])

print(keynotes)

# Remove "Keynote - " from each element in list
keynotes = [re.sub("Keynote - ", "", i) for i in keynotes]

print(keynotes)
print("Number of keynotes: ", len(list(set(keynotes))))

In [None]:
chairs = []
# Find the sections with names
sections = soup.find_all("h2")
for section in sections:

    # Find names within section
    bullets = section.find_all("i")
    for nameline in bullets:

        # Add only names after "Chairs" to list (Chair: Taha Yasseri)
        if "Chair" in nameline.text:
            chairs.append(nameline.text[7:])
        
# unique elements in names
print(chairs)
print("Number of chairs: ", len(list(set(chairs))))

In [None]:
names = []
# Find the sections with names
sections = soup.find_all("ul",{"class":"nav_list"})
for section in sections:

    # Find names within section
    bullets = section.find_all("i")
    for nameline in bullets:

        # Split into each name and add to list
        names.extend(nameline.text.split(", "))
        
print(names)
print("Number of names: ", len(list(set(names))))

In [None]:
# Combine all lists into one
all_names = keynotes + chairs + names

# Remove duplicates
all_names = list(set(all_names))

print("Number of names: ", len(all_names))

In [None]:
# Save names in csv
df = pd.DataFrame(all_names, columns=["name"])
df.to_csv("names.csv", index=False)

How many unique researchers do you get?
* 1491

Explain the process you followed to web-scrape the page. 
Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list?
* First, a detailed inspection of the page was made. The sections containing names were found. The sections containing names are:
    * An overview table containing keynote speakers (< td >)
    * Plenary talks containing chairs (< h2 >) and speakers (< i >)
    * Parallel talks containing chairs (< h2 >) and speakers (< i >)
    * Posters (< i >)
* These sections were scraped for names and combined in a list.
* Then, the list was made to a set to remove potential duplicates.

# Part 2: Ready Made vs Custom Made Data

What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)?
* Centola's data samples and experimental setup were highly controlled, so the causal effect of clustered vs random networks on spread behavior could be investigated while avoiding confounding effects effectively. However, ethical aspects of transparency of the experimental data could be discussed, as the people registering to the health website were not informed about their participation of the experiment. The controlled set-up might also create an artificial environment not accurately reflecting clustered networks of the real world.  

* The observational data from the fitness tracking app combined with the social network app used in Nicolaide's study, provided large samples under real world conditions, reflecting the users actual behaviors in their natural environments. However as control of data collecting were limited, potential biases in the app's user base can affect the generalizability of the findings. Other external variables may also affect the found correlation, such as the motivation or other explanations as homophily.  
  

How do you think these differences can influence the interpretation of the results in each study?
* The findings of Centola's experiment are likely more internally valid because of the controlled setup which helps isolating the variables of interest and reduce external factors. This difference makes for a more clear understanding of the causal relationship between network structures and spread behavior. However, because of the artificial environment and lack of transparency, the results of Centola’s study may have limited external validity.
* On the other side, the results of Nicolaide's study are more applicable to real-world scenarios, dealing with the complexity of actual social networks. However, because of the lack of control over the data collection there might be potential biases and confounding variables, making it necessary to interpret the findings with caution.


# Part 3: Gathering Research Articles using the OpenAlex API

In [None]:
import requests
import pandas as pd
import tqdm
import unicodedata

def normalize_name(name):
    normalized_name = ''.join(c for c in unicodedata.normalize('NFD', name) if unicodedata.category(c) != 'Mn' and (c.isalnum() or c.isspace()))
    return normalized_name.lower()


def get_info_from_names(df_names):

    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/authors'

    data = []
    for i in tqdm.tqdm(range(len(df_names))):
        try:
            response = requests.get(BASE_URL + RESOURCE, params={"search": df_names["name"][i]})
            all_results = response.json()["results"]#[0]

            names = []
            normalized_df_name = normalize_name(df_names["name"][i])

            for r in range(len(all_results)):
                name = all_results[r]["display_name"]
                names.append(normalize_name(name))

                if normalized_df_name in names:
                    results = response.json()["results"][r]
                    break

            if normalized_df_name not in names:
                continue

            data += [[results["id"], results["display_name"], results["works_api_url"], results["summary_stats"]["h_index"], results["works_count"], results["last_known_institution"]["country_code"]]]

        except (IndexError, KeyError, ValueError, TypeError) as e:
            print(f"Skipping data point at iteration {i} due to error: {e}")
            continue  # Skip the rest of the loop and proceed to the next iteration
        except requests.exceptions.RequestException as e:
            print(f"Request error at iteration {i}: {e}")
            continue

    new_df = pd.DataFrame(data, columns=["id","display_name","works_api_url","h_index","works_count", "country_code"])
    return new_df


def get_info_from_names_old(names):
    """
    names --> author table (id, name, works_api_url, h_index, works_count, country_code)
    """

    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/authors'
    people_info = []
    for i,name in tqdm.tqdm(enumerate(names['name'])):
  
        # search for person
        person = requests.get(BASE_URL + RESOURCE, params={'search': name}).json()

        # No results
        if len(person['results']) == 0:
            continue

        # take first result
        person = person['results'][0]

        # if person doesnt have all info, skip person
        try:
            # extract desired information
            person_info = [person['id'], person['display_name'], person['works_api_url'], person['summary_stats']['h_index'], person['works_count'], person['last_known_institution']['country_code']]
        
        except:
            continue

        people_info.append(person_info)

    people_pd = pd.DataFrame(people_info, columns=['id', 'name', 'works_api_url', 'h_index', 'works_count', 'country_code'])
    return people_pd


def get_concept_ids(concept_requirements):
    """
    concepts --> concept_ids
    """

    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/concepts'

    concept_ids = []
    for concept in concept_requirements:
        result = requests.get(BASE_URL + RESOURCE, params={'search': concept, 'filter': 'level:0'}).json()
        concept_ids.append(result['results'][0]['id'])

    concept_ids = [id.split("/")[-1] for id in concept_ids]
    return concept_ids


def get_articles_from_authors(names, concept_ids_requirements_1, concept_ids_requirements_2, subset=False):
    """
    Extracts articles from authors in the names table.
        The articles are filtered by the criteria from the assignment description.
    """
    
    BASE_URL = 'https://api.openalex.org'
    RESOURCE = '/works'

    # Filter out authors not having 5-5000 works
    names = names[(names['works_count']>=5) & (names['works_count']<=5000)]

    table1 = []
    table2 = []

    # Search for articles in batches of 25 authors
    name_batches = [list(names['id'][i:i+25]) for i in range(0, len(names), 25)]

    for num_name_batch, name_batch in enumerate(name_batches):

        # track progress
        print("batch number:", num_name_batch)

        # short version for testing
        if subset and num_name_batch>0:
            break

        # Scroll through the results
        cursor = '*'
        while True:
            filters = ['cited_by_count:>10', 
                        'authors_count:<10',
                        'authorships.author.id:'+'|'.join(name_batch),
                        'concepts.id:'+'|'.join(concept_ids_requirements_1),
                        'concepts.id:'+'|'.join(concept_ids_requirements_2)
                        ]
            parameters = {'per-page': 200,
                            'filter': ','.join(filters),
                            'cursor': cursor
                            }
            result = requests.get(BASE_URL + RESOURCE, params=parameters).json()

            # If last page is reached (which is empty), break 
            cursor = result['meta']['next_cursor'] # next page for next search
            if len(result['results'])==0 or cursor is None:
                break

            # Go through all articles and extract information
            for n_article,article in enumerate(result['results']):
                try:
                    tab1 = [article['id'], article['publication_year'], article['cited_by_count'], [author['author']['id'] for author in article['authorships']]]
                    tab2 = [article['id'], article['title'], article['abstract_inverted_index']]
                    table1.append(tab1)
                    table2.append(tab2)

                except:
                    print("skipped name batch:", num_name_batch, "article:", n_article)
                    continue
    
    table1 = pd.DataFrame(table1, columns=['id', 'publication_year', 'cited_by_count', 'authors'])
    table2 = pd.DataFrame(table2, columns=['id', 'title', 'abstract_inverted_index'])

    return table1, table2


def get_info_from_author_ids(authors):
    """
    Gets info table from author ids.
    """

    URL = 'https://api.openalex.org/authors'

    co_author_info = []
    for i,author_id in enumerate(authors):
        # track progress
        if i!=0 and i%50==0:
            print(i)
            
        # search for person
        author_id = author_id.split('/')[-1]
        person = requests.get(URL + '/' + author_id).json()

        # No results
        if len(person) == 0:
            continue

        # if person doesnt have all info, skip person
        try:
            # extract desired information
            person_info = [author_id, person['display_name'], person['works_api_url'], person['summary_stats']['h_index'], person['works_count'], person['last_known_institution']['country_code']]
        
        except:
            continue

        co_author_info.append(person_info)

    return co_author_info

In [None]:
# First, turn names-list from Part 1 into a pandas dataframe
names = pd.read_csv('data/names.csv')

# Get info from names (table of authors + info)
authors = get_info_from_names(names)

In [None]:
# Save authors to csv
authors.to_csv("authors.csv", index=False)

In [None]:
# Get concept ids
concepts_requirements_1 = ['Sociology', 'Psychology', 'Economics', 'Political Science']
concepts_requirements_2 = ['Mathematics', 'Physics', 'Computer Science']
concept_ids_1 = get_concept_ids(concepts_requirements_1)
concept_ids_2 = get_concept_ids(concepts_requirements_2)

In [None]:
# Get articles made by names
authors = pd.read_csv('data/authors_final.csv')

articles_from_authors, abstracts_from_authors = get_articles_from_authors(authors, concept_ids_1, concept_ids_2, subset=False)
print("Number of articles:", len(articles_from_authors))

In [None]:
# Save articles to csv
articles_from_authors.to_csv("papers.csv", index=False)
abstracts_from_authors.to_csv("abstracts.csv", index=False)

In [None]:
# Get unique authors from articles
authors_unique = articles_from_authors['authors'].explode().dropna().unique()
print("Number of unique authors:", len(authors_unique))

Dataset summary: How many works are listed in your IC2S2 papers dataframe?
* 9609

How many unique researchers have co-authored these works?
* 12766

Efficiency in code: Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time?
* As many of the search filters as possible were moved to the filter parameter, as this resulted in fewer results to filter through.
* Additionally, the search was done in batches of names, paging through the results (200 per page) and thereby avoiding many pages with few results.
* A major weakness is the concept filter, which is hard coded instead of using the 'concepts.id'-parameter due to lack of time. Implementing this would significantly improve the efficiency and would only require a list of the concept-OpenAlex-IDs. Also, multiprocessing isn't implemented despite the suggestion (also due to lack of time)
* The names searched for was based on the author table. If a name search didn't contain all desired columns (eg. country_code), the name was dropped. This results in fewer names than desired but avoids a dataset with missing values.

Filtering Criteria and Dataset Relevance: Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? 
* Filtering of field semantics ensures articles within out desired scientific field.
* Citation thresholding to >10 citations ensures well-acknowledged articles.
* Author thresholds on >5 works ensures that relevant and active authors are selected. Limiting to authors with <5000 articles attempts to limit to relevant authors within our desired field. Many articles increases the likelihood that the author is working on a very wide range of or in interdisciplinary fields, which could be an irrelevant author.
* Thresholding to <10 authors per work also attempts to avoid very broad and non-specific (irrelevant) articles.

Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices?
* On one hand, the thresholding could lead to the exclusion of too many authors, because they don't satisfy our thresholds.
* But we believe that there is a greater amount of included irrelevant authors, eg. based on the semantic filters. For example, a random physics article with political importance would be included because of the semantic filters. This would be irrelevant to our search, and there are many other examples.

# Part 4: The Network of Computational Social Scientists

In [None]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import tqdm
import ast
import itertools

# Load papers csv-file
papers = pd.read_csv('data/papers.csv')
authors_final = pd.read_csv('data/authors_final.csv')

# Weird CSV
papers['authors'] = papers['authors'].apply(ast.literal_eval)

# From the dataframe pull the authors
co_authors = papers['authors']

In [None]:
GRR = nx.Graph()

# Dictionary to store the count of collaborations between author pairs
collaboration_count = {}

for author_list in co_authors:
    author_list = sorted(author_list)
    
    for author_pair in itertools.combinations(author_list,2):
        if author_pair in collaboration_count:
            collaboration_count[author_pair] += 1
        else:
            collaboration_count[author_pair] = 1
            
edge_list = []
for (author1, author2), count in collaboration_count.items():
    edge_list.append((author1, author2, count))
    
GRR.add_weighted_edges_from(edge_list)