### Link to git repository:
https://github.com/mbrochlips/CSS_project1

### Contributions:
| Name                | Part 1 | Part 2 | Part 3 | Part 4 |
|---------------------|--------|--------|--------|--------|
|  Mikkel (s234860)   | 40%    | 30%    | 30%    | 35%    |
|  Kantinka (s23....) | 30%    | 40%    | 30%    | 35%    |
|  Marcus (s2348..)   | 30%    | 30%    | 40%    | 30%    |

## Part 3: Gathering Research Articles using the OpenAlex API

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 (NOT 2023) conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**
>
>
> **Steps:**
>
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
>
>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).
>
> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>
>
> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
>
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts).
>
>


In [19]:
# Library imports
import pandas as pd
import requests
from time import sleep
from json import JSONDecodeError
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed

In [20]:
# Read authors CSV file
csv_path = 'C:/Users/marcu/Documents/Computational social science/Week 2/Authors.csv'
authors = pd.read_csv(csv_path)

authors

Unnamed: 0,id,display_name,works_api_url,works_count,country_code,h_index
0,https://openalex.org/A5026829784,Sam Corbett‐Davies,https://api.openalex.org/works?filter=author.i...,34,IL,14
1,https://openalex.org/A5066089123,Byungkyu Lee,https://api.openalex.org/works?filter=author.i...,186,US,13
2,https://openalex.org/A5100435139,Jingwen Zhang,https://api.openalex.org/works?filter=author.i...,399,CN,47
3,https://openalex.org/A5002073039,Sou Hyun Jang,https://api.openalex.org/works?filter=author.i...,69,KR,11
4,https://openalex.org/A5091610280,Carl Colglazier,https://api.openalex.org/works?filter=author.i...,4,US,1
...,...,...,...,...,...,...
1145,https://openalex.org/A5043405308,Yen-Huei Chen,https://api.openalex.org/works?filter=author.i...,29,TW,13
1146,https://openalex.org/A5021090586,Michael Lees,https://api.openalex.org/works?filter=author.i...,226,NL,28
1147,https://openalex.org/A5102918288,Jiayu Zheng,https://api.openalex.org/works?filter=author.i...,35,CN,13
1148,https://openalex.org/A5065295188,Yang Tian,https://api.openalex.org/works?filter=author.i...,450,CN,63


This is all the authors from IC2S2 2024.

We then batch the authors and groups of 25 to make the code run faster:

In [21]:
# Split authors into batches. Each batch contains 25 authors
author_batches = np.array_split(authors, len(authors) // 25)
author_batches[0]

  return bound(*args, **kwds)


Unnamed: 0,id,display_name,works_api_url,works_count,country_code,h_index
0,https://openalex.org/A5026829784,Sam Corbett‐Davies,https://api.openalex.org/works?filter=author.i...,34,IL,14
1,https://openalex.org/A5066089123,Byungkyu Lee,https://api.openalex.org/works?filter=author.i...,186,US,13
2,https://openalex.org/A5100435139,Jingwen Zhang,https://api.openalex.org/works?filter=author.i...,399,CN,47
3,https://openalex.org/A5002073039,Sou Hyun Jang,https://api.openalex.org/works?filter=author.i...,69,KR,11
4,https://openalex.org/A5091610280,Carl Colglazier,https://api.openalex.org/works?filter=author.i...,4,US,1
5,https://openalex.org/A5071422618,Markus Strohmaier,https://api.openalex.org/works?filter=author.i...,348,DE,38
6,https://openalex.org/A5044191812,Vinícius Andrade Brei,https://api.openalex.org/works?filter=author.i...,63,BR,14
7,https://openalex.org/A5019023655,Alyssa Smith,https://api.openalex.org/works?filter=author.i...,4,MX,2
8,https://openalex.org/A5037451955,Alexander Furnas,https://api.openalex.org/works?filter=author.i...,51,US,6
9,https://openalex.org/A5010577211,Dehao Zhang,https://api.openalex.org/works?filter=author.i...,2,CN,1


In [22]:
# Define DataFrame column names
paper_col = ['id', 'publication_year', 'cited_by_count', 'author_ids']
abstract_col = ['id', 'title', 'abstract_inverted_index']

# Global API source and filter Concepts dictionaries
source = 'https://api.openalex.org/works'
concepts_1 = {
    'Computer science': 'https://openalex.org/C41008148',
    'Physics': 'https://openalex.org/C121332964',
    'Mathematics': 'https://openalex.org/C33923547',
}
concepts_2 = {
    'Psychology': 'https://openalex.org/C15744967',
    'Sociology': 'https://openalex.org/C144024400',
    'Economics': 'https://openalex.org/C162324750',
    'Political science': 'https://openalex.org/C17744445',
}

We then define a function that calls the api on a batch in order to make the code work in parallel.

In [53]:
def process_batch(author_batch, batch_index):
    """
    Process a single batch of authors by iterating over API pages,
    applying retry logic, and collecting papers and abstracts.
    :param author_batch: Dictionary containing batch of authors. The batch works on a maximum of 25 authors.
    :param batch_index: The index of the batch to be processed.
    :return: Two lists of papers and abstracts data.
    """

    local_session = requests.Session()  # create a local session for the thread
    local_papers_data = []
    local_abstracts_data = []
    page = 1

    while True:
        tries = 0
        results = None
        # Retry loop for API request
        while tries < 10:
            try:
                # Joining author ids to bulk the request
                author_ids_str = '|'.join(author_batch['id'])
                url = (
                    f"{source}?filter=author.id:{author_ids_str},"
                    f"cited_by_count:>10,authors_count:<10,"
                    f"concepts.id:{'|'.join(concepts_1.values())},"
                    f"concepts.id:{'|'.join(concepts_2.values())}"
                )
                response = local_session.get(url, params={'per_page': 200, 'page': page})
                results = response.json()['results']
                break  # exit retry loop on success
            except (JSONDecodeError, requests.exceptions.RequestException) as e:
                tries += 1
                sleep(0.1)

        # If no results are returned, exit the paging loop
        if not results:
            break

        # Process each paper in the current page
        for paper in results:
            paper_id = paper['id']
            publication_year = paper['publication_year']
            cited_by_count = paper['cited_by_count']
            # Extract author IDs from the authorships list
            author_ids = ";".join(sub_author['author']['id'] for sub_author in paper['authorships'])
            local_papers_data.append([paper_id, publication_year, cited_by_count, author_ids])

            title = paper['title']
            abstract_index = paper['abstract_inverted_index']
            local_abstracts_data.append([paper_id, title, abstract_index])

        page += 1

    print(f"Completed batch {batch_index}")
    return local_papers_data, local_abstracts_data

We then multithread all of the batches:

In [54]:
# Multi-threaded execution using ThreadPoolExecutor
all_papers_data = []
all_abstracts_data = []

# A maximum of 10 threads work at a time
with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_batch = {
        executor.submit(process_batch, batch, idx): idx
        for idx, batch in enumerate(author_batches)
    }
    # As each future completes, combine its results
    for future in as_completed(future_to_batch):
        try:
            papers_data, abstracts_data = future.result()
            all_papers_data.extend(papers_data)
            all_abstracts_data.extend(abstracts_data)
        except Exception as exc:
            batch_index = future_to_batch[future]
            print(f"Batch {batch_index} generated an exception: {exc}")

Completed batch 9
Completed batch 1
Completed batch 5
Completed batch 6
Completed batch 4
Completed batch 0
Completed batch 7
Completed batch 3
Completed batch 8
Completed batch 2
Completed batch 16
Completed batch 11
Completed batch 15
Completed batch 13
Completed batch 12
Completed batch 19
Completed batch 10
Completed batch 17
Completed batch 14
Completed batch 22
Completed batch 18
Completed batch 26
Completed batch 20
Completed batch 21
Completed batch 25
Completed batch 24
Completed batch 30
Completed batch 23
Completed batch 28
Completed batch 33
Completed batch 27
Completed batch 29
Completed batch 37
Completed batch 35
Completed batch 32
Completed batch 31
Completed batch 36
Completed batch 34
Completed batch 38
Completed batch 39
Completed batch 40
Completed batch 42
Completed batch 44
Completed batch 43
Completed batch 41
Completed batch 45


In [55]:
# Convert the collected data into DataFrames
papers = pd.DataFrame(all_papers_data, columns=paper_col)
abstracts = pd.DataFrame(all_abstracts_data, columns=abstract_col)

We can just do a sanity check and check that each DataFram are the same length

In [56]:
# Check length of papers and abstracts DataFrames
print(len(papers), len(abstracts))

13150 13150


We can then save the DataFrames

In [57]:
# Drop duplicates and save to csv
papers = papers.drop_duplicates(['id'], ignore_index=True)
papers.to_csv('Works/IC2S2_papers.csv', index=False)
papers

Unnamed: 0,id,publication_year,cited_by_count,author_ids
0,https://openalex.org/W2955058313,2019,5709,https://openalex.org/A5101758238;https://opena...
1,https://openalex.org/W1520494989,1990,3200,https://openalex.org/A5080791781;https://opena...
2,https://openalex.org/W1551153090,2013,1690,https://openalex.org/A5080791781;https://opena...
3,https://openalex.org/W1983912405,2001,1672,https://openalex.org/A5082473613;https://opena...
4,https://openalex.org/W4296586302,1993,1543,https://openalex.org/A5080791781
...,...,...,...,...
11620,https://openalex.org/W2514509634,2001,12,https://openalex.org/A5010044245;https://opena...
11621,https://openalex.org/W2076427612,1995,13,https://openalex.org/A5040651832
11622,https://openalex.org/W1526178299,1999,11,https://openalex.org/A5112819277;https://opena...
11623,https://openalex.org/W2322495318,1991,11,https://openalex.org/A5099111335


In [58]:
# Drop duplicates and save to csv
abstracts = abstracts.drop_duplicates(['id'], ignore_index=True)
abstracts.to_csv('Works/IC2S2_abstracts.csv', index=False)
abstracts

Unnamed: 0,id,title,abstract_inverted_index
0,https://openalex.org/W2955058313,Dual Attention Network for Scene Segmentation,"{'In': [0, 162], 'this': [1], 'paper,': [2], '..."
1,https://openalex.org/W1520494989,"A Continuum of Impression Formation, from Cate...",
2,https://openalex.org/W1551153090,Social Cognition: From Brains to Culture,"{'Social': [0, 13, 28, 46, 52, 151, 155, 162, ..."
3,https://openalex.org/W1983912405,An ambivalent alliance: Hostile and benevolent...,"{'The': [0, 45], 'equation': [1], 'of': [2, 76..."
4,https://openalex.org/W4296586302,Controlling other people: The impact of power ...,
...,...,...,...
11620,https://openalex.org/W2514509634,"Plus Ça Change, Plus C'est Différent: A Report...","{'Reported': [0], 'here': [1], 'are': [2, 47],..."
11621,https://openalex.org/W2076427612,The Primacy of Virtue in Children's Moral Deve...,"{'Abstract': [0], 'The': [1, 60], 'concept': [..."
11622,https://openalex.org/W1526178299,Exploring the Contexts of Information Behaviou...,"{'Exploring': [0, 29], 'the': [1, 8, 30, 37], ..."
11623,https://openalex.org/W2322495318,IT strategies for information management,


> **Data Overview and Reflection questions:** Answer the following questions:
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works?
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__
> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

A total of 11625 unique papers are in our dataset. The following code calcualtes the number of unique reasearches that have co-authored these works:

In [59]:
author_ids = ';'.join(papers['author_ids'])
author_ids = author_ids.split(';')
unique_authors = list(set( author_ids ))

len(unique_authors)

17936

A total number of 17936 researches have co-authored works in our dataset.

When I first ran the code without any optimizations it took around half an hour to get all the works for the IC2S2_papers dataframe and OC2S2_abstracts dataframe. Then I implemented bulking authors together in batches of 25 which significantly reduced the running time to a few minutes, and then multithreading the batches to 10 working threads at a time reduced the running time even lower to finishing in 16 seconds.

The filters we apply aim the enhance data quality and relevant to the topic. By only including works with over 10 citations we ensure that the papers we include only contains research that have a recognized impact. The concepts  filter insures that the papers are in the field of computational social science.

By having our filters we may underrepresent emerging papers with few citations, large teams of more than 10 researchers.

## Part 4: The Network of Computational Social Scientists