# Week 2

## Today

In this lecture, we will continue to work on the data. As you may have already noticed from the previous class, working with real-world data is not easy! It involves taking one decision after another, to ensure that the results we obtain are reliable. 
My goal is for you to learn how to take these decisions on your own and understand the consequences that these choices may have on your analysis and results. Today, we will learn more about different data sources for Computational Social Science, and some of the challenges that they present. 

Here is the plan for the lecture. 

* **Part 1:** We will learn the differences between **different kinds of data sources**. We will go through the theory, then I will ask you to reflect about what you have learned through an exercise. 

* **Part 2:**  In the second part of this class, I will introduce you to **APIs**. We will then use one API to continue our investigation of the field of Computational Social Science in a data-driven way.
  

## Part 1: Data Sources for Computational Social Science


We have seen how __DATA__ is central to Computational Social Science. But what data sources are we talking about? What are the limitations of different types of data sources? In the video below, I will give you an introduction to different types of data sources. As an example, I will introduce you to two studies that use two very different datasets to answer related questions. 


> **_Video lecture_**: Watch the video below about Data Sources in Computational Social Science
>
> *Optional Reading: [The Spread of Behavior in an Online Social Network Experiment.](https://www.science.org/doi/full/10.1126/science.1185231)* This is the article describing the first study I talk about in the video.    
> *Optional Reading: [Exercise contagion in a global social network.](https://www.nature.com/articles/ncomms14753)* This is the article describing the second study I talk about in the video.    

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("Hr5yKJaQUhE",width=600, height=337.5)


In this course, we are going to focus mostly on observational data collected online to address social science questions. So, I would like us to reflect a little bit more on what it means to use *Ready made* data in the social science, and understand its advantages and challenges. This is something that you can read about in Sections 2.1 to 2.3 of the book _Bit by Bit_. 

> *Reading*: [Bit by Bit, sections 2.1 to 2.3](https://www.bitbybitbook.com/en/1st-ed/observing-behavior/observing-intro/) Read sections 2.1 to 2.3. I don't expect you to read all the details, but to have a general understanding of advantages and challenges of large observational datasets (a.k.a. Big Data/Ready made data) for social science research. *If you have problems accessing the book, you can find a pdf version of the book [here](https://github.com/lalessan/comsocsci2024/tree/main/figures).*

> **Exercise 1: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. Remember to come and ask me, if you have any question about this! 
>
> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book. 
> 2. How do you think these differences can influence the interpretation of the results in each study?

## 1.1
### Centola's experiment
#### Pros:
- The results are useful in order to best select which people to select the next time  
- The data is nonreactive, which means that they are unaware about the experiment and therefore aren't biased  
- They have access to all data since it's custom made  
- The data is clean, no bots no spam  

#### Cons:
- Not much data  
- Takes time  
- The demographic isn't very wide, since the participants have to agree to the survey. This requires a special type of person  

### Nicolaides's study
#### Pros:
- Always on, the data is always updated  
- Large data  
- Nonreactive  
- Complete, clean and accessible data, they have all variables required

#### Cons:
- The data is nonrepresentative, the population isn't varied, they collect data from the app that only runners use  
- There could be many confounders, such as variables as weather and vacations

## 1.2
For Centola's experiment the data is controlled and clean, so the observed effects are causally related to the design of the experiment. However since the data is nonpresentative it can't necessarily be generalized to the broader population. The findings may only apply to a specific demographic that self-selected into the study, and the effects may not hold in real-world, more diverse settings.

For Nicolaides's study uses large-scale, real-world data, capturing social behavior over time. While this improves generalizability, confounders like weather or socioeconomic factors may bias results. Additionally, since the data comes from a runners' app, findings may not apply to non-runners.

#### The trade-off between control vs. generalizability is key:
Centola's experiment provides high internal validity (causality is clear) but low external validity (hard to generalize).
Nicolaides's study offers high external validity (reflects real-world behavior) but lower internal validity (difficult to establish causal relationships).

## Part 2: Using APIs to download  data

In this class, we will work with *Ready made* data. The second thing we will learn today is how to get data ready made data using APIs. We will do it using the Academic Graph API provided by Semantic Scholar. The Academic Graph API enables you to gather information on scientists and their publications. 

In [6]:
from IPython.display import YouTubeVideo
YouTubeVideo("7AQO3vJptvg",width=600, height=337.5)


> **Exercise 2: APIs for Computational Social Science research.** In this excercise, I ask you to look for an API that one could use for Computational Social Science. Starting from your answers, I will compile a list of useful APIs, and share it with the class next week. It may come handy when you start working on your project.
> *Note: the answers to the surveys on DTU Learn are not contributing to your final grade, but it is still important to fill them in, because they help me ensure that you are on track with the material.* 
>
> - Use the web to look for one API that could be used for gathering interesting data (from a Computational Social Science perspective). 
>     - *Data description*: describe in a couple of lines the data types that you can gather using this API
>     - *Rate limits*: What are the rate limits of the free version of the API?
>     - *Link to the API*: Add the link to the API.
> - Go on DTU Learn and fill in the survey *[Week 2 - APIs](https://learn.inside.dtu.dk/d2l/lms/survey/user/attempt/survey_start_frame.d2l?si=31329&ou=242061)*


## Prelude to part 3: Pandas Dataframes


Before starting, we will learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are built using numpy, which is in turn built in C, so they are a quite efficient data structure. You will find it quite useful :)

Pandas dataframes should be intuitive to use. **I suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to learn what you need to follow the rest of the course.**




**_Video lecture_**: Watch the video below about Pandas, [here is the notebook I used in the video](https://nbviewer.org/github/TheYuanLiao/comsocsci2025/blob/main/additional_notebooks/Pandas.ipynb) 

In [9]:
from IPython.display import YouTubeVideo
YouTubeVideo("pM7IKcyfOV4",width=600, height=337.5)

## Part 3: Getting data on Computational Social Scientists from the OpenAlex API

All right, let's dive in! We're going to start collecting data on Computational Social Scientists and their work. 
As you will see, getting to the final dataset will take a few iterations and will require you to handle lots of data, but I'm confident you'll find it rewarding in the end. 
To help you throughout, I've broken down the process into manageable steps and provided strategies to help you tackle each one effectively.

Feel free to team up with a classmate or two for this project (remember, you'll be submitting assignments as a group of three). 
If you run into any issues or have questions, don't hesitate to ask me. Working with real-world data can be frustrating at times, but I'm here to help!

> **Exercise : Find potential Computational Social Scientists** In this exercise, we'll use the OpenAlex API to compile a list of researchers in the field of Computational Social Science, focusing on those who have attended the IC2S2 conference in **2024** (NOT 2023). This will not only later on help you understand the landscape of Computational Social Science research but also develop practical skills in data collection and analysis.
>
> Please read the text of the whole exercise before starting to work on it. 
>
> **Steps**
> 
> 1. **Retreive data.** Consider the set of unique researcher names that you collected in Week 1, Exercise 3. Use the _authors_ endpoint of the [OpenAlex API](https://docs.openalex.org/api-entities/authors) to _search_ these researchers in the database based on their names. Loop through the list and, for each researcher in your list, find: 
>     - their _id_: The OpenAlex ID for this author.
>     - their _display\_name_: The name of the author as a single string.
>     - their _works\_api\_url_: A URL that will get you a list of all this author's works.
>     - their _h\_index_ : The h-index for this author.
>     - their _works\_count_: The number of Works this author has created.
>     - their _country\_code_: The country code of their last known institution.
> 2. **Data Storage** Store this information in a Pandas DataFrame and save it to file.
>    
> **Handling Challenges**
> 
> While working on the steps above, you will face several challenges, such as missing names or multiple results for a single name. Here are some of the issues you may run into and a possible way to address them. *Note: you may also find other ways to address these issues, which is ok. However, if you decide to use a different strategy, please come and talk to me first*
>    - *Problem:* an author is not found. *A possible solution:* you can discard that author.
>    - *Problem:* you get more than one result for the same name. *A possible solution:* You can pick the name with the highest relevance.
>    - *Problem:* None of the authors returned by the API has a name that is ``close enough'' to the name you searched for. *A possible solution:* you can discard that author.
>    - *Problem:* Your for loop keeps breaking due to unforeseen errors. *A possible solution:* to prevent losing data due to errors, save your progress frequently.
>
> **Reflection Questions**
> 
>  Answer the following questions: 
>    - Did you encounter any challenges not listed here? How did you address them?
>    - Choose one problem you faced while collecting the data and describe your solution. Why did you choose this approach, and what impact might it have on your data? 
>      
> Remember, if you're unsure about any steps or encounter hurdles, please reach out.

In [6]:
with open("files/researcher_names_2024.txt") as f:
    researcher_names = f.read().splitlines()

print(len(researcher_names))

1231


In [25]:
import requests

URL = "https://api.openalex.org/authors"

data_list = []

# Loop through researcher names
for name in researcher_names:
    params = {
        'search': name,
        "select": "id,display_name,works_api_url,summary_stats,works_count,last_known_institutions"
    }
    
    response = requests.get(URL, params=params)
    
    if response.status_code == 200:
        results = response.json().get('results', [])

        if results:
            result = results[0]
            author_id = result.get('id', '')
            display_name = result.get('display_name', '')
            works_api_url = result.get('works_api_url', '')
            works_count = result.get('works_count', 0)
            h_index = result.get('summary_stats', 0).get('h_index', 0)
            country_code = result.get('last_known_institutions', '')
            if country_code:
                country_code = country_code[0].get('country_code', '')


            # Append to list
            data_list.append({
                "ID": author_id,
                "Name": display_name,
                "Works API URL": works_api_url,
                "Works Count": works_count,
                "H-Index": h_index,
                "Country Code": country_code
            })
        else:
            data_list.append({
                "ID": '',
                "Name": name,
                "Works API URL": '',
                "Works Count": '',
                "H-Index": '',
                "Country Code": ''
            })
    else:
        print(f"Error fetching data for {name}: {response.status_code}")

In [27]:
import pandas as pd

# Convert list to DataFrame
df = pd.DataFrame(data_list)
df.head()

Unnamed: 0,ID,Name,Works API URL,Works Count,H-Index,Country Code
0,https://openalex.org/A5082130337,A. Marthe Möller,https://api.openalex.org/works?filter=author.i...,13,6,NL
1,https://openalex.org/A5014647140,Aaron Clauset,https://api.openalex.org/works?filter=author.i...,284,48,US
2,https://openalex.org/A5089395967,Aaron Nichols,https://api.openalex.org/works?filter=author.i...,10,2,US
3,https://openalex.org/A5047404909,Aaron J. Schwartz,https://api.openalex.org/works?filter=author.i...,32,10,US
4,https://openalex.org/A5053043999,Aaron J. Schein,https://api.openalex.org/works?filter=author.i...,19,16,US


In [28]:
df.to_csv("researchers_data.csv", index=False)

# Optimized fetching

In [7]:
import requests

URL = "https://api.openalex.org/authors"
data_list = []

# Use a session to reuse connections
with requests.Session() as session:
    for name in researcher_names:
        params = {
            'search': name,
            "select": "id,display_name,works_api_url,summary_stats,works_count,last_known_institutions"
        }
        
        response = session.get(URL, params=params)
        
        if response.status_code == 200:
            results = response.json().get('results', [])
            if results:
                result = results[0]
                author_id = result.get('id', '')
                display_name = result.get('display_name', '')
                works_api_url = result.get('works_api_url', '')
                works_count = result.get('works_count', 0)
                h_index = result.get('summary_stats', 0).get('h_index', 0)
                country_code = result.get('last_known_institutions', '')
                if country_code:
                    country_code = country_code[0].get('country_code', '')


                # Append to list
                data_list.append({
                    "ID": author_id,
                    "Name": display_name,
                    "Works API URL": works_api_url,
                    "Works Count": works_count,
                    "H-Index": h_index,
                    "Country Code": country_code
                })
            else:
                data_list.append({
                    "ID": '',
                    "Name": name,
                    "Works API URL": '',
                    "Works Count": '',
                    "H-Index": '',
                    "Country Code": ''
                })
        else:
            print(f"Error fetching data for {name}: {response.status_code}")

print(f"Fetched data for {len(data_list)} authors.")


Error fetching data for Ayush Kanodia: 429
Error fetching data for Bedoor AlShebli: 429
Error fetching data for Calvin Yixiang Cheng: 429
Error fetching data for Carla Freitas Silveira Netto: 429
Error fetching data for Daniel E. Ho: 429
Error fetching data for David Fang: 429
Error fetching data for David Gamba: 429
Error fetching data for Dilrukshi Gamage: 429
Error fetching data for Eduardo López: 429
Error fetching data for Fynn Bachmann: 429
Error fetching data for Henrik Olsson: 429
Fetched data for 1220 authors.


In [8]:
import pandas as pd

# Convert list to DataFrame
df = pd.DataFrame(data_list)
df.to_csv("researchers_data_2024.csv", index=False)

  from pandas.io.parsers.base_parser import ParserBase
