Link to GitHub repository [here](https://github.com/jesp9435/ComSocSci)

Group member contributions: Both group members contributed equally to the parts of the assignment. We have worked collaboratively on all parts. 

# Part 1: Web-scraping

In total we got a total of 1484 unique researchers. 

First we inspected the HTML page, and used the tool to select an element on a page. Using this, it was possible to see a pattern on the website, e.g. that all the conferences were nested under a single class. We thought looking for this would yield the max amount of authors. However, it did not consider some of the names above the nested class . Thus we realized that by searching for "i" as in italic, it was possible to retrieve all names on the website. Some names had a title "Chair" next to their name, so this was removed after printing the full list of the names. We retrieved data about all the names using the Openalex API, and saved it to a authors.txt. There were duplicates in the results , but we chose to include everyone except if they had no results. In total we scraped results for 3902 people.


In [None]:
from bs4 import BeautifulSoup 
import requests 
import json
LINK = "https://ic2s2-2023.org/program"
r = requests.get(LINK) 
soup = BeautifulSoup(r.content) 
parallel_talks = soup.find_all('i')
pep=[]
for people in parallel_talks:
    people_list=people.text.split(",")
    for i in range(len(people_list)):
        pep.append(people_list[i].replace("Chair: ","").strip())
res=[]
[res.append(x) for x in pep if x not in res]
count=0
authors=[]
file1 = open('authors.txt', mode='a', encoding='utf-8')
for author in res:
    api_string="https://api.openalex.org/autocomplete/authors?q="
    temp_name=author.split(" ")
    api_string+=temp_name[0]+"%20"
    temp_name.pop(0)
    for last_name in temp_name:
        api_string+=last_name
    r = requests.get(api_string)
    soup = BeautifulSoup(r.content) 
    meta_data=''.join(str(e) for e in soup.text)
    meta_data=json.loads(meta_data)
    if(len(meta_data["results"])>0):
        for individual in meta_data["results"]:
            authors.append([individual["id"],individual["display_name"],individual["external_id"],individual["works_count"],individual["hint"]])
            temp=str(individual["id"])+","+str(individual["display_name"])+","+str(individual["external_id"])+","+str(individual["works_count"])+","+str(individual["hint"])+""
            file1.write(temp+"\n")
print(len(res))
print(len(authors))

# Part 2: Ready Made vs Custom Made Data

**2.1** The pros of the custom made data, as used in Centola's study, is that he is able to setup the experiment in a very specific way so it is useful for what he is researching. The cons is that there may be some significant bias present, since he is testing for a hypothesis. The results may not be representative of the real world in case the participants know what is going on. Furthermore, creating custom-made data can be time-consuming and resource-intensive.
The pros of ready-made data in Nicolaides' study is that he is able to gather a much larger data set at a lower cost. The cons associated with this data is that the purpose of the data may be entirely different, thus the data will not always be of use for all research purposes, e.g. lacking consideration for confounders or other meta data.

**2.2** The interpretation of the results in each study can be influenced by the aforementioned pros and cons. Fx. bias has to be considered to a large extent for custom-made data, since as a researcher you collect the data for a hypothesis you have. The confounders may not be considered when ready-made data is created by governments or large corporations to the same extent as custom-made data, thus extra precautions have to be made. In Centola's experiment, the controlled nature of the custom-made data allows for more confident conclusions about causal relationships, while in Nicolaides's study, using ready-made data can provide insights into real-world phenomena and population trends, but you have to be careful in generalizing findings and interpreting results considering the limitations of such datasets.

# Part 3: Gathering Research Articles using the OpenAlex API

In [None]:
import requests
import pandas as pd
import concurrent.futures 
import requests
import math
f=open("authors.txt","r",encoding="utf-8")
url="https://api.openalex.org/works?per-page=200&filter=author.id:"
abstracts=[]
papers=[]
counters=[]
co_authors=[]
count=0
urls=[]
for line in f:
    temp=line.split(",")
    if(temp[3]!=None):
        if(int(temp[3])>5 and int(temp[3])<5000):
            if(int(temp[3])<201):
                temp_url=url+'"'+temp[0]+'"'
                urls.append(temp_url)
            else:
                pages=math.ceil(int(temp[3])/200)
                for i in range(pages):
                    page=i+1
                    url_max="https://api.openalex.org/works?page="+str(page)+"&per-page=200&filter=author.id:"+temp[0]+'"'
                    urls.append(url_max)
def process_data(url):
    r=requests.get(url).json()
    for abstract in r["results"]:
        if(int(abstract["cited_by_count"])>10 and len(abstract["authorships"])<10):
            focuses=["Sociology","Psychology","Economics","Political science"]
            quantitative_disciplines=["Mathematics","Physics","Computer science"]
            focus=False
            quantitative=False
            for concept in abstract["concepts"]:
                if(int(concept["level"])==0):
                    if(concept["display_name"] in focuses):
                        focus=True
                    if(concept["display_name"] in quantitative_disciplines):
                        quantitative=True
            for co_author in abstract["authorships"]:
                co_authors.append(co_author["raw_author_name"])
            if(focus==True and quantitative==True):
                abstracts.append([abstract["id"],abstract["publication_year"],abstract["cited_by_count"],temp[0]])
                papers.append([abstract["id"],abstract["title"],abstract["abstract_inverted_index"]])
    counters.append(["1"])
    print(len(counters))

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_data, urls)          
abstracts_df=pd.DataFrame(data=abstracts)
papers_df=pd.DataFrame(data=papers)


In [None]:
print(abstracts_df)
print(papers_df)

**Efficiency in code**: \
Considering the extensiveness of the dataset we work with, we had to implement some strategies for code efficiency. We first retrived the data using the request.get() function. We then looped through all the authors to make sure to remove any duplicates before collecting all unique authors in a txt file. We also made sure to only include authors with a total work count betweenn 5 and 5,000. The most computational-heavy feature was extrating all the works from each author. To speed up this process, we applied multithreading techniques. \
\
**Filtering Criteria and Dataset Relevance**: \
Setting thresholds for authors' total works, citation counts, and relevance to specific fields ensures dataset quality. It includes important authors and influential works while filtering out less significant ones. However, this might exclude things like emerging researchers or niche topics. Regarding the dataset that we have compiled, we could miss out on interdisciplinary research or new methodologies due to emphasis on established topics. Overemphasizing on popular fields could also skew the research focus. Some aspects of Computational Social Science research could be over- or underrepresented as a result of this.

# Part 4: The Network of Computational Social Scientists

In [None]:
# Dataframe of papers here
weighted_edgelist = []
import networkx as nx

G = nx.Graph()
print(G)

for authors in papers_df:
    pass