Link to git repository: https://github.com/ongiboy/Cognitive-Social-Science

Group members:
* Christian Ong Hansen (s204109)
* Kavus Latifi Yaghin (s214601)
* Daniel Damkjær Ries (s214641)

Group member's contribution:
* Every task was made in collaboration by all members.

# Part 1: Properties of the real-world network of Computational Social Scientists

##### 1.1 Analyzing networks through a random model

Answer the following questions (max 200 words in total):

* What regime does your random network fall into? Is it above or below the critical threshold?

* According to the textbook, what does the network's structure resemble in this regime?

* Based on your visualizations, identify the key differences between the actual and the random networks. Explain whether these differences are consistent with theoretical expectations.

# Part 2 Network Analysis in Computational Social Science

##### 2.1 Mixing patterns and assortativity

Reflection questions (max 250 words for the 3 questions)

* Assortativity by degree. Were the results of the degree assortativity in line with your expectations? Why or why not?

* Edge flipping. In the process of implementing the configuration model, you were instructed to flip the edges (e.g., changing e_1 from (u,v) to (v,u)) 50% of the time. Why do you think this step is included?

* Distribution of assortativity in random networks. Describe the distribution of degree assortativity values you observed for the random networks. Was the distribution pattern expected? Discuss how the nature of random network generation (specifically, the configuration model and edge flipping) might influence this distribution and whether it aligns with theoretical expectations.

##### 2.2 Central nodes

* Find the 5 most central scientists according to the closeness centrality. What role do you imagine scientists with high closeness centrality play?

* Find the 5 most central scientists according to eigenvector centrality.

* Plot the closeness centrality of nodes vs their degree. Is there a correlation between the two? Did you expect that? Why?

* Repeat the two points above using eigenvector centrality instead. Do you observe any difference? Why?

# Part 3 - Words that characterize Computational Social Science communities

##### 3.1 TF-IDF and the Computational Social Science communities.

First, check out the wikipedia page for TF-IDF. Explain in your own words the point of TF-IDF.

What does TF stand for?
* TF stands for Term Frequency, which is a counter of a term, e.g. the TF of the word Hello in a document would be the amount of times the word Hello appears in the document.

What does IDF stand for?
* IDF stands for Inverse Document Frequency, Which is determined by dividing the total number of documents by the number of documents that contains the term, and then taking the logarithm of it. Therefore a high IDF for a term will mean that the term is rare across the corpus, and is therefore a way of highligting more unique words. It is also a way of weighing down very common words.

The point of TF-IDF?
* The TF-IDF of a term is the product of the TF and the IDF of the term. A high value of TF-IDF will mean that the term has more appearences in the document but is rare across the corpus. Therefore the words with a high TF-IDF, are a good tool for characterizing and distinguishing documents, as they represent words that are frequent within a document and unique across the corpus.

In [2]:
import pandas as pd
import ast
import math
from collections import defaultdict
from collections import Counter

# Load the data
communities_df = pd.read_csv("data/author_communities.csv")
papers_df = pd.read_csv("data/papers_final.csv")
abstracts_df = pd.read_csv("data/abstracts_tokens.csv", converters={'tokens': ast.literal_eval})

In [3]:
# Ensure 'authors' column is a list
papers_df['authors'] = papers_df['authors'].apply(lambda x: x if isinstance(x, list) else ast.literal_eval(x))

# Explode 'authors' column
papers_exploded = papers_df.explode('authors')

# Create new dataframe with unique authors and their works
authors_works_df = papers_exploded.groupby('authors')['id'].agg(list).reset_index()

# Rename columns
authors_works_df.columns = ['Author', 'Works']

In [5]:
# Merge authors_works_df with communities_df
df = pd.merge(authors_works_df, communities_df, on='Author')

# Explode 'Works' column
df = df.explode('Works')

# Merge with abstracts_df
df = pd.merge(df, abstracts_df, left_on='Works', right_on='id')

# Get abstract tokens for all communities
all_communities_abstracts = df.groupby('Community')['tokens'].agg("sum").reset_index()

# Rename columns
all_communities_abstracts.columns = ['Community', 'Abstract Tokens']

In [11]:
# Get the top 5 communities
top5_communities = communities_df['Community'].value_counts().nlargest(5).index

# Filter all_communities_abstracts for top 5 communities
top5_communities_abstracts = all_communities_abstracts[all_communities_abstracts['Community'].isin(top5_communities)]

# Initialize a dictionary to store the top 5 terms for each community
top_terms = {}

# For each community in the top 5 communities
for community in top5_communities:
    # Get the abstract tokens for the community
    abstract_tokens = top5_communities_abstracts[top5_communities_abstracts['Community'] == community]['Abstract Tokens'].values[0]
    
    # Count the frequency of each term
    term_counts = Counter(abstract_tokens)
    
    # Get the top 5 terms
    top_terms[community] = term_counts.most_common(5)

In [14]:
for item in top_terms.items():
    print(f"Community {item[0]}:")
    print(f"Top 5 terms: {item[1]}\n")

Community 18:
Top 5 terms: [('model', 1854), ('use', 1729), ('data', 1084), ('network', 1065), ('task', 1031)]

Community 17:
Top 5 terms: [('network', 2283), ('model', 1505), ('use', 988), ('time', 985), ('social', 966)]

Community 30:
Top 5 terms: [('network', 2218), ('model', 1999), ('social', 1689), ('data', 1484), ('use', 1459)]

Community 0:
Top 5 terms: [('use', 969), ('model', 919), ('misinform', 810), ('inform', 803), ('peopl', 793)]

Community 10:
Top 5 terms: [('user', 1447), ('use', 1374), ('data', 1202), ('inform', 764), ('model', 715)]



Describe similarities and differences between the communities.
* It is common for all of the five top communities that the terms model and use are included in their top 5 most frequent terms. They also share other terms across multiple communities 

Why are the TFs not necessarily a good description of the communities?

In [7]:
# Get the top 5 communities
top5_communities = communities_df['Community'].value_counts().nlargest(5).index

# Filter all_communities_abstracts for top 5 communities
top5_communities_abstracts = all_communities_abstracts[all_communities_abstracts['Community'].isin(top5_communities)]

# Get the total number of documents
num_docs = len(top5_communities_abstracts)

# Initialize a dictionary to store the document frequency for each word
df = defaultdict(int)

# For each document
for tokens in top5_communities_abstracts['Abstract Tokens']:
    # Get the unique words in the document
    unique_words = set(tokens)
    
    # For each unique word, increment its document frequency
    for word in unique_words:
        df[word] += 1

# Initialize a dictionary to store the IDF for each word
idf = {}

# For each word in the document frequency dictionary
for word, freq in df.items():
    # Calculate the IDF and store it in the dictionary
    idf[word] = math.log(num_docs / freq)

idf

{'bird': 1.6094379124341003,
 'engag': 0.0,
 'variou': 0.0,
 'linden': 1.6094379124341003,
 'heed': 1.6094379124341003,
 'riski': 0.22314355131420976,
 'sob': 1.6094379124341003,
 'earli': 0.0,
 'late': 0.0,
 'segreg': 0.0,
 '18th': 0.9162907318741551,
 'propens': 0.0,
 'paper': 0.0,
 'postcorrect': 1.6094379124341003,
 'prescrib': 0.22314355131420976,
 'relax': 0.22314355131420976,
 'baldwin': 0.9162907318741551,
 'titl': 0.22314355131420976,
 'appeal': 0.0,
 'compound': 0.0,
 'contest': 0.22314355131420976,
 'ideolog': 0.22314355131420976,
 'ate': 1.6094379124341003,
 'connectionist': 1.6094379124341003,
 'elud': 0.9162907318741551,
 'simultan': 0.0,
 'full': 0.0,
 'misl': 1.6094379124341003,
 'tcm': 1.6094379124341003,
 'st': 0.5108256237659907,
 'maibach': 1.6094379124341003,
 'negoti': 0.0,
 'align': 0.0,
 'greec': 0.9162907318741551,
 'neurocomput': 1.6094379124341003,
 'constant': 0.0,
 'tractabl': 0.9162907318741551,
 'arguabl': 0.22314355131420976,
 'monetarili': 1.60943791243

In [8]:
from collections import Counter
import math

# Get the top 9 communities
top9_communities = communities_df['Community'].value_counts().nlargest(9).index

# Filter all_communities_abstracts for top 9 communities
top9_communities_abstracts = all_communities_abstracts[all_communities_abstracts['Community'].isin(top9_communities)]

# Initialize dictionaries to store the top 10 terms and TF-IDF terms for each community
top_terms = {}
top_tfidf_terms = {}

# Calculate IDF for each term once and store the results in a dictionary
idf_dict = {}
for term in set.union(*top9_communities_abstracts['Abstract Tokens'].apply(set)):
    idf_dict[term] = math.log(len(top9_communities_abstracts) / sum(term in abstract for abstract in top9_communities_abstracts['Abstract Tokens']))

# For each community in the top 9 communities
for community in top9_communities:
    # Get the abstract tokens for the community
    abstract_tokens = top9_communities_abstracts[top9_communities_abstracts['Community'] == community]['Abstract Tokens'].values[0]
    
    # Count the frequency of each term
    term_counts = Counter(abstract_tokens)
    
    # Get the top 10 terms
    top_terms[community] = term_counts.most_common(10)

    # Initialize a dictionary to store the TF-IDF for each term
    tfidf = {}
    
    # For each term and its count
    for term, count in term_counts.items():
        # Calculate the Term Frequency (TF)
        tf = count / len(abstract_tokens)
        
        # Get the Inverse Document Frequency (IDF) from the precalculated dictionary
        idf = idf_dict[term]
        
        # Calculate the TF-IDF
        tfidf[term] = tf * idf

    # Get the top 10 TF-IDF words
    top_tfidf_terms[community] = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)[:10]

print(top_terms)
print(top_tfidf_terms)

{18: [('model', 1854), ('use', 1729), ('data', 1084), ('network', 1065), ('task', 1031), ('user', 932), ('commun', 915), ('languag', 899), ('predict', 894), ('gener', 888)], 17: [('network', 2283), ('model', 1505), ('use', 988), ('time', 985), ('social', 966), ('differ', 931), ('inform', 914), ('show', 838), ('find', 833), ('studi', 826)], 30: [('network', 2218), ('model', 1999), ('social', 1689), ('data', 1484), ('use', 1459), ('user', 1454), ('studi', 1176), ('inform', 1132), ('dynam', 1012), ('differ', 997)], 0: [('use', 969), ('model', 919), ('misinform', 810), ('inform', 803), ('peopl', 793), ('effect', 772), ('studi', 675), ('social', 600), ('decis', 593), ('differ', 579)], 10: [('user', 1447), ('use', 1374), ('data', 1202), ('inform', 764), ('model', 715), ('differ', 691), ('social', 680), ('network', 664), ('studi', 611), ('onlin', 609)], 2: [('network', 1054), ('data', 1000), ('use', 824), ('model', 785), ('social', 665), ('system', 605), ('mobil', 573), ('individu', 536), ('s

In [9]:
# Filter communities_df for top 9 communities
top9_communities_df = communities_df[communities_df['Community'].isin(top9_communities)]

# Group by Community and Author, and sum the Degree
grouped_df = top9_communities_df.groupby(['Community', 'Author'])['Degree'].sum().reset_index()

# Get the top 3 authors by degree for each community
top_authors = grouped_df.groupby('Community').apply(lambda x: x.nlargest(3, 'Degree')).reset_index(drop=True)

print(top_authors)

    Community                            Author  Degree
0           0  https://openalex.org/A5017914184     145
1           0  https://openalex.org/A5071165387      60
2           0  https://openalex.org/A5033765081      47
3           1  https://openalex.org/A5012701585      64
4           1  https://openalex.org/A5047315859      63
5           1  https://openalex.org/A5014466973      49
6           2  https://openalex.org/A5038976962      86
7           2  https://openalex.org/A5024505700      81
8           2  https://openalex.org/A5044898565      64
9          10  https://openalex.org/A5084282503      98
10         10  https://openalex.org/A5033656008      94
11         10  https://openalex.org/A5086453253      62
12         16  https://openalex.org/A5007176508     106
13         16  https://openalex.org/A5048877432      97
14         16  https://openalex.org/A5067118505      85
15         17  https://openalex.org/A5044033087     100
16         17  https://openalex.org/A5016268748 

In [10]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
lolol
# For each community in the top 9 communities
for community in top9_communities:
    # Get the abstract tokens for the community
    abstract_tokens = top9_communities_abstracts[top9_communities_abstracts['Community'] == community]['Abstract Tokens'].values[0]
    
    # Create a word cloud
    wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = None, 
                min_font_size = 10).generate(" ".join(abstract_tokens))
    
    # Plot the word cloud
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

    # Print the names of the top three authors in the community
    top_three_authors = top_authors[top_authors['Community'] == community]['Author'].values
    print(f"Top three authors in community {community}: {', '.join(top_three_authors)}")

NameError: name 'lolol' is not defined

* Describe similarities and differences between the communities.
* Why aren't the TFs not necessarily a good description of the communities?
* Next, we calculate IDF for every word.
* What base logarithm did you use? Is that important?

* Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?

##### 3.2 The Wordcloud

##### 3.3 Computational Social Science

* Go back to Week 1, Exercise 1. Revise what you wrote on the topics in Computational Social Science.

* In light of your data-driven analysis, has your understanding of the field changed? How? (max 150 words)