# Scientific Collaboration Graph Analysis

In this jupyter notebook I will make the remaining needed analysis for my Graduation Final Project. 

## Environment setup

In [1]:
%pip install graphdatascience plotly pandas nbformat numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from graphdatascience import GraphDataScience

host = "bolt://localhost:7687"
user = "neo4j"
password= "password"

gds = GraphDataScience(host, auth=(user, password))

  from .autonotebook import tqdm as notebook_tqdm


## Part 1: Attributing an Institution to Authors

Currently, there's no assignment of Institution to Authors. Instead, each author has one (or more) affiliations in each publication. This part of the analysis has the purpose of attributing a "home" Institution for each Author. For that, we will consider the institution the author has most the most affiliated publications.

We will create a new relationship `HOME_INSTITUTION`, like so:

```
(:Author)-[:HOME_INSTITUTION]->(:Intitution)
```

In [3]:
df_affcount = gds.run_cypher(
    """
    MATCH (a:Author)<-[:INVOLVES_AUTHOR]-(auth:Authorship)-[:INVOLVES_INSTITUTION]->(d:Institution)
    RETURN a.scopus_id AS author_id, a.name AS author_name,
       d.scopus_id AS institution_id, d.name AS institution_name,
       count(*) AS affiliation_count
    """
)

df_affcount

Unnamed: 0,author_id,author_name,institution_id,institution_name,affiliation_count
0,57217764581,"Rodriguez, Juan C.",60032361,Pontifícia Universidade Católica do Rio de Jan...,1
1,7006818871,"Wrobel, Luiz Carlos",60032361,Pontifícia Universidade Católica do Rio de Jan...,9
2,7005255555,"Nunokawa, H.",60032361,Pontifícia Universidade Católica do Rio de Jan...,9
3,7006493149,"Carvalho, I. C.S.",60032361,Pontifícia Universidade Católica do Rio de Jan...,4
4,6507364614,"Ziolli, Roberta L.",60032361,Pontifícia Universidade Católica do Rio de Jan...,4
...,...,...,...,...,...
248290,59695062000,"Nogueira, André A.",130121784,Independent Researcher,1
248291,57206291888,"de Lima, Marcelo G.",105512226,Center for Large Landscape Conservation,1
248292,59696345900,"Kisil Marino, Ian",100998101,Leibniz-Institut für Europäische Geschichte,1
248293,8856664000,"Duque, Cristiane",60109712,Universidade Catolica Portuguesa,1


In [4]:
import pandas as pd
import plotly.express as px

# Filter for Unicamp affiliations
unicamp_affiliations = df_affcount[df_affcount["institution_id"] == "60029570"]

# Total affiliations per author
total_affiliations = df_affcount.groupby("author_id")["affiliation_count"].sum().rename("total_count")

# Merge and compute percentage
unicamp_affiliations = unicamp_affiliations.merge(total_affiliations, on="author_id", how="right").fillna({"affiliation_count": 0})
unicamp_affiliations["percentage"] = unicamp_affiliations["affiliation_count"] / unicamp_affiliations["total_count"]
unicamp_affiliations

Unnamed: 0,author_id,author_name,institution_id,institution_name,affiliation_count,total_count,percentage
0,10038868400,,,,0.0,1,0.0
1,10039131200,,,,0.0,12,0.0
2,10039132600,,,,0.0,1,0.0
3,10039306300,,,,0.0,1,0.0
4,10039420200,,,,0.0,8,0.0
...,...,...,...,...,...,...,...
181026,9943475600,,,,0.0,1,0.0
181027,9943485500,,,,0.0,1,0.0
181028,9943654300,,,,0.0,1,0.0
181029,9943745400,,,,0.0,9,0.0


In [5]:
import numpy as np

# Define 20 bins between 0 and 1
bins = np.linspace(0, 1, 21)
unicamp_affiliations["binned"] = pd.cut(unicamp_affiliations["percentage"], bins=bins, include_lowest=True)
unicamp_affiliations["binned"] = unicamp_affiliations["binned"].astype(str)  # Convert to string for plotting


def bin_key(bin_str):
    # bin_str example: "[0.0, 0.05)"
    left = bin_str.split(",")[0].strip("[(")
    return float(left)

unique_bins = unicamp_affiliations["binned"].unique()
sorted_bins = sorted(unique_bins, key=bin_key)

fig = px.histogram(
    unicamp_affiliations,
    x="binned",
    category_orders={"binned": sorted_bins},
    title="Distribution of Author Affiliation % with Unicamp"
)
fig.update_layout(
    xaxis_title="Affiliation Percentage Bin",
    yaxis_title="Number of Authors"
)
fig.show()


A maioria dos autores tem todas as suas publicações associadas à Unicamp, ou nenhuma publicação associada à Unicamp.

In [8]:
unicamp_affiliations["unicamp"] = unicamp_affiliations["percentage"] > 0.5
unicamp_affiliations

Unnamed: 0,author_id,author_name,institution_id,institution_name,affiliation_count,total_count,percentage,binned,unicamp
0,10038868400,,,,0.0,1,0.0,"(-0.001, 0.05]",False
1,10039131200,,,,0.0,12,0.0,"(-0.001, 0.05]",False
2,10039132600,,,,0.0,1,0.0,"(-0.001, 0.05]",False
3,10039306300,,,,0.0,1,0.0,"(-0.001, 0.05]",False
4,10039420200,,,,0.0,8,0.0,"(-0.001, 0.05]",False
...,...,...,...,...,...,...,...,...,...
181026,9943475600,,,,0.0,1,0.0,"(-0.001, 0.05]",False
181027,9943485500,,,,0.0,1,0.0,"(-0.001, 0.05]",False
181028,9943654300,,,,0.0,1,0.0,"(-0.001, 0.05]",False
181029,9943745400,,,,0.0,9,0.0,"(-0.001, 0.05]",False


## Write back

Abaixo, vamos criar uma nova propriedade chamada `unicamp` para vértices do tipo `Author`. Essa propriedade é `true` se a porcentagem de afiliações do autor à Unicamp é estritamente superior a 0.5 (isto é, `percentage > 0.5`) e `false` caso contrário. Vamos usar essa porcentagem para classificar as comunidades entre comunidades "Unicamp" e "não-Unicamp".

In [7]:
batch_data = [
    {
        "author_id": row["author_id"],
        "unicamp": row["unicamp"]
    }
    for _, row in unicamp_affiliations.iterrows()
]

gds.run_cypher(
    """
    UNWIND $rows AS row
    MATCH (a:Author {scopus_id: row.author_id})
    SET a.unicamp = row.unicamp
    """,
    params={"rows": batch_data}
)