# OpenAIRE Beginners Kit

The **OpenAIRE Graph** is an Open Access dataset containing metadata about research products (literature, datasets, software, and other research products) linked to other entities of the research ecosystem, such as organisations, grants, and data sources.

The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

The OpenAIRE Beginner’s Kit aims to address this issue. It consists of two components: a subset of the Graph composed of the research products published between `2024-06-01` and `2024-12-31`, all the entities connected to them and the respective relationships, and the present Jupyter notebook that demonstrates how you can use `PySpark` to analyse the Graph and get answers to some interesting research questions.

This notebook is structured in sections to help you navigate the different cells; you can visualise the structure by clicking on the third icon in the leftmost menu.

You can run each cell individually (`shift+enter`with a cell highlighted) or run everything at once from the top menu.

## Download data from Zenodo

In [None]:
import os
import requests
from urllib.parse import urlsplit
import tarfile
from pathlib import Path

zenodo_url = "https://zenodo.org/record/14891799/files/"

openaire_files = [zenodo_url + "communities_infrastructures.tar",
                  zenodo_url + "dataset.tar",
                  zenodo_url + "datasource.tar",
                  zenodo_url + "organization.tar",
                  zenodo_url + "otherresearchproduct.tar",
                  zenodo_url + "project.tar",
                  zenodo_url + "publication.tar",
                  zenodo_url + "relation.tar",
                  zenodo_url + "software.tar"]



def download_and_extract(url, path):
    tar_name = urlsplit(url).path.split('/')[-1] # publication.tar
    tar_path = os.path.join(path, tar_name) # data/raw/publication.tar
    untarred_folder = tar_name.split('.')[0] # publication
    untarred_path = os.path.join(path, untarred_folder) # data/raw/publication
    if not os.path.exists(untarred_path):
        if not os.path.exists(tar_path):
            print(f"downloading ${url}")
            try:
                with requests.get(url, stream=True) as response:
                    response.raise_for_status()
                    with open(tar_path, 'wb') as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            f.write(chunk)
            except requests.exceptions.RequestException as e:
                print("Error downloading the file:", e)

        print(f"untar ${tar_name}")
        with tarfile.open(tar_path, "r") as tar:
            tar.extractall(path)

        print('cleaning')
        os.remove(tar_path)


# Download data into /data/raw
for tar in openaire_files:
    download_and_extract(tar, "/app/openaire/data/raw")

## Import libraries

In [None]:
import json

import glob
import pandas as pd

import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession
from IPython.display import JSON as pretty_print

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Prepare data for analysis

The data subset is organised in part files under the `data/raw` folder. 

If you try to load the data straight into memory, one part file would fit

In [None]:
files = sorted(glob.glob('data/raw/publication/part*'))
df = pd.read_json(files[0], compression='gzip', lines=True)
df.head(3)

However, if you try to load the whole thing, even just the publications, the chances are slim.

If you try uncommenting and running the following lines, after some time, the kernel will die while trying and restart.

In [None]:
# files = sorted(glob.glob('data/raw/publication/part*'))
# publications_df = pd.concat(pd.read_json(f, compression='gzip', lines=True) for f in files)

So, let's see how `Spark` can help us.

First thing first, let's create the Spark session.

In [None]:
spark = SparkSession.builder.getOrCreate()

Now, let's read the datasets about OpenAIRE entitities.

In [None]:
inputPath = 'data/raw/'
 
publications = spark.read.json(inputPath + 'publication')
datasets = spark.read.json(inputPath + 'dataset')
software = spark.read.json(inputPath + 'software')
others = spark.read.json(inputPath + 'otherresearchproduct')
datasources = spark.read.json(inputPath + 'datasource')
organizations = spark.read.json(inputPath + 'organization')
projects = spark.read.json(inputPath + 'project')
communities = spark.read.json(inputPath + 'communities_infrastructures')
relations = spark.read.json(inputPath + 'relation')

Let's create some `Temporary views`, which is similar to a real SQL table that you can query via Spark.

In [None]:
publications.createOrReplaceTempView("publications")
datasets.createOrReplaceTempView("datasets")
software.createOrReplaceTempView("software")
others.createOrReplaceTempView("others")
datasources.createOrReplaceTempView("datasources")
organizations.createOrReplaceTempView("organizations")
projects.createOrReplaceTempView("projects")
communities.createOrReplaceTempView("communities")
relations.createOrReplaceTempView("relations")


Ok, let's count the number of rows.

In [None]:
print("number of publications %s"%publications.count())
print("number of datasets %s"%datasets.count())
print("number of software %s"%software.count())
print("number of other research products %s"%others.count())
print("number of datasources %s"%datasources.count())
print("number of organizations %s"%organizations.count())
print("number of communities %s"%communities.count())
print("number of projects %s"%projects.count())
print("number of relations %s"%relations.count())

By the way, the same could be achieved in SQL via Spark.

In [None]:
spark.sql("SELECT COUNT(*) FROM publications").toPandas()

## Visualise sample data structures

Let's show some data now. 

### Publication

For example, a generic publication (link to documentation: https://graph.openaire.eu/docs/data-model/entities/research-product)

In [None]:
pretty_print(json.loads(publications.toJSON().first()), expanded=False)

### Dataset

A generic dataset (link to documentation: https://graph.openaire.eu/docs/data-model/entities/research-product#data)

In [None]:
pretty_print(json.loads(datasets.toJSON().first()), expanded=False)

### Software

A software (https://graph.openaire.eu/docs/data-model/entities/research-product#software)

In [None]:
pretty_print(json.loads(software.toJSON().first()), expanded=False)

### Other

An other research product (https://graph.openaire.eu/docs/data-model/entities/research-product#other-research-product)

In [None]:
pretty_print(json.loads(others.toJSON().first()), expanded=False)

### Data source

Or a data source (link to documentation: https://graph.openaire.eu/docs/data-model/entities/data-source)

In [None]:
pretty_print(json.loads(datasources.toJSON().first()), expanded=False)

### Organization

An organization (link to documentation: https://graph.openaire.eu/docs/data-model/entities/organization)

In [None]:
pretty_print(json.loads(organizations.toJSON().first()), expanded=False)

### Project

A project (link to documentation: https://graph.openaire.eu/docs/data-model/entities/project)

In [None]:
pretty_print(json.loads(projects.toJSON().first()), expanded=False)

### Community

A community (link to documentation: https://graph.openaire.eu/docs/data-model/entities/community)

In [None]:
pretty_print(json.loads(communities.toJSON().first()), expanded=False)

### Relation

And finally, a relation (link to documentation: https://graph.openaire.eu/docs/data-model/relationships)

In [None]:
pretty_print(json.loads(relations.toJSON().first()), expanded=False)

## Simple queries 

All the exercises follow the template below.
```python
query = """
SELECT <columns>
FROM <table>
[WHERE <conditions>]
[GROUP BY <fields>]
[ORDER BY <fields> DESC|ASC]
[LIMIT <n_rows>]
"""

spark.sql(query).toPandas()
```

Your query is written as a string inside the `query` variable and then executed with `spark.sql(query)`. 

`toPandas()` results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data. Running queries that return a large number of rows could yield an out-of-memory exception and crash the application. If this is the case, it is always better to limit the number of rows with the `LIMIT` clause (see below).

In case the kernel dies unexpectedly (e.g., a connection error shows up after executing a cell), it is necessary to restart it from the top menu. In this case, you will need to reload the data from the beginning **without downloading everything from Zenodo again**.

The reference guide for Spark SQL can be found here: https://spark.apache.org/docs/latest/sql-ref.html.

**Generative AI and LLMs**, such as ChatGPT or Gemini, work great to get familiar with Spark SQL syntax and built-in functions; just frame your question in an appropriate way.

### **Task:** find publications with a keyword in the title, e.g., `covid`

To test this, we can use the standard SQL operator `LIKE`.

On LLMs, for example, the prompt `In Spark SQL, write a query that checks if a column mainTitle contains the string covid` would produce the intended query.

In [None]:
query = """
SELECT *
FROM publications
WHERE mainTitle LIKE '%covid%'
LIMIT 5
"""

spark.sql(query).toPandas()

As an alternative, we could use a convenient built-in function that checks if a string contains another string.
The documentation can be found here: https://spark.apache.org/docs/latest/api/sql/#contains.

In [None]:
query = """
SELECT *
FROM publications
WHERE CONTAINS(mainTitle, 'covid')
LIMIT 5
"""

spark.sql(query).toPandas()

### **Task:** count publications per date

In [None]:
query = """
SELECT publicationDate AS year, COUNT(*) AS n_pubs
FROM publications
GROUP BY year
ORDER BY year DESC
"""

spark.sql(query).toPandas()

### **Task:** count publications by language

In [None]:
query = """
SELECT language.code, COUNT(*) AS n_papers
FROM publications
GROUP BY language
ORDER BY n_papers DESC
LIMIT 25
"""

spark.sql(query).toPandas()

### **Task:** count publications by country

In [None]:
query = """
SELECT country.code AS code, COUNT(DISTINCT id) AS n_pubs
FROM publications
  LATERAL VIEW EXPLODE(countries) AS country
GROUP BY code
ORDER BY n_pubs DESC
LIMIT 10
"""

spark.sql(query).toPandas()

### **Task:** show the most occurring publication subjects

In [None]:
query = """
SELECT term, COUNT(term) as occurrences
FROM publications
    LATERAL VIEW explode (subjects.subject.value) as term
GROUP BY term
ORDER BY occurrences DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** find active projects

In [None]:
query = """
SELECT id, title,
    EXTRACT(YEAR FROM DATE(projects.startDate)) AS start_year,
    EXTRACT(YEAR FROM DATE(projects.endDate)) AS end_year
FROM projects
WHERE EXTRACT(YEAR FROM DATE(projects.endDate)) > 2024
"""

spark.sql(query).toPandas()

### **Task:** count projects by subject

In [None]:
query = """
SELECT subject, COUNT(id) AS n_projects
FROM projects
  LATERAL VIEW EXPLODE(subjects) AS subject
GROUP BY subject
ORDER BY n_projects DESC
"""

spark.sql(query).toPandas()

### **Task:** count active projects by subjects

In [None]:
query = """
SELECT subject, COUNT(id) AS n_projects
FROM projects
  LATERAL VIEW EXPLODE(subjects) AS subject
WHERE EXTRACT(YEAR FROM DATE(projects.endDate)) > 2024
GROUP BY subject
ORDER BY n_projects DESC
"""

spark.sql(query).toPandas()

### **Task:** count projects by subject, aggregate total funded amount

Hint: Funded amounts can have different currency.

In [None]:
query = """
SELECT subject,
        COUNT(id) AS n_projects,
        SUM(granted.fundedAmount) AS funded_total
FROM projects
    LATERAL VIEW EXPLODE(subjects) AS subject
WHERE granted.currency = 'EUR'
GROUP BY subject
ORDER BY n_projects DESC, funded_total DESC
"""

spark.sql(query).toPandas()

### **Task:** count different OA statuses of publications

In [None]:
query = """
SELECT bestAccessRight.label AS OA_status, COUNT(*) AS n_papers
FROM publications
GROUP BY bestAccessRight.label
"""

spark.sql(query).toPandas()

### **Task:** count OA statuses of all products

In [None]:
query = """
SELECT bestAccessRight.label AS accessright,
       SUBSTRING(publicationDate, 0,4) AS year,
       COUNT(*) AS count
FROM (SELECT id, bestAccessRight, publicationDate FROM publications 
        UNION 
      SELECT id, bestAccessRight, publicationDate FROM datasets
        UNION
      SELECT id, bestAccessRight, publicationDate FROM software
        UNION
      SELECT id, bestAccessRight, publicationDate FROM others) as products
WHERE bestAccessRight IS NOT NULL AND bestAccessRight IS NOT NULL
GROUP BY bestAccessRight, year
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** explore instances

In [None]:
query = """
SELECT instance.type as instance_type, COUNT(*) AS count
FROM publications
    LATERAL VIEW EXPLODE(instances) AS instance
GROUP BY instance_type
ORDER BY count DESC
"""

spark.sql(query).toPandas()

### **Task:** restrict to specific instance types

In [None]:
query = """
SELECT COUNT(DISTINCT id)
FROM publications
  LATERAL VIEW EXPLODE(instances) AS instance
WHERE instance.type IN ('Article',
                       'Book',
                       'Conference object',
                       'Part of book or chapter of book',
                       'Data Paper',
                       'Software Paper')
"""

spark.sql(query).toPandas()

### **Task:** count peer-reviewed publications per country

In [None]:
query = """
SELECT country.code AS country, COUNT(DISTINCT id) AS n_pubs
FROM publications
  LATERAL VIEW EXPLODE(countries) AS country
WHERE id IN (SELECT DISTINCT id
              FROM publications
                LATERAL VIEW EXPLODE(instances) AS instance
              WHERE instance.refereed = 'peerReviewed')
GROUP BY country
ORDER BY n_pubs DESC
LIMIT 10
"""

spark.sql(query).toPandas()

### **Task:** show information about publishing venues

In [None]:
query = """
SELECT container.issnLinking, container.issnOnline, container.issnPrinted, container.name 
FROM publications 
WHERE container IS NOT NULL
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** group and count relations based on their semantics and count them

In [None]:
query = """
SELECT reltype.name, COUNT(*) AS count 
FROM relations 
GROUP BY reltype.name 
ORDER BY count DESC
"""

spark.sql(query).toPandas()

## Advanced queries

### **Task:** count and sort publications by citations

In [None]:
query = """
SELECT publications.id, pids.value, COUNT(*) AS count
FROM publications 
    JOIN relations ON publications.id = relations.target
WHERE reltype.name = 'Cites'
GROUP BY publications.id, pids.value
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the journals with the highest number of publications

In [None]:
query = """
SELECT container.name, COUNT(id) as n_pubs
FROM publications
WHERE container IS NOT NULL
GROUP BY container.name
ORDER BY n_pubs DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the number of projects per organization

Hint: the `COALESCE` function can be oh help to select over the possible name forms of an organisation (e.g., short and full name). You can specify multiple columns to select and it will return the first column that is not null. 

In [None]:
query = """
SELECT COALESCE(legalshortname, legalname) AS name, 
        COUNT(*) AS count 
FROM organizations 
    JOIN relations ON organizations.id = relations.source
                    AND reltype.name = 'isParticipant'
GROUP BY name 
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show projects with the highest number of associated results

Note: An `unidentified` project title is a placeholder for all the associations to a funder without knowing the specific project. It should be removed from the count.

In [None]:
query = """
SELECT fundings.shortName, code, title, COUNT(*) AS count 
FROM projects 
    JOIN relations ON projects.id = relations.source
                    AND reltype.name = 'produces' 
                    AND not projects.title ilike '%unidentified%' 
GROUP BY fundings.shortName, code, title
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

Strings can be manipulated as well on the fly

In [None]:
query = """
SELECT CONCAT_WS(' / ',
                IF(SIZE(fundings.shortName) > 0, ARRAY_JOIN(fundings.shortName, ',', '-'), '?'), 
                COALESCE(code, '?'), 
                SUBSTRING(title, 0, 50)) AS project, COUNT(*) AS count 
FROM projects 
    JOIN relations ON projects.id = relations.source 
                    AND reltype.name = 'produces' 
                    AND NOT projects.title ilike '%unidentified%' 
GROUP BY project 
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the most co-occurring publication subjects from controlled vocabularies (i.e., scheme != 'keyword')

In [None]:
query = """
WITH subjects AS (
    WITH exploded_subjects (
        SELECT id, EXPLODE(subjects.subject) AS subject 
        FROM publications) 
    SELECT id, subject.value AS `subject` 
    FROM exploded_subjects 
    WHERE subject.scheme != 'keyword'
)
SELECT l.subject AS left, 
       r.subject AS right, 
       COUNT(*) AS count
FROM subjects AS l 
    JOIN subjects AS r ON l.id = r.id 
                        AND l.subject < r.subject
GROUP BY left, right
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the number of research products per organization 

The relation used is the affiliation of an individual to an organisation (i.e., `isAuthorInstitutionOf`), since in our data this relation links products to organization and not authors to organizations.

Organization short names can be empty, so the legal name could be a fallback option to use in `COALESCE`.

In [None]:
query = """
SELECT legalShortName, legalName
FROM organizations 
WHERE legalShortName IS NULL 
    AND legalName IS NOT NULL
LIMIT 20
"""

spark.sql(query).toPandas()

In [None]:
query = """
SELECT COALESCE(legalShortName, legalName) AS organization,
        COUNT(*) AS count 
FROM organizations 
    JOIN relations ON organizations.id = relations.source
                    AND relType.name = 'isAuthorInstitutionOf' 
GROUP BY organization
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the number of ALL research products (per type) per organization

In [None]:
query = """
SELECT COALESCE(legalShortName, legalName) AS organization, 
       COUNT(*) AS total,
       COUNT(IF(type = 'publication', 1, NULL)) AS publication,
       COUNT(IF(type = 'dataset', 1, NULL)) AS dataset,
       COUNT(IF(type = 'software', 1, NULL)) AS software,
       COUNT(IF(type = 'other', 1, NULL)) AS other
FROM (SELECT id, type FROM publications 
        UNION 
      SELECT id, type FROM datasets
        UNION
      SELECT id, type FROM software
        UNION
      SELECT id, type FROM others) as products 
    JOIN organizations 
    JOIN relations ON organizations.id = relations.source
                    AND products.id = relations.target 
                    AND relType.name = 'isAuthorInstitutionOf' 
GROUP BY organization 
ORDER BY total DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show publications access types per organization

In [None]:
query = """
SELECT COALESCE(legalShortName, legalName) AS organization, 
       COUNT(*) as total,
       COUNT(IF(bestAccessRight.label = 'OPEN', 1, NULL)) AS open,
       COUNT(IF(bestAccessRight.label = 'EMBARGO', 1, NULL)) AS embargo,
       COUNT(IF(bestAccessRight.label = 'CLOSED', 1, NULL)) AS closed
FROM organizations 
    JOIN relations 
    JOIN publications ON organizations.id = relations.source 
                        AND publications.id = relations.target 
                        AND relType.name = 'isAuthorInstitutionOf'
GROUP BY organization
ORDER BY total DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the result access types per country of the organizations

In [None]:
query = """
SELECT organizations.country.code AS country, 
       COUNT(*) AS total,
       COUNT(IF(bestAccessRight.label = 'OPEN', 1, NULL)) AS open,
       COUNT(IF(bestAccessRight.label = 'EMBARGO', 1, NULL)) AS embargo,
       COUNT(IF(bestAccessRight.label = 'CLOSED', 1, NULL)) AS closed
FROM (SELECT id, bestAccessRight FROM publications 
        UNION 
      SELECT id, bestAccessRight FROM datasets
        UNION
      SELECT id, bestAccessRight FROM software
        UNION
      SELECT id, bestAccessRight FROM others) as products 
    JOIN organizations 
    JOIN relations ON organizations.id = relations.source
                    AND products.id = relations.target 
                    AND relType.name = 'isAuthorInstitutionOf'
WHERE organizations.country IS NOT NULL
GROUP BY organizations.country.code
ORDER BY total DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show international project collaborations; focus on organizations

In [None]:
query = """
WITH countryProject AS (
    SELECT country.code AS country, 
           target AS id 
    FROM organizations 
        JOIN relations ON reltype.name = 'isParticipant' 
                        AND source = organizations.id
    WHERE country IS NOT NULL
)
SELECT l.country AS left, 
       r.country AS right, 
       COUNT(*) AS count 
FROM countryProject AS l 
    JOIN countryProject AS r ON l.id = r.id 
                                AND l.country < r.country
GROUP BY left, right 
ORDER BY count DESC
"""

spark.sql(query).toPandas() 

### **Task:** show the organisations collaborating in projects more often

In [None]:
query = """
WITH orgProject AS (
    SELECT COALESCE(legalshortname, legalname) AS organization, 
           target AS id 
    FROM organizations 
    JOIN relations ON reltype.name = 'isParticipant' 
                    AND source = organizations.id
)
SELECT l.organization AS left,
       r.organization AS right,
       COUNT(*) AS count
FROM orgProject AS l 
    JOIN orgProject AS r ON l.id = r.id 
                            AND l.organization < r.organization
GROUP BY left, right 
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the organizations co-authoring papers more often

In [None]:
query = """
WITH orgProduct AS (
    SELECT COALESCE(legalshortname, legalname) AS organization, 
           target AS id 
    FROM organizations 
        JOIN relations ON reltype.name = 'isAuthorInstitutionOf' 
                        AND source = organizations.id
)
SELECT l.organization AS left, 
       r.organization AS right,
       COUNT(*) AS count 
FROM orgProduct AS l 
    JOIN orgProduct AS r ON l.id = r.id 
                        AND l.organization < r.organization
GROUP BY left, right 
ORDER BY count DESC
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the number of publications supplemented by datasets

In [None]:
query = """
SELECT COUNT(*) AS count
FROM relations 
    JOIN publications 
    JOIN datasets ON reltype.name = 'IsSupplementedBy' 
                    AND publications.id = relations.source 
                    AND datasets.id = relations.target
LIMIT 20
"""

spark.sql(query).toPandas()

### **Task:** show the number of peer reviewed publications with doi split by openAccessColor and the availability of a green deposition

In [None]:
query = """

SELECT COUNT(id), isGreen , openAccessColor
FROM publications
WHERE ARRAY_CONTAINS(instances.refereed, 'peerReviewed') 
    AND  ARRAY_CONTAINS(pids.scheme, 'doi')
    AND isGreen IS NOT NULL 
    AND openAccessColor IS NOT NULL
GROUP BY openAccessColor, isGreen
LIMIT 20
"""
spark.sql(query).toPandas()

## Passing data to other libaries

### **Task:** show the collaboration network of countries participating in projects with respect to the partecipating organizations.

In [None]:
query = """
WITH countryProject AS (
    SELECT country.code AS country, 
           target AS id 
    FROM organizations JOIN relations ON reltype.name = 'isParticipant' AND source = organizations.id
    WHERE country IS NOT NULL
)
SELECT l.country AS left, 
       r.country AS right,
       COUNT(*) AS count 
FROM countryProject AS l 
    JOIN countryProject AS r ON l.id = r.id 
                                AND l.country <= r.country
GROUP BY left, right 
ORDER BY count DESC
"""

edges = spark.sql(query).toPandas()
edges

Results can be modeled as a graph and analysed. Let's try doing so with igraph and feed it with country couples and the number of coparticipated projects.

In [None]:
import igraph as ig

G = ig.Graph.TupleList(
    edges=edges[['left', 'right', 'count']].values,
    vertex_name_attr='countrycode',
    edge_attrs = ['weight'],
    directed=False)

Let's find the number of nodes in the graph.

In [None]:
G.vcount()

Now, let's find the number of edges.

In [None]:
G.ecount()

A generic node can be inspected like so

In [None]:
G.vs[0]

Instead, a generic edge can be inspected with

In [None]:
G.es[0]

Ok. Let's try to plot something. The whole network looks like this.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(10, 10))
max_w = np.max(G.es['weight'])
ig.plot(G, vertex_label=G.vs['countrycode'], vertex_size=10, edge_width=2, edge_color='gray', target=ax)

Now, let's filter by country, say Italy (i.e., IT)

In [None]:
H = G.induced_subgraph(G.neighborhood(G.vs.find(countrycode_eq = 'IT')))
H.summary()

In [None]:
H.vs['color'] = 'blue'
H.vs.find(countrycode_eq = 'IT')['color'] = 'red'
fig, ax = plt.subplots(figsize=(10, 10))
ig.plot(H, vertex_label=H.vs['countrycode'], vertex_size=30, edge_width=2, edge_color='gray',target=ax)

Still a lot of collaborations there. Let's plot a country with less collaborations, say Maldives (i.e., MV).

In [None]:
H = G.induced_subgraph(G.neighborhood(G.vs.find(countrycode_eq = 'MV'))) # Maldives
H.summary()

In [None]:
H.vs['color'] = 'blue'
H.vs.find(countrycode_eq = 'MV')['color'] = 'red'
fig, ax = plt.subplots(figsize=(10, 10))
ig.plot(H, vertex_label=H.vs['countrycode'], vertex_size=30, edge_width=2, edge_color='gray',target=ax)