# Introduction to Python and SPARQL Queries in Jupyter Notebook



## Section 1: Python Basics

This section covers basic Python concepts such as variables, data types, lists, dictionaries, loops, and functions before transitioning to SPARQL queries.

In [None]:
# Defining variables and printing values
x = 10
y = 5
sum_xy = x + y
print("The sum of x and y is:", sum_xy)

In [None]:
# Lists and loops
numbers = [1, 2, 3, 4, 5]
for num in numbers:
    print(f"Number: {num}")

In [None]:
# Dictionaries
data_dict = {"name": "Alice", "age": 25}
print("Name:", data_dict["name"])

In [None]:
# Function example
def greet(name):
    return f"Hello, {name}!"

print(greet("Bob"))

## Section 2: SPARQL Queries in Python

In this section, we introduce SPARQL and how to query RDF data using Python. 
We'll use the `SPARQLWrapper` library to send queries to a SPARQL endpoint.

In [None]:
%pip install pandas
%pip install SPARQLwrapper

In [1]:
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Function: execute_sparql_query

This function takes a SPARQL endpoint and a query as input. 
It then executes the query and returns the results as a Pandas DataFrame.

#### Parameters:
- `endpoint` (str): The SPARQL endpoint URL.
- `query` (str): The SPARQL query to execute.

#### Returns:
- `pd.DataFrame`: The query results in a structured format.

In [51]:
def execute_sparql_query(endpoint, query):
    # Initialize SPARQLWrapper with the endpoint
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    
    # Execute the query and retrieve the results
    results = sparql.query().convert()
    bindings = results["results"]["bindings"]
    
    # Return an empty DataFrame if there are no results
    if not bindings:
        return pd.DataFrame()
    
    # Extract column names from the first result entry
    columns = bindings[0].keys()
    
    # Convert results to a list of dictionaries
    data = []
    for row in bindings:
        data.append({col: row[col]['value'] for col in columns})
    
    # Convert list of dictionaries to a Pandas DataFrame
    return pd.DataFrame(data)

## Section 3: Example SPARQL Query

We'll use the WikiPathways SPARQL endpoint to retrieve some example data. 

### Query Description:
The query retrieves all pathways, their titles, and organisms from WikiPathways.

In [52]:
# Define the SPARQL endpoint
SPARQL_ENDPOINT = "https://sparql.wikipathways.org/sparql"

In [53]:
# Define the SPARQL query
SPARQL_QUERY_1 = """
SELECT DISTINCT ?wpIdentifier ?title ?organismName
WHERE {
    ?pathway dc:identifier ?wpIdentifier ;
             dc:title ?title ;
             wp:organismName ?organismName . 
 }
"""


### Running the Query

We now execute the query using our defined function and display the results as a Pandas DataFrame.

In [54]:
# Execute the SPARQL query and store results in a DataFrame
df = execute_sparql_query(SPARQL_ENDPOINT, SPARQL_QUERY_1)

### Displaying the Results

We now print the first few rows of the DataFrame to inspect the retrieved data.

In [None]:
display(df.head())

### Exploring the DataFrame

We now inspect the available columns and perform some filtering operations.

In [None]:
# Display column names
print("Columns in DataFrame:", df.columns.tolist())


### Filtering the DataFrame

Let's filter the DataFrame to display only results where 'organismName' contains 'Homo sapiens'.

In [None]:
filtered_df = df[df['organismName'].str.contains("Homo sapiens", na=False)]
display(filtered_df)

### Enhancing the DataFrame with Additional Data

We now retrieve extra information about the pathways in our dataset, such as the number of DataNodes in each pathway, using another SPARQL query.

In [58]:
SPARQL_QUERY_2 = """
SELECT DISTINCT ?wpIdentifier (COUNT (DISTINCT ?dataNode) as ?n_dataNode)
WHERE {
    ?pathway dc:identifier ?wpIdentifier .
	?dataNode a wp:DataNode ;
              dcterms:isPartOf ?pathway .
 }
"""

In [None]:
df_extra = execute_sparql_query(SPARQL_ENDPOINT, SPARQL_QUERY_2)

df_enriched = df.merge(df_extra, on="wpIdentifier", how="left")

display(df_enriched)

## Section 4: Data Visualization

In this section, we will explore various data visualization techniques using popular Python libraries such as Matplotlib and Seaborn. These visualizations will help us better understand and interpret the data we have retrieved from the SPARQL queries.

In [None]:
# Rank pathways based on n_dataNode (descending order)
df_sorted = df_enriched.sort_values(by="n_dataNode", ascending=False)

# Display the top pathways
print("Top-ranked pathways:")
display(df_sorted.head(10))

In [None]:
# Identify the type of values in the n_dataNode column before re-defining the column
print("Type of values in n_dataNode column before re-defining:", df_sorted['n_dataNode'].dtype)

# Convert n_dataNode to numeric
df_sorted['n_dataNode'] = pd.to_numeric(df_sorted['n_dataNode'], errors='coerce')

# Identify the type of values in the n_dataNode column after re-defining the column
print("Type of values in n_dataNode column after re-defining:", df_sorted['n_dataNode'].dtype)

It appears that the `n_dataNode` column in the table is not being interpreted correctly as numeric values, resulting in a maximum value of 99. However, in Section 3, we observed that multiple pathways have more than 100 DataNodes. To address this, we need to ensure that the `n_dataNode` column is correctly interpreted as numeric values. After converting the column, we will sort the table again to reflect the accurate data.

In [None]:
# Rank pathways based on n_dataNode (descending order)
df_sorted = df_sorted.sort_values(by="n_dataNode", ascending=False)

# Display the top pathways
print("Top-ranked pathways:")
display(df_sorted.head(10))

### Data visualisation with Matplotlib
First, we download and import the `matplotlib` library.

In the next cells, we will visualize the data using Matplotlib to gain insights into the pathways with the most DataNodes, and the number of pathways across different species.

In [None]:
%pip install matplotlib

In [64]:
import matplotlib.pyplot as plt

In [None]:
# Distribution plot of pathways per species
species_counts = df_sorted["organismName"].value_counts()

plt.figure(figsize=(12, 6))
species_counts.plot(kind='bar', color='skyblue')
plt.xlabel("Species")
plt.ylabel("Number of Pathways")
plt.title("Distribution of Pathways per Species")
plt.show()

### Data visualisation with Seaborn

In this section, we will use Seaborn to create a box plot that visualizes the distribution of DataNodes across different species. This will help us understand the variability and spread of DataNodes within each species.

First, we download and import the `seaborn` library.

In [None]:
%pip install seaborn

In [67]:
import seaborn as sns

In [None]:
# Create a box plot for the distribution of n_dataNode per species
plt.figure(figsize=(14, 8))
sns.boxplot(x='organismName', y='n_dataNode', data=df_sorted)
plt.xticks(rotation=90)
plt.xlabel("Species")
plt.ylabel("Number of Data Nodes")
plt.title("Distribution of Data Nodes per Species")
plt.show()

## Metadata


In [None]:
import sys
import pandas as pd
import matplotlib
import seaborn as sns
from SPARQLWrapper import __version__ as sparqlwrapper_version

# Print Python version
print("Python version:", sys.version)

# Print library versions
print("Pandas version:", pd.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("Seaborn version:", sns.__version__)
print("SPARQLWrapper version:", sparqlwrapper_version)
