# Programatically Querying A Knowledge Graph
 - Jupyter template file to programmatically query a Knowledge Graph using SPARQL triplestore API endpoint
 - enables to interact with RDF-based knowledge graphs directly from Python, using programmatic interfaces instead of manual web-based query tools
    - Programmatic querying supports automation, repeatability, and scalability in workflows where semantic data needs to be extracted, transformed, or analyzed at regular intervals
    - Helps in building software or AI applications which utilize KGs as a knowledge source

## Prerequisite: Fuseki Setup (IMPORTANT!)
- Before running this notebook, make sure that [Apache Jena Fuseki](https://jena.apache.org/documentation/fuseki2/) is properly setup with its endpoint information or have it installed locally on your system.
- This tool is the server we’ll use to store and query RDF data using SPARQL.
### What You Need to Do:
#### External Server (Usually will have admin access disabled - so its read-only)
 - Have your triplestore endpoint url. Example endpoint will look like this `https://stko-kwg.geog.ucsb.edu/sparql`
#### Local Server
 - Install Fuseki from [Apache Jena Fuseki](https://jena.apache.org/download/index.cgi) site.
 - Set the `JENA_HOME` environment variable to the folder where you installed Fuseki.
    - Example on Windows:
        ```bat
        set JENA_HOME=C:\Programs\apache-jena-fuseki
        ```
    - Example on Linux environments:
        ```bash
        export JENA_HOME=/Users/yourname/apache-jena-fuseki
        ```
 - Start the server using the command below (from this notebook or a terminal):

In [None]:
# This command attempts to start the Fuseki server from the JENA_HOME path
# This command tries to launch the Apache Jena Fuseki server from your local installation.
# Make sure the environment variable JENA_HOME points to your Jena installation directory.
# If you see an error like 'command not found', it means JENA_HOME might not be set correctly or Fuseki isn't installed.
!$JENA_HOME\fuseki start

### Install Required Packages
 - Before we can interact with RDF data or send HTTP requests, we need to make sure a few Python libraries are installed.


### What These Do:
 - SPARQLWrapper: Helps Python talk to SPARQL endpoints (like Fuseki)
 - requests: Lets Python send web requests (like fetching data from URLs)
 - pandas: For working with tables and dataframes
 
 If you see “Requirement already satisfied,” that’s good news — it means you already have the package!

In [None]:
# Installing required Python libraries for working with RDF data and HTTP requests.
!pip3 install SPARQLWrapper requests pandas

### Import Libraries
 - Once the packages are installed, we import them into Python so we can use them

### Here’s what each one does:
- os: Used to access environment variables like `JENA_HOME`
- requests: For sending HTTP GET/POST requests
- SPARQLWrapper: Core library for querying SPARQL endpoints
- pandas: For manipulating tabular data (great for SPARQL results!)
- pprint: Makes output prettier and easier to read

In [None]:
# Importing required Python libraries for working with RDF data and HTTP requests.
import os
import requests
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import pprint

### Function to Create Dataset

In [None]:
# Function to create a new dataset in Fuseki (It requires the admin URL and dataset name).
# The function also allows specifying the type of persistence (e.g., TDB2 or in-memory).
def create_fuseki_dataset(admin_url, dataset_name, persistent_type = "tdb2"): # Use "mem" for in-memory

    payload = {
        "dbName": dataset_name,
        "dbType": persistent_type
    }
    
    # Sending a SPARQL query to the Fuseki server and handling the response.
    response = requests.post(admin_url, data=payload) # sends a POST request to the Fuseki server to create the dataset.

    # Checking the response status code and printing appropriate messages.
    if response.status_code == 200:
        print(f"Dataset '{dataset_name}' created successfully.")
    elif "already exists" in response.text:
        print(f"Dataset '{dataset_name}' already exists.")
    else:
        print("Error creating dataset:", response.status_code, response.text)

### Function to Upload TTL File

In [None]:
# Function to upload a ttl to a dataset in the Fuseki server.
# The function takes the base URL of the Fuseki server, the dataset name, and the file path of the ttl file.
def upload_ttl_file(fuseki_base, dataset_name, file_path):
    # Constructing the URL for uploading data to the Fuseki server.
    data_url = f"{fuseki_base}/{dataset_name}/data"
    headers = {"Content-Type": "text/turtle"}

    # Reading the ttl file and sending it to the Fuseki server.
    with open(file_path, "rb") as f:
        response = requests.post(data_url, headers=headers, data=f) # sends a POST request to the Fuseki server to upload the data.

    # Checking the response status code and printing appropriate messages.
    if response.status_code in (200, 201):
        print(f"File '{file_path}' uploaded successfully.")
    else:
        print("Error uploading file:", response.status_code, response.text)


## If you want to start from scratch, use the below instructions

### Setting Up Apache Jena Fuseki
 - Before you can run SPARQL queries or interact with your knowledge graph, you need to have Apache Jena Fuseki set up and running.

### Set Fuseki Base and Dataset Name
 - Have your Fuseki endpoint as a default parameter to the `fuseki_base` variable or set it as an environment variable and extract from it

In [None]:
# Get Fuseki base URL from environment variable
fuseki_base = os.getenv("JENA_HOME", "http://arsenal.cs.wright.edu:3030")
# Default dataset name
dataset_name = "mydataset"  # Change to your preferred new dataset name

# Full endpoint and admin URLs
endpoint_url = f"{fuseki_base}/{dataset_name}/sparql"
admin_url = f"{fuseki_base}/$/datasets"

print("Fuseki base URL:", fuseki_base)
print("SPARQL endpoint:", endpoint_url)

Fuseki base URL: http://arsenal.cs.wright.edu:3030
SPARQL endpoint: http://arsenal.cs.wright.edu:3030/mydataset/sparql


### Create Dataset

In [None]:
# Function call to create a new dataset
# You can change the dataset name and persistent type as needed.
# For example, use "mem" for in-memory datasets or "tdb2" for persistent datasets.
# create_fuseki_dataset(admin_url, dataset_name, persistent_type="mem")
create_fuseki_dataset(admin_url, dataset_name)

Dataset 'mydataset' created successfully.


### Upload Your TTL File

In [None]:
# Function call to upload a TTL file to the dataset
# You can replace "example_data.ttl" with the path to your actual TTL file.
# Make sure the file exists in the specified path.
upload_ttl_file(fuseki_base, dataset_name, "example_data.ttl")

File 'example_data.ttl' uploaded successfully.


## Use Existing Dataset in Triplestore
### If you want to use an existing dataset in fuseki

In [None]:
# Get Fuseki base URL from environment variable
fuseki_base = os.getenv("JENA_HOME", "http://localhost:3030")


dataset_name = "mydataset"  # Change to your exisitng dataset name available in Fuseki

# Full endpoint and admin URLs
endpoint_url = f"{fuseki_base}/{dataset_name}/sparql"

print("Fuseki base URL:", fuseki_base)
print("SPARQL endpoint:", endpoint_url)

### Define SPARQL Query

- This cell defines a SPARQL query to retrieve distinct topics and their names from the dataset.
- The query uses the edu-ont ontology to find resources of type 'edu-ont:Topic' and their associated names.
- The result is limited to 10 entries for brevity.

In [None]:
# A SPARQL query to retrieve distinct topics and their names from the dataset.
query = """
PREFIX edu-ont: <https://edugate.cs.wright.edu/lod/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?topic ?topicName
WHERE {
  ?topic rdf:type edu-ont:Topic ;
         edu-ont:asString ?topicName .
}
LIMIT 10 
"""

### Run Query with SPARQLWrapper

In [None]:
# Initialize SPARQLWrapper with the SPARQL endpoint URL
sparql = SPARQLWrapper(endpoint_url) # This sets up the connection to the Fuseki server for executing SPARQL queries.

# Set the SPARQL query to be executed
sparql.setQuery(query) # The query is defined in the previous cell and retrieves distinct topics and their names.

# Set the return format of the query results to JSON
sparql.setReturnFormat(JSON) # JSON format is easier to parse and work with in Python.


# Execute the query and convert the results to a Python dictionary
results = sparql.query().convert() # This sends the query to the Fuseki server and retrieves the results.

### Show Results as DataFrame

In [None]:
# Extract the "bindings" section from the SPARQL query results.
bindings = results["results"]["bindings"] # The "bindings" contain the actual data returned by the query.

# Transform the bindings into a list of dictionaries.
# Each dictionary represents a row of data, where the keys are variable names and the values are their corresponding values.
data = [
    {var: binding[var]["value"] for var in binding} # Extract the "value" field for each variable in the binding.
    for binding in bindings # Iterate over all bindings (rows of results).
]

# Convert the list of dictionaries into a pandas DataFrame.
# This makes it easier to work with the data in a tabular format.
df = pd.DataFrame(data)

# Display the DataFrame.
df

Unnamed: 0,topic,topicName
0,https://edugate.cs.wright.edu/lod/resource/Top...,Coding
1,https://edugate.cs.wright.edu/lod/resource/Top...,Python
2,https://edugate.cs.wright.edu/lod/resource/Top...,Pandas
3,https://edugate.cs.wright.edu/lod/resource/Top...,Data Science Careers
4,https://edugate.cs.wright.edu/lod/resource/Top...,Probability
5,https://edugate.cs.wright.edu/lod/resource/Top...,Case Studies
6,https://edugate.cs.wright.edu/lod/resource/Top...,Machine Learning
7,https://edugate.cs.wright.edu/lod/resource/Top...,Natural Language Processing (NLP)
8,https://edugate.cs.wright.edu/lod/resource/Top...,Math Fundamentals
9,https://edugate.cs.wright.edu/lod/resource/Top...,Statistics
