# Tests with Software Heritage API

Documentation:

- [rate limit](https://archive.softwareheritage.org/api/#rate-limiting)
- [API endpoints](https://archive.softwareheritage.org/api/1/)

In [1]:
import json


import requests

## Configuration

- [create token](https://archive.softwareheritage.org/oidc/profile/#tokens)
- API root :  https://archive.softwareheritage.org

In [1]:
token = "XXXXX"
base_url = "https://archive.softwareheritage.org"
headers = {"Authorization": f"Bearer {token}"}

## Check authentication (token) is working

The `/api/1/ping/` [endpoint](https://archive.softwareheritage.org/api/1/ping/) is here to check the API is working properly.

Authenticated users have a limit of [1200 requests per hour](https://archive.softwareheritage.org/api/#rate-limiting)

In [3]:
endpoint = "/api/1/ping/"
response = requests.get(base_url + endpoint, headers=headers)

print(f"Status code: {response.status_code}")
request_per_hour = response.headers['X-RateLimit-Limit']
print(f"Max number of requests per hour: {request_per_hour}")
if request_per_hour == "1200":
    print("User is authenticated")
else:
    print("User is not authenticated. Check the token.")

Status code: 200
Max number of requests per hour: 1200
User is authenticated


In [4]:
print("Detailed headers:")
print(json.dumps(dict(response.headers), indent=4))

Detailed headers:
{
    "Date": "Thu, 24 Oct 2024 15:01:07 GMT",
    "Content-Type": "application/json",
    "Content-Length": "6",
    "Vary": "Accept, origin, Cookie",
    "Allow": "HEAD, GET, OPTIONS, OPTIONS",
    "X-RateLimit-Limit": "1200",
    "X-RateLimit-Remaining": "1194",
    "X-RateLimit-Reset": "1729782071",
    "X-Frame-Options": "DENY",
    "X-Content-Type-Options": "nosniff",
    "Referrer-Policy": "same-origin",
    "Cross-Origin-Opener-Policy": "same-origin",
    "X-Varnish": "334008188",
    "Age": "0",
    "Via": "1.1 varnish (Varnish/6.5)",
    "Strict-Transport-Security": "max-age=15768000;",
    "Accept-Ranges": "bytes",
    "Connection": "keep-alive"
}


In [5]:
response.content

b'"pong"'

Expected answer: `pong`

## Get the latest archive date (if any)

Use the `/api/1/origin/visit/latest` [endpoint](https://archive.softwareheritage.org/api/1/origin/visit/latest/doc/)

With a repo already archived:

In [6]:
repo_url = "https://github.com/pierrepo/biopyassistant"
endpoint = f"/api/1/origin/{repo_url}/visit/latest/"
response = requests.get(base_url + endpoint, headers=headers)

print(f"Status code: {response.status_code}")
print("Content:")
print(json.dumps(response.json(), indent=4))

Status code: 200
Content:
{
    "origin": "https://github.com/pierrepo/biopyassistant",
    "visit": 7,
    "date": "2024-10-18T16:14:44.398000+00:00",
    "status": "full",
    "snapshot": "f3fda6d6ccb5b4258cd656a37a4c2903a849df9b",
    "type": "git",
    "metadata": {},
    "origin_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/pierrepo/biopyassistant/get/",
    "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/f3fda6d6ccb5b4258cd656a37a4c2903a849df9b/"
}


In [7]:
print(f"Latest archive: {response.json()['date']}")

Latest archive: 2024-10-18T16:14:44.398000+00:00


If a repo has already been archived in Software Heritage:

- the status code is 200
- the date of the last archive is provided

With a repo not archived:

In [8]:
repo_url = "https://github.com/pierrepo/thisrepodoesnotexist"
endpoint = f"/api/1/origin/{repo_url}/visit/latest/"
response = requests.get(base_url + endpoint, headers=headers)

print(f"Status code: {response.status_code}")
print("Content:")
print(json.dumps(response.json(), indent=4))

Status code: 404
Content:
{
    "exception": "NotFoundExc",
    "reason": "Origin with url https://github.com/pierrepo/thisrepodoesnotexist not found!"
}


If a repo has never been archived in Software Heritage:

- the status code is 404

## Archive a repo

Use the `/api/1/origin/save` [endpoint](https://archive.softwareheritage.org/api/1/origin/save/doc/)

One way to find respositoties not yet archived on GitHub is to look at [trending repo](https://github.com/trending).


In [9]:
repo_url = "https://github.com/langgenius/dify"
repo_type = "git"
endpoint = f"/api/1/origin/save/{repo_type}/url/{repo_url}/"
# We need to use a POST request to trigger an archive
response = requests.post(base_url + endpoint, headers=headers)

print(f"Status code: {response.status_code}")
print("Content:")
print(json.dumps(response.json(), indent=4))

Status code: 200
Content:
{
    "id": 1705270,
    "origin_url": "https://github.com/langgenius/dify",
    "visit_type": "git",
    "save_request_date": "2024-10-24T15:01:08.856227+00:00",
    "save_request_status": "accepted",
    "save_task_status": "pending",
    "visit_status": null,
    "visit_date": null,
    "loading_task_id": 416908143,
    "note": null,
    "from_webhook": false,
    "webhook_origin": null,
    "snapshot_swhid": null,
    "next_run": "2024-10-24T15:01:08.693120Z",
    "request_url": "https://archive.softwareheritage.org/api/1/origin/save/1705270/"
}
