## Overview

In this notebook we will showcase how to filter and select relevant packages based on for example title, creation date or tags. First, we show it's done using the [kblab](https://github.com/Kungbib/kblab) python package. Secondly we show how to manually do the same by constructing our own API calls and downloading the `meta.json` files of each package.  

### Import kblab

In [1]:
import kblab 
from kblab import Archive

### Archive object

In the `kblab` package an Archive object lets us iterate over all package ids in the database, or construct filters to iterate over a subset of them. 

Calling the API requires us to submit authentication details (username and password). A good practice to follow is to not write out the password inside of scripts and notebooks, but rather read it either from a text file or environment file stored elsewhere on the computer.  

In [8]:
# Read password from .txt file containing only the password
with open('/home/faton/projects/api_credentials.txt', 'r') as file:
    pw = file.read().replace('\n', '')

Now create the archive object with authentication.

In [9]:
a = Archive("https://datalab.kb.se", auth=("demo", pw))
a

<kblab.httparchive.HttpArchive at 0x7f4504f72160>

We can iterate over this object to get all package ids that exist in betalab.kb.se.

In [4]:
package_ids = [package_id for package_id in a]
print(len(package_ids))

796173


In [5]:
package_ids[0:5]

['sou-1922-10', 'sou-1922-1', 'sou-1922-11', 'sou-1922-12', 'sou-1922-13']

### Filter package ids based on metadata

To select only the package ids we are interested in, we can filter with the `.search()` method. 

For example, to select only packages relating to parlamentiary minutes ("protokoll") we can search on the `tag` "protokoll". This returns an generator that allows us to iterate over these specific package ids. We can see there are **13440** of these in betalab.kb.se. 

In [6]:
a.search({"tags": "protokoll"})

Result(start=0, n=13440, m=13440, keys=<generator object HttpArchive._search_iter at 0x7fb951f31970>, hits=<list_iterator object at 0x7fb96024a6d0>)

For this to be useful, we need to iterate over the generator and do something useful with the package ids. Below is once again a basic example of how to save the relevant package ids in a list.

In [7]:
protokoll_package_ids = []
for package_id in a.search({"tags": "protokoll"}):
    protokoll_package_ids.append(package_id)

In [8]:
print(f"Number of protokoll: {len(protokoll_package_ids)}")
print(protokoll_package_ids[0:4])

Number of protokoll: 13440
['prot-1972--146', 'prot-1972--147', 'prot-1972--31', 'prot-1972--27']


Here are some additional examples of ways to filter content:

-   **label** or **meta.title**: The title given to the package. Newspapers will for example have titles like "AFTONBLADET 2003-08-02". Not all package types have meaningful titles though.
-   **tags**: The different type of tags can be found in the left panel when visiting betalab.kb.se. E.g. "sou", "protokoll", "issue" (newspapers). 
-   **content**: Search the text contents. Returns packages whose textual contents matched your search string.
-   **meta.created**: Creation date or year

Examples of each:

In [33]:
a.search({"label": "AFTONBLADET"}) # Same as a.search({"meta.title": "AFTONBLADET"})
a.search({"tags": "issue"})
a.search({"content": "hunger"})
a.search({"meta.created": "1888"})
a.search({'label: "AFTONBLADET" meta.created: "1888"'}) # Multiple criteria

Result(start=0, n=305, m=305, keys=<generator object HttpArchive._search_iter at 0x7f4504e72580>, hits=<list_iterator object at 0x7f4504ed1d90>)

### Download data from filtered packages

We can download data from the filtered packages. Most of the useful information relating to them have been assembled by us in three different `.json` files:

-   **content.json** (returns json file with ids and contents of all segmented text/image boxes)
-   **structure.json** (returns json file following hierarchical structure of how the data is organized. For newspapges e.g. it's organized in package, parts, pages, segmented boxes. Also contains contents, but nested hierarchically rather than in a flat file as is the case with content.json)
-   **meta.json** (metadata associated with the package)

### Approach #1: download data from list of package ids

If you have extracted package ids the way we did for `protokoll_package_ids`, then you can use `kblab` package's built in methods to make a GET request.

In [10]:
package = a.get(protokoll_package_ids[0])

Here, we download "meta.json" (if it exists inside package) and parse its contents.

In [11]:
import json
if "meta.json" in package:
    meta_raw = package.get_raw("meta.json")
    meta_json = json.load(meta_raw)

In [12]:
print(meta_raw)
print(meta_json)

<urllib3.response.HTTPResponse object at 0x7fb96024ec40>
{'title': 'prot 1972::146', 'year': '1972', 'created': '1972'}


### Approach #2: Download data while iterating over generator

Another way, and the way described in `kblab` documentation is to do everything on the fly in one step.

In [34]:
meta_json_list = []
for package_id in a.search({"tags": "protokoll"}, max=5):
    package = a.get(package_id)

    if "meta.json" in package:
        meta_json = json.load(package.get_raw("meta.json"))
        meta_json_list.append(meta_json)

print(meta_json_list[0])

{'title': 'prot 1972::146', 'year': '1972', 'created': '1972'}


### Getting a dump of content

One simple way of getting out a dump of the text and structure contents is by using `kblab` package's built in `flerge()` function. 

In [25]:
from kblab.utils import flerge

package = a.get(package_ids[0])
res = flerge(package)

In [35]:
print(f"Length of output: {len(res)}")
print(res[0])

Length of output: 928
{'@id': 'https://betalab.kb.se/sou-1922-10#1-1-cblock_0-block_0', '@type': 'Text', 'box': ['209', '167', '1214', '68'], 'has_representation': ['https://betalab.kb.se/sou-1922-10/SOU_1922_10-000.xml', 'https://betalab.kb.se/sou-1922-10/SOU_1922_10-000.jp2'], 'height': '2561', 'label': 'SOU 1922:10', 'meta': {'title': 'Om lappskattelandsinstitutet och dess historiska utveckling [Elektronisk resurs]', 'year': '1922', 'created': '1922', 'seriesEnumeration': 'SOU 1922:10'}, 'path': [{'@id': 'https://betalab.kb.se/sou-1922-10/', '@type': 'Package'}, {'@id': 'https://betalab.kb.se/sou-1922-10#1', '@type': 'Part'}, {'@id': 'https://betalab.kb.se/sou-1922-10#1-1', '@type': 'Page'}, {'@id': 'https://betalab.kb.se/sou-1922-10#1-1-cblock_0', '@type': 'Area'}], 'tags': ['SOU'], 'width': '1807', 'content': ['STATENS OFFENTLIGA UTREDNINGAR 1922:10\n']}


## Write custom API calls using requests package

Alternatively, one may construct their own API requests, e.g. via the python package `requests`. This assumes we have access to a list of package ids already (usually this part involves using `kblab` package and its archive object). 

In [55]:
import requests
from requests.auth import HTTPBasicAuth

package_id = package_ids[1]

meta_raw = requests.get(f"https://betalab.kb.se/{package_id}/meta.json", auth=HTTPBasicAuth("demo", pw))

if meta_raw.status_code == 200:
    meta_json = json.loads(meta_raw.text)

meta_json

{'title': 'Några iakttagelser från 1921 års riksdagsmannaval [Elektronisk resurs]',
 'year': '1922',
 'created': '1922',
 'seriesEnumeration': 'SOU 1922:1'}

### Parallelize