## The `.list` Method

After using the [`.process`](../parameters_processing_files_through_pipelines/process_method.md) method to process one or several files through your chosen pipeline, you can retrieve the record of any file(s) with the `.list` method. You can `.list` by `file_id` or by any other metadata you included when initially processing the file.  

This overview of the `.list` method is divided into the following sections:

- [.list Method Arguments](#.list-method-arguments)
- [Example Pipeline Setup and File Processing](#example-pipeline-setup-and-file-processing)
- [Listing by `file_ids`](#listing-by-file_ids)
- [Listing by `file_names`](#listing-by-file_names)
- [Listing by `symbolic_directory_paths`](#listing-by-symbolic_directory_paths)
- [Listing by `file_tags`](#listing-by-file_tags)
- [Listing by `created_at` and `updated_at` Bookend Times](#listing-by-created_at-and-updated_at-bookend-times)
- [Wildcard Operator Arguments](#wildcard-operator-arguments)
- [The Global Root](#the-global-root)
- [Using Multiple Arguments with the `.list` Method](#using-multiple-arguments-with-the-.list-method)
- [Output Size Cap](#output-size-cap)

In [1]:
# import utilities
import sys 
import json
import importlib
sys.path.append('../../../')
reset = importlib.import_module("utilities.reset")
reset_pipeline = reset.reset_pipeline

# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


### `.list` Method Arguments

The `.list` method is very versatile. It allows you to list by several different metadata items and by a combination of different metadata items.

All of the following arguments are optional. However, you must use at least one argument for the `.list` method to function.

For a refresher on file system metadata arguments please visit the [`.process` method overview](../parameters_processing_files_through_pipelines/process_method.md). The metadata arguments you can use for `.list` are:

- `file_ids`: A list of one or several `file_id`s to return records for.

- `file_names`: A list of  one or several `file_name`s to return records for.

- `symbolic_directory_paths`: A list of one or several `symbolic_directory_path`s to return records for.

- `symbolic_file_paths`: A list of one or several `symbolic_file_path`s to return records for.

- `file_tags`: A list of one or several `file_tag`s to return records for. Note that individual file_tags suffice; if a file has several file tags and you include at least one of them as a `.list` argument, that file's record will be returned.

You may use wildcard operators with `file_names`, `symbolic_directory_paths`,`symbolic_file_paths`, and `file_tags` to retrieve records whose exact metadata you don't remember—or if you wish to retrieve records for a group of files that share similar metadata. More on wildcards operators [later](#wildcard-operator-arguments) in this document.

You may also list by timestamp bookends. The `.list` method accepts timestamps based on both the creation and latest-update times of your records. These are strings in the `"YYYY-MM-DD HH:MM:SS"` format, or alternatively just in the `"YYYY-MM-DD"` format.

- `created_at_start`: Filters out all files whose `created_at` time is earlier than what you've specified.

- `created_at_end`: Filters out all files whose `created_at` time is after what you've specified.

- `last_updated_start`: Filters out all files whose `last_updated` time is earlier than what you've specified.

- `last_updated_end`: Filters out all files whose `last_updated` time is after what you've specified.

Examples on how to use metadata and timestamps in the `.list` method are included below.

Note that file system metadata arguments operate on **OR** logic: for instance, if you `.list` by `file_names`, `file_ids`, and `file_tags`, any file that is a match for any of these will be returned. However, timestamp arguments operate on **AND** logic; all files returned must respect the given timestamp bookends. If two timestamp bookends are given and there is no overlap between them, the `.list` method will return nothing.

Finally, the `.list` method takes two additional optional arguments to help you organize your output:

- `max_files` (int): Determines the maximum number of file records `.list` should return. Defaults to none.

- `sort_order` (str): Specifies how results should be sorted. The two valid values for this argument are 'ascending' and 'descending' (in reference to creation timestamp). Defaults to 'descending'.

### Example Pipeline Setup and File Processing

We will need to create a pipeline and [`.process`](../parameters_processing_files_through_pipelines/process_method.md) a couple of files through it to illustrate usage of `.list`. We'll create a single-module pipeline with a [`parser`](../../modules/support_function_modules/parser_module.md) module and [`.process`](../parameters_processing_files_through_pipelines/process_method.md) some TXT files that hold the text of some English-language classics.  We define optional metadata like file_name, file_tags, and symbolic_directory_path for each process to illustrate how each can be used with `.list` below.

In [None]:
# create single-module parser pipeline
pipeline = krixik.create_pipeline(name='list_method_1_parser',
                                  module_chain=['parser'])

In [4]:
# process files through the pipeline we just created.
# we define optional metadata like file_name, file_tags, and symbolic_directory_path for each
# to illustrate the ability to list by each.
entries = [
    {
        "local_file_path" : "../../../data/input/frankenstein_very_short.txt",
        "file_name": "Frankenstein.txt",
        "file_tags": [{"author": "Shelley"}, {"category": "gothic"}, {"century": "19"}],
        "symbolic_directory_path": "/novels/gothic",
    },
    {
        "local_file_path": "../../../data/input/pride_and_prejudice_very_short.txt",
        "file_name": "Pride and Prejudice.txt",
        "symbolic_directory_path": "/novels/romance",
        "file_tags": [{"author": "Austen"}, {"category": "romance"}, {"century": "19"}],
    },
    {
        "local_file_path":  "../../../data/input/moby_dick_very_short.txt",
        "file_name": "Moby Dick.txt",
        "symbolic_directory_path": "/novels/adventure",
        "file_tags": [{"author": "Melville"}, {"category": "adventure"}, {"century": "19"}]
    }
]
        
# process each file
all_process_output = []
for entry in entries:
    process_output = pipeline.process(local_file_path=entry["local_file_path"], # the initial local filepath where the input file is stored
                                       local_save_directory="../../../data/output",  # the local directory that the output file will be saved to
                                      expire_time=60 * 30,  # process data will be deleted from the Krixik system in 30 minutes
                                      wait_for_process=True,  # do not wait for process to complete before returning IDE control to user
                                      verbose=False,  # do not display process update printouts upon running code
                                      file_name=entry["file_name"],
                                      symbolic_directory_path=entry["symbolic_directory_path"],
                                      file_tags=entry["file_tags"])
    all_process_output.append(process_output)


ValueError: processes associated with request_id '8594789e-42d3-b069-d0f2-2d5f0d33c3be' failed at module 'summarize'

Let's quickly look at what the output for the last of these processed files.

In [None]:
# nicely print the output of the last process
print(json.dumps(all_process_output[-1], indent=2))

{
  "status_code": 200,
  "pipeline": "examples-transcribe-multilingual-sentiment-docs",
  "request_id": "1119f07f-e4a1-4021-9668-2f19ea367568",
  "file_id": "efdc2954-9bef-4427-8de1-2bd18a830015",
  "message": "SUCCESS - output fetched for file_id efdc2954-9bef-4427-8de1-2bd18a830015.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": "For the starting position, we want to see the feed between the hip and shoulders width, the heels on the floor, a neutral column mediated by abdominal tension, the shoulders are lightly in front of the bar or above, straight arms, symmetrical hands and enough width to not rather the knees and we can have a lightly look forward.",
      "positive": 0.99,
      "negative": 0.01,
      "neutral": 0.0
    },
    {
      "snippet": "To perform the movement, our athlete will push from the heels, he will start to raise the hips and shoulders together, when the bar passes the knees, we extend the hip.",
   

### Listing by `file_ids`

Let's try listing by `file_ids`.

You have the `file_id` of each of the four files you processed; each was returned after processing finalized.  

You can list by multiple `file_id`s if you so choose by providing a list of desired `file_ids`.

For example, to see metadata associated with each file processed above simply pluck out the `file_id` from each processed return.

In [None]:
# .list records for two of the uploaded files via file_ids
list_output = pipeline.list(file_ids=[v["file_id"] for v in all_process_output])

# nicely print the output of this process
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "11dcf756-702c-421c-a85a-49dabc2cca7f",
  "message": "Successfully returned 1 item.  Note: all timestamps in UTC.",
  "items": [
    {
      "last_updated": "2024-04-26 21:05:05",
      "process_id": "578cb0a2-0f19-4d83-4b05-3c543f5e2506",
      "created_at": "2024-04-26 21:05:05",
      "file_metadata": {
        "modules": {
          "parser": {
            "model": "sentence"
          }
        },
        "modules_data": {
          "parser": {
            "data_files_extensions": [
              ".json"
            ]
          }
        }
      },
      "file_tags": [
        {
          "author": "orwell"
        },
        {
          "category": "fiction"
        }
      ],
      "file_description": "the first paragraph of 1984",
      "symbolic_directory_path": "/my/custom/filepath",
      "pipeline": "parser-pipeline-1",
      "file_id": "fb228e8e-eefd-4c52-b966-a49506d63f34",
      "expire_time": "2024-04-26 21:10:05",
      "file_nam

As you can see, a full record for each file was returned. To learn more about each metadata item, visit the documentation for the [`.process`](../parameters_processing_files_through_pipelines/process_method.md) method, where they are gone into detail on.

### Listing by `file_names`

You can also list via `file_name`s. It works just like listing with `file_id`s above, but with `file_name` instead of `file_id`.  We'll list <u>Pride and Prejudice</u> via `file_names`, as follows:

In [None]:
# .list records for one of the uploaded files via file_names
list_output = pipeline.list(file_names=["Pride and Prejudice.txt"])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

As you can see, a full record for each file was returned. To learn more about each metadata item, visit the documentation for the [`.process`](../parameters_processing_files_through_pipelines/process_method.md) method, where they are gone into detail on.

### Listing by `symbolic_directory_paths`

You can also list via `symbolic_directory_paths`. It works just like listing with `file_id`s and `file_name`s above, but with `symbolic_directory_path` instead. We'll list <u>Little Women</u> and <u>Moby Dick</u> via `symbolic_directory_paths`, as follows:

In [None]:
# .list records for two of the uploaded files via symbolic_directory_paths
list_output = pipeline.list(symbolic_directory_paths=["/novels/gothic", "/novels/adventure"])

# nicely print the output of this process
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "70c71a76-7ce9-43c7-86e7-838b7fa93d8e",
  "message": "Successfully returned 1 item.  Note: all timestamps in UTC.",
  "items": [
    {
      "last_updated": "2024-04-26 21:05:05",
      "process_id": "578cb0a2-0f19-4d83-4b05-3c543f5e2506",
      "created_at": "2024-04-26 21:05:05",
      "file_metadata": {
        "modules": {
          "parser": {
            "model": "sentence"
          }
        },
        "modules_data": {
          "parser": {
            "data_files_extensions": [
              ".json"
            ]
          }
        }
      },
      "file_tags": [
        {
          "author": "orwell"
        },
        {
          "category": "fiction"
        }
      ],
      "file_description": "the first paragraph of 1984",
      "symbolic_directory_path": "/my/custom/filepath",
      "pipeline": "parser-pipeline-1",
      "file_id": "fb228e8e-eefd-4c52-b966-a49506d63f34",
      "expire_time": "2024-04-26 21:10:05",
      "file_nam

As you can see, a full record for each file was returned. To learn more about each metadata item, visit the documentation for the [`.process`](../parameters_processing_files_through_pipelines/process_method.md) method, where they are gone into detail on.

## Listing by `file_tags`

We can also list through `file_tags`.  We'll list for 19th century novels and any novels by 'Melville', as follows:

In [None]:
# .list records for two of the uploaded files via symbolic_directory_paths
list_output = pipeline.list(file_tags=[{"author": "Melville"}, {"century": 19}])

# nicely print the output of this process
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "7915929d-47cc-4fb9-82f6-b737ad823458",
  "message": "Successfully returned 1 item.  Note: all timestamps in UTC.",
  "items": [
    {
      "last_updated": "2024-04-26 21:05:05",
      "process_id": "578cb0a2-0f19-4d83-4b05-3c543f5e2506",
      "created_at": "2024-04-26 21:05:05",
      "file_metadata": {
        "modules": {
          "parser": {
            "model": "sentence"
          }
        },
        "modules_data": {
          "parser": {
            "data_files_extensions": [
              ".json"
            ]
          }
        }
      },
      "file_tags": [
        {
          "author": "orwell"
        },
        {
          "category": "fiction"
        }
      ],
      "file_description": "the first paragraph of 1984",
      "symbolic_directory_path": "/my/custom/filepath",
      "pipeline": "parser-pipeline-1",
      "file_id": "fb228e8e-eefd-4c52-b966-a49506d63f34",
      "expire_time": "2024-04-26 21:10:05",
      "file_nam

Given that every file included the file tag `{"century": 19}` when initially processed, all four files were listed. <u>Little Women</u> also included the file tag `{"author": "Melville"}`, but there's no duplication of results, so that file's record is only listed once.

### Listing by `created_at` and `updated_at` Bookend Times

To illustrate how to `.list` by timestamp bookends, let's first [`.process`](../parameters_processing_files_through_pipelines/process_method.md) one additional file through our pipeline:

In [None]:
# process an additional file into earlier pipeline
process_output = pipeline.process(local_file_path="../../../data/input/1984_very_short.txt", # the initial local filepath where the input JSON file is stored
                                  local_save_directory="../../../data/output",  # the local directory that the output file will be saved to
                                  expire_time=60 * 30,  # process data will be deleted from the Krixik system in 30 minutes
                                  wait_for_process=True,  # do not wait for process to complete before returning IDE control to user
                                  verbose=False,  # do not display process update printouts upon running code
                                  symbolic_directory_path="/novels/dystopian",
                                  file_name="1984.txt",
                                  file_tags=[{"author": "Orwell"}, {"category": "dystopian"}, {"century": 20}])

{
  "status_code": 200,
  "pipeline": "parser-pipeline-1",
  "request_id": "d3bca30e-d260-4c62-8aa9-91307b21d8b1",
  "file_id": "3b941b6f-bd05-4fbb-83fd-6fea80c25629",
  "message": "SUCCESS - output fetched for file_id 3b941b6f-bd05-4fbb-83fd-6fea80c25629.Output saved to location(s) listed in process_output_files.",
  "process_output": [
    {
      "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
      "line_numbers": [
        1
      ]
    },
    {
      "snippet": "Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.",
      "line_numbers": [
        2,
        3,
        4,
        5
      ]
    }
  ],
  "process_output_files": [
    "./3b941b6f-bd05-4fbb-83fd-6fea80c25629.json"
  ]
}


Listing by timestamp bookends is as straightforward as doing it by file system metadata. The following example only uses one type of bookend—`last_updated_start`—but all of them work the same way.

Based on the output from the file we just processed and the output from the four earlier files, we'll choose a time/date that falls in the middle of all five `last_updated` timestamps:

In [None]:
# .list process records by last_updated timestamp bookend
list_output = pipeline.list(created_at_start=process_output["created_at"])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "aacbb5e9-4701-454b-be2d-16d0f812a201",
  "message": "Successfully returned 1 item.  Note: all timestamps in UTC.",
  "items": [
    {
      "last_updated": "2024-04-26 21:05:21",
      "process_id": "132561f2-336b-c889-ba9e-500df80fdd38",
      "created_at": "2024-04-26 21:05:21",
      "file_metadata": {
        "modules": {
          "parser": {
            "model": "sentence"
          }
        },
        "modules_data": {
          "parser": {
            "data_files_extensions": [
              ".json"
            ]
          }
        },
        "pipeline_ordered_modules": [
          "parser"
        ],
        "pipeline_output_process_keys": [
          "snippet"
        ]
      },
      "file_tags": [],
      "file_description": "",
      "symbolic_directory_path": "/etc",
      "pipeline": "parser-pipeline-1",
      "file_id": "3b941b6f-bd05-4fbb-83fd-6fea80c25629",
      "expire_time": "2024-04-26 21:10:21",
      "file_name": "krixi

Keep in mind that timestamp bookend arguments operate with **AND** logic: to be listed, a file _must_ fall within the specified timestamp window. This also means that if two timestamp arguments are provided and there is no overlap between them, the `.list` method will return nothing.

### Wildcard Operator Arguments

The wildcard operator is the asterisk: *

You can use the wildcard operator * to `.list` records whose exact metadata you don't remember—or if you wish to `.list` records for a group of files that share similar metadata.

For `file_names` and `symbolic_directory_paths` a wildcard may be used as either prefix or suffix:

- Example * as a prefix: `*report.txt`
- Example * as a suffix: `/home/files/studies*`

Note that you don't necessarily have to attach full words to the wildcard operator *. The two above examples could thus instead be:

- Example * as a prefix: `*ort.txt`
- Example * as a suffix: `/home/files/studi*`

For `file_tags` a wildcard may be used for as the value in a key-value pair dictionary. This will return all records with the corresponding key.

- Example * in file_tags: `{"invoice_type": "*"}`

Let's dig into `.list` method examples for each of these. First a prefix wildcard in `file_names`:

In [None]:
# list process records using a wildcard prefix in file_names
list_output = pipeline.list(file_names=["*e.txt"])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "341c4e07-62a6-47f0-904b-7ff2ee3bbaef",
  "message": "No files were found for the given query arguments",
    {
        {
          "file_names": [
            "some*"
          ]
        }
      ]
    }
  ],
  "items": []
}


The above will return records for every file whose `file_name` ends with "e.txt".

Now a suffix wildcard in `symbolic_directory_paths`:

In [None]:
# list process records using wildcard suffix in symbolic_directory_paths
list_output = pipeline.list(symbolic_directory_paths=["/my/*"])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "b6d53064-2748-4c3a-ac7a-e0325cf8c58f",
  "message": "No files were found for the given query arguments",
    {
        {
          "symbolic_directory_paths": [
            "/my/*"
          ]
        }
      ]
    }
  ],
  "items": []
}


The above will return records for every file whose `symbolic_directory_path` begins with "/my/".

Now a wildcard operator in `file_tags`:

In [None]:
# list process records using the wildcard operator in file_tags
list_output = pipeline.list(file_tags=[{"author": "*"}])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "de17823f-5601-4143-b2ae-c546c173cdc7",
  "message": "No files were found for the given query arguments",
    {
        {
          "file_tags_keys": [
            "author"
          ]
        }
      ]
    }
  ],
  "items": []
}


The above will return records for every file that has a file_tag whose key is "author", regardless of the value.

You can also use the wildcard operator with the [`.show_tree`](show_tree_method.md) method, the [`.semantic_search`](../search_methods/semantic_search_method.md) method, and the [`.keyword_search`](../search_methods/keyword_search_method.md) method.

### The Global Root

As you might have surmised, there is one very special use of the wildcard operator on `symbolic_directory_path`s: we call it "the global root". It's leveraged by placing a wildcard operator * right after the root slash, and having nothing else, as follows:

```python
# example line of code with the global root
symbolic_directory_paths=['/*']
```

Listing the global root returns records for every single file in your pipeline.

### Using Multiple Arguments with the `.list` method

As earlier mentioned, you can jointly use multiple input arguments with the `.list` method. Multiple inputs are combined in a logical **OR** (if they are metadata arguments) or **AND** (if they are timestamp bookends) to retrieve records satisfying what's been requested.

As an example, let's combine a timestamp bookend, a `symbolic_file_path`, and `file_tags` in one `.list` method invocation:

In [None]:
# list process records using a combination of input args
list_output = pipeline.list(created_at_end=process_output["created_at"],
                            symbolic_file_path="/novels/gothic/Pride and Prejudice.txt",
                            file_tags=[({"author":"Orwell"})])

# nicely print the output of this .list
print(json.dumps(list_output, indent=2))

{
  "status_code": 200,
  "request_id": "091a2b5e-4d2f-44cd-9fcf-65a0c80546b7",
  "message": "No files were found for the given query arguments",
    {
        {
          "symbolic_directory_paths": [
            "/my/*"
          ]
        },
        {
          "file_names": [
            "some*"
          ]
        }
      ]
    }
  ],
  "items": []
}


Although <u>Pride and Prejudice</u> and <u>Little Women</u> are respectively covered by the `symbolic_file_paths` and `file_tags` arguments, neither of them falls within the indicated timestamp window. Consequently, they are both excluded from the above result.

### Output Size Cap

The current size limit on output generated by the `.list` method is 5MB.

In [None]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)