Search through a directory of json files for a specific shopper id and return the shopper's data.

In [16]:
import parallel_playhouse.constants as c

### Notebook Parameters ###

dir = c.DATA_DIR / "search_json"
generate_data_flag = True

shopper_id = 500 # Shopper Id to use for the search

### Setup ###

dir.mkdir(parents=True, exist_ok=True)

## Generate Data

In [17]:
import random
from pathlib import Path


def generate_data(
    destination_dir: Path,
    num_files: int = 20,
    shoppers_per_file: int = 50,
    products_per_shopper: int = 100,
) -> None:
    shopper_ids = list(range(num_files * shoppers_per_file))
    random.shuffle(shopper_ids)

    for file in range(num_files):
        file_path = destination_dir / f"shoppers_{file}.json"
        with open(file_path, "w") as f:
            for _ in range(shoppers_per_file):
                shopper_id = shopper_ids.pop()
                products = list(range(products_per_shopper))
                random.shuffle(products)
                f.write(
                    f'{{"shopper_id": {shopper_id}, "products": {products}}}\n'
                )

if generate_data_flag:
    generate_data(dir)

## Parallel Search Json

In [27]:
import subprocess

command = fr"""
parallel "grep '\"shopper_id\": {shopper_id}'" ::: *.json > basic_output.txt
"""
print(command)
output = subprocess.run(command, shell=True, cwd=dir, capture_output=True)


parallel "grep '\"shopper_id\": 500'" ::: *.json > basic_output.txt



The command finds all json files through glob expansion then uses grep match lines with the right shopper id.
We then pass these into parallel, by default each value will be passed as the last argument to the command.
If we wrap the command to be paralleld in double quotes, we can escape double quotes inside the command with backslash.
We then pipe all matched lines to the output file. There should only be one.

In [31]:
command = fr"""
parallel "grep -H '\"shopper_id\": {shopper_id}' {{}} | sed 's/:/\n/'" ::: *.json > tidy_output.txt
"""
print(command)
output = subprocess.run(command, shell=True, cwd=dir, capture_output=True)


parallel "grep -H '\"shopper_id\": 500' {} | sed 's/:/\n/'" ::: *.json > tidy_output.txt



Here we tidy the output a little. We use -H to include the file name with each matched line.
Then we pipe that into stream edit which replaces the colon in front of the file name with a new line character.
Here we need to use {} to denote where to place the value as it doesn't go at the end of the command.