## GraphQL for grabbing TJ's product info

GraphQL is an API querying language. Apparently, TJ's keeps a product endpoint available without the need for an authentication token. I wonder if this was found just from poking around the browser Network tab.

This way, we don't need to scrape or even render any JS. All we need to do now is understand the product schema. In particular, the nutrition info we want is nested a few layers deep in the highest level `item` object.

Some data of interest include:

* Food category (`categories`)
* Price
* Weight
* Ingredients
* Serves X
* Serving size
* Calories per serving
* A mapping of "nutrient_name":(`nutrient_amt`, `nutrient_dv`)
    * Nutrient amount per serving
    * Nutrient daily value
* `country_of_manufacture`
* `country_of_origin`

In [1]:
url = "https://www.traderjoes.com/api/graphql"
# Where does ^ even come from!? Network tab under F12, AJAX requests idk


In [2]:
import requests

### Example query of 100 products and their prices

In [3]:
# Start by mimicking the original price tracker project

store_code = 706 # Choose a store location

query_string = """
query {
  products(search: "", pageSize: 100) {
    items {
      sku
      item_title
      retail_price
    }
    total_count
  }
}
"""

headers = {
    "accept": "*/*",
    "accept-language": "en-US,en;q=0.9",
    "content-type": "application/json",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36",
}

query = {
    "operationName": "SearchProduct",
    "variables": {
        "storeCode": store_code,
        "published": "1",
        "currentPage": 1,
        "pageSize": 100
    },
    "query": query_string
}

response = requests.post(url, json=query, headers=headers)

if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f"Failure with status code {response.status_code}")
    

{'data': {'products': {'items': [{'sku': '083046', 'item_title': 'Honey Hydration Bath Fizzer', 'retail_price': '4.99'}, {'sku': '081280', 'item_title': 'Mini Chicken Tacos', 'retail_price': '5.99'}, {'sku': '081865', 'item_title': 'Turkish Inspired Stuffed Eggplant', 'retail_price': '0.00'}, {'sku': '081840', 'item_title': 'Loaded Mashed Potatoes', 'retail_price': '5.99'}, {'sku': '081863', 'item_title': 'Sliced Peppered Uncured Salami', 'retail_price': '6.99'}, {'sku': '080825', 'item_title': 'Organic Sweet Cream Creamer', 'retail_price': '2.99'}, {'sku': '080971', 'item_title': 'Organic Milk A2/A2', 'retail_price': '5.99'}, {'sku': '080667', 'item_title': 'McLelland Vintage Scottish Cheddar', 'retail_price': '5.49'}, {'sku': '082350', 'item_title': 'Fall Scented Candle Trio', 'retail_price': '9.99'}, {'sku': '082641', 'item_title': 'Bow Wow and Meow Tillandsia Planter', 'retail_price': '5.99'}, {'sku': '082761', 'item_title': 'Honeydew Cold Pressed Juice', 'retail_price': '3.49'}, {

In [6]:
# This `total_count` is independent of how many items we query using `pageSize`
# This is the actual total number of items in the store (historically)
data['data']['products']['total_count']

26231

In [5]:
len(data['data']['products']['items'])

100

In [9]:
[print(data['data']['products']['items'][i]) for i in range(5)]

{'sku': '083046', 'item_title': 'Honey Hydration Bath Fizzer', 'retail_price': '4.99'}
{'sku': '081280', 'item_title': 'Mini Chicken Tacos', 'retail_price': '5.99'}
{'sku': '081865', 'item_title': 'Turkish Inspired Stuffed Eggplant', 'retail_price': '0.00'}
{'sku': '081840', 'item_title': 'Loaded Mashed Potatoes', 'retail_price': '5.99'}
{'sku': '081863', 'item_title': 'Sliced Peppered Uncured Salami', 'retail_price': '6.99'}


[None, None, None, None, None]

### "Introspecting" all defined queries in the API

These are all of the queries available in the TJ's GraphQL API. The `description` explains vaguely how they are used.

**NB: This is not telling us much about attributes that the queries return, only how to perform the query (how to search, filter, limit number of results).**

In [11]:
query_query_schema = """
{
  __type(name: "Query") {
    name
    kind
    description
    fields {
      name
      description
      args {
        name
        description
        type {
          name
          kind
          ofType {
            name
            kind
          }
        }
        defaultValue
      }
      type {
        name
        kind
        ofType {
          name
          kind
        }
      }
    }
  }
}
"""
response = requests.post(url, json={"query": query_query_schema}, headers=headers)
qschema = response.json()


In [12]:
[print(qschema['data']['__type']['fields'][i]) for i in range(3)]


{'name': 'availableStores', 'description': 'Get a list of available store views and their config information.', 'args': [{'name': 'useCurrentGroup', 'description': 'Filter store views by the current store group.', 'type': {'name': 'Boolean', 'kind': 'SCALAR', 'ofType': None}, 'defaultValue': None}], 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'StoreConfig', 'kind': 'OBJECT'}}}
{'name': 'cart', 'description': 'Return information about the specified shopping cart.', 'args': [{'name': 'cart_id', 'description': 'The unique ID of the cart to query.', 'type': {'name': None, 'kind': 'NON_NULL', 'ofType': {'name': 'String', 'kind': 'SCALAR'}}, 'defaultValue': None}], 'type': {'name': 'Cart', 'kind': 'OBJECT', 'ofType': None}}
{'name': 'categories', 'description': 'Return a list of categories that match the specified filter.', 'args': [{'name': 'filters', 'description': 'Identifies which Category filter inputs to search for and return.', 'type': {'name': 'CategoryFilterInput', 'ki

[None, None, None]

In [None]:
# {'name': 'products',
#   'description': 'Search for products that match the criteria specified in the `search` and `filter` attributes.',
#   'args': [
#    {'name': 'search',
#     'description': 'One or more keywords to use in a full-text search.',
#     'type': {'name': 'String', 'kind': 'SCALAR', 'ofType': None},
#     'defaultValue': None},
#    {'name': 'filter',
#     'description': 'The product attributes to search for and return.',
#     'type': {'name': 'ProductAttributeFilterInput',
#      'kind': 'INPUT_OBJECT',
#      'ofType': None},
#     'defaultValue': None},
#    {'name': 'pageSize',
#     'description': 'The maximum number of results to return at once. The default value is 20.',
#     'type': {'name': 'Int', 'kind': 'SCALAR', 'ofType': None},
#     'defaultValue': '20'},
#    {'name': 'currentPage',
#     'description': 'The page of results to return. The default value is 1.',
#     'type': {'name': 'Int', 'kind': 'SCALAR', 'ofType': None},
#     'defaultValue': '1'},
#    {'name': 'sort',
#     'description': 'Specifies which attributes to sort on, and whether to return the results in ascending or descending order.',
#     'type': {'name': 'ProductAttributeSortInput',
#      'kind': 'INPUT_OBJECT',
#      'ofType': None},
#     'defaultValue': None}
#    ],
#   'type': {'name': 'Products', 'kind': 'OBJECT', 'ofType': None}}

Let's attempt to get some individual item info.

In [15]:
query_product_fields = """
{
  __type(name: "Products") {
    name
    kind
    fields {
      name
      description
      type {
        name
        kind
        ofType {
          name
          kind
        }
      }
    }
  }
}
"""
product_fields_response = requests.post(url, json={"query": query_product_fields}, headers=headers)

product_fields = product_fields_response.json()
# product_fields

In [20]:
def print_n(iterable, n=3):
    [print(iterable[i]) for i in range(n)]

In [21]:
# [print(product_fields['data']['__type']['fields'][i]) for i in range(3)]
print_n(product_fields['data']['__type']['fields'])

{'name': 'aggregations', 'description': 'A bucket that contains the attribute code and label for each filterable option.', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'Aggregation', 'kind': 'OBJECT'}}}
{'name': 'items', 'description': 'An array of products that match the specified search criteria.', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'ProductInterface', 'kind': 'INTERFACE'}}}
{'name': 'page_info', 'description': 'An object that includes the page_info and currentPage values specified in the query.', 'type': {'name': 'SearchResultPageInfo', 'kind': 'OBJECT', 'ofType': None}}


So apparently, `ProductInterface` is the actual class we are concerned with, as `Product` only contains an `items` attribute of many products that matched the given query.

Let's see what fields an `items` attribute contains, which we just found out has type `ProductInterface`.

In [17]:
query_productinterface_fields = """
{
  __type(name: "ProductInterface") {
    name
    kind
    fields {
      name
      description
      type {
        name
        kind
        ofType {
          name
          kind
        }
      }
    }
  }
}
"""
productinterface_fields_response = requests.post(url, json={"query": query_productinterface_fields}, headers=headers)

productinterface_fields = productinterface_fields_response.json()

In [22]:
print_n(productinterface_fields['data']['__type']['fields'])

{'name': 'all_context_images', 'description': 'Array of all context images', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'StoreSpecificImages', 'kind': 'OBJECT'}}}
{'name': 'all_other_images', 'description': 'Array of all other image attributes info', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'StoreSpecificImages', 'kind': 'OBJECT'}}}
{'name': 'all_primary_images', 'description': 'Array of all primary images', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'StoreSpecificImages', 'kind': 'OBJECT'}}}


Finally getting somewhere. We now know that `Products` contains a list `items`, each with type `ProductInterface`. The `ProductInterface`s have their own `nutrition` attribute of type `NutritionAttribute`.

I was confused by the naming scheme, but it is clear now that `NutritionAttribute` is not an attribute of a particular nutrition object, but the attribute *that contains nutrition information of a parent `ProductInterface`.*

In [23]:
query_nutrition_fields = """
{
  __type(name: "NutritionAttribute") {
    name
    kind
    fields {
      name
      description
      type {
        name
        kind
        ofType {
          name
          kind
        }
      }
    }
  }
}
"""
query_nutrition_fields_response = requests.post(url, json={"query": query_nutrition_fields}, headers=headers)

nutrition_fields = query_nutrition_fields_response.json()

In [24]:
# [(field['name']) for field in nutrition_fields['data']['__type']['fields']]
# nutrition_fields['data']['__type']['fields']
print_n(nutrition_fields['data']['__type']['fields'])

{'name': 'calories_per_serving', 'description': 'Calories per serving', 'type': {'name': 'String', 'kind': 'SCALAR', 'ofType': None}}
{'name': 'details', 'description': 'Nutrition Details', 'type': {'name': None, 'kind': 'LIST', 'ofType': {'name': 'NutritionDetails', 'kind': 'OBJECT'}}}
{'name': 'display_sequence', 'description': 'display sequence', 'type': {'name': 'Int', 'kind': 'SCALAR', 'ofType': None}}


To summarize all of the objects and their fields that we desire:

```
Products
|
L Attributes include:
    L sku
    L item_title
    L retail_price
    L items
        L nutrition
            L calories_per_serving
            L details
            L serving_size
            L servings_per_container

```

## MOVE TO SCRAPER_SCRATCH


In [None]:
# {'name': 'route',
#   'description': 'Return the full details for a specified product, category, or CMS page.',
#   'args': [{'name': 'url',
#     'description': 'A `url_key` appended by the `url_suffix, if one exists.',
#     'type': {'name': None,
#      'kind': 'NON_NULL',
#      'ofType': {'name': 'String', 'kind': 'SCALAR'}},
#     'defaultValue': None}],
#   'type': {'name': 'RoutableInterface', 'kind': 'INTERFACE', 'ofType': None}}

In [None]:
# query_string = """
# query {
#   route(url: "https://www.traderjoes.com/api/graphql") {
#     __typename
#     ... on Product {
#       sku
#     }
#   }
# }
# """

# query = {
#     "operationName": "SearchRoute",
#     "variables": {
#         "storeCode": "706",
#         "published": "1",
#         "currentPage": 1,
#         "pageSize": 100
#     },
#     "query": query_string
# }

# response = requests.post(url, json=query, headers=headers)

# # data = response.json()
# # data
# response

# How to use this `routes` query?


Let's grab the full API schema and see if we can find which fields to query for our purposes.

In [25]:
query_schema = """
query {
  __schema {
    types {
      name
      kind
      description
    }
  }
}
"""

response = requests.post(url, json={"query": query_schema}, headers=headers)
schema = response.json()
# print(schema)

In [26]:
# schema['data']['__schema']['types'] # ['data']['__type'].keys()
print_n(schema['data']['__schema']['types'])

{'name': 'Query', 'kind': 'OBJECT', 'description': ''}
{'name': 'Boolean', 'kind': 'SCALAR', 'description': 'The `Boolean` scalar type represents `true` or `false`.'}
{'name': 'StoreConfig', 'kind': 'OBJECT', 'description': "Contains information about a store's configuration."}


In [29]:
query_schema_nutrition = """
{
  __type(name: "NutritionDetails") {
    name
    kind
    description
    fields {
      name
      type {
        kind
        name
        ofType {
          kind
          name
        }
      }
    }
  }
}"""

response = requests.post(url, json={"query": query_schema_nutrition}, headers=headers)
nutrition = response.json()


In [31]:
nutrition

{'data': {'__type': {'name': 'NutritionDetails',
   'kind': 'OBJECT',
   'description': 'NutritionAttribute type',
   'fields': [{'name': 'amount',
     'type': {'kind': 'SCALAR', 'name': 'String', 'ofType': None}},
    {'name': 'display_seq',
     'type': {'kind': 'SCALAR', 'name': 'Int', 'ofType': None}},
    {'name': 'nutritional_item',
     'type': {'kind': 'SCALAR', 'name': 'String', 'ofType': None}},
    {'name': 'percent_dv',
     'type': {'kind': 'SCALAR', 'name': 'String', 'ofType': None}}]}}}

What is the relationship between `NutritionDetails` and `NutritionAttribute`?

Confusingly, `NutritionDetails` is an attribute with type `NutritionAttribute`.

In [35]:
query_schema_all = """
{
  __schema {
    queryType {
      name
    }
    mutationType {
      name
    }
    subscriptionType {
      name
    }
    types {
      name
      kind
      fields {
        name
        args {
          name
          type {
            name
            kind
          }
          defaultValue
        }
        type {
          name
          kind
        }
      }
      inputFields {
        name
        type {
          name
          kind
        }
      }
      interfaces {
        name
      }
      enumValues {
        name
      }
      possibleTypes {
        name
      }
    }
  }
}
"""
response = requests.post(url, json={"query": query_schema_all}, headers=headers)
schema_all = response.json()
# print(schema_all)

In [40]:
print_n(schema_all['data']['__schema']['types'][0]['fields'], 8)

{'name': 'availableStores', 'args': [{'name': 'useCurrentGroup', 'type': {'name': 'Boolean', 'kind': 'SCALAR'}, 'defaultValue': None}], 'type': {'name': None, 'kind': 'LIST'}}
{'name': 'cart', 'args': [{'name': 'cart_id', 'type': {'name': None, 'kind': 'NON_NULL'}, 'defaultValue': None}], 'type': {'name': 'Cart', 'kind': 'OBJECT'}}
{'name': 'categories', 'args': [{'name': 'filters', 'type': {'name': 'CategoryFilterInput', 'kind': 'INPUT_OBJECT'}, 'defaultValue': None}, {'name': 'pageSize', 'type': {'name': 'Int', 'kind': 'SCALAR'}, 'defaultValue': '20'}, {'name': 'currentPage', 'type': {'name': 'Int', 'kind': 'SCALAR'}, 'defaultValue': '1'}], 'type': {'name': 'CategoryResult', 'kind': 'OBJECT'}}
{'name': 'categoryList', 'args': [{'name': 'filters', 'type': {'name': 'CategoryFilterInput', 'kind': 'INPUT_OBJECT'}, 'defaultValue': None}], 'type': {'name': None, 'kind': 'LIST'}}
{'name': 'checkoutAgreements', 'args': [], 'type': {'name': None, 'kind': 'LIST'}}
{'name': 'cmsBlocks', 'args':

### Finding fields in various attributes

In [45]:
query_products_schema = """
query {
  __type(name: "Products") {
    name
    fields {
      name
      type {
        name
        kind
      }
    }
  }
}
"""

products_schema = requests.post(url, json={"query": query_products_schema}, headers=headers)
products_schema = products_schema.json()
# print(products_schema)

In [46]:
products_schema

{'data': {'__type': {'name': 'Products',
   'fields': [{'name': 'aggregations', 'type': {'name': None, 'kind': 'LIST'}},
    {'name': 'items', 'type': {'name': None, 'kind': 'LIST'}},
    {'name': 'page_info',
     'type': {'name': 'SearchResultPageInfo', 'kind': 'OBJECT'}},
    {'name': 'sort_fields', 'type': {'name': 'SortFields', 'kind': 'OBJECT'}},
    {'name': 'suggestions', 'type': {'name': None, 'kind': 'LIST'}},
    {'name': 'total_count', 'type': {'name': 'Int', 'kind': 'SCALAR'}}]}}}

In [48]:
query_types_schema = """
query {
  __schema {
    types {
      name
    }
  }
}
"""

types_schema = requests.post(url, json={"query": query_types_schema}, headers=headers)
types_schema = types_schema.json()


In [55]:
print_n(types_schema['data']['__schema']['types'], 7)

{'name': 'Query'}
{'name': 'Boolean'}
{'name': 'StoreConfig'}
{'name': 'String'}
{'name': 'FixedProductTaxDisplaySettings'}
{'name': 'Int'}
{'name': 'ID'}


### Test out querying one product

In [56]:
query_one_product = """
{
  products(
    storeCode: "706",
    search: "",
    filter: {},
    pageSize: 1,
    currentPage: 0
  ) {
    items {
      sku
      name
      availability
    }
  }
}
"""

response = requests.post(url, json={"query": query_one_product}, headers=headers)

if response.status_code == 200:
  one_product = response.json()
else:
  raise Exception(f"Failure with code {response.status_code}: {response.text}")

# print(one_product)


In [57]:
one_product

{'errors': [{'message': 'Unknown argument "storeCode" on field "products" of type "Query".',
   'extensions': {'category': 'graphql'},
   'locations': [{'line': 4, 'column': 5}]}]}

### Only querying items at Hyde Park location (and available?)

In [None]:
import requests
import json

def items_by_store(store_code, page):
    url = "https://www.traderjoes.com/api/graphql"
    headers = {
        "accept": "*/*",
        "accept-language": "en-US,en;q=0.9",
        "cache-control": "no-cache",
        "content-type": "application/json",
        "pragma": "no-cache",
        "accept-encoding": "gzip, deflate, br",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 \
                       (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36"
    }

    query = f"""
    {{
      products(
        search: "",
        filter: {{}},
        currentPage: {page},
        pageSize: 100
      ) {{
        items {{
          sku
          name
          availability
        }}
        total_count
      }}
    }}
    """

    payload = {
        "operationName": "SearchProduct",
        "query": query
    }

    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        try:
            data = response.json()
            return data.get('data', {}).get('products', {}).get('items', [])
        except json.JSONDecodeError:
            raise Exception("Failed to parse response JSON")
    else:
        raise Exception(f"Failure with code {response.status_code}: {response.text}")

items = items_by_store("706", 1)
for item in items:
    print(item)

In [None]:
items[0]['availability']

In [None]:
len(items)

In [None]:
[item for item in items if item['availability']=='0']

### Actually query all fields from all products (including nutrition)

Now, query the nutritional information.

uid vs sku? Both unique identifiers.

In [61]:
# Get total number of products
# Must be run before querying all products
query_number_products = """
{
  products(
    search: "",
    filter: {},
    pageSize: 100,
    currentPage: 0
  ) 
  {
    total_count
    page_info {
      current_page
      page_size
      total_pages
    }
  }
}
"""
number_products = requests.post(url, json={"query": query_number_products}, headers=headers)
number_products = number_products.json()
# print(number_products)


Let's query 256 pages of 100 products.

In [63]:
page_info = number_products['data']['products']['page_info']
page_size = page_info['page_size']
num_pages = page_info['total_pages']

query_product_page = """
{{
  products(
    search: "",
    filter: {{}},
    pageSize: {},
    currentPage: {}
  )
  {{
    items {{
      sku
      name
      nutrition {{
        calories_per_serving
        details {{
            amount
            display_seq
            nutritional_item
            percent_dv
        }}
        display_sequence
        panel_id
        panel_title
        serving_size
        servings_per_container
      }}
      ingredients {{
        display_sequence
        ingredient
      }}
      item_description
      popularity
      price {{
        regularPrice {{
          amount {{
            value
            currency
          }}
        }}
      }}
      country_of_manufacture
      country_of_origin
      description {{
        html
      }}
    }}
    total_count
    page_info {{
      current_page
      page_size
      total_pages
    }}
  }}
}}
"""


In [64]:
num_pages

263

In [65]:
all_items = []

In [None]:
# # Last run time: 3m 42.6s!
# # Page 0 is identical to page 1!!
# for page_idx in range(1, num_pages+1):
#     query_page_i = {"query": query_product_page.format(page_size, page_idx)}
#     product_page = requests.post(url, json=query_page_i, headers=headers)
#     product_page = product_page.json()
#     all_items = all_items + product_page['data']['products']['items']


In [66]:
len(all_items)

0

In [None]:
skus = [item['sku'] for item in all_items]

In [None]:
len(skus)

In [None]:
len(set(skus))

In [None]:
25564

In [None]:
len(set(skus[100:])) == len(set(skus))
# Ah-ha! Yes, page 0 is identical to page 1

In [None]:
query_page_256 = {"query": query_product_page.format(page_size, 256)}
product_page_256 = requests.post(url, json=query_page_256, headers=headers)
product_page_256 = product_page_256.json()
all_items_fixed = all_items[100:]

all_items_fixed = all_items_fixed + product_page_256['data']['products']['items']

In [None]:
skus_fixed = [item['sku'] for item in all_items_fixed]
len(set(skus_fixed))

### Query all items from Hyde Park location

In [67]:
query_product_page = """
{{
  products(
    search: "",
    filter: {{}},
    pageSize: {},
    currentPage: {}
  )
  {{
    items {{
      sku
      name
      availability
      stock_status
      only_x_left_in_stock
      nutrition {{
        calories_per_serving
        details {{
            amount
            display_seq
            nutritional_item
            percent_dv
        }}
        display_sequence
        panel_id
        panel_title
        serving_size
        servings_per_container
      }}
      ingredients {{
        display_sequence
        ingredient
      }}
      item_description
      popularity
      price {{
        regularPrice {{
          amount {{
            value
            currency
          }}
        }}
      }}
      country_of_manufacture
      country_of_origin
      description {{
        html
      }}
    }}
    total_count
    page_info {{
      current_page
      page_size
      total_pages
    }}
  }}
}}
"""

In [76]:
# Just mimicking the Haskell one

store_code = 706

headers = {
    "accept": "*/*",
    "accept-language": "en-US,en;q=0.9",
    "content-type": "application/json",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36",
}

page_idx=1
query_product_page_filled = query_product_page.format(page_size, page_idx)

query = {
    "operationName": "SearchProduct",
    "variables": {
        "storeCode": store_code,
        "published": "1",
        "currentPage": 1,
        "pageSize": 100
    },
    "query": query_product_page_filled
}

response = requests.post(url, json=query, headers=headers)

if response.status_code == 200:
    data = response.json()
else:
    print(f"Failure with code {response.status_code}")

In [78]:
num_pages

263

In [79]:
page_size

100

In [None]:
# all_items_hp = []
# store_code = 706

# # Last run time: 5m 47.0s!
# # Page 0 is identical to page 1!!

# for page_idx in range(1, num_pages+1):
#     query_product_page_filled = query_product_page.format(page_size, page_idx)
#     query_page_i = {
#         "operationName": "SearchProduct",
#         "variables": {
#             "storeCode": store_code,
#             "published": "1",
#             "currentPage": 1,
#             "pageSize": 100
#         },
#         "query": query_product_page_filled
#     }
#     # query_page_i = {"query": query_product_page.format(page_size, page_idx)}
#     product_page = requests.post(url, json=query_page_i, headers=headers)
#     product_page = product_page.json()
#     all_items_hp = all_items_hp + product_page['data']['products']['items']


In [None]:
# len(all_items_hp) # 25722 on 06/29/2025

25722

## Converting json output to `polars`

### Hyde Park list

In [156]:
skus = [item['sku'] for item in all_items_hp]

In [157]:
len(skus)

25722

In [158]:
len(set(skus))

23150

In [160]:
num_pages

258

In [165]:
from collections import Counter

In [169]:
skus = [i['sku'] for i in all_items_hp]
skus_ct = Counter(skus)

In [182]:
# skus_ct

In [180]:
sum([1 if i==2 else 0 for i in skus_ct.values()])

2572

In [183]:
[i for i in all_items_hp if i['sku']=='007363']

[{'sku': '007363',
  'name': 'JAGERMEISTER LIQUEUR 750ML',
  'availability': '1',
  'stock_status': 'OUT_OF_STOCK',
  'only_x_left_in_stock': None,
  'nutrition': None,
  'ingredients': None,
  'item_description': None,
  'popularity': '0',
  'price': {'regularPrice': {'amount': {'value': 18.99, 'currency': 'USD'}}},
  'country_of_manufacture': None,
  'country_of_origin': None,
  'description': {'html': ''}},
 {'sku': '007363',
  'name': 'JAGERMEISTER LIQUEUR 750ML',
  'availability': '1',
  'stock_status': 'OUT_OF_STOCK',
  'only_x_left_in_stock': None,
  'nutrition': None,
  'ingredients': None,
  'item_description': None,
  'popularity': '0',
  'price': {'regularPrice': {'amount': {'value': 18.99, 'currency': 'USD'}}},
  'country_of_manufacture': None,
  'country_of_origin': None,
  'description': {'html': ''}}]

In [168]:
len(all_items_hp)

25722

Uh-oh, remove duplicates and sponges.

Oh, sku 10032021 "POP UP SPONGES" for some reason contains information for pizza crusts? The problematic `"item_description": "Broccoli & Kale Pizza Crust description"`.

In [191]:
all_items_hp_fixed = [item for item in all_items_hp if item['sku']!='10032021']

In [192]:
len(all_items_hp_fixed)

25721

Save Hyde Park json.

In [None]:
# import json
# all_items_hp_fixed_raw_path = "data/all_items_hp_fixed_raw.json"
# with open(all_items_hp_fixed_raw_path, "w") as f:
#        json.dump(all_items_hp_fixed, f, indent=4)



In [196]:
import polars as pl

In [None]:
df_hp = pl.json_normalize(all_items_hp_fixed)
df_hp.head(5)

sku,name,availability,stock_status,only_x_left_in_stock,nutrition,ingredients,item_description,popularity,country_of_manufacture,country_of_origin,price.regularPrice.amount.value,price.regularPrice.amount.currency,description.html
str,str,str,str,null,list[struct[7]],list[struct[2]],null,str,null,str,f64,str,str
"""080626""","""R-SALAD BABY RED BUTTER & ARUG…","""1""","""OUT_OF_STOCK""",,"[{""15 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""250 mg"",14,""Potassium"","".06""}],0,1,""per serving"",""1/2 package (85g)"",""Serves 2""}, {""30 "",[{""0.5 g"",1,""Total Fat"","".01""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""490 mg"",14,""Potassium"","".1""}],1,2,""per container"",""1/2 package (85g)"",""Serves 2""}]","[{1,""ORGANIC BABY RED BUTTER LETTUCE""}, {2,""ORGANIC BABY ARUGULA""}]",,"""107""",,"""Product of USA""",2.49,"""USD""",""""""
"""097210""","""TURBINADO RAW CANE SUGAR""","""1""","""OUT_OF_STOCK""",,"[{""30"",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""2 tsp (8 g)"",""Serves 85""}]","[{1,""TURBINADO RAW CANE SUGAR""}]",,"""95""",,"""Product of Malawi Product of""",3.49,"""USD""",""""""
"""082077""","""SALSA MANGO PINEAPPLE PICO DE …","""1""","""OUT_OF_STOCK""",,"[{""10 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""50 mg"",14,""Potassium"","".02""}],0,1,null,""2 Tbsp. (30g)"",""Serves about 11""}]","[{1,""TOMATO""}, {2,""YELLOW ONION""}, … {8,""SALT""}]",,"""236""",,"""Made in United States""",3.99,"""USD""",""""""
"""080470""","""MARSHMALLOWS STRAWBERRY & WATE…","""1""","""OUT_OF_STOCK""",,"[{""100 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,null,""6 pieces (30g)"",""Serves about 8""}]","[{1,""GLUCOSE-FRUCTOSE SYRUP""}, {2,""SUGAR""}, … {9,""WATERMELON POWDER (MALTODEXTRIN, WATERMELON JUICE CONCENTRATE, CITRIC ACID [ACIDIFIER]).""}]",,"""371""",,"""Product of Spain""",2.99,"""USD""",""""""
"""073162""","""CROISSANT CHOCOLATE""","""1""","""OUT_OF_STOCK""",,"[{""350 "",[{""20 g"",1,""Total Fat"","".26""}, {""12 g"",2,""Saturated Fat"","".6""}, … {""140 mg"",14,""Potassium"","".02""}],0,1,""Per serving"",""1 croissant (83g)"",""Serves 2""}, {""710 "",[{""40 g"",1,""Total Fat"","".51""}, {""23 g"",2,""Saturated Fat"",""1.15""}, … {""280 mg"",14,""Potassium"","".06""}],1,2,""Per container"",""1 croissant (83g)"",""Serves 2""}]","[{1,""UNBLEACHED WHEAT FLOUR (WHEAT FLOUR, ENZYME)""}, {2,""WATER""}, … {12,""ENZYMES.""}]",,"""158""",,"""Product of USA""",3.49,"""USD""",""""""


Beyond just dumping each json object from `all_items_fixed` into `pl.json_normalize()`, we also need to further handle the `nutrition` and `ingredients` dictionaries.

In [None]:
# df_all_items_raw.write_csv('data/all_items_raw.csv')
# # Nested columns 2 and 3 do not work with CSV

In [None]:
# import json
# all_items_fixed_raw_path = "data/all_items_fixed_raw.json"
# with open(all_items_fixed_raw_path, "w") as f:
#        json.dump(all_items_fixed, f, indent=4)



Every description is empty html???

Let's try and isolate only food items.

In [199]:
nuts = [item['nutrition'] for item in all_items_hp_fixed]

In [None]:
# Get only items with non-empty nutrition information
# This will be criterion for what counts as "food"

In [203]:
# nut_lens = [len(nut) for nut in nuts]

#### Getting only the values we need from the json, making it relational

In [17]:
import json
import polars as pl

In [18]:
with open('data/all_items_hp_fixed_raw.json', 'r') as f:
    tjhp_raw = json.load(f)

Recall that the `item_description` field is `null` for just about every entry except one where it is instead the unrelated `Broccoli & Kale Pizza Crust description`.

In [19]:
# tj_raw = [item for item in tj_raw if item['item_description'] != 'Broccoli & Kale Pizza Crust description']

In [20]:
tjhp = pl.DataFrame(tjhp_raw)

Need to:

* Remove rows with null `nutrition`
* Remove useless fields (`item_description`, `description`)
* Drop duplicate rows
* Turn nested fields into 
    * lists (ingredients)
    * individual and possibly sparse columns (fiber, protein, calories)

In [21]:
tjhp = tjhp.drop(['item_description', 'description'])

In [22]:
tjhp.shape

(25721, 11)

In [23]:
tjhp = tjhp.drop_nulls(subset=['nutrition', 'price'])
tjhp.shape

(4783, 11)

In [24]:
# Drop dupes
tjhp = tjhp.unique()

In [25]:
# fix price so it actually is a float of US dollars
temp = tjhp.with_columns(
    pl.col("price").struct.field("regularPrice").struct.field("amount").struct.field("currency").alias("nested_field")
)

print(temp)

shape: (4_588, 12)
┌────────┬────────────┬────────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ sku    ┆ name       ┆ availabili ┆ stock_sta ┆ … ┆ price     ┆ country_o ┆ country_o ┆ nested_fi │
│ ---    ┆ ---        ┆ ty         ┆ tus       ┆   ┆ ---       ┆ f_manufac ┆ f_origin  ┆ eld       │
│ str    ┆ str        ┆ ---        ┆ ---       ┆   ┆ struct[1] ┆ ture      ┆ ---       ┆ ---       │
│        ┆            ┆ str        ┆ str       ┆   ┆           ┆ ---       ┆ str       ┆ str       │
│        ┆            ┆            ┆           ┆   ┆           ┆ null      ┆           ┆           │
╞════════╪════════════╪════════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 076670 ┆ MANGO      ┆ 1          ┆ OUT_OF_ST ┆ … ┆ {{{3.29," ┆ null      ┆ Product   ┆ USD       │
│        ┆ STICKY     ┆            ┆ OCK       ┆   ┆ USD"}}}   ┆           ┆ of        ┆           │
│        ┆ RICE       ┆            ┆           ┆   ┆           ┆        

In [26]:
temp['nested_field'].value_counts()

nested_field,count
str,u32
"""USD""",4588


OK, they are all in USD, proceed.

In [27]:
tjhp = tjhp.with_columns(
    pl.col("price").struct.field("regularPrice").struct.field("amount").struct.field("value").alias("price_usd")
)


Onto the same nested json but for calories...

Wow, these nutrition items are formatted horrendously. The key is not a key whatsoever. Should go back and fix this in the original GraphQL query under `nutrition`.

In [28]:
tjhp['nutrition'][0]

"{""140 "",[{""6 g"",1,""Total Fat"","".08""}, {""3.5 g"",2,""Saturated Fat"","".18""}, … {""40 mg"",14,""Potassium"",""0""}],0,1,"""",""1 oz (28g/about 9 pieces)"",""Serves about 5""}"


Oh, maybe not.

In [29]:
tjhp['nutrition'][0][0]

{'calories_per_serving': '140 ',
 'details': [{'amount': '6 g',
   'display_seq': 1,
   'nutritional_item': 'Total Fat',
   'percent_dv': '.08'},
  {'amount': '3.5 g',
   'display_seq': 2,
   'nutritional_item': 'Saturated Fat',
   'percent_dv': '.18'},
  {'amount': '0 g',
   'display_seq': 3,
   'nutritional_item': 'Trans Fat',
   'percent_dv': ''},
  {'amount': '0 mg',
   'display_seq': 4,
   'nutritional_item': 'Cholesterol',
   'percent_dv': '0'},
  {'amount': '65 mg',
   'display_seq': 5,
   'nutritional_item': 'Sodium',
   'percent_dv': '.03'},
  {'amount': '19 g',
   'display_seq': 6,
   'nutritional_item': 'Total Carbohydrate',
   'percent_dv': '.07'},
  {'amount': '2 g',
   'display_seq': 7,
   'nutritional_item': 'Dietary Fiber',
   'percent_dv': '.07'},
  {'amount': '9 g',
   'display_seq': 8,
   'nutritional_item': 'Total Sugars',
   'percent_dv': ''},
  {'amount': '3 g Added Sugars',
   'display_seq': 9,
   'nutritional_item': 'Includes',
   'percent_dv': '.06'},
  {'amoun

In [30]:
tjhp['nutrition'][0][0]['calories_per_serving']

'140 '

In [31]:
tjhp = tjhp.with_columns(
    # tjhp['nutrition']
    pl.col("nutrition").map_elements(lambda x: x[0]['calories_per_serving'] if len(x)>0 else None, return_dtype=str).alias("calories_per_serving")
)
tjhp.head(3)

sku,name,availability,stock_status,only_x_left_in_stock,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin,price_usd,calories_per_serving
str,str,str,str,null,list[struct[7]],list[struct[2]],str,struct[1],null,str,f64,str
"""076670""","""MANGO STICKY RICE CRISPS""","""1""","""OUT_OF_STOCK""",,"[{""140 "",[{""6 g"",1,""Total Fat"","".08""}, {""3.5 g"",2,""Saturated Fat"","".18""}, … {""40 mg"",14,""Potassium"",""0""}],0,1,"""",""1 oz (28g/about 9 pieces)"",""Serves about 5""}]","[{1,""STICKY RICE (RICE, WATER)""}, {2,""DRIED MANGO (MANGO, CANE SUGAR, GLYCERIN, CITRIC ACID [ACIDIFIER], SULFUR DIOXIDE [TO MAINTAIN COLOR])""}, … {10,""MANGO SYRUP (MANGO PUREE, SUGAR, SALT, GLUCOSE SYRUP, COCONUT MILK).""}]","""71""","{{{3.29,""USD""}}}",,"""Product of Thailand""",3.29,"""140 """
"""077815""","""ORG COCONUT BEVERAGE ORGINAL U…","""1""","""OUT_OF_STOCK""",,"[{""60 "",[{""6 g"",0,""Total Fat"","".08""}, {""5 g"",1,""Saturated Fat"","".25""}, … {""1.11 mcg"",16,""Vitamin B12"","".45""}],0,1,null,""1 Cup (240mL)"",""Serves 4""}]","[{1,""WATER""}, {2,""ORGANIC COCONUT CREAM""}, … {10,""CYANOCOBALAMIN (VITAMIN B 12).""}]","""121""","{{{2.99,""USD""}}}",,"""Product of United States""",2.99,"""60 """
"""077788""","""COOKIES NOCCIOLINI TINY HAZELN…","""1""","""OUT_OF_STOCK""",,"[{""150 "",[{""8 g"",1,""Total Fat"","".1""}, {""0.5 g"",2,""Saturated Fat"","".03""}, … {""110 mg"",14,""Potassium"","".02""}],0,1,null,""1/3 cup (28g)"",""Serves about 3.5""}]","[{1,""SUGAR""}, {2,""HAZELNUTS""}, {3,""EGG WHITE""}]","""112""","{{{2.69,""USD""}}}",,"""Product of Italy""",2.69,"""150 """


These things definitely do not actually have 0 calories. Drop them.

In [32]:
tjhp.filter((pl.col("calories_per_serving") == ""))

sku,name,availability,stock_status,only_x_left_in_stock,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin,price_usd,calories_per_serving
str,str,str,str,null,list[struct[7]],list[struct[2]],str,struct[1],null,str,f64,str
"""066054""","""CHEESECAKE CONES""","""1""","""IN_STOCK""",,"[{"""",[{""13 g"",1,""Total Fat"","".17""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""120 mg"",14,""Potassium"","".02""}],0,1,"""",""1 cone (50g)"",""4""}]","[{1,""CONE (SUGAR, ALMONDS""}, {2,""GLUCOSE SYRUP""}, … {13,""SUGAR).""}]","""147""","{{{4.99,""USD""}}}",,"""Product of Italy""",4.99,""""""
"""070148""","""MILK ULTRA - FILTERED REDUCED …","""1""","""IN_STOCK""",,"[{"""",[{""5g"",1,""Total Fat"","".06""}, {""3g"",2,""Saturated Fat"","".15""}, … {""200mcg"",15,""Vitamin A"","".2""}],0,1,"""",""1 cup (240mL)"",""Serves about 7""}]","[{1,""ULTRA-FILTERED REDUCED FAT MILK""}, {2,""WATER""}, … {5,""VITAMIN D3""}]","""84""","{{{3.99,""USD""}}}",,"""Product of United States""",3.99,""""""
"""076648""","""HOL PB DECO DOG COOKIES""","""1""","""OUT_OF_STOCK""",,"[{"""",[{"""",1,""CRUDE PROTEIN (MIN)"","".09""}, {"""",2,""CRUDE FAT (MIN)"","".05""}, … {"""",4,""MOISTURE (MAX)"","".12""}],0,1,"""",""3400 kcal/kg"",""45 kcal/treat""}]","[{1,""WHEAT FLOUR""}, {2,""GLUCOSE""}, … {17,""CALCIUM CARBONATE.""}]","""17""","{{{3.99,""USD""}}}",,"""Product of Vietnam""",3.99,""""""
"""072630""","""JOE'S CARVERY JERKY BITES""","""1""","""OUT_OF_STOCK""",,"[{"""",[{"""",1,""CRUDE PROTEIN (MIN)"","".22""}, {"""",2,""CRUDE FAT (MIN)"","".1""}, … {"""",4,""MOISTURE (MAX)"","".22""}],0,1,"""",""1 treat"",""CALORIE CONTENT (CALCULATED) ME: 3500 kcal/kg; 25 kcal/treat""}]","[{1,""TURKEY""}, {2,""CHICKEN""}, … {15,""ROSEMARY EXTRACT.""}]","""19""","{{{3.49,""USD""}}}",,"""Made in USA""",3.49,""""""
"""069471""","""HOL CREAMED GREENS""","""1""","""IN_STOCK""",,"[{"""",[{""7 g"",1,""Total Fat"","".09""}, {""2.5 g"",2,""Saturated Fat"","".13""}, … {""190 mg"",14,""Potassium"","".04""}],0,1,"""",""1 cup (124g)"",""Serves about 4""}]","[{1,""MILK (MILK, VITAMIN D3), BRUSSELS SPROUTS, KALE""}, {2,""ONION""}, … {11,""DRIED ROSEMARY.""}]","""10""","{{{5.99,""USD""}}}",,"""Manufactured in USA""",5.99,""""""
"""063197""","""GRATIN BROCCOLI & CAULIFLOWER""","""1""","""IN_STOCK""",,"[{"""",[{""7 g"",1,""Total Fat"","".09""}, {""4.0 g"",2,""Saturated Fat"","".2""}, … {""260 mg"",14,""Potassium"","".06""}],0,1,"""",""1 cup (150g)"",""about 4""}]","[{1,""BROCCOLI""}, {2,""CAULIFLOWER""}, … {11,""BLACK PEPPER.""}]","""2""","{{{4.49,""USD""}}}",,"""Product of Italy""",4.49,""""""


In [33]:
tjhp = tjhp.filter((pl.col("calories_per_serving") != ""))

More problematic calories per serving fields: `8 out of 4570 values: ["3291 kcal/kg; 29 kcal/treat", "3200 kcal/kg; 18 kcal/treat", … "varied"]`.

In [34]:
tjhp.shape

(4394, 13)

In [35]:
### DROP THEM. come back later to salvage whichever.
tjhp = tjhp.filter(tjhp['calories_per_serving'].str.contains('^\\s*\\d+\\s*$'))
tjhp.shape
# Yes, this filters out those 8 values giving us trouble before

(4386, 13)

In [36]:
tjhp = tjhp.with_columns(
    pl.col('calories_per_serving').str.strip_chars(' ').cast(pl.Float32).alias('calories_per_serving')
)

Now same for servings per container....

In [45]:
tjhp['nutrition'][0][0]['servings_per_container']

'Serves about 5'

In [47]:
# Create raw servings column
tjhp = tjhp.with_columns(
    pl.col("nutrition").map_elements(lambda x: x[0]['servings_per_container'] if len(x)>0 else None, return_dtype=str).alias("servings_per_container_text")
)

In [48]:
unique_spc = tjhp['servings_per_container_text'].unique() # alue_counts()['servings_per_container']
for spc in unique_spc:
    print(spc)

Serves About 3
serves 4
Serves 20
Serves about 12
Serves About 2.5 servings
about 6
Serves About 3.5
Serves aproximately 10
Serves 28
Serves about 83
Serves about 40
Serves about 5
Serves about 29
Serves 120
Serves about 92
Serves about 15
Serves About 4.5
Serves 82
Serves 88
Serves about 47
Serves about 22
Serves 75
Serves about 31
Serves about 4
Serves ABOUT 5
about 13
1
Serves about 28
Serves About 5
Serves about 542
Serves about 13
Serves ??
serves 3
about 11
Serves 3 Slices
Servings 13
Serves about 51
Serves About 30
Serves About 454
Serves 6
Serves about 10
Serves about 62
Serves 64
about 17
Servings varied
Serves 1 snowman (75g)
Serves About 7
Serves about 16 servings
Serves 50
about 4.5
Serves approx. 6
Servings varies
Serves 36
Serves 383
Serves about 152
Serves about 67
Serves 3
Serves about 13 per set
Serves 51
Serves 65
Serves About 25
Serves About 9
Serves not specified
serves 2
Serving varies
Serves about 45
Serves 68
Serves About 2.5 serving per container
about 16
Serves

The common problems with servings per container include:
* `Serves one` and `Serves does not need this since there is only one serving` (manually replace with 1)
* `Serves about 3 (About 2.5 without Dressing)` (just grab first numeric)
* `Serves 8 on 8oz and 12 on 12oz` (This might be a problem depending on whether there is a single reported calorie value)
* `Serves varied` (idek man)

In [49]:

tjhp = tjhp.with_columns(
    servings_per_container=pl.when(pl.col("servings_per_container_text").str.contains(' one', literal=True))
    .then(1)
    .otherwise(pl.col("servings_per_container_text"))
)#.filter(tjhp['servings_per_container'].str.contains(' one', literal=True))

In [50]:
# (?i)(serves about)|(serves)|(about)
tjhp = tjhp.with_columns(
    servings_per_container=tjhp['servings_per_container'].str.extract(r'(\d+)').cast(pl.Float32)
)

In [51]:
tjhp = tjhp.with_columns(
    calories_per_container = tjhp['calories_per_serving'] * tjhp['servings_per_container']
)

In [52]:
tjhp = tjhp.with_columns(
    dollars_per_calorie = tjhp['price_usd'] / tjhp['calories_per_container']
)

In [53]:
tjhp = tjhp.with_columns(
    calories_per_dollar = 1 / pl.col('dollars_per_calorie')
)

In [224]:
tjhp.columns

['sku',
 'name',
 'availability',
 'stock_status',
 'only_x_left_in_stock',
 'nutrition',
 'ingredients',
 'popularity',
 'price',
 'country_of_manufacture',
 'country_of_origin',
 'price_usd',
 'calories_per_serving',
 'servings_per_container_text',
 'servings_per_container',
 'calories_per_container',
 'dollars_per_calorie',
 'calories_per_dollar']

In [178]:
# Remove 0-calorie nothings like salt and tea and hot sauce
# Remove items that are just free i guess??!!
tjhp_trim = tjhp[:, ['sku', 'name', 'calories_per_dollar', 'dollars_per_calorie', 'price_usd', 'calories_per_container', 'calories_per_serving', 'servings_per_container', 'nutrition', 'stock_status']].filter(
    ~tjhp['dollars_per_calorie'].is_null() & ~tjhp['dollars_per_calorie'].is_nan() & (tjhp['calories_per_container']!=0) & (tjhp['price_usd']!=0)
)
tjhp_trim

sku,name,calories_per_dollar,dollars_per_calorie,price_usd,calories_per_container,calories_per_serving,servings_per_container,nutrition,stock_status
str,str,f64,f64,f64,f32,f32,f32,list[struct[7]],str
"""076670""","""MANGO STICKY RICE CRISPS""",212.765957,0.0047,3.29,700.0,140.0,5.0,"[{""140 "",[{""6 g"",1,""Total Fat"","".08""}, {""3.5 g"",2,""Saturated Fat"","".18""}, … {""40 mg"",14,""Potassium"",""0""}],0,1,"""",""1 oz (28g/about 9 pieces)"",""Serves about 5""}]","""OUT_OF_STOCK"""
"""077815""","""ORG COCONUT BEVERAGE ORGINAL U…",80.267559,0.012458,2.99,240.0,60.0,4.0,"[{""60 "",[{""6 g"",0,""Total Fat"","".08""}, {""5 g"",1,""Saturated Fat"","".25""}, … {""1.11 mcg"",16,""Vitamin B12"","".45""}],0,1,null,""1 Cup (240mL)"",""Serves 4""}]","""OUT_OF_STOCK"""
"""077788""","""COOKIES NOCCIOLINI TINY HAZELN…",167.286245,0.005978,2.69,450.0,150.0,3.0,"[{""150 "",[{""8 g"",1,""Total Fat"","".1""}, {""0.5 g"",2,""Saturated Fat"","".03""}, … {""110 mg"",14,""Potassium"","".02""}],0,1,null,""1/3 cup (28g)"",""Serves about 3.5""}]","""OUT_OF_STOCK"""
"""002636""","""PEANUT BUTTER CUPS MILK MINI""",352.705411,0.002835,4.99,1760.0,160.0,11.0,"[{""160 "",[{""9 g"",1,""Total Fat"","".12""}, {""5.0 g"",2,""Saturated Fat"","".25""}, … {""80 mg"",14,""Potassium"","".02""}],0,1,null,""21 pieces (30g)"",""Serves about 11""}]","""OUT_OF_STOCK"""
"""050920""","""WHOLE MILK RICOTTA""",160.401003,0.006234,3.99,640.0,80.0,8.0,"[{""80 "",[{""6 g"",1,""Total Fat"","".08""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""80 mg"",14,""Potassium"","".02""}],0,1,"""",""1/4 cup (55g)"",""Serves about 8""}]","""OUT_OF_STOCK"""
…,…,…,…,…,…,…,…,…,…
"""051226""","""NATURAL GROUND TURKEY BREAST""",91.819699,0.010891,5.99,550.0,110.0,5.0,"[{""110 "",[{""1.5 g"",1,""Total Fat"","".02""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""280 mg"",14,""Potassium"","".06""}],0,1,null,""4 oz (112g)"",""Serves 5""}]","""OUT_OF_STOCK"""
"""018410""","""WLD UNSLTD SRDNS SPRING WATER""",107.38255,0.0093125,1.49,160.0,160.0,1.0,"[{""160 "",[{""10 g"",1,""Total Fat"","".13""}, {""3.0 g"",2,""Saturated Fat"","".15""}, … {""160 mg"",16,""Potassium"","".04""}],0,1,null,""1 can drained (84g)"",""Serves 1""}]","""OUT_OF_STOCK"""
"""057868""","""PUFF DOGS""",208.012327,0.004807,6.49,1350.0,270.0,5.0,"[{""270 "",[{""170 kcal"",1,""Calories from Fat"",null}, {""19 g"",2,""Total Fat"","".29""}, … {""1.4329 mg"",15,""Iron"","".08""}],0,1,null,""1 puff dog (89g)"",""Serves 5""}]","""OUT_OF_STOCK"""
"""077629""","""A HANDFUL OF TINY DARK CHOCOLA…",217.054264,0.004607,1.29,280.0,140.0,2.0,"[{""140 "",[{""6 g"",1,""Total Fat"","".08""}, {""4.0 g"",2,""Saturated Fat"","".2""}, … {""100 mg"",14,""Potassium"","".02""}],0,1,""per serving"",""10 pretzels (28g)"",""Serves about 2.5""}, {""350 "",[{""16 g"",1,""Total Fat"","".21""}, {""10 g"",2,""Saturated Fat"","".5""}, … {""260 mg"",14,""Potassium"","".06""}],1,2,""per container"",""10 pretzels (28g)"",""Serves about 2.5""}]","""OUT_OF_STOCK"""


Why are there polars `null` and also just `NaN`?

In [179]:
tjhp_trim.sort(by='dollars_per_calorie', descending=False)
# Chocolatey dipping kit limited time only, caloric maximum is a fucking cryptid, it's lost media
# NO these fucking prices are just straight up wrong, 1 cent???

sku,name,calories_per_dollar,dollars_per_calorie,price_usd,calories_per_container,calories_per_serving,servings_per_container,nutrition,stock_status
str,str,f64,f64,f64,f32,f32,f32,list[struct[7]],str
"""077826""","""KIT CHOCOLATE DIPPING""",120000.0,0.000008,0.01,1200.0,150.0,8.0,"[{""150 "",[{""8 g"",0,""Total Fat"","".1""}, {""6 g"",1,""Saturated Fat"","".3""}, … {""70 mg"",13,""Potassium"","".02""}],0,1,"""",""1/8 kit (28g)"",""Serves 8""}]","""OUT_OF_STOCK"""
"""073468""","""INCREDISAUCE""",100000.0,0.00001,0.01,1000.0,100.0,10.0,"[{""100 "",[{""7 g"",1,""Total Fat"","".09""}, {""1 g"",2,""Saturated Fat"","".05""}, … {""20 mg"",14,""Potassium"",""0""}],0,1,"""",""2 Tbsp. (32g)"",""Serves about 10""}]","""OUT_OF_STOCK"""
"""077328""","""CINNAMON SWIZZLE STICKS""",72000.0,0.000014,0.01,720.0,120.0,6.0,"[{""120 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1 stick (30g) edible portion"",""Serves 6""}]","""OUT_OF_STOCK"""
"""077036""","""CINNAMON BUN TRUFFLES""",48000.0,0.000021,0.01,480.0,160.0,3.0,"[{""160 "",[{""10 g"",1,""Total Fat"","".13""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""70 mg"",14,""Potassium"","".02""}],0,1,"""",""6 pieces (30g)"",""Serves about 3""}]","""OUT_OF_STOCK"""
"""067699""","""ALMONDS DARK CHOC AMPED-UP""",38000.0,0.000026,0.01,380.0,380.0,1.0,"[{""380 "",[{""29 g"",1,""Total Fat"","".37""}, {""10 g"",2,""Saturated Fat"","".5""}, … {""450 mg"",14,""Potassium"","".1""}],0,1,"""",""1 pack (71g)"",""1""}]","""IN_STOCK"""
…,…,…,…,…,…,…,…,…,…
"""161708""","""COLD PRESSED JUICE SHOT ORGANI…",3.297739,0.303238,31.84,105.0,35.0,3.0,"[{""35 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"","".01""}, … {""270 mg"",14,""Potassium"","".06""}],0,1,null,""1 bottle (59 mL)"",""Serves 3""}]","""OUT_OF_STOCK"""
"""170308""","""SPARKLING GREEN TEA PINEAPPLE …",2.506266,0.399,3.99,10.0,10.0,1.0,"[{""10"",[{""0g"",1,""Total Fat"",""0""}, {""15mg"",2,""Sodium"","".01""}, … {""0g"",6,""Protein"",""""}],0,1,"""",""1 can (250mL)"",""Serves 1""}]","""OUT_OF_STOCK"""
"""175077""","""COLD PRESSED PROBIOTIC JUICE S…",0.628141,1.592,31.84,20.0,20.0,1.0,"[{""20 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""70 mg"",14,""Potassium"","".02""}],0,1,null,""1 bottle (59mL)"",""Serves 1""}]","""OUT_OF_STOCK"""
"""168481""","""COLD PRESSED JUICE SHOT 100% A…",0.471106,2.122667,31.84,15.0,15.0,1.0,"[{""15 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""907 mg"",16,""Vitamin C"",""10.1""}],0,1,null,""1 bottle (59 mL)"",""Serves 1""}]","""OUT_OF_STOCK"""


In [180]:
cfg = pl.Config()
cfg.set_tbl_rows(10)
# with pl.Config(tbl_rows=10):
#     tjhp_trim.sort(by='dollars_per_calorie', descending=False)
tjhp_trim.sort(by='calories_per_dollar', descending=True).head(20)
# How can I do a HAVING filter after the sort? To remove crazy low dollars_per_calorie values
## What did I mean by this. Maybe HAVING after the calories_per_container thing?

sku,name,calories_per_dollar,dollars_per_calorie,price_usd,calories_per_container,calories_per_serving,servings_per_container,nutrition,stock_status
str,str,f64,f64,f64,f32,f32,f32,list[struct[7]],str
"""077826""","""KIT CHOCOLATE DIPPING""",120000.0,0.000008,0.01,1200.0,150.0,8.0,"[{""150 "",[{""8 g"",0,""Total Fat"","".1""}, {""6 g"",1,""Saturated Fat"","".3""}, … {""70 mg"",13,""Potassium"","".02""}],0,1,"""",""1/8 kit (28g)"",""Serves 8""}]","""OUT_OF_STOCK"""
"""073468""","""INCREDISAUCE""",100000.0,0.00001,0.01,1000.0,100.0,10.0,"[{""100 "",[{""7 g"",1,""Total Fat"","".09""}, {""1 g"",2,""Saturated Fat"","".05""}, … {""20 mg"",14,""Potassium"",""0""}],0,1,"""",""2 Tbsp. (32g)"",""Serves about 10""}]","""OUT_OF_STOCK"""
"""077328""","""CINNAMON SWIZZLE STICKS""",72000.0,0.000014,0.01,720.0,120.0,6.0,"[{""120 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1 stick (30g) edible portion"",""Serves 6""}]","""OUT_OF_STOCK"""
"""077036""","""CINNAMON BUN TRUFFLES""",48000.0,0.000021,0.01,480.0,160.0,3.0,"[{""160 "",[{""10 g"",1,""Total Fat"","".13""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""70 mg"",14,""Potassium"","".02""}],0,1,"""",""6 pieces (30g)"",""Serves about 3""}]","""OUT_OF_STOCK"""
"""067699""","""ALMONDS DARK CHOC AMPED-UP""",38000.0,0.000026,0.01,380.0,380.0,1.0,"[{""380 "",[{""29 g"",1,""Total Fat"","".37""}, {""10 g"",2,""Saturated Fat"","".5""}, … {""450 mg"",14,""Potassium"","".1""}],0,1,"""",""1 pack (71g)"",""1""}]","""IN_STOCK"""
…,…,…,…,…,…,…,…,…,…
"""079176""","""HOL FESTIVELY SHAPED PRETZELS""",1333.333333,0.00075,0.99,1320.0,120.0,11.0,"[{""120 "",[{""1.0 g"",1,""Total Fat"","".01""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""50 mg"",14,""Potassium"","".02""}],0,1,null,""12 pieces (30g)"",""Serves about 11""}]","""OUT_OF_STOCK"""
"""070185""","""RICE CALROSE""",1285.140562,0.000778,2.49,3200.0,160.0,20.0,"[{""160 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""10 mg"",14,""Potassium"",""0""}],0,1,"""",""1/4 cup dry (45g) yields 2/3 cup cooked rice"",""Serves about 20""}]","""OUT_OF_STOCK"""
"""038985""","""SALTED TORTILLA CHIPS 2 LB""",1283.667622,0.000779,3.49,4480.0,140.0,32.0,"[{""140 "",[{""6 g"",1,""Total Fat"","".08""}, {""0.5 g"",2,""Saturated Fat"","".03""}, … {""60 mg"",14,""Potassium"","".02""}],0,1,"""",""1 oz (28g/about 10 chips)"",""Serves about 32""}]","""IN_STOCK"""
"""065544""","""PASTA FETTUCCINE ORGANIC""",1240.310078,0.000806,1.29,1600.0,200.0,8.0,"[{""200 "",[{""1.0 g"",1,""Total Fat"","".01""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""100 mg"",14,""Potassium"","".02""}],0,1,null,""2 oz (56g/1/8 package) dry"",""Serves 8""}]","""OUT_OF_STOCK"""


Let's filter out products that obviously make no sense, e.g. excessive calories in a container and $0.01 price tags.

In [204]:
tjhp_clean = tjhp_trim.clone() # Deep copy

In [205]:
tjhp_clean = tjhp_clean.sort(by='calories_per_dollar', descending=True)


In [206]:
tjhp_clean['price_usd'].hist()

breakpoint,category,count
f64,cat,u32
3.193,"""[0.01, 3.193]""",1351
6.376,"""(3.193, 6.376]""",2154
9.559,"""(6.376, 9.559]""",266
12.742,"""(9.559, 12.742]""",61
15.925,"""(12.742, 15.925]""",17
19.108,"""(15.925, 19.108]""",6
22.291,"""(19.108, 22.291]""",6
25.474,"""(22.291, 25.474]""",2
28.657,"""(25.474, 28.657]""",0
31.84,"""(28.657, 31.84]""",4


In [207]:
tjhp_clean.filter(tjhp_clean['price_usd'] <= 0.49)

sku,name,calories_per_dollar,dollars_per_calorie,price_usd,calories_per_container,calories_per_serving,servings_per_container,nutrition,stock_status
str,str,f64,f64,f64,f32,f32,f32,list[struct[7]],str
"""077826""","""KIT CHOCOLATE DIPPING""",120000.0,0.000008,0.01,1200.0,150.0,8.0,"[{""150 "",[{""8 g"",0,""Total Fat"","".1""}, {""6 g"",1,""Saturated Fat"","".3""}, … {""70 mg"",13,""Potassium"","".02""}],0,1,"""",""1/8 kit (28g)"",""Serves 8""}]","""OUT_OF_STOCK"""
"""073468""","""INCREDISAUCE""",100000.0,0.00001,0.01,1000.0,100.0,10.0,"[{""100 "",[{""7 g"",1,""Total Fat"","".09""}, {""1 g"",2,""Saturated Fat"","".05""}, … {""20 mg"",14,""Potassium"",""0""}],0,1,"""",""2 Tbsp. (32g)"",""Serves about 10""}]","""OUT_OF_STOCK"""
"""077328""","""CINNAMON SWIZZLE STICKS""",72000.0,0.000014,0.01,720.0,120.0,6.0,"[{""120 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1 stick (30g) edible portion"",""Serves 6""}]","""OUT_OF_STOCK"""
"""077036""","""CINNAMON BUN TRUFFLES""",48000.0,0.000021,0.01,480.0,160.0,3.0,"[{""160 "",[{""10 g"",1,""Total Fat"","".13""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""70 mg"",14,""Potassium"","".02""}],0,1,"""",""6 pieces (30g)"",""Serves about 3""}]","""OUT_OF_STOCK"""
"""067699""","""ALMONDS DARK CHOC AMPED-UP""",38000.0,0.000026,0.01,380.0,380.0,1.0,"[{""380 "",[{""29 g"",1,""Total Fat"","".37""}, {""10 g"",2,""Saturated Fat"","".5""}, … {""450 mg"",14,""Potassium"","".1""}],0,1,"""",""1 pack (71g)"",""1""}]","""IN_STOCK"""
…,…,…,…,…,…,…,…,…,…
"""042973""","""ORG APPLE RASPBERRY FRUIT WRAP""",91.836735,0.010889,0.49,45.0,45.0,1.0,"[{""45 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""110 mg"",14,""Potassium"","".02""}],0,1,null,""14g"",""Serves 1""}]","""OUT_OF_STOCK"""
"""097560""","""ORG APPLE BLUEBERRY FRUIT WRAP""",91.836735,0.010889,0.49,45.0,45.0,1.0,"[{""45 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""100 mg"",14,""Potassium"","".02""}],0,1,null,""14g"",""Serves 1""}]","""OUT_OF_STOCK"""
"""081316""","""ORG SOUR WATERMELON BAR""",91.836735,0.010889,0.49,45.0,45.0,1.0,"[{""45 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""100 mg"",14,""Potassium"","".02""}],0,1,null,""1 bar (14g)"",""Serves 1""}]","""OUT_OF_STOCK"""
"""042974""","""ORG APPLE STRAWBERRY FRUIT WRA…",91.836735,0.010889,0.49,45.0,45.0,1.0,"[{""45 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""110 mg"",14,""Potassium"","".02""}],0,1,null,""14g"",""Serves 1""}]","""OUT_OF_STOCK"""


In [208]:
# tjhp = tjhp.with_columns(
#     calories_per_dollar = 1 / pl.col('dollars_per_calorie')
# )

# df.filter REMOVES (filters lol) things that satisfy the condition
tjhp_clean = tjhp_clean.filter(
        (tjhp_clean['calories_per_container'] < 30000) # Demon core candied ginger
        &
        (tjhp_clean['price_usd'] >= 0.48) # Threshold to just include fruit wraps and "just a handful"-type products
)

In [209]:
# tjhp.filter(~tjhp['servings_per_container'].str.contains('[(serves)(Serves)(about)]'))

In [210]:
# # Energy Bar Peanut Butter / tjhp'S CRUNCHY PEANUT BUTTER ENERGY
# # Wrong, it is listing price of a single bar, but servings as 12 bars!!!!
# tjhp.filter(tjhp['sku'] == '077616')

# # Same issue with tjhp'S CHOCOLATE CHIP ENERGY

In [211]:
tjhp.filter(tjhp['sku'] == '070379')

sku,name,availability,stock_status,only_x_left_in_stock,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin,price_usd,calories_per_serving,servings_per_container_text,servings_per_container,calories_per_container,dollars_per_calorie,calories_per_dollar
str,str,str,str,null,list[struct[7]],list[struct[2]],str,struct[1],null,str,f64,f32,str,f32,f32,f64,f64
"""070379""","""OIL CANOLA SPRAY ORGANIC""","""1""","""OUT_OF_STOCK""",,"[{""0 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 mg"",2,""Sodium"",""0""}, … {""0 g"",10,""Protein"",""""}],0,1,"""",""1/3 second spray (0.25g)"",""Serves 536""}]","[{1,""ORGANIC CANOLA OIL""}]","""52""","{{{2.99,""USD""}}}",,"""Product of Other (do not use) …",2.99,0.0,"""Serves 536""",536.0,0.0,inf,0.0


In [212]:
tjhp.filter(tjhp['sku'] == '073468')

sku,name,availability,stock_status,only_x_left_in_stock,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin,price_usd,calories_per_serving,servings_per_container_text,servings_per_container,calories_per_container,dollars_per_calorie,calories_per_dollar
str,str,str,str,null,list[struct[7]],list[struct[2]],str,struct[1],null,str,f64,f32,str,f32,f32,f64,f64
"""073468""","""INCREDISAUCE""","""1""","""OUT_OF_STOCK""",,"[{""100 "",[{""7 g"",1,""Total Fat"","".09""}, {""1 g"",2,""Saturated Fat"","".05""}, … {""20 mg"",14,""Potassium"",""0""}],0,1,"""",""2 Tbsp. (32g)"",""Serves about 10""}]","[{1,""WATER""}, {2,""CANE SUGAR""}, … {20,""PAPRIKA OLEORESIN FOR COLOR.""}]","""29""","{{{0.01,""USD""}}}",,"""Product of United States""",0.01,100.0,"""Serves about 10""",10.0,1000.0,1e-05,100000.0


To sanity check items that do not have accurate servings per container given the price, we can insert another variable for the price per serving to see if that is weird.

In [213]:
tjhp_clean = tjhp_clean.with_columns(
    dollars_per_serving = tjhp_clean['price_usd'] / tjhp_clean['servings_per_container']
)

In [214]:
tjhp_clean.head(3)

sku,name,calories_per_dollar,dollars_per_calorie,price_usd,calories_per_container,calories_per_serving,servings_per_container,nutrition,stock_status,dollars_per_serving
str,str,f64,f64,f64,f32,f32,f32,list[struct[7]],str,f64
"""077616""","""TJ'S CRUNCHY PEANUT BUTTER ENE…",2722.689076,0.000367,1.19,3240.0,270.0,12.0,"[{""270 "",[{""7 g"",1,""Total Fat"","".09""}, {""0.5 g"",2,""Saturated Fat"","".03""}, … {""180 mg"",14,""Potassium"","".04""}],0,1,"""",""1 bar (68g)"",""Serves 12""}]","""OUT_OF_STOCK""",0.099167
"""077618""","""TJ'S CHOCOLATE CHIP ENERGY""",2621.848739,0.000381,1.19,3120.0,260.0,12.0,"[{""260 "",[{""6 g"",1,""Total Fat"","".08""}, {""2.5 g"",2,""Saturated Fat"","".13""}, … {""170 mg"",14,""Potassium"","".04""}],0,1,"""",""1 bar (68g)"",""Serves 12""}]","""OUT_OF_STOCK""",0.099167
"""052029""","""OIL 100% CANOLA""",2303.724928,0.000434,3.49,8040.0,120.0,67.0,"[{""120 "",[{""120 kcal"",1,""Calories from Fat"",null}, {""14 g"",2,""Total Fat"","".22""}, … {""0.00000000 mg"",17,""Iron"",""0""}],0,1,null,""1 Tbsp (14 g)"",""Serves About 67""}]","""OUT_OF_STOCK""",0.05209


Also, let's fix some obvious mistakes manually.

In [215]:
tjhp_clean = tjhp_clean.with_columns(
    pl.when(tjhp_clean['sku'].is_in(["077616", "077618"]))
    .then(1)
    .otherwise(pl.col("servings_per_container"))
    .alias("servings_per_container")
)


In [216]:
tjhp_clean = tjhp_clean.with_columns(
    calories_per_container = tjhp_clean['calories_per_serving'] * tjhp_clean['servings_per_container']
)

In [217]:
tjhp_clean = tjhp_clean.with_columns(
    calories_per_dollar = tjhp_clean['calories_per_container'] / tjhp_clean['price_usd']
)

In [218]:
tjhp_clean = tjhp_clean.with_columns(
    dollars_per_calorie = 1 / tjhp_clean['calories_per_dollar']
)

In [219]:
tjhp_clean = tjhp_clean.sort(by='calories_per_dollar', descending=True)

In [220]:
# tjhp['nutrition'][0][0]['details']

In [221]:

# [i['amount'] for i in tjhp['nutrition'][0][0]['details'] if i['nutritional_item']=='Protein']

In [222]:
# Try to extract nutrients (not calories)
# Start with protein 
tjhp_clean = tjhp_clean.with_columns(
    # tj['nutrition']
    pl.col("nutrition").map_elements(lambda x: x[0]['details'] if len(x)>0 else None).alias("nutrition_exp")
    # pl.col("nutrition").map_elements(lambda x: x[0]['servings_per_container'] if len(x)>0 else None, return_dtype=str).alias("servings_per_container")

)
# tjhp.head(3)


In [227]:
tjhp_clean = tjhp_clean.with_columns(
    pl.col('nutrition_exp').map_elements(lambda x: [i['amount'] for i in x if i['nutritional_item']=='Protein'], return_dtype=list[str]).alias('protein')
)


In [228]:
import re
def parse_nutritional_item(x, nutritional_unit=' g'):
    '''Assumes x is a list of one element'''
    if x is None:
        return 0
    else:
        try:
            # return x[0].strip(nutritional_unit).strip('less than') # Magic phrase, I know...
            # print('hi')
            return float(re.findall(r'\d+', x[0])[0])
        except TypeError:
            return 0
        except IndexError:
            if x.is_empty():
                return 0
            else:
                return float(re.findall(r'\d+', x)[0])

tjhp_clean = tjhp_clean.with_columns(
    pl.col("protein").map_elements(lambda x: parse_nutritional_item(x), return_dtype=float).alias('protein_per_serving')
)

CSV does not support nested data, so we must drop nutrition here.

In [238]:
tjhp_clean = tjhp_clean.drop('nutrition_exp')

In [240]:
tjhp_clean = tjhp_clean.drop('nutrition')

In [247]:
tjhp_clean = tjhp_clean.with_columns(
    pl.col('protein').explode()
)

In [249]:
tjhp_clean = tjhp_clean.with_columns(
    (pl.col('protein_per_serving') * pl.col('servings_per_container')).alias('protein_per_container')
)

tjhp_clean = tjhp_clean.with_columns(
    (pl.col('protein_per_container') / pl.col('price_usd')).alias('grams_protein_per_dollar')
)

Save.

In [250]:
tjhp_clean.write_csv("data/tjhp_clean.csv")

### All list

In [None]:
import polars as pl

In [None]:
# pl.json_normalize(all_items_fixed[100:102])

Beyond just dumping each json object from `all_items_fixed` into `pl.json_normalize()`, we also need to further handle the `nutrition` and `ingredients` dictionaries.

In [None]:
pl.json_normalize(all_items_fixed[100]['nutrition'])

For now, just save it as expressly as possible.

In [None]:
# df_all_items_raw = pl.json_normalize(all_items_fixed,
#                                      infer_schema_length=None)

In [None]:
df_all_items_raw.tail(3)
# Columns 2 and 3 are Lists of Structs of...


In [None]:
# df_all_items_raw.write_csv('data/all_items_raw.csv')
# # Nested columns 2 and 3 do not work with CSV

In [None]:
# import json
# all_items_fixed_raw_path = "data/all_items_fixed_raw.json"
# with open(all_items_fixed_raw_path, "w") as f:
#        json.dump(all_items_fixed, f, indent=4)



Every description is empty html???

In [None]:
broc = [item['name'] for item in all_items_fixed if 'KALE' in item['name']]



broc

In [None]:
df_all_items_raw.write_csv()

Let's try and isolate only food items.

In [None]:
pl.json_normalize(all_items_fixed[100], max_level=4)

In [None]:
nuts = [item['nutrition'] for item in all_items_fixed]

In [None]:
# Get only items with non-empty nutrition information
# This will be criterion for what counts as "food"

In [None]:
# nut_lens = [len(nut) for nut in nuts]

Oh, sku 10032021 "POP UP SPONGES" for some reason contains information for pizza crusts? The problematic `"item_description": "Broccoli & Kale Pizza Crust description"`.

#### Getting only the values we need from the json, making it relational

In [1]:
import json
import polars as pl

In [2]:
with open('data/all_items_fixed_raw.json', 'r') as f:
    tj_raw = json.load(f)

Recall that the `item_description` field is `null` for just about every entry except one where it is instead the unrelated `Broccoli & Kale Pizza Crust description`.

In [3]:
tj_raw = [item for item in tj_raw if item['item_description'] != 'Broccoli & Kale Pizza Crust description']

In [4]:
tj = pl.DataFrame(tj_raw)

Need to:

* Remove rows with null `nutrition`
* Remove useless fields (`item_description`, `description`)
* Turn nested fields into 
    * lists (ingredients)
    * individual and possibly sparse columns (fiber, protein, calories)

In [5]:
tj = tj.drop(['item_description', 'description'])

In [6]:
tj.shape

(25563, 8)

In [7]:
tj = tj.drop_nulls(subset=['nutrition', 'price'])
tj.shape

(4764, 8)

In [8]:
tj.head()

sku,name,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin
str,str,list[struct[7]],list[struct[2]],str,struct[1],null,str
"""079877""","""STRAWBERRY FIELDS GUMMY""","[{""100 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1/4 cup (33g)"",""Serves about 6""}]","[{1,""GLUCOSE SYRUP (CORN, WHEAT)""}, {2,""SUGAR""}, … {10,""WHEAT STARCH.""}]","""731""","{{{2.29,""USD""}}}",,"""Product of France"""
"""079073""","""CASANARE BAR SINGLE STATE DARK…","[{""180 "",[{""14g"",1,""Total Fat"","".18""}, {""8g"",2,""Saturated Fat"","".4""}, … {""140mg"",14,""Potassium"","".02""}],0,1,"""",""10 pieces (31g)"",""Serves About 3""}]","[{1,""UNSWEETENED CHOCOLATE""}, {2,""SUGAR""}, … {4,""SUNFLOWER LECITHIN [EMULSIFIER]""}]","""337""","{{{2.49,""USD""}}}",,"""Product of Colombia"""
"""081510""","""CAKE STRAWBERRY SHEET MINI""","[{""320"",[{""16g"",1,""Total Fat"","".21""}, {""7g"",2,""Saturated Fat"","".35""}, … {"""",14,""Potassium"",""0""}],0,1,"""",""1/6 cake (85g)"",""Serves 6""}]","[{1,""CAKE (CANE SUGAR, ENRICHED WHEAT FLOUR [WHEAT FLOUR, MALTED BARLEY FLOUR, NIACIN, IRON, THIAMINE MONONITRATE, RIBOFLAVIN, FOLIC ACID], WATER, STRAWBERRY PRESERVES [ORGANIC STRAWBERRIES, ORGANIC CANE SUGAR, APPLE PECTIN, ASCORBIC ACID {TO PRESERVE}, CITRIC ACID {ACIDIFIER}], CANOLA OIL, EGGS, BUTTERMILK [CULTURED PASTEURIZED MILK, NONFAT MILK POWDER, SALT], BAKING SODA, NATURAL FLAVOR, SEA SALT)""}, {2,""FROSTING (POWDERED SUGAR [SUGAR, CORNSTARCH], UNSALTED BUTTER [CREAM [MILK], NATURAL FLAVOR], CREAM CHEESE [PASTEURIZED CULTURED MILK AND CREAM, SALT, GUAR GUM, CAROB BEAN GUM], STRAWBERRY PUREE [STRAWBERRIES, CANE SUGAR, FRUIT PECTIN, ASCORBIC ACID {TO MAINTAIN COLOR}], FREEZE DRIED STRAWBERRY, NATURAL FLAVORS, SEA SALT).""}]","""761""","{{{5.99,""USD""}}}",,"""Made in USA"""
"""081299""","""FLATBREAD""","[{""180 "",[{""3.5 g"",1,""Total Fat"","".04""}, {""0.5 g"",2,""Saturated Fat"","".03""}, … {""60 mg"",14,""Potassium"","".02""}],0,1,"""",""1 flatbread (57g)"",""Serves 6""}]","[{1,""WHEAT FLOUR""}, {2,""WATER""}, … {10,""BAKING POWDER (SODIUM ACID PYROPHOSPHATE, SODIUM BICARBONATE, CORNSTARCH, MONOCALCIUM PHOSPHATE).""}]","""280""","{{{2.99,""USD""}}}",,"""Made in USA"""
"""081523""","""COOKIES STRAWBERRY DOODLE""","[{""230 "",[{""10 g"",1,""Total Fat"","".13""}, {""5 g"",2,""Saturated Fat"","".25""}, … {""50 mg"",14,""Potassium"","".02""}],0,1,null,""1 cookie (50 g)"",""Serves 6""}]","[{1,""UNBLEACHED ENRICHED FLOUR (UNBLEACHED WHEAT FLOUR, NIACIN, REDUCED IRON, THIAMINE MONONITRATE, RIBOFLAVIN, FOLIC ACID, ENZYME)""}, {2,""MARGARINE (PALM OIL, PALM KERNEL OIL, WATER, SALT, MONOGLYCERIDES, SUNFLOWER LECITHIN [EMULSIFIER], NATURAL FLAVOR, CITRIC ACID [ACIDULANT], VITAMIN A PALMITATE, VITAMIN D2)""}, … {16,""CITRIC ACID (ACIDULANT).""}]","""363""","{{{5.49,""USD""}}}",,"""Product of United States"""


In [9]:
tj['price'].str

<polars.series.string.StringNameSpace at 0x1f93afc74d0>

In [10]:
temp = tj.with_columns(
    pl.col("price").struct.field("regularPrice").struct.field("amount").struct.field("currency").alias("nested_field")
)

print(temp)

shape: (4_764, 9)
┌────────┬────────────┬────────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ sku    ┆ name       ┆ nutrition  ┆ ingredien ┆ … ┆ price     ┆ country_o ┆ country_o ┆ nested_fi │
│ ---    ┆ ---        ┆ ---        ┆ ts        ┆   ┆ ---       ┆ f_manufac ┆ f_origin  ┆ eld       │
│ str    ┆ str        ┆ list[struc ┆ ---       ┆   ┆ struct[1] ┆ ture      ┆ ---       ┆ ---       │
│        ┆            ┆ t[7]]      ┆ list[stru ┆   ┆           ┆ ---       ┆ str       ┆ str       │
│        ┆            ┆            ┆ ct[2]]    ┆   ┆           ┆ null      ┆           ┆           │
╞════════╪════════════╪════════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 079877 ┆ STRAWBERRY ┆ [{"100     ┆ [{1,"GLUC ┆ … ┆ {{{2.29," ┆ null      ┆ Product   ┆ USD       │
│        ┆ FIELDS     ┆ ",[{"0 g", ┆ OSE SYRUP ┆   ┆ USD"}}}   ┆           ┆ of France ┆           │
│        ┆ GUMMY      ┆ 1,"Total   ┆ (CORN,    ┆   ┆           ┆         

In [11]:
temp['nested_field'].value_counts()

nested_field,count
str,u32
"""USD""",4764


OK, they are all in USD, proceed.

In [12]:
tj = tj.with_columns(
    pl.col("price").struct.field("regularPrice").struct.field("amount").struct.field("value").alias("price_usd")
)


Onto the same nested json but for calories...

Wow, these nutrition items are formatted horrendously. The key is not a key whatsoever. Should go back and fix this in the original GraphQL query under `nutrition`.

In [13]:
tj['nutrition'][0]

"{""100 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1/4 cup (33g)"",""Serves about 6""}"


Oh, maybe not.

In [14]:
tj['nutrition'][0][0]

{'calories_per_serving': '100 ',
 'details': [{'amount': '0 g',
   'display_seq': 1,
   'nutritional_item': 'Total Fat',
   'percent_dv': '0'},
  {'amount': '0 g',
   'display_seq': 2,
   'nutritional_item': 'Saturated Fat',
   'percent_dv': '0'},
  {'amount': '0 g',
   'display_seq': 3,
   'nutritional_item': 'Trans Fat',
   'percent_dv': ''},
  {'amount': '0 mg',
   'display_seq': 4,
   'nutritional_item': 'Cholesterol',
   'percent_dv': '0'},
  {'amount': '0 mg',
   'display_seq': 5,
   'nutritional_item': 'Sodium',
   'percent_dv': '0'},
  {'amount': '26 g',
   'display_seq': 6,
   'nutritional_item': 'Total Carbohydrate',
   'percent_dv': '.09'},
  {'amount': '0 g',
   'display_seq': 7,
   'nutritional_item': 'Dietary Fiber',
   'percent_dv': '0'},
  {'amount': '21 g',
   'display_seq': 8,
   'nutritional_item': 'Total Sugars',
   'percent_dv': ''},
  {'amount': '21 g Added Sugars',
   'display_seq': 9,
   'nutritional_item': 'Includes',
   'percent_dv': '.42'},
  {'amount': '0 g'

In [None]:
tj = tj.with_columns(
    # tj['nutrition']
    pl.col("nutrition").map_elements(lambda x: x[0]['calories_per_serving'] if len(x)>0 else None, return_dtype=str).alias("calories_per_serving")
)
tj.head(3)

sku,name,nutrition,ingredients,popularity,price,country_of_manufacture,country_of_origin,price_usd,calories_per_serving
str,str,list[struct[7]],list[struct[2]],str,struct[1],null,str,f64,str
"""079877""","""STRAWBERRY FIELDS GUMMY""","[{""100 "",[{""0 g"",1,""Total Fat"",""0""}, {""0 g"",2,""Saturated Fat"",""0""}, … {""0 mg"",14,""Potassium"",""0""}],0,1,"""",""1/4 cup (33g)"",""Serves about 6""}]","[{1,""GLUCOSE SYRUP (CORN, WHEAT)""}, {2,""SUGAR""}, … {10,""WHEAT STARCH.""}]","""731""","{{{2.29,""USD""}}}",,"""Product of France""",2.29,"""100 """
"""079073""","""CASANARE BAR SINGLE STATE DARK…","[{""180 "",[{""14g"",1,""Total Fat"","".18""}, {""8g"",2,""Saturated Fat"","".4""}, … {""140mg"",14,""Potassium"","".02""}],0,1,"""",""10 pieces (31g)"",""Serves About 3""}]","[{1,""UNSWEETENED CHOCOLATE""}, {2,""SUGAR""}, … {4,""SUNFLOWER LECITHIN [EMULSIFIER]""}]","""337""","{{{2.49,""USD""}}}",,"""Product of Colombia""",2.49,"""180 """
"""081510""","""CAKE STRAWBERRY SHEET MINI""","[{""320"",[{""16g"",1,""Total Fat"","".21""}, {""7g"",2,""Saturated Fat"","".35""}, … {"""",14,""Potassium"",""0""}],0,1,"""",""1/6 cake (85g)"",""Serves 6""}]","[{1,""CAKE (CANE SUGAR, ENRICHED WHEAT FLOUR [WHEAT FLOUR, MALTED BARLEY FLOUR, NIACIN, IRON, THIAMINE MONONITRATE, RIBOFLAVIN, FOLIC ACID], WATER, STRAWBERRY PRESERVES [ORGANIC STRAWBERRIES, ORGANIC CANE SUGAR, APPLE PECTIN, ASCORBIC ACID {TO PRESERVE}, CITRIC ACID {ACIDIFIER}], CANOLA OIL, EGGS, BUTTERMILK [CULTURED PASTEURIZED MILK, NONFAT MILK POWDER, SALT], BAKING SODA, NATURAL FLAVOR, SEA SALT)""}, {2,""FROSTING (POWDERED SUGAR [SUGAR, CORNSTARCH], UNSALTED BUTTER [CREAM [MILK], NATURAL FLAVOR], CREAM CHEESE [PASTEURIZED CULTURED MILK AND CREAM, SALT, GUAR GUM, CAROB BEAN GUM], STRAWBERRY PUREE [STRAWBERRIES, CANE SUGAR, FRUIT PECTIN, ASCORBIC ACID {TO MAINTAIN COLOR}], FREEZE DRIED STRAWBERRY, NATURAL FLAVORS, SEA SALT).""}]","""761""","{{{5.99,""USD""}}}",,"""Made in USA""",5.99,"""320"""


These things definitely do not actually have 0 calories. Drop them.

In [None]:
tj.filter((pl.col("calories_per_serving") == ""))

In [None]:
tj = tj.filter((pl.col("calories_per_serving") != ""))

More problematic calories per serving fields: `8 out of 4570 values: ["3291 kcal/kg; 29 kcal/treat", "3200 kcal/kg; 18 kcal/treat", … "varied"]`.

In [None]:
tj.shape

In [None]:
### DROP THEM. come back later to salvage whichever.
tj = tj.filter(tj['calories_per_serving'].str.contains('^\\s*\\d+\\s*$'))
tj.shape
# Yes, this filters out those 8 values giving us trouble before

In [None]:
tj = tj.with_columns(
    pl.col('calories_per_serving').str.strip_chars(' ').cast(pl.Float32).alias('calories_per_serving')
)

Now same for servings per container....

In [None]:
tj['nutrition'][0][0]['servings_per_container']

In [None]:
# Create raw servings column
tj = tj.with_columns(
    pl.col("nutrition").map_elements(lambda x: x[0]['servings_per_container'] if len(x)>0 else None, return_dtype=str).alias("servings_per_container")
)

In [None]:
unique_spc = tj['servings_per_container'].unique() # alue_counts()['servings_per_container']
for spc in unique_spc:
    print(spc)

In [None]:
tj.shape

In [None]:
# tj.head(3)

The common problems with servings per container include:
* `Serves one` and `Serves does not need this since there is only one serving` (manually replace with 1)
* `Serves about 3 (About 2.5 without Dressing)` (just grab first numeric)
* `Serves 8 on 8oz and 12 on 12oz` (This might be a problem depending on whether there is a single reported calorie value)
* `Serves varied` (idek man)

In [None]:

tj = tj.with_columns(
    spc=pl.when(pl.col("servings_per_container").str.contains(' one', literal=True))
    .then(1)
    .otherwise(pl.col("servings_per_container"))
)#.filter(tj['servings_per_container'].str.contains(' one', literal=True))

In [None]:
# (?i)(serves about)|(serves)|(about)
tj = tj.with_columns(
    spc=tj['spc'].str.extract(r'(\d+)').cast(pl.Float32)
)

In [None]:
tj = tj.with_columns(
    calories_per_container = tj['calories_per_serving'] * tj['spc']
)

In [None]:
tj = tj.with_columns(
    dollars_per_calorie = tj['price_usd'] / tj['calories_per_container']
)

In [None]:
tj = tj.with_columns(
    calories_per_dollar = 1 / pl.col('dollars_per_calorie')
)

In [None]:
tj.columns

In [None]:
# Remove 0-calorie nothings like salt and tea and hot sauce
# Remove items that are just fucking free i guess??!!
tj_trim = tj[:, ['sku', 'name', 'calories_per_dollar', 'dollars_per_calorie', 'price_usd', 'calories_per_container', 'calories_per_serving', 'spc', ]].filter(
    ~tj['dollars_per_calorie'].is_null() & ~tj['dollars_per_calorie'].is_nan() & (tj['calories_per_container']!=0) & (tj['price_usd']!=0)
)
tj_trim

Lol why are there polars `null` and also just `NaN` brother

In [None]:
tj_trim.sort(by='dollars_per_calorie', descending=False)
# Chocolatey dipping kit limited time only, caloric maximum is a fucking cryptid, it's lost media
# NO these fucking prices are just straight up wrong, 1 cent???

In [None]:
cfg = pl.Config()
cfg.set_tbl_rows(20)
# with pl.Config(tbl_rows=10):
#     tj_trim.sort(by='dollars_per_calorie', descending=False)
tj_trim.sort(by='calories_per_dollar', descending=True).head(20)
# How can I do a HAVING filter after the sort? To remove crazy low dollars_per_calorie values
## What did I mean by this. Maybe HAVING after the calories_per_container thing?

In [None]:
# Energy Bar Peanut Butter / TJ'S CRUNCHY PEANUT BUTTER ENERGY
# Wrong, it is listing price of a single bar, but servings as 12 bars!!!!
tj.filter(tj['sku'] == '077616')

# Same issue with TJ'S CHOCOLATE CHIP ENERGY

In [None]:
tj.filter(tj['sku'] == '070379')

In [None]:
tj.filter(tj['sku'] == '073468')

Let's filter out products that obviously make no sense, e.g. excessive calories in a container and $0.01 price tags.

In [None]:
tj_clean = tj_trim.clone() # Deep copy

In [None]:
tj_clean = tj_clean.sort(by='calories_per_dollar', descending=True)


In [None]:
tj_clean['price_usd'].hist()

In [None]:
tj_clean.filter(tj_clean['price_usd'] <= 0.49)

In [None]:
# tj = tj.with_columns(
#     calories_per_dollar = 1 / pl.col('dollars_per_calorie')
# )

# df.filter REMOVES (filters lol) things that satisfy the condition
tj_clean = tj_clean.filter(
        (tj_clean['calories_per_container'] < 30000) # Demon core candied ginger
        &
        (tj_clean['price_usd'] >= 0.49) # Threshold at fruit wraps and "just a handful"-type products
)

In [None]:
# tj.filter(~tj['servings_per_container'].str.contains('[(serves)(Serves)(about)]'))

In [None]:
# tj_clean_top20 = 
# tj_clean.sort(by='calories_per_dollar', descending=True).head(20)

To sanity check items that do not have accurate servings per container given the price, we can insert another variable for the price per serving to see if that is weird.

In [None]:
tj_clean = tj_clean.with_columns(
    dollars_per_serving = tj_clean['price_usd'] / tj_clean['spc']
)

In [None]:
tj_clean.head(3)

Also, let's fix some obvious mistakes manually.

In [None]:
tj_clean = tj_clean.with_columns(
    pl.when(tj_clean['sku'].is_in(["077616", "077618"]))
    .then(1)
    .otherwise(pl.col("spc"))
    .alias("spc")
)


In [None]:
tj_clean = tj_clean.with_columns(
    calories_per_container = tj_clean['calories_per_serving'] * tj_clean['spc']
)

In [None]:
tj_clean = tj_clean.with_columns(
    calories_per_dollar = tj_clean['calories_per_container'] / tj_clean['price_usd']
)

In [None]:
tj_clean = tj_clean.with_columns(
    dollars_per_calorie = 1 / tj_clean['calories_per_dollar']
)

In [None]:
tj_clean = tj_clean.sort(by='calories_per_dollar', descending=True)

Save.

In [None]:
tj_clean.write_csv("data/tj_clean.csv")