# Shopify Scraping

This notebook was me doing some simple data exploration with Shopify product data. I have a lot of shopify stores saved in my iCloud notes, and wanted a way to aggregate them. The long-term idea was to make an RSS feed for e-commerce. I think it's natural for where E-Com is today, but that's a topic for another time.

Some quick googling showed me that you can see a product listing by visiting  a shopify store and add `/product.json` at the end of the url. Turns out that doesn't give you all the data, so this notebook was me kind of looking through the dataset and structuring it in my mind.

In [2]:
import pandas as pd
import requests

Let's test out the very basic functionality. This site, Burga, makes cases for phones, tablets, laptops, and airpods. Maybe some other stuff.

In [5]:
prod = requests.get('https://burga.com/products.json')

pdf = pd.DataFrame(prod.json()['products'])

pdf.head()

Unnamed: 0,id,title,handle,body_html,published_at,created_at,updated_at,vendor,product_type,tags,variants,images,options
0,7221549269167,Northern Lights - Marble Samsung Galaxy S22 Ul...,n-l-l,<p>* The case does not contain actual glitter-...,2022-04-26T10:54:30+03:00,2022-04-26T10:54:30+03:00,2022-05-09T21:11:28+03:00,BURGA,Phone Case,"[custom_text:GYTIS, custom_text_color:#ffffff,...","[{'id': 41554229362863, 'title': 'Snap', 'opti...","[{'id': 33742816247983, 'created_at': '2022-04...","[{'name': 'Case Type', 'position': 1, 'values'..."
1,7206143754415,Sparkling Tiara - Stars OnePlus 9 Pro Case,sparkling-tiara-oneplus-9-pro-case,<p>* The case does not contain actual glitter-...,2022-04-11T10:39:05+03:00,2022-04-11T09:26:26+03:00,2022-05-08T01:18:19+03:00,BURGA,Phone Case,"[custom_text:STARS, custom_text_color:#fff7f0,...","[{'id': 41501283123375, 'title': 'Tough', 'opt...","[{'id': 33551594684591, 'created_at': '2022-04...","[{'name': 'Case Type', 'position': 1, 'values'..."
2,7206143721647,Enchanted Mirror - Marble OnePlus 9 Pro Case,enchanted-mirror-oneplus-9-pro-case,<p>* The case does not contain actual glitter-...,2022-04-11T10:39:05+03:00,2022-04-11T09:26:18+03:00,2022-05-12T16:31:29+03:00,BURGA,Phone Case,"[custom_text:GLOW, custom_text_color:#fff3e7, ...","[{'id': 41501282697391, 'title': 'Tough', 'opt...","[{'id': 33551593865391, 'created_at': '2022-04...","[{'name': 'Case Type', 'position': 1, 'values'..."
3,7206143688879,Prince Charming - Stars OnePlus 9 Pro Case,prince-charming-oneplus-9-pro-case,<p>* The case does not contain actual glitter-...,2022-04-11T10:39:04+03:00,2022-04-11T09:26:10+03:00,2022-05-07T23:21:42+03:00,BURGA,Phone Case,"[custom_text:1991, custom_text_color:#fff3e7, ...","[{'id': 41501282336943, 'title': 'Tough', 'opt...","[{'id': 33551593078959, 'created_at': '2022-04...","[{'name': 'Case Type', 'position': 1, 'values'..."
4,7206143623343,Snow White - Marble OnePlus 9 Pro Case,snow-white-oneplus-9-pro-case,<p>* The case does not contain actual glitter-...,2022-04-11T10:39:04+03:00,2022-04-11T09:26:02+03:00,2022-05-08T01:03:31+03:00,BURGA,Phone Case,"[custom_text:T.K, custom_text_color:#ffffff, c...","[{'id': 41501281878191, 'title': 'Tough', 'opt...","[{'id': 33551592358063, 'created_at': '2022-04...","[{'name': 'Case Type', 'position': 1, 'values'..."


let me see if we have all the data

In [6]:
pdf['product_type'].unique()

array(['Phone Case', 'iPad Case'], dtype=object)

Right from the get go I can see that it's not picking up all the data. If you go to burga.com you'll see a lot more products. Let me see the length of the response

In [12]:
len(pdf)

30

Ok so it's limiting us at 30? Turns out that this is a global thing - if you don't specify, you only get 30 items in a json response from this API. You'll see me check this on every response until I figure out how to get more than 30 items.

Here I'm looking for collections. I imagine there's some kind of hierarchy with collections that contain products, so I'm going to get the collections and then use those to get the items.

In [7]:
#seems like theres high level products, but there's some kidn of hierarchy. Let's look for collections
collections = requests.get('https://www.burga.com/collections.json')

In [8]:
collections.json()['collections'][0].keys()

dict_keys(['id', 'title', 'handle', 'description', 'published_at', 'updated_at', 'image', 'products_count'])

In [9]:
collections.json()['collections'][0]['title']

'Accessories'

In [10]:
cdf = pd.DataFrame(collections.json()['collections'])

In [14]:
cdf.head(10)

Unnamed: 0,id,title,handle,description,published_at,updated_at,image,products_count
0,282440958127,Accessories,accessories,,2021-12-15T21:11:52+02:00,2022-05-12T17:50:49+03:00,,13
1,268083626159,Airpod Max Cases,airpod-max-cases,<p>The AirPods Max headphones shook the world....,2021-05-24T17:40:24+03:00,2022-05-11T23:45:37+03:00,,140
2,266627645615,Airpod Types,airpod-types,,2021-05-10T17:50:58+03:00,2022-05-08T15:40:12+03:00,,4
3,280987467951,AirPods 3 Cases,airpods-3-cases,"<p>Your AirPods 3 need some love, too! Choose ...",2021-10-25T19:53:16+03:00,2022-05-12T16:05:35+03:00,,143
4,266500440239,AirPods Cases,airpod-cases,<p>Revive your AirPods with a fresh new look. ...,2021-05-08T16:30:49+03:00,2022-05-12T12:50:47+03:00,,149
5,266494640303,AirPods Pro Cases,airpods-pro-cases,"<p>Your AirPods Pro need some love, too! Choos...",2021-05-08T14:13:32+03:00,2022-05-12T07:35:19+03:00,,140
6,263714144431,All models,all-models,,2021-04-02T17:24:33+03:00,2022-05-11T23:50:45+03:00,,74
7,168930246742,All Products (Flexify),all-products-flexify,,2020-05-18T14:43:32+03:00,2022-05-12T18:16:05+03:00,,16275
8,279820763311,Apple Watch Bands,apple-watch-bands,<p>Tired of your boring old Apple Watch straps...,2021-09-28T13:10:38+03:00,2022-05-08T04:37:06+03:00,,148
9,172213010518,Backyard Stories,backyard-stories,<div>It’s time to discover the beauty of our o...,2020-08-18T14:28:48+03:00,2022-05-12T17:25:52+03:00,"{'id': 940383633494, 'created_at': '2020-08-18...",744


In [12]:
len(cdf)

30

30 items!

Next I'm going to look at one of the collections. A couple of things
- I noticed that the handle looks like the url structure when you actually visit the site so that's what I'll try for the api call.
- I'm looking at the all-products-flexify object and seeing it hsa 16k products.  I'm going to see if I get all of them by specifying down to a collection

In [27]:
flexify = requests.get('http://burga.com/collections/all-products-flexify/products.json')

In [28]:
flexify.json().keys()

dict_keys(['products'])

In [30]:
len(flexify.json()['products']) 

30

30 products! I'm going to try to get more. I need all the products. I have a pretty basic udnerstanding of how API's work so I'm looking for the number 30 in the headers.

In [35]:
flexify.headers

{'Date': 'Wed, 30 Mar 2022 01:10:59 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Sorting-Hat-PodId': '174', 'X-Sorting-Hat-ShopId': '21002577', 'X-Storefront-Renderer-Rendered': '1', 'Set-Cookie': 'secure_customer_sig=; path=/; expires=Thu, 30 Mar 2023 01:10:59 GMT; secure; HttpOnly; SameSite=Lax, localization=US; path=/; expires=Wed, 13 Apr 2022 01:10:59 GMT, _y=6b61480b-2e5f-438d-be76-0ecffd698a06; Expires=Thu, 30-Mar-23 01:10:59 GMT; Domain=burga.com; Path=/; SameSite=Lax, _s=4d6227b1-c2c2-4c62-ad90-76886622c422; Expires=Wed, 30-Mar-22 01:40:59 GMT; Domain=burga.com; Path=/; SameSite=Lax, _shopify_y=6b61480b-2e5f-438d-be76-0ecffd698a06; Expires=Thu, 30-Mar-23 01:10:59 GMT; Domain=burga.com; Path=/; SameSite=Lax, _shopify_s=4d6227b1-c2c2-4c62-ad90-76886622c422; Expires=Wed, 30-Mar-22 01:40:59 GMT; Domain=burga.com; Path=/; SameSite=Lax', 'X-Alternate-Cache-Key': 'cacheable:1e014b5c6d0b1fde612ffa4b1a9b18da', 'X

Here I'm trying the naive thing and adding a limit parameter to the API call. Turns out it works.

In [15]:
pixel = requests.get('http://burga.com/collections/google-pixel-2-cases/products.json',params={'limit':1000})

In [16]:
len(pixel.json()['products'])

151

Let's look at the "elite" phone cases. This is what I actually want to buy.

In [17]:
elite = requests.get('http://burga.com/collections/elite-cases-burger-nav/products.json',params={'limit':1000})

In [18]:
len(elite.json()['products'])

250

In [19]:
edf = pd.DataFrame(elite.json()['products'])

In [20]:
edf.head()

Unnamed: 0,id,title,handle,body_html,published_at,created_at,updated_at,vendor,product_type,tags,variants,images,options
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 40642467987631, 'title': 'Snap', 'opti...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
1,6602849583279,Almond Latte - Cute iPhone 12 Pro Max Case,cute-iphone-12-pro-max-case,<p>Cute cream color as a background for the un...,2021-04-01T12:30:05+03:00,2021-04-01T12:30:12+03:00,2022-05-11T10:30:20+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 39488378667183, 'title': 'Snap', 'opti...","[{'id': 32707649339567, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
2,6939994914991,Almond Latte - Cute iPhone 13 Pro Case,cute-iphone-13-pro-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:17+03:00,2021-09-14T21:39:22+03:00,2022-05-12T18:23:35+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 40642467561647, 'title': 'Snap', 'opti...","[{'id': 32707639476399, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
3,6602849714351,Almond Latte - Cute iPhone 11 Case,cute-iphone-11-case,<p>Cute cream color as a background for the un...,2021-04-01T12:30:14+03:00,2021-04-01T12:30:21+03:00,2022-05-12T10:40:13+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 39488378831023, 'title': 'Snap', 'opti...","[{'id': 32705721303215, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
4,6602849452207,Almond Latte - Cute iPhone 12 Pro Case,cute-iphone-12-pro-case,<p>Cute cream color as a background for the un...,2021-04-01T12:29:56+03:00,2021-04-01T12:30:03+03:00,2022-05-10T23:20:29+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 39488378372271, 'title': 'Snap', 'opti...","[{'id': 32707647602863, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."


Looks like we have some metadata to parse out here. Specifically, I want to look at those json columns. There's some info that I'm not getting at the product level, so maybe it exists in those jsons

In [21]:
edf.loc[0,'tags']

['Beige',
 'design:Almond Latte - Cute',
 'fw19',
 'model:iPhone 13 Pro Max',
 'PhoneCase2',
 'recomatic-model:iPhone 13 Pro Max']

Doesn't look that useful. Maybe the color is good but it's not actually a json, which is disappointing. Let's look at the next one, variants:

In [22]:
edf.loc[0,'variants']

[{'id': 40642467987631,
  'title': 'Snap',
  'option1': 'Snap',
  'option2': None,
  'option3': None,
  'sku': 'FA_01_IP13PROMAX_SP',
  'requires_shipping': True,
  'taxable': True,
  'featured_image': {'id': 32707641868463,
   'product_id': 6939995242671,
   'position': 1,
   'created_at': '2022-01-24T23:52:34+02:00',
   'updated_at': '2022-01-24T23:52:34+02:00',
   'alt': 'FA_01_IP13PROMAX_SP',
   'width': 1263,
   'height': 1263,
   'src': 'https://cdn.shopify.com/s/files/1/2100/2577/products/FA-01-A01AA_f6b6c376-6c76-45f9-8307-556d4124c204.jpg?v=1643061154',
   'variant_ids': [40642467987631]},
  'available': True,
  'price': '29.95',
  'grams': 29,
  'compare_at_price': None,
  'position': 1,
  'product_id': 6939995242671,
  'created_at': '2021-09-14T21:39:29+03:00',
  'updated_at': '2022-05-11T16:12:25+03:00'},
 {'id': 40642468020399,
  'title': 'Tough',
  'option1': 'Tough',
  'option2': None,
  'option3': None,
  'sku': 'FA_01_IP13PROMAX_TH',
  'requires_shipping': True,
  'tax

Oh yea, now we're cookin. Here we're getting different variants and their individual prices, as well as their images. On the site, these exist on the same product page. Kind of like different colors or sizes for a t shirt. Let's see if we can find one of those variants as a product_id:

In [53]:
edf[edf['id']==6602849583279]

Unnamed: 0,id,title,handle,body_html,published_at,created_at,updated_at,vendor,product_type,tags,variants,images,options
0,6602849583279,Almond Latte - Cute iPhone 12 Pro Max Case,cute-iphone-12-pro-max-case,<p>Cute cream color as a background for the un...,2021-04-01T12:30:05+03:00,2021-04-01T12:30:12+03:00,2022-03-28T21:51:00+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","[{'id': 39488378667183, 'title': 'Snap', 'opti...","[{'id': 32707649339567, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."


Ok that's the same row that we have been drawing from. There's a product to variant hierarchy, where every product can have several variants. Let me look for the first variant ID:

In [23]:
edf[edf['id']==39488378667183]

Unnamed: 0,id,title,handle,body_html,published_at,created_at,updated_at,vendor,product_type,tags,variants,images,options


Nothing. Good. Now we know we can explode this to get variant-product combinations. Before we do that, let's look at the last column.

In [24]:
edf.loc[0,'options']

[{'name': 'Case Type',
  'position': 1,
  'values': ['Snap',
   'Tough',
   'Elite Dark',
   'Tough (MagSafe)',
   'Elite Dark (MagSafe)']}]

This looks like how we can filter through those variants. Basically, this is one-to-one with the title of each variant, and with the option1 type. Let me expand this into a dataframe:

In [26]:
edf.explode('variants').head()

Unnamed: 0,id,title,handle,body_html,published_at,created_at,updated_at,vendor,product_type,tags,variants,images,options
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","{'id': 40642467987631, 'title': 'Snap', 'optio...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","{'id': 40642468020399, 'title': 'Tough', 'opti...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","{'id': 41227336450223, 'title': 'Elite Dark', ...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","{'id': 41551436251311, 'title': 'Tough (MagSaf...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."
0,6939995242671,Almond Latte - Cute iPhone 13 Pro Max Case,cute-iphone-13-pro-max-case,<p>Cute cream color as a background for the un...,2021-09-14T21:39:24+03:00,2021-09-14T21:39:29+03:00,2022-05-12T04:40:48+03:00,BURGA,Phone Case,"[Beige, design:Almond Latte - Cute, fw19, mode...","{'id': 41551436284079, 'title': 'Elite Dark (M...","[{'id': 32707641868463, 'created_at': '2022-01...","[{'name': 'Case Type', 'position': 1, 'values'..."


We get five times as many rows, which suggests that we now have all the variant pairs. Now we need to expand all the json stuff.

In [44]:
t = edf.explode('variants').reset_index()

t.columns

vdf = t['variants'].apply(pd.Series)
vdf.head()

Unnamed: 0,id,title,option1,option2,option3,sku,requires_shipping,taxable,featured_image,available,price,grams,compare_at_price,position,product_id,created_at,updated_at
0,40642467987631,Snap,Snap,,,FA_01_IP13PROMAX_SP,True,True,"{'id': 32707641868463, 'product_id': 693999524...",True,29.95,29,,1,6939995242671,2021-09-14T21:39:29+03:00,2022-05-11T16:12:25+03:00
1,40642468020399,Tough,Tough,,,FA_01_IP13PROMAX_TH,True,True,"{'id': 32707642032303, 'product_id': 693999524...",True,44.95,39,,2,6939995242671,2021-09-14T21:39:29+03:00,2022-05-12T04:36:45+03:00
2,41227336450223,Elite Dark,Elite Dark,,,FA_01EL_IP13PROMAX_EL-dark,True,True,"{'id': 32707642196143, 'product_id': 693999524...",True,79.95,45,,3,6939995242671,2022-01-24T23:52:35+02:00,2022-05-11T15:39:50+03:00
3,41551436251311,Tough (MagSafe),Tough (MagSafe),,,FA_01_IP13PROMAX_TH-magsafe,True,True,"{'id': 32707642032303, 'product_id': 693999524...",True,49.9,49,,4,6939995242671,2022-04-25T17:08:52+03:00,2022-05-11T10:20:07+03:00
4,41551436284079,Elite Dark (MagSafe),Elite Dark (MagSafe),,,FA_01EL_IP13PROMAX_EL-dark-magsafe,True,True,"{'id': 32707642196143, 'product_id': 693999524...",True,84.9,55,,5,6939995242671,2022-04-25T17:08:52+03:00,2022-05-01T15:15:41+03:00


In [45]:
edfx = pd.merge(edf,vdf.rename({'id':'variant_id'}),left_on='id',right_on='product_id')

In [46]:
edfx.columns

Index(['id_x', 'title_x', 'handle', 'body_html', 'published_at',
       'created_at_x', 'updated_at_x', 'vendor', 'product_type', 'tags',
       'variants', 'images', 'options', 'id_y', 'title_y', 'option1',
       'option2', 'option3', 'sku', 'requires_shipping', 'taxable',
       'featured_image', 'available', 'price', 'grams', 'compare_at_price',
       'position', 'product_id', 'created_at_y', 'updated_at_y'],
      dtype='object')

Now I've got a dataset with all the product/variant pairs. Each variant has the price, availability, and and a picture. I could literally recreate the cosmetics of a shopify store if I wanted to. But more importantly, I can write all of this into a script and not have to do it manually.

I'm leaving the rest of the notebook here for you to look at the process. It's very raw, because I was still tinkering with how to pull and structure the data. I wanted to create methods that would, in one fell swoop, get all the products from a website and generate filters for it. That way I could create a database that I could query for the entire store.

I stopped working on it because I decided at some point that I like the idea enough to use it (maybe even pay for it) or maybe invest in it, but not enough to spend a significant part of my life building it. But either way it was a fun exercise.

## Tinkering (unfinished)

Next steps
- Create a way to understand the structure of the data and search capabilities
- try to recreate it for another site
- try to create a generic one
- test how it cuts across similar sites (like merch sites)

In [48]:
edfx['product_type'].unique()

array(['Phone Case'], dtype=object)

In [49]:
cdf.head()

Unnamed: 0,id,title,handle,description,published_at,updated_at,image,products_count
0,282440958127,Accessories,accessories,,2021-12-15T21:11:52+02:00,2022-05-12T17:50:49+03:00,,13
1,268083626159,Airpod Max Cases,airpod-max-cases,<p>The AirPods Max headphones shook the world....,2021-05-24T17:40:24+03:00,2022-05-11T23:45:37+03:00,,140
2,266627645615,Airpod Types,airpod-types,,2021-05-10T17:50:58+03:00,2022-05-08T15:40:12+03:00,,4
3,280987467951,AirPods 3 Cases,airpods-3-cases,"<p>Your AirPods 3 need some love, too! Choose ...",2021-10-25T19:53:16+03:00,2022-05-12T16:05:35+03:00,,143
4,266500440239,AirPods Cases,airpod-cases,<p>Revive your AirPods with a fresh new look. ...,2021-05-08T16:30:49+03:00,2022-05-12T12:50:47+03:00,,149


In [50]:
def get_collection_products(base_url,collection):
    
    url = base_url+'/collections/{0}/products.json'.format(collection['handle'])
    e = requests.get(url,params={'limit':collection['products_count']})
    
    pdf = pd.DataFrame(e.json()['products'])
    return pdf

In [51]:
def explode_variants(product_df):
    t = product_df.explode('variants').reset_index()
    pdfx = pd.merge(product_df,t['variants'].apply(pd.Series).rename(columns={'id':'variant_id'}),
                    left_on='id',
                    right_on='product_id',
                   suffixes=('_prod','_variant'))
    return pdfx

In [52]:
def get_all_collection_variants(base_url):
    
    collections =  requests.get(base_url + '/collections.json')
    
    df_out = pd.DataFrame()
    
    for c in collections.json()['collections']:
        products = get_collection_products(base_url,c)
        if products.shape == (0,0):
            continue
        
        try:
            pdfx = explode_variants(products)
        except KeyError:
            print(c['handle'])
            continue

        df_out = df_out.append(pdfx)
        
    return df_out

In [53]:
df = get_all_collection_variants('https://www.burga.com')

Next step:
- find a way to get all the different variant information out of a product collection. So want to get possible filters

Product type is probably an easy one:

In [123]:
df['product_type'].unique()

array(['Airtag Case', 'wireless charger', 'Ring Light', 'Phone Charm',
       'Eyewear Chain', 'AirPods Case', 'Phone Case', 'screenprotector',
       'Ring Holder', 'macbook', 'Cable', 'Camera Lens Protector',
       'Card Holder', 'Water Bottle', 'Passport Holder', 'Straws',
       'Power Bank', 'Travel Mug', 'Watch Protector', 'iPad Case',
       'Leather Apple Watch Band', 'Metal Mesh Apple Watch Band',
       'Wall Charger'], dtype=object)

In [131]:
filter_cols = []
filter_cols += ['product_type']
filter_cols += ['available']
filter_cols += ['requires_shipping']

In [129]:
df['available'].unique()

array([ True, False])

In [247]:
products.url

'https://www.burga.com/collections/airpod-cases/products.json?limit=143'

In [149]:
df.columns

Index(['id', 'title_prod', 'handle', 'body_html', 'published_at',
       'created_at_prod', 'updated_at_prod', 'vendor', 'product_type', 'tags',
       'variants', 'images', 'options', 'variant_id', 'title_variant',
       'option1', 'option2', 'option3', 'sku', 'requires_shipping', 'taxable',
       'featured_image', 'available', 'price', 'grams', 'compare_at_price',
       'position', 'product_id', 'created_at_variant', 'updated_at_variant'],
      dtype='object')

In [132]:
filter_cols += ['created_at_y','updated_at_y']

In [None]:
filter_cols += ['options']

Note that the options column has different kinds of data. It's a json with the different options that a variant could have. These would probably need to be exploded into different options for each product type.

So a structure might look like this

- for each store
    - options
    - for each collection
        - collection options
        - for each product
            - product options
            - variants
                - variant options

In [223]:
def get_shopify_shop_structure(base_url,collection_limit=30):
    collections = requests.get(base_url+'/collections.json',params={'limit':collection_limit}).json()['collections']
    counter = 0
    
    base_collection_info = ['title',
                              'handle',
                           'description',
                           'published_at',
                           'updated_at',
                           'image',
                           'products_count']
    
    product_types = dict() #has dict of product types for each collection
    product_type_options = dict() #has options for each product type
    product_type_options_values = dict() #has possible values for products
    products = []
    

    #for each collection
    for c in collections:
        if c['products_count']==0:
            continue
        try:
            c_products = requests.get('{0}/collections/{1}/products.json'.format(base_url,c['handle']),
                                      params={'limit':c['products_count']}).json()['products']
        except:
            print('{0}/{1}/products.json'.format(base_url,c['handle']))
        product_types[c['handle']] = set()
        
        for product in c_products:
            options = product['options']
            product_types[c['handle']].add(product['product_type'])

            if product['product_type'] not in product_types: #add new product type
                product_type_options[product['product_type']] = options #add options
            
            else:
                product_type_options[product['product_type']].add(options)

            for option in options:
                if option['name'] in product_type_options_values.keys():
                    product_type_options_values[option['name']].update(option['values'])
                else:
                    product_type_options_values[option['name']] = set(option['values'])

        
        #for each product
            # add product type to collections list if not there
            # add product type to master list for this website
            # add options to product types dict, if options not already listed there
            # for product type options, add possible values if not already included
    
    return collections,products, product_types, product_type_options, product_type_options_values

In [224]:
burga = get_shopify_shop_structure('https://burga.com')

In [159]:
c2 = collections.json()['collections'][4]

In [173]:
products = requests.get('{0}/collections/{1}/products.json'.format('https://www.burga.com',c2['handle']),params={'limit':c2['products_count']})

In [179]:
ptypes = set([p['product_type'] for p in products.json()['products']])

In [195]:
products.json()['products'][0]['options']

[{'name': 'Model', 'position': 1, 'values': ['Airpod']}]

In [232]:
bcoll = burga[0]
btypes = burga[1]
bptypes = burga[2]
bpto = burga[3]
bptov = burga[4]

In [237]:
bptov.keys()

dict_keys(['Color', 'Size', 'Model', 'Case Type', 'Quantity', 'Type', 'Plug Type'])

In [245]:
bptov

{'Color': {'Almond Latte',
  'Assorted Dreams',
  'Auriel',
  'Baby Blue',
  'Black',
  'Blue',
  'Cream',
  'Emerald Pool',
  'Gold',
  'Gun Metal',
  'Ivory',
  'Ivy',
  'Jin',
  'Lavender',
  'Pink',
  'Pink Sunrise',
  'Positive Vibes',
  'Rainbow Splash',
  'Red',
  'Rose Gold',
  'Rosé',
  'Santorini',
  'Sienna',
  'Silver',
  'Snow Cone',
  'Transparent'},
 'Size': {'16oz/470ml',
  '17oz/500ml',
  '38mm / 40 mm',
  '38mm / 40 mm / 41 mm',
  '38mm / 40mm / 41mm',
  '42mm / 44mm',
  '42mm / 44mm / 45 mm',
  '42mm / 44mm / 45mm',
  'One-Size'},
 'Model': {'AirPods Case',
  'AirPods Pro Case',
  'Airpod',
  'Airpod Max',
  'Airpods 3',
  'Airpods Pro',
  'Apple Watch 40mm',
  'Apple Watch 42mm',
  'Apple Watch 44mm',
  'MACBOOK 12 [A1534]',
  'MACBOOK AIR 11 [A1370/A1465]',
  'MACBOOK AIR 13 [A1466/A1369]',
  'MACBOOK AIR 13 [A1932/A2179/A2337]',
  'MACBOOK PRO 13 [A1502/A1425]',
  'MACBOOK PRO 13 [A1706/A1708/A2338]',
  'MACBOOK PRO 13 [A1989/A2159]',
  'MACBOOK PRO 13 [A2289/A225

In [243]:
[c['handle'] for c in bcoll]

['accessories',
 'airpod-max-cases',
 'airpod-types',
 'airpods-3-cases',
 'airpod-cases',
 'airpods-pro-cases',
 'all-models',
 'all-products-flexify',
 'apple-watch-bands',
 'backyard-stories',
 'beige-tones',
 'best-selling-phone-cases',
 'blooming-spring',
 'camo-phone-cases',
 'charging',
 'colorful-phone-cases',
 'personalized-phone-cases',
 'elite-cases-burger-nav',
 'essentials-phone-cases',
 'explorer-phone-cases',
 'eyeglass-chains',
 'festive-collection',
 'floral-phone-cases',
 'for-him-airpods-cases',
 'for-him-ipad-cases',
 'for-him-macbook-cases',
 'for-him-phone-cases',
 'for-him-drinkware',
 'google-pixel-2-cases',
 'google-pixel-2-xl-cases']

Impossible to query this the way the site gets queried. I probably need a normalized table?

In [None]:
example_url = 'https://www.burga.com/collections/personalized-phone-cases/model:iphone-13?type=elite-dark&page=2&=undefined&color=black'

In [244]:
bptypes

{'accessories': {'Airtag Case',
  'Eyewear Chain',
  'Phone Charm',
  'Ring Light',
  'wireless charger'},
 'airpod-max-cases': {'AirPods Case'},
 'airpod-types': {'AirPods Case'},
 'airpods-3-cases': {'AirPods Case'},
 'airpod-cases': {'AirPods Case'},
 'airpods-pro-cases': {'AirPods Case'},
 'all-models': {'Phone Case'},
 'all-products-flexify': {'AirPods Case',
  'Cable',
  'Camera Lens Protector',
  'Card Holder',
  'Passport Holder',
  'Phone Case',
  'Power Bank',
  'Ring Holder',
  'Straws',
  'Travel Mug',
  'Watch Protector',
  'Water Bottle',
  'iPad Case',
  'macbook',
  'screenprotector',
  'wireless charger'},
 'apple-watch-bands': {'Leather Apple Watch Band',
  'Metal Mesh Apple Watch Band'},
 'backyard-stories': {'Phone Case'},
 'beige-tones': {'Phone Case'},
 'blooming-spring': {'Phone Case'},
 'camo-phone-cases': {'Phone Case'},
 'charging': {'Cable', 'Power Bank', 'Wall Charger', 'wireless charger'},
 'colorful-phone-cases': {'Phone Case'},
 'personalized-phone-cases'

what if we get the product_Types from the collections?

Hypohteses to test:
- options available for variants are consistent within each collection. So all the items that belong to a single collection will have the same format for option1, option2, option3
    - actually this occurs at the product level