# Field Decomposition

In this notebook, we decompose the fields from the raw tables that contain more than one property in their value. Specifically, we have two description fields, one in `recipe_at` and another in `faker_journalists` that need to be split up into their individual components.


In [None]:
%%bigquery
select * from magazine_recipes_raw.recipe_at limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,name,rating,ease_of_prep,note,type,prep_time,cookbook,page,ingredients,slowcooker,link,last_made,load_time
0,Chive Butter Radishes,4,,,,,,,,,,6/16/2018,2024-02-02 21:09:57.475069+00:00
1,Sweet Potato Breakfast Burritos,4,,,Main Dish,,,,Sweet potato,,https://www.ambitiouskitchen.com/healthy-sweet...,11/1/2018,2024-02-02 21:09:57.475069+00:00
2,Spicy Black Bean Nachos,2,,,Main Dish,,,,"Beans,Adobo Chile",,https://www.mexicanplease.com/spicy-black-bean...,9/17/2018,2024-02-02 21:09:57.475069+00:00
3,Balsamic Pork Chops,1,,,,,,,,,,2/14/2019,2024-02-02 21:09:57.475069+00:00
4,Chocolate Raspberry Torte,5,Hard,So good,Dessert,60.0,,,"Eggs,Milk",,,10/11/2016,2024-02-02 21:09:57.475069+00:00


In [None]:
%%bigquery
select * from magazine_recipes_raw.faker_journalists limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,author_id,name,age,phone_number,state,load_time
0,13,Heather Roberts,25,(499)524-6610x935,IN,2024-01-27 00:25:41.566545+00:00
1,22,Christina Walker,25,(701)568-8477x9361,KS,2024-01-27 00:25:41.566545+00:00
2,40,David Chen,25,+1-380-466-0657x3547,WY,2024-01-27 00:25:41.566545+00:00
3,15,Joseph Freeman,26,+1-890-507-5470,OH,2024-01-27 00:25:41.566545+00:00
4,37,Gregory Haley,26,(703)455-7448,OR,2024-01-27 00:25:41.566545+00:00


# faker_journalists


Split up the name from the `faker_journalists` table. The name contains first and last name

  Example: Paris Hilton --->  f_name: Paris, l_name: Hilton

Create staging table journalists

This is a change from our plan from proj1 - we realize the directions in BIRD recipes table should stay together as one unit. It made more sense to split 'name' from faker_journalists.

In [None]:
%%bigquery
create or replace table magazine_recipes_stg.Journalists as
  select journalist_id, name_array[0] as f_name, name_array[1] as l_name, age, phone, state, 'faker' as data_source, load_time
  from
  (select author_id as journalist_id, age, phone_number as phone, state, split(name, ' ') as name_array, load_time
  from magazine_recipes_raw.faker_journalists)

Query is running:   0%|          |

# Recipe_at


- We are now splitting the ingredients within recipe_at into individual column for each of the 7 possible ingredients

- We are adding a unique integer id for recipes from air table because the raw tables didn't include one

In [None]:
import json, datetime
from google.cloud import bigquery

project_id = "shidcs329e"
raw_dataset_name = "magazine_recipes_raw"
raw_table_name = "recipe_at"
stg_dataset_name = "magazine_recipes_stg"
stg_table_name = "recipe_ingredient_at" # lowercase the name because it's an intermediate table

recipe_at = []
target_table_id = "{}.{}.{}".format(project_id, stg_dataset_name, stg_table_name)

def serialize_datetime(obj):
    if isinstance(obj, datetime.datetime):
        return obj.isoformat()
    raise TypeError("Type not serializable")

schema = [
  bigquery.SchemaField("recipe_id", "INTEGER", mode = "REQUIRED"),
  bigquery.SchemaField("name", "STRING", mode = "NULLABLE"),
  bigquery.SchemaField("rating", "INTEGER", mode = "NULLABLE"),
  bigquery.SchemaField("ease_of_prep", "STRING", mode = "NULLABLE"),
  bigquery.SchemaField("note", "STRING", mode = "NULLABLE"),
  bigquery.SchemaField("type","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("prep_time", "INTEGER", mode = "NULLABLE"),
  bigquery.SchemaField("cookbook", "STRING", mode = "NULLABLE"),
  bigquery.SchemaField("page", "INTEGER", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_1","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_2","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_3","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_4","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_5","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_6","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("ingredient_7","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("slowcooker","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("link","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("last_made","STRING", mode = "NULLABLE"),
  bigquery.SchemaField("load_time", "TIMESTAMP", mode="NULLABLE", default_value_expression="CURRENT_TIMESTAMP")
]


bq_client = bigquery.Client()
sql = "select * from {}.{}".format(raw_dataset_name, raw_table_name)
query_job = bq_client.query(sql)

for index, row in enumerate(query_job):
    name = row["name"]
    rating = row["rating"]
    ease_of_prep = row["ease_of_prep"]
    note = row["note"]
    type = row["type"]
    prep_time = row["prep_time"]
    cookbook = row["cookbook"]
    page = row["page"]
    slowcooker = row["slowcooker"]
    link = row["link"]
    last_made = row["last_made"]
    load_time = json.dumps(row["load_time"], default=serialize_datetime).replace('"', '')
    # define ingredients individually
    ingredients = row["ingredients"]
    if ingredients == None:
      for i in range(7):
        exec(f"ingredient_{i+1} = None")
    else:
      ingredient_num = len(ingredients.split(","))
      for i in range(7):
        if i < ingredient_num:
          variable_value = ingredients.split(",")[i].strip()
          exec(f"ingredient_{i+1} = '{variable_value}'")
        else:
          exec(f"ingredient_{i+1} = None")

    record = {}
    record['recipe_id'] = index+1

    if name != None:
      record["name"] = name
    if rating != None:
      record["rating"] = rating
    if ease_of_prep != None:
      record['ease_of_prep'] = ease_of_prep
    if note != None:
      record['note'] = note
    if type != None:
      record['type'] = type
    if prep_time != None:
      record['prep_time'] = prep_time
    if cookbook != None:
      record['cookbook'] = cookbook
    if page != None:
      record['page'] = page
    if slowcooker != None:
      record['slowcooker'] = slowcooker
    if link != None:
      record['link'] = link
    if last_made != None:
      record['last_made'] = last_made
    if load_time != None:
      record['load_time'] = load_time
    if ingredient_1 != None:
      record['ingredient_1'] = ingredient_1
    if ingredient_2 != None:
      record['ingredient_2'] = ingredient_2
    if ingredient_3 != None:
      record['ingredient_3'] = ingredient_3
    if ingredient_4 != None:
      record['ingredient_4'] = ingredient_4
    if ingredient_5 != None:
      record['ingredient_5'] = ingredient_5
    if ingredient_6 != None:
      record['ingredient_6'] = ingredient_6
    if ingredient_7 != None:
      record['ingredient_7'] = ingredient_7

    recipe_at.append(record)

# load records into staging table

job_config = bigquery.LoadJobConfig(schema=schema, source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON, write_disposition='WRITE_TRUNCATE')
table_ref = bigquery.table.TableReference.from_string(target_table_id)

try:
    job = bq_client.load_table_from_json(recipe_at, table_ref, job_config=job_config)
    #print(job.error_result, job.errors)
    print('Inserted into', stg_table_name, ':', (len(recipe_at)), 'records')

    if job.errors:
      print('job errors:', job.errors)

except Exception as e:
    print("Error inserting into BQ: {}".format(e))


Inserted into recipe_ingredient_at : 145 records


Verify that we ended up with the same record count in the staging table as in the raw table:

In [None]:
%%bigquery
select (select count(*) from magazine_recipes_raw.recipe_at) as raw_count,
  (select count(*) from magazine_recipes_stg.recipe_ingredient_at) as intermediate_stg_count

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,raw_count,intermediate_stg_count
0,145,145


In [None]:
%%bigquery
select (select count(*) from magazine_recipes_raw.faker_journalists) as raw_count,
  (select count(*) from magazine_recipes_stg.Journalists) as intermediate_stg_count

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,raw_count,intermediate_stg_count
0,90,90


# Primary Key

BigQuery does not enforce primary keys, so the following command is for understanding the intent of the `airline_id` field. We will still need to check that it conforms to a PK.

In [None]:
%%bigquery
alter table magazine_recipes_stg.Journalists
  add primary key (journalist_id) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select journalist_id, count(*) duplicate_records
from magazine_recipes_stg.Journalists
group by journalist_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,journalist_id,duplicate_records


In [None]:
%%bigquery
alter table magazine_recipes_stg.recipe_ingredient_at
  add primary key (recipe_id) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select recipe_id, count(*) duplicate_records
from magazine_recipes_stg.recipe_ingredient_at
group by recipe_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,recipe_id,duplicate_records
