# Entity Merge

In this notebook we merge three pairs of tables (1): 'recipes_bird' and 'recipe_at', (2) 'ingredients' and 'ingredients_at', (3) 'quantity' and 'quantity_at'

We did most of the processing and cleansing for ingredient and ingredients_at in Project2. Additionally, very few of these table pairs have duplicate/repeating rows. For us, most of these merges were just a matter of doing Unions

In [None]:
%%bigquery
select * from magazine_recipes_stg.recipes_at
limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,recipe_id,name,rating,ease_of_prep,note,type,prep_time,cookbook,page,slowcooker,link,last_made,data_source,load_time
0,4,Balsamic Pork Chops,1,,,,,,,,,2/14/2019,Airtable,2024-02-06 23:27:55.034768
1,1,Chive Butter Radishes,4,,,,,,,,,6/16/2018,Airtable,2024-02-06 23:27:55.034768
2,3,Spicy Black Bean Nachos,2,,,Main Dish,,,,,https://www.mexicanplease.com/spicy-black-bean...,9/17/2018,Airtable,2024-02-06 23:27:55.034768
3,2,Sweet Potato Breakfast Burritos,4,,,Main Dish,,,,,https://www.ambitiouskitchen.com/healthy-sweet...,11/1/2018,Airtable,2024-02-06 23:27:55.034768
4,75,Vegetarian Chili,4,,,Main Dish,,Taste of Home,260.0,,,7/8/2018,Airtable,2024-02-06 23:27:55.034768


## A. Merge Recipes
##### Create an intermediary table for both the recipes from different sources, where they both have the same schema (by renaming shared columns and setting new columns to null of the proper type), so we can combine them with a set opertation. Make minor changes to schema to fit with envisioned logic (ie change the last made from a string to a date).

In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.recipes_at_p3 AS
SELECT recipe_id, name as title, CAST(NULL as STRING) as subtitle, CAST(NULL as INTEGER) as servings, CAST(NULL as STRING) yield_unit,
prep_time as prep_min, CAST(NULL as INTEGER) as cook_min, CAST(NULL as INTEGER) as stnd_min, cookbook as source,
CAST(NULL as STRING) as intro, CAST(NULL as STRING) as directions, rating, ease_of_prep, note, type, page, slowcooker,
link, PARSE_DATE('%m/%d/%Y', last_made) as last_made, 'airtable' as data_source, CAST('2024-02-02 21:09:57.475069 UTC' as TIMESTAMP) as load_time FROM magazine_recipes_stg.recipes_at;

Query is running:   0%|          |

In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.recipes_bird_p3 AS
SELECT recipe_id, title, subtitle, servings, yield_unit, prep_min, cook_min, stnd_min, source, intro, directions,
CAST(NULL as INTEGER) as rating, CAST(NULL as STRING) as ease_of_prep, CAST(NULL as STRING) as note,  CAST(NULL as STRING) as type,
CAST(NULL as INTEGER) page, CAST(NULL as STRING) slowcooker, CAST(NULL as STRING) link, CAST(NULL as DATE) last_made,
'bird' as data_source, load_time FROM magazine_recipes_raw.bird_recipes;

Query is running:   0%|          |

We made sure that the two recipes have the exact same columns, so now we can perform a UNION to combine them.

In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.Recipes AS
SELECT * FROM magazine_recipes_stg.recipes_bird_p3
UNION ALL
SELECT * FROM magazine_recipes_stg.recipes_at_p3;

Query is running:   0%|          |

### Recipe Primary Keys

In [None]:
# set primary key on recipe id
%%bigquery
alter table magazine_recipes_stg.Recipes
  add primary key (recipe_id) not enforced;

Query is running:   0%|          |

In [None]:
# check for duplicate records
%%bigquery
select recipe_id, count(*) duplicate_records
from magazine_recipes_stg.Recipes
group by recipe_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,recipe_id,duplicate_records


## Merge Quantity
##### Apply the exact same logic to the quantity table - We make sure they have all the same columns and then do a Union. We can do this because we are confident there are no duplicate records or repeating values

In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.Quantity as
SELECT quantity_id, recipe_id, ingredient_id, CAST(NULL as FLOAT64) as max_qty, CAST(NULL as FLOAT64) as min_qty,
CAST(NULL as STRING) as unit, CAST(NULL as STRING) as preparation, CAST(NULL as BOOLEAN) as optional, data_source,
CAST('2024-02-02 21:09:57.475069 UTC' as TIMESTAMP) as load_time  FROM magazine_recipes_stg.quantity_at
UNION ALL
SELECT * except(load_time), 'bird' as data_source, load_time
FROM magazine_recipes_raw.quantity



Query is running:   0%|          |

### Quantity Primary Keys

In [None]:
# set primary key on quantity id
%%bigquery
alter table magazine_recipes_stg.Quantity
  add primary key (quantity_id) not enforced;

Query is running:   0%|          |

In [None]:
# check for duplicate records
%%bigquery
select quantity_id, count(*) duplicate_records
from magazine_recipes_stg.Quantity
group by quantity_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,quantity_id,duplicate_records


## Merge Ingredients
##### Merge the two ingredient tables together. We can't just do a UNION straightaway this time.  (there are some ingredients that show up in both, so we can use an outer join to get all of the fields from both tables)

In [None]:
# first create a table that has all of the overlapping ingredients
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.dual_ingredients as
SELECT a.ingredient_id, a.ingredient_name, b.category, b.plural, b.load_time, 'bird-airtable' as data_source,
FROM magazine_recipes_stg.ingredients_at a
JOIN magazine_recipes_raw.ingredients b
on a.ingredient_id = b.ingredient_id

Query is running:   0%|          |

In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.bird_ingredients AS
SELECT * EXCEPT(load_time), 'bird' as data_source, load_time FROM magazine_recipes_raw.ingredients

Query is running:   0%|          |

In [None]:
%%bigquery
UPDATE magazine_recipes_stg.bird_ingredients
SET data_source = 'bird-airtable'
WHERE ingredient_id in (select ingredient_id from magazine_recipes_stg.dual_ingredients);

Query is running:   0%|          |

In [None]:
%%bigquery
CREATE TABLE magazine_recipes_stg.ingredients_at_p3 AS
SELECT * FROM magazine_recipes_stg.ingredients_at
WHERE ingredient_id not in (select ingredient_id from magazine_recipes_stg.dual_ingredients);

Executing query with job ID: 93723db2-9f0c-4c11-a74a-e4bd95b34752
Query executing: 0.46s


ERROR:
 409 Already Exists: Table shidcs329e:magazine_recipes_stg.ingredients_at_p3

Location: US
Job ID: 93723db2-9f0c-4c11-a74a-e4bd95b34752



In [None]:
%%bigquery
CREATE OR REPLACE TABLE magazine_recipes_stg.Ingredients as
SELECT * FROM magazine_recipes_stg.bird_ingredients
UNION ALL
SELECT ingredient_id, ingredient_name, CAST(NULL AS STRING) AS category, CAST(NULL AS STRING) as plural, data_source, CAST('2024-02-02 21:09:57.475069 UTC' as TIMESTAMP) as load_time FROM magazine_recipes_stg.ingredients_at_p3

Query is running:   0%|          |

### Ingredients Primary Keys

In [None]:
%%bigquery
alter table magazine_recipes_stg.Ingredients
  add primary key (ingredient_id) not enforced;

Query is running:   0%|          |

In [None]:
# check for duplicate records
%%bigquery
select ingredient_id, count(*) duplicate_records
from magazine_recipes_stg.Ingredients
group by ingredient_id
having count(*) > 1
order by count(*) desc

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,ingredient_id,duplicate_records


# Foreign Keys

In [None]:
%%bigquery
alter table magazine_recipes_stg.Quantity add foreign key (recipe_id)
  references magazine_recipes_stg.Recipes (recipe_id) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select count(*) as orphan_records
from magazine_recipes_stg.Quantity
where recipe_id not in (select recipe_id from magazine_recipes_stg.Recipes)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,orphan_records
0,0


In [None]:
%%bigquery
alter table magazine_recipes_stg.Quantity add foreign key (ingredient_id)
  references magazine_recipes_stg.Ingredients (ingredient_id) not enforced;

Query is running:   0%|          |

In [None]:
%%bigquery
select count(*) as orphan_records
from magazine_recipes_stg.Quantity
where ingredient_id not in (select ingredient_id from magazine_recipes_stg.Ingredients)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,orphan_records
0,0


# Cleanup

In [None]:
%%bigquery
drop table magazine_recipes_stg.bird_ingredients;
drop table magazine_recipes_stg.dual_ingredients;
drop table magazine_recipes_stg.ingredients_at;
drop table magazine_recipes_stg.ingredients_at_p3;
drop table magazine_recipes_stg.quantity_at;
drop table magazine_recipes_stg.quantity_at_p3;
drop table magazine_recipes_stg.recipes;
drop table magazine_recipes_stg.recipes_at;
drop table magazine_recipes_stg.recipes_at_p3;
drop table magazine_recipes_stg.recipes_bird_p3;




Query is running:   0%|          |