# Entity Join

- In this notebook, we connect Journalists and recipes together using Publications as a junction tables.
- First we did a select statement on Magazines to verify that it was created properly. (We made the Magazines table in our catchall ipnyb)

In [None]:
%%bigquery
select *
from  magazine_recipes_stg.Magazines
limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,magazine_id,magazine_name,website,pub_frequency_weeks,publishing_company,subscription_price,data_source,load_time
0,0,,,,,,,2024-02-09 23:45:33.542043+00:00
1,1,,,,,,,2024-02-09 23:45:33.542043+00:00
2,2,,,,,,,2024-02-09 23:45:33.542043+00:00
3,3,,,,,,,2024-02-09 23:45:33.542043+00:00
4,4,,,,,,,2024-02-09 23:45:33.542043+00:00


In [None]:
%%bigquery
select *
from magazine_recipes_stg.Journalists
limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,journalist_id,f_name,l_name,age,phone,state,data_source,load_time
0,13,Heather,Roberts,25,(499)524-6610x935,IN,faker,2024-01-27 00:25:41.566545+00:00
1,22,Christina,Walker,25,(701)568-8477x9361,KS,faker,2024-01-27 00:25:41.566545+00:00
2,40,David,Chen,25,+1-380-466-0657x3547,WY,faker,2024-01-27 00:25:41.566545+00:00
3,37,Gregory,Haley,26,(703)455-7448,OR,faker,2024-01-27 00:25:41.566545+00:00
4,15,Joseph,Freeman,26,+1-890-507-5470,OH,faker,2024-01-27 00:25:41.566545+00:00


In [None]:
%%bigquery
select *
from magazine_recipes_stg.Recipes
limit 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,recipe_id,title,subtitle,servings,yield_unit,prep_min,cook_min,stnd_min,source,intro,...,rating,ease_of_prep,note,type,page,slowcooker,link,last_made,data_source,load_time
0,1471,,,,,,,,,,...,,,,,,,,NaT,bird,2024-01-30 01:08:47.612652+00:00
1,1559,,,,,,,,,,...,,,,,,,,NaT,bird,2024-01-30 01:08:47.612652+00:00
2,1509,,,,,,,,,,...,,,,,,,,NaT,bird,2024-01-30 01:08:47.612652+00:00
3,1458,,,,,,,,,,...,,,,,,,,NaT,bird,2024-01-30 01:08:47.612652+00:00
4,1567,,,,,,,,,,...,,,,,,,,NaT,bird,2024-01-30 01:08:47.612652+00:00


We assign each journalist within Journalists a magazine_id. We do this by assigning the first 20 values the same as their row_number to assure there's at least one of each magazine assigned to a journalist. For the remaining 70 rows we use random assignment of numbers 1-20

In [None]:

import random
journalist_magazine_map = {}
for i in range(1, 91):
  if i <= 20:
    journalist_magazine_map[i] = i
  else:
    journalist_magazine_map[i] = random.randint(1, 20)


create json data for publication, will only include publication_id, recipe_id, magazine_id, journalist_id, (all other fields will be null). These null, place-holder columns will be populated using faker or other tool in the future.


In [None]:
from google.cloud import bigquery
client = bigquery.Client()

sql_query = """
    SELECT recipe_id
    FROM `magazine_recipes_stg.Recipes`
"""
query_job = client.query(sql_query)

# Fetch the results and save them as a list
recipe_ids = [row.recipe_id for row in query_job]
recipe_ids

In [None]:
from google.cloud import bigquery

client = bigquery.Client()

table_name = 'Publications'

schema = [
  bigquery.SchemaField("publication_id", "INTEGER", mode="REQUIRED"),
  bigquery.SchemaField("recipe_id", "INTEGER", mode="REQUIRED"),
  bigquery.SchemaField("magazine_id", "INTEGER", mode="REQUIRED"),
  bigquery.SchemaField("journalist_id", "INTEGER", mode="REQUIRED"),
  bigquery.SchemaField("date", "DATE", mode="NULLABLE"),
  bigquery.SchemaField("volume", "INTEGER", mode="NULLABLE"),
  bigquery.SchemaField("issue", "INTEGER", mode="NULLABLE"),
  bigquery.SchemaField("publication_type", "STRING", mode="NULLABLE"),
  bigquery.SchemaField("data_source", "STRING", mode="NULLABLE"),
  bigquery.SchemaField("load_time", "TIMESTAMP", mode="REQUIRED", default_value_expression="CURRENT_TIMESTAMP"),
]

table_ref = client.dataset("magazine_recipes_stg").table(table_name)
table = bigquery.Table(table_ref, schema=schema)

client.create_table(table)

rows_to_insert = []

for index, recipe_id in enumerate(recipe_ids):
  journalist_id = random.randint(1, 90)
  magazine_id = journalist_magazine_map[journalist_id]
  row = {"publication_id": index, "recipe_id": recipe_id, "magazine_id": magazine_id, "journalist_id": journalist_id, "date" : None, 'volume' : None, 'issue': None, 'publication_type': None, 'data_source': None, "load_time": None}
  rows_to_insert.append(row)

errors = client.insert_rows(table_ref, rows_to_insert, schema)

if errors == []:
    print("Rows inserted successfully.")
else:
    print("Encountered errors while inserting rows:", errors)


Rows inserted successfully.


## Primary Key
- The Publications junction table between journalists and recipes has successfully been created, we will now establish any necessary primary and foreign keys

In [None]:
%%bigquery
alter table magazine_recipes_stg.Publications
  add primary key (publication_id) not enforced

Query is running:   0%|          |

Check for primary key violations:

In [None]:
%%bigquery
select publication_id, count(*) as duplicate_pk
from magazine_recipes_stg.Publications
group by publication_id
having count(*) > 1

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,publication_id,duplicate_pk


## Foreign Keys

Publications has three foreign keys

In [None]:
%%bigquery
alter table magazine_recipes_stg.Publications add foreign key (recipe_id)
  references magazine_recipes_stg.Recipes (recipe_id) not enforced

Query is running:   0%|          |

In [None]:
%%bigquery
alter table magazine_recipes_stg.Publications add foreign key (magazine_id)
  references magazine_recipes_stg.Magazines (magazine_id) not enforced

Query is running:   0%|          |

In [None]:
%%bigquery
alter table magazine_recipes_stg.Journalists add primary key (journalist_id) not enforced

Query is running:   0%|          |

In [None]:
%%bigquery
alter table magazine_recipes_stg.Publications add foreign key (journalist_id)
  references magazine_recipes_stg.Journalists (journalist_id) not enforced

Query is running:   0%|          |

Check for foreign key violations for each of the three FKs:

In [None]:
%%bigquery
select count(*) as orphan_records
from magazine_recipes_stg.Publications
where recipe_id not in (select recipe_id from  magazine_recipes_stg.Recipes)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,orphan_records
0,0


In [None]:
%%bigquery
select count(*) as orphan_records
from magazine_recipes_stg.Publications
where magazine_id not in (select magazine_id from  magazine_recipes_stg.Magazines)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,orphan_records
0,0


In [None]:
%%bigquery
select count(*) as orphan_records
from magazine_recipes_stg.Publications
where journalist_id not in (select journalist_id from  magazine_recipes_stg.Journalists)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,orphan_records
0,0
