## CMPINF 2110 Spring 2021 - Homework 06

### SOLUTION GUIDE

Helper notebook to read in and inspect the CSV files before importing into Neo4j.

The data are located in the Github repo below.

https://github.com/jyurko/CMPINF_2110_Spring_2021_data/tree/main/hw06

## Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

## Read data

There are 8 CSV files in the Github repo. Four of the files have "nouns" as names. The other four files have "nouns" and "verbs" as names. Let's first look at the four "noun" files.

### Devices

In [2]:
url_devices = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/devices.csv'

df_devices = pd.read_csv( url_devices )

In [3]:
df_devices

Unnamed: 0,device_id,class
0,1,doohickey
1,2,thing
2,3,whachmacallit
3,4,object


As we see above, there are 4 device classes. Each class in the `df_devices` DataFrame has a corresponding unique ID, `device_id`.

### Employees

In [4]:
url_employees = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/employees.csv'

df_employees = pd.read_csv( url_employees )

In [5]:
df_employees

Unnamed: 0,employee_id,name,started
0,1,Alice,2017
1,2,Bob,2018
2,3,Chuck,2020
3,4,Dave,2019
4,5,Emily,2011


Each employee has an ID, `employee_id`, a name, and the year they joined the company as denoted by the `started` column.

### Machines

In [6]:
url_machines = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/machines.csv'

df_machines = pd.read_csv( url_machines )

In [7]:
df_machines

Unnamed: 0,machine_id,name,type
0,1,alpha,printer
1,2,bravo,printer
2,3,charlie,printer
3,4,delta,printer


There are four machines. Each machine has a unique ID, `machine_id`, a `name`, and the `type`. In this example, all machines are of the same type.

### Parts

In [8]:
url_parts = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/parts.csv'

df_parts = pd.read_csv( url_parts )

In [9]:
df_parts

Unnamed: 0,part_id,type
0,1,widget
1,2,gizmo
2,3,gadget
3,4,sprocket
4,5,button
5,6,square
6,7,triangle


We have 7 parts represented by their `part_id` and their `type` in the `df_parts` DataFrame.

### "Noun" tables summary

As we have seen above, each row in the four "noun" tables corresponds to a unique instance of an entity. The entity *label* is provided by the table name. 

We will now examine the tables with names containing nouns and verbs.

### Employees assemble devices

In [10]:
url_ead = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/employees_assembles_devices.csv'

df_ead = pd.read_csv( url_ead )

In [11]:
df_ead.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   assembly_id  50 non-null     int64
 1   device_id    50 non-null     int64
 2   employee_id  50 non-null     int64
dtypes: int64(3)
memory usage: 1.3 KB


The `.info()` method reveals that there are 50 rows in the `employees_assembles_devices.csv` table and 3 columns. There are no missing values, and we see that the table contains the `device_id`, `employee_id`, and `assembly_id`. Let's check the number of unique values per column.

In [12]:
df_ead.nunique()

assembly_id    50
device_id       4
employee_id     5
dtype: int64

There are 4 unique values for `device_id`. The cell below displays the unique values as well as the number of rows associated with each unique value. As we can see, the four values for `device_id` are the **same** as those in the `devices.csv` file.

In [13]:
df_ead.groupby(['device_id']).size().reset_index(name='num_rows')

Unnamed: 0,device_id,num_rows
0,1,12
1,2,15
2,3,11
3,4,12


There are 5 unique values for `employee_id`. The cell below shows that those 5 values are the same unique values of `employee_id` in the `employees.csv` file.

In [14]:
df_ead.groupby(['employee_id']).size().reset_index(name='num_rows')

Unnamed: 0,employee_id,num_rows
0,1,8
1,2,10
2,3,13
3,4,8
4,5,11


The `employees_assembles_devices.csv` file is therefore provides the relationship between the employees and the devices. It stores which employee assembles which device. The `device_id` and `employee_id` values repeat and not unique in the the `employees_assembles_devices.csv` because *many* employees assemble *many* classes of devices. Thus, a many-to-many relationship exists between employees and devices. Relational data models cannot handle many-to-many relationships. **Link tables** are created to logically model the many-to-many relationship via two one-to-many relationships.

Let's look at the first few and last few rows of the `employees_assembles_devices.csv` file.

In [15]:
df_ead.head()

Unnamed: 0,assembly_id,device_id,employee_id
0,1,2,4
1,2,1,2
2,3,2,2
3,4,4,5
4,5,2,2


In [16]:
df_ead.tail()

Unnamed: 0,assembly_id,device_id,employee_id
45,46,3,3
46,47,1,1
47,48,3,3
48,49,4,4
49,50,4,5


The `assembly_id` looks like it increases with the rows of the table. We previously saw that `assembly_id` has 50 unique values. The data model already showed us that `assembly_id` is the PRIMARY KEY, thus we should be not be surprised that it uniquely defines each row.

### Machines print jobs

In [17]:
url_mpj = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/machines_prints_jobs.csv'

df_mpj = pd.read_csv( url_mpj )

df_mpj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   job_id       49 non-null     int64
 1   machine_id   49 non-null     int64
 2   employee_id  49 non-null     int64
dtypes: int64(3)
memory usage: 1.3 KB


The `machines_prints_jobs.csv` file contains 49 rows and 3 columns. We again see the `employee_id` variable as one of the columns. The `machine_id` is also contained in this table, and so we should expect that we can relate which employee is associated with the machine printing the job. The data model provided that `job_id` is the PRIMARY KEY for this table and so we should anticipate `job_id` to have as many unique values as rows. Let's use the `.nunique()` to check.

In [18]:
df_mpj.nunique()

job_id         49
machine_id      4
employee_id     5
dtype: int64

As shown above, there are indeed 49 unique values for `job_id`. There are 5 unique values for `employee_id`, which are shown to be the same values as contained in the `employees` table.

In [19]:
df_mpj.groupby(['employee_id']).size().reset_index(name='num_rows')

Unnamed: 0,employee_id,num_rows
0,1,8
1,2,12
2,3,5
3,4,8
4,5,16


The `machine_id` column has 4 unique values, which are shown below to be the same as the unique values of `machine_id` in the `machines` table.

In [20]:
df_mpj.groupby(['machine_id']).size().reset_index(name='num_rows')

Unnamed: 0,machine_id,num_rows
0,1,12
1,2,11
2,3,14
3,4,12


Let's take a look at the head and tail of the table to confirm that `job_id` is incrementing each row.

In [21]:
df_mpj.head()

Unnamed: 0,job_id,machine_id,employee_id
0,1,2,2
1,2,4,4
2,3,4,5
3,4,4,5
4,5,2,2


In [22]:
df_mpj.tail()

Unnamed: 0,job_id,machine_id,employee_id
44,45,3,1
45,46,2,5
46,47,1,5
47,48,2,5
48,49,3,2


### Parts in jobs

In [23]:
url_pij = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/parts_in_jobs.csv'

df_pij = pd.read_csv( url_pij )

df_pij.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   item_id  149 non-null    int64
 1   part_id  149 non-null    int64
 2   job_id   149 non-null    int64
dtypes: int64(3)
memory usage: 3.6 KB


The data model shows that the PRIMARY KEY of the `parts_in_jobs.csv` table is `item_id`. Thus, we should expect `item_id` to have as many unique values as there are rows. We know that a machine prints a job. The `parts_in_jobs.csv` table is providing the **link table** between the parts and the jobs. The `item_id` column therefore corresponds to an *instance* of the part *entity*.

Let's take a look at the number of unique values per column. As we see below, there are indeed 149 unique values for `item_id`.

In [24]:
df_pij.nunique()

item_id    149
part_id      7
job_id      49
dtype: int64

The above cell shows us that there are 7 unique values for `part_id`. The cell below shows that the 7 `part_id` values are the same as those from the `parts` table.

In [25]:
df_pij.groupby(['part_id']).size().reset_index(name='num_rows')

Unnamed: 0,part_id,num_rows
0,1,15
1,2,12
2,3,24
3,4,41
4,5,35
5,6,11
6,7,11


Let's confirm that the unique values of `job_id` are the same as those in the `machines_prints_jobs.csv` table. The Python `set()` function is used to extract the unique *set* of values from the lists. As we see below, the unique `job_id` values in the `parts_in_jobs.csv` file are the same as those in the `machines_prints_jobs.csv` table.

In [26]:
set(df_mpj.job_id.to_list()) == set(df_pij.job_id.to_list())

True

### Part component of devices

In [27]:
url_pcd = 'https://raw.githubusercontent.com/jyurko/CMPINF_2110_Spring_2021_data/main/hw06/part_componet_of_device.csv'

df_pcd = pd.read_csv( url_pcd )

In [28]:
df_pcd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   item_id      149 non-null    int64
 1   assembly_id  149 non-null    int64
dtypes: int64(2)
memory usage: 2.5 KB


The `part_component_of_device.csv` file contains 149 rows and just 2 columns. The unique values for the two columns are shown below. 

In [29]:
df_pcd.nunique()

item_id        149
assembly_id     50
dtype: int64

The `item_id` column has 149 unique values. The data model shows us that `item_id` is linked to the `parts_in_jobs` table via `item_id`. We already saw that `item_id` unique defines a type of print printed within a job, thus we should not be surprised that there are 149 unique values. Let's use the `set()` function to confirm that the unique values are indeed the same between the two tables.

In [30]:
set( df_pij.item_id.to_list() ) == set( df_pcd.item_id.to_list() )

True

The `assembly_id` column has 50 unique values. The data model shows us that `assembly_id` describes a specific assembled device. It has the same number of unique values as rows in the `employee_assembles_device.csv` table. Let's confirm the unique values of `assembly_id` are the same between the two tables.

In [31]:
set( df_ead.assembly_id.to_list() ) == set( df_pcd.assembly_id.to_list() )

True

## Import data into Neo4j

### Nodes and Labels

Now that we have inspected each of the tables, let's go ahead and start importing the data into Neo4j. After creating the project and local DBMS, the first set of information that we should load in are the nodes with appropriate labels. 

Nodes correspond to rows in tables, and so we will begin by importing in the rows of the 4 "noun" only tables. The table names will be converted to Labels in Neo4j. Thus, we will import in the Machine, Employee, Device, and Part Labels. The columns of those tables will become properties of the nodes. 

After adding in the 4 "major" labels, we next need to add the Job label. Nodes of the Job label will only have the `job_id` column assigned to a property. The other columns in the `machines_prints_jobs.csv` table or `parts_in_jobs.csv` table since those columns will be used to define relationships to other nodes.

Next, the Item label is added by reading in the `parts_in_jobs.csv` table with the `item_id` column set to a property. No other properties are set for these nodes in our present application. The other columns of the `parts_in_jobs.csv` table will be used to define relationships.

Lastly, the Assembly label is added by reading in the `employees_assembles_devices.csv` table. The `assembly_id` column is set to a property.

### Relationship types

After creating all nodes, we can create the relationships between them. 

We will start out by creating the relationships between the Employees with the devices they assembled. The `employee_id` column in the `employees_assembles_devices.csv` gives the employee that assembled each unique `assembly_id`. We will refer to this as the **ASSEMBLES** relationship type.

Next, we will create the relationship between the Employees with the job they operated. The `employee_id` column in the `machines_prints_jobs.csv` table gives the employee that operated the machine for each unique `job_id`. We will refer to this as the **OPERATES** relationship type. Homework 05 focused on the employee-to-machine relationship directly, but was a smaller introductory example. With a larger data set in the present assignment, we will capture the employee-to-machine relationship via the employee-to-job and machine-to-job relationships.

The `machines_prints_jobs.csv` table is used a second time to create the **PRINTS** relationship type between the Machine and Job nodes. 

The `parts_in_jobs.csv` table is used to associated the Items in a Job. We will define this relationship as the **IS_IN** relationship type. This way we know which specific part (as described by the `item_id`) is printed *in* a specific `job_id`.

The `part_component_of_device.csv` table is used to create the **IS_COMPONENT_OF** relationship type between the Item and Assembly nodes. This relationship tells us which specific items are used as components (think pieces or lego bricks) of assembled devices. The items are therefore the building blocks of the devices!

The last two relationships provide the association of the specific Item to the *type* of part and the specific Assembly to the *class* of device. The `parts_in_jobs.csv` table tells us which `part_id` is associated with each `item_id`. We will use that table to create the **IS_TYPE_OF** relationship. The `employee_assembles_device.csv` table tells us which `device_id` each `assembly_id` is associated with. Thus, we will use the `employee_assembles_device.csv` table to create the **IS_CLASS_OF** relationship type.

## Check required queries

Although you were not required to do this, we can double check any query we make in Neo4j with Pandas. This can make sure the graph structure is "behaving" as we expect, given our relational model.

### Query all nodes related to `job_id = 3`

We first need to join the appropriate dataframes together.

In [32]:
q01 = df_pij.merge( df_mpj, on='job_id', how='left' ).\
merge( df_machines.rename(columns={'name': 'machine_name', 'type': 'machine_type'}), on='machine_id', how='left').\
merge( df_employees.rename(columns={'name': 'employee_name'}), on='employee_id', how='left').\
merge( df_parts.loc[:, ['part_id', 'type']], on='part_id', how='left')

In [33]:
q01

Unnamed: 0,item_id,part_id,job_id,machine_id,employee_id,machine_name,machine_type,employee_name,started,type
0,1,4,1,2,2,bravo,printer,Bob,2018,sprocket
1,2,4,1,2,2,bravo,printer,Bob,2018,sprocket
2,3,1,2,4,4,delta,printer,Dave,2019,widget
3,4,3,2,4,4,delta,printer,Dave,2019,gadget
4,5,2,3,4,5,delta,printer,Emily,2011,gizmo
...,...,...,...,...,...,...,...,...,...,...
144,145,5,48,2,5,bravo,printer,Emily,2011,button
145,146,3,48,2,5,bravo,printer,Emily,2011,gadget
146,147,5,49,3,2,charlie,printer,Bob,2018,button
147,148,5,49,3,2,charlie,printer,Bob,2018,button


Next, let's focus just on those rows associated with `job_id = 3`.

In [34]:
q01.loc[ q01.job_id == 3, :]

Unnamed: 0,item_id,part_id,job_id,machine_id,employee_id,machine_name,machine_type,employee_name,started,type
4,5,2,3,4,5,delta,printer,Emily,2011,gizmo
5,6,4,3,4,5,delta,printer,Emily,2011,sprocket


The columns that correspond to the nodes directly related to the Job node in Neo4j are:

In [35]:
q01.loc[ q01.job_id == 3, ['job_id', 'employee_name', 'machine_name', 'item_id']]

Unnamed: 0,job_id,employee_name,machine_name,item_id
4,3,Emily,delta,5
5,3,Emily,delta,6


### Query the number of unique parts printed for `job_id = 3`

Check the number of items are the same as the number of part types.

In [36]:
q01.loc[ q01.job_id == 3, ['job_id', 'employee_name', 'machine_name', 'item_id', 'type']]

Unnamed: 0,job_id,employee_name,machine_name,item_id,type
4,3,Emily,delta,5,gizmo
5,3,Emily,delta,6,sprocket


### Query all nodes related to Employee Alice

The `machines_prints_jobs.csv` table gives us which Job each Employee operates. Let's join that table with the `employees.csv` table and then isolate all rows associated with Alice.

In [37]:
q03_a = df_mpj.merge( df_employees, on='employee_id', how='left' )

In [38]:
q03_a.loc[ q03_a.name == 'Alice', :]

Unnamed: 0,job_id,machine_id,employee_id,name,started
13,14,4,1,Alice,2017
21,22,3,1,Alice,2017
22,23,3,1,Alice,2017
26,27,3,1,Alice,2017
34,35,1,1,Alice,2017
37,38,2,1,Alice,2017
41,42,2,1,Alice,2017
44,45,3,1,Alice,2017


In [39]:
q03_a.loc[ q03_a.name == 'Alice', :].shape[0]

8

The `employee_assembles_device.csv` table gives us which Assembly each Employee assembled (put together). Join the `employees.csv` table and then isolate all rows associated with Alice.

In [40]:
q03_b = df_ead.merge( df_employees, on='employee_id', how='left' )

In [41]:
q03_b.loc[ q03_b.name == 'Alice', :]

Unnamed: 0,assembly_id,device_id,employee_id,name,started
10,11,2,1,Alice,2017
11,12,2,1,Alice,2017
19,20,2,1,Alice,2017
20,21,4,1,Alice,2017
25,26,1,1,Alice,2017
26,27,1,1,Alice,2017
41,42,2,1,Alice,2017
46,47,1,1,Alice,2017


In [42]:
q03_b.loc[ q03_b.name == 'Alice', :].shape[0]

8

### Count the number of devices assembled by Chuck

In [43]:
q03_b.loc[ q03_b.name == 'Chuck', :].shape[0]

13

In [44]:
q03_b.groupby(['name']).\
aggregate(num_assemblies = ('assembly_id', 'nunique')).\
reset_index()

Unnamed: 0,name,num_assemblies
0,Alice,8
1,Bob,10
2,Chuck,13
3,Dave,8
4,Emily,11


Check the number of unique device classes assembled by Chuck.

In [45]:
q03_b.groupby(['name']).\
aggregate(num_assemblies = ('assembly_id', 'nunique'),
          num_device_classes = ('device_id', 'nunique')).\
reset_index()

Unnamed: 0,name,num_assemblies,num_device_classes
0,Alice,8,3
1,Bob,10,4
2,Chuck,13,4
3,Dave,8,4
4,Emily,11,4


Or starting from the "base" tables, the grouping and aggregation steps are:

In [46]:
df_ead.merge( df_employees, on='employee_id', how='left' ).\
groupby(['name']).\
aggregate(num_assemblies = ('assembly_id', 'nunique'),
          num_device_classes = ('device_id', 'nunique')).\
reset_index()

Unnamed: 0,name,num_assemblies,num_device_classes
0,Alice,8,3
1,Bob,10,4
2,Chuck,13,4
3,Dave,8,4
4,Emily,11,4


### Count the number of parts printed by Machine delta

In [47]:
df_pij.merge( df_mpj, on='job_id', how='left').\
merge( df_machines.loc[:, ['machine_id', 'name']], on='machine_id', how='left').\
merge( df_parts, on='part_id', how='left').\
groupby(['name']).\
aggregate(num_jobs = ('job_id', 'nunique'),
          num_items = ('item_id', 'nunique'),
          num_part_types = ('type', 'nunique')).\
reset_index()

Unnamed: 0,name,num_jobs,num_items,num_part_types
0,alpha,12,48,7
1,bravo,11,22,5
2,charlie,14,55,7
3,delta,12,24,7
