Assignment 1122

CMSI 3520 Database Systems, Fall 2021

Assignment 1122

We wrap up our tour of selected database models with graph databases, as represented by Neo4j.

This assignment continues to have a similar structure as all of the mini-stack assignments. You are also to stay with the same group and dataset.

Background Reading

Theoretical/Conceptual Reading

Although graph databases are a relatively recent development, the graph data structure is well-studied. In many respects, a graph database is simply a persisted graph data structure. As such, any data structures texts on graphs and their associated algorithms would be useful for review.

Elmasri & Navathe’s NOSQL chapter 24 includes coverage of graph databases, also focusing on Neo4j as the reference system. If you don’t have the book, this PDF covers the chapter, with graph database and Neo4j coverage appearing near the end.

Technical/Operational Reading

Direct technical assistance for the action items in this assignment can be found primarily in the Neo4j documentation site. Documentation types range from initial Getting Started tutorial to a full-blown Operations Manual.

Separate but similar are the Neo4j Developer Guides—confusingly, these have overlapping content as the official docs but are distinct from them, and they have embedded videos if that suits your learning style better.

Neo4j shares MongoDB’s terminology choice of “drivers” as the name for libraries that allow general-purpose programs to communicate with a Neo4j server. Multiple languages are supported and for our Netflix Prize graph database mini-stack case study, examples are provided for Python and JavaScript. If you choose to use Java, Neo4j also provides an “OGM” library (“Object Graph Mapper,” in the vein of “Object Relational Mapper”).

For Submission: Graph Database Mini-Stack

For this assignment, you are asked to build the beginnings of a graph database persistence layer for an envisioned application that is based on your chosen dataset. A working example of such a persistence layer is provided for our on-going Netflix Prize case study.

NodeFlix: netflix-pratice.md, Graph Database Edition

Transfer some skills from the Netflix Prize graph database mini-stack case study: study the logical schema; study and run the given preprocessor programs and import command so that you have the dataset in graph form (giving them ample time to finish!); study and run the sample programs to see how they and their respective libraries interact with Neo4j.

Due to the specific strengths of and motivations for graph databases, we change up our Netflix portion somewhat in order to emphasize these differentiating features. However, as before, make sure to still provide the following items for each in netflix-practice.md:

State what the query/statement is asking or doing in English
Provide the Neo4j Cypher query/statement that yields those results
Include a screenshot of this query/statement being issued and the graph that it produces

Watch for scale: Graph databases require a lot of resources. Watch out for queries that might return too many nodes and edges—if you seem to have formulated one, find ways to restrict it either with greater specifics or through the LIMIT clause.

Create new nodes and relationships: Do some IMDB/Wikipedia research on your choice of artists (performers, directors, writers, musicians, etc.) who are affiliated with movies/shows in the Netflix dataset and connect them to those shows’ nodes with appropriate relationships. Pick a mix—around five (5) such nodes will be good, and make sure they have movies/shows in common, in different combinations. Research and define a small set of common properties for those artists, such as gender, birthdate, nationality, etc. Show the MATCH/CREATE/RETURN clauses that make these additions and a culminating query that produces a graph showing all of your additions and the movies/shows that they worked on (but no ratings—that would be too much).
Viewers who are fans: Let’s define a “fan” as someone who has rated a movie/show with a 5. Formulate a query that graphs the viewers who have given a 5 rating to the work of one of your selected artists. Make sure to return the viewers, the movies/shows that they rated, your chosen artist, and what they did in those movies/shows.
Love/hate relationship: Pick two movies that are likely to have a decent overlap of viewers. Formulate a query that graphs the viewers who hated one movie (rated it a 1) but loved the other (rated it a 5).
Watch party 1: Define a set of criteria that filters out a small subset of movies/shows (no more than 3 to be safe). Formulate a query that produces a graph showing viewers who rated those movies/shows on the same day.
Watch party 2: Define a set of criteria that filters out a small subset of the artists that you’ve loaded into Neo4j. Formulate a query that produces a graph showing viewers who rated a movie/show on the same day, for movies/shows that your chosen artists worked on. Make sure to return the viewers, the movies/shows that they rated, the chosen artists, and what they did in those movies/shows.

For each of these queries, find ways to double-check your work—are there ways to run other queries that will help you verify whether you are really getting the results you’ve requested? It’s useful to do this at first while you’re still getting the hang of Cypher.

Just .gitignore It

Because this is your third go with the same dataset, we don’t need about.md for this assignment. Just edit the .gitignore file again so that it makes your repository ignore your chosen dataset’s files.

Rock the Graph-bah: Schema, Preprocessors, Headers, and Commentary

What doesn’t change from before is the need to populate your database with your dataset:

Determine an appropriate logical schema for the dataset—because this is a graph database, take the opportunity to rethink the structure of your data in a way that highlights relationships and connections within it
Put that design in writing by providing a diagram of that schema—follow the rounded-rectangle notation used by Neo4j: submit this as schema.* in some standard image format
Write one or more programs and header files that will populate the target database with the dataset using neo4j-admin import: submit these as preprocess* and *-header.csv files
In a Markdown file called design.md, provide commentary on your logical schema design choices with an embedded image or link to your schema diagram; in addition, provide the command sequence for loading up your dataset, with explanatory remarks as needed

Dance the Graphy Q: queries.md

Show off your ability to derive graphs from your database by writing the following Cypher queries. For each query, use the format given in the NodeFlix section where you:

State what the query/statement is asking or doing in English
Provide the Neo4j Cypher query/statement that yields those results
Include a screenshot of this query/statement being issued and the graph that it produces

Submit these in a Markdown file called queries.md. All queries should be domain-appropriate—i.e., they should make sense for an application that is trying to do real-world work with your adopted dataset:

A query that matches a meaningful subgraph in your dataset
Another such query, involving a different set of nodes, properties, and relationships
A query that matches a meaningful subgraph then optionally matches more relationships/nodes (i.e., the query returns all nodes in the first subgraph even if they don’t match the second pattern)
An overall aggregate query that provides counts or other aggregate computations for an overall set of pattern-matched nodes or edges (this one will not return a graph)
A grouped aggregate query that provides counts or other aggregate computations for groupings derived from pattern-matched nodes or edges (this one will not return a graph)

Five (5)-member groups are asked to do two (2) additional queries: one (1) more from either query type 1–3 and one (1) more of any query type 4–5, for the same total number of points overall.

If inspiration strikes you, don’t stop at just these five (5) queries. The more practice you get with Cypher, the better. The five that are given are only meant to provide the base coverage for this assignment.

Connect the DAL: dal.*

As with the other mini-stack assignments, we would like the beginnings of a graph database DAL. Once more, you may choose the programming language for this code—the only requirement is that a Neo4j “driver” exists in that language. The Netflix Prize example again provides its own netflix-dal that you can use as a reference:

Appropriate configuration and connection setup code
Model objects and other definitions, as applicable (specifics will vary based on the language and database connection library)
One (1) domain-appropriate retrieval function that, given some set of arguments, will return a graph matching those arguments—you may adapt one of the queries you wrote in Dance the Graphy Q for this—pick some aspect of that query that would make sense as parameters so that the same function can be used for multiple queries of the same type
One (1) domain-appropriate “CUD” function (create, update, or delete) that modifies the database’s overall graph, given some set of arguments

Five (5)-member groups are asked to do one additional “CUD” function, for the same total number of points overall.

One aspect where a graph database is at a disadvantage here is that the graph aspect only becomes clear with a graph rendering, which is highly infeasible with a command line program. You aren’t required to go that far in your demo programs, but make sure that your functions’ return values still contain enough information to produce a graph rendering. As long as your functions return collections of Neo4j’s “record” objects, a graph can still be constructed given the right front end.

(:Program)-[:CALLS]->(:Dal)

Write one (presumably short) program apiece that calls the retrieval and “CUD” functions, respectively. These programs’ primary jobs would be:

Provide help on how to use the program
Check program arguments for validity
Call the underlying DAL function with those arguments
Report any errors that may have occurred

As a natural consequence of having three (3) DAL functions instead of two (2), five (5)-member groups will end up doing an additional DAL-calling program, for the same total number of points overall.

As mentioned in the DAL instructions, it isn’t very feasible to expect a command-line program to provide a graph rendering (though it isn’t outright impossible; just…a helluva lot of work). Still, try to keep your output readable and clear. If you can express some of the graph aspects of the return value (e.g., listing connected nodes as indented beneath another node), feel free to give it a shot.

Operational Directives/Suggestions

The same notes and suggestions remain from before:

Make sure to divide the implementation work relatively evenly within your group. Most groups have four (4) members and there is plenty of work to spread around. Let each member “run point” on some set of tasks so that someone is on top of things but of course allow yourselves to help each other.
Once more, do not commit dataset files to the repository—they may be too large for that. Provide links instead. Edit .gitignore to avoid accidental commits.
Not everyone’s computer might have enough storage or other capacity—AWS is an option but watch your credits; or, designate someone as the “host” for doing work and find ways to collaborate over a screenshare and (friendly) remote control of a classmate’s screen.

How to Turn it In

Commit everything to GitHub. Reiterating the deliverables, they are:

netflix-practice.md
.gitignore (revised from what is already provided)
schema.*
Preprocessor program(s) and *-header.csv files
design.md
queries.md
Data access layer (DAL) module
Two (2) DAL-calling programs

Review the instructions in the deliverables’ respective sections to see what goes in them.

Specific Point Allocations

This assignment is scored according to outcomes 1a, 1d, 3a–3d, and 4a–4f in the syllabus. For this particular assignment, graded categories are as follows:

Category	Points	Outcomes
netflix-practice.md correctly implements the requested operations	5 points each, 25 points total	1a, 1d, 3a–3c, 4a–4d
.gitignore correctly prevents accidental commits of dataset files	deduction only, if missed	4a
schema.* clearly diagrams the logical schema	5 points	1d, 4c
Preprocessor program(s) and *-header.csv files	15 points	3a, 3b, 4a–4d
design.md explains the logical schema and import approach	5 points	1d, 4c
queries.md correctly implements the requested operations	5 points each, 25 points total	1d, 3c, 4a–4d
5-member groups: 7 queries total	or 3 + 3 + 3 + 4 + 4 + 4 + 4
DAL module	19 points total	3c, 3d, 4a–4d
• Correct, well-separated configuration and setup	5 points
• Domain-appropriate retrieval function	7 points
• Domain-approprate “CUD” function	7 points
5-member groups: One more “CUD” function	or 4 + 5 + 5
DAL-calling programs	3 points each, 6 points total	3d, 4a–4d
5-member groups: One more DAL-calling program	or 2 + 2 + 2
Hard-to-maintain or error-prone code	deduction only	4b
Hard-to-read code	deduction only	4c
Version control	deduction only	4e
Punctuality	deduction only	4f
Total	100

Where applicable, we reinterpret outcomes 4b and 4c in this assignment to represent the clarity, polish, and effectiveness of how you document your dataset, database, and its features, whether in written descriptions, the database diagram, or the DAL code.

Note that inability to compile and run any code to begin with will negatively affect other criteria, because if we can’t run your code (or commands), we can’t evaluate related remaining items completely.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
assets		assets
netflix-prize-graph-example		netflix-prize-graph-example
python		python
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
design.md		design.md
feedback.md		feedback.md
netflix-practice.md		netflix-practice.md
queries.md		queries.md
schema.md		schema.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assignment 1122

Background Reading

Theoretical/Conceptual Reading

Technical/Operational Reading

For Submission: Graph Database Mini-Stack

NodeFlix: netflix-pratice.md, Graph Database Edition

Just .gitignore It

Rock the Graph-bah: Schema, Preprocessors, Headers, and Commentary

Dance the Graphy Q: queries.md

Connect the DAL: dal.*

(:Program)-[:CALLS]->(:Dal)

Operational Directives/Suggestions

How to Turn it In

Specific Point Allocations

About

Releases

Packages

Languages

jkalili/Graph-DB

Folders and files

Latest commit

History

Repository files navigation

Assignment 1122

Background Reading

Theoretical/Conceptual Reading

Technical/Operational Reading

For Submission: Graph Database Mini-Stack

NodeFlix: netflix-pratice.md, Graph Database Edition

Just .gitignore It

Rock the Graph-bah: Schema, Preprocessors, Headers, and Commentary

Dance the Graphy Q: queries.md

Connect the DAL: dal.*

(:Program)-[:CALLS]->(:Dal)

Operational Directives/Suggestions

How to Turn it In

Specific Point Allocations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages