Skip to content

Graph Database coursework for Database Systems

Notifications You must be signed in to change notification settings

jkalili/Graph-DB

 
 

Repository files navigation

CMSI 3520 Database Systems, Fall 2021

Assignment 1122

We wrap up our tour of selected database models with graph databases, as represented by Neo4j.

This assignment continues to have a similar structure as all of the mini-stack assignments. You are also to stay with the same group and dataset.

Background Reading

Theoretical/Conceptual Reading

Although graph databases are a relatively recent development, the graph data structure is well-studied. In many respects, a graph database is simply a persisted graph data structure. As such, any data structures texts on graphs and their associated algorithms would be useful for review.

Elmasri & Navathe’s NOSQL chapter 24 includes coverage of graph databases, also focusing on Neo4j as the reference system. If you don’t have the book, this PDF covers the chapter, with graph database and Neo4j coverage appearing near the end.

Technical/Operational Reading

Direct technical assistance for the action items in this assignment can be found primarily in the Neo4j documentation site. Documentation types range from initial Getting Started tutorial to a full-blown Operations Manual.

Separate but similar are the Neo4j Developer Guides—confusingly, these have overlapping content as the official docs but are distinct from them, and they have embedded videos if that suits your learning style better.

Neo4j shares MongoDB’s terminology choice of “drivers” as the name for libraries that allow general-purpose programs to communicate with a Neo4j server. Multiple languages are supported and for our Netflix Prize graph database mini-stack case study, examples are provided for Python and JavaScript. If you choose to use Java, Neo4j also provides an “OGM” library (“Object Graph Mapper,” in the vein of “Object Relational Mapper”).

For Submission: Graph Database Mini-Stack

For this assignment, you are asked to build the beginnings of a graph database persistence layer for an envisioned application that is based on your chosen dataset. A working example of such a persistence layer is provided for our on-going Netflix Prize case study.

NodeFlix: netflix-pratice.md, Graph Database Edition

Transfer some skills from the Netflix Prize graph database mini-stack case study: study the logical schema; study and run the given preprocessor programs and import command so that you have the dataset in graph form (giving them ample time to finish!); study and run the sample programs to see how they and their respective libraries interact with Neo4j.

Due to the specific strengths of and motivations for graph databases, we change up our Netflix portion somewhat in order to emphasize these differentiating features. However, as before, make sure to still provide the following items for each in netflix-practice.md:

  • State what the query/statement is asking or doing in English
  • Provide the Neo4j Cypher query/statement that yields those results
  • Include a screenshot of this query/statement being issued and the graph that it produces

Watch for scale: Graph databases require a lot of resources. Watch out for queries that might return too many nodes and edges—if you seem to have formulated one, find ways to restrict it either with greater specifics or through the LIMIT clause.

  1. Create new nodes and relationships: Do some IMDB/Wikipedia research on your choice of artists (performers, directors, writers, musicians, etc.) who are affiliated with movies/shows in the Netflix dataset and connect them to those shows’ nodes with appropriate relationships. Pick a mix—around five (5) such nodes will be good, and make sure they have movies/shows in common, in different combinations. Research and define a small set of common properties for those artists, such as gender, birthdate, nationality, etc. Show the MATCH/CREATE/RETURN clauses that make these additions and a culminating query that produces a graph showing all of your additions and the movies/shows that they worked on (but no ratings—that would be too much).
  2. Viewers who are fans: Let’s define a “fan” as someone who has rated a movie/show with a 5. Formulate a query that graphs the viewers who have given a 5 rating to the work of one of your selected artists. Make sure to return the viewers, the movies/shows that they rated, your chosen artist, and what they did in those movies/shows.
  3. Love/hate relationship: Pick two movies that are likely to have a decent overlap of viewers. Formulate a query that graphs the viewers who hated one movie (rated it a 1) but loved the other (rated it a 5).
  4. Watch party 1: Define a set of criteria that filters out a small subset of movies/shows (no more than 3 to be safe). Formulate a query that produces a graph showing viewers who rated those movies/shows on the same day.
  5. Watch party 2: Define a set of criteria that filters out a small subset of the artists that you’ve loaded into Neo4j. Formulate a query that produces a graph showing viewers who rated a movie/show on the same day, for movies/shows that your chosen artists worked on. Make sure to return the viewers, the movies/shows that they rated, the chosen artists, and what they did in those movies/shows.

For each of these queries, find ways to double-check your work—are there ways to run other queries that will help you verify whether you are really getting the results you’ve requested? It’s useful to do this at first while you’re still getting the hang of Cypher.

Just .gitignore It

Because this is your third go with the same dataset, we don’t need about.md for this assignment. Just edit the .gitignore file again so that it makes your repository ignore your chosen dataset’s files.

Rock the Graph-bah: Schema, Preprocessors, Headers, and Commentary

What doesn’t change from before is the need to populate your database with your dataset:

  1. Determine an appropriate logical schema for the dataset—because this is a graph database, take the opportunity to rethink the structure of your data in a way that highlights relationships and connections within it
  2. Put that design in writing by providing a diagram of that schema—follow the rounded-rectangle notation used by Neo4j: submit this as schema.* in some standard image format
  3. Write one or more programs and header files that will populate the target database with the dataset using neo4j-admin import: submit these as preprocess* and *-header.csv files
  4. In a Markdown file called design.md, provide commentary on your logical schema design choices with an embedded image or link to your schema diagram; in addition, provide the command sequence for loading up your dataset, with explanatory remarks as needed

Dance the Graphy Q: queries.md

Show off your ability to derive graphs from your database by writing the following Cypher queries. For each query, use the format given in the NodeFlix section where you:

  • State what the query/statement is asking or doing in English
  • Provide the Neo4j Cypher query/statement that yields those results
  • Include a screenshot of this query/statement being issued and the graph that it produces

Submit these in a Markdown file called queries.md. All queries should be domain-appropriate—i.e., they should make sense for an application that is trying to do real-world work with your adopted dataset:

  1. A query that matches a meaningful subgraph in your dataset
  2. Another such query, involving a different set of nodes, properties, and relationships
  3. A query that matches a meaningful subgraph then optionally matches more relationships/nodes (i.e., the query returns all nodes in the first subgraph even if they don’t match the second pattern)
  4. An overall aggregate query that provides counts or other aggregate computations for an overall set of pattern-matched nodes or edges (this one will not return a graph)
  5. A grouped aggregate query that provides counts or other aggregate computations for groupings derived from pattern-matched nodes or edges (this one will not return a graph)

Five (5)-member groups are asked to do two (2) additional queries: one (1) more from either query type 1–3 and one (1) more of any query type 4–5, for the same total number of points overall.

If inspiration strikes you, don’t stop at just these five (5) queries. The more practice you get with Cypher, the better. The five that are given are only meant to provide the base coverage for this assignment.

Connect the DAL: dal.*

As with the other mini-stack assignments, we would like the beginnings of a graph database DAL. Once more, you may choose the programming language for this code—the only requirement is that a Neo4j “driver” exists in that language. The Netflix Prize example again provides its own netflix-dal that you can use as a reference:

  • Appropriate configuration and connection setup code
  • Model objects and other definitions, as applicable (specifics will vary based on the language and database connection library)
  • One (1) domain-appropriate retrieval function that, given some set of arguments, will return a graph matching those arguments—you may adapt one of the queries you wrote in Dance the Graphy Q for this—pick some aspect of that query that would make sense as parameters so that the same function can be used for multiple queries of the same type
  • One (1) domain-appropriate “CUD” function (create, update, or delete) that modifies the database’s overall graph, given some set of arguments

Five (5)-member groups are asked to do one additional “CUD” function, for the same total number of points overall.

One aspect where a graph database is at a disadvantage here is that the graph aspect only becomes clear with a graph rendering, which is highly infeasible with a command line program. You aren’t required to go that far in your demo programs, but make sure that your functions’ return values still contain enough information to produce a graph rendering. As long as your functions return collections of Neo4j’s “record” objects, a graph can still be constructed given the right front end.

(:Program)-[:CALLS]->(:Dal)

Write one (presumably short) program apiece that calls the retrieval and “CUD” functions, respectively. These programs’ primary jobs would be:

  • Provide help on how to use the program
  • Check program arguments for validity
  • Call the underlying DAL function with those arguments
  • Report any errors that may have occurred

As a natural consequence of having three (3) DAL functions instead of two (2), five (5)-member groups will end up doing an additional DAL-calling program, for the same total number of points overall.

As mentioned in the DAL instructions, it isn’t very feasible to expect a command-line program to provide a graph rendering (though it isn’t outright impossible; just…a helluva lot of work). Still, try to keep your output readable and clear. If you can express some of the graph aspects of the return value (e.g., listing connected nodes as indented beneath another node), feel free to give it a shot.

Operational Directives/Suggestions

The same notes and suggestions remain from before:

  • Make sure to divide the implementation work relatively evenly within your group. Most groups have four (4) members and there is plenty of work to spread around. Let each member “run point” on some set of tasks so that someone is on top of things but of course allow yourselves to help each other.
  • Once more, do not commit dataset files to the repository—they may be too large for that. Provide links instead. Edit .gitignore to avoid accidental commits.
  • Not everyone’s computer might have enough storage or other capacity—AWS is an option but watch your credits; or, designate someone as the “host” for doing work and find ways to collaborate over a screenshare and (friendly) remote control of a classmate’s screen.

How to Turn it In

Commit everything to GitHub. Reiterating the deliverables, they are:

Review the instructions in the deliverables’ respective sections to see what goes in them.

Specific Point Allocations

This assignment is scored according to outcomes 1a, 1d, 3a3d, and 4a4f in the syllabus. For this particular assignment, graded categories are as follows:

Category Points Outcomes
netflix-practice.md correctly implements the requested operations 5 points each, 25 points total 1a, 1d, 3a3c, 4a4d
.gitignore correctly prevents accidental commits of dataset files deduction only, if missed 4a
schema.* clearly diagrams the logical schema 5 points 1d, 4c
Preprocessor program(s) and *-header.csv files 15 points 3a, 3b, 4a4d
design.md explains the logical schema and import approach 5 points 1d, 4c
queries.md correctly implements the requested operations 5 points each, 25 points total 1d, 3c, 4a4d
5-member groups: 7 queries total or 3 + 3 + 3 + 4 + 4 + 4 + 4
DAL module 19 points total 3c, 3d, 4a4d
• Correct, well-separated configuration and setup 5 points
• Domain-appropriate retrieval function 7 points
• Domain-approprate “CUD” function 7 points
5-member groups: One more “CUD” function or 4 + 5 + 5
DAL-calling programs 3 points each, 6 points total 3d, 4a4d
5-member groups: One more DAL-calling program or 2 + 2 + 2
Hard-to-maintain or error-prone code deduction only 4b
Hard-to-read code deduction only 4c
Version control deduction only 4e
Punctuality deduction only 4f
Total 100

Where applicable, we reinterpret outcomes 4b and 4c in this assignment to represent the clarity, polish, and effectiveness of how you document your dataset, database, and its features, whether in written descriptions, the database diagram, or the DAL code.

Note that inability to compile and run any code to begin with will negatively affect other criteria, because if we can’t run your code (or commands), we can’t evaluate related remaining items completely.

About

Graph Database coursework for Database Systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.9%
  • JavaScript 25.1%