Skip to content

Latest commit

 

History

History
152 lines (116 loc) · 18.3 KB

document-db.md

File metadata and controls

152 lines (116 loc) · 18.3 KB

CMSI 3520 Database Systems, Fall 2022

Assignment 1107

For this assignment, we shift our attention to one of the most prominent alternatives to the relational model—a document-centric database. You may choose between two such systems for this assignment: MongoDB or ElasticSearch.

To assist in your decision, you may want to review this video overview of database paradigms to get a quick recap of how MongoDB and ElasticSearch differ. In addition, here is an informal summary that compares the two:

MongoDB ElasticSearch
Ease of use (based on observations off prior students) Generally easier to learn and use Harder to learn and use
Performance Slows down notably with large databases and cross-collection lookups Built for speed as a central feature
Interaction Command line: mongosh corresponds to psql Web service only: must use a web service client like curl or Postman
Industry use Some adoption; best used for rapid prototyping and smaller-scale systems Widely used for search, mirroring other systems

As mentioned previously, this assignment has a similar structure to all of the mini-stack assignments. You are also to stay with the same group and dataset.

Background Reading

At this point, the presumption is that you know your chosen dataset quite well, and of course the Netflix Prixe dataset continues to tag along. All reading is thus centered on what differentiates document-centric databases from the rest.

Theoretical/Conceptual Reading

Unlike the relational database model, document-centric databases don’t have pre-established mathematical/theoretical underpinnings. There are lots of papers about MongoDB and ElasticSearch (a quick look at the ACM Digital Library bears this out) but nothing like Ted Codd’s seminal paper.

Chapter 24 of the Elmasri & Navathe textbook covers NoSQL systems in general (now represented in all-caps “NOSQL” to mean Not Only SQL instead of flat-out ”no” SQL). If you don’t have the book, this PDF covers the chapter with some pages looking specifically at MongoDB.

Technical/Operational Reading

Direct technical assistance for the action items in this assignment can be found primarily in the MongoDB and ElasticSearch documentation sites. Watch out when doing web searches—multiple documentation sites exist for various versions of these systems so make sure that you are looking at documentation that matches the version you’re using. (this was also true of PostgreSQL but was less of an issue there because of SQL’s existence as a broader standard)

MongoDB has a separate collection of Getting Started guides which summarizes the fundamentals. ElasticSearch has a step-by-step Getting Started set of pages as well. An ElasticSearch video collection is also available if you prefer to learn by that medium instead.

We are also more strongly-bound to libraries that specifically target these systems. MongoDB calls them “drivers.” ElasticSearch calls them “clients.” Whatever—they are libraries that you can use to write code that interacts with those respective systems. Pick a language, any language:

Between these and the Netflix Prize document-centric database mini-stack case study, the hope is that you have plenty of resources for getting into the system of your choice.

For Submission: Document-Centric Database Mini-Stack

For this assignment, you are asked to build the beginnings of a document-centric database persistence layer for an envisioned application that is based on your chosen dataset. A working example of such a persistence layer is provided for our on-going Netflix Prize case study.

DocFlix: netflix-pratice.md, Document-Centric Database Edition

Transfer some skills from the Netflix Prize document-centric database mini-stack case study: study the logical schema; study and run the given loader programs so that you have the dataset in document-centric form (giving them ample time to finish!); study and run the sample programs to see how they and their respective libraries interact with MongoDB and ElasticSearch.

With that background in mind, implement the queries listed below, providing the following items for each in netflix-practice.md:

  • State the criteria in English
  • Provide the MongoDB or ElasticSearch query that yields those results
  • Include a screenshot of this query being issued in the corresponding direct-database access utility (mongo, curl, Postman) alongside the first few results

Precision is the thing: Due to the way document-centric databases work, it won’t always be possible to enforce the same level of precision that we see in other types of databases—but we can get close enough. Make a good-faith effort at eliminating extraneous information. If the request is for “the number of reviews” that meet a certain condition, don’t provide a query that lists the reviews and requires the user to look at how many reviews were returned; the result itself must provide the number directly. Similarly, for queries that ask for the “highest” or “lowest” values of something, don’t make the user to look for the desired result manually—again, the query clearly identify the requested values.

  1. Movies filtered by title and/or year: A query that retrieves the ID, year, and title of movies that fit criteria of your choosing (e.g., movies with certain titles or title patterns, movies released on one or more given years, etc.), sorted ascending by title
  2. Number of movies released per year: A query that takes movie criteria of your choosing and returns a collection consisting of year and count where count is the number of movies that meet these criteria which were released on that year, sorted ascending by year
  3. Years with the most movies released: A query that takes movie criteria of your choosing and returns the same collection as above except it only returns the year and count of the top five (5) years with the most movies released, sorted descending by count then ascending by year in case of a tie
  4. Movies rated a certain way by a specific user: A query that lists the title and year of movies seen by a particular user with a rating matching conditions of your choosing (e.g., 4 and above, 2 and below, etc.) sorted ascending by title
  5. Average rating of movies: A query that takes movie criteria of your choosing and returns a collection consisting of title, year, and avg where avg is the average rating received by each movie, sorted descending by avg (thus listing the top-rated movie first) then ascending by title in case of a tie
  6. Specific average rating of movies: A query that takes movie criteria of your choosing and returns a collection consisting of title, year, and avg where avg is the average rating received by each movie and meeting some condition of your choosing such as average greater than 4, average less than 3, etc.—the results should be sorted descending by avg (thus listing the top-rated movie first) then ascending by title in case of a tie
  7. Number of reviews received by a movie during a certain time period: A query that takes movie criteria of your choosing and returns a collection consisting of title, year, and count where count is the number of reviews received by each movie within a particular date range of your choosing such as after 2005, during the month of September, etc.—the results should be sorted descending by count (thus listing the most-frequently-rated movie first) then ascending by title in case of a tie

For each of these queries, find ways to double-check your work—are there ways to run other queries that will help you verify whether you are really getting the results you’ve requested? It’s useful to do this at first while you’re still getting the hang of your chosen system’s query language.

Just .gitignore It

Because we’re on the same dataset as before, we don’t need about.md for this assignment. Just edit the .gitignore file again so that it makes your repository ignore the files that comprise your chosen dataset.

We Will, We Will Doc You: Schema and Loader Files

What doesn’t change from before is the need to populate your database with your dataset:

  1. Determine an appropriate logical schema for the dataset, in line with the database that will host it—for document-centric databases, this is technically not necessary but we’ll still require it because we aren’t loading arbitrary data structures either
  2. Put that design in writing by providing a diagram of that schema—notation is less rigid here again because of the way document-centric databases operate:
    • For MongoDB, state the chosen name for your database and the collections it will contain
    • For ElasticSearch, state the names of the indices that you will use
    • In both cases, describe the documents that will go into these collections or indices by showing their JSON structure—what are their property names? What types do they have? Are there any sub-objects or arrays?
    • Submit this as schema.pdf or schema.md, as appropriate
  3. For ElasticSearch, you may need to supply mappings objects (one per index that requires mappings): if so, supply them as (index-name)-mappings.json (meant for direct PUT via curl or Postman to the intended index)
  4. Write one or more programs that will populate the target database with the dataset: submit these as one or more loader source files

What’s Up Doc? queries.md

Show off your ability to derive information from your database by writing the following SQL queries. For each query, use the format given in the DocFlix section where you:

  • State the criteria in English
  • Provide the query that yields those results
  • Include a screenshot of this query being issued in the corresponding direct-database access utility (mongo, Kibana) alongside the first few results

Submit these in a Markdown file called queries.md. All queries should be domain-appropriate—i.e., they should make sense for an application that is trying to do real-world work with your adopted dataset:

  1. A query that selects a subset of a particular entity in your dataset
  2. Another such query, with a specific sort order
  3. Either a sequence of queries that combines information from more than one collection (this may require some pseudocode that connects one collection to another) or a query that iterates through nested collections or sub-objects
  4. An aggregate query that provides counts or other aggregate computations for certain groups in your dataset
  5. A ranking query that provides the “top” or “bottom” n documents based on some metric

If inspiration strikes you, don’t stop at just the required queries. The more practice you get with your chosen system’s query language, the better. The ones that are given are only meant to provide the base coverage for this assignment.

DAL-Doc-ahedron: dal.*

As with the other mini-stack assignments, we would like the beginnings of a document-centric DAL. Once more, you may choose the programming language for this code—the only requirement is that a “driver” (MongoDB) or “client” (ElasticSearch) exists for the corresponding system in that language. The Netflix Prize example again provides its own dal that you can use as a reference:

  • Appropriate configuration and connection setup code
  • Model objects and other definitions, as applicable (specifics will vary based on the language and database connection library)
  • One (1) domain-appropriate retrieval function that, given some set of arguments, will return corresponding data matching those arguments—you may adapt one of the queries you wrote in What’s Up Doc? for this—pick some aspect of that query that would make sense as parameters so that the same function can be used for multiple queries of the same type
  • One (1) domain-appropriate “CUD” function (create, update, or delete) that modifies the database’s documents, given some set of arguments

A fun self-challenge here would be to see if you can implement the exact same functions with the exact same signatures and external behavior as your previous mini-stack. The benefit of doing this would be that you wouldn’t need to rewrite your demo applications (the next section). Because sometimes, this won’t be possible out of no fault of your own—some definitions may be inapplicable, the technology may shift, etc.—it wouldn’t be fair to offer extra credit for this; the main motivation would be to not have to write new demo application code.

DAL Call

Write one (presumably short) program apiece that calls the retrieval and “CUD” functions, respectively. These programs’ primary jobs would be:

  • Provide help on how to use the program
  • Check program arguments for validity
  • Call the underlying DAL function with those arguments
  • Report any errors that may have occurred

As a natural consequence of having three (3) DAL functions instead of two (2), five (5)-member groups will end up doing an additional DAL-calling program, for the same total number of points overall.

As mentioned in the DAL instructions, see if you can find a way to define your DAL so that the programs you submitted previously will work without modification. The reward would be that you don’t have to redo these programs! (but don’t worry about it if this isn’t feasible)

Operational Directives/Suggestions

The same notes and suggestions remain from before:

  • Make sure to divide the implementation work relatively evenly within your group. Most groups have four (4) members and there is plenty of work to spread around. Let each member “run point” on some set of tasks so that someone is on top of things but of course allow yourselves to help each other.
  • Once more, do not commit dataset files to the repository—they may be too large for that. Provide links instead. Edit .gitignore to avoid accidental commits.
  • Not everyone’s computer might have enough storage or other capacity—AWS is an option but watch your credits; or, designate someone as the “host” for doing work and find ways to collaborate over a screenshare and (friendly) remote control of a classmate’s screen.

How to Turn it In

Commit everything to GitHub. Reiterating the deliverables, they are:

Review the instructions in the deliverables’ respective sections to see what goes in them.

Specific Point Allocations

This assignment is scored according to outcomes 1a, 1d, 3a3d, and 4a4f in the syllabus. For this particular assignment, graded categories are as follows:

Category Points Outcomes
netflix-practice.md correctly implements the requested operations 4 points each, 28 points total 1a, 1d, 3a3c, 4a4d
.gitignore correctly prevents accidental commits of dataset files deduction only, if missed 4a
.schema.pdf/schema.md and (index-name)-mappings.json (if applicable) 5 points 1d, 4c
Loader program(s) 15 points 3a, 3b, 4a4d
queries.md correctly implements the requested operations 5 points each, 25 points total 1d, 3c, 4a4d
DAL module 21 points total 3c, 3d, 4a4d
• Correct, well-separated configuration and setup 7 points
• Domain-appropriate retrieval function 7 points
• Domain-approprate “CUD” function 7 points
DAL-calling programs 3 points each, 6 points total 3d, 4a4d
Hard-to-maintain or error-prone code deduction only 4b
Hard-to-read code deduction only 4c
Version control deduction only 4e
Punctuality deduction only 4f
Total 100

Where applicable, we reinterpret outcomes 4b and 4c in this assignment to represent the clarity, polish, and effectiveness of how you document your dataset, database, and its features, whether in written descriptions, the database diagram, or the DAL code.

Note that inability to compile and run any code to begin with will negatively affect other criteria, because if we can’t run your code (or commands), we can’t evaluate related remaining items completely.