CMSI 3520 Database Systems, Fall 2021

Assignment 1215

So it all comes to this—after implementing multiple mini-stacks for the same hypothetical application across different database paradigms, it’s time to pick the one that you feel can expand best into a full database software development kit (SDK) for that application. Which paradigm will you choose? That’s up to you. What application will you envision? That’s up to you also, though presumably you already have a fair idea for what that will be, given that you have already written multiple queries for it and some simple command-line programs that preview its functionality.

Whichever way you go, you will bootstrap this application with data from your chosen dataset (which you have presumably already done in a previous assignment, but this one gives you an opportunity to improve on that). But you will also make sure that the application is fully capable of adding, updating, and deleting information from that database as well.

As a little side trip, we will take a glimpse at the theoretical side of the relational database model, so that you have some sense of the underpinnings that allowed relational databases to be implemented in the first place.

Background Reading

Not surprisingly, for this culminating assignment your final background reading material comprises whatever background reading was applicable to the database paradigm that you choose. We won’t repeat that here; just go back to the mini-stack assignment corresponding to your chosen database model and review what’s there.

Theoretical/Conceptual Reading

What we will reiterate, because it gets deeper examination as we close out the course, are readings pertaining to the theoretical foundations of the relational database model—foundations that were instrumental to that logical data model’s success and for which equivalents would be worth seeking when pursuing future endeavors in computer science:

E. F. Codd’s seminal A relational model of data for large shared data banks
Elmasri & Navathe Chapter 5
Elmasri & Navathe Chapter 8
Elmasri & Navathe Chapter 14

Technical/Operational Reading

On the technical/operational side, we have two new areas: transactions and indexing/physical schemas. These areas are universal to databases regardless of database type, and so each of our representative systems has corresopnding resources for each area. In addition, transactions also factor into database programming and thus there are programming resources for them as well.

Transactions

SQL: BEGIN, COMMIT, ROLLBACK (among others)
SQLAlchemy: Transactions and Connection Management
Sequelize: Transactions
MongoDB: Transactions and Transactions in Applications
Elasticsearch doesn’t have native transactions, but they can be emulated
Neo4j: Transactions and Session API

Indexing and Physical Schemas

SQL: CREATE INDEX command
MongoDB: Indexes
Elasticsearch: Tips for tuning search speed and indexing speed
Neo4j: Indexes for search performance

For Submission: Full Database Application SDK

Expand one of your database mini-stacks into a full software development kit (SDK) for an envisioned end-user application that relies on the dataset (and additions/changes to it) that you have been using for most of the semester.

“Whad-About.md It” (Hugh Grant as Michael Felgate, `5415` Mickey Blue Eyes 1999)

about.md is back to take its final, fullest form. Provide a complete about.md that includes the following (from previous versions)…

Link to the baseline dataset files
What the dataset contains
The application you have envisioned that uses this dataset: general description, prominent use cases (involving underlying data)

…with these new sections:

Rationale for why you chose the logical data model for this dataset and application—why was it the best fit? What features/characteristics were in its favor? What weren’t in its favor (but clearly wasn’t enough to overturn it)?
Assessment—now that the assignment is done, how does your group feel about this choice?

Consider integrating your database schema (see below) into this document, if technically feasible. At a minimum, you should include a link to the file when appropriate.

“That Little .gitignore” (Rupert Grint as Ron Weasley, Harry Potter and the Deathly Hallows Part 2 2011)

Perhaps the least-changed part of these assignments: make sure your .gitignore file has been edited to exclude raw dataset files from git modification tracking.

“You Always Do Drag Me Into Your Crazy Schemas, Preprocessors, and Loaders” (Vivian Vance as Ethel Mertz, `14325|76|9826|1882|1017`I Love Lucy 1951–1957)

Provide all of the information needed in order to get the database set up and loaded. This includes but is not limited to:

A schema diagram of your database, using notation appropriate for the chosen logical data model
Schema, mappings, or other DDL (if applicable)
Preprocessing code for preparing the raw dataset files for loading (if any)
Loader code for populating the database
New with the SDK: Determine and implement an appropriate indexing scheme, based on your application’s needs (Elasticsearch is exempt from this)—you may build this into your loader programs, or specify this as a command to be executed at the database level
New with the SDK: A complete, step-by-step setup.md document containing complete instructions for setting up the database (you may assume that the user knows how to prepare an empty file structure for the database server, so the instructions can take it from that point on)

Show the Impact of Your Index

Except for Elasticsearch, provide objective evidence in setup.md that your chosen index(es) have a concrete impact on your queries:

Show “explain” output indicating how the query plan changes before and after the index is created
Show time output indicating the real-time speed difference for the same query from before and after the index is created

For the maximum impact, try to index something that spans your largest data collection.

“I’d Rather Take My Chances as a Supernaturally-Possessed DAL” (Brad Dourif as Chucky, `12997` Seed of Chucky 2004)

Surprise surprise, the centerpiece of your SDK is…the software. Supply a fully-decked DAL with the following features/functions.

Configuration Code Driven by Environment Variables

Most of you have already done this, so just don’t mess it up.

Featured Query Functions

Implement two (2) DAL functions for non-trivial “top-level”/“featured” queries—operations that capture a primary/central use case of your intended application.

Choose queries of some sophistication: involve multiple entities (join/nested object/edges) or a grouped aggregate query
Simple “subset of a single entity based on simple criteria” don’t count

For each function, supply a detailed comment block that includes the following:

A natural language statement of a query (as you’ve done in previous assignments, but now in the code)
Specification of all function parameters
Specification of the return value
At least one copy-pasteable example of the query, with specific values already provided

Full-Cycle CRUD for at Least One Entity

Make sure your DAL has full CRUD functionality for at least one entity in the database, meaning that a user has the complete ability to create, retrieve, update, and delete occurrences of that entity via your DAL, without having to go to the level of the database (i.e., no need to run psql nor mongo nor curl http://localhost:9200 nor Neo4j Browser).

Creation function
User-readable search function (which includes an ID so that the user can then call…)
Get-one-entity-by-ID function
Update function
Delete function

Provide the same information in the comment blocks for these functions as the featured queries (though most likely, the descriptions will be much simpler): natural language statement, specifications of function parameters and return values, and a copy-pasteable example of the query (or queries), with specific values already provided. State any assumptions, restrictions, or other domain-specific characteristics that apply to these functions (e.g., required values, default values, constraints, etc.).

At least one (1) of these functions must require a transaction: This means that one or more of the CUD functions must perform an operation that requires more than one query (functions that solely read from the database only need transactions for the most complex of queries, which you are highly unlikely to have given the scope of your datasets). You are most likely to find this requirement with an entity that has relationships with another entity. Do not implement a transaction without actually needing one.

Elasticsearch is exempt from the transaction requirement. Further, some systems have advanced features that allow operations to cascade such that multiple entities may be touched by a single query: for the purposes of the transaction requirement, do not make use of those features and just implement your operation as multiple queries within a transaction.

Helper Functions for Usability

To ensure that your SDK is ready for end-users, supply any necessary helper functions whose parameters can be provided by an end-user which then return values that can be used for other queries. The prime example here is for any query that takes IDs—end-users will almost never know what IDs to use for various operations. Instead, a more usable search or filter is performed first, after which the application would know what ID(s) to use.

The user-readable search function in the CRUD suite is one such function. This allows a user to locate an entity of interest through a human-readable search term, which then supplies ID information that can either retrieve that specific entity in full (get-one-entity-by-ID) or update/delete that entity.

Your envisioned application may require more of these. Specify and implement whatever you feel is needed, again supplying a comment block that describes them completely: natural language statement, specifications of function parameters and return values, and a copy-pasteable example of the query (or queries), with specific values already provided.

“With the Right Leverage and the Proper Application Proof-of-Concept, the Door Will Lift Free” (Orlando Bloom as Will Turner, `1905` Pirates of the Caribbean: The Curse of the Black Pearl 2003)

Your DAL won’t run itself—it needs end-user programs to shine. Provide such programs to demonstrate a “proof-of-concept” for what your envisioned application can do with this DAL. This proof-of-concept can take any of these forms:

Command-line suite: Collection of programs that, together, show off the functions in your DAL
Integrated text-based program: A single executable program which can run all of these functions (e.g., via a menu/keyboard interface)
(only if you’ve already built one) Web or GUI application: This is most likely in a separate GitHub repository—point to it, provide instructions on how to run it, and provide instructions on how to access the features that use your DAL

All programs must:

Accept input from the user (no fully-hardcoded arguments in function calls)
Provide reasonable input validation
Provide reasonable error handling (no stack traces!)
Include adequate help when appropriate (e.g., if the user runs a program with no/insufficient arguments)

Operational Directives/Suggestions

The first two are new/different from before:

Every group’s applications and datasets are different and have unique characteristics—if you feel that yours has aspects that do not match the instructions, deliverables given here, talk to me right away. We can find ways to adjust the assignment accordingly
If you have been given feedback that your dataset is too limited (e.g., insufficient entities, low richness/content, trivial/unnecessary relationships, etc.), also talk to me right away. We will also need to set parameters for the assignment to ensure that you learn what is needed from it. In fairness to other groups that picked applications/datasets of sufficient sophistication, failure to discuss and make these adjustments will incur a reverse degree of difficulty deduction
Make sure to divide the implementation work relatively evenly within your group. Most groups have four (4) members and there is plenty of work to spread around. Let each member “run point” on some set of tasks so that someone is on top of things but of course allow yourselves to help each other.
Once more, do not commit dataset files to the repository—they may be too large for that. Provide links instead. Edit .gitignore to avoid accidental commits.
Not everyone’s computer might have enough storage or other capacity—AWS is an option but watch your credits; or, designate someone as the “host” for doing work and find ways to collaborate over a screenshare and (friendly) remote control of a classmate’s screen.

For Submission: Group Retrospective

“As for you, here tomorrow, write a group retrospective” (Subtitled character in Bruce Lee film Xi lu xiang/The Kid 1950)

Reflect on your overall group database stack work for the semester in a group-retrospective.md document. Describe what each group member worked on through all of the database mini-stack assignments as well as this last full SDK
Write an individual email directly to me stating anything that you prefer not to include in the shared group-retrospective.md—even if you have nothing to add, write to me anyway, so I know that you didn’t miss a chance to speak up. So that I don’t miss it, use the subject line “CMSI 486 group name individual retrospective”

The points for this deliverable pertain to submission only and not its content. If warranted, individual adjustments to final grades may be made based on the information provided in these items.

For Submission: Tutorial Video

“Thanks to the genius of the theory of relations video, many customers also became friends” (Taylor Nichols as Ted Boynton, 7659 Barcelona 1994) Unrelated to the SDK

The last week of class will look at the theoretical foundations of the relational database model—foundations that remain unique and unmatched in the world of generalized database management systems. As a distinct deliverable from the full database SDK, record a tutorial video where each member takes a turn to explain or talk through one of the following:

A favorite relational algebra operation beyond σ, π, and ρ (but of course you can use these operations in your explanation)
A favorite normal form above 1NF
A proof of a favorite relational theorem or lemma

Coordinate/negotiate your choices with each other so that you don’t repeat yourselves.

At a minimum, provide (a) an explanation of the topic and (b) a concrete example—ideally from your own application! Writing on a piece of paper then holding it up to the camera will be fine (especially for the explanation). A screen-shared document that has a dynamic table feature may work best as a visual aid for the example.

Keep the video concise, to the point, and, if possible…fun! I just want to make sure you pick something up from this last unit and provide some assurance that you won’t get all “deer in the headlights” if someone asks you about this in the future 🚘🦌🚨

Supply the video through whatever approach is feasible (which will likely not include GitHub nor Brightspace): link, Box file, other high-capacity storage.

How to Turn it In

Commit everything to GitHub. Reiterating the deliverables, they are:

about.md
.gitignore
Schema diagram
Schema, mappings, or other DDL, if applicable
Preprocessing program(s), if applicable
Loader program(s)
Indexing scheme
- Implementation (e.g., part of the loader code, easily-executable command, etc.)
- Evidence of impact (before-after query explanation; before-after query benchmarks) as part of setup.md
setup.md
DAL code
Application proof-of-concept
Group retrospective
- group-retrospective.md
- Individual email to me
Tutorial video

Review the instructions in the deliverables’ respective sections to see what goes in them.

Specific Point Allocations

This assignment is scored according to outcomes 1a, 1b, 1c or 1d, 2a, 2b, 3a–3d, and 4a–4f in the syllabus. For this particular assignment, graded categories are as follows:

Category	Points	Outcomes
about.md provides all requested information	20 points total	1a, 1b/1c or 1d, 4c
• Dataset link	deduction only, if missed
• Dataset content description	5 points
• Application description	5 points
• Database choice rationale	5 points
• Database choice assessment	5 points
.gitignore correctly prevents accidental commits of dataset files	deduction only, if missed	4a
Schema diagram is accurate, complete, and clear	10 points	1b/1c or 1d, 4c
DDL, preprocessing, and loader code works as expected	20 points	3a, 3b, 4a–4d
Indexing scheme is implemented and documented	20 points	2b
setup.md is accurate, complete, and clear	10 points	1b/1c or 1d, 3a, 3b 4c
DAL code meets specifications and works as expected	60 points total	3c, 3d, 4a–4d
• Configuration code	deduction only, if regressed
• Two (2) featured query functions	20 points
• Full-cycle CRUD with transaction implementation	25 points
• Helper functions	15 points
Application proof-of-concept is functional and usable	20 points	3d, 4a–4d
Group retrospective has the requested information	10 points total	4c
• group-retrospective.md	5 points
• Individual email	5 points
Tutorial video has the requested content	30 points	2a
Hard-to-maintain or error-prone code	deduction only	4b
Hard-to-read code	deduction only	4c
Version control	deduction only	4e
Punctuality	deduction only	4f
Total	200

Where applicable, we reinterpret outcomes 4b and 4c in this assignment to represent the clarity, polish, and effectiveness of how you document your dataset, database, and its features, whether in written descriptions, the database diagram, or the DAL code.

Note that inability to compile and run any code to begin with will negatively affect other criteria, because if we can’t run your code (or commands), we can’t evaluate related remaining items completely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly