Skip to content

Latest commit

 

History

History
101 lines (77 loc) · 11.5 KB

file-db.md

File metadata and controls

101 lines (77 loc) · 11.5 KB

CMSI 3520 Database Systems, Fall 2022

Assignment 1003

Before diving fully into database systems per se, we make one stopover “under the hood” by looking at data management in terms of working with files alone. The premise here is that all database systems, regardless of model, philosophy, or generation, ultimately store their data in files. Thus, it is of value to explore what it takes to work with files directly—in a way, to get an additional taste of what is involved in implementing your own database system.

Background Reading

The Elmasri & Navathe book touches on this conceptually in their Chapter 1 section on “Advantages of Using the DBMS Approach.” If you want a peek into how database management systems do organize their data files in order to do what they do efficiently, you can skim Part 7 File Structures, Hashing, Indexing, and Physical Database Design to get an idea of how this is done. Think of this assignment as a microcosm of that material.

Continue to draw inspiration from the Kaggle and Awesome Public Datasets collections for this assignment—and this time, keep a closer eye on datasets that have files which you can actually work with along the lines of this assignment. As before the Netflix Prize dataset is used as the case study for the assignment’s accompanying examples—you may not use it for your own submission, but by all means download these files and try out the given examples for yourself.

Direct assistance for the action items in this assignment can be found (among other places) in:

Also of use would be programming references for I/O and file processing libraries of whatever programming language you might choose, such as:

  • Python (take a look at their csv library too)
  • NodeJS—particularly handy if your dataset’s files are in JSON format
  • Java—just don’t tell Dr. Toal (lest you become a…tattle-Toal?)
  • (though I doubt any of you will pick this one) C—unless you’re taking this as a challenge and want to legit say “Hold my avocado 🥑”

Of course, for technologies like these, you’re likely to find a lot of additional supplements on the web.

For Submission: File Database Proto-Stack

For this assignment, give yourselves and your groups a taste of what it would be like to work with data at a file level by taking a stab at performing the database operations described below.

Warm Up to the Challenge: netflix-practice.md

Walk through the Netflix Prize file database proto-stack case study—ideally together as a group, so you can help each other through each example—and do a little prep by doing some freeform exploration of that data using the techniques shown therein. Ideally, everyone in the group can download and preprocess the data firsthand; if this is not feasible for some, designate one or more group members to serve as “hosts” for your work, and collaborate in person or around a screenshare with them.

In a Markdown file on your repository called netflix-practice.md, do the following:

  1. Write up two (2) movie queries—things like movies with certain titles or title patterns, movies released on one or more given years, etc. Provide those queries and show their results (or a subset of them, if there are too many to list).
  2. Pick a viewer at random and build their complete “review profile”—actually track down the titles/years of the movies they reviewed and how they rated them. Provide the commands used to build this profile and show the results. State conclusions that your group thinks can be drawn about that viewer’s preferences and tastes.
  3. Pick a movie at random and collate its review scores. Provide the commands used to gather this data and show the results. Can you draw any conclusions about the quality of this movie based on the review scores that it received?

Let There Be…Data! (about.md, .gitignore)

After you’ve gotten some practice with the Netflix Prize data, you’ll now need to work with another dataset in all its glory (not just mimicking its structure), so choose well! Pre-visualize the kinds of files that you can work with based on the requested operations, alongside the language that you would be inclined to use for the operations that call for a little programming. Make sure to keep that in mind alongside, of course, also still picking a dataset whose core content is interesting to you.

This assignment is most effective when you implement the requested operations on that dataset in its entirety (just like with the Netflix Prize case study)—so yes you’ll have to be conscious of how much disk space you have available—and get to know how the files are organized, formatted, and structured. Ideally, everyone in the group will be able to work with the files directly; however not everyone may have enough capacity to work with the entire dataset on their local computer so please be open to projecting or sharing your screen and allowing for some teammates to drive in order to give everyone that firsthand experience.

In any case, describe your dataset and potential applications for it in a file called about.md. This can be very similar to the about.md that you wrote for the database fiddle assignment. Make sure that about.md also has links to the actual dataset, for anyone who would like to download a copy for themselves.

Whatever you do, do not commit the datasets to GitHub. This is one case where GitHub isn’t a good fit.

To guarantee that this does not accidentally happen, edit the .gitignore file so that it makes your repository ignore the files that comprise your chosen dataset.

Feel the Power: queries.md

Once you have the data at your fingertips, it’s time to play database. Seek to implement the following operations. Refer back to the examples in the netflix-prize-file-example folder as needed. Just remember that datasets will differ from each other, so don’t limit yourself exclusively to what you see in the case study. In particular, note that the case study uses Python for custom programming work—this will not necessarily be the best fit for the dataset that you choose.

As before, it is also useful to contextualize things by envisioning an application that would use your chosen dataset. Having such an application in mind will help fuel ideas for the specific operations to implement in the following categories. Place query descriptions and commands in a file called queries.md and commit relevant source code directly to the repository (referring to them as needed in queries.md):

  1. Query by pure command (grep, awk)
  2. Count query by pure command (grep, awk, wc)
  3. Query by a program that scans the data file(s)
  4. Query by pure command facilitated by pre-processing (with accompanying preprocessor of course)
  5. “Compound” query that requires a manual combination of commands or programs

Take Stock of Your Proto-Stack: report.md

Finally, finish up the exercise by writing a small technical report on everything that you did. Call this document report.md and answer the following questions:

  1. Which operations/commands/programs were the most difficult to implement? Which were easiest? Provide brief rationales for your responses.
  2. In what way do the size and ordering of your data files affect the speed of an individual operation?
    • To support your answer, use the time command to get an objective reading for how long certain operations take
    • Choose your timed operations in a way that illustrates how size and ordering affects performance
  3. Is there a correlation between the ease of implementation and performance? (i.e., are the hardest operations to implement always the slowest ones? vice versa? or is there no relationship at all?)

Operational Directives/Suggestions

  • Make sure to divide the implementation work relatively evenly within your group. Most groups have four (4) members and you will notice that there are eight (8) total “coding” tasks (three for Netflix, five for your dataset). Thus, letting individual group members “own” around two (2) tasks each will help spread the load. Of course you can all help each other as needed, but let each person take point on two items.
  • Once more, do not commit the original files to the repository—they may be too large for that. Provide links instead. Edit .gitignore to avoid accidental commits.
  • Not everyone’s computer might have enough storage or other capacity; in this case, designate someone as the “host” for doing work and find ways to collaborate and take turns “driving.”

How to Turn it In

Commit everything to GitHub. Reiterating the deliverables, they are:

Review the instructions in the deliverables’ respective sections to see what goes in them.

Specific Point Allocations

This assignment is scored according to outcomes 1a, 1d, 2b, 3c, 3d, and 4a to 4f in the syllabus. For this particular assignment, graded categories are as follows:

Category Points Outcomes
netflix-practice.md correctly implements the requested operations 5 points each, 15 points total 1a, 2b, 3c, 4a
about.md 10 points total 1a, 1d, 4c
• Includes link to the dataset’s files 3 points
• Describes the dataset and possible applications well 7 points
.gitignore correctly prevents accidental commits of dataset files 5 points 4a
queries.md correctly implements the requested operations 7 points each, 35 points total 1d, 2b, 3c, 4a
Custom file-processing/querying code works as intended 10 points each, 20 points total 1d, 2b, 3d, 4a
report.md answers the questions well 5 points each, 15 points total 1a, 4c
• Objective data supplied via time deduction if not provided
Hard-to-maintain or error-prone code deduction only 4b
Hard-to-read code deduction only 4c
Version control deduction only 4e
Punctuality deduction only 4f
Total 100

Note that inability to compile and run any code to begin with will negatively affect other criteria, because if we can’t run your code (or commands), we can’t evaluate related remaining items completely.