<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510 Final Project**
__Spring 2021__

## Purpose##
To put into practice what you have learned in this class. Further, to provide the MSBA students a head start on their Capstone projects. 

## Objectives
In this project you will ...
- Construct a data pipeline from original source files to a dimensional data warehouse
- Extract, translate, and load data into a normalized database of your own design
- Design and populate a dimensional data warehouse designed to address common analytical questions
- Test your databases to ensure domain, entity, relational integrity
- Demonstrate that the course objectives have been met  
- Work in small teams, sharing code and data files equally

## Expectations
- Each student should be able to fully present the work of her/his team. While it is on your honor to share the load with your teammates, the instructor may ask individuals to present separately from the team in private. These private presentations will be graded separately from the rest of the team.

## Instructions


__0. Form teams of 1-3 students each.__

Use Google Sharing to share your noetbooks with your teammates. Note that while the documents (files) are shared and jointly editable, the Colab runtime is not. You will figure out what this means and figure out how to work around it. 



__1. Gather Source Data and Frame a Analytical Problem to Address..__ 

Submit your writeup in the `1_ProjectFraming.ipynb` notebook. 

There are two options:
- Use the [banner data](SOMEWHERE ON GITHUB).  
  You will need to figure out to extract the relevant files from GitHub. 
- Bring Your Own Data.   
  The source data must be nontrivial. Either of the following suffice:  
    - Integration from multiple data sources 
    - One large *denormalized* dataset of least 1 million rows; if it fits in Excel or can't be normalized into at least 3 tables then it's too trivial to count. 

The problem statement should be about a decision you intend to support. For example, if using the banner data, then you could address detection of faculty workload imbalances. Are there faculty who are teaching too manhy different kinds of courses? Too many students per semester? etc. 

Then indicate how you would use the data you have collected to answer the analytical questions.



__2. Design a normalized relational database that can contain all the data outlined in your framing document. Document the design with an ERD and a data dictionary.__

Submit your database design in the `2_Normalized_Data_Model.ipynb` notebook. 

- The database should not leave off any files or columns.
- Normalize to at least BCNF.
- Use Lucidchart to draw the ERD. Export the ERD to a PNG file at 300 DPI. Then drop the PNG file into Google Drive and link to it in your colab notebook. (See [here](https://towardsdatascience.com/the-2-step-guide-to-upload-images-in-google-colab-b51348e882e4#:~:text=Step%20I%3A%20Upload%20the%20image,a%20sharable%20link%20%26%20copy%20it.&text=Open%20Google%20Colab%20Notebook%20%26%20add,want%20to%20include%20the%20image.).)

- Within your notebook add a 'Data Dictionary' section that defines every column on every table. Use the table names as third level headings (`###`) and bullet lists for the column definitions. If you are feeling frisky, then perhaps use Markdown tables instead of bullet lists.  
- Take care with the [Markdown formatting](https://github.github.com/gfm/). Dropping big, stupidly-formatted blobs of text is very bad form. It's also extremely unprofessional (and will affect your course grade).



__3. Create a SQLite or AWS MySQL database. (Instructions for AWS will be posted on Slack if requested.) The database should exactly match your ERD. Populate the database with data from your original sources.__

Submit your initial ETL work in the `2_Data_Ingestion.ipynb` notebook. 

- You will likely have to import CSV file data into tables that you will ultimately drop when the database is completed. To keep track of them all, please the prefix the table name with `import_` to indicate that the table contains raw source data.
- Use SQL to create and populate the tables in your ERD. The code will likely look a lot like what we did in class, with lots of JOINs. You should implement FOREIGN KEY constraints (With cascading updates/deletes) as well.
- Write queries to ensure that  ...
    - Each column has a sensible data type (Domain integrity); are there truncation or translation errors?   
    - Each row describes a unique entity (Entity integrity); just having a PK is not enough: you will need to look for duplicate a data records
    - Each relationship is implemented correctly (Relational integrity); are the FKs JOIN-compatible with the PKs? does each mandatory relationship have a corresponding NOT NULL constraint?
- Use Markdown and SQL comments to annotate your work as you go along. Also, make sure you can re-run your code from scratch to rebuild the database when needed.
- Your notebook should be rerunnable from scratch to recreate and reload the database as needed. That means there should be no manual steps. 



__4. Design and build data warehouse called `CourseDataWarehouse.db`.__

Submit your work in the `4_DataWarehouse_Design_ETL.ipynb` notebook. 

- The notebook should lay out the design of the warehouse and the ETL code necessary to populate the tables .
- Use a star schema design. The idea is to make writing 'rollup' queries with SELECT ... FROM ... WHERE ... GROUP BY as easy as possible. The dimension FKs are likely redundant -- they can be usually be inferred from other table relationships -- but often eliminate the need for complex JOINs.
- Document each fact table (and associated dimensions) as a separate ERD. Each ERD should be named using the pattern `fact-table-name.pdf` and contain no spaces or other unnecessary punctuation. Store the PDF files in the `docs` folder.
- You will need to figure out how to extract data from `CourseData.db` in order to insert it into `CourseDataWarehouse.db`.



__5. Demo your results with useful queries.__

Submit your work in the `5_DataWarehouse_Demo.ipynb` notebook. 

- Formulate queries that illustrates how the data warehouse addresses the issues raised in step #1. 
the usefulness of the data warehouse with a few informative queries. 
- Number the queries so we can refer to them by number later. (In other words, make sure your queries have entity integrity.)
- In your Markdown include remarks about why, when, and how the query might be used in practice. If your query results suggest anything insightful, then include a cell below the query results with Markdown-formatted remarks.



__6. Deliver a brief walkthrough presentation of your work.__

- You will have exactly 10 minutes to present your work. One member of your team will be selected at random to present. Everyone has to be 100% knowledgable about the project. 
- There are no slides for presentation itself. Just walk us through your work work, notebook by notebook. What ever you do, please do not try to _sell_ your work. The purpose of the walkthrough is to review the work for completeness, relevance, and professionalism. If you have met the course objectives, then your work should stand on its own.
- The presentation can be live over Zoom -- schedule an appointment! -- or with a pre-recorded Youtube video. 