# Project Specification



## Group formation
The project group may consist of 3 to 4 students.

## Project grading 
The project is worth 35% of the class grade, with the following details:

|                     |                        |                   |
|---------------------|:----------------------:|:-----------------:|
|                     |   **Due date**         | **Proportion (%)**|
| Proposal            | May 13                 |       15          |
| Check-up            | June 2                 |       20          |
| Presentation recording | Jun 11              |       25          |
| Report              | Jun 11                 |       35          |
| Peer experience summary | Jun 11             |       10          |
| **Total**           |                        |      100          |

## Project proposal

Develop a proposal that includes your topic, the selection of relevant dataset(s), and a plan to answer questions of interests.  Keep in mind the timeline of the quarter and set achievable goals.


### Guidelines for proposal

Review the guidelines below carefully.

The proposal should follow this format (5 points): 

- Include full names of all teammates.
- Between 1 to 2 pages, single-spaced, 11-point type, 1-inch margins.
- Do not include graphics.
- Decide on the name of your group.  A pdf named `[GroupName]_proposal.pdf` should be submitted.  

The proposal should cover the following components (but in a narrative format, not Q/A):

- **Introduction** (12 points): Introduce your project and motivation, e.g.,
  - What is the main issue you are interested in?
  - Why is this topic important?
  - In what way does this project provide a solution?
- **Data source(s)** (10 points): Describe the data sources you have chosen, e.g.,
  - How can the data be retrieved?
  - How are the data related to the topic?
  - State the amount of data you will be working with. 
- **Goal definitions** (18 points): State two to three goals you are interested in achieving with the selected data sources.  
  As you tackle the goals, you should use **at least one** of the data engineering techniques covered in class:
  - Missing data imputation
  - Database system (relational or non-relational)
  - Distributed computing framework (e.g., pyspark)  
For each goal, answer the following:
  - How do you intend to achieve your goal?
  - How do you intend to use the above techniques?
  - What can you answer (and not answer) with the proposed datasets?
- **Reference** (5 points): Provide valid references for your data source(s). (Reference does not count toward the page limit.)
  - For example, using APA guidelines, the MIMIC-III database can be cited as
    > Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). *PhysioNet*. https://doi.org/10.13026/C2XW26.
  - or the original article 
    > Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. *Scientific data*, 3(1), 1-9.
  

## Potential data sources
Below are some possible avenues for finding data sources:
- [Google dataset search](https://datasetsearch.research.google.com/)
- [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)
- [BuzzFeed news repository](https://github.com/BuzzFeedNews/everything)

## Project check-up

Two elements are required for the project check-up:  
1. **Group, submitted on Canvas as a pdf.** (30 points)
    - For each of your set goals in the proposal, provide a brief summary of progress. *If you have not worked on a particular goal yet, feel free to say that you have not made progress there.*
    - Include any questions or roadblocks you are encountering and require assistance.
    - If you find it helpful to set up a time to meet with Prof. Chan, indicate so, and provide some available times to meet between June 2 and 6.
3. **Individual.** (20 points)
    - Complete the following Google form: [DE300 Project check-up Google Form](https://forms.gle/inRfq9NKifQcGcfa8)
  
*Note:* While the check-up form would not necessarily change your individual project grades, adjustments will be considered combining the feedback from the check-up and the peer experience summary. 

## Presentation recording
Produce a presentation recording with Panopto. The presentation recording should follow these guidelines:

- Between 8 to 12.5 minutes.
- Include a side-by-side view with your presentation slides and your camera.
- All of your teammates should be visible at all times during the recording.
- *Tips:*
    - If you decide to record using Zoom, you may upload the recorded video to Panopto for edits.
    - Distribute your content and do a quick dry run in advance.  This will reduce your time of re-recording.

The presentation should include *but not limited to the following*:

- Motivation of your project.
- Introduction to your dataset(s) and its relevance.
- Goals, method of analyses, and findings.
- Summary.

The presentation is evaluated via the following criteria (20 each):

- Organization of content
- Appropriate use of language
- Delivery
- Use of supporting materials (visuals, statistics, etc.)
- Clarity of central idea

Note that while the evaluation is primarily for the entire group, evaluation may differ among students if significant discrepancy is observed. You may refer to the [sample reference rubric](a1-presentation-rubric) for details.

## Report
Prepare a project report to clearly outline your project choice and its importance, your approach to achieving the set of goals, your results, and a summary.

The report should follow these guidlines:

- Include full names of all teammates.
    - This is a reminder that it is an academic violation to include your name on any work you have not contributed or performed.
- Up to 15 pages (not including Appendix), single-spaced, 11-point type, 1-inch margins.

**Submission:**
A single pdf named `LastName1_LastName2_..._report.pdf` should be submitted.

Include all of your code in a zipped folder in your submission.  Make sure your codes and code organization are understandable.

---

The report should include the following components in a narrative format:

- **Introduction** (5 points): Introduce your project and motivation, e.g.,
  - What is the main issue you are interested in?
  - Why is this topic important?
  - In what way does this project provide a solution?
- **Data source(s)** (5 points): Describe the data sources you have chosen, e.g.,
  - How can the data be retrieved?
  - How are the data related to the topic?
  - State the amount of data you will be working with. 
- **Goals and approaches** (60 points): State the *specific* goals you are attempting to achieve with the selected data sources.  For each goal, provide the approach in achieving said goal, its implementation, and your results.  Use any graph, table, or other visualization to support your narrative.
- **Summary** (10 points): Provide a brief summary of your project, and describe any loose ends or future opportunities.
- **Generative AI statement and reference** (10 points):
  - Generative AI statement: If you have, in any way, employed the use of generative AI tools, report all usage according to the syllabus. This statement does not count toward the page limit.
  - Reference: Provide valid references for your data source(s). (*Does not count toward the page limit.*)
      - For example, using APA guidelines, the MIMIC-III database can be cited as
        > Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). *PhysioNet*. https://doi.org/10.13026/C2XW26.
      - or the original article 
        > Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. *Scientific data*, 3(1), 1-9.
- **Code legibility and organization** (10 points): Give a brief summary or provide a README to use and evaluate your code. (*Does not count toward the page limit.*)


## Peer experience summary
Complete the following Google form: [to be provided]