# Introduction to Data Science

Designed to prepare educators to teach data science, this workshop presents an introductory data science unit for high school students. The unit includes a project plan that guides students through key steps in data investigation: data gathering, processing, exploration, visualization, and storytelling. Participants in the session will learn to help students use Python and libraries like Pandas, Numpy, and Matplotlib to analyze and present data. This session offers practical tools for teaching introductory data science within a structured project framework, and participants will receive resources to help them teach the unit in their own high school classrooms.

# Project Description

This project gives you the opportunity to choose a dataset that interests you and apply the coding and analytical concepts you’ve learned in this class. It is structured into four milestones, which will be assigned throughout the remainder of the unit. Each milestone will be graded and accompanied by detailed feedback to support your progress.

To successfully complete the final project, you will need to:

- Select a dataset
- Develop research questions
- Apply data manipulation, generate numerical summaries, and create visualizations
- Write a comprehensive report that:
  - Presents your findings
  - Answers your research questions
  - Reflects on the methods you used

You may complete the coding in R or Python. The project will conclude with a final deliverable, which can be either a Jupyter notebook (`.ipynb` file) or an RMarkdown document (`.Rmd` file) that is knit into a `.pdf` report.

## Project Milestone 1: Dataset

This milestone of your project will be completed in two steps: first, selecting a suitable dataset and obtaining approval; second, completing a detailed data sheet. It is important that your chosen dataset aligns with the project requirements and supports the type of analysis you plan to conduct. To ensure this, you’ll need to start by submitting a Dataset Approval Request.

When selecting a dataset for your project, please keep the following requirements in mind. The dataset must:

- Contain between 500 and 10,000 rows
- Include at least 10 variables, with at least 2 character variables and 2 numerical variables
- Exclude any personally identifiable information (PII) if using data from a project you are working on
- Be different from any dataset that will be used in this course
- Avoid datasets that:
  - Consist entirely of time series data (e.g., stock prices or monthly sales).
  - Have repeated measurements spread across rows

To help you get started with finding datasets, please review this [document](https://docs.google.com/document/d/1anCa86MPIZQDcicwkMT7w2aiHkPOIy5iKffbtxdCQ1o/edit?usp=sharing). If you know of other valuable sources, please share them with me, and I’ll add them to the list.

After you submit your dataset, it will be reviewed, and you will receive feedback on whether it has been approved or if modifications are needed. Please wait for approval before proceeding to the second part of this milestone: **Creating a Datasheet**.

### Part 1: Selecting a Dataset and Obtaining Approval

Choose a dataset that meets all of the following criteria:

- Publicly available  
- Between **500 and 10,000 rows**  
- Contains betweeb **10 to 15** variables**, including:  
    - At least **2 character variables** 
    - At least **2 numerical variables**  
**Notes:**

- Datasets with more than **30 variables** should be **narrowed down to 10–15** that most relevant to your analysis.
- Do not use any dataset provided in this course.
- If your dataset comes from your work or a personal project, ensure all Personally Identifiable Information (PII) is removed.
- If you're unsure whether your dataset meets the criteria, contact the instructor for clarification.

### Part 2: Sharing Your Dataset

Submit your dataset in one of the following ways:

- Provide a link to the dataset (preferred if it’s available online), or
- Upload the dataset as a `.csv` file to Moodle

**Note:** 
- Uploaded files must be under 1GB due to Moodle's file size limit.

### Part 3: Justifying Your Dataset Selection

Write a brief justification (1-2 paragraphs) explaining why your chosen dataset is a good fit for your project. Your explanation should:

- Describe how the dataset meets the project's size and variable requirements.
- Include 3 initial research questions you plan to explore (these may change as your analysis progresses).

## Submission Instructions

Follow these steps to complete your submission:

### Dataset

- If your dataset is available online, include a link
- If you need to upload the dataset, save it as a `.csv` file and upload it to Moodle.

### Paragraph

- If you write your paragraph using word processing software (e.g., Word, Google Docs), save it as a `.pdf` before uploading to Moodle.

**Note:**  
- You may submit two files: the justification paragraph (`.pdf`) and the dataset (`.csv`). However, if your dataset is available online, it’s preferred that you include the dataset link within your `.pdf` rather than uploading a separate file.

## Grading Rubric

Please consult the [Project Milestone 1: Finding a Dataset Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.ac4ga0s8jrbc) for details on how your work will be evaluated for this milestone.

### Part 4: Creating Your Datasheet

Provide a detailed description of your dataset by creating a datasheet. The purpose of the datasheet is to clearly explain what each variable (column) in your dataset represents. For each variable, include the following:

- A clear description that explains the meaning and context of the variable.
- The data type (e.g., string, integer, float)
- The unit of measurement, if applicable

**Note:** 
- For examples, refer to the datasheets created for the farmer’s market and coffee datasets in earlier assignments.

## Submission Instructions

Follow these steps to complete your submission:

- Include the source of your dataset in your datasheet.
- Provide a brief description of the dataset, summarizing its context or purpose.
- If you create your data sheet using word processing software (e.g., Word or Google Docs), save it as a `.pdf` before uploading it to Moodle.

## Grading Rubric

Please consult the [Project Milestone 1: Creating a Datasheet Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.65b467nsbsmg) for details on how your work will be evaluated for this milestone.

## Project Milestone 2: Data Dive

This milestone of your project, the Data Dive, you will perform exploratory data analysis on your dataset using R or Python, and document your work in either an RMarkdown document or a Jupyter Notebook. Your analysis should include data moves such as filtering, summarizing, calculating, and grouping. Be sure to clearly comment your code to explain what your code is doing and why, so that your analysis is easy to follow and demonstrates your reasoning.

Along with code, your notebook should include text that explains the story of the dataset, including its context and significance. For proper text formatting, please refer to this [Basic Markdown Guide](https://www.markdownguide.org/basic-syntax/).

Refer to the provided template to guide the structure and organization of your analysis as you begin your work.

### Part 5: Data Dive Using R

Follow the guidelines and steps outlined in the document [Data Dive Using R: A Step-by-Step Guide](https://docs.google.com/document/d/17FS7yhTA8DcF9pjZ2xW6pkirE7QkzTwBH3YjnH7i3x4/edit?tab=t.0) to complete your assignment.

## Submission Instructions

Follow these steps to complete your submission:

- Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document. 
- Save the assignment before proceeding to knit the file into a `.pdf` document.
- Once the knitting process is complete, locate the resulting `.pdf` document and upload this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.

### Part 6: Data Dive Using Python

Follow the guidelines and steps outlined in the document [Data Dive Using Python: A Step-by-Stpe Guide](https://docs.google.com/document/d/1C-J3pVmkv0BT4F9clsXABWgqLHt1G-zFa94A9o1ibnk/edit?tab=t.0) to complete your assignment.

## Submission Instructions

Follow these steps to complete your submission:

- Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.
- Save the assignment before proceeding to download the file.
- After downloading, locate the `.ipynb` file and upload only this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.

## Grading Rubric

Please consult the [Project Milestone 2: Data Dive Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.lrx3oz9s6qom) for details on how your work will be evaluated for this milestone.

## Project Milestone 3: Analysis Plan

In this milestone, you will build on the progress you've made in earlier milestones and continue developing your research questions. Your task involves two main steps: first, refining and finalizing focused research questions; and second, creating an analysis plan to guide your exploratory data analysis (EDA). This plan should draw on insights from previous milestones, including feedback from your instructor and observations from your Data Dive assignments. Your EDA should be designed to help you begin answering your research questions by identifying patterns, relationships, or trends in the data.

### Part 7: Research Questions

Your task is to develop at least three research questions, each involving two or more variables from your dataset. These variables can be ones already included in the dataset or ones you create by transforming existing data. Each question should be accompanied by a short explanation that describes why you chose the variables and what makes the relationship worth exploring.

**Note:** 

- Follow the recommendations in [Guidance for Developing Research Questions](https://docs.google.com/document/d/15bQMtscw6QthfCzXgpu7EEPp0KQnDvYRvPFVb12bmz0/edit?tab=t.0) to write research questions that connect to your dataset and goals. 

## Submission Instructions

Follow these steps to complete your submission:

- Enter your research questions in the text box or, if using word processing software (e.g., Word, Google Docs) to create your document, save it as a `.pdf` before uploading it to Moodle.

## Grading Rubric

Please consult the [Project Milestone 3: Research Questions Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.4ux29xt5szo5) for details on how your work will be evaluated for this milestone.

### Part 8: Analysis Plan Document Setup

For this milestone, you will create a new document to outline your analysis plan. At this stage, do not include any code; instead, focus on outlining a plan for your analysis. In **Project Milestone 4: Data Analysis and Storytelling**, you will return to this document and add the code needed to carry out your analysis. Refer to the format and guidance provided in [Analysis Plan Document Setup Template](https://docs.google.com/document/d/1z0o5stpWtw5LLCVe_hRFoqAC9Lgad2dbIoSqgQT8Bt4/edit?tab=t.0) as you create your outline.

- If you plan to use R, create an RMarkdown (`.Rmd`).
- If you plan to use Python, set up a Jupyter Notebook (`.ipynb`).

## Submission Instructions

Follow these steps to complete your submission:

### RMarkdown
  - Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.
  - Save your assignment before proceeding to knit the file into a `.pdf` document.
  - Upload the `.pdf` file to this Moodle page, and it will be automatically submitted to Gradescope for grading.

### Jupyter Notebook
  - Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.
  - Save your assignment before proceeding to download the `.ipynb` file. 
  - Upload the `.ipynb` file to this Moodle page, and it will be automatically submitted to Gradescope for grading.

## Grading Rubric

Please consult the [Project Milestone 3: Analysis Plan Document Setup Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.lrx3oz9s6qom) for details on how your work will be evaluated for this milestone.

## Project Milestone 4: Data Analysis and Storytelling

### Introduction

For this final project milestone, you will implement your planned analyses and present your findings. Submit your work as a single, complete document, either a knitted RMarkdown file in `.pdf` format or a Jupyter notebook (`.ipynb`). The document must include all required sections.

### Part 9: Data Analysis and Storytelling Template

This milestone builds on your earlier submissions but also requires new content. As you complete this milestone:

- Replace the text you previously wrote for the planned analysis with the actual code, including appropriate comments.
- Write your interpretations of the output produced by your code.
- Add code to build the visualizations you described in your plan.
- Explain what your visualizations show and how they relate to your research questions.
- Revise and strengthen the previously submitted version of your document based on instructor feedback and what you have learned throughout the course.

Return to your the [Analysis Plan Document Setup Template](https://docs.google.com/document/d/1z0o5stpWtw5LLCVe_hRFoqAC9Lgad2dbIoSqgQT8Bt4/edit?tab=t.0) you setup in **Part 8: Analysis Plan Document Setup** to add the code needed to carry out your analysis. Refer to the format and guidance provided in [Data Analysis and Storytelling Template](https://docs.google.com/document/d/1bPPJ8mr6JAev6zbaPSDmhhjL4nzmerV5MRJCnO6t9VE/edit?tab=t.0#heading=h.lx5dpsinbhos) as you complete your final document.

## Grading Rubric

Please consult the [Project Milestone 4: Data Analysis and Storytelling Section of the Grading Rubric](https://docs.google.com/document/d/1pNJdSuyrcYNnsImlPU8tpOty9IWoyYKAv4mkPX50Tbs/edit?tab=t.0#heading=h.xairjwsspn6x) for details on how your work will be evaluated for this milestone.