# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
## DA5 Jupyter Notebook (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Write Markdown and code cells in Jupyter Notebook
* Type set equations using Latex
* Tell a data science story
* Parse JSON data
 
## Prerequisites
Before starting this programming assignment, participants should be able to:
* Use pandas for data analysis
* Use matplotlib to visualize data

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [The COVID Tracking Project Data API](https://covidtracking.com/data/api)

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this DA5 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/eMsFWSLZ 

Your repo, for example, will be named GonzagaCPSC222/da5-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Programming (40 pts)
For this assignment, we are going to use a Jupyter Notebook, called WACovidEDA.ipynb, to tell a story about exploratory data analysis (EDA) of a JSON dataset. Your source code, explanations/log, and charts will be combined in a Jupyter Notebook with well-organized, interleaved code cells and markdown cells. Here are general requirements that your Notebook should conform to:
* Your EDA story should be logically divided into different section levels, appropriately labeled using markdown headers, and contain well-written commentary describing the code, results, and insights you come up with
* Each chart that you generate must be generated inline in the Notebook and include a figure title and labels on the x and y axes where appropriate
* Each formula that you use in your code must be typeset using Latex and described in markdown (this includes formulas used for stats, even if those stats are implemented in a pandas function/method you call)
    
### Dataset
The dataset we are going to use is going to be downloaded by your code! We are going to fetch the daily COVID-19 data for Washington state from [covidtracking.com](https://covidtracking.com/data/api). The specific URL we are going to use to download the data is https://api.covidtracking.com/v1/states/wa/daily.json You can/should open this url now and take a look at the data. It is in JSON format :) 

Here is starter code that fetches this JSON data from the above URL and writes the data to a file called "dailyWA.json":

In [None]:
import json
import urllib.request

# data source 
# https://covidtracking.com/data/api
# historic value for a single state
# note: data is also available as a CSV output 
# at https://covidtracking.com/data/download click Washington 
# to download the "washington-history.csv"

# URL to fetch data from 
url = "https://api.covidtracking.com/v1/states/wa/daily.json"
# download raw JSON object from the URL
data = urllib.request.urlopen(url).read().decode()
# write JSON object to an output file
outfile = open("dailyWA.json", "w")
outfile.write(data)
outfile.close()

Once you run this code, you will have dailyWA.json on your local machine and can begin the following data storytelling execises.

### Python Data Storytelling Exercises
1. Convert dailyWA.json to dailyWA.csv (e.g. convert the JSON data to CSV data and write out the resulting CSV file)
1. Load the dailyWA.csv data into a pandas DataFrame object. Add a new column to the DataFrame called "month" that simply stores the month for each row (e.g. 1, 2, 3,..., 10 or "Jan", "Feb", "March",..., "Oct")
1. Display the total number of new positive cases for each month. Then display the mean and standard deviation across the months. 
1. Create two charts. The first chart is of the accumulated positive cases over time and second chart is of the accumulated negative cases over time.

Note: the code for the above exercises does not need to be modular (e.g. you don't need to write functions).

### Bonus (7 pts)
(4 pts) Create a third chart that plots both the accumulated positive and negative cases on the same Matplotlib figure. Your chart should have two Y-axes so you can see the trend of each line clearly (otherwise the negative case numbers are so large the scale of a single Y-axis washes out the trend of the positive case numbers). Include a legend for each line.

(3 pts) Modify the X-axis labels so they show the month. 

Here is the bonus plot in action (note depending on when you download the JSON data, your chart will likely contain additional days of data):
<img src="https://github.com/GonzagaCPSC222/DAs/raw/master/figures/wa_covid_bonus_chart.png" width="400">

## Project Proposal (40 pts)
### Overview
For the "Quantified Self" project, you are going to create and analyze your own dataset by collecting longitudinal data (data collected over time) on yourself. Previous DAs have included incremental tasks related to exploring what data on yourself you want to collect and some preliminary analyses of this data. Now, you are going to decide on the dataset, analyses, and hypotheses.

Your "Quantified Self" dataset can be any dataset that includes your own data, so long as it conforms to the following requirements:
1. It spans at least two months of a recent data collection period
1. It has at least 5 attributes of different measurement scales, including a "class" attribute to be used for classification (e.g. the data is labeled and can be used with supervised machine learning algorithms)
1. It contains at least two tables that can be joined (e.g. each table is from a different data source. As an example, consider my Fitbit CSV file and the days of the week CSV file from DA3). 

Include your dataset files in your Github repo.

### Propose the Project
In a Jupyter Notebook named ProjectProposal.ipynb, formally write up your proposed project. Your proposal should be a narrative that is written in a data-storytelling format and is grammatically-correct. If you have preliminary Python code from previous DAs that is relevant for the proposal, please interleave it via code cells in your Jupyter Notebook.

Your proposal should have the following headers and content:
1. Title of your project (heading level 1)
1. Introduction (heading level 2)
    1. Domain introduction (heading level 3)
        1. Introduce the domain of your project (e.g. the fitness domain, the music domain, the medical domain, etc.)
        1. Personally, why is this domain important to you
        1. What are you researching in this domain
    1. Hypotheses (heading level 3)
        1. What are your hypotheses about insights/results in this domain
        1. What are potential impacts of the results
        1. Who are [stakeholders](https://www.projectmanager.com/blog/what-is-a-stakeholder) interested in your results
1. Data Analysis (heading level 2)
    1. Dataset description (heading level 3)
        1. What tables are included in the dataset and how is the data in each table collected
        1. What is its format (e.g. CSV files, JSON files, a mix of the two, etc.)
        1. Include a brief description of the attributes
    1. Data preparation (heading level 3)
        1. What cleaning of the dataset is required (e.g.. are there missing values and how do you plan to handle the missing values)
        1. How are you going to merge the tables
        1. What are anticipated challenges with data preparation
    1. Exploratory data analysis (heading level 3)
        1. What data aggregation techniques are you going to apply
        1. What visualizations will informatively present the attributes and relationships
        1. What statistical hypothesis tests are you going to compute
    1. Classification (heading level 3)
        1. What attribute will you use as class information (i.e., what attribute or attributes will you try to predict)
        1. What are your hypotheses about the predictions
        1. What are anticipated challenges with classification
        
Note: looking forward, the final project report will include variations of the above sections (with code/results), as well as sections titled "Discussion" and "Conclusion".

### Project Timeline
About three weeks after your project proposal is approved, we will have a "mid-project" demo due. For this, you will demo a Jupyter Notebook with results for "Data preparation" and "Exploratory data analysis". You will also demo preliminary work/results on your classification approach.

The final project deliverables include a presentation/demo on Thursday, December 13th from 1-3pm, as well as a final report that will be due that night at 11:59pm.

## Data Ethics (20 pts)
Watch Netflix's documentary, ["The Social Dilemma"](https://www.netflix.com/title/81254224). While I encourage you to watch the whole documentary (1 hour and 34 mins), I'm only assigning the following ~45 minute chunk: 5:45 (title page) to 49:01. 

Note: If you do not have access to Netflix, you can read the transcript of the documentary [here](https://scrapsfromtheloft.com/2020/10/03/the-social-dilemma-movie-transcript/). Read from "Aza does welcoming remarks. We play the video." to "Because they’re controlling, you know, the information that we see, they’re controlling us more than we’re controlling them." (this part of the transcript corresponds with the assigned timestamps above). Use the web browser search feature (e.g. ctrl + F or cmd + F) to search for the start of the reading.

In a Jupyter Notebook called Ethics.ipynb, provide your reflection on the following discussion points:
1. Toward the beginning of the documentary, viewers are presented with the idea that social media sells our attention to advertisers to make their profits. The quote that best summarizes this business model is, "If you’re not paying for the product, then you are the product." How do you feel about selling your attention to use the products for free? Are there short and long term effects that we can/cannot foresee?
1. Tristan Harris states, "On the other side of the screen, it’s almost as if they had this avatar voodoo doll-like model of us. All of the things we’ve ever done, all the clicks we’ve ever made, all the videos we’ve watched, all the likes, that all gets brought back into building a more and more accurate model. The model, once you have it, you can predict the kinds of things that person does." He then states that tech companies have three main goals: engagement (drive up your usage), growth (keep you coming back and inviting friends), and advertising (making as much money as possible from advertising). Are there ethical lines that companies should not cross in order to achieve these three goals? Do you feel that the "avatar voodoo doll-like model of us" crosses these lines? Why or why not?
1. How do you feel about persuasive technology (e.g. programming of the "the positive intermittent reinforcement", AKA the slot machine)? As humans, are we able to differentiate between our original thoughts and ones that are "planted" as an unconcious by persuasive technology?
1. What else struck you about this documentary?

This write-up should be written using full sentences and should be grammaticallly correct. Proof read your writing before you submit it!!

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .ipynb file(s) and your .csv or .json file(s). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from Jupyter Lab like we will when we grade your code.

## Grading Guidelines
This assignment is worth 100 points + 7 points bonus. Your assignment will be evaluated based on a successful execution in Jupyter Lab (using the Anaconda Python Distribution v3.8) and adherence to the program requirements. We will grade according to the following criteria:
* 5 pts for converting the JSON file to a CSV file
* 10 pts for correct monthly stats
* 10 pts for correct accumulated cases charts
* 5 pts for data storytelling using Markdown cells
* 5 pts for typesetting stats formulas using Latex
* 5 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC222/DAs/blob/master/Coding%20Standard.ipynb)
* 40 pts for quality, clarity, and creativity in the project proposal, as well as coverage of the required headers and content
* 20 pts for quality, clarity, and creativity in the data ethics write-up