# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
## DA3 Intro to Pandas (75 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Utilize the Pandas library to
    * Load data from a CSV into a DataFrame
    * Work with DataFrames and Series
    * Save data to a CSV

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Define and call functions
* Open a CSV file
* Create and work with 2D lists

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* None

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this DA3 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/HyEAjjrd

Your repo, for example, will be named GonzagaCPSC222/da3-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Programming (45 pts)
Write a program to work with data using the Pandas library (e.g. the`pandas` module).

### Download the Data
Download the youtube_analytics_9-20-20_9-20-21.csv file and the days_of_week_9-20-20_9-20-21.csv from the DAs repo on Github: https://github.com/GonzagaCPSC222/DAs/blob/master/files. 

One way to download a file is to click "Raw" then right click on the page and click "Save As." Move both of these files into the same folder as your local DA3 Git repo. This is my own YouTube channel data that I downloaded from the YouTube Studio's Analytics website from 9/20/20-9/21/21.

### Pandas Exercises
Using Pandas objects, methods, and functions, code up the following data processing steps using the above two CSV files:
1. Load the two CSV files into DataFrame objects (`youtube_df` and `days_df`). The Index column for each DataFrame should the "Date" column.
1. Prompt the user for a start date (inclusive) and an end date (inclusive). Slice `youtube_df` using the dates to make a new DataFrame. Prompt the user for the name of a numeric column (provide these options to the user). Create a Series from the user-entered column name and the sliced DataFrame.
1. Compute the following stats on the sliced column: sum, mean, standard deviation, median, smallest value, and largest value. Store these label/result pairs in a Pandas Series object. Write this stats Series out to a CSV file. For example:
```
Sum,1121.0
Mean,373.6666666666667
StdDev,69.83074776438623
Median,358.0
Smallest,313.0
Largest,450.0
```
1. Join the original `youtube_df` with `days_df` on the "Date" index column to make a new DataFrame, `merged_df`. Write `merged_df` to a CSV file
1. Use the split-apply-combine approach to do the following:
    1. Split the data into groups based on the day of the week
    1. Apply the mean functionality to the user-entered column (from exercise #2 above) of each the groups
    1. Combine the daily means into a Series and write this Series to a new CSV file
        1. Do you notice any patterns about my YouTube channel based on the day of the week?
    
Notes:
1. Your solution should be modular. Define appropriate functions to solve these exercises. To help you get started with this, perhaps one function for each of the major steps above (e.g. load, slice, join, split/apply/combine, etc.)
1. Your CSV files you write out should have descriptive names
1. There are multiple ways to solve each of these steps. If you use a method/approach other than what was covered in class, make sure you cite your source!! (and understand the code you write)


### Bonus (5 pts)
Write your own "group-by" function. Your function should accept a DataFrame and a column name to split on. Return a dictionary of group name-DataFrame pairs. Your solution may not use `groupby` or other existing functions/methods. You must write your own code using a loop to do this. Walk through each row in the DataFrame and assign it to a new DataFrame depending on its value for column name (e.g. if column name is "Day of Week" then your dictionary would have 7 DataFrames in it, one for each of the 7 values of "Day of Week").

<img src="https://github.com/GonzagaCPSC222/DAs/raw/master/figures/group_by_day_of_week.png" width="300">

Note: It is nice that Pandas provides this functionality as a one-liner with `groupby` right? This is the beauty of using really nice data science libraries; however, it is always good to know how algorithms are implemented under the hood!!

## Project Part 3 (15 pts)
In a **PDF document called project_part3.pdf**, provide the following information for one of your three data sources from DA1 Project Part 1 (but not the same one you used for DA2 Project Part 2):
1. What table could be populated from this data source?
1. What is at least one other table of data you could combine with this one to form a larger and more informative dataset? 
1. For each table in your dataset:
    1. What is an instance? What is the universe of instances?
    1. What are the attributes? For each attribute:
        1. Is it categorical/discrete or continuous? 
        1. What is its scale of measurement (e.g. nominal, ordinal, interval, ratio)
    1. Is there a key?
    1. Is there an attribute (or attributes) that would logically serve as a class for supervised learning? Meaning would it be logical/interesting to predict this attribute based on the other attributes? What would be the value of predicting this attribute?
1. What would be a common key you could use to identify instances across your tables?

This write-up should be written using full sentences and should be grammaticallly correct. Proof read your writing before you submit it!!

## Data Ethics (15 pts)
Read the Introduction to [Weapons of Math Destruction](https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815) by Cathy O'Neill. This book is a NY Times bestseller, National Book Award longlist winner, and frequently mentioned as one of the top non-technical data science/big data books that everyone should read (here are a few lists: [kdnuggets](https://www.kdnuggets.com/2019/12/non-technical-reading-list-data-science.html), [dataquest](https://www.dataquest.io/blog/data-science-books/), [builtin](https://builtin.com/data-science/data-science-books), etc.). You don't need to purchase this book, unless you want a hard-copy or you want to read the whole thing. We will be reading a few chapters from it, starting with the Introduction. I'll [post the sections we will read to Ed](https://edstem.org/us/courses/8021/resources) so they are not publicly available on Github.

In a **PDF document called ethics.pdf**, provide your reflection on the following discussion points:
1. The Introduction states, "...attempting to reduce human behavior, performance, and potential to algorithms is no easy job." The parents loved Sarah Wysocki, but the algorithm did not. For this algorithm, Sarah was an "instance" and there were several attributes used to describe her. These attributes were weighted to produce an IMPACT score. What do you think some of the attributes were? Can you think of attributes that could better represent "human behavior, performance, and potential"? Is it possible to fully code a teacher's impact on student learning? What might a feedback loop look like for the IMPACT WMD?
1. The Introduction also states, "...data scientists all too often lose sight of the folks on the receiving end of the transaction." What is an example of an automated data-based system where you were on the receiving end of the transaction? Do you think the output from the system used to label you was accurate and fair? Try to think of an example not mentioned in the Introduction. 
1. What else struck you about this introduction?

This write-up should be written using full sentences and should be grammaticallly correct. Proof read your writing before you submit it!!

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your .csv file(s), and your write-up file(s) (.pdf). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from VS Code like we will when we grade your code.

## Grading Guidelines
This assignment is worth 75 points + 5 points bonus. Your assignment will be evaluated based on a successful execution in VS Code (using the Anaconda Python Distribution v3.8) and adherence to the program requirements. We will grade according to the following criteria:
* 5 pts for loading the CSV files
* 10 pts for prompting the user and indexing/slicing to form a Series
* 10 pts for writing out a stats Series based on the user's column
* 5 pts for joining on "Date" and writing out the merged DataFrame
* 10 pts for split-apply-combine and writing out the combined Series based on the user's column
* 5 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC222/DAs/blob/master/Coding%20Standard.ipynb)
* 15 pts for quality, clarity, and creativity in the project part 3 write-up
* 15 pts for quality, clarity, and creativity in the data ethics write-up