## CS 260 Final Project Overview

Pick a dataset, any dataset…

…and tell me a story using visualizations and the techniques we've used this semester.

That is your final project in a nutshell. More details below.

---

## A. Project Goal

The final project for this class will invovle an analysis on a dataset of your team's own choosing. You can choose the data based on your interests or based on work in other courses or research projects. The intent of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.

Your main goal is to prove to me that:
* you are proficient at asking meaningful questions and answering them with the results of basic data analysis and visualization, 
* you are proficient in using Python, 
* you are proficient at interpreting results, and 
* you are able to tell a cohesive story when you present your results. 

The project is very open ended. You should create compelling visualizations of this data in Python and then interpret them. (You can even use statistical techniques or visualizations we haven’t officially covered in class, if you’re feeling adventurous.)  There is no limit on what tools or packages you may incorporate, but you must also incorporate packages we learned in class. You do not need to visualize all of the data at once. A handful of high quality visualizations that tell a compelling story will receive a much higher grade than a large number of poor quality visualizations that are unrelated to each other. Also pay attention to your presentation. Neatness, coherency, grammar, cohesiveness, and clarity will count. All analyses must be done in Jupyter Lab/Notebooks, using Python.

---

## B. Data

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. 

As such, your dataset must have at least 250 observations (so rows) and between 10 to 20 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include a mix of:

* categorical variables, 
* discrete numerical variables, and  
* continuous numerical variables. 

Ideally, your data set should be in an excel file or a csv file or some type of table that can be converted to these files. If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into Ptyhon as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

**Note on reusing datasets from class:** Do *not* reuse datasets used in examples, homework assignments, or labs in the class.

Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:

<ul>
<li><a href="https://www.statcrunch.com/datasets/shared">Stat Crunch</a></li>
<li><a href="https://github.com/rfordatascience/tidytuesday">TidyTuesday</a></li>
<li><a href="https://www.opendata.nhs.scot/">NHS Scotland Open Data</a></li>
<li><a href="https://edinburghopendata.info/">Edinburgh Open Data</a></li>
<li><a href="https://statistics.gov.scot/home">Open access to Scotland’s official statistics</a></li>
<li><a href="https://www.bikeshare.com/data/">Bikeshare data portal</a></li>
<li><a href="https://data.gov.uk/">UK Gov Data</a></li>
<li><a href="https://www.kaggle.com/datasets">Kaggle datasets</a></li>
<li><a href="http://openintrostat.github.io/openintro/">OpenIntro datasets</a></li>
<li><a href="https://worlddata.ai/">World Data AI</a></li>
<li><a href="https://github.com/awesomedata/awesome-public-datasets">Awesome public datasets</a></li>
<li><a href="https://chronicdata.cdc.gov/Youth-Risk-Behaviors/DASH-Youth-Risk-Behavior-Surveillance-System-YRBSS/q6p7-56au">Youth Risk Behavior Surveillance System (YRBSS)</a></li>
<li><a href="https://www.icpsr.umich.edu/icpsrweb/content/ICPSR/fenway.html">PRISM Data Archive Project</a></li>
<li><a href="https://dataverse.harvard.edu/">Harvard Dataverse</a></li>
<li>If you know of others, let me know, and we’ll add here…</li>
</ul>

Finally, some of your professors likely do research in areas that may be of interest to you.  This means that they may have datasets that they would like help analyzing.  If you wish to ask a professor if they would be willing to share a data set with you for this project, then please talk to me first.  **I repeat - do not contact your professors without first talking to me.**  I will reach out them to explain the scope of the project to them and then you may talk to them afterwards.  Note:  It is very likely that your professor may decline giving you data, so you should talk to him/her early.  

## C. Deliverables

1. Proposal - due **[Friday Oct. 29 at 11:59 PM]**
1. Project Update 1: Initial Analysis/Descriptive Statistics - due **[Fri Nov. 12 at 11:59 PM]**
1. \*Project Update 2: Initial Report With Preview Video - due **[Fri Dec. 3 at 11:59 PM]**
1. Preview Video Peer Feedback: due **[Mon Dec. 6 at 11:59 PM]**
1. Final Report:  Due at 11:59 PM on the day of the final exam period.
1. Presentation: Will present during the final exam period - Slides due at the start of class.

---

## D. Proposal

This is a summary of your data as well as a list of questions you wish to address.  You will submit:

* An ipynb file with any csv's that I need to run it  (See below for sections required by your ipynb)
* A link to your team's google collaborate file
* A partner review 

The ipynb must contain:

* an Introduction Section and 
* a Data Section.

**What to include in Section 1 - Introduction:** 

The introduction should explain the topic you plan to explore. 

There should be a main/overarching research question that you are trying to answer.  Here are some examples from last year.  Notice these questions are open ended and cannot be answered with just a few lines of code.
* Did excitement for games grow after the NHL changed its rules in 2016?
* What variables lead to higher graduation rates among private colleges?
* Which variables contribute to frequent rentals in Hawaii Air BnB's?
* Are the salaries of female CEOs on par with those of their male counterparts in the USA?

Then you must include at least 5 **mature** questions involving multiple variables (the features in the columns) from your data that could help you assess the answer to your question.  These questions should help you identify patterns and relationships amongst variables in the data.  Here are some examples:
* Did team A perform better than team B in all factors, or just points scored?
* Did the number of points scored before/after the 2016 rule changes increase/decrease?
* Does the length of the Air BnB rental increase as price decreases?
* **Do coastal Air Bnb's in Hawaii cost more than inland rentals?**
* **Do the babies born of non-smoking mothers weigh more than those born of smoking mothers?**

**NOTE:** At least two of your questions must compare **2 or more groups/categories** within your data set, like the 2 bolded above.

Here are examples of **non-mature** questions.  Yes, you may have to ask/answer these questions while performing your actual analysis, but they don't show relationships between variables and can be answered with just one line of code.
* What was the total income made by company A last year?
* What categories were possible in column XXX?

**What to Include In Section 2 - Data:**  This section will explain your data:
* the websites it came from, 
* how it was collected, 
* what the variables/columns are, etc.  

In this section you will also load your data, showing the first several rows of the data in this section. The variables should be explained  in details in a bulleted list.

### Proposal Grading Schema

The grading scheme for the project proposal is as follows. </p>

<table style="width:99%;">
<colgroup>
<col width="70%" />
<col width="30%" />
</colgroup>
<thead>
<tr class="header">
<th>Total*</th>
<th>20 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Data Section</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>Introduction Section</td>
<td>5 pts</td>
</tr>
    
<tr class="odd">
<td>Maturity of Questions/Topic</td>
<td>5 pts</td>
</tr>
<tr class="odd">
<td>Professionalism/Grammar</td>
<td>4pts</td>
</tr>
<tr class="even">
<td>Teamwork</td>
<td>1 pt</td>
</tr>
</tbody>
</table>

<em>\*If it becomes clear that you have contributed little to nothing to the project, then you may not recieve the same score as your teammates.</em>

---

## E. Project Update 1: 
## Initial Analysis/Descriptive Statistics

This is an update showing me you are making progress.  You will submit:

* An ipynb file with any csv's that I need to run it  (See below for sections required by your ipynb)
* A link to your team's google collaborate file
* A partner review 


For the project update, your goal is to prove to me that you have begun performing your initial data analysis.  It should be clear that you are attempting to answer the questions that you posed in your proposal and that you gaining support for/against certain hypotheses as you make them. It should also be clear that you are discovering new results and applying new techniques (transforming columns via functions, including new visualizations, etc.)

**DIRECTIONS:** First, update Sections 1-2 of your proposal dcoument per my feedback. Then begin adding content to the 3rd section in your ipynb file. 
    
**Section 3 - Initial Analysis**
    
In Section 1, you offered questions you wished to consider. In this section, try to answer these questions with descriptive statistics and visualizations.  Include at least 5 visualizations that tell a compelling story about your data set.  

You must explain what you learn from these statistics/visualization.  So after every code cell with statistic or a visualziation, there should be markdown cell explaining what you learned from the previous code cell.

**Before submitting**, reread Section 1 and decide if you need to make any updates.  Perhaps your discussions with me have led you to spin off in a new direction.  Therefore, you may have different/more questions to add to your list.  On the otherhand, perhaps via our discussions, you've tweaked your initial/overarching theme as well.  In that case, please update that.

The grading scheme for the project update is as follows. 

<table style="width:99%;">
<colgroup>
<col width="80%" />
<col width="20%" />
</colgroup>
<thead>
<tr class="header">
<th>Total</th>
<th align = "center">30 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Visualizations/Procedures:  The team has clearly made progress since the proposal and is manipulating data in a way that suggests they are asking sound questions and not just throwing visualizations together for the update.</td>
<td align = "center">15 pts</td>
</tr>
<tr class="even">
<td>Relevancy:  The team is asking questions relevant to initial questions asked, or has discovered a new interesting path that warrants deviating from their initial analysis plan.</td>
<td align = "center">5 pts</td>
</tr>
<tr class="odd">
<td>Maturity of Interpretations/Explanation of Intermediate Results.</td>
<td align = "center">5 pts</td>
</tr>
<tr class="even">
<td>Professionalism/Grammar</td>
<td align = "center">5 pts</td>
</tr>

</tbody>
</table>

---

## F. Project Update 2: 

## Initial Report With Preview Video

Details to come, but basically, you are cleaning up your results and starting to create a final report with actual findings/conclusions.  Additionally, you are incorporating a video for your peers to watch and to ask questions/offer suggestions.