---

## CS 260 Final Project Overview

Pick a dataset, any dataset…

…and tell me a story using visualizations and the techniques we've used this semester.

That is your final project in a nutshell. More details below.

---

## A. Project Goal

The final project for this class will invovle an analysis on a dataset of your team's own choosing. You can choose the data based on your interests or based on work in other courses or research projects. The intent of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.

Your main goal is to prove to me that:
* you are proficient at asking meaningful questions and answering them with the results of basic data analysis and visualization, 
* you are proficient in using Python, 
* you are proficient at interpreting results, and 
* you are able to tell a cohesive story when you present your results. 

The project is very open ended. You should create compelling visualizations of this data in Python and then interpret them. (You can even use statistical techniques or visualizations we haven’t officially covered in class, if you’re feeling adventurous.)  There is no limit on what tools or packages you may incorporate, but you must also incorporate packages we learned in class. You do not need to visualize all of the data at once. A handful of high quality visualizations that tell a compelling story will receive a much higher grade than a large number of poor quality visualizations that are unrelated to each other. Also pay attention to your presentation. Neatness, coherency, grammar, cohesiveness, and clarity will count. All analyses must be done in Jupyter Lab/Notebooks, using Python.

---

## B. Data

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. 

As such, your dataset must have at least 250 observations (so rows) and between 10 to 20 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include a mix of:

* categorical variables, 
* discrete numerical variables, and  
* continuous numerical variables. 

Ideally, your data set should be in an excel file or a csv file or some type of table that can be converted to these files. If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into Ptyhon as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

**Note on reusing datasets from class:** Do *not* reuse datasets used in examples, homework assignments, or labs in the class.

Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:

<ul>
<li><a href="https://www.statcrunch.com/datasets/shared">Stat Crunch</a></li>
<li><a href="https://github.com/rfordatascience/tidytuesday">TidyTuesday</a></li>
<li><a href="https://www.opendata.nhs.scot/">NHS Scotland Open Data</a></li>
<li><a href="https://edinburghopendata.info/">Edinburgh Open Data</a></li>
<li><a href="https://statistics.gov.scot/home">Open access to Scotland’s official statistics</a></li>
<li><a href="https://www.bikeshare.com/data/">Bikeshare data portal</a></li>
<li><a href="https://data.gov.uk/">UK Gov Data</a></li>
<li><a href="https://www.kaggle.com/datasets">Kaggle datasets</a></li>
<li><a href="http://openintrostat.github.io/openintro/">OpenIntro datasets</a></li>
<li><a href="https://worlddata.ai/">World Data AI</a></li>
<li><a href="https://github.com/awesomedata/awesome-public-datasets">Awesome public datasets</a></li>
<li><a href="https://chronicdata.cdc.gov/Youth-Risk-Behaviors/DASH-Youth-Risk-Behavior-Surveillance-System-YRBSS/q6p7-56au">Youth Risk Behavior Surveillance System (YRBSS)</a></li>
<li><a href="https://www.icpsr.umich.edu/icpsrweb/content/ICPSR/fenway.html">PRISM Data Archive Project</a></li>
<li><a href="https://dataverse.harvard.edu/">Harvard Dataverse</a></li>
<li>If you know of others, let me know, and we’ll add here…</li>
</ul>

Finally, some of your professors likely do research in areas that may be of interest to you.  This means that they may have datasets that they would like help analyzing.  If you wish to ask a professor if they would be willing to share a data set with you for this project, then please talk to me first.  **I repeat - do not contact your professors without first talking to me.**  I will reach out them to explain the scope of the project to them and then you may talk to them afterwards.  Note:  It is very likely that your professor may decline giving you data, so you should talk to him/her early.  

## C. Deliverables

1. Proposal - due **[Friday Oct. 29 at 11:59 PM]**
1. Project Update 1: Initial Analysis/Descriptive Statistics - due **[Fri Nov. 12 at 11:59 PM]**
1. Project Update 2: Almost-Final Analysis and Small Group Presentation: due **[Tuesday Dec. 7 at 11:59 PM]**
1. Final Report:  Due on the day of the final exam period **[Monday Dec. 13 at 11:59 PM]**
1. Presentation: Will present during the final exam period - Slides due at the start of class - **[Monday Dec. 13 at 11:59 PM]**.

---

## D. Proposal

This is a summary of your data as well as a list of questions you wish to address.  You will submit:

* An ipynb file with any csv's that I need to run it  (See below for sections required by your ipynb)
* A link to your team's google collaborate file
* A partner review 

The ipynb must contain:

* an Introduction Section and 
* a Data Section.

**What to include in Section 1 - Introduction:** 

The introduction should explain the topic you plan to explore. 

There should be a main/overarching research question that you are trying to answer.  Here are some examples from last year.  Notice these questions are open ended and cannot be answered with just a few lines of code.
* Did excitement for games grow after the NHL changed its rules in 2016?
* What variables lead to higher graduation rates among private colleges?
* Which variables contribute to frequent rentals in Hawaii Air BnB's?
* Are the salaries of female CEOs on par with those of their male counterparts in the USA?

Then you must include at least 5 **mature** questions involving multiple variables (the features in the columns) from your data that could help you assess the answer to your question.  These questions should help you identify patterns and relationships amongst variables in the data.  Here are some examples:
* Did team A perform better than team B in all factors, or just points scored?
* Did the number of points scored before/after the 2016 rule changes increase/decrease?
* Does the length of the Air BnB rental increase as price decreases?
* **Do coastal Air Bnb's in Hawaii cost more than inland rentals?**
* **Do the babies born of non-smoking mothers weigh more than those born of smoking mothers?**

**NOTE:** At least two of your questions must compare **2 or more groups/categories** within your data set, like the 2 bolded above.

Here are examples of **non-mature** questions.  Yes, you may have to ask/answer these questions while performing your actual analysis, but they don't show relationships between variables and can be answered with just one line of code.
* What was the total income made by company A last year?
* What categories were possible in column XXX?

**What to Include In Section 2 - Data:**  This section will explain your data:
* the websites it came from, 
* how it was collected, 
* what the variables/columns are, etc.  

In this section you will also load your data, showing the first several rows of the data in this section. The variables should be explained  in details in a bulleted list.

### Proposal Grading Schema

The grading scheme for the project proposal is as follows. </p>

<table style="width:99%;">
<colgroup>
<col width="70%" />
<col width="30%" />
</colgroup>
<thead>
<tr class="header">
<th>Total*</th>
<th>20 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Data Section</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>Introduction Section</td>
<td>5 pts</td>
</tr>
    
<tr class="odd">
<td>Maturity of Questions/Topic</td>
<td>5 pts</td>
</tr>
<tr class="odd">
<td>Professionalism/Grammar</td>
<td>4pts</td>
</tr>
<tr class="even">
<td>Teamwork</td>
<td>1 pt</td>
</tr>
</tbody>
</table>

<em>\*If it becomes clear that you have contributed little to nothing to the project, then you may not recieve the same score as your teammates.</em>

---

## E. Project Update 1: Initial Analysis/Descriptive Statistics

This is an update showing me you are making progress.  You will submit:

* An ipynb file with any csv's that I need to run it  (See below for sections required by your ipynb)
* A link to your team's google collaborate file
* A partner review 


For the project update, your goal is to prove to me that you have begun performing your initial data analysis.  It should be clear that you are attempting to answer the questions that you posed in your proposal and that you gaining support for/against certain hypotheses as you make them. It should also be clear that you are discovering new results and applying new techniques (transforming columns via functions, including new visualizations, etc.)

**DIRECTIONS:** First, update Sections 1-2 of your proposal dcoument per my feedback. Then begin adding content to the 3rd section in your ipynb file. 
    
**Section 3 - Initial Analysis**
    
In Section 1, you offered questions you wished to consider. In this section, try to answer these questions with descriptive statistics and visualizations.  Include at least 5 visualizations that tell a compelling story about your data set.  

You must explain what you learn from these statistics/visualization.  So after every code cell with statistic or a visualziation, there should be markdown cell explaining what you learned from the previous code cell.

**Before submitting**, reread Section 1 and decide if you need to make any updates.  Perhaps your discussions with me have led you to spin off in a new direction.  Therefore, you may have different/more questions to add to your list.  On the otherhand, perhaps via our discussions, you've tweaked your initial/overarching theme as well.  In that case, please update that.

The grading scheme for the project update is as follows. 

<table style="width:99%;">
<colgroup>
<col width="80%" />
<col width="20%" />
</colgroup>
<thead>
<tr class="header">
<th>Total</th>
<th align = "center">30 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Visualizations/Procedures:  The team has clearly made progress since the proposal and is manipulating data in a way that suggests they are asking sound questions and not just throwing visualizations together for the update.</td>
<td align = "center">15 pts</td>
</tr>
<tr class="even">
<td>Relevancy:  The team is asking questions relevant to initial questions asked, or has discovered a new interesting path that warrants deviating from their initial analysis plan.</td>
<td align = "center">5 pts</td>
</tr>
<tr class="odd">
<td>Maturity of Interpretations/Explanation of Intermediate Results.</td>
<td align = "center">5 pts</td>
</tr>
<tr class="even">
<td>Professionalism/Grammar</td>
<td align = "center">4 pts</td>
</tr>
<tr class="even">
<td>Teamwork</td>
<td align = "center">1 pt</td>
</tr>
</tbody>
</table>

---

## F. Project Update 2:  Almost-Final Analysis and Small Group Presentation

For this update, the main goal is for you to get to the point where you have a COHESIVE story to tell about your data.  Also, your analyses should be almost complete upon the submission of this update.   When you give your final presentation/report (see the next section), the audience should clearly understand the point of your research question.  The audience should be able to identify the relevance between your question/topic and any visuals/statistics that you provide.  This update is meant to ensure your success during the final presentation and report.

So for this update, continue trying to answer your research question(s). 
* Make sure you have compared subgroups during your analyses, according to whatever comparison questions you included in your intro.  
* Add new statistics/visualizations to Section 3 Initial Analysis of your ipynb. 
* If you wound up taking a different path during the research, you should update your introduction with these details.  
* As before, after every code cell with a statistic or a visualization, there should be markdown cell explaining what you learned from the previous code cell. 

Create a set of slides detailing your topic, main research question, and what results you have found.   You will informally present these slides to 1-2 teams during class on the day the project is due.  Though the presentation of them will  be informal, the slides must be very professional and formal.   The goal is for your peers to answer questions/offer good suggestions.
* There should be NO code in these slides.
* The slides should be professional with no typos.
* Usual conventions should be maintained:  succinct, consist, nice flow, no long paragraphs, etc.
* These slides can be done via Google Slides or PowerPoint, your preference.  But if you submit a google slides link and I cannot open it, I will give no credit, so be sure to submit a link in the same way that you submit a SQL ZOO Lab link.
* Your slides should contain results/visuals that I have seen in other updates plus new results.

You will submit:
* Your updated ipynb file with any csv's that I need to run it, updated in the way I described above.
* A link to your team's google collaborate file.
* PowerPoint Slides OR a Link to youtr google slides.  If I cannot open the linke, I will give no credit.
* There is no partner review at this time.

Rubric:


<table style="width:99%;">
<colgroup>
<col width="80%" />
<col width="20%" />
</colgroup>
<thead>
<tr class="header">
<th>Total</th>
<th align = "center">10 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Mature Slides</td>
<td align = "center">5 pts</td>
</tr>
<tr class="even">
<td>Continued Progress (ipynb and slides)</td>
<td align = "center">5 pts</td>
</tr>
</tbody>
</table>

---

## G. Presentation

Presentations will take place during the final exam period.  The presentation should last 7-10 minutes, with points taken off for speaking too long or too short, and each team member should say something substantial and speak for a decent amount of time.  

When you give your final presentation, the audience should clearly understand the point of your research question.  Also, the relevance of each visualization/statistic to your question should be clear.    

The presentation should be cohesive and have a natural flow.  **So your presentation should not just be an account of everything you tried ("We did this, then we did this, etc.").  Instead, it should convey what choices you made, and why, and what you found.** 

It makes sense that your presentation should include these items:  

* your research question 
* an explanation of the context/domain: Make your topic accessible.  For example, if analyzing hockey data, do not assume that everyone in the audience understands hockey jargon or the rules of hockey.  Tell the audience what they need to know to understand your project.
* an explanation of the variables
* explanation of your results/conclusions

At the end of your presentation, you should include these 2 slides.

* Future Work Slide - The second to last slide should be entitled "Future Work". Here, give ideas for what you would work on in the future if you had more time.  What additional questions would you want to answer?  What additional data do you wish you had?

* Lessons and Challenges Slide- The last slide of your presentation should be entitled "Lessons and Challenges".  Discuss:  What was the toughest part of the project?  What lessons do you feel you learned?

Some Notes: 
 * There isn't a limit to how many slides you can use, just the time limit indicated above.
 * There should NOT be code in your slides.
 * Be professional: No clunky or over wordy slides.
 * Dress in business casual.  So no jeans/t-shirts.
 * Those who simply read their presentations off the slides will lose points. Make eye contact with the classroom.

During the presentations, you will provide feedback in the form of peer evaluations.  The presentation line-up will be generated randomly.

What to submit:

* A link to your google slides, OR if you used PowerPoint, a  PowerPoint presentation.

The grading scheme for the presentation is as follows:  

<table style="width:99%;">
<colgroup>
<col width="70%" />
<col width="30%" />
</colgroup>
<thead>
<tr class="header">
<th>Total</th>
<th align="left">60 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Time management: Did the team divide the time well amongst themselves? Were they over or under the time limit? Did everyone get a chance to say something meaningful about the project?</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>Teamwork: Did the team present a unified and cohesive story, or did it seem like independent pieces of work patched together?</td>
<td>10 pts  </td>
</tr>
<tr class="odd">
<td>Content: Are the visualizations relevant to the main research question? Were a variety of visualizations used?  Were subgroups/categories compared?  (Recall in the initial proposal that your analysis should have included asking at least two questions that compare 2 or more groups/categories within your data set?)</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>Professionalism: How smoothly did the team present? Does the presentation appear to be well practiced?  Did the team members dress accordingly?
</td>
<td>5 pts</td>
</tr>
<tr class="odd">
<td>Content: Did the team use appropriate visualizations/stats and interpret the results accurately?</td>
<td>10 pts  </td>
</tr>
<tr class="even">
<td>Creativity and Critical Thought: Is the project carefully thought out? Is it clear that the procedures/visualizations are carefully considered? Does it appear that time and effort went into planning the visualizations and making them look appealing?</td>
<td>10 pts  </td>
</tr>
<tr class="odd">
<td>Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?</td>
<td>10 pts  </td>
</tr>
<tr class="even">
<td>Feedback to peers: Did the team give meaningful feedback to their peers?</td>
<td>5 pts  </td>
</tr>
</tbody>
</table>

---

## H. Final Report 

The final report is a polished version of the notebook file you have been building.  It is meant to be a concise summary of your project.  The audience is someone who has never heard about your topic or results before.  

* **Header**:  Where your project says "CS 260 Proposal - Fall 2021", please change it to say "CS 260 Final Project - Fall 2021".

* **Introduction**:  Update your introduction and write it in a essay/paragraph format, not just bullet points.  Now that your analysis is over, the intro should be in the past or present tense, but not in the future tense.  It should be concise and straightforward. Include your main question/topic and a summary of the main results.  (The details of the results will be in the results section, of course.)  It is still fine to list the bullet point of questions that you included in your proposal, but make sure there is some sort of sentence before the list explaining why the list is there.  In other words, the list should not just appear but there should be some type of verbage introducing the list.

* **Data**:  You probably have little changes to make here.  Update your data section if you included any additional tables, OR if you did not use a data set, delete that information from the data section. Recall that per the proposal:
    * This section should explain your data: the websites it came from with a clickable link, how it was collected, what the variables/columns are, etc.  Also, this is the section is where you load your data, showing the first several rows of the data in this section. The variables should be explained in details in a bulleted list. 


* **Results**:  Rename your "Initial Analysis" section to "Results".  This section should include a description of your most important results with supporting summary statistics, visualizations, and analyses from other procedures.  This section should discuss and explain the results and what is learned from any figures.  This section should flow very nicely from result to result.  It should *tell a compelling* story from the data.  It should not hop from picture to picture in an unrelated fashion. Do not opt for "quantity over quality".  Choose your most important results/findings.  Along the way, feel free to discuss/mention any critiques or deficienices in your results that may be caused by limitations in your data.  Such challenges are realistic and points will not be deducted for those issues.

* **Conclusion**:  This section should summarize your findings one last time and bring your write-up to a close.  You should also describe what additional questions you would like to consider if you had more time and what data could be incorporated to shed more light on your topic.


You will submit:
* Your updated ipynb file with any csv's that I need to run it, updated in the way I described above.
* A link to your team's google collaborate file.

The grading scheme for the presentation is as follows:  

<table style="width:99%;">
<colgroup>
<col width="70%" />
<col width="30%" />
</colgroup>
<thead>
<tr class="header">
<th>Total</th>
<th>25 pts</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Cohesion, story Telling, flow in the results section</td>
<td>5 pts</td>
</tr>
 <tr class="odd">
<td>Maturity of interpretations/explanation of the findings in the results section.</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>A solid introduction updated per the given requirements</td>
<td>5 pts</td>
</tr>
<tr class="odd">
<td>Grammar, organiziation, professionalism</td>
<td>5 pts</td>
</tr>
<tr class="even">
<td>A thoughtful conclusion</td>
<td>5 pts</td>
</tr>
</tbody>
</table>


---

Look at your final project and feel proud - You have learned so much! Go you!