# CMPINF0010 Lab Final Project

Your final project is, at least ideally, the conjoining of everything you've learned in the course so far. You'll be using many of the big ideas you've learned, your Python skills, command-line and `git`, and plenty of pandas and data viz. 

You will be working in teams (assigned below) to make a data driven argument that answers the following question:

## What is the best neighborhood in Pittsburgh?

Using data from the WPRDC, you will create a data driven argument to support your claim about the “best” neighborhood in Pittsburgh.

With your group members, you will be creating a Jupyter notebook to demonstrate your argument and the data analysis you did to support it. You will present your arguments to the class in the last few weeks of lab.

To answer this question you need to do the following:

* Come up with a team name!
* As a group, come up with some way of defining and measuring "bestness." This doesn't have to be a serious metric, it could be whimsical or secretly "worstness.”
* Use at least two datasets in your argument.
* Create a git repository to store your data and notebooks and code.


**Note**: There is a lot of subjectivity here, to wit, *what does "best" mean?*. One of your tasks in this project is to come up with your own metric for “best” and then use it to analyze data to determine the best neighborhood. The goal of this final project is to work as a team to develop a metric, apply it, and write up the results.

You could define "best" as the smartest and then define smartness as "number of advanced degree holders living in the neighborhood". Or you might also define best as the ratio of the ”number of potholes" and the “number of trees” in the neighborhood (lower or higher, the decision is up to you). How you want to measure bestness is up to you.

----

You'll be dealing with WPRDC data to talk about Pittsburgh, so here's a guide to working with the WPRDC.

In [1]:
import pandas as pd
import numpy as np

## The WPRDC, or, how to deal with open data

Last week, you dug through the WPRDC and had a chance to see the vast and varied data available there. During this project, you'll be dealing with quite a few datasets and data sources, but luckily they'll all be *open data*. Open data is, according to [Albert Lin](http://www.wprdc.org/news/so-you-want-to-use-open-data/) of the WPRDC, "a complete set of primary data made easily and permanently available in a timely fashion using electronic, machine readable, open file formats. Cost should not pose a barrier to accessing information, and no unreasonable restrictions should limit accessibility, sharing and re-use." 

### A brief guide to the WPRDC

Your primary source of data to analyze will be the Western Pennsylvania Regional Data Center, or WPRDC. We've used WPRDC data a lot over the past few weeks, but this is the first time you'll be exploring the vast expanses of its data before. 

There's a *lot* of data available, so it's important to know how to search through it and find what you want. 

The primary way you'll find relevant datasets is through **keyword search**. From the master list of all datasets available at [data.wprdc.org](https://data.wprdc.org), just search for what you're looking for! For example, when I was looking for a list of all of the bus stops on Port Authority routes, I searched "bus stops," and the first (and only) result was exactly what I wanted: [this lovely dataset](https://data.wprdc.org/dataset/port-authority-of-allegheny-county-transit-stops).

However, searching around can be fruitless if you're not exactly sure what kind of data are available; fortunately, the WPRDC groups their datasets into topic-centered **categories**, like health, public safety, and housing. The full list of categories is available here: [data.wprdc.org/group](https://data.wprdc.org/group). If you can't find the specific data you want by searching for it, or if you don't know what data you want exactly, try browsing through the relevant category or categories and see what you find.

Finally, you can also sort by **publisher**. Each government agency or private group that displays data through the WPRDC has their own page listing all of the data they provide. You can view all of the organizations here: [data.wprdc.org/organization](https://data.wprdc.org/organization).

### What to do once you've got your data

When you've found a dataset you're going to use in your analysis, it's important to consider how you should treat your data. There are basically two relevant vectors to consider here: streaming the data online, or downloading a copy locally. The good news is that you've done both of these kinds of data use before!

**Streaming the data** is essentially just reading in your data from a link on the internet. We did this with the 311 data in a past week: 

In [2]:
# run this if you have data

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
# load the 311 data directly from the WPRDC and parse dates directly
pgh_311_data = pd.read_csv("https://data.wprdc.org/datastore/dump/76fda9d0-69be-4dd5-8108-0de7907fc5a4",
                           index_col="CREATED_ON", 
                           parse_dates=True)
pgh_311_data.head()

Unnamed: 0_level_0,REQUEST_ID,REQUEST_TYPE,REQUEST_ORIGIN,STATUS,DEPARTMENT,NEIGHBORHOOD,COUNCIL_DISTRICT,WARD,TRACT,PUBLIC_WORKS_DIVISION,PLI_DIVISION,POLICE_ZONE,FIRE_ZONE,X,Y,GEO_ACCURACY
CREATED_ON,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2017-12-15 14:53:00,203364.0,Street Obstruction/Closure,Call Center,1,DOMI - Permits,Central Northside,1.0,22.0,42003220000.0,1.0,22.0,1.0,1-7,-80.016716,40.454144,EXACT
2017-11-29 09:54:00,200800.0,Graffiti,Control Panel,1,Police - Zones 1-6,South Side Flats,3.0,16.0,42003160000.0,3.0,16.0,3.0,4-24,-79.969952,40.429243,APPROXIMATE
2017-12-01 13:23:00,201310.0,Litter,Call Center,1,DPW - Street Maintenance,Troy Hill,1.0,24.0,42003240000.0,1.0,24.0,1.0,1-2,-79.985859,40.459716,EXACT
2017-11-22 14:54:00,200171.0,Water Main Break,Call Center,0,Pittsburgh Water and Sewer Authority,Banksville,2.0,20.0,42003200000.0,5.0,20.0,6.0,4-9,-80.03421,40.406969,EXACT
2017-10-12 12:46:00,193043.0,Guide Rail,Call Center,1,DPW - Construction Division,East Hills,9.0,13.0,42003130000.0,2.0,13.0,5.0,3-19,-79.876582,40.451226,EXACT


As you see here, we can just give `pandas` a link to a data file from the internet and it'll just handle it; it's pretty great that way. And it was good to use that for something that keeps updating continuously, like Pittsburgh 311 calls. (As I write this, the most recent update was 4 minutes ago.) 

However, links on the web are *unstable*: they can move or be taken down, and there's no guarantee that there will be a good internet connection everywhere. So, it's always a good idea to have a copy of any online data as a backup.

If you're dealing with data that isn't actively changing, like results from a diabetes study that occured in 2015, it's probably best to **use a local copy of the data** for your analysis. You've done this a lot; it's why our data analysis labs have been in GitHub repos that you `git clone` instead of just downloading a notebook. Managing a bunch of files can be fiddly, but GitHub makes it easy to distribute and manage projects. You can download local copies of WPRDC data from the site and use them in your projects like any other file we've worked with, using the `read_table()` and `read_csv()` functions in `pandas`. 

**Note**: You may encounter some weird filetypes that we haven't dealt with in class. One I ran into was a `.dbf` file, which is an older type of database. To deal with `.dbf` files, we can use a module called `geopandas`, which normally is used for doing spatial/map stuff in data analysis. (You'll use `geopandas` elsewhere in your project, most likely.)

To read in a `.dbf` file (I'm using the Port Authority's bus stop data as an example), do the following: 

In [9]:
import geopandas as gpd

dbf = r'PAAC_Stops_1611.dbf' # this opens the database as a readable file

table = gpd.read_file(dbf)

pdtable = pd.DataFrame(table)
pdtable.head()

Unnamed: 0,InternalID,Name,ExternalID,Direction,Lat,Long,Time_Point,NewZone,No_Rts_Ser,Routes_161,Mode,Public_She,Public_Sto,geometry
0,S00010,10TH AVE AT ANN ST,11652.0,Inbound,40.406334,-79.908116,No,1A,1.0,53L,Bus,No Shelter,Bus Stop,
1,N71237,10TH AVE AT GARFIELD ST,20654.0,Outbound,40.606766,-79.753111,No,2,1.0,P10,Bus,No Shelter,Bus Stop,
2,N71239,10TH AVE AT ORMOND ST,20656.0,Outbound,40.608314,-79.7505,No,2,1.0,P10,Bus,No Shelter,Bus Stop,
3,N71238,10TH AVE AT SUMMIT ST,20655.0,Outbound,40.607561,-79.75176,No,2,1.0,P10,Bus,No Shelter,Bus Stop,
4,S00080,12TH AVE AT AMITY ST,11650.0,Inbound,40.404325,-79.908139,No,1A,2.0,"53, 53L",Bus,No Shelter,Bus Stop,


`geopandas` can read some filetypes that `pandas` doesn't natively support. If you encounter any other filetypes and the normal functions don't work, remember to use [Read-Search-Ask](https://medium.freecodecamp.org/read-search-dont-be-afraid-to-ask-743a23c411b4) and try to find a solution on the internet. Someone's dealt with your situation before.

---

## Groups 

You'll be broken into groups, which were generated randomly (using pandas).

In [7]:
students = pd.read_csv("students.csv")
students = students.sample(len(students))
groups = np.array_split(students, 12)

glist = pd.DataFrame()
i = 0

while i < len(groups):
    curr = groups[i]
    curr["Group Number"] = i
    glist = glist.append(curr)
    i += 1

glist = glist.sort_index()
glist.to_csv("groups.csv")
glist

FileNotFoundError: File b'students.csv' does not exist

Once you've found your group partner, link up with them (with the emails provided, or through whatever other means you wish). Each group will turn in **one copy of the project**.

---

## Progress Presentations

In the last two weeks of lab, your group will need to present your argument for what is the best neighborhood in Pittsburgh. You will also have to talk about your metric and how you applied it. Expect to have a discussion about your team’s metric and get feedback to incorporate into the final report. 

Presentations should last a maximum of 10 minutes, with 5 minutes of discussion.

You can prepare slides or you can present from a notebook directly. We understand your analysis will be in-progress, but you should have some initial results. 

## Final Report

It is best to think about your final project as a data-driven report. You will need to put everything into a Jupyter notebook with the following structure:

* **Introduction:** Introduce the project, your approach, talk about the process of how you came up with the metric and some alternatives you may have explored.
* **The Metric:** Describe your metric, what features are you measuring. What datasets  are you using?
* **The Best Neighborhood:** Apply the metric from the previous section to determine the best neighborhood in Pittsburgh. Beyond just executing code, provide narrative about why you think this is the best neighborhood. Incorporate a data visualization, perhaps to rank all of the neighborhoods or show a neighborhood’s best-ness over time. The key is to make a data driven argument.
* **Conclusion:** Reflect on how the data driven determination of “best neighborhood” is the same or different from your personal favorite neighborhood. Each member of the group should write their own response to this.

### Tips

* Make this fun! Don’t make this harder than it needs to be. You define your metrics, so you are entirely free to pick a metric that is easy to calculate, but reveals something fun about Pittsburgh. You can also be as fancy as you want, and if you want to do something big and impressive, go for it!
* Show your work, but not all of it. When working on a project, I often have a couple sets of notebooks. One is a set of working notebooks where I do my data exploration, hack on the code, and go down wrong paths. When I finally am able to pull together the code that generates the result I want, I move just that code into my “final draft.” I only want to see the data cleaning, transformation, analysis, and visualization associated with the specific argument you are making.


## Collaboration

You should use a shared GitHub repository to collaborate with your group, so that you can work on the project and be sure that you all have the same version. One person should create a repository for the project, and then they'll need to invite the other members of the group to collaborate. You can see how to do that in this [GitHub article](https://help.github.com/articles/inviting-collaborators-to-a-personal-repository/). 

For programming, I'd point your attention to the software development concept of "pair programming". It's defined [here](https://www.agilealliance.org/glossary/pairing/): 

> "Pair programming consists of two programmers sharing a single workstation (one screen, keyboard and mouse among the pair). The programmer at the keyboard is usually called the "driver", the other, also actively involved in the programming task but focusing more on overall direction is the "navigator"; it is expected that the programmers swap roles every few minutes or so.

If you and your partners want to work together to solve the larger coding/data analysis problems, pair programming can be a good framework to try to operate in. It, however, is just a suggestion.

## Completing the Report and Presentation

The **final report** is due by **11:59pm, on Friday 6 December 2019**. 

You'll have this week and next to work on your presentation and the project in lab and independently, with your groups.

The final two lab sessions will consist of **student presentations**: each group will give a short presentation on their chosen neighborhood. This presentation will be a lot like your lab lecture; you'll be presenting from your Jupyter notebook. You should prepare this with your group ahead of time, taking the time to write yourself speaker notes and get a sense of who will be presenting what.

Next week, the groups will select their **presentation times**. 