In [None]:
# HOMEWORK 1 / Data Manipulation, Pandas and Distance Metrics

__This assignment is worth up to 15 POINTS to your grade total if you do it and turn it in on time.  Late assignments will lose 15%.__


| Points Possible | Due Date |
|:---------------:|:--------:|
| 15 | Wednesday, Sep. 28 @ Midnight|

## OBJECTIVE
* Learn how to use Pandas for data ingest, manipulation and summarization.
* Learn to perform a simple exploratory data analysis to plot data and visualize outcomes.
* Understand how to use and interpret Pearson-_r_ for correlation.
* Implement $k$-Nearest Neighbors and see a distance metric in action on real data.

## WHAT TO TURN IN

You are being encouraged to turn the assignment in using the provided Jupyter Notebook.  To do so, clone the repository and modify the `Homework1.ipynb` file in the `HOMEWORK/01` directory.

Turn in a copy of a ipynb file OR a PDF or Word Document to Blackboard with the answers to the questions labeled with the &#167; sign.

## RESOURCES

### Datasets

| Dataset | Summary | Size |
|---------|---------|------|
|[Salt Lake City Water Usage Jan-July 2016](http://www.civicdata.io/dataset/slc_water_usage_1/resource/a2bc8285-d9ef-45ca-ae86-bf735bc5011a) | Real data that includes water usage for various property types, connection counts, etc. | > 40K instances, < 10 features |
|[Sacramento Real Estate transactions, May 15-21, 2008](http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv).  Also found [archived here in the repo](./Sacramentorealestatetransactions.csv).| Sacramento transcation data from real estate properties sold in a single week in 2008.  Includes latitude, longitude, prices, addresses, and basic property characteristics (# beds, square footage, etc.) | > 900 instances, < 10 features |

### Pandas

http://pandas.pydata.org/

### Jupyter Notebooks

http://jupyter.org

### Code 

You might need to use the following at the beginning of your Jupyter notebook:
```python
%matplotlib inline
import numpy as np
import requests
import pandas as pd
import numpy as np
```

## HOW TO COMPLETE THIS ASSIGNMENT / ASSIGNMENT DETAILS

## PART 1 / Python Pandas

For this part you will need to load and analyze data using [Pandas](http://pandas.pydata.org). 

Familiarize yourself with the dataset from the [spatialkey.com](https://support.spatialkey.com/spatialkey-sample-csv-data/) website.  In particular, we will be working with the dataset here:
![sacramento dataset](./spatialkey_screenshot01.PNG)

###  &#167; Write the Python code to load the [CSV file from spatialkey](http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv) directly into a Pandas dataframe.  
 * You may need to familiarize yourself with the [IO Tools](http://pandas.pydata.org/pandas-docs/stable/io.html) functions of Pandas.
 * Load the file to your local file system where the running code exists (e.g. the CSV file should exist in the same location as your code or notebook).
 * Turn in the fragment of code that does this.

In [1]:
import pandas as pd
print "my answer"

my answer


### &#167;  Compute the price per square foot for all properties.  Add this data back into the dataframe.  
 * See the documentation on [concatening objects](http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects).
 * Turn in the code that does this, and show the results of `head()` and `tail()` on the new dataframe.

In [5]:
def ppsf(some_number, another_number):
    return some_number / (another_number * 1.7)

ppsf(20, 30)

0.39215686274509803

### &#167;  Provide three data points from the original dataset that give you reason to be concerned about the some of the properties with unusual price per square foot data.
 * Learn about [summarizing your data](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics) and [sorting](http://pandas.pydata.org/pandas-docs/stable/basics.html#by-values).
 * Turn in the rows of those three data points and 1 to 3 sentences discussing your concerns.

### &#167;  Use a scatter plot to plot the sale price to the square feet.  

### &#167;  Do the same for number of beds to sale price.
* You will need to learn about [scatter plots](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics) to complete these.
* For both, turn in the scatter plot &#emdash; if you're using Jupyter, just leave the plots inline.

### &#167; Explore the distribution of properties by number of beds.  Plot this using `plot()` 
* Learn about that function [here](http://pandas.pydata.org/pandas-docs/stable/visualization.html).

## PART 2 / Distance metrics and _k_-Nearest Neighbor

### &#167; Implement the _k_-Nearest Neighbor algorithm in Python using the Euclidean distance metric.

The _k_-Nearest Neighbor algorithm is a very simple _lazy learning_ algorithm.  It computes $k$ neigbors given a data point $d_i$ (vector) from the set of all data $D = \{d_1, d_2, \ldots, d_i, \ldots, d_n\}$.


#### 1. Implement the function `d_euclidean(v1, v2)` that takes 2 arguments v1 and v2 which are the vectors you will be comparing from your dataset.

You will be turning in the implementation of two functions.  The first is the Euclidean distance function, which is trivial to implement.  Recall Euclidean distance $d_{\mathrm{euclidean}}$ is defined by, given two vectors $v_1$ and $v_2$ of length $n$ :

$$
d_{\mathrm{euclidean}} = \sqrt{ {\sum_{i=1}^n \big({v_1}_i - {v_2}_i\big)}^2 }
$$

You are free to use the implementation in the notes, but please make sure you actually cite it appropriately if you copy it verbatim!

#### 2. Implement the function `knn_euclidean(k, v, d)`that takes two arguments, the number of neighbors to return _k_ the data vector _v_ and the entire dataset _d_.

The algorithm will do the following:
1. compute the distance between $v$ and all vectors in $d$ (with $v$ removed, if you like, but the distance between $v$ and itself will be ... 0!).  You will use the function from part 1 `d_euclidean`.
2. sort the distances of all vectors in descending order with the closest (lowest) distances first.
3. return the _k_ top neighbors

Your Python code might look something like this:

#### What you will turn in:

* Turn in the code for the two functions implemented in Python.
* Use the templates below :

```python

# assume v1 and v2 are python tuples
def d_euclidean(v1, v2):
    return # the calculation of the euclidean distance


# implement the knn algorithm as defined above
def knn_euclidean(k, v, all_v):

    # compute all the distances between v and d_v in all_v
    for d_v in all_v:
        d_euclidean(d_v, v)

    # store the distances, sort and return k of them
    
    # the rest of your implementation
    
    return # the top k vectors sorted

```

### &#167; Produce the distance table using your version of _k_-NN for all properties.  

*  To make things easier, please reduce the data vector to just beds, bath, square footage, price, latitude and longitude.
* You can use the street as the index.  Your final output will look something like this:


<div class="tg-wrap"><table>
  <tr>
    <td></td>
    <td>3526 HIGH ST</td>
    <td>51 OMAHA CT</td>
    <td>C2796 BRANCH ST</td>
    <td>D2805 JANETTE WAY</td>
  </tr>
  <tr>
    <td>3526 HIGH ST</td>
    <td>0</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>51 OMAHA CT</td>
    <td>0.25<br></td>
    <td>0</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>2796 BRANCH ST</td>
    <td>0.25</td>
    <td>0.25</td>
    <td>0</td>
    <td></td>
  </tr>
  <tr>
    <td>2805 JANETTE WAY</td>
    <td>0.25</td>
    <td>0.25</td>
    <td>0.25</td>
    <td>0</td>
  </tr>
</table></div>


### &#167; Compute the 4-NN and 5-NN to the following properties:

- 4882 BANDALIN WAY
- 7511 OAKVALE CT
* 7731 MASTERS ST
* 4925 PERCHERON DR
* 4727 SAVOIE WAY
* 3228 BAGGAN CT
* 8515 DARTFORD DR
* 2460 EL ROCCO WAY
* 5840 WALERGA RD
* 923 FULTON AVE
* 4030 BROADWAY
* 6485 LAGUNA MIRAGE LN
* 8758 LEMAS RD
* 1140 EDMONTON DR
* 1890 GENEVA PL

Turn in the table that has each property address and the addresses of the 4-NN and 5-NN properties.

## PART 3 / Data Exploration

For this part we will be using the data from the [Salt Lake City water usage dataset (by block)](http://www.civicdata.io/dataset/slc_water_usage_1/resource/a2bc8285-d9ef-45ca-ae86-bf735bc5011a).  This dataset includes information about water consumption for 2016.

## Explore the data:

### &#167; How many different property types are described in this data?
### &#167; How many much of consumption data is less than 0?
### &#167; Produce a histogram of the monthly consumption for all property types?

## Fix the data
There are some data in the input that needs work. In particular, the property descriptions have issues.  Fix them (**hint:** explore [strip()]() and [lower()]()).  Also please remove all data points that are out of spec, that is connections less than 0 and consumption less than 0.

### &#167; How many data points do you now have?
### &#167; Show the descriptive stats for this new fixed data: For consumption and connections only, what is the mean, media, min and max?
### &#167; Normalize the data using z-score normalization for connection and consumption.  Show the `head()` and `tail()` of the new values.

## Explore relationships

### &#167; Plot the connections (x-axis) to consumption (y-axis) in scatter plot.  Is there a relationship?

### &#167; Compute the Pearson's _r_ correlation between these two?  Does it confirm what you see?
* You will need to investigate the [scipy Pearson's _r_ for this part](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

### &#167; Group the data by month and by property type and show the bar plot of this for the year-to-date day (January to July).

* You will need to read up on the [groupby]() function in Pandas for the next few questions. 
* Also read up on plotting/visualizing in Pandas.

## BONUS

## This analysis optional, but you will receive 1 bonus point for each question you answer below.

### _Water consumption rises dramatically in June and July -- let's explore why?_
### &#167; Analyze the data for May, June and July and produce a bar plot of the consumpution over those three months.
### &#167; Explain what you see.  Of course, during the summer months, the temperatures are higher and thus water consumption would naturally go up, but given what you see in the plot, what else is going on?

* _Hints_: You will need to understand how to use [aggregate()](http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation) and maybe [get_level_values()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.CategoricalIndex.get_level_values.html#pandas.CategoricalIndex.get_level_values).  